update incident log

* update 012 * create 013 - 015
2022-09-09 01:50:22 +02:00 · 2022-09-09 01:50:22 +02:00 · dfab9afcde
parent 67ebf9b9bb
commit dfab9afcde
1 changed files with 112 additions and 4 deletions
--- a/documentation/INCIDENTS.md
+++ b/documentation/INCIDENTS.md
@ -403,7 +403,7 @@ I added a `persistent_keepalive` to the tunnel to stop this from happening again
 * [ ] monitor wireguard state (probably needs a custom lua exporter)


-012: 2022.09.01 17:24, 18:10  |  ongoing reboots of gw-core01
+012: 2022.09.01 - 2022.09.08  | ongoing reboots of gw-core01
 -------------------------------------------------------------

 Unfortunately zip tying back the protective cap of the power strip did not stop the random reboots of `gw-core01`.
@ -413,17 +413,125 @@ Either the power supply or the device itself is broken.

 **solution**:
 * [x] replace power supply
-* [ ] plug power supply into "normal" socket
-* [ ] replace device itself
+* [x] plug power supply into "normal" socket
+* [x] replace device itself

 **updates**:
 * 2022.09.02 ~20:00: I tried replacing the power supply but nobody could let me into the facilities.
 * 2022.09.03 ~14:40: Successfully replaced the power supply. While doing so the new power supply slipped out of its socket multiple times. It seems like the sockets in the power strip are deeper than normal. Maybe the old supply was not broken but simply sliped out because the strip is weard ? Some zip ties are holding the new supply in its place
-* 2022.09.04 ~14:50: `gw-core01` rebooted again :( The next step is to put a "normal" power strip between the weard one and `gw-core01`
+* 2022.09.04 ~14:50: `gw-core01` rebooted again => put psu of `gw-core01` into a "normal" power strip
+* 2022.09.06 ~11:00: `gw-core01` put psu of `gw-core01` into a "normal" power strip
+* 2022.09.07 ~09:40: `gw-core01` rebooted again => replace device
+* 2022.09.08 ~15:40: replaced `gw-core01` with an `Ubiquiti Edge Router X SFP`

 **impact**:
 * 2022.09.01 17:24, 17:47
 * 2022.09.02 14:31, 18:10
 * 2022.09.03 ~14:40
 * 2022.09.04 ~14:50
+* 2022.09.07 ~09:40
+* 2022.09.08 ~15:40
 * routing outage for a few minutes
+
+**router replacement**:
+We replaced a `Ubiquiti Edge Router X` with an `Ubiquiti Edge Router X SFP`.
+Those devices are nearly identical except the new router has an additional sfp port and can deliver passive POE on all ports.
+
+After building a custom openwrt image with [garet](https://git.sr.ht/~hirnpfirsich/garet)
+(profile: `ubiquiti-edgerouter-x-sfp_21.02.3`, commit: `6f7c75c8064e7e0241cdba8f87efc9492dd860d0`)
+we transfered the config to the new device.
+
+There are custom gpio mappings in `/etc/config/system` which are different between those device so they where edited accordingly.
+
+
+013: 2022.09.07 10:17 - 11:47 | public wifi vpn blackholed traffic
+------------------------------------------------------------------
+
+The public wifi had no upstream internet connectivity from 10:17 till 11:47.
+
+This time the wireguard interface was up and connected (ie. handshake < 120 seconds):
+```
+root@gw-core01:~# wg
+interface: wg1
+  public key: Sqz0LEJVmgNlq6ZgmR9YqUu3EcJzFw0bJNixGUV9Nl8=
+  private key: (hidden)
+  listening port: 36986
+
+peer: uC0C1H4zE6WoDjOq65DByv1dSZt2wAv6gXQ5nYOLiQM=
+  endpoint: 185.209.196.70:51820
+  allowed ips: 0.0.0.0/0
+  latest handshake: 1 minute, 2 seconds ago
+  transfer: 5.84 GiB received, 679.59 MiB sent
+  persistent keepalive: every 15 seconds
+
+interface: wg0
+  public key: 1lYOjFZBY4WbaVmyWFuesVbgfFrfqDTnmAIrXTWLkh4=
+  private key: (hidden)
+  listening port: 51820
+
+peer: 9j6aZs+ViG9d9xw8AofRo10FPosW6LpDIv0IHtqP4UM=
+  preshared key: (hidden)
+  endpoint: 162.55.53.85:51820
+  allowed ips: 0.0.0.0/0
+  latest handshake: 1 minute, 31 seconds ago
+  transfer: 5.48 MiB received, 4.51 MiB sent
+  persistent keepalive: every 15 seconds
+```
+
+After restarting all interfaces (via `/etc/init.d/network restart`) traffic started flowing again.
+
+**impact**:
+no internet on the public wifi from 10:17 till 11:47
+
+**reason**: unknown. My guess is that mullvad had an opsi
+
+
+014 2022.09.08 15:07 | (maintenance) additional ap in tent 5
+------------------------------------------------------------
+
+The residents in tent 5 complained about bad wifi performance in tent 5.
+
+**root cause**:
+A speedtest in the office revealed `~60mbit/s`. The same test in tent 5 only got `~15mbit/s`.
+(This was under normal load of the network).
+Additionally the monitoring showed that the ap in tent 5 had the most connected clients (~35) while other tents only had 15 to 20.
+
+Therefore my assumption was that the ap could not keep up with the amount of clients connected.
+__This is unscientific I know__.
+
+**solution**:
+We installed the additional ap (`ap-8f39`) on the opposite side of the tent to distribute the load evenly.
+The network cable for `ap-8f39` could be terminated right inside tent 5 because `sw-access02` also lives there.
+Because we did not want to crawl behind the seperated rooms inside the tent we decided to route the cable for `ap-8f39` via the outside.
+
+
+015 2022.09.08 18:45 - ??:?? | gw-core01 unreachable
+----------------------------------------------------
+
+Since 18:45 `gw-core01` lost its wireguard connection to `eae-adp-jump01`.
+Either Vodafone is down or the new router died on us.
+
+```
+eae-adp-jump01# date
+eae-adp-jump01# date && ospfctl show neigh && ifconfig wg
+Thu Sep  8 20:44:05 CEST 2022
+ID              Pri State        DeadTime Address         Iface     Uptime
+10.84.8.1       1   DOWN/P2P     01:56:07 10.84.254.1     wg0       -
+192.168.0.2     1   DOWN/P2P     05:06:46 10.84.254.1     wg0       -
+wg0: flags=80c3<UP,BROADCAST,RUNNING,NOARP,MULTICAST> mtu 1350
+	index 5 priority 0 llprio 3
+	wgport 51820
+	wgpubkey 9j6aZs+ViG9d9xw8AofRo10FPosW6LpDIv0IHtqP4UM=
+	wgpeer 1lYOjFZBY4WbaVmyWFuesVbgfFrfqDTnmAIrXTWLkh4=
+		wgpsk (present)
+		wgendpoint 109.42.241.116 9749
+		tx: 1858427748, rx: 836384108
+		last handshake: 7036 seconds ago
+		wgaip 0.0.0.0/0
+	groups: wg
+	inet 10.84.254.0 netmask 0xfffffffe
+eae-adp-jump01#
+```
+
+**impact**:
+no routing into the internet