incidents: add incidents 009 till 011

2022-09-01 13:06:15 +02:00 · 2022-09-01 13:06:15 +02:00 · d57b0ae362
parent 1d01fa7020
commit d57b0ae362
1 changed files with 73 additions and 0 deletions
--- a/documentation/INCIDENTS.md
+++ b/documentation/INCIDENTS.md
@ -269,3 +269,76 @@ root@gw-core01:~# /etc/init.d/firewall restart
 root@gw-core01:~# /etc/init.d/dnsmasq reload
 ```
 * `ap-XXXX`: see `playbook_provision_accesspoints.yml`
+
+
+009: 2022.08.23 ~03:00 | (maintenance) launder public wifi traffic through vpn
+------------------------------------------------------------------------------
+
+_This is already implemented, documentation will follow_
+
+
+010: 2022.08.28 13:00, 2022.08.29 09:10 - 10:30 | random reboots of gw-core01
+-----------------------------------------------------------------------------
+
+`gw-core01` randomly reboots.
+The other devices on the same circuit (ie. `sw-access01`) did not reboot.
+Therefore it is not an issue with the circuit itself.
+
+After calling the facility management they confirmed that the power supply is not correctly seated.
+Cause for the missallignment is still the protective cap of the power strip (see incident `001`) for details.
+
+The facility management is either going to remove the protective cap or disable the latching mechanism with zip ties.
+
+**impact**:
+* dhcp and routing downtime
+* for a few minutes on 2022.08.28 13:00
+* for about an hour on 2022.08.29 09:10 (till 10:30)
+
+**monitoring enhancements**:
+* [ ] alert on rebooted nodes (via `node_boot_time_seconds`)
+
+
+011: 2022.08.31 01:06 - 10:00 | public wifi lost upstream vpn connection
+------------------------------------------------------------------------
+
+The wireguard vpn (which launders the traffic of the public wifi) did not handshake for 9 hours:
+```
+root@gw-core01:~# date && wg
+Thu Sep  1 07:55:49 UTC 2022
+[...]
+interface: wg1
+  public key: [redacted]
+  private key: (hidden)
+  listening port: 48603
+
+peer: [redacted]
+  endpoint: [redacted]:51820
+  allowed ips: 0.0.0.0/0
+  latest handshake: 8 hours, 49 minutes, 4 seconds ago
+  transfer: 201.20 GiB received, 16.12 GiB sent
+```
+
+**impact**:
+the public wifi `GU Deutscher Platz` had no internet access for 9 hours
+
+**solution**:
+add persistent keepalive statement to `wg1` and restart tunnel (via `/etc/init.d/network restart`)
+
+**discussion**:
+Any traffic traversing the interface should trigger a new handshake.
+Therefore I do not really understand why there was no handshake for 9 hours.
+
+Root cause theory:
+The kernel decided that the default route at that interface was unreachable, dropped traffic and therefore stopped the handshake from triggering again.
+
+I added a `persistent_keepalive` to the tunnel to stop this from happening again (if my theory for the root cause is correct).
+
+**changes**:
+* 09:56:10 `/etc/init.d/network restart`                 (to kickstart new handshake)
+* 09:58:00 add `persistent_keepalive` statement to `wg1` (longterm fix)
+* 10:00:00 `/etc/init.d/network restart`                 (restart again to apply wg changes)
+
+**monitoring enhancements**:
+* [ ] monitor connectivity for the public wifi (`blackbox exporter` in `client` network) and create alerting rules
+* [ ] prometheus instance on `eap-adp-jump01` to get alerts if upstream is down in facility
+* [ ] monitor wireguard state (probably needs a custom lua exporter)