From d57b0ae362e72c5189cca3a5fb74ec51d8098f3d Mon Sep 17 00:00:00 2001 From: Gregor Michels Date: Thu, 1 Sep 2022 13:06:15 +0200 Subject: [PATCH] incidents: add incidents 009 till 011 --- documentation/INCIDENTS.md | 73 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 73 insertions(+) diff --git a/documentation/INCIDENTS.md b/documentation/INCIDENTS.md index 42b3e80..d6af885 100644 --- a/documentation/INCIDENTS.md +++ b/documentation/INCIDENTS.md @@ -269,3 +269,76 @@ root@gw-core01:~# /etc/init.d/firewall restart root@gw-core01:~# /etc/init.d/dnsmasq reload ``` * `ap-XXXX`: see `playbook_provision_accesspoints.yml` + + +009: 2022.08.23 ~03:00 | (maintenance) launder public wifi traffic through vpn +------------------------------------------------------------------------------ + +_This is already implemented, documentation will follow_ + + +010: 2022.08.28 13:00, 2022.08.29 09:10 - 10:30 | random reboots of gw-core01 +----------------------------------------------------------------------------- + +`gw-core01` randomly reboots. +The other devices on the same circuit (ie. `sw-access01`) did not reboot. +Therefore it is not an issue with the circuit itself. + +After calling the facility management they confirmed that the power supply is not correctly seated. +Cause for the missallignment is still the protective cap of the power strip (see incident `001`) for details. + +The facility management is either going to remove the protective cap or disable the latching mechanism with zip ties. + +**impact**: +* dhcp and routing downtime +* for a few minutes on 2022.08.28 13:00 +* for about an hour on 2022.08.29 09:10 (till 10:30) + +**monitoring enhancements**: +* [ ] alert on rebooted nodes (via `node_boot_time_seconds`) + + +011: 2022.08.31 01:06 - 10:00 | public wifi lost upstream vpn connection +------------------------------------------------------------------------ + +The wireguard vpn (which launders the traffic of the public wifi) did not handshake for 9 hours: +``` +root@gw-core01:~# date && wg +Thu Sep 1 07:55:49 UTC 2022 +[...] +interface: wg1 + public key: [redacted] + private key: (hidden) + listening port: 48603 + +peer: [redacted] + endpoint: [redacted]:51820 + allowed ips: 0.0.0.0/0 + latest handshake: 8 hours, 49 minutes, 4 seconds ago + transfer: 201.20 GiB received, 16.12 GiB sent +``` + +**impact**: +the public wifi `GU Deutscher Platz` had no internet access for 9 hours + +**solution**: +add persistent keepalive statement to `wg1` and restart tunnel (via `/etc/init.d/network restart`) + +**discussion**: +Any traffic traversing the interface should trigger a new handshake. +Therefore I do not really understand why there was no handshake for 9 hours. + +Root cause theory: +The kernel decided that the default route at that interface was unreachable, dropped traffic and therefore stopped the handshake from triggering again. + +I added a `persistent_keepalive` to the tunnel to stop this from happening again (if my theory for the root cause is correct). + +**changes**: +* 09:56:10 `/etc/init.d/network restart` (to kickstart new handshake) +* 09:58:00 add `persistent_keepalive` statement to `wg1` (longterm fix) +* 10:00:00 `/etc/init.d/network restart` (restart again to apply wg changes) + +**monitoring enhancements**: +* [ ] monitor connectivity for the public wifi (`blackbox exporter` in `client` network) and create alerting rules +* [ ] prometheus instance on `eap-adp-jump01` to get alerts if upstream is down in facility +* [ ] monitor wireguard state (probably needs a custom lua exporter)