incidents: add incidents 009 till 011

This commit is contained in:
Gregor Michels 2022-09-01 13:06:15 +02:00
parent 1d01fa7020
commit d57b0ae362
1 changed files with 73 additions and 0 deletions

View File

@ -269,3 +269,76 @@ root@gw-core01:~# /etc/init.d/firewall restart
root@gw-core01:~# /etc/init.d/dnsmasq reload
```
* `ap-XXXX`: see `playbook_provision_accesspoints.yml`
009: 2022.08.23 ~03:00 | (maintenance) launder public wifi traffic through vpn
------------------------------------------------------------------------------
_This is already implemented, documentation will follow_
010: 2022.08.28 13:00, 2022.08.29 09:10 - 10:30 | random reboots of gw-core01
-----------------------------------------------------------------------------
`gw-core01` randomly reboots.
The other devices on the same circuit (ie. `sw-access01`) did not reboot.
Therefore it is not an issue with the circuit itself.
After calling the facility management they confirmed that the power supply is not correctly seated.
Cause for the missallignment is still the protective cap of the power strip (see incident `001`) for details.
The facility management is either going to remove the protective cap or disable the latching mechanism with zip ties.
**impact**:
* dhcp and routing downtime
* for a few minutes on 2022.08.28 13:00
* for about an hour on 2022.08.29 09:10 (till 10:30)
**monitoring enhancements**:
* [ ] alert on rebooted nodes (via `node_boot_time_seconds`)
011: 2022.08.31 01:06 - 10:00 | public wifi lost upstream vpn connection
------------------------------------------------------------------------
The wireguard vpn (which launders the traffic of the public wifi) did not handshake for 9 hours:
```
root@gw-core01:~# date && wg
Thu Sep 1 07:55:49 UTC 2022
[...]
interface: wg1
public key: [redacted]
private key: (hidden)
listening port: 48603
peer: [redacted]
endpoint: [redacted]:51820
allowed ips: 0.0.0.0/0
latest handshake: 8 hours, 49 minutes, 4 seconds ago
transfer: 201.20 GiB received, 16.12 GiB sent
```
**impact**:
the public wifi `GU Deutscher Platz` had no internet access for 9 hours
**solution**:
add persistent keepalive statement to `wg1` and restart tunnel (via `/etc/init.d/network restart`)
**discussion**:
Any traffic traversing the interface should trigger a new handshake.
Therefore I do not really understand why there was no handshake for 9 hours.
Root cause theory:
The kernel decided that the default route at that interface was unreachable, dropped traffic and therefore stopped the handshake from triggering again.
I added a `persistent_keepalive` to the tunnel to stop this from happening again (if my theory for the root cause is correct).
**changes**:
* 09:56:10 `/etc/init.d/network restart` (to kickstart new handshake)
* 09:58:00 add `persistent_keepalive` statement to `wg1` (longterm fix)
* 10:00:00 `/etc/init.d/network restart` (restart again to apply wg changes)
**monitoring enhancements**:
* [ ] monitor connectivity for the public wifi (`blackbox exporter` in `client` network) and create alerting rules
* [ ] prometheus instance on `eap-adp-jump01` to get alerts if upstream is down in facility
* [ ] monitor wireguard state (probably needs a custom lua exporter)