incidents: add incidents 009 till 011
This commit is contained in:
parent
1d01fa7020
commit
d57b0ae362
|
@ -269,3 +269,76 @@ root@gw-core01:~# /etc/init.d/firewall restart
|
|||
root@gw-core01:~# /etc/init.d/dnsmasq reload
|
||||
```
|
||||
* `ap-XXXX`: see `playbook_provision_accesspoints.yml`
|
||||
|
||||
|
||||
009: 2022.08.23 ~03:00 | (maintenance) launder public wifi traffic through vpn
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
_This is already implemented, documentation will follow_
|
||||
|
||||
|
||||
010: 2022.08.28 13:00, 2022.08.29 09:10 - 10:30 | random reboots of gw-core01
|
||||
-----------------------------------------------------------------------------
|
||||
|
||||
`gw-core01` randomly reboots.
|
||||
The other devices on the same circuit (ie. `sw-access01`) did not reboot.
|
||||
Therefore it is not an issue with the circuit itself.
|
||||
|
||||
After calling the facility management they confirmed that the power supply is not correctly seated.
|
||||
Cause for the missallignment is still the protective cap of the power strip (see incident `001`) for details.
|
||||
|
||||
The facility management is either going to remove the protective cap or disable the latching mechanism with zip ties.
|
||||
|
||||
**impact**:
|
||||
* dhcp and routing downtime
|
||||
* for a few minutes on 2022.08.28 13:00
|
||||
* for about an hour on 2022.08.29 09:10 (till 10:30)
|
||||
|
||||
**monitoring enhancements**:
|
||||
* [ ] alert on rebooted nodes (via `node_boot_time_seconds`)
|
||||
|
||||
|
||||
011: 2022.08.31 01:06 - 10:00 | public wifi lost upstream vpn connection
|
||||
------------------------------------------------------------------------
|
||||
|
||||
The wireguard vpn (which launders the traffic of the public wifi) did not handshake for 9 hours:
|
||||
```
|
||||
root@gw-core01:~# date && wg
|
||||
Thu Sep 1 07:55:49 UTC 2022
|
||||
[...]
|
||||
interface: wg1
|
||||
public key: [redacted]
|
||||
private key: (hidden)
|
||||
listening port: 48603
|
||||
|
||||
peer: [redacted]
|
||||
endpoint: [redacted]:51820
|
||||
allowed ips: 0.0.0.0/0
|
||||
latest handshake: 8 hours, 49 minutes, 4 seconds ago
|
||||
transfer: 201.20 GiB received, 16.12 GiB sent
|
||||
```
|
||||
|
||||
**impact**:
|
||||
the public wifi `GU Deutscher Platz` had no internet access for 9 hours
|
||||
|
||||
**solution**:
|
||||
add persistent keepalive statement to `wg1` and restart tunnel (via `/etc/init.d/network restart`)
|
||||
|
||||
**discussion**:
|
||||
Any traffic traversing the interface should trigger a new handshake.
|
||||
Therefore I do not really understand why there was no handshake for 9 hours.
|
||||
|
||||
Root cause theory:
|
||||
The kernel decided that the default route at that interface was unreachable, dropped traffic and therefore stopped the handshake from triggering again.
|
||||
|
||||
I added a `persistent_keepalive` to the tunnel to stop this from happening again (if my theory for the root cause is correct).
|
||||
|
||||
**changes**:
|
||||
* 09:56:10 `/etc/init.d/network restart` (to kickstart new handshake)
|
||||
* 09:58:00 add `persistent_keepalive` statement to `wg1` (longterm fix)
|
||||
* 10:00:00 `/etc/init.d/network restart` (restart again to apply wg changes)
|
||||
|
||||
**monitoring enhancements**:
|
||||
* [ ] monitor connectivity for the public wifi (`blackbox exporter` in `client` network) and create alerting rules
|
||||
* [ ] prometheus instance on `eap-adp-jump01` to get alerts if upstream is down in facility
|
||||
* [ ] monitor wireguard state (probably needs a custom lua exporter)
|
||||
|
|
Reference in New Issue