This repository has been archived on 2024-05-11. You can view files and clone it, but cannot push or open issues or pull requests.
eae-am-deutschen-platz/documentation/INCIDENTS.md

345 lines
12 KiB
Markdown
Raw Normal View History

Collection of all incidents
===========================
2022-08-21 20:19:19 +00:00
001: 2022.06.30 17:00 - 18:00 | power issues on gw-core01
---------------------------------------------------------
**issue**:
The protective cap of the power strip yanked the power supply of `gw-core01` out of its socket.
Therefore `gw-core01` had no power.
**solution**:
Tape back the protective cap of the power strip and reinsert the power supply
**impact**:
No internet access for 2 hours
2022-08-21 20:19:19 +00:00
002: 2022.06.30 12:00 - 2022.07.01 19:00 | wifi issues in tent 5
----------------------------------------------------------------
### issue
A resident reported slow internet speeds. He resides in tent 5. I do not have more information.
While trying to check logs for the ap I noticed that `ap-ac7c` is very slow and hangs/freezes a lot via ssh.
Rebooting did not solve the problem.
### cause
Unknown
I've checked the ap the next day in person. I tested the ap with a different lan cable on a different switch port.
The issued I've noticed the night before where not reproducible.
But I did notice that the short patchcable (connecting the ap to the switch) had some light rust on it.
### solution
_01.07.2022 ~ 03:00_ (shortterm):
After noticing the issue myself I tried rebooting the ap.
Unfortunately that did not solve the problem.
To spare the clients from connecting to a bonkers ap I disabled poe for the switch port to take the ap offline:
```
root@sw-access02:~# uci show poe | grep lan2
poe.@port[1].name='lan2'
root@sw-access02:~# uci set poe.@port[1].enable=0
root@sw-access02:~# uci commit poe
root@sw-access02:~# /etc/init.d/poe restart
```
_01.07.2022 ~ 19:00_ (longterm):
I could not reproduce the issue in person. To be on the safe side I replaced the short patchcable (connecting the ap to the switch) and ap:
`ap-ac7c -> ap-1a38`.
Afterwards I reenabled poe on the corrosponding switch port.
### impact
* `2022.06.31 12:00 - 2022.07.01 03:30`: (probably) unreliable wifi for clients connected to `ap-ac7c`
* `2022.07.01 03:30 - 2022.07.01 18:30`: bad signal strength to clients in and around tent 5
### notes
While disabling poe on the port connecting `ap-ac7c` I restarted the `poe` service.
That resulted in all ports shortly dropping power.
Therefore I also accidentially rebooted `ap-2bbf`.
Next time I'll just reload the service (shame on me).
### logs
This was my test to show that ssh was slow/freezed a lot on `ap-ac7c`.
good ap:
```
user@freifunk-admin:~$ time ssh ap-2bbf uptime
01:04:43 up 1 day, 6:21, load average: 0.01, 0.02, 0.00
real 0m1.438s
user 0m0.071s
sys 0m0.011s
user@freifunk-admin:~$ time ssh ap-2bbf uptime
01:17:49 up 1 day, 6:34, load average: 0.00, 0.01, 0.00
real 0m1.924s
user 0m0.070s
sys 0m0.010s
```
bad ap:
```
user@freifunk-admin:~$ time ssh ap-ac7c uptime
01:05:00 up 1 day, 6:33, load average: 0.01, 0.08, 0.03
real 0m29.526s
user 0m0.070s
sys 0m0.014s
user@freifunk-admin:~$ time ssh ap-ac7c uptime
01:06:22 up 1 day, 6:34, load average: 0.00, 0.06, 0.03
real 1m15.379s
user 0m0.081s
sys 0m0.015s
user@freifunk-admin:~$
```
2022-08-21 20:19:19 +00:00
003: 2022.07.03 00:30 | (maintenance) rolling wifi channel updates
------------------------------------------------------------------
We skipped mapping every ap to non overlapping channels while installing the infrastructure because of time constraints.
Therefore I just did that (commit: 41752ef9bdfe0041359e09d08e107d330f10fcf2).
I ran ansible with `--fork=1` to update on ap at a time (every ap takes about 5 seconds to reconfigure theire radios).
**impact**:
Either clients "roamed" (no fast roaming - short interuptions) to a different ap or waited till the original ap came back online.
For every client not more than 10-15 seconds service interruption.
2022-08-21 20:19:19 +00:00
004: 2022.07.17 01:50 | (maintenance) increase dhcp pool size for clients
-------------------------------------------------------------------------
The dnsmasq instance for the client network (`10.84.4.0/22`) only used the dhcp pool `10.84.4.100 - .250`.
To be able to actually assign the full `/22` to clients I've changed the pool to `10.84.4.2 - 10.84.7.254`.
Afterwards I've reloaded `dnsmasq` on `gw-core01`.
**impact**: none
Currently `dnsmasq` has handed out 104 leases, so we presumably never ran out of ips in the old pool.
2022-08-21 20:19:19 +00:00
005: 2022.07.23 07:40 - 12:50 | power outage in tent 5
------------------------------------------------------
There was a power outage in tent 5 taking down `sw-access02` and therefore also `ap-1a38` (tent 5) and `ap-2bbf` (tent 4).
**impact**: no accesspoints and therefore no wifi in tent 4 and 5. Maybe some clients roamed to a different tent.
**problems accessing the equipment**:
Currently every visit from Freifunkas need to be coordinated with the object management of the facility.
This is fine for sheduled maintenances but not doeable for incident response and often leads to discussions and unaccessible equipment (which is totally understandable).
Maybe we can "whitelist" certain people at the facility so they can always access the equipment without further authorization.
2022-07-28 00:41:32 +00:00
**Update 2022.07.28**: The facility management created id cards for `@hirnpfirsich`, `@katzenparadoxon` and `@martin`
2022-08-21 20:19:19 +00:00
006: 2022.07.25 01:00 | (maintenance) os upgrades
-------------------------------------------------
2022-07-28 00:41:32 +00:00
OS upgrades of non-customer facing machines/services
**impact**: short downtime of the monitoring
**upgrades**:
* `monitoring01`: `apt update && apt dist-upgrade -y && reboot` (downtime: 00:58 - 00:59)
* `hyper01`: `apt update && apt dist-upgrade -y && reboot` (downtime: 01:09 - 01:10)
* `eae-adp-jump01`: `syspatch && rcctl restart cron`
2022-08-21 20:19:19 +00:00
007: 2022.08.01 14:00 - 2022.08.15 14:15 | no internet access
-------------------------------------------------------------
Vodafone expired theire free internet offering for refugee camps on 01.08.2022.
**impact**:
No internet access for ~2 weeks
**solution**:
`Saxonia Catering` entered into an internet contract with Vodafone
2022-08-22 13:53:37 +00:00
008: 2022.08.13 ~13:30 | (maintenance) add backoffice wifi
----------------------------------------------------------
The facility managemend asked us if we could build a backoffice wifi that is inaccessible from the rest of the network.
* vlan: `8`
* subnet: `10.84.8.1/24`
* wifi ssid: `GU Deutscher Platz Backoffice`
* wifi password: `wifi/GU_Deutscher_Platz_Backoffice` in `pass`
2022-08-22 13:53:37 +00:00
**impact**:
* `gw-core01` and `sw-access0{1,2}` only got reloaded (so no downtime)
* on `ap-XXXX`s networking restarted so the wifi was unavailable for a few seconds
* either way there was no upstream internet connectivity at this time (see incidents `007` for details)
* therefore impact "calculations" are irrelevant
2022-08-22 13:53:37 +00:00
**changes**:
* `sw-access0{1,2}`:
```
root@sw-access01:~# cat >> /etc/config/network << EOF
config bridge-vlan 'backoffice_vlan'
option device 'switch'
option vlan '8'
option ports 'lan1:t lan2:t lan3:t lan4:t lan5:t lan6:t lan7:t lan8:t'
EOF
root@sw-access01:~# /etc/init.d/network reload
```
* `gw-core01`:
```
root@gw-core01:~# cat >> /etc/config/network << EOF
config bridge-vlan 'backoffice_vlan'
option vlan '8'
option device 'switch'
list ports 'eth2:t'
list ports 'eth3:t'
list ports 'eth4:t'
config interface 'backoffice'
option device 'switch.8'
option proto 'static'
option ipaddr '10.84.8.1'
option netmask '255.255.255.0'
EOF
root@gw-core01:~#
root@gw-core01:~# cat >> /etc/config/firewall << EOF
config zone
option name backoffice
list network 'backoffice'
option input REJECT
option output ACCEPT
option forward REJECT
config forwarding
option src backoffice
option dest wan
config rule
option name BACKOFFICE_Allow-DHCP
option src backoffice
option proto udp
option dest_port 67-68
option target ACCEPT
option family ipv4
config rule
option name BACKOFFICE_Allow-DNS
option src backoffice
option proto udp
option dest_port 53
option target ACCEPT
option family ipv4
EOF
root@gw-core01:~#
root@gw-core01:~# cat >> /etc/config/dhcp << EOF
config dhcp 'backoffice'
option interface 'backoffice'
option start '100'
option limit '150'
option leasetime '12h'
option dhcpv4 'server'
option dhcpv6 'server'
option ra 'server'
option ra_slaac '1'
list ra_flags 'managed-config'
list ra_flags 'other-config'
EOF
root@gw-core01:~#
root@gw-core01:~# /etc/init.d/network reload
root@gw-core01:~# /etc/init.d/firewall restart
root@gw-core01:~# /etc/init.d/dnsmasq reload
```
* `ap-XXXX`: see `playbook_provision_accesspoints.yml`
2022-09-01 11:06:15 +00:00
009: 2022.08.23 ~03:00 | (maintenance) launder public wifi traffic through vpn
------------------------------------------------------------------------------
_This is already implemented, documentation will follow_
010: 2022.08.28 13:00, 2022.08.29 09:10 - 10:30 | random reboots of gw-core01
-----------------------------------------------------------------------------
`gw-core01` randomly reboots.
The other devices on the same circuit (ie. `sw-access01`) did not reboot.
Therefore it is not an issue with the circuit itself.
After calling the facility management they confirmed that the power supply is not correctly seated.
Cause for the missallignment is still the protective cap of the power strip (see incident `001`) for details.
The facility management is either going to remove the protective cap or disable the latching mechanism with zip ties.
**impact**:
* dhcp and routing downtime
* for a few minutes on 2022.08.28 13:00
* for about an hour on 2022.08.29 09:10 (till 10:30)
**monitoring enhancements**:
* [ ] alert on rebooted nodes (via `node_boot_time_seconds`)
011: 2022.08.31 01:06 - 10:00 | public wifi lost upstream vpn connection
------------------------------------------------------------------------
The wireguard vpn (which launders the traffic of the public wifi) did not handshake for 9 hours:
```
root@gw-core01:~# date && wg
Thu Sep 1 07:55:49 UTC 2022
[...]
interface: wg1
public key: [redacted]
private key: (hidden)
listening port: 48603
peer: [redacted]
endpoint: [redacted]:51820
allowed ips: 0.0.0.0/0
latest handshake: 8 hours, 49 minutes, 4 seconds ago
transfer: 201.20 GiB received, 16.12 GiB sent
```
**impact**:
the public wifi `GU Deutscher Platz` had no internet access for 9 hours
**solution**:
add persistent keepalive statement to `wg1` and restart tunnel (via `/etc/init.d/network restart`)
**discussion**:
Any traffic traversing the interface should trigger a new handshake.
Therefore I do not really understand why there was no handshake for 9 hours.
Root cause theory:
The kernel decided that the default route at that interface was unreachable, dropped traffic and therefore stopped the handshake from triggering again.
I added a `persistent_keepalive` to the tunnel to stop this from happening again (if my theory for the root cause is correct).
**changes**:
* 09:56:10 `/etc/init.d/network restart` (to kickstart new handshake)
* 09:58:00 add `persistent_keepalive` statement to `wg1` (longterm fix)
* 10:00:00 `/etc/init.d/network restart` (restart again to apply wg changes)
**monitoring enhancements**:
* [ ] monitor connectivity for the public wifi (`blackbox exporter` in `client` network) and create alerting rules
* [ ] prometheus instance on `eap-adp-jump01` to get alerts if upstream is down in facility
* [ ] monitor wireguard state (probably needs a custom lua exporter)