eae-am-deutschen-platz/documentation/INCIDENTS.md


Collection of all incidents
===========================

001: 2022.06.30 17:00 - 18:00 | power issues on gw-core01
---------------------------------------------------------

**issue**:

The protective cap of the power strip yanked the power supply of `gw-core01` out of its socket.
Therefore `gw-core01` had no power.

**solution**:

Tape back the protective cap of the power strip and reinsert the power supply

**impact**:

No internet access for 2 hours


002: 2022.06.30 12:00 - 2022.07.01 19:00 | wifi issues in tent 5
----------------------------------------------------------------

### issue

A resident reported slow internet speeds. He resides in tent 5. I do not have more information.
While trying to check logs for the ap I noticed that `ap-ac7c` is very slow and hangs/freezes a lot via ssh.

Rebooting did not solve the problem.

### cause

Unknown

I've checked the ap the next day in person. I tested the ap with a different lan cable on a different switch port.
The issued I've noticed the night before where not reproducible.

But I did notice that the short patchcable (connecting the ap to the switch) had some light rust on it.

### solution

_01.07.2022 ~ 03:00_ (shortterm):
After noticing the issue myself I tried rebooting the ap.
Unfortunately that did not solve the problem.
To spare the clients from connecting to a bonkers ap I disabled poe for the switch port to take the ap offline:
```
root@sw-access02:~# uci show poe | grep lan2
poe.@port[1].name='lan2'
root@sw-access02:~# uci set poe.@port[1].enable=0
root@sw-access02:~# uci commit poe
root@sw-access02:~# /etc/init.d/poe restart
```

_01.07.2022 ~ 19:00_ (longterm):
I could not reproduce the issue in person. To be on the safe side I replaced the short patchcable (connecting the ap to the switch) and ap:
`ap-ac7c -> ap-1a38`.
Afterwards I reenabled poe on the corrosponding switch port.


### impact

* `2022.06.31 12:00 - 2022.07.01 03:30`: (probably) unreliable wifi for clients connected to `ap-ac7c`
* `2022.07.01 03:30 - 2022.07.01 18:30`: bad signal strength to clients in and around tent 5

### notes

While disabling poe on the port connecting `ap-ac7c` I restarted the `poe` service.
That resulted in all ports shortly dropping power.
Therefore I also accidentially rebooted `ap-2bbf`.

Next time I'll just reload the service (shame on me).

### logs

This was my test to show that ssh was slow/freezed a lot on `ap-ac7c`.

good ap:
```
user@freifunk-admin:~$ time ssh ap-2bbf uptime
 01:04:43 up 1 day,  6:21,  load average: 0.01, 0.02, 0.00

real	0m1.438s
user	0m0.071s
sys	0m0.011s
user@freifunk-admin:~$ time ssh ap-2bbf uptime
 01:17:49 up 1 day,  6:34,  load average: 0.00, 0.01, 0.00

real	0m1.924s
user	0m0.070s
sys	0m0.010s
```

bad ap:
```
user@freifunk-admin:~$ time ssh ap-ac7c uptime
 01:05:00 up 1 day,  6:33,  load average: 0.01, 0.08, 0.03

real	0m29.526s
user	0m0.070s
sys	0m0.014s
user@freifunk-admin:~$ time ssh ap-ac7c uptime
 01:06:22 up 1 day,  6:34,  load average: 0.00, 0.06, 0.03

real	1m15.379s
user	0m0.081s
sys	0m0.015s
user@freifunk-admin:~$
```

003: 2022.07.03 00:30 | (maintenance) rolling wifi channel updates
------------------------------------------------------------------

We skipped mapping every ap to non overlapping channels while installing the infrastructure because of time constraints.

Therefore I just did that (commit: 41752ef9bdfe0041359e09d08e107d330f10fcf2).

I ran ansible with `--fork=1` to update on ap at a time (every ap takes about 5 seconds to reconfigure theire radios).

**impact**:
Either clients "roamed" (no fast roaming - short interuptions) to a different ap or waited till the original ap came back online.
For every client not more than 10-15 seconds service interruption.


004: 2022.07.17 01:50 | (maintenance) increase dhcp pool size for clients
-------------------------------------------------------------------------

The dnsmasq instance for the client network (`10.84.4.0/22`) only used the dhcp pool `10.84.4.100 - .250`.

To be able to actually assign the full `/22` to clients I've changed the pool to `10.84.4.2 - 10.84.7.254`.

Afterwards I've reloaded `dnsmasq` on `gw-core01`.

**impact**: none

Currently `dnsmasq` has handed out 104 leases, so we presumably never ran out of ips in the old pool.


005: 2022.07.23 07:40 - 12:50 | power outage in tent 5
------------------------------------------------------

There was a power outage in tent 5 taking down `sw-access02` and therefore also `ap-1a38` (tent 5) and `ap-2bbf` (tent 4).

**impact**: no accesspoints and therefore no wifi in tent 4 and 5. Maybe some clients roamed to a different tent.

**problems accessing the equipment**:
Currently every visit from Freifunkas need to be coordinated with the object management of the facility.
This is fine for sheduled maintenances but not doeable for incident response and often leads to discussions and unaccessible equipment (which is totally understandable).
Maybe we can "whitelist" certain people at the facility so they can always access the equipment without further authorization.

**Update 2022.07.28**: The facility management created id cards for `@hirnpfirsich`, `@katzenparadoxon` and `@martin`


006: 2022.07.25 01:00 | (maintenance) os upgrades
-------------------------------------------------

OS upgrades of non-customer facing machines/services

**impact**: short downtime of the monitoring

**upgrades**:
* `monitoring01`:   `apt update && apt dist-upgrade -y && reboot` (downtime: 00:58 - 00:59)
* `hyper01`:        `apt update && apt dist-upgrade -y && reboot` (downtime: 01:09 - 01:10)
* `eae-adp-jump01`: `syspatch && rcctl restart cron`


007: 2022.08.01 14:00 - 2022.08.15 14:15 | no internet access
-------------------------------------------------------------

Vodafone expired theire free internet offering for refugee camps on 01.08.2022.

**impact**:
No internet access for ~2 weeks

**solution**:
`Saxonia Catering` entered into an internet contract with Vodafone


008: 2022.08.13 ~13:30 | (maintenance) add backoffice wifi
----------------------------------------------------------

The facility managemend asked us if we could build a backoffice wifi that is inaccessible from the rest of the network.

* vlan: `8`
* subnet: `10.84.8.1/24`
* wifi ssid: `GU Deutscher Platz Backoffice`
* wifi password: `wifi/GU_Deutscher_Platz_Backoffice` in `pass`

**impact**:
* `gw-core01` and `sw-access0{1,2}` only got reloaded (so no downtime)
* on `ap-XXXX`s networking restarted so the wifi was unavailable for a few seconds
* either way there was no upstream internet connectivity at this time (see incidents `007` for details)
* therefore impact "calculations" are irrelevant

**changes**:
* `sw-access0{1,2}`:
```
root@sw-access01:~# cat >> /etc/config/network << EOF
config bridge-vlan 'backoffice_vlan'
	option device 'switch'
	option vlan '8'
	option ports 'lan1:t lan2:t lan3:t lan4:t lan5:t lan6:t lan7:t lan8:t'
EOF
root@sw-access01:~# /etc/init.d/network reload
```

* `gw-core01`:
```
root@gw-core01:~# cat >> /etc/config/network << EOF
config bridge-vlan 'backoffice_vlan'
        option vlan '8'
        option device 'switch'
        list ports 'eth2:t'
        list ports 'eth3:t'
        list ports 'eth4:t'

config interface 'backoffice'
        option device 'switch.8'
        option proto 'static'
        option ipaddr '10.84.8.1'
        option netmask '255.255.255.0'
EOF
root@gw-core01:~#
root@gw-core01:~# cat >> /etc/config/firewall << EOF
config zone
        option name             backoffice
        list   network          'backoffice'
        option input            REJECT
        option output           ACCEPT
        option forward          REJECT

config forwarding
        option src              backoffice
        option dest             wan

config rule
        option name             BACKOFFICE_Allow-DHCP
        option src              backoffice
        option proto            udp
        option dest_port        67-68
        option target           ACCEPT
        option family           ipv4

config rule
        option name             BACKOFFICE_Allow-DNS
        option src              backoffice
        option proto            udp
        option dest_port        53
        option target           ACCEPT
        option family           ipv4
EOF
root@gw-core01:~#
root@gw-core01:~# cat >> /etc/config/dhcp << EOF
config dhcp 'backoffice'
        option interface 'backoffice'
        option start '100'
        option limit '150'
        option leasetime '12h'
        option dhcpv4 'server'
        option dhcpv6 'server'
        option ra 'server'
        option ra_slaac '1'
        list ra_flags 'managed-config'
        list ra_flags 'other-config'
EOF
root@gw-core01:~#
root@gw-core01:~# /etc/init.d/network reload
root@gw-core01:~# /etc/init.d/firewall restart
root@gw-core01:~# /etc/init.d/dnsmasq reload
```
* `ap-XXXX`: see `playbook_provision_accesspoints.yml`
documentation: add INCIDENTS.md and some more documentation 2022-07-01 00:00:43 +00:00
			`Collection of all incidents`
			`===========================`

incidents: paginate incidents 2022-08-21 20:19:19 +00:00			`001: 2022.06.30 17:00 - 18:00 \| power issues on gw-core01`
			`---------------------------------------------------------`
documentation: add INCIDENTS.md and some more documentation 2022-07-01 00:00:43 +00:00
			`issue:`

			The protective cap of the power strip yanked the power supply of `gw-core01` out of its socket.
			Therefore `gw-core01` had no power.

			`solution:`

			`Tape back the protective cap of the power strip and reinsert the power supply`

			`impact:`

			`No internet access for 2 hours`
incidents: wifi issues in tent 5 (ongoing) 2022-07-01 01:30:32 +00:00

incidents: paginate incidents 2022-08-21 20:19:19 +00:00			`002: 2022.06.30 12:00 - 2022.07.01 19:00 \| wifi issues in tent 5`
			`----------------------------------------------------------------`
incidents: wifi issues in tent 5 (ongoing) 2022-07-01 01:30:32 +00:00
incidents: close "wifi issues in tent 5" 2022-07-02 21:58:41 +00:00			`### issue`
incidents: wifi issues in tent 5 (ongoing) 2022-07-01 01:30:32 +00:00
			`A resident reported slow internet speeds. He resides in tent 5. I do not have more information.`
incidents: close "wifi issues in tent 5" 2022-07-02 21:58:41 +00:00			While trying to check logs for the ap I noticed that `ap-ac7c` is very slow and hangs/freezes a lot via ssh.

			`Rebooting did not solve the problem.`

			`### cause`

			`Unknown`

			`I've checked the ap the next day in person. I tested the ap with a different lan cable on a different switch port.`
			`The issued I've noticed the night before where not reproducible.`

			`But I did notice that the short patchcable (connecting the ap to the switch) had some light rust on it.`

			`### solution`

			`_01.07.2022 ~ 03:00_ (shortterm):`
			`After noticing the issue myself I tried rebooting the ap.`
			`Unfortunately that did not solve the problem.`
			`To spare the clients from connecting to a bonkers ap I disabled poe for the switch port to take the ap offline:`
			```
			`root@sw-access02:~# uci show poe \| grep lan2`
			`poe.@port[1].name='lan2'`
			`root@sw-access02:~# uci set poe.@port[1].enable=0`
			`root@sw-access02:~# uci commit poe`
			`root@sw-access02:~# /etc/init.d/poe restart`
			```

			`_01.07.2022 ~ 19:00_ (longterm):`
			`I could not reproduce the issue in person. To be on the safe side I replaced the short patchcable (connecting the ap to the switch) and ap:`
			`ap-ac7c -> ap-1a38`.
			`Afterwards I reenabled poe on the corrosponding switch port.`


			`### impact`

			* `2022.06.31 12:00 - 2022.07.01 03:30`: (probably) unreliable wifi for clients connected to `ap-ac7c`
			* `2022.07.01 03:30 - 2022.07.01 18:30`: bad signal strength to clients in and around tent 5

			`### notes`

			While disabling poe on the port connecting `ap-ac7c` I restarted the `poe` service.
			`That resulted in all ports shortly dropping power.`
			Therefore I also accidentially rebooted `ap-2bbf`.

			`Next time I'll just reload the service (shame on me).`

			`### logs`

			This was my test to show that ssh was slow/freezed a lot on `ap-ac7c`.
incidents: wifi issues in tent 5 (ongoing) 2022-07-01 01:30:32 +00:00
			`good ap:`
			```
			`user@freifunk-admin:~$ time ssh ap-2bbf uptime`
			`01:04:43 up 1 day, 6:21, load average: 0.01, 0.02, 0.00`

			`real 0m1.438s`
			`user 0m0.071s`
			`sys 0m0.011s`
			`user@freifunk-admin:~$ time ssh ap-2bbf uptime`
			`01:17:49 up 1 day, 6:34, load average: 0.00, 0.01, 0.00`

			`real 0m1.924s`
			`user 0m0.070s`
			`sys 0m0.010s`
			```

			`bad ap:`
			```
			`user@freifunk-admin:~$ time ssh ap-ac7c uptime`
			`01:05:00 up 1 day, 6:33, load average: 0.01, 0.08, 0.03`

			`real 0m29.526s`
			`user 0m0.070s`
			`sys 0m0.014s`
			`user@freifunk-admin:~$ time ssh ap-ac7c uptime`
			`01:06:22 up 1 day, 6:34, load average: 0.00, 0.06, 0.03`

			`real 1m15.379s`
			`user 0m0.081s`
			`sys 0m0.015s`
			`user@freifunk-admin:~$`
			```
incidents: add entry for wifi channel maintenance 2022-07-02 22:51:58 +00:00
incidents: paginate incidents 2022-08-21 20:19:19 +00:00			`003: 2022.07.03 00:30 \| (maintenance) rolling wifi channel updates`
			`------------------------------------------------------------------`
incidents: add entry for wifi channel maintenance 2022-07-02 22:51:58 +00:00
			`We skipped mapping every ap to non overlapping channels while installing the infrastructure because of time constraints.`

			`Therefore I just did that (commit: 41752ef9bdfe0041359e09d08e107d330f10fcf2).`

			I ran ansible with `--fork=1` to update on ap at a time (every ap takes about 5 seconds to reconfigure theire radios).

			`impact:`
			`Either clients "roamed" (no fast roaming - short interuptions) to a different ap or waited till the original ap came back online.`
			`For every client not more than 10-15 seconds service interruption.`
incidents: client dhcp pool maintenance 2022-07-17 00:07:12 +00:00

incidents: paginate incidents 2022-08-21 20:19:19 +00:00			`004: 2022.07.17 01:50 \| (maintenance) increase dhcp pool size for clients`
			`-------------------------------------------------------------------------`
incidents: client dhcp pool maintenance 2022-07-17 00:07:12 +00:00
			The dnsmasq instance for the client network (`10.84.4.0/22`) only used the dhcp pool `10.84.4.100 - .250`.

			To be able to actually assign the full `/22` to clients I've changed the pool to `10.84.4.2 - 10.84.7.254`.

			Afterwards I've reloaded `dnsmasq` on `gw-core01`.

			`impact: none`

			Currently `dnsmasq` has handed out 104 leases, so we presumably never ran out of ips in the old pool.
incident - 2022.07.23: power outage in tent 5 2022-07-23 11:28:23 +00:00

incidents: paginate incidents 2022-08-21 20:19:19 +00:00			`005: 2022.07.23 07:40 - 12:50 \| power outage in tent 5`
			`------------------------------------------------------`
incident - 2022.07.23: power outage in tent 5 2022-07-23 11:28:23 +00:00
			There was a power outage in tent 5 taking down `sw-access02` and therefore also `ap-1a38` (tent 5) and `ap-2bbf` (tent 4).

			`impact: no accesspoints and therefore no wifi in tent 4 and 5. Maybe some clients roamed to a different tent.`

			`problems accessing the equipment:`
			`Currently every visit from Freifunkas need to be coordinated with the object management of the facility.`
			`This is fine for sheduled maintenances but not doeable for incident response and often leads to discussions and unaccessible equipment (which is totally understandable).`
			`Maybe we can "whitelist" certain people at the facility so they can always access the equipment without further authorization.`
incidents: os upgrades (maintenance) 2022-07-28 00:41:32 +00:00
incident - 2022.07.23: update regarding facility access 2022-07-28 00:44:25 +00:00			Update 2022.07.28: The facility management created id cards for `@hirnpfirsich`, `@katzenparadoxon` and `@martin`


incidents: paginate incidents 2022-08-21 20:19:19 +00:00			`006: 2022.07.25 01:00 \| (maintenance) os upgrades`
			`-------------------------------------------------`
incidents: os upgrades (maintenance) 2022-07-28 00:41:32 +00:00
			`OS upgrades of non-customer facing machines/services`

			`impact: short downtime of the monitoring`

			`upgrades:`
			* `monitoring01`: `apt update && apt dist-upgrade -y && reboot` (downtime: 00:58 - 00:59)
			* `hyper01`: `apt update && apt dist-upgrade -y && reboot` (downtime: 01:09 - 01:10)
			* `eae-adp-jump01`: `syspatch && rcctl restart cron`
incident - 2022.08.01: no internet access for 2 weeks 2022-08-21 20:15:09 +00:00

incidents: paginate incidents 2022-08-21 20:19:19 +00:00			`007: 2022.08.01 14:00 - 2022.08.15 14:15 \| no internet access`
			`-------------------------------------------------------------`
incident - 2022.08.01: no internet access for 2 weeks 2022-08-21 20:15:09 +00:00
			`Vodafone expired theire free internet offering for refugee camps on 01.08.2022.`

			`impact:`
			`No internet access for ~2 weeks`

			`solution:`
			`Saxonia Catering` entered into an internet contract with Vodafone
add backoffice wifi 2022-08-22 13:53:37 +00:00

			`008: 2022.08.13 ~13:30 \| (maintenance) add backoffice wifi`
			`----------------------------------------------------------`

			`The facility managemend asked us if we could build a backoffice wifi that is inaccessible from the rest of the network.`

incidents: clean up 008 Fixes: b96fa8542c15a1fb1e478eedc72b35e7fa999196 2022-08-22 13:56:40 +00:00			* vlan: `8`
			* subnet: `10.84.8.1/24`
			* wifi ssid: `GU Deutscher Platz Backoffice`
			* wifi password: `wifi/GU_Deutscher_Platz_Backoffice` in `pass`
add backoffice wifi 2022-08-22 13:53:37 +00:00
			`impact:`
			* `gw-core01` and `sw-access0{1,2}` only got reloaded (so no downtime)
			* on `ap-XXXX`s networking restarted so the wifi was unavailable for a few seconds
incidents: clean up 008 Fixes: b96fa8542c15a1fb1e478eedc72b35e7fa999196 2022-08-22 13:56:40 +00:00			* either way there was no upstream internet connectivity at this time (see incidents `007` for details)
			`* therefore impact "calculations" are irrelevant`
add backoffice wifi 2022-08-22 13:53:37 +00:00
			`changes:`
			* `sw-access0{1,2}`:
			```
			`root@sw-access01:~# cat >> /etc/config/network << EOF`
			`config bridge-vlan 'backoffice_vlan'`
			`option device 'switch'`
			`option vlan '8'`
			`option ports 'lan1:t lan2:t lan3:t lan4:t lan5:t lan6:t lan7:t lan8:t'`
			`EOF`
			`root@sw-access01:~# /etc/init.d/network reload`
			```

			* `gw-core01`:
			```
			`root@gw-core01:~# cat >> /etc/config/network << EOF`
			`config bridge-vlan 'backoffice_vlan'`
			`option vlan '8'`
			`option device 'switch'`
			`list ports 'eth2:t'`
			`list ports 'eth3:t'`
			`list ports 'eth4:t'`

			`config interface 'backoffice'`
			`option device 'switch.8'`
			`option proto 'static'`
			`option ipaddr '10.84.8.1'`
			`option netmask '255.255.255.0'`
			`EOF`
			`root@gw-core01:~#`
			`root@gw-core01:~# cat >> /etc/config/firewall << EOF`
			`config zone`
			`option name backoffice`
			`list network 'backoffice'`
			`option input REJECT`
			`option output ACCEPT`
			`option forward REJECT`

			`config forwarding`
			`option src backoffice`
			`option dest wan`

			`config rule`
			`option name BACKOFFICE_Allow-DHCP`
			`option src backoffice`
			`option proto udp`
			`option dest_port 67-68`
			`option target ACCEPT`
			`option family ipv4`

			`config rule`
			`option name BACKOFFICE_Allow-DNS`
			`option src backoffice`
			`option proto udp`
			`option dest_port 53`
			`option target ACCEPT`
			`option family ipv4`
			`EOF`
			`root@gw-core01:~#`
			`root@gw-core01:~# cat >> /etc/config/dhcp << EOF`
			`config dhcp 'backoffice'`
			`option interface 'backoffice'`
			`option start '100'`
			`option limit '150'`
			`option leasetime '12h'`
			`option dhcpv4 'server'`
			`option dhcpv6 'server'`
			`option ra 'server'`
			`option ra_slaac '1'`
			`list ra_flags 'managed-config'`
			`list ra_flags 'other-config'`
			`EOF`
			`root@gw-core01:~#`
			`root@gw-core01:~# /etc/init.d/network reload`
			`root@gw-core01:~# /etc/init.d/firewall restart`
			`root@gw-core01:~# /etc/init.d/dnsmasq reload`
			```
			* `ap-XXXX`: see `playbook_provision_accesspoints.yml`