eae-am-deutschen-platz/documentation/INCIDENTS.md


Collection of all incidents
===========================

2022.06.30 17:00 - 18:00 | power issues on gw-core01
----------------------------------------------------

**issue**:

The protective cap of the power strip yanked the power supply of `gw-core01` out of its socket.
Therefore `gw-core01` had no power.

**solution**:

Tape back the protective cap of the power strip and reinsert the power supply

**impact**:

No internet access for 2 hours


2022.06.30 12:00 - 2022.07.01 19:00 | wifi issues in tent 5
-----------------------------------------------------------

### issue

A resident reported slow internet speeds. He resides in tent 5. I do not have more information.
While trying to check logs for the ap I noticed that `ap-ac7c` is very slow and hangs/freezes a lot via ssh.

Rebooting did not solve the problem.

### cause

Unknown

I've checked the ap the next day in person. I tested the ap with a different lan cable on a different switch port.
The issued I've noticed the night before where not reproducible.

But I did notice that the short patchcable (connecting the ap to the switch) had some light rust on it.

### solution

_01.07.2022 ~ 03:00_ (shortterm):
After noticing the issue myself I tried rebooting the ap.
Unfortunately that did not solve the problem.
To spare the clients from connecting to a bonkers ap I disabled poe for the switch port to take the ap offline:
```
root@sw-access02:~# uci show poe | grep lan2
poe.@port[1].name='lan2'
root@sw-access02:~# uci set poe.@port[1].enable=0
root@sw-access02:~# uci commit poe
root@sw-access02:~# /etc/init.d/poe restart
```

_01.07.2022 ~ 19:00_ (longterm):
I could not reproduce the issue in person. To be on the safe side I replaced the short patchcable (connecting the ap to the switch) and ap:
`ap-ac7c -> ap-1a38`.
Afterwards I reenabled poe on the corrosponding switch port.


### impact

* `2022.06.31 12:00 - 2022.07.01 03:30`: (probably) unreliable wifi for clients connected to `ap-ac7c`
* `2022.07.01 03:30 - 2022.07.01 18:30`: bad signal strength to clients in and around tent 5

### notes

While disabling poe on the port connecting `ap-ac7c` I restarted the `poe` service.
That resulted in all ports shortly dropping power.
Therefore I also accidentially rebooted `ap-2bbf`.

Next time I'll just reload the service (shame on me).

### logs

This was my test to show that ssh was slow/freezed a lot on `ap-ac7c`.

good ap:
```
user@freifunk-admin:~$ time ssh ap-2bbf uptime
 01:04:43 up 1 day,  6:21,  load average: 0.01, 0.02, 0.00

real	0m1.438s
user	0m0.071s
sys	0m0.011s
user@freifunk-admin:~$ time ssh ap-2bbf uptime
 01:17:49 up 1 day,  6:34,  load average: 0.00, 0.01, 0.00

real	0m1.924s
user	0m0.070s
sys	0m0.010s
```

bad ap:
```
user@freifunk-admin:~$ time ssh ap-ac7c uptime
 01:05:00 up 1 day,  6:33,  load average: 0.01, 0.08, 0.03

real	0m29.526s
user	0m0.070s
sys	0m0.014s
user@freifunk-admin:~$ time ssh ap-ac7c uptime
 01:06:22 up 1 day,  6:34,  load average: 0.00, 0.06, 0.03

real	1m15.379s
user	0m0.081s
sys	0m0.015s
user@freifunk-admin:~$
```

2022.07.03 00:30 | (maintenance) rolling wifi channel updates
-------------------------------------------------------------

We skipped mapping every ap to non overlapping channels while installing the infrastructure because of time constraints.

Therefore I just did that (commit: 41752ef9bdfe0041359e09d08e107d330f10fcf2).

I ran ansible with `--fork=1` to update on ap at a time (every ap takes about 5 seconds to reconfigure theire radios).

**impact**:
Either clients "roamed" (no fast roaming - short interuptions) to a different ap or waited till the original ap came back online.
For every client not more than 10-15 seconds service interruption.


2022.07.17 01:50 | (maintenance) increase dhcp pool size for clients
--------------------------------------------------------------------

The dnsmasq instance for the client network (`10.84.4.0/22`) only used the dhcp pool `10.84.4.100 - .250`.

To be able to actually assign the full `/22` to clients I've changed the pool to `10.84.4.2 - 10.84.7.254`.

Afterwards I've reloaded `dnsmasq` on `gw-core01`.

**impact**: none

Currently `dnsmasq` has handed out 104 leases, so we presumably never ran out of ips in the old pool.
documentation: add INCIDENTS.md and some more documentation 2022-07-01 00:00:43 +00:00
			`Collection of all incidents`
			`===========================`

			`2022.06.30 17:00 - 18:00 \| power issues on gw-core01`
			`----------------------------------------------------`

			`issue:`

			The protective cap of the power strip yanked the power supply of `gw-core01` out of its socket.
			Therefore `gw-core01` had no power.

			`solution:`

			`Tape back the protective cap of the power strip and reinsert the power supply`

			`impact:`

			`No internet access for 2 hours`
incidents: wifi issues in tent 5 (ongoing) 2022-07-01 01:30:32 +00:00

incidents: close "wifi issues in tent 5" 2022-07-02 21:58:41 +00:00			`2022.06.30 12:00 - 2022.07.01 19:00 \| wifi issues in tent 5`
			`-----------------------------------------------------------`
incidents: wifi issues in tent 5 (ongoing) 2022-07-01 01:30:32 +00:00
incidents: close "wifi issues in tent 5" 2022-07-02 21:58:41 +00:00			`### issue`
incidents: wifi issues in tent 5 (ongoing) 2022-07-01 01:30:32 +00:00
			`A resident reported slow internet speeds. He resides in tent 5. I do not have more information.`
incidents: close "wifi issues in tent 5" 2022-07-02 21:58:41 +00:00			While trying to check logs for the ap I noticed that `ap-ac7c` is very slow and hangs/freezes a lot via ssh.

			`Rebooting did not solve the problem.`

			`### cause`

			`Unknown`

			`I've checked the ap the next day in person. I tested the ap with a different lan cable on a different switch port.`
			`The issued I've noticed the night before where not reproducible.`

			`But I did notice that the short patchcable (connecting the ap to the switch) had some light rust on it.`

			`### solution`

			`_01.07.2022 ~ 03:00_ (shortterm):`
			`After noticing the issue myself I tried rebooting the ap.`
			`Unfortunately that did not solve the problem.`
			`To spare the clients from connecting to a bonkers ap I disabled poe for the switch port to take the ap offline:`
			```
			`root@sw-access02:~# uci show poe \| grep lan2`
			`poe.@port[1].name='lan2'`
			`root@sw-access02:~# uci set poe.@port[1].enable=0`
			`root@sw-access02:~# uci commit poe`
			`root@sw-access02:~# /etc/init.d/poe restart`
			```

			`_01.07.2022 ~ 19:00_ (longterm):`
			`I could not reproduce the issue in person. To be on the safe side I replaced the short patchcable (connecting the ap to the switch) and ap:`
			`ap-ac7c -> ap-1a38`.
			`Afterwards I reenabled poe on the corrosponding switch port.`


			`### impact`

			* `2022.06.31 12:00 - 2022.07.01 03:30`: (probably) unreliable wifi for clients connected to `ap-ac7c`
			* `2022.07.01 03:30 - 2022.07.01 18:30`: bad signal strength to clients in and around tent 5

			`### notes`

			While disabling poe on the port connecting `ap-ac7c` I restarted the `poe` service.
			`That resulted in all ports shortly dropping power.`
			Therefore I also accidentially rebooted `ap-2bbf`.

			`Next time I'll just reload the service (shame on me).`

			`### logs`

			This was my test to show that ssh was slow/freezed a lot on `ap-ac7c`.
incidents: wifi issues in tent 5 (ongoing) 2022-07-01 01:30:32 +00:00
			`good ap:`
			```
			`user@freifunk-admin:~$ time ssh ap-2bbf uptime`
			`01:04:43 up 1 day, 6:21, load average: 0.01, 0.02, 0.00`

			`real 0m1.438s`
			`user 0m0.071s`
			`sys 0m0.011s`
			`user@freifunk-admin:~$ time ssh ap-2bbf uptime`
			`01:17:49 up 1 day, 6:34, load average: 0.00, 0.01, 0.00`

			`real 0m1.924s`
			`user 0m0.070s`
			`sys 0m0.010s`
			```

			`bad ap:`
			```
			`user@freifunk-admin:~$ time ssh ap-ac7c uptime`
			`01:05:00 up 1 day, 6:33, load average: 0.01, 0.08, 0.03`

			`real 0m29.526s`
			`user 0m0.070s`
			`sys 0m0.014s`
			`user@freifunk-admin:~$ time ssh ap-ac7c uptime`
			`01:06:22 up 1 day, 6:34, load average: 0.00, 0.06, 0.03`

			`real 1m15.379s`
			`user 0m0.081s`
			`sys 0m0.015s`
			`user@freifunk-admin:~$`
			```
incidents: add entry for wifi channel maintenance 2022-07-02 22:51:58 +00:00
			`2022.07.03 00:30 \| (maintenance) rolling wifi channel updates`
			`-------------------------------------------------------------`

			`We skipped mapping every ap to non overlapping channels while installing the infrastructure because of time constraints.`

			`Therefore I just did that (commit: 41752ef9bdfe0041359e09d08e107d330f10fcf2).`

			I ran ansible with `--fork=1` to update on ap at a time (every ap takes about 5 seconds to reconfigure theire radios).

			`impact:`
			`Either clients "roamed" (no fast roaming - short interuptions) to a different ap or waited till the original ap came back online.`
			`For every client not more than 10-15 seconds service interruption.`
incidents: client dhcp pool maintenance 2022-07-17 00:07:12 +00:00

			`2022.07.17 01:50 \| (maintenance) increase dhcp pool size for clients`
			`--------------------------------------------------------------------`

			The dnsmasq instance for the client network (`10.84.4.0/22`) only used the dhcp pool `10.84.4.100 - .250`.

			To be able to actually assign the full `/22` to clients I've changed the pool to `10.84.4.2 - 10.84.7.254`.

			Afterwards I've reloaded `dnsmasq` on `gw-core01`.

			`impact: none`

			Currently `dnsmasq` has handed out 104 leases, so we presumably never ran out of ips in the old pool.