This repository has been archived on 2024-05-11. You can view files and clone it, but cannot push or open issues or pull requests.
eae-am-deutschen-platz/documentation/INCIDENTS.md

3.6 KiB

Collection of all incidents

2022.06.30 17:00 - 18:00 | power issues on gw-core01

issue:

The protective cap of the power strip yanked the power supply of gw-core01 out of its socket. Therefore gw-core01 had no power.

solution:

Tape back the protective cap of the power strip and reinsert the power supply

impact:

No internet access for 2 hours

2022.06.30 12:00 - 2022.07.01 19:00 | wifi issues in tent 5

issue

A resident reported slow internet speeds. He resides in tent 5. I do not have more information. While trying to check logs for the ap I noticed that ap-ac7c is very slow and hangs/freezes a lot via ssh.

Rebooting did not solve the problem.

cause

Unknown

I've checked the ap the next day in person. I tested the ap with a different lan cable on a different switch port. The issued I've noticed the night before where not reproducible.

But I did notice that the short patchcable (connecting the ap to the switch) had some light rust on it.

solution

01.07.2022 ~ 03:00 (shortterm): After noticing the issue myself I tried rebooting the ap. Unfortunately that did not solve the problem. To spare the clients from connecting to a bonkers ap I disabled poe for the switch port to take the ap offline:

root@sw-access02:~# uci show poe | grep lan2
poe.@port[1].name='lan2'
root@sw-access02:~# uci set poe.@port[1].enable=0
root@sw-access02:~# uci commit poe
root@sw-access02:~# /etc/init.d/poe restart

01.07.2022 ~ 19:00 (longterm): I could not reproduce the issue in person. To be on the safe side I replaced the short patchcable (connecting the ap to the switch) and ap: ap-ac7c -> ap-1a38. Afterwards I reenabled poe on the corrosponding switch port.

impact

  • 2022.06.31 12:00 - 2022.07.01 03:30: (probably) unreliable wifi for clients connected to ap-ac7c
  • 2022.07.01 03:30 - 2022.07.01 18:30: bad signal strength to clients in and around tent 5

notes

While disabling poe on the port connecting ap-ac7c I restarted the poe service. That resulted in all ports shortly dropping power. Therefore I also accidentially rebooted ap-2bbf.

Next time I'll just reload the service (shame on me).

logs

This was my test to show that ssh was slow/freezed a lot on ap-ac7c.

good ap:

user@freifunk-admin:~$ time ssh ap-2bbf uptime
 01:04:43 up 1 day,  6:21,  load average: 0.01, 0.02, 0.00

real	0m1.438s
user	0m0.071s
sys	0m0.011s
user@freifunk-admin:~$ time ssh ap-2bbf uptime
 01:17:49 up 1 day,  6:34,  load average: 0.00, 0.01, 0.00

real	0m1.924s
user	0m0.070s
sys	0m0.010s

bad ap:

user@freifunk-admin:~$ time ssh ap-ac7c uptime
 01:05:00 up 1 day,  6:33,  load average: 0.01, 0.08, 0.03

real	0m29.526s
user	0m0.070s
sys	0m0.014s
user@freifunk-admin:~$ time ssh ap-ac7c uptime
 01:06:22 up 1 day,  6:34,  load average: 0.00, 0.06, 0.03

real	1m15.379s
user	0m0.081s
sys	0m0.015s
user@freifunk-admin:~$

2022.07.03 00:30 | (maintenance) rolling wifi channel updates

We skipped mapping every ap to non overlapping channels while installing the infrastructure because of time constraints.

Therefore I just did that (commit: 41752ef9bd).

I ran ansible with --fork=1 to update on ap at a time (every ap takes about 5 seconds to reconfigure theire radios).

impact: Either clients "roamed" (no fast roaming - short interuptions) to a different ap or waited till the original ap came back online. For every client not more than 10-15 seconds service interruption.