eae-am-deutschen-platz/documentation/INCIDENTS.md

5.5 KiB

Collection of all incidents

2022.06.30 17:00 - 18:00 | power issues on gw-core01

issue:

The protective cap of the power strip yanked the power supply of gw-core01 out of its socket. Therefore gw-core01 had no power.

solution:

Tape back the protective cap of the power strip and reinsert the power supply

impact:

No internet access for 2 hours

2022.06.30 12:00 - 2022.07.01 19:00 | wifi issues in tent 5

issue

A resident reported slow internet speeds. He resides in tent 5. I do not have more information. While trying to check logs for the ap I noticed that ap-ac7c is very slow and hangs/freezes a lot via ssh.

Rebooting did not solve the problem.

cause

Unknown

I've checked the ap the next day in person. I tested the ap with a different lan cable on a different switch port. The issued I've noticed the night before where not reproducible.

But I did notice that the short patchcable (connecting the ap to the switch) had some light rust on it.

solution

01.07.2022 ~ 03:00 (shortterm): After noticing the issue myself I tried rebooting the ap. Unfortunately that did not solve the problem. To spare the clients from connecting to a bonkers ap I disabled poe for the switch port to take the ap offline:

root@sw-access02:~# uci show poe | grep lan2
poe.@port[1].name='lan2'
root@sw-access02:~# uci set poe.@port[1].enable=0
root@sw-access02:~# uci commit poe
root@sw-access02:~# /etc/init.d/poe restart

01.07.2022 ~ 19:00 (longterm): I could not reproduce the issue in person. To be on the safe side I replaced the short patchcable (connecting the ap to the switch) and ap: ap-ac7c -> ap-1a38. Afterwards I reenabled poe on the corrosponding switch port.

impact

  • 2022.06.31 12:00 - 2022.07.01 03:30: (probably) unreliable wifi for clients connected to ap-ac7c
  • 2022.07.01 03:30 - 2022.07.01 18:30: bad signal strength to clients in and around tent 5

notes

While disabling poe on the port connecting ap-ac7c I restarted the poe service. That resulted in all ports shortly dropping power. Therefore I also accidentially rebooted ap-2bbf.

Next time I'll just reload the service (shame on me).

logs

This was my test to show that ssh was slow/freezed a lot on ap-ac7c.

good ap:

user@freifunk-admin:~$ time ssh ap-2bbf uptime
 01:04:43 up 1 day,  6:21,  load average: 0.01, 0.02, 0.00

real	0m1.438s
user	0m0.071s
sys	0m0.011s
user@freifunk-admin:~$ time ssh ap-2bbf uptime
 01:17:49 up 1 day,  6:34,  load average: 0.00, 0.01, 0.00

real	0m1.924s
user	0m0.070s
sys	0m0.010s

bad ap:

user@freifunk-admin:~$ time ssh ap-ac7c uptime
 01:05:00 up 1 day,  6:33,  load average: 0.01, 0.08, 0.03

real	0m29.526s
user	0m0.070s
sys	0m0.014s
user@freifunk-admin:~$ time ssh ap-ac7c uptime
 01:06:22 up 1 day,  6:34,  load average: 0.00, 0.06, 0.03

real	1m15.379s
user	0m0.081s
sys	0m0.015s
user@freifunk-admin:~$

2022.07.03 00:30 | (maintenance) rolling wifi channel updates

We skipped mapping every ap to non overlapping channels while installing the infrastructure because of time constraints.

Therefore I just did that (commit: 41752ef9bd).

I ran ansible with --fork=1 to update on ap at a time (every ap takes about 5 seconds to reconfigure theire radios).

impact: Either clients "roamed" (no fast roaming - short interuptions) to a different ap or waited till the original ap came back online. For every client not more than 10-15 seconds service interruption.

2022.07.17 01:50 | (maintenance) increase dhcp pool size for clients

The dnsmasq instance for the client network (10.84.4.0/22) only used the dhcp pool 10.84.4.100 - .250.

To be able to actually assign the full /22 to clients I've changed the pool to 10.84.4.2 - 10.84.7.254.

Afterwards I've reloaded dnsmasq on gw-core01.

impact: none

Currently dnsmasq has handed out 104 leases, so we presumably never ran out of ips in the old pool.

2022.07.23 07:40 - 12:50 | power outage in tent 5

There was a power outage in tent 5 taking down sw-access02 and therefore also ap-1a38 (tent 5) and ap-2bbf (tent 4).

impact: no accesspoints and therefore no wifi in tent 4 and 5. Maybe some clients roamed to a different tent.

problems accessing the equipment: Currently every visit from Freifunkas need to be coordinated with the object management of the facility. This is fine for sheduled maintenances but not doeable for incident response and often leads to discussions and unaccessible equipment (which is totally understandable). Maybe we can "whitelist" certain people at the facility so they can always access the equipment without further authorization.

Update 2022.07.28: The facility management created id cards for @hirnpfirsich, @katzenparadoxon and @martin

2022.07.25 01:00 | (maintenance) os upgrades

OS upgrades of non-customer facing machines/services

impact: short downtime of the monitoring

upgrades:

  • monitoring01: apt update && apt dist-upgrade -y && reboot (downtime: 00:58 - 00:59)
  • hyper01: apt update && apt dist-upgrade -y && reboot (downtime: 01:09 - 01:10)
  • eae-adp-jump01: syspatch && rcctl restart cron