This repository has been archived on 2024-05-11. You can view files and clone it, but cannot push or open issues or pull requests.
eae-am-deutschen-platz/documentation/INCIDENTS.md

19 KiB

Collection of all incidents

001: 2022.06.30 17:00 - 18:00 | power issues on gw-core01

issue:

The protective cap of the power strip yanked the power supply of gw-core01 out of its socket. Therefore gw-core01 had no power.

solution:

Tape back the protective cap of the power strip and reinsert the power supply

impact:

No internet access for 2 hours

002: 2022.06.30 12:00 - 2022.07.01 19:00 | wifi issues in tent 5

issue

A resident reported slow internet speeds. He resides in tent 5. I do not have more information. While trying to check logs for the ap I noticed that ap-ac7c is very slow and hangs/freezes a lot via ssh.

Rebooting did not solve the problem.

cause

Unknown

I've checked the ap the next day in person. I tested the ap with a different lan cable on a different switch port. The issued I've noticed the night before where not reproducible.

But I did notice that the short patchcable (connecting the ap to the switch) had some light rust on it.

solution

01.07.2022 ~ 03:00 (shortterm): After noticing the issue myself I tried rebooting the ap. Unfortunately that did not solve the problem. To spare the clients from connecting to a bonkers ap I disabled poe for the switch port to take the ap offline:

root@sw-access02:~# uci show poe | grep lan2
poe.@port[1].name='lan2'
root@sw-access02:~# uci set poe.@port[1].enable=0
root@sw-access02:~# uci commit poe
root@sw-access02:~# /etc/init.d/poe restart

01.07.2022 ~ 19:00 (longterm): I could not reproduce the issue in person. To be on the safe side I replaced the short patchcable (connecting the ap to the switch) and ap: ap-ac7c -> ap-1a38. Afterwards I reenabled poe on the corrosponding switch port.

impact

  • 2022.06.31 12:00 - 2022.07.01 03:30: (probably) unreliable wifi for clients connected to ap-ac7c
  • 2022.07.01 03:30 - 2022.07.01 18:30: bad signal strength to clients in and around tent 5

notes

While disabling poe on the port connecting ap-ac7c I restarted the poe service. That resulted in all ports shortly dropping power. Therefore I also accidentially rebooted ap-2bbf.

Next time I'll just reload the service (shame on me).

logs

This was my test to show that ssh was slow/freezed a lot on ap-ac7c.

good ap:

user@freifunk-admin:~$ time ssh ap-2bbf uptime
 01:04:43 up 1 day,  6:21,  load average: 0.01, 0.02, 0.00

real	0m1.438s
user	0m0.071s
sys	0m0.011s
user@freifunk-admin:~$ time ssh ap-2bbf uptime
 01:17:49 up 1 day,  6:34,  load average: 0.00, 0.01, 0.00

real	0m1.924s
user	0m0.070s
sys	0m0.010s

bad ap:

user@freifunk-admin:~$ time ssh ap-ac7c uptime
 01:05:00 up 1 day,  6:33,  load average: 0.01, 0.08, 0.03

real	0m29.526s
user	0m0.070s
sys	0m0.014s
user@freifunk-admin:~$ time ssh ap-ac7c uptime
 01:06:22 up 1 day,  6:34,  load average: 0.00, 0.06, 0.03

real	1m15.379s
user	0m0.081s
sys	0m0.015s
user@freifunk-admin:~$

003: 2022.07.03 00:30 | (maintenance) rolling wifi channel updates

We skipped mapping every ap to non overlapping channels while installing the infrastructure because of time constraints.

Therefore I just did that (commit: 41752ef9bd).

I ran ansible with --fork=1 to update on ap at a time (every ap takes about 5 seconds to reconfigure theire radios).

impact: Either clients "roamed" (no fast roaming - short interuptions) to a different ap or waited till the original ap came back online. For every client not more than 10-15 seconds service interruption.

004: 2022.07.17 01:50 | (maintenance) increase dhcp pool size for clients

The dnsmasq instance for the client network (10.84.4.0/22) only used the dhcp pool 10.84.4.100 - .250.

To be able to actually assign the full /22 to clients I've changed the pool to 10.84.4.2 - 10.84.7.254.

Afterwards I've reloaded dnsmasq on gw-core01.

impact: none

Currently dnsmasq has handed out 104 leases, so we presumably never ran out of ips in the old pool.

005: 2022.07.23 07:40 - 12:50 | power outage in tent 5

There was a power outage in tent 5 taking down sw-access02 and therefore also ap-1a38 (tent 5) and ap-2bbf (tent 4).

impact: no accesspoints and therefore no wifi in tent 4 and 5. Maybe some clients roamed to a different tent.

problems accessing the equipment: Currently every visit from Freifunkas need to be coordinated with the object management of the facility. This is fine for sheduled maintenances but not doeable for incident response and often leads to discussions and unaccessible equipment (which is totally understandable). Maybe we can "whitelist" certain people at the facility so they can always access the equipment without further authorization.

Update 2022.07.28: The facility management created id cards for @hirnpfirsich, @katzenparadoxon and @martin

006: 2022.07.25 01:00 | (maintenance) os upgrades

OS upgrades of non-customer facing machines/services

impact: short downtime of the monitoring

upgrades:

  • monitoring01: apt update && apt dist-upgrade -y && reboot (downtime: 00:58 - 00:59)
  • hyper01: apt update && apt dist-upgrade -y && reboot (downtime: 01:09 - 01:10)
  • eae-adp-jump01: syspatch && rcctl restart cron

007: 2022.08.01 14:00 - 2022.08.15 14:15 | no internet access

Vodafone expired theire free internet offering for refugee camps on 01.08.2022.

impact: No internet access for ~2 weeks

solution: Saxonia Catering entered into an internet contract with Vodafone

008: 2022.08.13 ~13:30 | (maintenance) add backoffice wifi

The facility managemend asked us if we could build a backoffice wifi that is inaccessible from the rest of the network.

  • vlan: 8
  • subnet: 10.84.8.1/24
  • wifi ssid: GU Deutscher Platz Backoffice
  • wifi password: wifi/GU_Deutscher_Platz_Backoffice in pass

impact:

  • gw-core01 and sw-access0{1,2} only got reloaded (so no downtime)
  • on ap-XXXXs networking restarted so the wifi was unavailable for a few seconds
  • either way there was no upstream internet connectivity at this time (see incidents 007 for details)
  • therefore impact "calculations" are irrelevant

changes:

  • sw-access0{1,2}:
root@sw-access01:~# cat >> /etc/config/network << EOF
config bridge-vlan 'backoffice_vlan'
	option device 'switch'
	option vlan '8'
	option ports 'lan1:t lan2:t lan3:t lan4:t lan5:t lan6:t lan7:t lan8:t'
EOF
root@sw-access01:~# /etc/init.d/network reload
  • gw-core01:
root@gw-core01:~# cat >> /etc/config/network << EOF
config bridge-vlan 'backoffice_vlan'
        option vlan '8'
        option device 'switch'
        list ports 'eth2:t'
        list ports 'eth3:t'
        list ports 'eth4:t'

config interface 'backoffice'
        option device 'switch.8'
        option proto 'static'
        option ipaddr '10.84.8.1'
        option netmask '255.255.255.0'
EOF
root@gw-core01:~#
root@gw-core01:~# cat >> /etc/config/firewall << EOF
config zone
        option name             backoffice
        list   network          'backoffice'
        option input            REJECT
        option output           ACCEPT
        option forward          REJECT

config forwarding
        option src              backoffice
        option dest             wan

config rule
        option name             BACKOFFICE_Allow-DHCP
        option src              backoffice
        option proto            udp
        option dest_port        67-68
        option target           ACCEPT
        option family           ipv4

config rule
        option name             BACKOFFICE_Allow-DNS
        option src              backoffice
        option proto            udp
        option dest_port        53
        option target           ACCEPT
        option family           ipv4
EOF
root@gw-core01:~#
root@gw-core01:~# cat >> /etc/config/dhcp << EOF
config dhcp 'backoffice'
        option interface 'backoffice'
        option start '100'
        option limit '150'
        option leasetime '12h'
        option dhcpv4 'server'
        option dhcpv6 'server'
        option ra 'server'
        option ra_slaac '1'
        list ra_flags 'managed-config'
        list ra_flags 'other-config'
EOF
root@gw-core01:~#
root@gw-core01:~# /etc/init.d/network reload
root@gw-core01:~# /etc/init.d/firewall restart
root@gw-core01:~# /etc/init.d/dnsmasq reload
  • ap-XXXX: see playbook_provision_accesspoints.yml

009: 2022.08.23 ~03:00 | (maintenance) launder public wifi traffic through vpn

To help the refugee camp not get into legal trouble for providing free internet access (gotta love germany) we've put a vpn inplace that launders the traffic from the public wifi through a vpn provider.

Only the clients network gets laundered. This is accomplished by using policy-based routing.

The vpn interface is put into its own routing table (20 - launder). Two ip rules steer the traffic from the clients network into the tunnel with a failsafe. If the vpn connection dies no traffic leaks through the normal wan interface.

changes:

root@gw-core01:~# echo "20 launder" >> /etc/iproute2/rt_tables
root@gw-core01:~# cat >> /etc/config/network << EOF
config interface 'wg1'
        option mtu 1350
        option proto 'wireguard'
        option private_key '[redacted]'
        list addresses '[redacted]'
        option ip4table 'launder'

config wireguard_wg1 'mullvad_fr'
        option public_key '[redacted]'
        option endpoint_host '[redacted]'
        option endpoint_port '51820'
        option route_allowed_ips '1'
        list allowed_ips '0.0.0.0/0'

config rule
        option in 'clients'
        option lookup 'launder'
        option priority 50

config rule
        option in 'clients'
        option action prohibit
        option priority 51
EOF
root@gw-core01:~# cat >> /etc/config/firewall << EOF
config zone
        option name launder
        list network wg1
        option input REJECT
        option output ACCEPT
        option forward REJECT
        option masq 1
        option mtu_fix 1

config forwarding
        option src clients
        option dest launder
EOF
root@gw-core01:~# /etc/init.d/network restart
root@gw-core01:~# /etc/init.d/firewall restart

impact:

  • short service interruptions for public wifi clients at around ~03:00 lasting a few minutes

010: 2022.08.28 13:00, 2022.08.29 09:10 - 10:30 | random reboots of gw-core01

gw-core01 randomly reboots. The other devices on the same circuit (ie. sw-access01) did not reboot. Therefore it is not an issue with the circuit itself.

After calling the facility management they confirmed that the power supply is not correctly seated. Cause for the missallignment is still the protective cap of the power strip (see incident 001) for details.

The facility management is either going to remove the protective cap or disable the latching mechanism with zip ties.

Update: Now the protective cap is held back by zip ties and should finally stop interfering with the power supply

impact:

  • dhcp and routing downtime
  • for a few minutes on 2022.08.28 13:00
  • for about an hour on 2022.08.29 09:10 (till 10:30)

monitoring enhancements:

  • alert on rebooted nodes (via node_boot_time_seconds)

011: 2022.08.31 01:06 - 10:00 | public wifi lost upstream vpn connection

The wireguard vpn (which launders the traffic of the public wifi) did not handshake for 9 hours:

root@gw-core01:~# date && wg
Thu Sep  1 07:55:49 UTC 2022
[...]
interface: wg1
  public key: [redacted]
  private key: (hidden)
  listening port: 48603

peer: [redacted]
  endpoint: [redacted]:51820
  allowed ips: 0.0.0.0/0
  latest handshake: 8 hours, 49 minutes, 4 seconds ago
  transfer: 201.20 GiB received, 16.12 GiB sent

impact: the public wifi GU Deutscher Platz had no internet access for 9 hours

solution: add persistent keepalive statement to wg1 and restart tunnel (via /etc/init.d/network restart)

discussion: Any traffic traversing the interface should trigger a new handshake. Therefore I do not really understand why there was no handshake for 9 hours.

Root cause theory: The kernel decided that the default route at that interface was unreachable, dropped traffic and therefore stopped the handshake from triggering again.

I added a persistent_keepalive to the tunnel to stop this from happening again (if my theory for the root cause is correct).

changes:

  • 09:56:10 /etc/init.d/network restart (to kickstart new handshake)
  • 09:58:00 add persistent_keepalive statement to wg1 (longterm fix)
  • 10:00:00 /etc/init.d/network restart (restart again to apply wg changes)

monitoring enhancements:

  • monitor connectivity for the public wifi (blackbox exporter in client network) and create alerting rules
  • prometheus instance on eap-adp-jump01 to get alerts if upstream is down in facility
  • monitor wireguard state (probably needs a custom lua exporter)

012: 2022.09.01 - 2022.09.08 | ongoing reboots of gw-core01

Unfortunately zip tying back the protective cap of the power strip did not stop the random reboots of gw-core01. See incidents 001 and 010 for details.

Either the power supply or the device itself is broken.

solution:

  • replace power supply
  • plug power supply into "normal" socket
  • replace device itself

updates:

  • 2022.09.02 ~20:00: I tried replacing the power supply but nobody could let me into the facilities.
  • 2022.09.03 ~14:40: Successfully replaced the power supply. While doing so the new power supply slipped out of its socket multiple times. It seems like the sockets in the power strip are deeper than normal. Maybe the old supply was not broken but simply sliped out because the strip is weard ? Some zip ties are holding the new supply in its place
  • 2022.09.04 ~14:50: gw-core01 rebooted again => put psu of gw-core01 into a "normal" power strip
  • 2022.09.06 ~11:00: gw-core01 put psu of gw-core01 into a "normal" power strip
  • 2022.09.07 ~09:40: gw-core01 rebooted again => replace device
  • 2022.09.08 ~15:40: replaced gw-core01 with an Ubiquiti Edge Router X SFP

impact:

  • 2022.09.01 17:24, 17:47
  • 2022.09.02 14:31, 18:10
  • 2022.09.03 ~14:40
  • 2022.09.04 ~14:50
  • 2022.09.07 ~09:40
  • 2022.09.08 ~15:40
  • routing outage for a few minutes

router replacement: We replaced a Ubiquiti Edge Router X with an Ubiquiti Edge Router X SFP. Those devices are nearly identical except the new router has an additional sfp port and can deliver passive POE on all ports.

After building a custom openwrt image with garet (profile: ubiquiti-edgerouter-x-sfp_21.02.3, commit: 6f7c75c8064e7e0241cdba8f87efc9492dd860d0) we transfered the config to the new device.

There are custom gpio mappings in /etc/config/system which are different between those device so they where edited accordingly.

013: 2022.09.07 10:17 - 11:47 | public wifi vpn blackholed traffic

The public wifi had no upstream internet connectivity from 10:17 till 11:47.

This time the wireguard interface was up and connected (ie. handshake < 120 seconds):

root@gw-core01:~# wg
interface: wg1
  public key: Sqz0LEJVmgNlq6ZgmR9YqUu3EcJzFw0bJNixGUV9Nl8=
  private key: (hidden)
  listening port: 36986

peer: uC0C1H4zE6WoDjOq65DByv1dSZt2wAv6gXQ5nYOLiQM=
  endpoint: 185.209.196.70:51820
  allowed ips: 0.0.0.0/0
  latest handshake: 1 minute, 2 seconds ago
  transfer: 5.84 GiB received, 679.59 MiB sent
  persistent keepalive: every 15 seconds

interface: wg0
  public key: 1lYOjFZBY4WbaVmyWFuesVbgfFrfqDTnmAIrXTWLkh4=
  private key: (hidden)
  listening port: 51820

peer: 9j6aZs+ViG9d9xw8AofRo10FPosW6LpDIv0IHtqP4UM=
  preshared key: (hidden)
  endpoint: 162.55.53.85:51820
  allowed ips: 0.0.0.0/0
  latest handshake: 1 minute, 31 seconds ago
  transfer: 5.48 MiB received, 4.51 MiB sent
  persistent keepalive: every 15 seconds

After restarting all interfaces (via /etc/init.d/network restart) traffic started flowing again.

impact: no internet on the public wifi from 10:17 till 11:47

reason: unknown. My guess is that mullvad had an opsi

014 2022.09.08 15:07 | (maintenance) additional ap in tent 5

The residents in tent 5 complained about bad wifi performance in tent 5.

root cause: A speedtest in the office revealed ~60mbit/s. The same test in tent 5 only got ~15mbit/s. (This was under normal load of the network). Additionally the monitoring showed that the ap in tent 5 had the most connected clients (~35) while other tents only had 15 to 20.

Therefore my assumption was that the ap could not keep up with the amount of clients connected. This is unscientific I know.

solution: We installed the additional ap (ap-8f39) on the opposite side of the tent to distribute the load evenly. The network cable for ap-8f39 could be terminated right inside tent 5 because sw-access02 also lives there. Because we did not want to crawl behind the seperated rooms inside the tent we decided to route the cable for ap-8f39 via the outside.

015 2022.09.08 18:45 - ??:?? | gw-core01 unreachable

Since 18:45 gw-core01 lost its wireguard connection to eae-adp-jump01. Either Vodafone is down or the new router died on us.

eae-adp-jump01# date
eae-adp-jump01# date && ospfctl show neigh && ifconfig wg
Thu Sep  8 20:44:05 CEST 2022
ID              Pri State        DeadTime Address         Iface     Uptime
10.84.8.1       1   DOWN/P2P     01:56:07 10.84.254.1     wg0       -
192.168.0.2     1   DOWN/P2P     05:06:46 10.84.254.1     wg0       -
wg0: flags=80c3<UP,BROADCAST,RUNNING,NOARP,MULTICAST> mtu 1350
	index 5 priority 0 llprio 3
	wgport 51820
	wgpubkey 9j6aZs+ViG9d9xw8AofRo10FPosW6LpDIv0IHtqP4UM=
	wgpeer 1lYOjFZBY4WbaVmyWFuesVbgfFrfqDTnmAIrXTWLkh4=
		wgpsk (present)
		wgendpoint 109.42.241.116 9749
		tx: 1858427748, rx: 836384108
		last handshake: 7036 seconds ago
		wgaip 0.0.0.0/0
	groups: wg
	inet 10.84.254.0 netmask 0xfffffffe
eae-adp-jump01#

impact: no routing into the internet