eae-am-deutschen-platz/documentation/INCIDENTS.md

66 KiB

Collection of all incidents

001: 2022.06.30 17:00 - 18:00 | power issues on gw-core01

issue:

The protective cap of the power strip yanked the power supply of gw-core01 out of its socket. Therefore gw-core01 had no power.

solution:

Tape back the protective cap of the power strip and reinsert the power supply

impact:

No internet access for 2 hours

002: 2022.06.30 12:00 - 2022.07.01 19:00 | wifi issues in tent 5

issue

A resident reported slow internet speeds. He resides in tent 5. I do not have more information. While trying to check logs for the ap I noticed that ap-ac7c is very slow and hangs/freezes a lot via ssh.

Rebooting did not solve the problem.

cause

Unknown

I've checked the ap the next day in person. I tested the ap with a different lan cable on a different switch port. The issued I've noticed the night before where not reproducible.

But I did notice that the short patchcable (connecting the ap to the switch) had some light rust on it.

solution

01.07.2022 ~ 03:00 (shortterm): After noticing the issue myself I tried rebooting the ap. Unfortunately that did not solve the problem. To spare the clients from connecting to a bonkers ap I disabled poe for the switch port to take the ap offline:

root@sw-access02:~# uci show poe | grep lan2
poe.@port[1].name='lan2'
root@sw-access02:~# uci set poe.@port[1].enable=0
root@sw-access02:~# uci commit poe
root@sw-access02:~# /etc/init.d/poe restart

01.07.2022 ~ 19:00 (longterm): I could not reproduce the issue in person. To be on the safe side I replaced the short patchcable (connecting the ap to the switch) and ap: ap-ac7c -> ap-1a38. Afterwards I reenabled poe on the corrosponding switch port.

impact

  • 2022.06.31 12:00 - 2022.07.01 03:30: (probably) unreliable wifi for clients connected to ap-ac7c
  • 2022.07.01 03:30 - 2022.07.01 18:30: bad signal strength to clients in and around tent 5

notes

While disabling poe on the port connecting ap-ac7c I restarted the poe service. That resulted in all ports shortly dropping power. Therefore I also accidentially rebooted ap-2bbf.

Next time I'll just reload the service (shame on me).

logs

This was my test to show that ssh was slow/freezed a lot on ap-ac7c.

good ap:

user@freifunk-admin:~$ time ssh ap-2bbf uptime
 01:04:43 up 1 day,  6:21,  load average: 0.01, 0.02, 0.00

real	0m1.438s
user	0m0.071s
sys	0m0.011s
user@freifunk-admin:~$ time ssh ap-2bbf uptime
 01:17:49 up 1 day,  6:34,  load average: 0.00, 0.01, 0.00

real	0m1.924s
user	0m0.070s
sys	0m0.010s

bad ap:

user@freifunk-admin:~$ time ssh ap-ac7c uptime
 01:05:00 up 1 day,  6:33,  load average: 0.01, 0.08, 0.03

real	0m29.526s
user	0m0.070s
sys	0m0.014s
user@freifunk-admin:~$ time ssh ap-ac7c uptime
 01:06:22 up 1 day,  6:34,  load average: 0.00, 0.06, 0.03

real	1m15.379s
user	0m0.081s
sys	0m0.015s
user@freifunk-admin:~$

003: 2022.07.03 00:30 | (maintenance) rolling wifi channel updates

We skipped mapping every ap to non overlapping channels while installing the infrastructure because of time constraints.

Therefore I just did that (commit: 41752ef9bd).

I ran ansible with --fork=1 to update on ap at a time (every ap takes about 5 seconds to reconfigure theire radios).

impact: Either clients "roamed" (no fast roaming - short interuptions) to a different ap or waited till the original ap came back online. For every client not more than 10-15 seconds service interruption.

004: 2022.07.17 01:50 | (maintenance) increase dhcp pool size for clients

The dnsmasq instance for the client network (10.84.4.0/22) only used the dhcp pool 10.84.4.100 - .250.

To be able to actually assign the full /22 to clients I've changed the pool to 10.84.4.2 - 10.84.7.254.

Afterwards I've reloaded dnsmasq on gw-core01.

impact: none

Currently dnsmasq has handed out 104 leases, so we presumably never ran out of ips in the old pool.

005: 2022.07.23 07:40 - 12:50 | power outage in tent 5

There was a power outage in tent 5 taking down sw-access02 and therefore also ap-1a38 (tent 5) and ap-2bbf (tent 4).

impact: no accesspoints and therefore no wifi in tent 4 and 5. Maybe some clients roamed to a different tent.

problems accessing the equipment: Currently every visit from Freifunkas need to be coordinated with the object management of the facility. This is fine for sheduled maintenances but not doeable for incident response and often leads to discussions and unaccessible equipment (which is totally understandable). Maybe we can "whitelist" certain people at the facility so they can always access the equipment without further authorization.

Update 2022.07.28: The facility management created id cards for @hirnpfirsich, @katzenparadoxon and @martin

006: 2022.07.25 01:00 | (maintenance) os upgrades

OS upgrades of non-customer facing machines/services

impact: short downtime of the monitoring

upgrades:

  • monitoring01: apt update && apt dist-upgrade -y && reboot (downtime: 00:58 - 00:59)
  • hyper01: apt update && apt dist-upgrade -y && reboot (downtime: 01:09 - 01:10)
  • eae-adp-jump01: syspatch && rcctl restart cron

007: 2022.08.01 14:00 - 2022.08.15 14:15 | no internet access

Vodafone expired theire free internet offering for refugee camps on 01.08.2022.

impact: No internet access for ~2 weeks

solution: Saxonia Catering entered into an internet contract with Vodafone

008: 2022.08.13 ~13:30 | (maintenance) add backoffice wifi

The facility managemend asked us if we could build a backoffice wifi that is inaccessible from the rest of the network.

  • vlan: 8
  • subnet: 10.84.8.1/24
  • wifi ssid: GU Deutscher Platz Backoffice
  • wifi password: wifi/GU_Deutscher_Platz_Backoffice in pass

impact:

  • gw-core01 and sw-access0{1,2} only got reloaded (so no downtime)
  • on ap-XXXXs networking restarted so the wifi was unavailable for a few seconds
  • either way there was no upstream internet connectivity at this time (see incidents 007 for details)
  • therefore impact "calculations" are irrelevant

changes:

  • sw-access0{1,2}:
root@sw-access01:~# cat >> /etc/config/network << EOF
config bridge-vlan 'backoffice_vlan'
	option device 'switch'
	option vlan '8'
	option ports 'lan1:t lan2:t lan3:t lan4:t lan5:t lan6:t lan7:t lan8:t'
EOF
root@sw-access01:~# /etc/init.d/network reload
  • gw-core01:
root@gw-core01:~# cat >> /etc/config/network << EOF
config bridge-vlan 'backoffice_vlan'
        option vlan '8'
        option device 'switch'
        list ports 'eth2:t'
        list ports 'eth3:t'
        list ports 'eth4:t'

config interface 'backoffice'
        option device 'switch.8'
        option proto 'static'
        option ipaddr '10.84.8.1'
        option netmask '255.255.255.0'
EOF
root@gw-core01:~#
root@gw-core01:~# cat >> /etc/config/firewall << EOF
config zone
        option name             backoffice
        list   network          'backoffice'
        option input            REJECT
        option output           ACCEPT
        option forward          REJECT

config forwarding
        option src              backoffice
        option dest             wan

config rule
        option name             BACKOFFICE_Allow-DHCP
        option src              backoffice
        option proto            udp
        option dest_port        67-68
        option target           ACCEPT
        option family           ipv4

config rule
        option name             BACKOFFICE_Allow-DNS
        option src              backoffice
        option proto            udp
        option dest_port        53
        option target           ACCEPT
        option family           ipv4
EOF
root@gw-core01:~#
root@gw-core01:~# cat >> /etc/config/dhcp << EOF
config dhcp 'backoffice'
        option interface 'backoffice'
        option start '100'
        option limit '150'
        option leasetime '12h'
        option dhcpv4 'server'
        option dhcpv6 'server'
        option ra 'server'
        option ra_slaac '1'
        list ra_flags 'managed-config'
        list ra_flags 'other-config'
EOF
root@gw-core01:~#
root@gw-core01:~# /etc/init.d/network reload
root@gw-core01:~# /etc/init.d/firewall restart
root@gw-core01:~# /etc/init.d/dnsmasq reload
  • ap-XXXX: see playbook_provision_accesspoints.yml

009: 2022.08.23 ~03:00 | (maintenance) launder public wifi traffic through vpn

To help the refugee camp not get into legal trouble for providing free internet access (gotta love germany) we've put a vpn inplace that launders the traffic from the public wifi through a vpn provider.

Only the clients network gets laundered. This is accomplished by using policy-based routing.

The vpn interface is put into its own routing table (20 - launder). Two ip rules steer the traffic from the clients network into the tunnel with a failsafe. If the vpn connection dies no traffic leaks through the normal wan interface.

changes:

root@gw-core01:~# echo "20 launder" >> /etc/iproute2/rt_tables
root@gw-core01:~# cat >> /etc/config/network << EOF
config interface 'wg1'
        option mtu 1350
        option proto 'wireguard'
        option private_key '[redacted]'
        list addresses '[redacted]'
        option ip4table 'launder'

config wireguard_wg1 'mullvad_fr'
        option public_key '[redacted]'
        option endpoint_host '[redacted]'
        option endpoint_port '51820'
        option route_allowed_ips '1'
        list allowed_ips '0.0.0.0/0'

config rule
        option in 'clients'
        option lookup 'launder'
        option priority 50

config rule
        option in 'clients'
        option action prohibit
        option priority 51
EOF
root@gw-core01:~# cat >> /etc/config/firewall << EOF
config zone
        option name launder
        list network wg1
        option input REJECT
        option output ACCEPT
        option forward REJECT
        option masq 1
        option mtu_fix 1

config forwarding
        option src clients
        option dest launder
EOF
root@gw-core01:~# /etc/init.d/network restart
root@gw-core01:~# /etc/init.d/firewall restart

impact:

  • short service interruptions for public wifi clients at around ~03:00 lasting a few minutes

010: 2022.08.28 13:00, 2022.08.29 09:10 - 10:30 | random reboots of gw-core01

gw-core01 randomly reboots. The other devices on the same circuit (ie. sw-access01) did not reboot. Therefore it is not an issue with the circuit itself.

After calling the facility management they confirmed that the power supply is not correctly seated. Cause for the missallignment is still the protective cap of the power strip (see incident 001) for details.

The facility management is either going to remove the protective cap or disable the latching mechanism with zip ties.

Update: Now the protective cap is held back by zip ties and should finally stop interfering with the power supply

impact:

  • dhcp and routing downtime
  • for a few minutes on 2022.08.28 13:00
  • for about an hour on 2022.08.29 09:10 (till 10:30)

monitoring enhancements:

  • alert on rebooted nodes (via node_boot_time_seconds)

011: 2022.08.31 01:06 - 10:00 | public wifi lost upstream vpn connection

The wireguard vpn (which launders the traffic of the public wifi) did not handshake for 9 hours:

root@gw-core01:~# date && wg
Thu Sep  1 07:55:49 UTC 2022
[...]
interface: wg1
  public key: [redacted]
  private key: (hidden)
  listening port: 48603

peer: [redacted]
  endpoint: [redacted]:51820
  allowed ips: 0.0.0.0/0
  latest handshake: 8 hours, 49 minutes, 4 seconds ago
  transfer: 201.20 GiB received, 16.12 GiB sent

impact: the public wifi GU Deutscher Platz had no internet access for 9 hours

solution: add persistent keepalive statement to wg1 and restart tunnel (via /etc/init.d/network restart)

discussion: Any traffic traversing the interface should trigger a new handshake. Therefore I do not really understand why there was no handshake for 9 hours.

Root cause theory: The kernel decided that the default route at that interface was unreachable, dropped traffic and therefore stopped the handshake from triggering again.

I added a persistent_keepalive to the tunnel to stop this from happening again (if my theory for the root cause is correct).

changes:

  • 09:56:10 /etc/init.d/network restart (to kickstart new handshake)
  • 09:58:00 add persistent_keepalive statement to wg1 (longterm fix)
  • 10:00:00 /etc/init.d/network restart (restart again to apply wg changes)

monitoring enhancements:

  • monitor connectivity for the public wifi (blackbox exporter in client network) and create alerting rules
  • prometheus instance on eap-adp-jump01 to get alerts if upstream is down in facility
  • monitor wireguard state (probably needs a custom lua exporter)

012: 2022.09.01 - 2022.09.08 | ongoing reboots of gw-core01

Unfortunately zip tying back the protective cap of the power strip did not stop the random reboots of gw-core01. See incidents 001 and 010 for details.

Either the power supply or the device itself is broken.

solution:

  • replace power supply
  • plug power supply into "normal" socket
  • replace device itself

updates:

  • 2022.09.02 ~20:00: I tried replacing the power supply but nobody could let me into the facilities.
  • 2022.09.03 ~14:40: Successfully replaced the power supply. While doing so the new power supply slipped out of its socket multiple times. It seems like the sockets in the power strip are deeper than normal. Maybe the old supply was not broken but simply sliped out because the strip is weard ? Some zip ties are holding the new supply in its place
  • 2022.09.04 ~14:50: gw-core01 rebooted again => put psu of gw-core01 into a "normal" power strip
  • 2022.09.06 ~11:00: gw-core01 put psu of gw-core01 into a "normal" power strip
  • 2022.09.07 ~09:40: gw-core01 rebooted again => replace device
  • 2022.09.08 ~15:40: replaced gw-core01 with an Ubiquiti Edge Router X SFP

impact:

  • 2022.09.01 17:24, 17:47
  • 2022.09.02 14:31, 18:10
  • 2022.09.03 ~14:40
  • 2022.09.04 ~14:50
  • 2022.09.07 ~09:40
  • 2022.09.08 ~15:40
  • routing outage for a few minutes

router replacement: We replaced a Ubiquiti Edge Router X with an Ubiquiti Edge Router X SFP. Those devices are nearly identical except the new router has an additional sfp port and can deliver passive POE on all ports.

After building a custom openwrt image with garet (profile: ubiquiti-edgerouter-x-sfp_21.02.3, commit: 6f7c75c8064e7e0241cdba8f87efc9492dd860d0) we transfered the config to the new device.

There are custom gpio mappings in /etc/config/system which are different between those device so they where edited accordingly.

013: 2022.09.07 10:17 - 11:47 | public wifi vpn blackholed traffic

The public wifi had no upstream internet connectivity from 10:17 till 11:47.

This time the wireguard interface was up and connected (ie. handshake < 120 seconds):

root@gw-core01:~# wg
interface: wg1
  public key: Sqz0LEJVmgNlq6ZgmR9YqUu3EcJzFw0bJNixGUV9Nl8=
  private key: (hidden)
  listening port: 36986

peer: uC0C1H4zE6WoDjOq65DByv1dSZt2wAv6gXQ5nYOLiQM=
  endpoint: 185.209.196.70:51820
  allowed ips: 0.0.0.0/0
  latest handshake: 1 minute, 2 seconds ago
  transfer: 5.84 GiB received, 679.59 MiB sent
  persistent keepalive: every 15 seconds

interface: wg0
  public key: 1lYOjFZBY4WbaVmyWFuesVbgfFrfqDTnmAIrXTWLkh4=
  private key: (hidden)
  listening port: 51820

peer: 9j6aZs+ViG9d9xw8AofRo10FPosW6LpDIv0IHtqP4UM=
  preshared key: (hidden)
  endpoint: 162.55.53.85:51820
  allowed ips: 0.0.0.0/0
  latest handshake: 1 minute, 31 seconds ago
  transfer: 5.48 MiB received, 4.51 MiB sent
  persistent keepalive: every 15 seconds

After restarting all interfaces (via /etc/init.d/network restart) traffic started flowing again.

impact: no internet on the public wifi from 10:17 till 11:47

reason: unknown. My guess is that mullvad had an opsi

014 2022.09.08 15:07 | (maintenance) additional ap in tent 5

The residents in tent 5 complained about bad wifi performance in tent 5.

root cause: A speedtest in the office revealed ~60mbit/s. The same test in tent 5 only got ~15mbit/s. (This was under normal load of the network). Additionally the monitoring showed that the ap in tent 5 had the most connected clients (~35) while other tents only had 15 to 20.

Therefore my assumption was that the ap could not keep up with the amount of clients connected. This is unscientific I know.

solution: We installed the additional ap (ap-8f39) on the opposite side of the tent to distribute the load evenly. The network cable for ap-8f39 could be terminated right inside tent 5 because sw-access02 also lives there. Because we did not want to crawl behind the seperated rooms inside the tent we decided to route the cable for ap-8f39 via the outside.

015 2022.09.08 18:45 - 2022.09.09 10:15 | gw-core01 unreachable

Since 18:45 gw-core01 lost its wireguard connection to eae-adp-jump01. Either Vodafone is down or the new router died on us.

eae-adp-jump01# date
eae-adp-jump01# date && ospfctl show neigh && ifconfig wg
Thu Sep  8 20:44:05 CEST 2022
ID              Pri State        DeadTime Address         Iface     Uptime
10.84.8.1       1   DOWN/P2P     01:56:07 10.84.254.1     wg0       -
192.168.0.2     1   DOWN/P2P     05:06:46 10.84.254.1     wg0       -
wg0: flags=80c3<UP,BROADCAST,RUNNING,NOARP,MULTICAST> mtu 1350
	index 5 priority 0 llprio 3
	wgport 51820
	wgpubkey 9j6aZs+ViG9d9xw8AofRo10FPosW6LpDIv0IHtqP4UM=
	wgpeer 1lYOjFZBY4WbaVmyWFuesVbgfFrfqDTnmAIrXTWLkh4=
		wgpsk (present)
		wgendpoint 109.42.241.116 9749
		tx: 1858427748, rx: 836384108
		last handshake: 7036 seconds ago
		wgaip 0.0.0.0/0
	groups: wg
	inet 10.84.254.0 netmask 0xfffffffe
eae-adp-jump01#

diagnosis: On the next morning I immediatly went to the facility to diagnose and fix the problem.

Connecting directly into the management port of gw-core01 did not result in an ssh connection. The only way I could establish an ssh connection into gw-core01 was by plugging myself into sw-access01.

The device itself responded and seemed to still try to serve dhcp responses. But any kind of routing failed. /etc/init.d/network restart and ip a ran into internal timeouts:

root@gw-core01:~# date
Fri Sep  9 08:07:33 UTC 2022
root@gw-core01:~# /etc/init.d/network restart
Command failed: Request timed out
Command failed: Request timed out
Command failed: Request timed out
Command failed: Request timed out
^C^CCommand failed: Request timed out

root@gw-core01:~# ip a





^C^C

root@gw-core01:~#

There where no kernel messages indicating failures:

root@gw-core01:~# dmesg
[    0.000000] Linux version 5.4.188 (builder@buildhost) (gcc version 8.4.0 (OpenWrt GCC 8.4.0 r16554-1d4dea6d4f)) #0 SMP Sat Apr 16 12:59:34 2022
[    0.000000] SoC Type: MediaTek MT7621 ver:1 eco:3
[    0.000000] printk: bootconsole [early0] enabled
[    0.000000] CPU0 revision is: 0001992f (MIPS 1004Kc)
[    0.000000] MIPS: machine is Ubiquiti EdgeRouter X SFP
[    0.000000] Initrd not found or empty - disabling initrd
[    0.000000] VPE topology {2,2} total 4
[    0.000000] Primary instruction cache 32kB, VIPT, 4-way, linesize 32 bytes.
[    0.000000] Primary data cache 32kB, 4-way, PIPT, no aliases, linesize 32 bytes
[    0.000000] MIPS secondary cache 256kB, 8-way, linesize 32 bytes.
[    0.000000] Zone ranges:
[    0.000000]   Normal   [mem 0x0000000000000000-0x000000000fffffff]
[    0.000000]   HighMem  empty
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000000000000-0x000000000fffffff]
[    0.000000] Initmem setup node 0 [mem 0x0000000000000000-0x000000000fffffff]
[    0.000000] On node 0 totalpages: 65536
[    0.000000]   Normal zone: 576 pages used for memmap
[    0.000000]   Normal zone: 0 pages reserved
[    0.000000]   Normal zone: 65536 pages, LIFO batch:15
[    0.000000] percpu: Embedded 14 pages/cpu s26768 r8192 d22384 u57344
[    0.000000] pcpu-alloc: s26768 r8192 d22384 u57344 alloc=14*4096
[    0.000000] pcpu-alloc: [0] 0 [0] 1 [0] 2 [0] 3 
[    0.000000] Built 1 zonelists, mobility grouping on.  Total pages: 64960
[    0.000000] Kernel command line: console=ttyS0,57600 rootfstype=squashfs,jffs2
[    0.000000] Dentry cache hash table entries: 32768 (order: 5, 131072 bytes, linear)
[    0.000000] Inode-cache hash table entries: 16384 (order: 4, 65536 bytes, linear)
[    0.000000] Writing ErrCtl register=0006c560
[    0.000000] Readback ErrCtl register=0006c560
[    0.000000] mem auto-init: stack:off, heap alloc:off, heap free:off
[    0.000000] Memory: 250792K/262144K available (6097K kernel code, 210K rwdata, 748K rodata, 1252K init, 238K bss, 11352K reserved, 0K cma-reserved, 0K highmem)
[    0.000000] SLUB: HWalign=32, Order=0-3, MinObjects=0, CPUs=4, Nodes=1
[    0.000000] rcu: Hierarchical RCU implementation.
[    0.000000] rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.
[    0.000000] NR_IRQS: 256
[    0.000000] random: get_random_bytes called from 0x806e7a3c with crng_init=0
[    0.000000] CPU Clock: 880MHz
[    0.000000] clocksource: GIC: mask: 0xffffffffffffffff max_cycles: 0xcaf478abb4, max_idle_ns: 440795247997 ns
[    0.000000] clocksource: MIPS: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 4343773742 ns
[    0.000008] sched_clock: 32 bits at 440MHz, resolution 2ns, wraps every 4880645118ns
[    0.015500] Calibrating delay loop... 583.68 BogoMIPS (lpj=1167360)
[    0.055845] pid_max: default: 32768 minimum: 301
[    0.065197] Mount-cache hash table entries: 1024 (order: 0, 4096 bytes, linear)
[    0.079604] Mountpoint-cache hash table entries: 1024 (order: 0, 4096 bytes, linear)
[    0.097727] rcu: Hierarchical SRCU implementation.
[    0.107838] smp: Bringing up secondary CPUs ...
[    6.638149] Primary instruction cache 32kB, VIPT, 4-way, linesize 32 bytes.
[    6.638160] Primary data cache 32kB, 4-way, PIPT, no aliases, linesize 32 bytes
[    6.638172] MIPS secondary cache 256kB, 8-way, linesize 32 bytes.
[    6.638275] CPU1 revision is: 0001992f (MIPS 1004Kc)
[    0.145018] Synchronize counters for CPU 1: done.
[    6.729209] Primary instruction cache 32kB, VIPT, 4-way, linesize 32 bytes.
[    6.729217] Primary data cache 32kB, 4-way, PIPT, no aliases, linesize 32 bytes
[    6.729225] MIPS secondary cache 256kB, 8-way, linesize 32 bytes.
[    6.729283] CPU2 revision is: 0001992f (MIPS 1004Kc)
[    0.239464] Synchronize counters for CPU 2: done.
[    6.820315] Primary instruction cache 32kB, VIPT, 4-way, linesize 32 bytes.
[    6.820323] Primary data cache 32kB, 4-way, PIPT, no aliases, linesize 32 bytes
[    6.820331] MIPS secondary cache 256kB, 8-way, linesize 32 bytes.
[    6.820392] CPU3 revision is: 0001992f (MIPS 1004Kc)
[    0.327062] Synchronize counters for CPU 3: done.
[    0.386672] smp: Brought up 1 node, 4 CPUs
[    0.399152] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645041785100000 ns
[    0.418438] futex hash table entries: 1024 (order: 3, 32768 bytes, linear)
[    0.432279] pinctrl core: initialized pinctrl subsystem
[    0.444212] NET: Registered protocol family 16
[    0.458990] FPU Affinity set after 4688 emulations
[    0.477155] clocksource: Switched to clocksource GIC
[    0.488388] thermal_sys: Registered thermal governor 'step_wise'
[    0.488922] NET: Registered protocol family 2
[    0.509581] IP idents hash table entries: 4096 (order: 3, 32768 bytes, linear)
[    0.525403] tcp_listen_portaddr_hash hash table entries: 512 (order: 0, 6144 bytes, linear)
[    0.542061] TCP established hash table entries: 2048 (order: 1, 8192 bytes, linear)
[    0.557199] TCP bind hash table entries: 2048 (order: 2, 16384 bytes, linear)
[    0.571359] TCP: Hash tables configured (established 2048 bind 2048)
[    0.584096] UDP hash table entries: 256 (order: 1, 8192 bytes, linear)
[    0.596987] UDP-Lite hash table entries: 256 (order: 1, 8192 bytes, linear)
[    0.611063] NET: Registered protocol family 1
[    0.619627] PCI: CLS 0 bytes, default 32
[    0.717095] 4 CPUs re-calibrate udelay(lpj = 1167360)
[    0.728609] workingset: timestamp_bits=14 max_order=16 bucket_order=2
[    0.742148] random: fast init done
[    0.754345] squashfs: version 4.0 (2009/01/31) Phillip Lougher
[    0.765830] jffs2: version 2.2 (NAND) (SUMMARY) (LZMA) (RTIME) (CMODE_PRIORITY) (c) 2001-2006 Red Hat, Inc.
[    0.786977] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 251)
[    0.803185] GPIO line 487 (sfp_i2c_clk_gate) hogged as output/high
[    0.815644] mt7621_gpio 1e000600.gpio: registering 32 gpios
[    0.826917] mt7621_gpio 1e000600.gpio: registering 32 gpios
[    0.838184] mt7621_gpio 1e000600.gpio: registering 32 gpios
[    0.850028] Serial: 8250/16550 driver, 16 ports, IRQ sharing enabled
[    0.866274] printk: console [ttyS0] disabled
[    0.874717] 1e000c00.uartlite: ttyS0 at MMIO 0x1e000c00 (irq = 19, base_baud = 3125000) is a 16550A
[    0.892649] printk: console [ttyS0] enabled
[    0.909210] printk: bootconsole [early0] disabled
[    0.930438] mt7621-nand 1e003000.nand: Using programmed access timing: 31c07388
[    0.945315] nand: device found, Manufacturer ID: 0x01, Chip ID: 0xda
[    0.957964] nand: AMD/Spansion S34ML02G2
[    0.965772] nand: 256 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB size: 128
[    0.981019] mt7621-nand 1e003000.nand: ECC strength adjusted to 12 bits
[    0.994229] mt7621-nand 1e003000.nand: Using programmed access timing: 21005134
[    1.008785] mt7621-nand 1e003000.nand: Using programmed access timing: 21005134
[    1.023342] Scanning device for bad blocks
[    4.991028] 6 fixed-partitions partitions found on MTD device mt7621-nand
[    5.004548] Creating 6 MTD partitions on "mt7621-nand":
[    5.014961] 0x000000000000-0x000000080000 : "u-boot"
[    5.026320] 0x000000080000-0x0000000e0000 : "u-boot-env"
[    5.038154] 0x0000000e0000-0x000000140000 : "factory"
[    5.049647] 0x000000140000-0x000000440000 : "kernel1"
[    5.060986] 0x000000440000-0x000000740000 : "kernel2"
[    5.072507] 0x000000740000-0x00000ff00000 : "ubi"
[    5.111797] mt7530 mdio-bus:1f: MT7530 adapts as multi-chip module
[    5.128943] mtk_soc_eth 1e100000.ethernet dsa: mediatek frame engine at 0xbe100000, irq 20
[    5.146701] i2c-mt7621 1e000900.i2c: clock 100 kHz
[    5.160084] NET: Registered protocol family 10
[    5.170415] Segment Routing with IPv6
[    5.177872] NET: Registered protocol family 17
[    5.186801] bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this.
[    5.212902] 8021q: 802.1Q VLAN Support v1.8
[    5.223128] mt7530 mdio-bus:1f: MT7530 adapts as multi-chip module
[    5.246580] mt7530 mdio-bus:1f eth0 (uninitialized): PHY [dsa-0.0:00] driver [Generic PHY]
[    5.264582] mt7530 mdio-bus:1f eth1 (uninitialized): PHY [dsa-0.0:01] driver [Generic PHY]
[    5.282533] mt7530 mdio-bus:1f eth2 (uninitialized): PHY [dsa-0.0:02] driver [Generic PHY]
[    5.300513] mt7530 mdio-bus:1f eth3 (uninitialized): PHY [dsa-0.0:03] driver [Generic PHY]
[    5.318518] mt7530 mdio-bus:1f eth4 (uninitialized): PHY [dsa-0.0:04] driver [Generic PHY]
[    5.336462] mt7530 mdio-bus:1f eth5 (uninitialized): PHY [mdio-bus:07] driver [Atheros 8031 ethernet]
[    5.356180] mt7530 mdio-bus:1f: configuring for fixed/rgmii link mode
[    5.373826] DSA: tree 0 setup
[    5.380199] mt7530 mdio-bus:1f: Link is Up - 1Gbps/Full - flow control off
[    5.381345] UBI: auto-attach mtd5
[    5.400526] ubi0: attaching mtd5
[    7.967018] ubi0: scanning is finished
[    7.993716] ubi0: attached mtd5 (name "ubi", size 247 MiB)
[    8.004666] ubi0: PEB size: 131072 bytes (128 KiB), LEB size: 126976 bytes
[    8.018357] ubi0: min./max. I/O unit sizes: 2048/2048, sub-page size 2048
[    8.031871] ubi0: VID header offset: 2048 (aligned 2048), data offset: 4096
[    8.045746] ubi0: good PEBs: 1982, bad PEBs: 0, corrupted PEBs: 0
[    8.057881] ubi0: user volume: 2, internal volumes: 1, max. volumes count: 128
[    8.072269] ubi0: max/mean erase counter: 2/0, WL threshold: 4096, image sequence number: 231599107
[    8.090289] ubi0: available PEBs: 0, total reserved PEBs: 1982, PEBs reserved for bad PEB handling: 40
[    8.108849] ubi0: background thread "ubi_bgt0d" started, PID 480
[    8.111258] block ubiblock0_0: created from ubi0:0(rootfs)
[    8.131782] ubiblock: device ubiblock0_0 (rootfs) set to be root filesystem
[    8.145658] hctosys: unable to open rtc device (rtc0)
[    8.163435] VFS: Mounted root (squashfs filesystem) readonly on device 254:0.
[    8.181987] Freeing unused kernel memory: 1252K
[    8.191032] This architecture does not have kernel memory protection.
[    8.203856] Run /sbin/init as init process
[    8.682869] init: Console is alive
[    8.689902] init: - watchdog -
[    8.904196] kmodloader: loading kernel modules from /etc/modules-boot.d/*
[    9.001394] kmodloader: done loading kernel modules from /etc/modules-boot.d/*
[    9.021446] init: - preinit -
[    9.693086] mtk_soc_eth 1e100000.ethernet dsa: configuring for fixed/rgmii link mode
[    9.709069] mtk_soc_eth 1e100000.ethernet dsa: Link is Up - 1Gbps/Full - flow control rx/tx
[    9.725759] IPv6: ADDRCONF(NETDEV_CHANGE): dsa: link becomes ready
[    9.883838] random: jshn: uninitialized urandom read (4 bytes read)
[    9.954270] random: jshn: uninitialized urandom read (4 bytes read)
[    9.999580] random: jshn: uninitialized urandom read (4 bytes read)
[   10.308223] device dsa entered promiscuous mode
[   10.317845] mt7530 mdio-bus:1f eth1: configuring for phy/gmii link mode
[   10.331434] 8021q: adding VLAN 0 to HW filter on device eth1
[   14.562377] UBIFS (ubi0:1): Mounting in unauthenticated mode
[   14.574081] UBIFS (ubi0:1): background thread "ubifs_bgt0_1" started, PID 587
[   14.616572] urandom_read: 6 callbacks suppressed
[   14.616584] random: procd: uninitialized urandom read (4 bytes read)
[   14.658895] UBIFS (ubi0:1): recovery needed
[   14.833294] UBIFS (ubi0:1): recovery completed
[   14.842330] UBIFS (ubi0:1): UBIFS: mounted UBI device 0, volume 1, name "rootfs_data"
[   14.857938] UBIFS (ubi0:1): LEB size: 126976 bytes (124 KiB), min./max. I/O unit sizes: 2048 bytes/2048 bytes
[   14.877699] UBIFS (ubi0:1): FS size: 236429312 bytes (225 MiB, 1862 LEBs), journal size 11808768 bytes (11 MiB, 93 LEBs)
[   14.899365] UBIFS (ubi0:1): reserved for root: 4952683 bytes (4836 KiB)
[   14.912550] UBIFS (ubi0:1): media format: w5/r0 (latest is w5/r0), UUID 787684E8-A245-4FE7-9437-3D9F0B3BD798, small LPT model
[   14.941199] mount_root: switching to ubifs overlay
[   14.969476] urandom-seed: Seeding with /etc/urandom.seed
[   15.073478] device dsa left promiscuous mode
[   15.091161] procd: - early -
[   15.097018] procd: - watchdog -
[   15.649297] procd: - watchdog -
[   15.659598] procd: - ubus -
[   15.720914] procd: - init -
[   16.294778] kmodloader: loading kernel modules from /etc/modules.d/*
[   16.320888] i2c /dev entries driver
[   16.330637] pca953x 0-0025: 0-0025 supply vcc not found, using dummy regulator
[   16.345230] pca953x 0-0025: using no AI
[   16.421511] pca953x 0-0025: interrupt support not compiled in
[   16.465255] sfp sfp_eth5: Host maximum power 1.0W
[   16.480258] urngd: v1.0.2 started.
[   16.501156] sfp sfp_eth5: No tx_disable pin: SFP modules will always be emitting.
[   16.524577] xt_time: kernel timezone is -0000
[   16.549198] PPP generic driver version 2.4.2
[   16.559155] NET: Registered protocol family 24
[   16.571409] wireguard: WireGuard 1.0.0 loaded. See www.wireguard.com for information.
[   16.587065] wireguard: Copyright (C) 2015-2019 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
[   16.616800] kmodloader: done loading kernel modules from /etc/modules.d/*
[   16.627960] crng init done
[   21.588816] mtk_soc_eth 1e100000.ethernet dsa: Link is Down
[   21.608948] mtk_soc_eth 1e100000.ethernet dsa: configuring for fixed/rgmii link mode
[   21.624843] mtk_soc_eth 1e100000.ethernet dsa: Link is Up - 1Gbps/Full - flow control rx/tx
[   21.630145] mt7530 mdio-bus:1f eth0: configuring for phy/gmii link mode
[   21.655323] 8021q: adding VLAN 0 to HW filter on device eth0
[   21.669925] IPv6: ADDRCONF(NETDEV_CHANGE): dsa: link becomes ready
[   21.683342] switch: port 1(eth0) entered blocking state
[   21.693825] switch: port 1(eth0) entered disabled state
[   21.705836] device eth0 entered promiscuous mode
[   21.715085] device dsa entered promiscuous mode
[   21.798255] mt7530 mdio-bus:1f eth1: configuring for phy/gmii link mode
[   21.812164] 8021q: adding VLAN 0 to HW filter on device eth1
[   21.827686] switch: port 2(eth1) entered blocking state
[   21.838224] switch: port 2(eth1) entered disabled state
[   21.850382] device eth1 entered promiscuous mode
[   21.876743] mt7530 mdio-bus:1f eth2: configuring for phy/gmii link mode
[   21.890510] 8021q: adding VLAN 0 to HW filter on device eth2
[   21.905871] switch: port 3(eth2) entered blocking state
[   21.916366] switch: port 3(eth2) entered disabled state
[   21.928552] device eth2 entered promiscuous mode
[   21.959473] mt7530 mdio-bus:1f eth3: configuring for phy/gmii link mode
[   21.973384] 8021q: adding VLAN 0 to HW filter on device eth3
[   21.988082] switch: port 4(eth3) entered blocking state
[   21.998586] switch: port 4(eth3) entered disabled state
[   22.010746] device eth3 entered promiscuous mode
[   22.041595] mt7530 mdio-bus:1f eth4: configuring for phy/gmii link mode
[   22.055465] 8021q: adding VLAN 0 to HW filter on device eth4
[   22.070479] switch: port 5(eth4) entered blocking state
[   22.080938] switch: port 5(eth4) entered disabled state
[   22.093369] device eth4 entered promiscuous mode
[   44.201668] mt7530 mdio-bus:1f eth0: Link is Up - 1Gbps/Full - flow control rx/tx
[   44.216647] switch: port 1(eth0) entered blocking state
[   44.227108] switch: port 1(eth0) entered forwarding state
[   44.238996] IPv6: ADDRCONF(NETDEV_CHANGE): switch: link becomes ready
[   44.252780] IPv6: ADDRCONF(NETDEV_CHANGE): switch.1: link becomes ready
[   44.266682] IPv6: ADDRCONF(NETDEV_CHANGE): switch.2: link becomes ready
[   44.280599] IPv6: ADDRCONF(NETDEV_CHANGE): switch.3: link becomes ready
[   44.294436] IPv6: ADDRCONF(NETDEV_CHANGE): switch.8: link becomes ready
[   53.674078] mt7530 mdio-bus:1f eth2: Link is Up - 1Gbps/Full - flow control rx/tx
[   53.689068] switch: port 3(eth2) entered blocking state
[   53.699526] switch: port 3(eth2) entered forwarding state
[   55.786325] mt7530 mdio-bus:1f eth3: Link is Up - 1Gbps/Full - flow control rx/tx
[   55.801371] switch: port 4(eth3) entered blocking state
[   55.811823] switch: port 4(eth3) entered forwarding state
[   59.946845] mt7530 mdio-bus:1f eth4: Link is Up - 1Gbps/Full - flow control rx/tx
[   59.961897] switch: port 5(eth4) entered blocking state
[   59.972332] switch: port 5(eth4) entered forwarding state

logread was full of messages from dnsmasq-dhcp which I am not going to share publicly.

issue: Unknown. Probably faulty hardware or a bug in OpenWrt

solution: Because the "original" gw-core01 (see incident 012 for details) was way stabler I replaced gw-core01 again with the old node

impact: no routing, dhcp and dns in the specified timeframe

016 2022.09.11 21:39, etc. | power outages on site

There where power outages in the facility.

outages:

  • 09.11 21:39: office and tent 5
  • 09.12 17:47: office
  • 09.13 01:47: office
  • tent 5: sw-acces02, ap-2bbf, ap-1a38, ap-8f39
  • office: all equipment except tent 5

impact:

  • service interruption in the mentioned timeframes till power was restored and equipment back online

017 2022.09.13 - 2022.09.19 | wifi instabilites reported by the facility management


Most of the problems reported by the facility management/social workers are solved now. The documentation for this incident was added way after all the work was done. Therefore some info will be missing.


the facility management reached out and reported that

  • they are having issues with the wifi on there new notebooks
  • sometimes the wifi is marked as having no internet by windows
  • they need to switch to the public wifi instead of using the backoffice wifi
  • according to the security the wifi is unuseable at around 12:00

thoughts: To help diagnose the problem I expanded the monitoring with end to end monitors. They are pinging two sites in the internet. One is using the normal wan (like the backoffice) and one is using the vpn (like the public wifi). See commit f011562 for details.

I am noticing that sometimes the icmp probes fail, which is odd. But at the same time the node exporter on eae-adp-jump01 is reachable.

I implemented another icmp probe that directly uses the gigacube without traversing gw-core01 before.

While checking the gigacube for the size of the dhcp pool I have noticed that Signal Strength is only -91 dBm. That seems way to little. Maybe this is the cause for the wan instability ?

changes in reponse:

  • set password for gigacube-2001 (see pass)
    • the gui force me to set a password
    • reused the password from the old gigacube that was set by the facility management
    • will write the changed pw onto the current gigacube next time i'll visit
    • probably going to create a new secure password beforehand :)
  • configures to static ip bindings on the gigacube
    • for gw-core01: 192.168.0.2
    • for mon-e2e-wan01: 192.168.0.3
  • added WAN/gigacube vlan onto proxmox port via gw-core01
  • added vmbr3 on proxmox which is used for the WAN/gigacube vlan
  • created mon-e2e-wan01 that lives inside the gigacube network and probes the same stuff as mon-e2e-clients01

update short version:

TLDR: The devices inside the social work container had a pretty bad wifi experience because there where sitting inside a big metal box without an ap. After installing an accesspoint into that container the problems went away. Additionally we installed an access switch (sw-access03) into the container to directly connect a few notebooks via ethernet.

update long version:

  • 15.09.2022: date with the facility management to talk about there problems
  • 16.09.2022:
    • call from the social workers because they could not print (surfing via a hotspot)
    • unsuccessfull fix with a usb wifi card for the new notebooks
  • 17.09.2022:
    • another visit (this time with @drbroiler)
    • installation of access switch into social work container
    • wifi tests in social work container
  • 19.09.2022:
    • installation of ap-ac7c into social work container
    • limit tx power of ap inside office containers (ap-c5d1, ap-ac7c)
  • 26.09.2022:
    • acknowledgement from the social workers that everything network related is doing just fine

after thoughts:

  • the security is also having a bad wifi experience because they can only connect to far away aps that live inside metal boxes (either tents or containers)
  • we are still going to install a directional LTE antenna to hopefully combat the instabilities with the icmp probes

018 2022.09.26 15:28, 2022.09.27 21:14 | power outage in facility management container

there where power outages in the facility management containers at around

  • 2022.09.26 15:28 and
  • 2022.09.27 21:14

taking down everything except sw-access02 (plus connected aps). I don't know if the social worker container still had power.

impact:

  • full network downtime till systems where back online (not more than a few minutes)

todos:

  • get acknowledgement that power outage happend

019 2022.09.26 01:00 | ap-0b99 unreachable via ssh

While trying to uprade the firmware on ap-0b99 (see incident 020 for details) I was unable to ssh into the ap. ssh -v revealed that an successfull ssh handshake happend but afterwards ap-0b99 immediatly closed the connection.

monitoring01 was still able to scrape the target. Therefore my plan was to collect logs via the serial console on my next visit.

The recent power outages (see incident 018 for details) destroyed the logs.

impact:

  • earliest occurence: 2022.09.26 ~01:00
  • till: 2022.09.26 ~15:30
  • maybe just an issue with the management plane
  • possibly also an issue with the forwarding plane and wifi stuff

020 2022.09.27 02:30 - 04:00 | (maintenance) firmware unification on accesspoints

To have all APs on the same firmware revision I've sysupgraded them in the early morning hours.

 https://git.sr.ht/~hirnpfirsich/garet
 garet ce38181, aruba-ap-105_21.02.3

This was not neccessary but helped with two things:

  • the current garet releases include a version number in the firmware
    • because not all aps where installed at the same time there were different firmware/garet versions installed
    • nothing serious but it forced the ansible playbooks to always think about all versions
    • now we can simplify the playbooks :)
    • also a clear version indicator is always nice
  • the aps needed some packages for new ansible shenanigans

I've started to upgrade the APs in the office containers (because nobody is working there at these hours). After doing the upgrades I've noticed that the images missed some files (disabling the wifi monitoring).

So after fixing that and testing the images (by flashing the APs in the office containers again) I upgraded the firmware on all APs.

Because ap-0b99 was unreachable at that time (see incident 019 for details) it was still stuck on some older revision of the firmware.

timetable:

  • 02:50: ap-ac7c
  • 03:05: ap-c5d1
  • 03:30: ap-ac7c
  • 03:43: ap-c5d1
  • 03:56:
    • ap-8f42
    • ap-c495
    • ap-2bbf
    • ap-1a38
    • ap-8f39

playbook: To automate the upgrade process I've written a small playbook (playbook_sysupgrade). One has to specify the firmware image via -e firmware_file=<file>. Also currently the playbook tries to upgrade all accesspoints (limit via -l <device>).

impact: While doing the sysupgrade the APs disable all services (including wifi), write the new firmware into flash and then reboot. This process takes round about 5 minutes. In this window the ap drops all clients.

(Nearly) all APs upgraded at the same time so therefore clients could not roam to a different AP. This means that there was complete wifi downtime for about 10 minutes.

update: After the power outage (see incident 018 for details) ap-0b99 was reachable again. Therefore I've upgraded the ap on 2022.09.28 from 00:21 till 00:27.

021 2022.09.29 10:30 - 11:30 | (maintenance) replace gw-core01, reorg cabling

To finally combat the random reboots of gw-core01 (see incidents 010and 012 for details) I've replaced the device again.

The last time I tried to replace gw-core01 the replacement device stopped working after 4 hours (see incident 015 for details). Because the original and replacement device have the same SoC (Mediatek MT7621) this smells like an OS issue. There a few OpenWrt forum entries about people having issues with the MT7621 on kernel 5.10.

Therefore this replacement was done using an x64 based platform (Sophos SG-125r2).

After replacing the device I also replugged some switch ports on sw-access01 to bring some order into the network cables.

firmware and configuration of gw-core01:

  • firmware: garet commit ce38181, garet profile sophos-sg-125r2_22.03.0
  • playbook_provision_gateway.yml: e7054c1 ported the config of gw-core01 onto the new platform.
  • synced dhcp leases via scp between devices

timetable:

  • 10:53: transferred all ports to the new gw-core01
  • 11:00: replugged ap-8f42 (tent 1) to reorg network cable (=> ap reboot)
  • 11:02: replugged sw-acces02 (tent 5) to reorg network cable (=> only short link interruption)
  • 11:14: shutdown hyper01 to reorg power cord (=> shutdown all vms)

022 2022.10.04 | (maintenance) install directional LTE antenna

The gigacube-2001 sits inside one of the orga containers aka inside a big enclosed metal box. To combat possible issues regarding the LTE connection I've installed a directional LTE antenna ontop of the container on an already existing mounting pole.

antenna details:

  • 800/900/1800/2100/2600 MHz
  • one vertical and one horizontal array
  • inside single enclosure
  • type-N connectors

antenna connection: antenna (type-N female) <-> (type-N male) 10m Coaxial Cable (SMA male) <-> (SMA female) Adapter (TS-9 male) <-> (TS-9 female) Gigacube for both arrays

antenna orientation:

  • loosly towards top of city block on Philipp-Rosenthal-Straße 66, 04103 Leipzig (nearest Vodafone eNodeB)
  • "fine" tuned through moving antenna vertically and horizontally while monitoring signal strength

gain:

  • increased rsrp from -90 dBm to -74 dBm
  • current rssi is 51

important: from now on please be careful when moving things around near the gigacube. The SMA to TS-9 adapters are too long for the jacks on the gigacube and therefore the antenna cables have a very unsatisfying fit. They could either damage the gigacube, be yanked out or both.


023 2022.10.16 ~18:00 - 2022.10.18 ~13:00 | public wifi lost upstream connectivity

issue: The public wifi stopped routing into the internet

cause: The wireguard tunnel towards mullvad stopped handshaking. It turns out that we forgot the recharge the prepaid account.

hotfix: disable traffic laundering (6297531)

solution: recharged mullvad account

timetable:

  • 2022.10.16 17:50: mullvad vpn stopped handshaking; blackholed public wifi traffic
  • 2022.10.18 12:20: notification from facility management that public wifi stopped working
  • 2022.10.18 12:50: disabled traffic laundering for public wifi as a hotfix (6297531)
  • 2022.10.18 21:15: recharged pre-paid mullvad account
  • 2022.10.19 01:40: reenabled traffic laundering (466fefe)
  • 2022.10.19 02:10: added alarming rule for this exact case (ec917a2)

impact:

  • 2022.10.16 ~18:00 - 2022.10.18 ~13:00: public wifi not working

extended monitoring: This is not the first time the public wifi selectivly stopped working because something was wrong with the vpn. To be proactively notified when this happens again I've created a alarm that should trigger when every end to end test from the public wifi/client network stops working (ec917a2).

024 2022.10.24 ~ 01:00 | (maintenance) upgrade firmware on accespoints and gw-core01

upgrade firmware on all accesspoints to the latest old stable version of OpenWrt:

 -----------------------------------------------------
 OpenWrt 21.02.5, r16688-fa9a932fdb
 -----------------------------------------------------
 https://git.sr.ht/~hirnpfirsich/garet
 garet 845a6ba, aruba-ap-105_21.02
 -----------------------------------------------------

upgrade firmware on gw-core01 to the latest stable version of OpenWrt:

 -----------------------------------------------------
 OpenWrt 22.03.2, r19803-9a599fee93
 -----------------------------------------------------
 https://git.sr.ht/~hirnpfirsich/garet
 garet 89cbd27, sophos-sg-125r2_22.03
 -----------------------------------------------------

all updates where doing using the new "idempotent" playbook_sysupgrade (since 8d79518).

timetable:

  • 2022.10.24 00:16 - 00:20: ap-c5d1 (office container)
  • 2022.10.24 00:27 - 00:32: ap-ac7c (social work)
  • 2022.10.24 00:39 - 00:44: ap-0b99, ap-1a38, ap-2bbf, ap-8f39, ap-8f42, ap-c495 (tents)
  • 2022.10.24 01:44 - 01:46: gw-core01 => downtime of the accesspoints in the specified timeframe => downtime of gw-core01 in the specified timeframe

025 2022.11.19 04:00 (ANS) | (maintenance) (try to) steer clients into 5 GHz band


this log entry was added way after doing the actual work. Please read it with a grain of salt


problem:

  • (if i remember correctly) way more clients in the 2,4 GhZ band than in the 5 GHz band (3/4 to 1/4)

solution:

  • halfe the transmit power in the 2,4 GHz band
  • increased transmit power in the 5 GHz band by 1 dBm
  • implemented by 5017cb5

impact: This restarted wifi on all APs at the same time. Downtime for all clients for a few seconds at 04:00 in the morning.

validation: One day afterwards it seemed like there where more clients in the 5 GHz band (50/50), but the datarates dropped for most of them.

critisism:

  • placement, transmit power and supported bands of the clients impact 5 GHz utilization
  • unsure what actually is the problem
  • also did not correctly validate for a few days

026 2022.11.20 15:30 (ANS) | (maintenance) replace SFP modules


this log entry was added way after doing the actual work. Please read it with a grain of salt


intro: The needed SFP modules for ans did not arrive in time for the installation. Therefore we've installed super old and shitty transcievers (> 10 years old, >70°C, ...) to get a working network.

impact:

  • L2 interruption (<= 10 seconds) for all tents

027 2022.11.21 02:00 | (maintenance) attach volume to eae-adp-jump01 for prometheus


this log entry was added way after doing the actual work. Please read it with a grain of salt


problem: After installing a prometheus stack onto eae-adp-jump01 (8389a18) the /var/ partition filled up after a few days. Limiting the size of the TSDB did not resolve this issues (maybe i've misconifigured the limit).

solution:

  • sysupgrade to OpenBSD 7.2
  • attach 20GB block device onto vm and mount it as /var/prometheus:
eae-adp-jump01# rcctl stop prometheus
eae-adp-jump01# rm -r /var/prometheus/*
eae-adp-jump01# sysctl hw.disknames
eae-adp-jump01# fdisk -iy sd1
eae-adp-jump01# disklabel -E sd1
> a a
>
> *
> q
eae-adp-jump01# newfs sd1a
eae-adp-jump01# diff -Naur /etc/fstab.20221121 /etc/fstab
--- /etc/fstab.20221121	Sun Jun 26 23:00:39 2022
+++ /etc/fstab	Mon Nov 21 02:01:03 2022
@@ -8,3 +8,4 @@
 e1c3571d54635852.j /usr/obj ffs rw,nodev,nosuid 1 2
 e1c3571d54635852.i /usr/src ffs rw,nodev,nosuid 1 2
 e1c3571d54635852.e /var ffs rw,nodev,nosuid 1 2
+a0469c9f38992e1d.a /var/prometheus ffs rw,nodev,nosuid 1 2
eae-adp-jump01# mount /var/prometheus
eae-adp-jump01# chown _prometheus:_prometheus /var/prometheus
eae-adp-jump01# rcctl start prometheus
eae-adp-jump01# syspatch
eae-adp-jump01# reboot

028 2022.11.29 02:00 | periodically restart prometheus


this log entry was added way after doing the actual work. Please read it with a grain of salt


problem: prometheus crashed regularly on eae-adp-jump01. It seems like OpenBSD is missing some functionality on file handles that let's prometheus crash. Here is an github issue (for an older OpenBSD release) that descripes the same problems.

solution: until I've got time to install a new linux machine somewhere that does the monitoring: regularly restart prometheus:

eae-adp-jump01# crontab -e
[...]
0	*/2	*	*	*	rcctl restart prometheus

029 2022.11.29 03:00 (ANS) | (maintenance) automagically start offloader


this log entry was added way after doing the actual work. Please read it with a grain of salt


problem: ANS washes the traffic via a FFLPZ/FFDD offloader vm. There only was a script that manually started the offloader vm. On reboots the offloader vm would not automagically start.

solution: implement a service that starts the vm

impact: after validating the script on another openwrt machine I tested the script in production. This created the following downtimes:

  • offloader down from 02:50 to 03:05 -- service interruption for the public wifi
  • ffl-ans-gw-core01 down from 02:53 to 02:55 -- service interruption for everybody

disclaimer: The script is manually deployed on ffl-ans-gw-core01 and therefore not part of this repo at the moment

030 2022.11.30 15:30 (ANS) | (maintenance) replace switches


this log entry was added way after doing the actual work. Please read it with a grain of salt


intro: The switches installed into ans were defective. Not every boot had working PoE. Meaning that a power outage could result in no power for the APs. Fortunately Zyxel replaced the devices.

replacement log:

  • 16:34:30 - 16:34:50: ffl-ans-sw-distribution01
    • quickly replaced device and connections
    • => l2 interruption for ffl-ans-sw-acces01 and ffl-ans-sw-access02
    • => power cycle of APs in social, security and facility container
  • 16:49: ffl-ans-sw-access01
    • power up new device alongside
    • bridge old and new device with short patch cable
    • move sfp uplink to new device
    • move first ap to new switch
    • wait till ap was back up and serving clients
    • move second ap
    • teardown old device
    • => minimal l2 downtime
    • => rolling AP downtimes
  • 17:09:30 - 17:10:15: ffl-ans-sw-access02
    • quickly replaced device and connections
    • => power cycle of all APs in tent 2&3

031 2022.12.23 14:00 (ADP) | enable backoffice wifi in tent 1

intro: The facility management moved into the "logistics" container

problem: Because there is no AP inside the container the wifi experience sucks. The printer is unable to connect to the wifi and the notebook has 0 bars.

hotfix: Also distribute backoffice wifi from tent 1 because it is the nearest ap. Commited through (d808775). Rolled out at around 14:00.

impact:

  • quick wifi outage for tent 1 (a few seconds)

validation:

  • the wifi experience still sucks
  • scanning sometimes works
  • i see the backoffice devices from the logistics container associating and reassociating multiple times per minute

longterm:

  • install an ap inside the logistics container
  • ETA for installation: 29.12.2022

032 2022.12.29 14:00 - 16:00 (ADP) | add ap to new facility management container

We installed an accesspoint into the new facility management container (ap-1293). Afterwards we disabled the temporary backoffice wifi in tent 1 (e3d8369).

The new ap is connected via a CAT.7 outdoor 30m cable and is plugged into sw-acces01 port 6.

This closes incident 031.

033 2023.01.02 17:45 (ANS) | network core rebooted

someone accidentially unplugged the power for the network core in the facility management container

impact:

  • downtime from 17:40 - 17:45
  • public wifi was down 5 minutes longer because batman needed to converge on the offloader

034 2023.01.08 04:00 - 05:30 | (maintenance) bring accesspoints onto OpenWrt 22.03

target version:

 -----------------------------------------------------
 https://git.sr.ht/~hirnpfirsich/garet
 garet 9974455, aruba-ap-105_22.03
 -----------------------------------------------------

changelog:

  • OpenWrt 22.03.2
  • support for lldp -- needs to be configured

canary test:

  • ap-c5d1 (adp)
  • ap-b6ee (ans)
  • down from 04:00 - 04:05
  • playbook_provision_accesspoints restarted wifi on ap-c5d1 04:21 one more time

Deutscher Platz:

  • down from 05:00 - 05:06
  • playbook_provision_accesspoints restarted wifi on the following aps again at around 05:10
    • ap-8f39
    • ap-c495
    • ap-ac7c
    • ap-0b99

Arno-Nitzsche-Str:

  • down from 05:21 - 05:27
  • playbook_provision_accesspoints did not restart the wifi again

summarized impact:

  • Arno-Nitzsche-Str: wifi down from 05:21 - 05:27
    • ap-b6ee from 04:00 - 04:05
  • Deutscher Platz: wifi down from 05:00 - 05:06
    • ap-c5d1 from 04:00 - 04:05
    • ap-8f39 additional wifi restart at 05:10
    • ap-c495 additional wifi restart at 05:10
    • ap-ac7c additional wifi restart at 05:10
    • ap-0b99 additional wifi restart at 05:10

facility management rebooted the gigacube

036 2023.01.24 02:15 (RGS) | (maintenance) increase tx power of aps

RUNNING HANDLER [reload wireless] *********************************************************************************************************************
Tuesday 24 January 2023  02:16:45 +0100 (0:00:31.967)       0:02:06.789 *******

see 191b7f2 for details

impact: very unstable uplink for ap in tent-3

hotfix: shutdown ap via poe to move clients onto other accesspoint (there really are no other ap in this tent though :()

problem: someone butchered the ethernet cables (from the network core) by squeezing and bending them through cable guides.

fix: Tried "unbending" them and the link came back!

038 2023.02.01 04:00 (ADP) | move to different mullvad account

old pubkey: 'Sqz0LEJVmgNlq6ZgmR9YqUu3EcJzFw0bJNixGUV9Nl8=`

RUNNING HANDLER [reload network] **********************************************************************************************************************
Wednesday 01 February 2023  03:59:45 +0100 (0:00:03.789)       0:01:22.895 ****
changed: [gw-core01]

see commit 68ee430 for details

introduction: the uplink for tent-3 went flacky again

problem: the cables took irrepearable damage from mishandling (see incident 035 for details)

fix:

  • install new access switch into tent-2 (sw-access04: 220bb14)
  • migrate uplink for tent-3 from the core onto sw-access04

040 2023.02.28 08:00 (ADP) | dns issues

introduction: Someone on site called and notified me that "the internet is not working".

problem: gw-core01 stopped serving dns queries:

root@gw-core01:~# logread | grep max
Tue Feb 28 08:44:16 2023 daemon.warn dnsmasq[1]: Maximum number of concurrent DNS queries reached (max: 150)

fix:

  • increased maxdnsqueries
  • increased dnscache
  • changed upstream dns to 9.9.9.9 (quad9) and 1.1.1.1 (cloudflare)

see a236643 for details

041 2023.03.11 19:20 - 2023.03.13 20:30 (ADP) | broken management vpn tunnel

root@gw-core01:~# date
Mon Mar 13 19:40:48 2023
root@gw-core01:~# wg
interface: wg0
  public key: 1lYOjFZBY4WbaVmyWFuesVbgfFrfqDTnmAIrXTWLkh4=
  private key: (hidden)
  listening port: 51820

peer: 9j6aZs+ViG9d9xw8AofRo10FPosW6LpDIv0IHtqP4UM=
  preshared key: (hidden)
  endpoint: 162.55.53.85:51820
  allowed ips: 0.0.0.0/0
  latest handshake: 1 day, 23 hours, 55 minutes, 49 seconds ago
  transfer: 1.17 GiB received, 16.71 GiB sent
  persistent keepalive: every 15 seconds
root@gw-core01:~# ifdown wg0
root@gw-core01:~# ifup wg0
root@gw-core01:~# echo wg0 still not handshaking properly
root@gw-core01:~# uci delete network.wg0.listen_port
root@gw-core01:~# /etc/init.d/network reload
root@gw-core01:~# echo wg0 is up again !
root@gw-core01:~# uci commit network

042 2023.03.12 18:00 - 2023.03.22 19:30 (RGS) | ap-1374 (kitchen-og) down

ap-1374 is (mostly) down since 2023.03.12 18:00. Neither the ethernet link nor the poe is coming up.

user@freifunk-admin:~$ date && ssh sax-rgs-sw-access02 
Wed 15 Mar 2023 12:07:55 AM CET
[...]
sax-rgs-sw-access02# show logging buffered 

Log messages in buffer
[...]
5;Feb 17 2000 05:37:36;%LINEPROTO-5-UPDOWN: Line protocol on GigabitEthernet0/7, changed state to down
4;Feb 17 2000 05:37:37;%TRUNK-4-INFO: Power-Over-Ethernet on gi0/7 Powered Down!
4;Feb 17 2000 05:37:48;%TRUNK-4-INFO: Power-Over-Ethernet on gi0/7: Detected Standard PD, Delivering power!
5;Feb 17 2000 05:37:54;%LINEPROTO-5-UPDOWN: Line protocol on GigabitEthernet0/7, changed state to up
5;Feb 17 2000 05:38:26;%LINEPROTO-5-UPDOWN: Line protocol on GigabitEthernet0/7, changed state to down
5;Feb 17 2000 05:38:28;%LINEPROTO-5-UPDOWN: Line protocol on GigabitEthernet0/7, changed state to up
5;Feb 17 2000 05:38:32;%LINEPROTO-5-UPDOWN: Line protocol on GigabitEthernet0/7, changed state to down
5;Feb 17 2000 05:38:35;%LINEPROTO-5-UPDOWN: Line protocol on GigabitEthernet0/7, changed state to up
5;Feb 17 2000 05:38:38;%LINEPROTO-5-UPDOWN: Line protocol on GigabitEthernet0/7, changed state to down
5;Feb 17 2000 05:38:59;%LINEPROTO-5-UPDOWN: Line protocol on GigabitEthernet0/7, changed state to up
5;Feb 20 2000 10:02:32;%LINEPROTO-5-UPDOWN: Line protocol on GigabitEthernet0/6, changed state to down
5;Feb 20 2000 10:02:35;%LINEPROTO-5-UPDOWN: Line protocol on GigabitEthernet0/6, changed state to up
5;Feb 24 2000 22:50:15;%LINEPROTO-5-UPDOWN: Line protocol on GigabitEthernet0/7, changed state to down
5;Feb 24 2000 22:50:15;%LINEPROTO-5-UPDOWN: Line protocol on GigabitEthernet0/7, changed state to up
5;Feb 24 2000 22:50:15;%LINEPROTO-5-UPDOWN: Line protocol on GigabitEthernet0/7, changed state to down
4;Feb 24 2000 22:50:18;%TRUNK-4-INFO: Power-Over-Ethernet on gi0/7 Powered Down!
5;Feb 25 2000 13:57:06;%LINEPROTO-5-UPDOWN: Line protocol on GigabitEthernet0/6, changed state to down
5;Feb 25 2000 13:57:09;%LINEPROTO-5-UPDOWN: Line protocol on GigabitEthernet0/6, changed state to up
4;Feb 26 2000 21:52:17;%TRUNK-4-INFO: Power-Over-Ethernet on gi0/7: Detected Standard PD, Delivering power!
5;Feb 26 2000 21:52:22;%LINEPROTO-5-UPDOWN: Line protocol on GigabitEthernet0/7, changed state to up
5;Feb 26 2000 21:52:54;%LINEPROTO-5-UPDOWN: Line protocol on GigabitEthernet0/7, changed state to down
5;Feb 26 2000 21:52:57;%LINEPROTO-5-UPDOWN: Line protocol on GigabitEthernet0/7, changed state to up
5;Feb 26 2000 21:53:01;%LINEPROTO-5-UPDOWN: Line protocol on GigabitEthernet0/7, changed state to down
5;Feb 26 2000 21:53:03;%LINEPROTO-5-UPDOWN: Line protocol on GigabitEthernet0/7, changed state to up
5;Feb 26 2000 21:53:06;%LINEPROTO-5-UPDOWN: Line protocol on GigabitEthernet0/7, changed state to down
5;Feb 26 2000 21:53:26;%LINEPROTO-5-UPDOWN: Line protocol on GigabitEthernet0/7, changed state to up
5;Feb 27 2000 00:31:56;%LINEPROTO-5-UPDOWN: Line protocol on GigabitEthernet0/7, changed state to down
4;Feb 27 2000 00:31:57;%TRUNK-4-INFO: Power-Over-Ethernet on gi0/7 Powered Down!
5;Feb 27 2000 05:24:48;%AAA-5-LOGIN: New ssh connection for user admin, source 10.86.254.0  ACCEPTED 
6;Feb 27 2000 05:25:00;%AAA-6-INFO: User 'admin' enter privileged mode from ssh with level '15' success 
sax-rgs-sw-access02# show clock                                                                                                                                                                                                   
2000-02-27 05:43:37 Coordinated(UTC+0)

needed fix:

  • check keystone modules on site
  • also check module for 0/6 (there are some ifInErrors)

additional work - set correct time on switches (done):

sax-rgs-sw-access0X> enable
sax-rgs-sw-access0X# configure terminal
sax-rgs-sw-access0X(config)# clock timezone CET +1
sax-rgs-sw-access0X(config)# clock set 00:26:15 mar 15 2023
sax-rgs-sw-access0X(config)# clock source ntp
sax-rgs-sw-access0X(config)# ntp server pool.ntp.org
sax-rgs-sw-access0X(config)# exit
sax-rgs-sw-access0X(config)# write

disable port till fix is there - done 16.03.2023 00:40:

sax-rgs-sw-access02> enable
sax-rgs-sw-access02# configure terminal
sax-rgs-sw-access02(config-if-GigabitEthernet0/7)# no poe enable
sax-rgs-sw-access02(config-if-GigabitEthernet0/7)# exit
sax-rgs-sw-access02(config)# exit
sax-rgs-sw-access02# write

actual fix - done 22.03.2023:

  • reterminate keystone modules for both links (GigabitEthernet0/6 and GigabitEthernet0/7)
  • reenable poe on GigabitEthernet0/7
  • test by
    • resetting link counters on sax-rgs-sw-access02
    • iperf3 from ap to core gateway (bidirectional)
    • looking at the counters again

043 2023.03.20 01:30 | (maintenance) update eae-adp-jump01

syspatch
pkg_add -uU
reboot

044 2023.03.25 23:45 - 2023.03.26 13:00 (ANS) | broken upstream

ffl-ans-gw-core01 hasn't handshaked with eae-adp-jump01 since 2023.03.25 at around 23:45. Additionally the facility management called and said that there was "no internet" on site.

The facility management will drive to the ANS and check in with me to talk about the next steps

solution: after power cycling the gigacube the upstream came back

045 2023.04.01 - 2023.04.02 (ANS) | fibre cut to tent-1

issue: fibre cut from facility management container to tent-1

solution: replace fibre with outdoor copper cable

dicussion:

  • longterm: replace copper with fibre