2022-07-01 00:00:43 +00:00
|
|
|
|
|
|
|
Collection of all incidents
|
|
|
|
===========================
|
|
|
|
|
|
|
|
2022.06.30 17:00 - 18:00 | power issues on gw-core01
|
|
|
|
----------------------------------------------------
|
|
|
|
|
|
|
|
**issue**:
|
|
|
|
|
|
|
|
The protective cap of the power strip yanked the power supply of `gw-core01` out of its socket.
|
|
|
|
Therefore `gw-core01` had no power.
|
|
|
|
|
|
|
|
**solution**:
|
|
|
|
|
|
|
|
Tape back the protective cap of the power strip and reinsert the power supply
|
|
|
|
|
|
|
|
**impact**:
|
|
|
|
|
|
|
|
No internet access for 2 hours
|
2022-07-01 01:30:32 +00:00
|
|
|
|
|
|
|
|
2022-07-02 21:58:41 +00:00
|
|
|
2022.06.30 12:00 - 2022.07.01 19:00 | wifi issues in tent 5
|
|
|
|
-----------------------------------------------------------
|
2022-07-01 01:30:32 +00:00
|
|
|
|
2022-07-02 21:58:41 +00:00
|
|
|
### issue
|
2022-07-01 01:30:32 +00:00
|
|
|
|
|
|
|
A resident reported slow internet speeds. He resides in tent 5. I do not have more information.
|
2022-07-02 21:58:41 +00:00
|
|
|
While trying to check logs for the ap I noticed that `ap-ac7c` is very slow and hangs/freezes a lot via ssh.
|
|
|
|
|
|
|
|
Rebooting did not solve the problem.
|
|
|
|
|
|
|
|
### cause
|
|
|
|
|
|
|
|
Unknown
|
|
|
|
|
|
|
|
I've checked the ap the next day in person. I tested the ap with a different lan cable on a different switch port.
|
|
|
|
The issued I've noticed the night before where not reproducible.
|
|
|
|
|
|
|
|
But I did notice that the short patchcable (connecting the ap to the switch) had some light rust on it.
|
|
|
|
|
|
|
|
### solution
|
|
|
|
|
|
|
|
_01.07.2022 ~ 03:00_ (shortterm):
|
|
|
|
After noticing the issue myself I tried rebooting the ap.
|
|
|
|
Unfortunately that did not solve the problem.
|
|
|
|
To spare the clients from connecting to a bonkers ap I disabled poe for the switch port to take the ap offline:
|
|
|
|
```
|
|
|
|
root@sw-access02:~# uci show poe | grep lan2
|
|
|
|
poe.@port[1].name='lan2'
|
|
|
|
root@sw-access02:~# uci set poe.@port[1].enable=0
|
|
|
|
root@sw-access02:~# uci commit poe
|
|
|
|
root@sw-access02:~# /etc/init.d/poe restart
|
|
|
|
```
|
|
|
|
|
|
|
|
_01.07.2022 ~ 19:00_ (longterm):
|
|
|
|
I could not reproduce the issue in person. To be on the safe side I replaced the short patchcable (connecting the ap to the switch) and ap:
|
|
|
|
`ap-ac7c -> ap-1a38`.
|
|
|
|
Afterwards I reenabled poe on the corrosponding switch port.
|
|
|
|
|
|
|
|
|
|
|
|
### impact
|
|
|
|
|
|
|
|
* `2022.06.31 12:00 - 2022.07.01 03:30`: (probably) unreliable wifi for clients connected to `ap-ac7c`
|
|
|
|
* `2022.07.01 03:30 - 2022.07.01 18:30`: bad signal strength to clients in and around tent 5
|
|
|
|
|
|
|
|
### notes
|
|
|
|
|
|
|
|
While disabling poe on the port connecting `ap-ac7c` I restarted the `poe` service.
|
|
|
|
That resulted in all ports shortly dropping power.
|
|
|
|
Therefore I also accidentially rebooted `ap-2bbf`.
|
|
|
|
|
|
|
|
Next time I'll just reload the service (shame on me).
|
|
|
|
|
|
|
|
### logs
|
|
|
|
|
|
|
|
This was my test to show that ssh was slow/freezed a lot on `ap-ac7c`.
|
2022-07-01 01:30:32 +00:00
|
|
|
|
|
|
|
good ap:
|
|
|
|
```
|
|
|
|
user@freifunk-admin:~$ time ssh ap-2bbf uptime
|
|
|
|
01:04:43 up 1 day, 6:21, load average: 0.01, 0.02, 0.00
|
|
|
|
|
|
|
|
real 0m1.438s
|
|
|
|
user 0m0.071s
|
|
|
|
sys 0m0.011s
|
|
|
|
user@freifunk-admin:~$ time ssh ap-2bbf uptime
|
|
|
|
01:17:49 up 1 day, 6:34, load average: 0.00, 0.01, 0.00
|
|
|
|
|
|
|
|
real 0m1.924s
|
|
|
|
user 0m0.070s
|
|
|
|
sys 0m0.010s
|
|
|
|
```
|
|
|
|
|
|
|
|
bad ap:
|
|
|
|
```
|
|
|
|
user@freifunk-admin:~$ time ssh ap-ac7c uptime
|
|
|
|
01:05:00 up 1 day, 6:33, load average: 0.01, 0.08, 0.03
|
|
|
|
|
|
|
|
real 0m29.526s
|
|
|
|
user 0m0.070s
|
|
|
|
sys 0m0.014s
|
|
|
|
user@freifunk-admin:~$ time ssh ap-ac7c uptime
|
|
|
|
01:06:22 up 1 day, 6:34, load average: 0.00, 0.06, 0.03
|
|
|
|
|
|
|
|
real 1m15.379s
|
|
|
|
user 0m0.081s
|
|
|
|
sys 0m0.015s
|
|
|
|
user@freifunk-admin:~$
|
|
|
|
```
|
2022-07-02 22:51:58 +00:00
|
|
|
|
|
|
|
2022.07.03 00:30 | (maintenance) rolling wifi channel updates
|
|
|
|
-------------------------------------------------------------
|
|
|
|
|
|
|
|
We skipped mapping every ap to non overlapping channels while installing the infrastructure because of time constraints.
|
|
|
|
|
|
|
|
Therefore I just did that (commit: 41752ef9bdfe0041359e09d08e107d330f10fcf2).
|
|
|
|
|
|
|
|
I ran ansible with `--fork=1` to update on ap at a time (every ap takes about 5 seconds to reconfigure theire radios).
|
|
|
|
|
|
|
|
**impact**:
|
|
|
|
Either clients "roamed" (no fast roaming - short interuptions) to a different ap or waited till the original ap came back online.
|
|
|
|
For every client not more than 10-15 seconds service interruption.
|
2022-07-17 00:07:12 +00:00
|
|
|
|
|
|
|
|
|
|
|
2022.07.17 01:50 | (maintenance) increase dhcp pool size for clients
|
|
|
|
--------------------------------------------------------------------
|
|
|
|
|
|
|
|
The dnsmasq instance for the client network (`10.84.4.0/22`) only used the dhcp pool `10.84.4.100 - .250`.
|
|
|
|
|
|
|
|
To be able to actually assign the full `/22` to clients I've changed the pool to `10.84.4.2 - 10.84.7.254`.
|
|
|
|
|
|
|
|
Afterwards I've reloaded `dnsmasq` on `gw-core01`.
|
|
|
|
|
|
|
|
**impact**: none
|
|
|
|
|
|
|
|
Currently `dnsmasq` has handed out 104 leases, so we presumably never ran out of ips in the old pool.
|