Compare commits

...

2 Commits

Author SHA1 Message Date
Gregor Michels 0bf94d10a2 add incident 030: replace switches in ans 2022-12-23 01:45:29 +01:00
Gregor Michels ec0cfc908a add incident 029: ans create a service for the offloader vm 2022-12-23 01:39:26 +01:00
1 changed files with 64 additions and 0 deletions

View File

@ -1235,3 +1235,67 @@ eae-adp-jump01# crontab -e
[...]
0 */2 * * * rcctl restart prometheus
```
029 2022.11.29 03:00 (ANS) | (maintenance) automagically start offloader
------------------------------------------------------------------------
---
_this log entry was added way after doing the actual work.
Please read it with a grain of salt_
---
**problem**:
ANS washes the traffic via a FFLPZ/FFDD offloader vm.
There only was a script that manually started the offloader vm.
On reboots the offloader vm would not automagically start.
**solution**:
implement a service that starts the vm
**impact**:
after validating the script on another openwrt machine I tested the script in production.
This created the following downtimes:
* `offloader` down from 02:50 to 03:05 -- service interruption for the public wifi
* `ffl-ans-gw-core01` down from 02:53 to 02:55 -- service interruption for everybody
**disclaimer**:
The script is manually deployed on `ffl-ans-gw-core01` and therefore not part of this repo at the moment
030 2022.11.30 15:30 (ANS) | (maintenance) replace switches
-----------------------------------------------------------
---
_this log entry was added way after doing the actual work.
Please read it with a grain of salt_
---
**intro**:
The switches installed into ans were defective.
Not every boot had working PoE.
Meaning that a power outage could result in no power for the APs.
Fortunately `Zyxel` replaced the devices.
**replacement log**:
* 16:34:30 - 16:34:50: `ffl-ans-sw-distribution01`
* quickly replaced device and connections
* => l2 interruption for `ffl-ans-sw-acces01` and `ffl-ans-sw-access02`
* => power cycle of APs in social, security and facility container
* 16:49: `ffl-ans-sw-access01`
* power up new device alongside
* bridge old and new device with short patch cable
* move sfp uplink to new device
* move first ap to new switch
* wait till ap was back up and serving clients
* move second ap
* teardown old device
* => minimal l2 downtime
* => rolling AP downtimes
* 17:09:30 - 17:10:15`: `ffl-ans-sw-access02`
* quickly replaced device and connections
* => power cycle of all APs in `tent 2&3`