incidents: (maintenance) add incident 020 about ap firmware upgrades

This commit is contained in:
Gregor Michels 2022-09-28 00:34:25 +02:00
parent 87e7767ea5
commit 836436e625
2 changed files with 74 additions and 1 deletions

View File

@ -928,7 +928,7 @@ I don't know if the social worker container still had power.
019 2022.09.26 01:00 | ap-0b99 unreachable via ssh
--------------------------------------------------
While trying to uprade the firmware on ap-0b99 (future incident which will be linked) I was unable to ssh into the ap.
While trying to uprade the firmware on ap-0b99 (see `incident 020` for details) I was unable to ssh into the ap.
`ssh -v` revealed that an successfull ssh handshake happend but afterwards `ap-0b99` immediatly closed the connection.
`monitoring01` was still able to scrape the target.
@ -941,3 +941,57 @@ The recent power outages (see `incident 018` for details) destroyed the logs.
* till: 2022.09.26 ~15:30
* maybe just an issue with the management plane
* possibly also an issue with the forwarding plane and wifi stuff
020 2022.09.27 02:30 - 04:00 | (maintenance) firmware unification on accesspoints
-------------------------------------------------------------------
To have all APs on the same firmware revision I've `sysupgraded` them in the early morning hours.
```
https://git.sr.ht/~hirnpfirsich/garet
garet ce38181, aruba-ap-105_21.02.3
```
This was not neccessary but helped with two things:
* the current garet releases include a version number in the firmware
* because not all aps where installed at the same time there were different firmware/garet versions installed
* nothing serious but it forced the ansible playbooks to always think about all versions
* now we can simplify the playbooks :)
* also a clear version indicator is always nice
* the aps needed some packages for new ansible shenanigans
I've started to upgrade the APs in the office containers (because nobody is working there at these hours).
After doing the upgrades I've noticed that the images missed some files (disabling the wifi monitoring).
So after fixing that and testing the images (by flashing the APs in the office containers again) I upgraded the firmware on all APs.
Because `ap-0b99` was unreachable at that time (see `incident 019` for details) it was still stuck on some older revision of the firmware.
**timetable**:
* 02:50: `ap-ac7c`
* 03:05: `ap-c5d1`
* 03:30: `ap-ac7c`
* 03:43: `ap-c5d1`
* 03:56:
* `ap-8f42`
* `ap-c495`
* `ap-2bbf`
* `ap-1a38`
* `ap-8f39`
**playbook**:
To automate the upgrade process I've written a small playbook (`playbook_sysupgrade`).
One has to specify the firmware image via `-e firmware_file=<file>`.
Also currently the playbook tries to upgrade all accesspoints (limit via `-l <device>`).
**impact**:
While doing the `sysupgrade` the APs disable all services (including wifi), write the new firmware into flash and then reboot.
This process takes round about 5 minutes.
In this window the ap drops all clients.
(Nearly) all APs upgraded at the same time so therefore clients could not roam to a different AP.
This means that there was complete wifi downtime for about 10 minutes.
**update**:
After the power outage (see `incident 018` for details) `ap-0b99` was reachable again.
Therefore I've upgraded the ap on 2022.09.28 from 00:21 till 00:27.

19
playbook_sysupgrade.yml Normal file
View File

@ -0,0 +1,19 @@
---
- name: upgrade firmware on openwrt device(s)
gather_facts: no
hosts: accesspoints
tasks:
- name: upload new firmware
copy:
src: "{{ firmware_file }}"
dest: "/tmp/{{ firmware_file | basename }}"
- name: issue sysupgrade command
command:
cmd: "sysupgrade /tmp/{{ firmware_file | basename }}"
ignore_errors: yes
- name: wait till device is back online
wait_for_connection:
delay: 10
timeout: 600