incidents: (maintenance) add incident 020 about ap firmware upgrades
This commit is contained in:
parent
87e7767ea5
commit
836436e625
|
@ -928,7 +928,7 @@ I don't know if the social worker container still had power.
|
|||
019 2022.09.26 01:00 | ap-0b99 unreachable via ssh
|
||||
--------------------------------------------------
|
||||
|
||||
While trying to uprade the firmware on ap-0b99 (future incident which will be linked) I was unable to ssh into the ap.
|
||||
While trying to uprade the firmware on ap-0b99 (see `incident 020` for details) I was unable to ssh into the ap.
|
||||
`ssh -v` revealed that an successfull ssh handshake happend but afterwards `ap-0b99` immediatly closed the connection.
|
||||
|
||||
`monitoring01` was still able to scrape the target.
|
||||
|
@ -941,3 +941,57 @@ The recent power outages (see `incident 018` for details) destroyed the logs.
|
|||
* till: 2022.09.26 ~15:30
|
||||
* maybe just an issue with the management plane
|
||||
* possibly also an issue with the forwarding plane and wifi stuff
|
||||
|
||||
|
||||
020 2022.09.27 02:30 - 04:00 | (maintenance) firmware unification on accesspoints
|
||||
-------------------------------------------------------------------
|
||||
|
||||
To have all APs on the same firmware revision I've `sysupgraded` them in the early morning hours.
|
||||
```
|
||||
https://git.sr.ht/~hirnpfirsich/garet
|
||||
garet ce38181, aruba-ap-105_21.02.3
|
||||
```
|
||||
|
||||
This was not neccessary but helped with two things:
|
||||
* the current garet releases include a version number in the firmware
|
||||
* because not all aps where installed at the same time there were different firmware/garet versions installed
|
||||
* nothing serious but it forced the ansible playbooks to always think about all versions
|
||||
* now we can simplify the playbooks :)
|
||||
* also a clear version indicator is always nice
|
||||
* the aps needed some packages for new ansible shenanigans
|
||||
|
||||
I've started to upgrade the APs in the office containers (because nobody is working there at these hours).
|
||||
After doing the upgrades I've noticed that the images missed some files (disabling the wifi monitoring).
|
||||
|
||||
So after fixing that and testing the images (by flashing the APs in the office containers again) I upgraded the firmware on all APs.
|
||||
|
||||
Because `ap-0b99` was unreachable at that time (see `incident 019` for details) it was still stuck on some older revision of the firmware.
|
||||
|
||||
**timetable**:
|
||||
* 02:50: `ap-ac7c`
|
||||
* 03:05: `ap-c5d1`
|
||||
* 03:30: `ap-ac7c`
|
||||
* 03:43: `ap-c5d1`
|
||||
* 03:56:
|
||||
* `ap-8f42`
|
||||
* `ap-c495`
|
||||
* `ap-2bbf`
|
||||
* `ap-1a38`
|
||||
* `ap-8f39`
|
||||
|
||||
**playbook**:
|
||||
To automate the upgrade process I've written a small playbook (`playbook_sysupgrade`).
|
||||
One has to specify the firmware image via `-e firmware_file=<file>`.
|
||||
Also currently the playbook tries to upgrade all accesspoints (limit via `-l <device>`).
|
||||
|
||||
**impact**:
|
||||
While doing the `sysupgrade` the APs disable all services (including wifi), write the new firmware into flash and then reboot.
|
||||
This process takes round about 5 minutes.
|
||||
In this window the ap drops all clients.
|
||||
|
||||
(Nearly) all APs upgraded at the same time so therefore clients could not roam to a different AP.
|
||||
This means that there was complete wifi downtime for about 10 minutes.
|
||||
|
||||
**update**:
|
||||
After the power outage (see `incident 018` for details) `ap-0b99` was reachable again.
|
||||
Therefore I've upgraded the ap on 2022.09.28 from 00:21 till 00:27.
|
||||
|
|
|
@ -0,0 +1,19 @@
|
|||
---
|
||||
- name: upgrade firmware on openwrt device(s)
|
||||
gather_facts: no
|
||||
hosts: accesspoints
|
||||
tasks:
|
||||
- name: upload new firmware
|
||||
copy:
|
||||
src: "{{ firmware_file }}"
|
||||
dest: "/tmp/{{ firmware_file | basename }}"
|
||||
|
||||
- name: issue sysupgrade command
|
||||
command:
|
||||
cmd: "sysupgrade /tmp/{{ firmware_file | basename }}"
|
||||
ignore_errors: yes
|
||||
|
||||
- name: wait till device is back online
|
||||
wait_for_connection:
|
||||
delay: 10
|
||||
timeout: 600
|
Reference in New Issue