incidents: (maintenance) add incident 020 about ap firmware upgrades

2022-09-28 00:34:25 +02:00 · 2022-09-28 00:34:25 +02:00 · 836436e625
parent 87e7767ea5
commit 836436e625
2 changed files with 74 additions and 1 deletions
--- a/documentation/INCIDENTS.md
+++ b/documentation/INCIDENTS.md
@ -928,7 +928,7 @@ I don't know if the social worker container still had power.
 019 2022.09.26 01:00 | ap-0b99 unreachable via ssh
 --------------------------------------------------

-While trying to uprade the firmware on ap-0b99 (future incident which will be linked) I was unable to ssh into the ap.
+While trying to uprade the firmware on ap-0b99 (see `incident 020` for details) I was unable to ssh into the ap.
 `ssh -v` revealed that an successfull ssh handshake happend but afterwards `ap-0b99` immediatly closed the connection.

 `monitoring01` was still able to scrape the target.
@ -941,3 +941,57 @@ The recent power outages (see `incident 018` for details) destroyed the logs.
 * till:               2022.09.26 ~15:30
 * maybe just an issue with the management plane
 * possibly also an issue with the forwarding plane and wifi stuff
+
+
+020 2022.09.27 02:30 - 04:00 | (maintenance) firmware unification on accesspoints
+-------------------------------------------------------------------
+
+To have all APs on the same firmware revision I've `sysupgraded` them in the early morning hours.
+```
+ https://git.sr.ht/~hirnpfirsich/garet
+ garet ce38181, aruba-ap-105_21.02.3
+```
+
+This was not neccessary but helped with two things:
+* the current garet releases include a version number in the firmware
+	* because not all aps where installed at the same time there were different firmware/garet versions installed
+	* nothing serious but it forced the ansible playbooks to always think about all versions
+	* now we can simplify the playbooks :)
+	* also a clear version indicator is always nice
+* the aps needed some packages for new ansible shenanigans
+
+I've started to upgrade the APs in the office containers (because nobody is working there at these hours).
+After doing the upgrades I've noticed that the images missed some files (disabling the wifi monitoring).
+
+So after fixing that and testing the images (by flashing the APs in the office containers again) I upgraded the firmware on all APs.
+
+Because `ap-0b99` was unreachable at that time (see `incident 019` for details) it was still stuck on some older revision of the firmware.
+
+**timetable**:
+* 02:50: `ap-ac7c`
+* 03:05: `ap-c5d1`
+* 03:30: `ap-ac7c`
+* 03:43: `ap-c5d1`
+* 03:56:
+	* `ap-8f42`
+	* `ap-c495`
+	* `ap-2bbf`
+	* `ap-1a38`
+	* `ap-8f39`
+
+**playbook**:
+To automate the upgrade process I've written a small playbook (`playbook_sysupgrade`).
+One has to specify the firmware image via `-e firmware_file=<file>`.
+Also currently the playbook tries to upgrade all accesspoints (limit via `-l <device>`).
+
+**impact**:
+While doing the `sysupgrade` the APs disable all services (including wifi), write the new firmware into flash and then reboot.
+This process takes round about 5 minutes.
+In this window the ap drops all clients.
+
+(Nearly) all APs upgraded at the same time so therefore clients could not roam to a different AP.
+This means that there was complete wifi downtime for about 10 minutes.
+
+**update**:
+After the power outage (see `incident 018` for details) `ap-0b99` was reachable again.
+Therefore I've upgraded the ap on 2022.09.28 from 00:21 till 00:27.
--- a/playbook_sysupgrade.yml
+++ b/playbook_sysupgrade.yml
@ -0,0 +1,19 @@
+---
+- name: upgrade firmware on openwrt device(s)
+  gather_facts: no
+  hosts: accesspoints
+  tasks:
+    - name: upload new firmware
+      copy:
+        src: "{{ firmware_file }}"
+        dest: "/tmp/{{ firmware_file | basename }}"
+
+    - name: issue sysupgrade command
+      command:
+        cmd: "sysupgrade /tmp/{{ firmware_file | basename }}"
+      ignore_errors: yes
+
+    - name: wait till device is back online
+      wait_for_connection:
+        delay: 10
+        timeout: 600