Compare commits

...

3 Commits

Author SHA1 Message Date
Gregor Michels 03e2543f95 incidents: add incident 025 2022-12-23 01:03:39 +01:00
Gregor Michels 0475923590 alerting: only alarm on devices that are unreachable for 1m at least 2022-12-22 16:37:15 +01:00
Gregor Michels 69834a8d2b alerting: also alert on reboots of snmp devices 2022-12-22 16:37:15 +01:00
2 changed files with 106 additions and 1 deletions

View File

@ -1118,3 +1118,99 @@ all updates where doing using the new "idempotent" `playbook_sysupgrade` (since
* 2022.10.24 01:44 - 01:46: `gw-core01`
=> downtime of the accesspoints in the specified timeframe
=> downtime of `gw-core01` in the specified timeframe
025 2022.11.19 04:00 (ANS) | (maintenance) (try to) steer clients into 5 GHz band
---------------------------------------------------------------------------------
---
_this log entry was added way after doing the acutal work.
Please read it with a grain of salt_
---
**problem**:
* (if i remember correctly) way more clients in the 2,4 GhZ band than in the 5 GHz band (3/4 to 1/4)
**solution**:
* halfe the transmit power in the 2,4 GHz band
* increased transmit power in the 5 GHz band by 1 dBm
* implemented by `5017cb5`
**impact**:
This restarted wifi on all APs at the same time.
Downtime for all clients for a few seconds at 04:00 in the morning.
**validation**:
One day afterwards it seemed like there where more clients in the 5 GHz band (50/50), but the datarates dropped for most of them.
**critisism**:
* placement, transmit power and supported bands of the clients impact 5 GHz utilization
* unsure what actually is the problem
* also did not correctly validate for a few days
026 2022.11.20 15:30 (ANS) | (maintenance) replace SFP modules
--------------------------------------------------------------
...
027 2022.11.21 02:00 | (maintenance) attach volume to `eae-adp-jump01` for prometheus
-------------------------------------------------------------------------------------
20 GB volume an vm attached
reboot: gegen kurz vor 02:00
vorher noch syspatch
```
eae-adp-jump01# rcctl stop prometheus
eae-adp-jump01# rm -r /var/prometheus/*
eae-adp-jump01# sysctl hw.disknames
eae-adp-jump01# fdisk -iy sd1
eae-adp-jump01# disklabel -E sd1
> a a
>
> *
> q
eae-adp-jump01# newfs sd1a
eae-adp-jump01# diff -Naur /etc/fstab.20221121 /etc/fstab
--- /etc/fstab.20221121 Sun Jun 26 23:00:39 2022
+++ /etc/fstab Mon Nov 21 02:01:03 2022
@@ -8,3 +8,4 @@
e1c3571d54635852.j /usr/obj ffs rw,nodev,nosuid 1 2
e1c3571d54635852.i /usr/src ffs rw,nodev,nosuid 1 2
e1c3571d54635852.e /var ffs rw,nodev,nosuid 1 2
+a0469c9f38992e1d.a /var/prometheus ffs rw,nodev,nosuid 1 2
eae-adp-jump01# mount /var/prometheus
eae-adp-jump01# chown _prometheus:_prometheus /var/prometheus
eae-adp-jump01# rcctl start prometheus
```
028 2022.11.29 02:00 | periodically restart prometheus
-------------
028 2022.11.29 03:00 | (maintenance) activate auto start for offloader
-------------
offloader down from 02:50 to 03:05
gw-core down from 02:53 to 02:55
029 2022.11.30 15:30 | (maintenance) replace switches
----
* 16:34:30 - 16:34:50: `ffl-ans-sw-distribution01`
* quickly replaced device and connections
* 16:49: `ffl-ans-sw-access01`: minimal L2 downtme, accesspoint needed a reboot
* power up new device alongside
* bridge old and new device with short patch cable
* move sfp uplink to new device
* move first ap to new switch
* wait till ap was back up and serving clients
* move second ap
* teardown old device
* 17:09:30 - 17:10:15`: `ffl-ans-sw-access02`
* quickly replaced device and connections

View File

@ -4,7 +4,7 @@ groups:
# from https://awesome-prometheus-alerts.grep.to/rules.html#rule-prometheus-self-monitoring-1-2
- alert: PrometheusTargetMissing
expr: up == 0
for: 0m
for: 1m
labels:
severity: critical
annotations:
@ -64,3 +64,12 @@ groups:
annotations:
summary: A switch port changed it's state {{ $value }}x time
description: "For some reason a switch port changed it's state\n LABELS = {{ $labels }}"
- alert: SNMPNodeRebooted
expr: (sysUpTime / 100) <= (60 * 60 * 2)
for: 0m
labels:
severity: critical
annotations:
summary: A snmp node rebooted in the last 2 hours (instance {{ $labels.instance }})
description: "The uptime of a snmp node changed in the last two hours. VALUE = {{ $value }}\n LABELS = {{ $labels }}"