incidents: add 012 about the ongoing random reboots of gw-core01

This commit is contained in:
Gregor Michels 2022-09-02 22:04:28 +02:00
parent b5698a6c90
commit b57200bd6c
1 changed files with 20 additions and 0 deletions

View File

@ -401,3 +401,23 @@ I added a `persistent_keepalive` to the tunnel to stop this from happening again
* [ ] monitor connectivity for the public wifi (`blackbox exporter` in `client` network) and create alerting rules
* [ ] prometheus instance on `eap-adp-jump01` to get alerts if upstream is down in facility
* [ ] monitor wireguard state (probably needs a custom lua exporter)
012: 2022.09.01 17:24, 18:10 | ongoing reboots of gw-core01
-------------------------------------------------------------
Unfortunately zip tying back the protective cap of the power strip did not stop the random reboots of `gw-core01`.
See incidents `001` and `010` for details.
Either the power supply or the device itself is broken.
**solution**:
* [ ] replace power supply
* [ ] replace device itself (if replacing the power supply does not work)
I tried replacing the power supply today (2022.09.01 ~20:00) but nobody could let me into the facilities.
Going to try that again tommorrow.
**impact**:
* 2022.09.01 17:24, 17:47
* 2022.09.02 14:31, 18:10