Over the last 72 hours we’ve had quite an eventful time with our panel! I want to run through what went wrong, and how we are making sure it wont happen again.
10AM – I got a message informing me that our internal k8s certificates expired, preventing our web nodes from communicating with each other. If you don’t know what “k8s” is, it basically lets our servers work together.
This wasn’t a good situation but it wasn’t a crisis, our systems are actually designed to handle this gracefully. We temporarily mitigated issues by routing all traffic to Canada until we could push new certificates later.
7PM -When it came time to propagate cert changes on all nodes, restarting the k8s management process (kubelet) caused all pods to be recreated for a reason we still don’t understand. This rescheduling caused the pod ID’s to get mixed up impacting stateful data being read & written.
This effectively reset our web servers to the “factory settings” until Trixter manually rebuilt the cluster. While we got the panel up on October 1st, issues impacting certain services lead into the next day.
The above issues caused longer-lasting issues for a minority of customers, who were unable to access their services for ~12+ hours in some cases. This isn’t acceptable, and we’ve reached out to compensate impacted customers.
Unfortunately this was caused by DNS resolution issues, because of the invalid certificates. These residual issues from the main technical issues that started on the 1st continued into October 3rd.
Residual issues from the 1st caused a database crash which wasn’t automatically recovered due to the k8s cluster being unfit with the recent trauma. This caused an error 500 to appear for about 5 minutes. This isn’t normal, and typically wouldn’t happen. Under normal circumstances this would be automatically recovered.
It doesn’t look great
From the above paragraphs, it’s easy to draw many conclusions. So it’s important to put things in perspective: 4.6 hours of downtime and 8254+ hours of up time (344 days)
We will still be hitting over 99% up time this year, however we will try harder in the future, which brings me onto my next point
WISP is a new SaaS Crident is working on that will “house” Dino Panel in the future. We’ve taken everything we learnt with Dino & put it into making WISP better. Which is why the issues impacting us over the last few days wouldn’t be possible on WISP.
WISP separates the clusters geo-graphically. So any software failure should be contained to a specific region, and routed around easily. So in the future issues like this would be isolated and routed around very quickly.
WISP is going into beta in October, and Crident will be looking to move onto the platform late 2019 or early 2020