What we’re doing?
We’re rolling out upgrades to every server on our network over the last week. This is part of our efforts to continue providing high quality & reliable service. We’ll be performing upgrades to various software running on our servers, and generally doing a health checkup on all our servers. Since Crident started, we’ve improved and changed how we maintain, setup & operate our machines. This window of maintenance provides an opportunity for us to correct any disparities between more legacy server setups, when compared to how (& what software) we deploy [on] new servers now.
We’re also upgrading all the kernels on our servers to apply fixes to certain variations of the Spectre & Meltdown vulnerabilities recently discovered in certain CPU’s. Crident & our upstream providers have no evidence to believe these exploits have ever been used outside of a research purpose. To fully take advantage of this window of maintenance, we’ll also be rolling out some changes to enable our CS:GO game hosting shortly after this maintenance is complete.
Why now & why all servers?
Spectre & Meltdown sent the IT industry into a bit of a meltdown – pardon the pun. Due to how our services are designed, we weren’t exposed to high risk as a result of these exploits, nor do the machines that were at any risk contain information we deem business critical, or critically sensitive. We opted to react carefully, and only roll out changes that we were confident in the stability of, we think this is what our clients would have personally wanted us to do. Which is why we waited until now to fully tighten everything up regarding these design flaws in the CPU’s we use.
This isn’t to say we took no mitigation steps, or that we weren’t taking these exploits seriously. We just chose to utilise security through obscurity, until now, while the IT industry was working out how to react to these recent exploits (& finding new variations of them), and in turn deploy (and develop) the mitigation’s for these attacks. Now things are in place, and known to be stable through time, we’re aiming to fully deploy all mitigation techniques of all variations of these exploits on our host systems through the last week.
Impact & issues encountered
Things haven’t gone perfectly, regardless of how careful we were in how we reacted. One machine had stability issues due to BIOS/CPU Microcode being updated before the respective kernels were updated, and another had an issue as a result of the /boot directory getting full from all the new kernel images being pushed out.
UK09 & UK03, respectively were the machines that had these issues. We managed to get UK09 stable within the first 48 hours of us becoming aware of the issue. However UK03, embarrassingly, continued to have issues for about 7 days. This is because I (Drizzy) assumed the issue on UK03 was the same as the issue on UK09. Obviously this was an oversight and we’ll do better in the future to avoid these human errors from happening.
All-in-all, everything is OK. You have nothing to be worried about in any reasonable way with your services security on Crident. These exploits – while pretty bad, aren’t awfully dangerous in our situation, due to how we manage & design our systems, and the kinds of data our systems handle.
So, why are you making this post?
We want to be transparent. It’s important people understand why their services were impacted by issues on UK09 & UK03. It’s also important our users understand why we’re doing what we’re doing, and that we are paying attention to the current global events and threats impacting their servers.