Alright, let’s keep this to the point. Our second Dallas node has been experiencing more issues than I’m comfortable with, so I want to document the issues this node has had, and the steps we’ve taken to ensure they don’t happen again.
Why is Dallas 02 having issues?
Our Dallas location is new. Dallas 01 was deployed with a provider which we did months of testing with, however, we decided to change our Dallas provider between the first and second machine. This swap allowed us to have more control over our infrastructure, keep our prices competitive, and offer better hardware.
These events took place within a short period of time, naturally concerning the impacted customers. Which is why we’re making this post.
What have the issues been?
Mic Lag (Firewall Issue)
Mic lag was reported on servers with higher player counts. This was the result of a firewall configuration rate-limiting the traffic too aggressively.
Once aware of this, it was fixed within two hours. This problem likely impacted certain servers on the node for a while. Unfortunately it wasn’t picked up by our internal tests due to the issue only occurring on higher player counts.
Our Dallas firewall setup is unique to all our other locations, and we’re working to bring it up to speed. Future firewall configurations will be modelled after Dallas 02, so this issue wont happen again.
Downtime (Upstream Provider Mistake)
Due to a technical bug on an upstream providers ticketing system, they tagged the wrong machine for a hardware change. This issue was compound by the poor communication upstream.
Once the issue was identified, the machine was taken down to correct the hardware change. This resulted in a further reboot once more, totalling 3 periods of downtime on the node that day.
While this is frustrating, it’s out of our control. We would have broken some wrists if we could. Our upstream providers are aware of the bug, and we are too. So this will be avoided in the future with more clear communication to compensate for any technical issues.
Steam Auth Issues (DDoS/Firewall Issue)
All our locations outside of Dallas use a universal and consistent firewall and DDoS mitigation model. Unfortunately, this isn’t possible to deploy in Dallas – meaning we can’t take direct advantage of years of tuning.
Because of this, Dallas was a bit behind with DDoS mitigation. Today a small attack caused issues. As a response, I re-modelled the Dallas firewall to better protect servers, now effectively matching our other locations.
Unfortunately, a small oversight lead to Valve Authentication being blocked for a brief period. After updating the firewall, these connections are working as expected.
Dallas 02 has had more issues than normal due to the amount of new things happening at once, from the firewall, to the providers systems.
The biggest issue seemed to be the firewall. This was because Dallas uses a different system to our other locations. Now the Dallas firewall effectively has the same rules applied as our other locations, I don’t expect more issues to originate from the firewall. It seems like this was mostly a learning curve for us.
We will over communicate with upstream moving forwards to ensure there are no future mistakes. Additionally, they have reported the technical bug to their admins who will hopefully fix, and prevent that exact issue from happening again.
I wanted to apologise to any customers on this machine, as the recent quality isn’t what we try to provide. To move into a slightly more positive note – the lessons learnt from these teething issues should enable us to provide the quality of service we expect in the future.
If you have any questions or concerns – please drop us a ticket!