Server Management in Action – Fully Managed VPS Servers
It’s the Holiday weekend here in the UK and Ireland and whilst many of you were off enjoying the long weekend our support team and monitoring team were busy beavering away in the background. Here is a monitoring event from Sunday 30th August 2015 to give you some insight into how our Server Management and Pro Active Monitoring works in practice. We hope it gives you some insight into how our team reacts to a server monitoring event.
Each event is unique so our staff are trained to think on their feet and put themselves in the shoes of the client. We try to react as best we can as to how we feel the client would want us to react. This was of particular importance in this instance as we made a decision to reboot a client server without being able to contact the client. This caused 510 minutes of downtime for a very busy website.
Have a read and let us know if you would have done anything differently? We’re genuinely interested to know as we strive to improve service.
The client is a web design studio and they have a Managed VPS Server from us. The server hosts a number of websites from their portfolio. One such website was for a large one day festival that kicked off at 1pm on Sunday 30th August 2015 in a large City with over 500000 of a population.
Everything was great until 11.29am when we were alerted by our server monitoring system that the load on the server had suddenly spiked to over 100 and Apache Concurrent connections were maxing out. Essentially the server was struggling under the number of hits. As the load increased the number of people it served slowed down (diminishing returns) and the server was in danger of entering a downward spiral resulting in a crash.
What we did to ensure Best Effort Management
Here’s a log of what our team did with this server and we hope it shows our monitoring is pro-active and we always have your best interests at heart.
|11.29am||Our monitoring system alerted our team that the load on the server was spiralling out of control. Load average was rising: 101.38, 89.66, 75.96. Our Team checked the server and saw hundreds of concurrent Apache connections to one website. We checked this website and it was for a cultural festival starting at 1pm.
The IPs used to access the website were checked to see if it was ‘normal’ traffic and not a DDoS attack. All traffic was confirmed to be from legitimate places and 4 examples are shown here:
2.xxx.187.xxx Sky Broadband
We then did a quick search of the website Twitter Feed and Hash Tag and discovered the local tourist board, city council, radio station and local BBC regional news had tweeted a link to the website all within a short period of time.
|11.44am||We opened a ticket for the client to alert the client to the issue. The load was high but the server was still serving all connections in what we felt was a reasonable length of time so we opted at this point to monitor the server rather than add resources and cause downtime by rebooting without permission from the client.|
|12.02pm||This traffic was generated from a series of badly timed tweets from accounts with many followers so we knew the traffic would slow with time.
We told the client we could move the website to an isolated server for them but as the TTL value on the DNS was 4 hours and as the festival was due to end at 6pm we decided against that. We did leave that option open to the client and we promised to continue to monitor. The website was still at this point serving all requests albeit more slowly than would be ideal.
|12.21pm||RAM usage on the server was now the issue. It was starting to creep up and a lot of cache RAM was being used which would slow the server down. We alerted the client to this and asked if we could add more RAM. We cleared cache RAM but this is very temporary as it will be used up again.
As it was 40 minutes to the start of the festival we chose not to reboot without the client permission as the server was still up and serving all traffic. Rebooting in such close proximity to the start of the event we felt was a mistake.
|12.30pm||Unfortunately RAM was up to 96% used and the server was in danger of crashing. Usually we will not reboot a server without permission from a client but in this instance we could not contact the client. The client had not come back to us so we took the decision it was in the best interests of the client to increase the server resources and reboot the server despite it being only 30 minutes to festival start time.|
|12.45pm||The server came back on line and the extra resources did the trick. Load was reduced significantly and page load time was snappy. Load average: 3.41, 3.82, 2.84|
|3.30pm||The server has remained stable and we will reach out to the client on Tuesday after the Bank Holiday to see if they want to keep the extra resources or scale back again. Thankfully we were able to keep the server online apart from a short reboot to increase resources.|
Of course we prefer if we can get some notice of events that are likely to be time critical generating high traffic so we can isolate sites before they get busy but we know this is not always possible. In this instance once we received notification of the load we reacted to keep the server on line for the benefit of the fully managed client.
If you have servers hosted elsewhere and you want to avail of our monitoring service why not talk to our sales team today about moving over to us.
● 14 day initial trial to test our service with no obligation
● Free website migrations from your current provider
● Powerful servers with a 2 hour hardware replacement SLA
● Spare servers always on line in the event of issues
● Arbor Networks DDoS Protection as standard
● RAID10 for additional data protection
● R1Soft Enterprise Managed Backups
● Remote Backup Space in a different physical location you get full access
● Best Effort Management
● 24/7 Server Monitoring
● Accredited support team with access to a technician 24/7 via live chat
● Priority Support SLA for helpdesk tickets
We also build custom clusters to order so if you have the need for something a little more complex feel free to speak to our Engineering Team.
An example is here: Server Setup for the Strategic Investment Board