Sales Line: 028 9507 2311

We got your back….

We got your back….

Server Monitoring – Don’t Panic!  We got your back….

Here at Bigwetfish Hosting we have been working hard behind the scenes to develop a comprehensive server monitoring system for the benefit of our clients.  We wanted to let you know what is available to you as a BWF Client and what we do 24/7 behind the scenes to ensure the smooth running and high performance of our servers.

Service Monitoring – Shared / Reseller Servers

Some call Nagios ‘The Industry Standard in IT Infrastructure Monitoring’ and this is the software we have chosen to be the backbone of our server monitoring.  From Chrome and Firefox plugins to iOS and Android Apps this software means no matter where our staff are notifications of outages will come through right away.

As we have a helpdesk that is manned 24/7/365 by our support staff this was an ideal place to send any notices.

The above picture shows an example of what our technicians see when they visit the web based Nagios Monitor. As you can see in this example we show that Server 29 is in perfect health and there are no active alerts

The above picture shows an example of what our technicians see when they visit the web based Nagios Monitor. As you can see in this example we show that Server 29 is in perfect health and there are no active alerts

All our Shared and Reseller servers have been added to the Nagios Monitor and if any of the following instances happen on any such server we will instantly get an alert:

  • A degraded RAID array
  • Server load rises above a pre defined level  (level depends on CPU power)
  • Apache service fails
  • IMAP service fails
  • MySQL service fails
  • Ping returns packet loss
  • POP service fails
  • SMTP service fails
  • SSH service fails
  • Mail queue gets large perhaps indicating spamming
Firefox Plugin

Pictured is the Nagios Firefox Plugin from a BWF staff members macbook. As you can see it alerts us most of the time that there are no problems. If a problem occurs this changes to a Red Alert for critical issues and a Yellow Alert for non critical issues. We get instant notification of problems. All our technicians have this running in their browsers. There is a similar plugin for Chrome as well

This alert is instant and an email is sent to our helpdesk where the technician on duty is instructed to check the server right away.  Sometimes it is a simply matter of stopping some processes to reduce load and sometimes in the case of a degraded RAID array we need to get our data centre partners to ‘hot swap’a disk.

RAID health monitoring is probably one of the most important aspects of our monitoring.  If a drive in an array becomes 'degraded' we get alerted immediately and we can have a technician in our data centre 'hot swap' the drive and kick off a rebuild immediately.  This can ensure that problematic disks are replaced the moment they throw errors and before it becomes a larger problem

RAID health monitoring is probably one of the most important aspects of our monitoring. If an array become degraded caused by a bad drive we get alerted immediately and we can have a technician in our data centre ‘hot swap’ the drive and kick off a rebuild immediately. This can ensure that problematic disks are replaced the moment they throw errors and before it becomes a larger problem

 

 

Management also have Apps on their iPhones and will also get the alerts as ‘Push Messages’ in iOS so even outside of office hours management sometimes know of events before a technician on the helpdesk calls them.

This is an example of a Red Alert from nagios showing us that a service has failed on a server. This allows our techs to react immediately and restart the service and fix the issue. Most times our clients are not aware of issues as we have fixed them before clients notice.

Service Monitoring – VPS Nodes / VPS Servers / Dedicated Clients

We use Nagios to monitor the health of the RAID arrays on our VPS Nodes as well as monitor the server load.

Clients who have VPS servers with us will get their services monitored if they have bought the ‘Server Management and Backup’ add on from our website:  http://www.bigwetfish.co.uk/vps/server-management-options/

Clients with Dedicated Servers get the monitoring as standard on their servers.

Hardware Monitoring

The most important part of monitoring hardware is Drive Health as if a drive fails or an array fails there can be data loss.  At BWF we monitor our drive and array health in a number of ways:

  • We use nagios to monitor our RAID arrays for degraded disks and we get instant alerts when there is a problem.
  • We use SMART checking on our drives to check for general drive health and specifically we check for reallocated sectors in drives as this can be an indication of a drive about to go bad.  Where we see problems we will quickly act to replace a drive hopefully even before it fails.
SMART checking raw data

Here you will see examples of raw data from drive SMART checking. All the drives we use have this capability built in and we check all drives regularly

Seeing what is happening at a glance – BWFMonitor Graph Portal

We also have a fully featured graphing portal available to all VPS and Dedicated server clients and our technicians have full access to the same graphs for our shared and reseller servers.  These graphs allow us to see ‘at a glance’ what is happening on a server.  Such things we have detected on such servers have been:

  • High rate of inbound traffic to a shared server causing the graph to spike.  Technicians were able to quickly stop a small inbound ddos against a specific website on a particular server by blocking the IP ranges that were causing the issues.  Had we not seen this on our monitor this would have been a problem for longer and more clients may have had issues
  • Our Nagios monitor indicated a high load on a server and at the same time we noticed a spike in outbound traffic on a particular OpenVZ VPS Node.  The server owner was running compressed cpanel backups and the compression was causing server load issues.  We were able to work with the client and help him implement an rsync backup solution that required no compression and as such the server load issues were resolved

If you are a VPS or Dedicated Server client and do not have access to these graphs yet just open a ticket and we can get you access right away.

Here you will see a number of examples of graphs taken from our BWF Monitor from three servers.  All these graphs were taken around 4pm on Thursday 31 January 2013 and a brief explanation of each one will also follow.  There are lots more graphs available such as the number of logged in users, number of running processes etc and we can customize the graphs clients see on request.

Here you can see a Traffic Graph taken from the Server 14 Shared Server Page

Here you can see a Traffic Graph taken from the Server 14 Shared Server Page

Here you can see a Traffic Graph taken from the Server 14 Shared Server Page

This is an example of a 24 hour load graph from a shared server and it is useful in helping to track problems. The load spiking to 5 overnight is simply a result of the backup processes running on the server and is perfectly normal. The server shown has 8 CPU Cores so a load of 5 is acceptable

An example of one of our most stable shared servers - not much happens in term of RAM usage but this is a good sign as it means the server is working as normal

An example of one of our most stable shared servers – not much happens in term of RAM usage but this is a good sign as it means the server is working as normal

This is a graph from a client's dedicated server.  We routinely monitor dedicated servers belinging to clients as a matter of routine.  This server is actually in the a different Data Centre and belongs to a client who simply pay us to manage the server.  Any server we remotely manage as part of our Third Party Remote Server Management Addon is treated just like a server belonging to us.  We monitor it in Nagios and in Cacti just as if it were our own server.  This is one reason why a growing number of clients are renting unmanaged servers from another provider and paying us a monthly fee for management

This is a graph from a client’s dedicated server. We routinely monitor dedicated servers belonging to clients as a matter of routine. This server is actually in a different Data Centre and belongs to a client who simply pay us to manage the server. Any server we remotely manage as part of our Third Party Remote Server Management Addon is treated just like a server belonging to us. We monitor it in Nagios and in BWFMonitor just as if it were our own server. This is one reason why a growing number of clients are renting unmanaged servers from another provider and paying us a monthly fee for management

We trust you see a little more of what goes on behind the scenes to make BWF your number one choice for shared, reseller, VPS or Dedicated server hosting.  We never take our clients for granted and we wanted to give you a little flavour of what our technicians do on a daily basis to monitor the servers your websites are located on.  We really have seen an increase in our server uptime as a direct result of us monitoring things so closely as it allows us to get to 99% of small issues before they become a large issue.

Whilst comprehensive server monitoring will never guarantee there will be no outages or downtime we firmly believe our proactive monitoring of all critical services for our clients helps immensely in keeping things working as they should.

We also have the strong backing of our Hosting Partners (Hostdime) and their Data Centre techs from DIMEnoc to quickly react if we do experience any outages.  The backing of a global company will give any of our clients confidence in the quality of our hardware solutions.

Trust our growing team of experts to manage your hosting account professionally.

We would like to thank Praveen one of our Red Hat Linux Certified Level 3 techs for taking this as his project and implementing this complete solution for us. You can find out a little more about Praveen on the ‘About Us’ page on our website.