Reliability / Business Continuity

Reliability is about ensuring that your IT systems continue to perform when hit by any of the problems that could stop them.

Reliability Assessment

 

Often known as a Business Continuity Report this need only be a few pages, but is a critical document for any company. Bazaar Systems can help you produce one.

 

A Reliability Assessment should go through every system your company uses and answer the question "What would the impact on the Business be if this system were offline for, an hour, a day, a week, a month or permanently?"

 

You need to ask how you use the system: is it the system itself that is important to business operations ( like an instant messaging system for example), or is it the Data that is held in the system that is important (like all the VAT data for the past year)?

 

Once you have answered questions like that you can begin to assess the risk to the business presented by the commonest forms of IT system failure.

 

Finally it is a simple question of assessing how much time, effort and money you are prepared to pay in order to mitigate each of these risks. Once you write that down you have your Reliability Assessment.

 

Below are a couple of techniques that are in use in most companies - even those without a formal Assessment.

 

Fault Tolerance

twin disksMaking a computer system fault tolerant means that certain types of failures won't stop the system working. Most IT systems can be made Fault Tolerant to a certain extent. The questions are: "What Faults?" and "How Tolerant?"

Inevitably each company has to decide how cost-effective any proposed solution is. It's a finely balanced judgment to decide how much to spend to remove any given risk. However, here are a couple of the commonest IT system failures and the sort of things you can do about them. As always Bazaar Systems will be more than happy to discuss these with you:

Power supplies
This is the bit that takes the mains electricity supply and coverts it into voltages the Computer can use. Power Supplies (PSUs) get hot, and things that get hot generally have higher failure rates than things that don't. PSU failure is so common that most commercial strength servers have two or more PSUs built in, so that if one fails the other can carry on. Most Domestic PCs only have one PSU, and while you can often put a second PSU into the chassis, it won't automatically take over in the event that the original PSU fails, this is a limitation of the Motherboard; which needs special circuitry to allow dual PSUs.
This should be in your Reliability Assessment: a dual-PSU server will continue to operate with a failed PSU, while a server with a Primary and Standby PSU will be offline for as long as it takes to unplug the Primary and plug in the secondary - maybe 30 minutes assuming you have someone who can do that job on-site.
Of course, if your system only has one PSU it will be offline as long as it takes to find and fetch the replacement, and if your Server is more than a few years old you may find that a modern PSU won't fit your motherboard...
 
 
Disk Failure
The failure of Hard disks is probably the commonest form of computer failure. So there are a lot of solutions to this issue. For a simple PC a Disk failure usually has two impacts:
  • The systems becomes unusable
  • All the data on the disk will be lost.
Unlike a PSU failure it is not sufficient to simply plug in a replacement disk: you have to load it with all the software and data that was on the failed disk, but you can't get that data from the failed disk because it's broken (see Backup and Recovery). Disk failure risks and their impact on your Business should be in your Reliability Assessment.
RAID systems provide a solution in some circumstances. A Random Array of Inexpensive Disks, was the original meaning, which simply meant that instead of using a single 100G disk, you used five 25G disks instead. Yes, five = 125G. The idea is that if the data from the original 100G disk is spread around the five 25G disks in the right way, the system will cope with a failure of any one of the disks. So RAID arrays behave like dual-PSUs: if a disk fails you unplug it and plug in a replacement - the system carries on regardless. As the disk failure will not effect the system, it is important to monitor the array and replace a failed disk immediately as a second failure in the array will crash everything.

Finally, the most important thing about Fault Tolerant systems is their monitoring: it's all very well for a system to carry on working in the face of a disk failure, but if nobody knows it has failed, and done something about it, you are back to square one.

Backup and Recovery

Old IBM tape unitA backup is a copy of the data on your system held somewhere else. It is an essential component of any company's IT systems. Bazaar Systems can help you decide the most appropriate form of backup and help set up a Backup Regime.

Backups are most often taken onto Magnetic Tape, or burnt onto CDs or DVDs. They can even be made onto another Server, which in turn may put the data onto Mag Tape of CD/DVD. There are wide variations as to how and when this is all done: some backup systems require that the server is taken offline while the backup is being made, while other systems can make backups while the system is running.

However one thing is critically important: the Recovery cycle MUST be tested. Backups are complex, and it is all too easy to make mistakes in the Backup configuration and end up NOT having the data you thought you did on the tape. So recovery testing, usually onto a spare machine, is an essential part of a Backup and Recovery Regime.

Assuming you are backing up onto removable media (Tape, CD/DVD), another important question that should be covered in your Reliability Assessment is what happens to the tapes: where are they kept? They should never be kept near to a CRT screen for instance, as the degauss pulse when the screen starts up may erase the tape. Keeping the tapes in a desk drawer next to the server may be very convenient when changing tapes for the backup cycle, but if there is a fire that takes out your server that chances are that all the backup tapes will also be destroyed. So a Backup regime will usually suggest that some tapes are taken off-site to protect against fire-risk.

How often Backups are taken is another thing that should be addressed in a Reliability Assessment, and depends on why the Backups are being taken. If you are simply protecting against a system failure, then it's a simple matter to assess how often the system data changes and adjust the Backup cycle to suit. But probably the major reason for Backups is to protect a company against the risk of operator errors. For example, when the computers is instructed to delete all the old files on the wrong disk and ends up completely deleting all the Customer records prior to last week. Something of an issue if your company has spent ten years accumulating that Customer data.

Finally a note on Security. Backup tapes will contain copies of data like ten years of customer records that are extremely useful to competitors, the sort of competitors that a disgruntled employee may be leaving to join. Keeping Backup tapes under lock and key, along with a record of what tapes there are, so that missing tapes can be spotted, is a good idea.

Top Return to Top of Page
About Us | Contact Us | ©2006 Bazaar Systems Limited