| You are here > | Home | > | What we do | > | Reliability |
Reliability / Business Continuity
Reliability is about ensuring that your IT systems continue to perform when hit by any of the problems that could stop them.
Fault Tolerance
Making a computer system fault tolerant means that certain types of failures won't stop the system working. Most IT systems can be made Fault Tolerant to a certain extent. The questions are: "What Faults?" and "How Tolerant?"
Inevitably each company has to decide how cost-effective any proposed solution is. It's a finely balanced judgment to decide how much to spend to remove any given risk. However, here are a couple of the commonest IT system failures and the sort of things you can do about them. As always Bazaar Systems will be more than happy to discuss these with you:
- Power supplies
- This is the bit that takes the mains electricity supply and coverts it into voltages the Computer can use. Power Supplies (PSUs) get hot, and things that get hot generally have higher failure rates than things that don't. PSU failure is so common that most commercial strength servers have two or more PSUs built in, so that if one fails the other can carry on. Most Domestic PCs only have one PSU, and while you can often put a second PSU into the chassis, it won't automatically take over in the event that the original PSU fails, this is a limitation of the Motherboard; which needs special circuitry to allow dual PSUs.
This should be in your Reliability Assessment: a dual-PSU server will continue to operate with a failed PSU, while a server with a Primary and Standby PSU will be offline for as long as it takes to unplug the Primary and plug in the secondary - maybe 30 minutes assuming you have someone who can do that job on-site.
Of course, if your system only has one PSU it will be offline as long as it takes to find and fetch the replacement, and if your Server is more than a few years old you may find that a modern PSU won't fit your motherboard... - Disk Failure
- The failure of Hard disks is probably the commonest form of computer failure. So there are a lot of solutions to this issue. For a simple PC a Disk failure usually has two impacts:
- The systems becomes unusable
- All the data on the disk will be lost.
RAID systems provide a solution in some circumstances. A Random Array of Inexpensive Disks, was the original meaning, which simply meant that instead of using a single 100G disk, you used five 25G disks instead. Yes, five = 125G. The idea is that if the data from the original 100G disk is spread around the five 25G disks in the right way, the system will cope with a failure of any one of the disks. So RAID arrays behave like dual-PSUs: if a disk fails you unplug it and plug in a replacement - the system carries on regardless. As the disk failure will not effect the system, it is important to monitor the array and replace a failed disk immediately as a second failure in the array will crash everything.
Finally, the most important thing about Fault Tolerant systems is their monitoring: it's all very well for a system to carry on working in the face of a disk failure, but if nobody knows it has failed, and done something about it, you are back to square one.
Backup and Recovery
A backup is a copy of the data on your system held somewhere else. It is an essential component of any company's IT systems. Bazaar Systems can help you decide the most appropriate form of backup and help set up a Backup Regime.
Backups are most often taken onto Magnetic Tape, or burnt onto CDs or DVDs. They can even be made onto another Server, which in turn may put the data onto Mag Tape of CD/DVD. There are wide variations as to how and when this is all done: some backup systems require that the server is taken offline while the backup is being made, while other systems can make backups while the system is running.
However one thing is critically important: the Recovery cycle MUST be tested. Backups are complex, and it is all too easy to make mistakes in the Backup configuration and end up NOT having the data you thought you did on the tape. So recovery testing, usually onto a spare machine, is an essential part of a Backup and Recovery Regime.
Assuming you are backing up onto removable media (Tape, CD/DVD), another important question that should be covered in your Reliability Assessment is what happens to the tapes: where are they kept? They should never be kept near to a CRT screen for instance, as the degauss pulse when the screen starts up may erase the tape. Keeping the tapes in a desk drawer next to the server may be very convenient when changing tapes for the backup cycle, but if there is a fire that takes out your server that chances are that all the backup tapes will also be destroyed. So a Backup regime will usually suggest that some tapes are taken off-site to protect against fire-risk.
How often Backups are taken is another thing that should be addressed in a Reliability Assessment, and depends on why the Backups are being taken. If you are simply protecting against a system failure, then it's a simple matter to assess how often the system data changes and adjust the Backup cycle to suit. But probably the major reason for Backups is to protect a company against the risk of operator errors. For example, when the computers is instructed to delete all the old files on the wrong disk and ends up completely deleting all the Customer records prior to last week. Something of an issue if your company has spent ten years accumulating that Customer data.
Finally a note on Security. Backup tapes will contain copies of data like ten years of customer records that are extremely useful to competitors, the sort of competitors that a disgruntled employee may be leaving to join. Keeping Backup tapes under lock and key, along with a record of what tapes there are, so that missing tapes can be spotted, is a good idea.

Return to Top of Page