How to prevent data centre disasters caused by human error

A university network brought to its knees when someone inadvertently plugged two network cables into the wrong hub. An employee injured after an ill-timed entry into a data centre. Overheated systems shut down after a worker changes a data centre thermostat setting from Fahrenheit to Celsius.

These are just a few of the data centre disasters that have been caused not by technological malfunctions or natural catastrophes, but by human error.

According to the Uptime Institute, a New York-based research and consulting organization that focuses on data-centre performance, human error causes roughly 70 per cent of the problems that plague data centres today. The group analyzed 4,500 data-centre incidents, including 400 full downtime events, says Julian Kudritzki, a vice president at the Uptime Institute, which recently published a set of guidelines for operational sustainability of data centres.

“I’m not surprised,” Kudritzki says of the findings. “The management of operations is your greatest vulnerability, but also is a significant opportunity to avoid downtime. The good news is people can be retrained.”

Whether it’s due to neglect, insufficient training, end-user interference, tight purse strings or simple mistakes, human error is unavoidable. And these days, thanks to the ever-increasing complexity of IT systems — and the related problem of increasingly overworked data centre staffers — even the mishaps that can be avoided often aren’t, says Charles King, an analyst at Pund-IT Inc.

“Whenever you mix high levels of complexity and overwork, the results are typically ugly,” says King. And as companies become more reliant on technology to achieve their business goals, those mistakes become more critical and more costly.

Wrong worker, wrong cable

Take the example of the university data centre switch that overloaded because an IT worker mistakenly plugged two network cables into a downstream hub. That happened about four years ago at the Indiana University School of Medicine in Indianapolis, according to Jeramy Jay Bowers, a security analyst at the school.

The problem arose out of less-than-optimal network design, says Bowers, who worked at the school as a system engineer at the time of the incident. The IT department for the school of medicine was split into two locations, with one room in the school of medicine building and another room at the neighboring university hospital — not an ideal setup to begin with, says Bowers.

The department had run fiber — a purple cable, to be exact — from a switch in the first building to the second, running it up through the ceiling, through a set of doors and across to the hospital’s administrative wing next door. That cable attached to a 12-port switch that sat in the hospital building’s IT room, and staffers could easily disconnect from the school of medicine network and connect to the hospital network through a jack in the wall, Bowers explains.

One day, Bowers had taken some personal time and was out for a jog when his iPhone rang — the switch in the school of medicine’s server room was overloaded, causing denials to every service it hosted.

“The green lights go on and off when packets pass through,” he explained. “It had ramped up until the lights were more on than off.”

Bowers quickly began troubleshooting over the phone. He was able to determine that nothing on the school of medicine’s network had changed. Then he remembered that purple cable. He told his co-worker on the phone to unplug it, and activity on the switch went back to normal. Then he had his co-worker plug it back in and the switch overloaded again, proving that the problem was at the other end of the purple cable — in the university hospital building.

It turned out that an IT staffer who was normally based out of a satellite location came to the university hospital’s IT room to work on a project and needed extra connectivity. He inadvertently created a loop by plugging two network cables from the university switch into a hub he had added to the network so he could attach additional devices.

“So it kept trying to send data around in a circle, over and over,” says Bowers, which in turn caused the switch in the school of medicine building to overload.

Bowers says the network was cobbled together like that when he began working at the university, so he inherited the setup — which a better approach to network planning and design would have no doubt flagged as problematic. But at least now the IT department knows one scenario to avoid going forward: Jury-rigged cabling and traveling techies can be a bad mix.

“We didn’t do an official lessons learned [exercise] after this, it was just more of a ‘don’t do that again,'” says Bowers. However, this event, combined with another incident where a user unwittingly established a rogue wireless access point on the school of medicine’s network and overloaded the switch, has convinced Bowers of one thing: “I hold to the concept that human errors account for more problems than technical errors,” he says.

Save $35, lose all your data

More often than not, data centre mishaps are caused, directly or indirectly, by employers’ attempts to save money. In this case, it was all about saving $35 on a backup tape.

In 1999, Charles Barber worked as technical support manager at a health-instrument company (one that no longer exists) that made stand-alone, server-based equipment that connected to treadmills to collect the data resulting from patient stress tests. One of the company’s customers was a small medical practice in St. Louis where the administrative assistant also served as the IT person.

“She was pretty competent” — but not a trained IT professional, says Barber.

One Friday evening, she heard strange noises coming from the equipment’s server and realized that the hard drive had failed. That Saturday she purchased a new hard drive, installed it and reloaded Microsoft’s Windows Server and SQL Server, since she had saved the discs and documentation. Barber had provided written instructions for her on how to configure the server, in case such a thing ever happened, and the assistant did so successfully. (“I’ve had field engineers call me to get help with these things,” says Barber, but this woman managed it on her own.)

She then spent Sunday and most of Monday restoring the data and testing the system before allowing a live stress test on a patient later on Monday, and that test appeared to have gone well.

But on Tuesday morning she called Barber to say all the information that she had restored on the server from the backup tape was gone.

“This is a person who does a full backup of the entire system every day,” Barber explains. “Unfortunately, when she went to reinstall her backup, all she saw was the results of the test patient from Monday.”

Because she only had one backup tape, she had used it to back up the Monday test results without remembering that the disc now included all the historical data from the server which, in the process, was erased.

“The tapes cost $35. If only her employer had authorized her to buy a second one. Instead, they lost three months of data,” says Barber. “I was choking for about thirty seconds when I realized what had happened; here was someone who was totally competent, but her bosses wouldn’t spend the $35 for an extra backup tape.”

Physical plant, physical fall

Sometimes there are literally accidents waiting to happen in data centres, but the people who work there every day are oblivious to the hazards — though fresh eyes would recognize them right away.

Ed Gould, a retired IT professional, worked for a securities firm (which he’d rather not name) as a systems programmer in Chicago during the mid-1980s. He was a month into the job when he discovered a data-centre danger hidden in plain sight.

At this company, programmers typically would hand their tapes off to data centre operators, who then mounted them in the computer room. One day the operators were too busy to mount the tapes Gould had for them, so he decided to mount them himself. He was only a few steps into the data centre when he stepped into a hole in the floor that was roughly two-and-a-half feet deep and the size of a pizza. (The data centre, located on the seventh story of the building, had been built on a raised floor.)

“My foot just went through,” he remembers. “I felt some pain and started cussing. Someone had to come over to help me out.”

He asked the operators why there was a hole in the middle of the floor, in a high-traffic area no less. The operators answered that they were used to it, since it had been there for two years, and they simply maneuvered around it. Gould then asked the shift supervisor, who told him he wasn’t supposed to be in the data centre in the first place, and that the operators knew enough not to fall into the hole.

After escalating the situation to a vice president — who told him he was the first one to report the hole in the floor — and subsequently visiting the hospital to have his wounds inspected, Gould was compensated for his emergency room bills and the pants of his suit (which had been torn), and the floor was fixed within a couple of days.

He eventually found out that the hole had been cut in the floor to accommodate cabling for a tape drive system that had already been relocated when Gould took his tumble.

What surprised Gould more than the fact that such a hazard was literally in the middle of the data centre floor was the way the other IT workers reacted to it. “I think I was more stunned at the operators who just went around it,” he says.

Mistakes around the globe

If you need any more evidence that humans can wreak havoc on the data centre, look no further than Computerworld‘s own Shark Tank column, where over the years IT managers have offered up hundreds of tales of woe.

During the mid-1980s, an Air Force base in Arizona had to install new cabling throughout the facility, remembers John Eyre, an Air Force engineer at the time. The new cable was needed for an installation of Wang minicomputers — each computer required coaxial cables to connect to terminals, and the vendor had recommended a two-inch conduit to pull the cables into place.

Eyre didn’t think the conduit was wide enough to accommodate the cable, but since it was recommended by Wang and the project was running behind schedule, Eyre’s superiors went ahead with the conduit, he says.

When all the cable was laid, management discovered that in each instance where they used the conduit to pull the cable, the cable was nicked and therefore unusable, says Eyre. The entire installation had to be redone with a wider conduit, which delayed rollout by nine months.

Lesson learned? “When you’re in a hurry to meet a deadline and get a feather in your cap, it only ends up causing problems,” Eyre says.

Other favorite tales of human error from Shark Tank contributors:

A jet of Freon shoots out of a disconnected air conditioning line in the middle of a data centre, spraying rows of rack-mounted servers (“with a frantic tech trying to stem the flow with his bare hands,” says the storyteller), resulting in a building evacuation.

  • A university lab testing speech perception in quails (yes, the small ground birds) is forced to close temporarily after a homegrown backup program that hadn’t been beta-tested brought down systems for two weeks and wiped out five months of data.
  • A server room suffers 100-degree-plus conditions, even though the thermostat was set to 64 degrees. The problem — someone changed the setting from Fahrenheit to Celsius. The result? Melted drives.

Minimizing data centre mistakes

So when it comes to data centre disasters, what’s more dangerous — the systems or the people who configure and maintain those systems?

“I think the answer is both,” says the Uptime Institute’s Kudritzki. “If you have a well-kept, well-run data centre, your equipment will run at the highest level. If you have a poorly maintained data centre, you’re going to see problems.”

Part of having a well-run data centre is paying attention to the humans who run or otherwise interact with those systems, adds Kudritzki. Managers who take the time to make careful decisions regarding staffing levels, training, maintenance and the overall rigor of the operation are those most likely to avoid disaster and achieve maximum uptime.

Management shouldn’t take a quick-fix approach to addressing the human factors that contribute to data-centre downtime, warns Pund-IT’s King. Effective personnel management requires well-thought-out strategies.

“Addressing any of these [human issues] requires systemic strategies and solutions, but training programs are often narrow and task-oriented,” he says.

“There’s also a certain irony here in that while most staff members understand the systemic nature of the technologies they work with, fewer recognize that data centres themselves are highly complex, interconnected systems,” King says. “Training programs and exercises that emphasize a holistic approach to data centre management could help address that problem.”

On July 1, 2010, the Uptime Institute released a new set of specifications designed to help data centres improve uptime by outlining operational issues, including the human element.

Called the Data centre Site Tier Standard: Operational Sustainability, the guidelines address, among other things, how the behaviors and risks of a data centre’s management team can impact long-term performance.

If not managed properly, even the most advanced data centre will suffer downtime, says Julian Kudritzki, an Uptime Institute vice president.

The guidelines address four aspects that management should pay attention to in order to get the most uptime out of their data centres. These include staffing — and that means not just enough people, but enough qualified people to uphold the performance goals of the data centre. For example, at advanced (Tier 4) data centres, the Uptime Institute recommends at least two full-time operations staffers be on-site 24/7.

Management must also make the right decisions regarding all aspects of maintenance, including preventative maintenance, everyday housekeeping and life-cycle-related tasks.

Training, too, is essential: Employees who are able to react to unplanned events can help avert downtime, according to the standard, which recommends on-site, on-the-job training, offsite vendor training and formal certification.

As for overall planning, coordination and management of the data centre, the standard recommends that managers should establish site policies and financial management policies and make use of space, power and cooling-capacity management tools, as well as maintain a site infrastructure library such as ITIL.

Garretson is a freelance writer in the Washington, D.C., area. She can be reached at [email protected].

Would you recommend this article?

Share

Thanks for taking the time to let us know what you think of this article!
We'd love to hear your opinion about this or any other story you read in our publication.


Jim Love, Chief Content Officer, IT World Canada

Featured Download

Featured Story

How the CTO can Maintain Cloud Momentum Across the Enterprise

Embracing cloud is easy for some individuals. But embedding widespread cloud adoption at the enterprise level is...

Related Tech News

Get ITBusiness Delivered

Our experienced team of journalists brings you engaging content targeted to IT professionals and line-of-business executives delivered directly to your inbox.

Featured Tech Jobs