Uptime Assurance through the Mobilization and Augmentation of Data Center IT Teams with Artificially Intelligent Applications

Across the entire data center landscape, when asking professionals about what their top concern is within their facilities, they’ll likely respond with one issue: downtime. Even with the most diligent teams and thorough operational protocols, too many factors need to be juggled to completely safeguard against outages, especially when the possibility for human error is involved. With downtime being so debilitating (reports show that in 2017 and 2018 the cost of a single hour of unplanned downtime could reach upwards of $5 million, not to mention the cost of a damaged reputation due to unreliability), data center professionals are seeking new and innovative solutions to prevent costly outages.

When it comes to protecting data center systems against failure, many problems can be solved by careful monitoring and vigilance across day-to-day operations. However, without some way to augment a data center’s existing teams of technicians and engineers, it is not feasible to maintain such an extraordinarily high level of constant oversight. Fortunately, the development of newer technologies based on Artificial Intelligence (AI) will enable data centers to successfully combat issues like downtime, allowing them to meet uptime guarantees and prevent costly outages. In describing the top ten strategic technology trends for 2018, Gartner analysts predicted that AI would become a major industry player. AJ Byers, CEO of ROOT Data Center, a Montreal-based next-generation data center company, notes, “the ability to use AI to enhance decision making, reinvent business models and ecosystems, and remake the customer experience will drive the payoff for digital initiatives through 2025.”

ROOT investigated the prospect of using AI as the extended eyes and ears of its data center’s operational teams, adding additional layers of automated surveillance that can foresee, and potentially correct, issues. ROOT challenged themselves to develop a plan for utilizing AI sensors and machine learning to predict possible faults, eliminate human error, reduce downtime and drive efficiency across the data center.

How This Opportunity or Challenge was Met

ROOT developed a five-year strategy consisting of four related projects, the first of which was launched in 2017 and continued to run through the end of 2018. This initial project focused on installing and deploying sensors within the generator platform of a 5MW data hall. By deploying these sensors, data is collected and employed machine learning is applied to establish a baseline operating level, which would then allow them to alert data center personnel of operation outside of the baseline indices. Since the AI persona was implemented, it has gone through over 3,000 training sessions with the generators, representing 250 hours of monitoring augmentation.

In the subsequent phases and projects, ROOT established goals and outlined plans for augmenting data center trend analysis, enhancing AI data center controls and finally moving to an AI-first operator system with a human fail-safe. In a step-wise fashion, the AI persona is planned to be expanded into primary monitoring systems where it will predict generator failure and allow for preventative maintenance. From there, it would then be incorporated in a holistic way wherein the operators would no longer make decisions, only confirm the AI’s appraisal and decisions.

The AI system utilized 3,000 training sessions and 25,000 work units to expand on the domain knowledge of the sensors, such that they are sensing, identifying and learning about generator maintenance issues in a large variety of operating conditions.

This project is the world’s first instance of using AI in a colocation data center to measure and reduce customer downtime. Alex, the name ROOT’s team gave the AI system, has become an effectively integrated part of data center operations.

Benefits of the Initiative

By developing and implementing the first stages of its five-year plan, ROOT successfully developed a cost-effective and innovative strategy to reduce the risk of human error and minimize downtime. AIex successfully maintained uptime for ROOT customers.

ROOT’s AI was able to overcome significant technical barriers, including varying ambient noise, individual noise signatures and a range of other occurrences that demand a dynamic approach and adjustment strategy for generator operations. Throughout this project, ROOT not only reduced the risk for downtime and achieved 100 percent uptime within their facility but increased the operational efficiency for ongoing maintenance.

Overall, Alex set a precedent that has possible applications in other data centers throughout the world, benefitting the industry and its customers that depend on data center uptime. ROOT’s well-read white paper on this use case has resulted in other data centers following in the company’s footsteps, a clear indication of how AI can reduce the risk of downtime.

About the Author

AJ Byers is CEO of ROOT Data Center, a leading Montreal firm that specializes in next-generation colocation that goes beyond reliability and security. AJ leverages over 20 years of experience in the data center industry to support and promote business growth and transformation. Prior to joining ROOT, AJ served as President of Rogers Data Centers, where he was instrumental in leading the team in the development of one of Canada’s largest data center service companies.

Sign up for the free insideAI News newsletter.