The Infrastructure behind the Outputs: Cloud and HPC Unlock the Power of AI

The remarkable results GPT-4 and Chat-GPT can produce have captured headlines and the minds of business leaders alike. Companies are constantly searching for better products, services and internal processes through artificial intelligence (AI) but need to keep in mind that uses of these technologies must be distinct to end goals. Whether wind tunnel simulation, electronic design validation, customized chatbots, “digital twin” complex system simulation, or other use cases, AI has fired imaginations across industries. However, while outputs are currently garnering the most attention, the underlying technologies—cloud, high performance computing (HPC), automation and machine learning (ML)—are also surging.

The Impact of the Cloud

Leading organizations have leveraged HPC and AI for decades, using specialized CPU- and GPU-based compute clusters with low-latency network and storage infrastructure. More recently, though, organizations have turned to the cloud, as public cloud vendors have made the infrastructure investments and core technological advances necessary to meet the increased performance demands.

Unlike prior models in which users’ access to compute was governed by job schedulers and on-premises capacity, the cloud-based model allows for nearly instant “no waiting” access to compute where users can work with a cluster that precisely meets the needs of their application. Elements such as high core-count CPUs, large memory footprint nodes and access to bare metal have closed the gap between the capabilities of cloud and those of customized on-premises systems.

However, the key to cloud success with HPC/AI will be access to software and relevant expertise tied to elastic cloud resources that can transform base infrastructure from the major public cloud providers into truly high-performing configurations. In a cloud-based model, each group can have clusters with different configurations and combinations of CPU, GPU, memory, and storage—even specialty processors available only in specific public clouds.

Leveraging the Latest Cloud Innovations

As new technology becomes available in the cloud, researchers and data scientists will benefit from rapid access to the latest advances in performance and capabilities. In the end, business acceleration is about driving better outcomes at lower costs, and cloud based HPC/AI has emerged as a capability that CIOs can use to spotlight IT as a function where innovation takes place and efficiencies are achieved.

With the right software and services support, the capabilities that have traditionally only been available to the largest organizations can now be rapidly leveraged by innovative enterprises of all sizes on “pay as you go” models that can closely link investments in computing with demonstrated ROI.

To meet these objectives, CIOs are looking to align with cloud services partners that have expertise in both compute infrastructure and usage discount models for various CPU and GPU instance types within the public clouds. This is where digging into the underlying technologies can be so critical, as cost savings associated with seemingly minor infrastructure changes can be significant—turning “good” ROI into “maximum” ROI.

For example, one of the major public cloud providers has recently introduced a highly-tuned cluster-oriented HPC configuration based on nodes with the latest high core count CPUs, extensive memory, and specialty high-speed network interconnects—at extremely attractive prices for users performing large-scale compute jobs. For the right workload types, identifying and leveraging these types of pre-optimized configurations can be a game changer.

Optimizing Infrastructure and Deployment ROI

While the outputs of AI are changing the game across industries, they are the result of the calculations of thousands of processors. In the end, the value of AI is only as good as the breadth of training data and speed of delivering answers for users – and the resources required to train large-scale models – and subsequently produce results (known as “inference”) can be dramatically different.

When initiating the AI development process, organizations should concurrently be considering both their needs for training and inferencing. Typically, training is done on a cluster-oriented basis with numerous powerful, interconnected GPU-based nodes working collectively to create a highly tuned model. Performing inference—and delivering the value of the model for users—is usually done by large banks of less powerful inference nodes working independently to service individual requests.

Cloud-based deployment environments offer the potential for users to easily create and test both training and inference configurations based on a variety of CPU and GPUs for their specific workloads. While GPUs are frequently the right choice for performing large-scale training, the most recent generation of CPUs include embedded “GPU-like” capabilities that can make them excellent options for inference workloads—from both a performance and cost/ROI perspective. Additionally, as new generations of processors are introduced in the future, the on-demand nature of the cloud makes it possible to rapidly evaluate and pivot to new technologies in a way that is simply not possible with dedicated, on-premises environments.

Conclusion

Artificial intelligence has spurred innovation across industries, with its remarkable outputs squarely in the spotlight. However, the underlying technologies like cloud computing, HPC, automation and machine learning play a pivotal role in this revolution. The shift to cloud-based infrastructure marks a significant milestone, making AI more accessible and scalable. As leading organizations continue to embrace HPC and AI, the cloud’s technological advances—coupled with improved data modeling and management—propel industries toward a future of boundless AI potential, laying the foundation for the next wave of innovations.

About the Author

Phil Pokorny serves as the Chief Technology Officer (CTO) for Intelligent Platform Solutions and is responsible for all aspects of leading-edge technology for the company. He reports to Dave Laurello, President of Intelligent Platform Solutions. Mr. Pokorny joined Penguin Computing in February of 2001 as an engineer, and steadily progressed through the organization, taking on more responsibility and influencing the direction of key technology and design decisions. He brings a wealth of engineering experience and customer insight to the design, development and support of Penguin Solutions and Stratus products.

Prior to joining Penguin Computing, he spent 14 years in various engineering and system administration roles with Cummins, Inc. and Cummins Electronics. At Cummins, Pokorny participated in the development of internal network standards, deployed and managed a multisite network of multiprotocol routers and supported a diverse mix of office and engineering workers with a variety of server and desktop operating systems. He has contributed code to Open-Source projects, including the Linux kernel, lm_sensors and LCDproc.

Mr. Pokorny graduated from Rose-Hulman Institute of Technology with Bachelor of Science degrees in math and electrical engineering, with a second major in computer science.

Sign up for the free insideAI News newsletter.

Join us on Twitter: https://twitter.com/InsideBigData1

Join us on LinkedIn: https://www.linkedin.com/company/insidebigdata/

Join us on Facebook: https://www.facebook.com/insideAI NewsNOW

Sponsored Guest Articles

Re-Engineering Ethernet for AI Fabric

White Papers

From Legacy to Leading Edge: How Mainframe Data Can Transform AI and Analytics

Featured RSS Feed

More News from insideHPC