NicoElNino - Fotolia

Storage technology explained: AI and data storage

In this guide, we examine the data storage needs of artificial intelligence, the demands it places on data storage, the suitability of cloud and object storage for AI, and key AI storage products

Antony Adshead

By

Antony Adshead, Storage Editor

Published: 11 Apr 2024

Artificial intelligence (AI) and machine learning (ML) promises a step change in the automation fundamental to IT, with applications ranging from simple chatbots to almost unthinkable levels of complexity, content generation and control.

Storage forms a key part of AI, to supply data for training and store the potentially huge volumes of data generated, or during inference when the results of AI are applied to real-world workloads.

In this article, we look at the key characteristics of AI workloads, their storage input/output (I/O) profile, the types of storage suited to AI, the suitability of cloud and object storage for AI, and storage supplier strategy and products for AI.

What are the key features of AI workloads?

AI and ML are based on training an algorithm to detect patterns in data, gain insight into data and often to trigger responses based on those findings. Those could be very simple recommendations based on sales data, such as the “people who bought this also bought” type of recommendation. Or they could be the kind of complex content we see from large language models (LLMs) in generative AI (GenAI) trained on vast and multiple datasets to allow it to create convincing text, images and video.

There are three key phases and deployment types to AI workloads:

Training, where recognition is worked into the algorithm from the AI model dataset, with varying degrees of human supervision;
Inference, during which the patterns identified in the training phase are put to work, either in standalone AI deployments and/or;
Deployment of AI to an application or sets of applications.

Where and how AI and ML workloads are trained and run can vary significantly. On the one hand, they can resemble batch or one-off training and inference runs that resemble high-performance computing (HPC) processing on specific datasets in science and research environments. On the other hand, AI, once trained, can be applied to continuous application workloads, such as the types of sales and marketing operations described above.

The types of data in training and operational datasets could vary from a great many small files in, for example, sensor readings in internet of things (IoT) workloads, to very large objects such as image and movie files or discrete batches of scientific data. File size upon ingestion also depends on AI frameworks in use (see below).

Datasets could also form part of primary or secondary data storage, such as sales records or data held in backups, which is increasingly seen as a valuable source of corporate information.

What are the I/O characteristics of AI workloads?

Training and inferencing in AI workloads usually requires massively parallel processing, using graphics processing units (GPUs) or similar hardware that offload processing from central processing units (CPUs).

Processing performance needs to be exceptional to handle AI training and inference in a reasonable timeframe and with as many iterations as possible to maximise quality.

Infrastructure also potentially needs to be able to scale massively to handle very large training datasets and outputs from training and inference. It also requires speed of I/O between storage and processing, and potentially also to be able to manage portability of data between locations to enable the most efficient processing.

Data is likely to be unstructured and in large volumes, rather than structured and in databases.

What kind of storage do AI workloads need?

As we’ve seen, massive parallel processing using GPUs is the core of AI infrastructure. So, in short, the task of storage is to supply those GPUs as quickly as possible to ensure these very costly hardware items are used optimally.

More often than not, that means flash storage for low latency in I/O. Capacity required will vary according to the scale of workloads and the likely scale of the results of AI processing, but hundreds of terabytes, even petabytes, is likely.

Adequate throughput is also a factor as different AI frameworks store data differently, such as between PyTorch (large number of smaller files) and TensorFlow (the reverse). So, it’s not just a case of getting data to GPUs quickly, but also at the right volume and with the right I/O capabilities.

Recently, storage suppliers have pushed flash-based storage – often using high-density QLC flash – as a potential general-purpose storage, including for datasets hitherto considered “secondary”, such as backup data, because customers may now want to access it at higher speed using AI.

Storage for AI projects will range from that which provides very high performance during training and inference to various forms of longer-term retention because it won’t always be clear at the outset of an AI project what data will be useful.

Is cloud storage good for AI workloads?

Cloud storage could be a viable consideration for AI workload data. The advantage of holding data in the cloud brings an element of portability, with data able to be “moved” nearer to its processing location.

Many AI projects start in the cloud because you can use the GPUs for the time you need them. The cloud is not cheap, but to deploy hardware on-premise, you need to have committed to a production project before it is justified.

All the key cloud providers offer AI services that range from pre-trained models, application programming interfaces (APIs) into models, AI/ML compute with scalable GPU deployment (Nvidia and their own) and storage infrastructure scalable to multiple petabytes.

Is object storage good for AI workloads?

Object storage is good for unstructured data, able to scale massively, often found in the cloud, and can handle almost any data type as an object. That makes it well-suited for the large, unstructured data workloads likely in AI and ML applications.

The presence of rich metadata is another plus to object storage. It can be searched and read to help find and organise the right data for AI training models. Data can be held almost anywhere, including in the cloud with communication via the S3 protocol.

But metadata, for all its benefits, can also overwhelm storage controllers and affect performance. And, if cloud is a location for cloud storage, cloud costs need to be taken into account as data is accessed and moved.

What do storage suppliers offer for AI?

Nvidia provides reference architectures and hardware stacks that include servers, GPUs and networking. These are the DGX BasePOD reference architecture and DGX SuperPOD turnkey infrastructure stack, which can be specified for industry verticals.

Storage suppliers have also focused on the I/O bottleneck so data can be delivered efficiently to large numbers of (very costly) GPUs.

Those efforts have ranged from integrations with Nvidia infrastructure – the key player in GPU and AI server technology – via microservices such as NeMo for training and NIM for inference to storage product validation with AI infrastructure, and to entire storage infrastructure stacks aimed at AI.

Supplier initiatives have also centred on the development of retrieval augmented generation (RAG) pipelines and hardware architectures to support it. RAG validates the findings of AI training by reference to external, trusted information, in part to tackle so-called hallucinations.

Which storage suppliers offer products validated for Nvidia DGX?

Numerous storage suppliers have products validated with DGX offerings, including the following.

DataDirect Networks (DDN) offers its A³I AI400X2 all-NVMe storage appliances with SuperPOD. Each appliance delivers up to 90GBps throughput and three million IOPS.

Dell’s AI Factory is an integrated hardware stack spanning desktop, laptop and server PowerEdge XE9680 compute, PowerScale F710 storage, software and services and validated with Nvidia’s AI infrastructure. It is available via Dell’s Apex as-a-service scheme.

IBM has Spectrum Storage for AI with Nvidia DGX. It is a converged, but separately scalable compute, storage and networking solution validated for Nvidia BasePOD and SuperPod.

Backup provider Cohesity announced at Nvidia’s GTC 2024 event that it would integrate Nvidia NIM microservices and Nvidia AI Enterprise into its Gaia multicloud data platform, which allows use of backup and archive data to form a source of training data.

Hammerspace has GPUDirect certification with Nvidia. Hammerspace markets its Hyperscale NAS as a global file system built for AI/ML workloads and GPU-driven processing.

Hitachi Vantara has its Hitachi iQ, which provides industry-specific AI systems that use Nvidia DGX and HGX GPUs with the company’s storage.

HPE has GenAI supercomputing and enterprise systems with Nvidia components, a RAG reference architecture, and plans to build in NIM microservices. In March 2024, HPE upgraded its Alletra MP storage arrays to connect two times the number of servers and four times the capacity in the same rackspace with 100Gbps connectivity between nodes in a cluster.

NetApp has product integrations with BasePOD and SuperPOD. At GTC 2024 NetApp announced integration of Nvidia’s NeMo Retriever microservice, a RAG software offering, with OnTap customer hybrid cloud storage.

Pure Storage has AIRI, a flash-based AI infrastructure certified with DGX and Nvidia OVX servers and using Pure’s FlashBlade//S storage. At GTC 2024, Pure announced it had created a RAG pipeline that uses Nvidia NeMo-based microservices with Nvidia GPUs and its storage, plus RAGs for specific industry verticals.

Vast Data launched its Vast Data Platform in 2023, which marries its QLC flash-and-fast-cache storage subsystems with database-like capabilities at native storage I/O level, and DGX certification.

In March 2024, hybrid cloud NAS maker Weka announced a hardware appliance certified to work with Nvidia’s DGX SuperPod AI datacentre infrastructure.

Read more about artificial intelligence and storage

Podcast: What is the impact of AI on storage and compliance? Start now looking at artificial intelligence compliance. That’s the advice of Mathieu Gorge of Vigitrust, who says AI governance is still immature but firms should recognise the limits and still act.
IT not ready for AI, Pure Storage survey finds: Storage, compute and networking hardware won’t cope without upgrades, and that often means total IT infrastructure overhaul.

Read more on AI and storage

CIO

Tensions rise over China's control of critical materials
While there is disagreement in Congress over how to diversify the critical materials supply chain, there is bipartisan agreement ...
Businesses face growing patchwork of state AI laws
As U.S. states like Colorado pass their own AI laws, businesses will need to prepare compliance measures if they do business in ...
How to lead a digital transformation: 10 key steps
Digital transformation success requires cross-organizational alignment, actionable goals and top-notch project management. Here's...

Congress grills Microsoft president over security failures
Microsoft President Brad Smith testifies on a wide range of issues, including Chinese and Russian nation-state attacks, the ...
The enduring importance of digital trust
Digital trust is an increasingly important issue, yet confusion remains about what exactly it is, how to achieve it and how to ...
Microsoft's Recall changes might be too little, too late
Criticism of Microsoft's Recall feature continues even after the software giant announced several updates to address concerns ...

Recap: Enterprise Strategy Group Thoughts on Cisco Live 2024
Analysts Jim Frey and Jon Brown from Enterprise Strategy Group talk about their takeaways from the Cisco Live 2024 conference in ...
Mass General Brigham tackles network upgrade, AI pilots
As Mass General Brigham updated its network and invested in other upgrades, the hospital ecosystem also developed a measured ...
Cisco Live 2024 conference coverage and analysis
Cisco Live 2024 will focus largely on AI and its potential to transform enterprise networking and IT. Use this guide to follow ...

8 green computing best practices
As climate change becomes a more pressing issue, these sustainability best practices can help your data center go greener, which ...
StorMagic debuts HCI, hypervisor as VMware alternative
StorMagic looks to court customers with smaller data centers for SMBs and the edge with SvHCI, a new VMware alternative with a ...
Lessons in AI from Dell Technologies World 2024
A main focus of the Dell Technologies World 2024 conference was AI and how it impacts infrastructure environments. Dell ...

Data Management

Anomalo unveils quality monitoring for unstructured data
The data quality specialist's capabilities will enable customers to monitor unstructured text to ensure the health of data used ...
Graph database vs. relational database: Key differences
Graph databases offer plenty of advantages for enterprises, but relational databases still top the market. Both emphasize ...
Multi-cloud databases: How to deploy and manage them
Deploying databases on different cloud platforms offers various benefits. Here's a set of 10 best practices for building a ...

Close