Decision points in storage for artificial intelligence, machine learning and big data

Big Data

Decision points in storage for artificial intelligence, machine learning and big data

Artificial intelligence and machine learning storage is not one-size-fits-all. Analytics work differs, and has varied storage requirements for capacity, latency, throughput and IOPS. We look at key decision points

Data analytics has rarely been more newsworthy. Throughout the Covid-19 coronavirus pandemic, governments and bodies such as the World Health Organization (WHO) have produced a stream of statistics and mathematical models.

Businesses have run models to test post-lockdown scenarios, planners have looked at traffic flows and public transport journeys, and firms use artificial intelligence (AI) to reduce the workload for hard-pressed customer services teams and to handle record demand for e-commerce.

All that places more demand on storage.

Even before Covid-19, industry analysts at Gartner pointed out that expansion of digital business would “result in the unprecedented growth of unstructured data within the enterprise in the next few years”.

Advanced analytics needs powerful computing to turn data into insights. Machine learning (ML) and AI takes this to another level because such systems need rich datasets for training and rapid access to new data for operations. These can run to multiple petabytes.

Sure, all data-rich applications put pressure on storage systems, but the demands can differ.

“Data-intensive applications have multiple storage architectures. It is all about the specific KPI [key performance indicators] of the specific workload,” says Julia Palmer, research vice-president at Gartner. “Some of those workloads require lower latency and some of them require higher throughput.”

AI, ML and big data: Storage demands

All big data and AI projects need to mix performance, capacity and economy. But that mix will vary, depending on the application and where it is in its lifecycle.

Projects based on unstructured data, especially images and video, involve large single files.

Also, AI applications that include surveillance and facial recognition, geological, scientific and medical research use large files and so need petabyte scale storage.

Applications based on business systems data, such as sales or enterprise resource planning (ERP), might only need a few hundred megabytes to be effective.

Sensor-based applications that include maintenance, repair and overhaul technologies in transport and power generation could run to the low hundreds of gigabytes.

Meanwhile, applications based on compute-intensive machine learning and dense neural networks need high throughput and low latency, says Gartner’s Palmer. But they also need access to scalable, low-cost storage for potentially large volumes of data.

AI and ML applications have distinct cycles of storage demand too. The learning or training phase is most data intensive, with more data making for a better model. And storage needs to keep up with the compute engines that run the algorithm. Model training needs high throughput and low latency.

 

IOPS is not the only measure

Once the system is trained, requirements can be modest because the model only needs to examine relevant data.

Here, latency becomes more important than throughput. But this presents a challenge for IT departments because conventional storage solutions usually struggle to perform well for both sequential and random input/output (I/O).

For data analytics, typical batch-based workflows need to maximise the use of computing resources to speed up processing.

As a result, big data and analytics projects work well with distributed data, notes Ronan McCurtin, vice-president for northern Europe at Acronis.

“It is better to have distributed storage for data analytics and, for example, apply Hadoop or Spark technologies for big data analysis. In this case, the analyst can solve issues with memory limitations and run tasks on many machines. AI/ML training/inference requires fast SSD storage.”

But solid-state technology is typically too expensive for large volumes of data and long-term retention, while the need to replicate volumes for distributed processing adds further cost.

As Stephen Gilderdale, a senior director at Dell Technologies points out, organisations have moved on from a primary focus on enterprise resource planning (ERP) and customer relationship management (CRM) to heavier use of unstructured data.

And analytics has moved on too. It is no longer simply a study of historical data, “looking backwards to move forwards”. Instead, predictive and real-time analytics including sensor data is growing in importance.

Here, data volumes are lower, but the system will need to process the data very quickly to deliver insights back to the business. System designers need to ensure the network is not the bottleneck. This is prompting architects to look at edge processing, often combined with centralised cloud storage and compute.

Continue Reading

Decision points in storage for artificial intelligence, machine learning and big data