Data Lakes And Data Warehouses: The Two Sides Of A Modern Cloud Data Platform

Technology

Data Lakes And Data Warehouses: The Two Sides Of A Modern Cloud Data Platform

When it comes to data, today’s CIOs are being pulled in all directions. They’re asked to provide an effective data platform that supports machine learning and other data-driven innovation, but they must also continue to support traditional business intelligence (BI) tools and weekly data reports. Successful CIOs have figured out how to address both of these needs without making tradeoffs, and the answer lies in understanding the relative strengths of data warehouses and data lakes.

Much has been said about whether data lakes will replace data warehouses, but I view them as complementary. Data warehouses enable fast, complex queries across historical structured data. They help businesses learn from the past, so a retailer can understand what products have sold well in a particular region, for example. Data lakes focus on learning about the present using streaming analytics — and the future using predictive analytics and machine learning. They’re a repository for a wide variety of unstructured and semi-structured data — think video, social streams and IoT sensor data, for example. And they support the technologies data scientists use today for machine learning, such as the Python language and multiple open-source query engines.

So how does a CIO build a cohesive architecture that brings these two platforms together?

With data warehouses, data must go through a standardized ETL (extract, transform, load) process to load the data and make it available to users, and this ETL process takes time. With a data lake, what we think of as ETL happens at the time the data is read or consumed by the user. Through a process of continuous data engineering, data is made available in the most efficient way, allowing users to access that data directly from the data lake. This is critical because data can become stale in a matter of hours or even minutes. Often, organizations don’t have the luxury of going through a long ETL process to make data available to users.

A modern data architecture accelerates return on investment by combining a data warehouse and a data lake to federate relational and non-relational data stores into a single, cohesive architecture. This enables new practices that complement the core data warehouse without replacing it, because a data warehouse remains the right platform for the standardized data used for BI reports, dashboards and OLAP (online analytical processing), while the data lake supports newer use cases, such as streaming analytics and machine learning.

To visualize this, imagine a cloud object store as the bottom layer of this modern data architecture. Data from all sources reside here, including the structured data for traditional business apps and the unstructured data for your data lake — the clickstreams, images, server logs, IoT and other data required for machine learning and advanced analytics. Data from this cloud object store is then fed into the data warehouse for the BI workload, while the rest of the data in the cloud object store is made available via a data lake platform for streaming analytics and machine learning.

In this way, the modern data platform becomes the single ingestion point for all new data. Organizations can transform and process data, formerly destined only for the data warehouse, with a schema-on-read approach directly from the data lake.

Continue Reading

Data Lakes And Data Warehouses: The Two Sides Of A Modern Cloud Data Platform