Why Build a Hybrid Data Lake for Analytics and AI

Artificial Intelligence

Why Build a Hybrid Data Lake for Analytics and AI

The present global crisis has bolstered cries for cost-cutting across organizations. Data lakes spanning from on-premises environments to a public cloud platform have continued to evolve, frequently striving to keep the infrastructure and operational costs low while providing business agility.

At many large organizations, traditional data lakes were established on-premises with complex workflows in place spanning different business units. The on-premises infrastructure in these environments is oftentimes stressed, leading to an increase in the total cost of infrastructure. At the same time, new and unanticipated workloads are rapidly being onboarded in these data-driven organizations.

A completely on-premises infrastructure will fail to keep up because of the time it takes to provision new infrastructure and the heavy operational cost to maintain every piece of hardware and software acquired. Even though the promise of a public cloud vendor with a managed elastic infrastructure sounds great, the costs quickly start to add up as we scale up in this scenario.

As the amount of data continues to grow, the key to keeping costs under control is to remain flexible. By being prepared for pieces of infrastructure to be spread across an on-premises data lake and a public cloud, you will be able to get the best of both worlds. But it’s easier said than done. Here are five recommendations for how to leverage a hybrid cloud.

Migrate Incrementally

A complete lift of the on-premises environment and migration to a public cloud may sound scary. And oftentimes, the benefits of an elastic compute infrastructure are outweighed by expensive storage, network, and operation costs. Keep a foot in both worlds by migrating some of your workloads from a busy on-premises data lake to leveraging compute in the cloud. Be prepared for data and compute infrastructure to be spread across the on-premises environment and public clouds.

Remain Infrastructure Agnostic

Adopting cloud-native computing in a naive manner may necessitate application rewrites, copying of data to cloud storage, and redefinition of structured data catalogs. Such a complex migration is laborious and expensive. Abstraction is the key to remain agnostic to the infrastructure provider at all layers of the technology stack. Container orchestration future-proofs the application layer so that workloads can be migrated across infrastructure providers when needed. But data has gravity, and moving data is not immune to network and storage costs. Similarly, a data orchestration layer decouples applications from the physical location of data to optimize for which data resides where and for how long.

Don’t Forget About Data Locality

A key premise for the initial data analytics ecosystem was that data locality brings you performance. In the scenario where compute workloads are migrated to a cloud and separated from storage, there is no locality. Performance gains in the context of a public cloud translate to cost savings achieved by elastically scaling down compute, when not needed. A highly distributed caching capability automatically orchestrates hot data to be closer to compute for performance while keeping cold data in cheaper storage. Caching also eliminates repeated network transfers and the associated cloud operation costs.

Use Policies for Everything (as Much as Possible)

Each workload is unique with different resource usage patterns. Elasticity in the cloud demands policies for those specific workloads, both for compute and storage. Use auto-scaling policies to control when and for how long to keep compute resources up. Similarly, employ Data Management policies to determine which data is migrated when and where to truly enable a hybrid cloud environment.

Continue Reading

Why Build a Hybrid Data Lake for Analytics and AI