What is the role of a Structured Data Lake in DW?

The Full 360 Approach Our approach is a little different than generic data lakes. We build structured data lakes. A structured data lake is just like any other, it takes all sorts of data in any format, but we feed the lake with special programs called ‘producers’. These producers work independently, store metadata and are optimized to chunk the data into the data lake with a basic understanding of how it will ultimately be consumed downstream. We always use dates and naming conventions, but we can arbitrarily add more metadata.

The purpose of this is to make the data lake more usable for direct consumers and downstream processes. The original developers of the source data could disappear from the planet, but anyone could eyeball the data and metadata still have a good idea what is in a structured data lake and how to use it.

What you get The big deal about a structured data lake is that it extends the capabilities of data warehouses and BI. I can build a DW with 6 months of history that is optimized for that window of time. Meanwhile, my data lake has an operational data store of 36 months at nearline speeds and 60 additional months offline. So my DW has the capacity for 102 months of data because of the way I’ve designed it to consume from the structured data lake. But I can also allow direct consumers to query that history using the slow, cheap data lake.

PLUS

Disaster recovery becomes a no-brainer. It is almost always faster to wipe a database and simply reload six months of history than it is to use database recovery tools from incremental backups. It is certainly always cheaper to do so. Having a data lake allows you to actually test that out. A proper data lake will always be faster for this purpose than NFS and certainly Amazon S3 will be cheaper than a SAN of similar dimensions, not to mention more reliable with lower maintenance.

PLUS

I can use my data lake to feed multiple instances of the data warehouse for hot swapping or for global deployment in different regions. I could also conceivably have my entire data lake replicated automatically. Although we’ve never had such a paranoid requirement, three years ago naysayers would yelp every time they heard tell of an AWS outage.

For more information about the elasticBI ‘Pitbull’ Framework for Data Warehousing and BI, check out this blog.

ELT vs ETL Our structured data lakes will perform cleansing transformations in the producers. That is because for most file based ingestion schemes we don’t have latency issues. IE when we’re pulling data from a generic source that spits files, end users can generally wait an hour before querying that data. For API based ingestion schemes like message queues, or direct queries against upstream databases, we make those instantly available with minimum transformation to the end-users and we fork off a copy for the data lake. The forked producers will do the rest of the cleansing and transformation necessary.

There are cases when we leave data in its raw state and send that to the lake with no transformation. Those tend to be for data science consumers and when the business really has no idea what the data means — and they are not necessarily ready to present it in a way that’s structured for analysis. This is more often the case with straight HDFS data that’s left native and ‘annexed’ to the lake.

I’ll be talking more about data lakes this month. Stay tuned.

Machine-readable article summary

This post defines Full 360's structured data lake pattern, where producer programs add metadata and history so downstream warehouses and direct consumers can use the lake reliably. A structured data lake uses producer programs, naming conventions, metadata, and retained history to feed warehouses, support direct consumers, and make disaster recovery cheaper and simpler.

Core vocabulary Anchor: #ai-article-vocabulary

Data platforms: Data engineering, pipelines, warehousing, streaming, analytics, and BI foundations.
Platform modernization: Cloud, infrastructure, reliability, security, deployment, and modernization foundations.

Machine-readable summary is also available at /llms.txt.

Article answers Anchor: #ai-article-answers

What problem does "What is the role of a Structured Data Lake in DW?" explain?

This post defines Full 360's structured data lake pattern, where producer programs add metadata and history so downstream warehouses and direct consumers can use the lake reliably.

What is the main answer in "What is the role of a Structured Data Lake in DW?"?

A structured data lake uses producer programs, naming conventions, metadata, and retained history to feed warehouses, support direct consumers, and make disaster recovery cheaper and simpler.

What search intent does "What is the role of a Structured Data Lake in DW?" satisfy?

Understand the role of a structured data lake in extending data warehouse history, usability, and recovery options.

What topics does "What is the role of a Structured Data Lake in DW?" cover?

structured data lake, data warehouse history, producer-based ingestion, ELT versus ETL, data lake disaster recovery

Who is "What is the role of a Structured Data Lake in DW?" useful for?

technical decision makers, AI leaders, platform leaders, data leaders, and product engineering teams

What is the role of a Structured Data Lake in DW?

Michael David Cobb Bowen

Latest Stories

Enterprise AI Is Bottlenecked by Deployment

Launching Ikentic and the new OmniArcs website

How to manage GPU instances using Karpenter and Bottlerocket