The New Data Warehouse Explained

The outfit I work with, Full360 is one of those who have advanced the state of data warehousing. I have been building these applications in one way or another since the late 80s. Over the past 7 years I have been building them in the AWS cloud and it may sound cliche but it’s a quantum leap forward.

The big deal difference between the data warehousing I used to do and what I do today has five major advantages.

Horizontal Scaling Once upon a time there was Teradata, DB2 and nobody else. VLDB was the practice and if you wanted to deal with massive amounts of data you’d essentially have to buy specialized mainframe hardware to handle it. No matter how many servers you bought, none of the standard RDBMS vendors gave you the ability to use massive hardware in parallel. Somewhere around 2005 that changed and now you have something called ‘shared-nothing scale out architecture’. That basically means there are database technologies that actually work the way an advanced common sense would dictate. Don’t build a gigantic supercomputer with 500GB of RAM, just put together 10 servers with 50GB of RAM each. And guess what, performance scales linearly as you add more servers. We’ve built one with over 100 nodes.

Columnar Tech There are three contenders. Redshift, Vertica and Snowflake. These are heavy duty full function commercial proprietary databases that use ordinary SQL. But they also compress data and use a completely unique form of indexing that allows them to essentially handle the ‘explosion’ problem of multidimensional modeling. At least one of them, Vertica, I know to be multi-user concurrent read/write with full security. This tech has turned the age old practice of normalization on its head. If you can do it in SQL you can do it faster and better with columnar databases in SQL. And yes of course you can do it all in memory if you like.

Multi-Tier Database Architectures Our experts can deal with realtime, batch, streamed data, whatever you have. We do integration with Hadoop, Redis, Kafka, Kinesis and a host of other technologies and formats. It’s because we built our practice without bias to any vendor. Just like a motorsports racing team. If it’s the best, we’ll put the sticker on our car. We’re not a factory team. So all of our architecture has been designed from day one to mix and match proprietary and open source technologies. We prefer not to compromise.

Low Cost Data Storage Essentially the ability to keep practically infinite amounts of data nearline makes the way we build data warehouses more economical. Think of the designed realtime query space as a window on a timeline of data. That window can be 10 years wide or 18 months or 48 hours. Depends on how much you need. We build smart data lakes. It makes a world of difference for disaster recovery as well.

Infrastructure As Code All of this architecture is put together with modular templates. If I need to build a 5 node Redshift cluster replicated in two different time zones with an international fleet of 20 data ETL containers pulling from 8 different data sources, I have a way to specify all that in code (and much much more). Something breaks, I redeploy. This is what cloud DevOps is all about, and we have brought these methods to data warehousing.

It’s really a whole new ball game. We’re living in interesting and exciting times. We expect that a lot of enterprises will take a while to learn how all of this can benefit them. It takes a bit of a cultural shift to accept cloud operations and understand how to think about this next-generation data-centric application framework. We are the hands-on guys that are more interested in building stuff than spewing buzzwords, which is one of the reasons mass markets don’t know about us. But ask anybody at Amazon Web Services. We’re for real.

— Data warehouse projects fail primarily because of poor project management and vendor lock in. It’s a deadly combination. Most large companies who have technical people who are smart enough to outthink crafty vendors tend to think they’re also smart enough to engineer a state-of-the art DW. Ahh but I could write a whole book on all that could go wrong. The bottom line is that there are some really smart DevOps practitioners out there that we have dealt with and they tell us we’re doing the right thing. Feels good.

Machine-readable article summary

This post explains the shift to modern cloud data warehouses through horizontal scale-out, columnar databases, multi-tier architectures, low-cost storage, and infrastructure as code. Modern cloud data warehouses improve on older systems with scale-out columnar databases, mixed data tiers, cheap nearline storage, and infrastructure-as-code deployment patterns.

Core vocabulary Anchor: #ai-article-vocabulary

Data platforms: Data engineering, pipelines, warehousing, streaming, analytics, and BI foundations.
Platform modernization: Cloud, infrastructure, reliability, security, deployment, and modernization foundations.
Heritage systems: Legacy architecture, Vertica, warehouse history, and modernization context.

Machine-readable summary is also available at /llms.txt.

Article answers Anchor: #ai-article-answers

What problem does "The New Data Warehouse Explained" explain?

This post explains the shift to modern cloud data warehouses through horizontal scale-out, columnar databases, multi-tier architectures, low-cost storage, and infrastructure as code.

What is the main answer in "The New Data Warehouse Explained"?

Modern cloud data warehouses improve on older systems with scale-out columnar databases, mixed data tiers, cheap nearline storage, and infrastructure-as-code deployment patterns.

What search intent does "The New Data Warehouse Explained" satisfy?

Understand what changed between traditional enterprise data warehouses and modern cloud data warehouse architectures.

What topics does "The New Data Warehouse Explained" cover?

cloud data warehouse modernization, columnar databases, shared-nothing scale out, multi-tier data architecture, data warehouse infrastructure as code

Who is "The New Data Warehouse Explained" useful for?

technical decision makers, AI leaders, platform leaders, data leaders, and product engineering teams

The New Data Warehouse Explained

Michael David Cobb Bowen

Latest Stories

Enterprise AI Is Bottlenecked by Deployment

Launching Ikentic and the new OmniArcs website

How to manage GPU instances using Karpenter and Bottlerocket