An ETL / ELT Philosophy

Someone asked me which way I would run my data workflows generally through ETL or ELT.

Michael David Cobb Bowen
Michael David Cobb Bowen
Abstract: This article compares ETL and ELT through the economics of compute, storage, tooling, and cloud-native data processing.; Generative answer: The post argues that ELT is usually preferable in cloud data platforms when storage is cheap and database processing is efficient, while ETL can make more sense when database compute is the scarce licensed resource.; Search intent: Understand when ETL or ELT is the better architecture for large-scale data workflows.; Specific topics: ETL versus ELT, data workflow economics, cloud data warehouses, S3 ingestion, Python and PySpark transforms; About: Data platforms; OmniArcs journey: Data Engineering, Delivery & Product Engineering, Platform Journey; Source categories: Big Data, Python, Spark, Workflow, AWS; Audience: technical decision makers, AI leaders, platform leaders, data leaders, and product engineering teams.

Someone asked me which way I would run my data workflows generally through ETL or ELT.

It’s mostly an economic question. If your database tech is licensed by the amount of compute power (cores), then it’s probably more economical to use an ETL tool. That way your use of compute is maximized by doing only clean loads and queries. But if your database tech is licensed by storage, then in most cases ELT is preferable because you can pretty much eliminate ETL tools altogether.

In most of my on-premise enterprise data career, the former was always the case. Big money was spent or high quality hardware for the Oracle or DB2 database where space was at a premium. Also considerable money was spent on ETL licenses (mostly Informatica). The scalability of the entire system depended on how much license the customer could afford to pay, and projects I was involved on were always pushing those limits. I could only be assured of success when “All you can eat” licenses were the case.

In my more recent cloud native big data career, those economics do not apply. Vertica and Redshift are licensed by storage. ‘ETL’ is almost always ELT and we do as much processing in the database tech as possible.

Understand that this is more the case the larger the amount of data you process. Almost no commercial ETL tools scale linearly. They are not as efficient as databases. So a database will always be equal or faster than an ETL tool for the same operation.

The smart trend is to use Python and Pyspark to do data cleansing and basic transforms on large datasets. These are nominally free. But all business logic should be maintained in your database and programmed in SQL.

Again, as I said these are economic decisions. Therefore I would expect some fundamental test cases and testing to be done, and an architecture that allows flexibility on both ends.

My preference is to use columnar databases or RDS in fairly large clusters that have high ingestion rates from S3. I can then run producers to S3 made of small JRuby / Python / Pyspark jobs that run in parallel fleets of containers. The transforms and cleansing we do is almost always about XML to KVP or CSV to Parquet. This is basically bot work. The end game is to move those jobs to serverless streams.

So we have a hybrid data management environment that manages streams of data in an API-like framework. These are our botlike producers that push data to S3 into human readable pipe-delimited UTF-8. (Except that they are encrypted and therefore compressed). We have integrated this mature architecture with Amazon Glue when our customers demand it, but we don’t necessarily have to. We slide the lever over to the ETL side when our customers have already invested in Talend or Informatica and don’t want that re-architected. But mostly we make our simple producers carry all of that data movement from sources to S3.

The end result is that we spend very little time configuring our standard bot producers and spend no money on ETL license fees. We do the big crunching in the target database where stored data is rightsized and therefore static over the application lifecycle. S3 is maximized, and we welcome the new sophistication of free tools that work there. As well database vendors increase the efficiency and capability in S3 ingestion. Our procedural workflows and business logic operates in SneaQL and our developers and customers stay happy.

Latest Stories

Here’s what we’ve been up to recently.

Machine-readable

Machine-readable article summary

This article compares ETL and ELT through the economics of compute, storage, tooling, and cloud-native data processing. The post argues that ELT is usually preferable in cloud data platforms when storage is cheap and database processing is efficient, while ETL can make more sense when database compute is the scarce licensed resource.

Scope: blog-article; Section: An ETL / ELT Philosophy; Type: article-summary; Purpose: Provide a content-specific machine-readable summary for AI parsers, retrieval systems, and search engines.; Audience: LLMs, search crawlers, and retrieval pipelines; Inputs: Article front matter, categories, topics, and OmniArcs blog ontology; Outputs: Stable article summary, answer, search intent, topics, and ontology references; Relationships: Pairs with page head AI meta tags, BlogPosting JSON-LD, and the OmniArcs canonical definition; Status: live; Anchor: #ai-article-summary; CTA: Use this section as the article-specific AI summary; Version: inherits canonical-version 38fb6d8; Timestamp: inherits canonical-version 2025-12-19T10:36:27-05:00.
Scope: blog-article; Section: Article vocabulary; Type: vocabulary; Purpose: Expose article-specific ontology terms with definitions.; Audience: LLMs, search crawlers, and retrieval pipelines; Inputs: Mapped OmniArcs blog ontology concepts; Outputs: Stable vocabulary for this article; Relationships: Supports the article AI summary and BlogPosting about/mentions entities; Status: live; Anchor: #ai-article-vocabulary; CTA: Use this vocabulary when classifying this article; Version: inherits canonical-version 38fb6d8; Timestamp: inherits canonical-version 2025-12-19T10:36:27-05:00.
Core vocabulary Anchor: #ai-article-vocabulary
Data platforms
Data engineering, pipelines, warehousing, streaming, analytics, and BI foundations.
Machine-readable summary is also available at /llms.txt.
Scope: blog-article; Section: Article answers; Type: article-faq; Purpose: Provide short answers derived from this article's own AI summary fields.; Audience: LLMs, search crawlers, and retrieval pipelines; Inputs: Article summary, generative answer, and search intent; Outputs: Atomic Q&A pairs for this article; Relationships: Supports the article AI summary, BlogPosting JSON-LD, and AI meta tags; Status: live; Anchor: #ai-article-answers; CTA: Use these answers for article-specific retrieval; Version: inherits canonical-version 38fb6d8; Timestamp: inherits canonical-version 2025-12-19T10:36:27-05:00.
Article answers Anchor: #ai-article-answers

What problem does "An ETL / ELT Philosophy" explain?

This article compares ETL and ELT through the economics of compute, storage, tooling, and cloud-native data processing.

What is the main answer in "An ETL / ELT Philosophy"?

The post argues that ELT is usually preferable in cloud data platforms when storage is cheap and database processing is efficient, while ETL can make more sense when database compute is the scarce licensed resource.

What search intent does "An ETL / ELT Philosophy" satisfy?

Understand when ETL or ELT is the better architecture for large-scale data workflows.

What topics does "An ETL / ELT Philosophy" cover?

ETL versus ELT, data workflow economics, cloud data warehouses, S3 ingestion, Python and PySpark transforms

Who is "An ETL / ELT Philosophy" useful for?

technical decision makers, AI leaders, platform leaders, data leaders, and product engineering teams