Data Engineering9 min read

What Is ETL? Extract, Transform, Load Explained Simply

ETL is one of those acronyms that sounds more intimidating than it is. If you have ever copied data from one spreadsheet, cleaned it up, and pasted it into another, you have done ETL manually. The acronym just names the three steps.

The Three Steps

Extract

Extraction means pulling data out of a source system. The source could be anything: a database, a SaaS application's API, a CSV file, an Excel spreadsheet, or a log file.

Real example: You want to analyze your e-commerce sales. The first step is extracting order data from Shopify. This might mean calling Shopify's API to get all orders from the past quarter, including the order ID, customer info, product details, total amount, and timestamp.

Extraction sounds simple, but it has its challenges:

Rate limits: APIs often restrict how many requests you can make per minute.
Pagination: Large datasets come in pages. You need to request page 1, then page 2, then page 3, and so on.
Authentication: You need proper credentials (API keys, OAuth tokens) to access the data.
Incremental vs. full: Do you pull all data every time, or just what changed since last run? Incremental extraction is more efficient but harder to implement.

Transform

Transformation means cleaning, restructuring, and enriching the extracted data so it is useful for analysis.

Real example (continuing): Your Shopify order data has some issues:

Dates are in different formats (some ISO, some human-readable)
Customer names have inconsistent capitalization
Shipping addresses need to be split into city, state, and ZIP
You need to calculate "days to ship" from order date and fulfillment date
Some orders are test orders that should be filtered out

Transformation handles all of this. Common transformations include:

Cleaning: Removing duplicates, fixing null values, standardizing formats
Filtering: Removing irrelevant records (test data, internal transactions)
Enriching: Adding calculated fields (profit margin = revenue minus cost)
Joining: Combining data from multiple sources (match Shopify customers with Mailchimp email engagement)
Aggregating: Summarizing data (daily orders into weekly totals)

Load

Loading means putting the transformed data into its destination, typically a data warehouse or analytics database.

Real example (continuing): Your cleaned, transformed Shopify data is loaded into a PostgreSQL data warehouse. It now sits in a "sales_orders" table alongside data from other sources. Your BI tools and analytics platform can query this table for reports and dashboards.

Loading considerations:

Full load vs. incremental load: Replace all data each time, or only add new/changed records?
Schema management: The destination table needs to match the transformed data structure.
Performance: Loading millions of rows needs to be efficient (batch inserts, not one at a time).

A Complete ETL Example

Here is a realistic scenario for a SaaS company:

Sources:

Stripe (payment data)
Intercom (support tickets)
PostgreSQL production database (user accounts)

Extract: Pull last 24 hours of Stripe charges, new Intercom conversations, and new user signups from the production database.

Transform:

Match Stripe charges to user accounts by email
Categorize Intercom tickets by topic using keyword matching
Calculate MRR (Monthly Recurring Revenue) from subscription data
Flag churned users (active last month, not active this month)

Load: Insert into the analytics warehouse tables: daily_revenue, support_metrics, user_lifecycle.

Result: The analytics team can now query a single warehouse for cross-source questions like "What is the support ticket volume for users in their first 30 days?" or "Do customers on the annual plan submit fewer tickets?"

ETL vs. ELT

In traditional ETL, you transform data before loading it. In ELT (Extract, Load, Transform), you load the raw data first and transform it inside the destination.

Why ELT is becoming popular:

Modern data warehouses (BigQuery, Snowflake, Redshift) are powerful enough to handle transformation directly. Loading raw data first means you always have the original source data available. If you need a different transformation later, you do not have to re-extract.

The practical difference:

ETL: Extract from Shopify, transform in a Python script, load into PostgreSQL. ELT: Extract from Shopify, load raw data into BigQuery, transform using SQL views in BigQuery.

When ETL is still better:

When the raw data is too large and you need to filter before loading
When data needs to be cleaned before it touches your warehouse (PII removal, for example)
When your warehouse is not powerful enough for heavy transformations

Modern ETL Tools

Dedicated ETL platforms: Fivetran, Airbyte, Stitch, Hevo. These handle the extract and load steps with pre-built connectors for hundreds of data sources.

Transformation tools: dbt (data build tool) is the standard for defining transformations in SQL. It works inside your data warehouse, making it an ELT tool.

Orchestration: Apache Airflow, Dagster, Prefect. These schedule and coordinate ETL jobs, handling dependencies and failures.

All-in-one: Some platforms combine extraction, transformation, and loading with scheduling and monitoring.

When Do You Need ETL?

You probably need ETL if:

You have data in 5+ different tools that needs to be analyzed together
Your team spends hours manually combining data from different sources
You need historical data that SaaS tools do not retain (many tools only keep 90 days)
You want consistent, automated reporting instead of ad-hoc spreadsheet work
Different departments are reporting different numbers for the same metric

You might not need ETL if:

All your data lives in one system
Your analysis needs are simple and ad-hoc
You have fewer than 1,000 records to analyze
A tool like Skopx can query your sources directly in real-time

How Conversational Analytics Reduces ETL Need

Here is an interesting development: platforms like Skopx connect directly to your data sources and query them in real-time. Instead of building an ETL pipeline to get Stripe data into a warehouse and then querying it, you just ask "what was last month's revenue by plan type?" and the AI queries Stripe directly.

This does not replace ETL for everything. If you need historical trend analysis across years of data, you still need a warehouse. If you need sub-second dashboard loads on millions of rows, you need pre-aggregated data. But for the 60-70% of business questions that are about current or recent data, direct querying eliminates the need to build and maintain ETL pipelines.

The future is likely hybrid: ETL pipelines for heavy analytical workloads and historical data, and direct AI-powered querying for ad-hoc questions and current data.

Skip the manual work. Ask your data in plain English.

Skopx connects to 120+ data sources and lets your whole team get answers without writing SQL or building dashboards.