SQL for Data Analysis: The Essential Commands Every Analyst Needs
SQL is the language of data analysis. Whether you use Python, Excel, or AI-powered tools, understanding SQL gives you the ability to ask any question of any database. This guide covers the commands and patterns analysts use daily, organized by analytical task rather than by SQL syntax.
The Analytical SQL Toolkit
Counting and Summarizing
-- How many customers do we have?
SELECT COUNT(*) AS total_customers FROM customers;
-- How many by segment?
SELECT segment, COUNT(*) AS count
FROM customers
GROUP BY segment
ORDER BY count DESC;
-- How many unique values?
SELECT COUNT(DISTINCT country) AS countries FROM customers;
Aggregating Metrics
-- Revenue summary
SELECT
SUM(amount) AS total_revenue,
AVG(amount) AS avg_order,
MIN(amount) AS smallest_order,
MAX(amount) AS largest_order,
COUNT(*) AS order_count
FROM orders
WHERE order_date >= '2026-01-01';
Grouping and Comparing
-- Revenue by product category
SELECT
p.category,
COUNT(DISTINCT o.customer_id) AS customers,
COUNT(*) AS orders,
SUM(o.amount) AS revenue,
AVG(o.amount) AS avg_order_value
FROM orders o
JOIN products p ON o.product_id = p.id
WHERE o.order_date >= '2026-01-01'
GROUP BY p.category
ORDER BY revenue DESC;
Time-Based Analysis
-- Monthly revenue trend
SELECT
DATE_TRUNC('month', order_date) AS month,
SUM(amount) AS revenue,
COUNT(*) AS orders,
COUNT(DISTINCT customer_id) AS unique_customers
FROM orders
GROUP BY 1
ORDER BY 1;
Period-Over-Period Comparison
-- Month-over-month growth
WITH monthly AS (
SELECT
DATE_TRUNC('month', order_date) AS month,
SUM(amount) AS revenue
FROM orders
GROUP BY 1
)
SELECT
month,
revenue,
LAG(revenue) OVER (ORDER BY month) AS prev_month,
ROUND((revenue - LAG(revenue) OVER (ORDER BY month)) * 100.0
/ LAG(revenue) OVER (ORDER BY month), 1) AS growth_pct
FROM monthly
ORDER BY month DESC
LIMIT 12;
Ranking and Top-N
-- Top 10 customers by revenue
SELECT
customer_id,
customer_name,
SUM(amount) AS total_revenue,
COUNT(*) AS order_count
FROM orders o
JOIN customers c ON o.customer_id = c.id
GROUP BY 1, 2
ORDER BY total_revenue DESC
LIMIT 10;
-- Rank within groups
SELECT
department,
employee_name,
salary,
RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dept_rank
FROM employees;
Cohort Analysis
-- Customer retention by signup cohort
WITH first_purchase AS (
SELECT customer_id, MIN(order_date) AS cohort_date
FROM orders GROUP BY 1
)
SELECT
DATE_TRUNC('month', fp.cohort_date) AS cohort_month,
DATEDIFF('month', fp.cohort_date, o.order_date) AS months_since_first,
COUNT(DISTINCT o.customer_id) AS active_customers
FROM orders o
JOIN first_purchase fp ON o.customer_id = fp.customer_id
GROUP BY 1, 2
ORDER BY 1, 2;
Funnel Analysis
-- Conversion funnel
SELECT
COUNT(DISTINCT CASE WHEN visited THEN user_id END) AS visitors,
COUNT(DISTINCT CASE WHEN signed_up THEN user_id END) AS signups,
COUNT(DISTINCT CASE WHEN trial_started THEN user_id END) AS trials,
COUNT(DISTINCT CASE WHEN paid THEN user_id END) AS paying,
ROUND(COUNT(DISTINCT CASE WHEN paid THEN user_id END) * 100.0 /
NULLIF(COUNT(DISTINCT CASE WHEN visited THEN user_id END), 0), 1) AS overall_conversion
FROM user_events
WHERE event_date >= CURRENT_DATE - INTERVAL '30 days';
Window Functions for Running Calculations
-- Running total and moving average
SELECT
order_date,
daily_revenue,
SUM(daily_revenue) OVER (ORDER BY order_date) AS cumulative_revenue,
AVG(daily_revenue) OVER (
ORDER BY order_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS moving_7day_avg
FROM (
SELECT order_date, SUM(amount) AS daily_revenue
FROM orders GROUP BY 1
) daily;
Finding Anomalies
-- Days with unusual order volume (> 2 standard deviations from mean)
WITH daily_stats AS (
SELECT
order_date,
COUNT(*) AS order_count
FROM orders
GROUP BY 1
),
stats AS (
SELECT AVG(order_count) AS avg_count, STDDEV(order_count) AS std_count
FROM daily_stats
)
SELECT ds.order_date, ds.order_count
FROM daily_stats ds
CROSS JOIN stats s
WHERE ds.order_count > s.avg_count + 2 * s.std_count
OR ds.order_count < s.avg_count - 2 * s.std_count
ORDER BY ds.order_date DESC;
Segmentation
-- RFM segmentation
WITH customer_metrics AS (
SELECT
customer_id,
DATEDIFF('day', MAX(order_date), CURRENT_DATE) AS recency_days,
COUNT(*) AS frequency,
SUM(amount) AS monetary
FROM orders
GROUP BY 1
)
SELECT
customer_id,
CASE
WHEN recency_days <= 30 AND frequency >= 10 AND monetary >= 5000 THEN 'Champion'
WHEN recency_days <= 60 AND frequency >= 5 THEN 'Loyal'
WHEN recency_days <= 30 AND frequency <= 2 THEN 'New'
WHEN recency_days > 90 AND frequency >= 5 THEN 'At Risk'
WHEN recency_days > 180 THEN 'Lost'
ELSE 'Regular'
END AS segment
FROM customer_metrics;
CTEs: Organizing Complex Analysis
Common Table Expressions (WITH clauses) make complex queries readable:
WITH
-- Step 1: Get customer-level metrics
customer_summary AS (
SELECT
customer_id,
MIN(order_date) AS first_order,
MAX(order_date) AS last_order,
COUNT(*) AS total_orders,
SUM(amount) AS lifetime_revenue
FROM orders
GROUP BY 1
),
-- Step 2: Calculate retention
retention AS (
SELECT
customer_id,
lifetime_revenue,
DATEDIFF('day', first_order, last_order) AS customer_lifespan_days,
DATEDIFF('day', last_order, CURRENT_DATE) AS days_since_last_order
FROM customer_summary
)
-- Step 3: Classify
SELECT
CASE
WHEN days_since_last_order > 90 THEN 'churned'
WHEN days_since_last_order > 30 THEN 'at_risk'
ELSE 'active'
END AS status,
COUNT(*) AS customer_count,
AVG(lifetime_revenue) AS avg_ltv
FROM retention
GROUP BY 1;
Key SQL Patterns for Analysts
| Pattern | When to Use |
|---|---|
| GROUP BY + aggregates | Summarize by category |
| Window functions (OVER) | Running totals, rankings, comparisons |
| CTEs (WITH clause) | Break complex queries into readable steps |
| CASE WHEN | Create categories, conditional logic |
| JOINs | Combine data from multiple tables |
| Subqueries | Filter based on calculated values |
| DATE_TRUNC | Group by time periods |
| LAG/LEAD | Compare to previous/next period |
| PERCENTILE_CONT | Find medians and percentiles |
| COALESCE | Handle NULL values gracefully |
Beyond Manual SQL
For teams that want analytical power without writing SQL, platforms like Skopx translate natural language questions into optimized SQL automatically. Ask "What is our customer retention rate by acquisition channel?" and get the query, result, and visualization instantly. This democratizes the analytical patterns above to anyone in the organization.
Summary
SQL for data analysis is not about memorizing syntax. It is about knowing which patterns solve which business questions. Master GROUP BY for summarization, window functions for sequential analysis, CTEs for organization, and JOINs for combining data sources. These patterns cover 90% of analytical queries you will ever need.
Saad Selim
The Skopx engineering and product team