Data Manipulation: Techniques, Tools, and Best Practices
Data manipulation is the process of changing data to make it more organized, readable, and useful for analysis. It includes sorting, filtering, aggregating, transforming, joining, and reshaping data. Every analytics workflow, from simple reporting to machine learning, starts with data manipulation.
Core Data Manipulation Operations
1. Filtering (Selecting Rows)
Keep only the rows that match specific criteria.
SQL:
SELECT * FROM orders
WHERE status = 'completed'
AND order_date >= '2026-01-01'
AND amount > 100;
Python (pandas):
filtered = orders[
(orders['status'] == 'completed') &
(orders['order_date'] >= '2026-01-01') &
(orders['amount'] > 100)
]
2. Sorting (Ordering Rows)
Arrange rows in a specific order.
SQL:
SELECT customer_name, total_spend
FROM customers
ORDER BY total_spend DESC
LIMIT 10;
3. Aggregating (Summarizing)
Reduce many rows to summary statistics.
SQL:
SELECT
product_category,
COUNT(*) AS order_count,
SUM(revenue) AS total_revenue,
AVG(revenue) AS avg_revenue
FROM orders
GROUP BY product_category;
4. Joining (Combining Tables)
Merge data from multiple tables using shared keys.
SQL:
SELECT
o.order_id,
o.amount,
c.customer_name,
c.segment
FROM orders o
JOIN customers c ON o.customer_id = c.id;
5. Transforming (Creating New Columns)
Derive new values from existing columns.
SQL:
SELECT
order_id,
amount,
quantity,
amount / quantity AS unit_price,
CASE
WHEN amount > 1000 THEN 'large'
WHEN amount > 100 THEN 'medium'
ELSE 'small'
END AS order_size,
DATE_TRUNC('month', order_date) AS order_month
FROM orders;
6. Pivoting (Reshaping)
Convert rows to columns or columns to rows.
SQL (pivot using CASE):
SELECT
customer_id,
SUM(CASE WHEN product = 'Basic' THEN revenue ELSE 0 END) AS basic_revenue,
SUM(CASE WHEN product = 'Pro' THEN revenue ELSE 0 END) AS pro_revenue,
SUM(CASE WHEN product = 'Enterprise' THEN revenue ELSE 0 END) AS enterprise_revenue
FROM subscriptions
GROUP BY customer_id;
7. Deduplication (Removing Duplicates)
Eliminate duplicate rows.
SQL:
WITH ranked AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY email
ORDER BY created_at DESC
) AS rn
FROM contacts
)
SELECT * FROM ranked WHERE rn = 1;
8. String Manipulation
Clean and transform text data.
SQL:
SELECT
LOWER(TRIM(email)) AS clean_email,
UPPER(LEFT(first_name, 1)) || LOWER(SUBSTRING(first_name, 2)) AS proper_name,
SPLIT_PART(email, '@', 2) AS email_domain,
REPLACE(phone, '-', '') AS clean_phone
FROM users;
9. Date Manipulation
Extract and transform temporal data.
SQL:
SELECT
order_date,
EXTRACT(YEAR FROM order_date) AS year,
EXTRACT(MONTH FROM order_date) AS month,
EXTRACT(DOW FROM order_date) AS day_of_week,
order_date - LAG(order_date) OVER (PARTITION BY customer_id ORDER BY order_date) AS days_between_orders,
CURRENT_DATE - order_date AS days_since_order
FROM orders;
10. Null Handling
Deal with missing values.
SQL:
SELECT
customer_id,
COALESCE(phone, email, 'no contact') AS primary_contact,
COALESCE(discount_rate, 0) AS effective_discount,
NULLIF(status, '') AS clean_status -- Convert empty string to NULL
FROM customers;
Data Manipulation Tools
| Tool | Language | Best For | Scale |
|---|---|---|---|
| SQL | SQL | Database-resident data, production analytics | Billions of rows |
| pandas | Python | Exploratory analysis, complex transformations | Millions of rows |
| Polars | Python/Rust | Fast pandas alternative, larger datasets | Tens of millions |
| dbt | SQL | Automated transformations in warehouses | Billions of rows |
| Excel/Sheets | Point-and-click | Quick manipulation of small datasets | Thousands of rows |
| Spark | Python/Scala/SQL | Distributed processing of massive data | Billions+ rows |
| Skopx | Natural language | Non-technical users who need data answers | Any scale |
Common Data Manipulation Patterns
Pattern 1: Customer 360
Combine data from multiple sources into a unified customer view:
SELECT
c.customer_id,
c.name,
c.segment,
o.total_orders,
o.total_revenue,
o.last_order_date,
s.open_tickets,
s.avg_resolution_hours,
p.feature_adoption_score,
p.last_login
FROM customers c
LEFT JOIN (
SELECT customer_id, COUNT(*) AS total_orders, SUM(amount) AS total_revenue, MAX(order_date) AS last_order_date
FROM orders GROUP BY 1
) o ON c.customer_id = o.customer_id
LEFT JOIN (
SELECT customer_id, COUNT(*) AS open_tickets, AVG(resolution_hours) AS avg_resolution_hours
FROM support_tickets WHERE status = 'open' GROUP BY 1
) s ON c.customer_id = s.customer_id
LEFT JOIN (
SELECT user_id AS customer_id, feature_score AS feature_adoption_score, last_login
FROM product_usage
) p ON c.customer_id = p.customer_id;
Pattern 2: Period-over-Period Comparison
Compare current period to previous period:
WITH monthly AS (
SELECT
DATE_TRUNC('month', order_date) AS month,
SUM(revenue) AS revenue
FROM orders
GROUP BY 1
)
SELECT
month,
revenue,
LAG(revenue) OVER (ORDER BY month) AS prev_month_revenue,
ROUND((revenue - LAG(revenue) OVER (ORDER BY month)) * 100.0 /
LAG(revenue) OVER (ORDER BY month), 1) AS growth_pct
FROM monthly;
Pattern 3: Funnel Analysis
Track conversion through sequential steps:
SELECT
COUNT(DISTINCT CASE WHEN visited THEN user_id END) AS visitors,
COUNT(DISTINCT CASE WHEN signed_up THEN user_id END) AS signups,
COUNT(DISTINCT CASE WHEN activated THEN user_id END) AS activated,
COUNT(DISTINCT CASE WHEN subscribed THEN user_id END) AS subscribers,
ROUND(COUNT(DISTINCT CASE WHEN signed_up THEN user_id END) * 100.0 /
NULLIF(COUNT(DISTINCT CASE WHEN visited THEN user_id END), 0), 1) AS visit_to_signup_pct,
ROUND(COUNT(DISTINCT CASE WHEN subscribed THEN user_id END) * 100.0 /
NULLIF(COUNT(DISTINCT CASE WHEN visited THEN user_id END), 0), 1) AS overall_conversion_pct
FROM user_journey
WHERE event_date >= CURRENT_DATE - INTERVAL '30 days';
Best Practices
- Manipulate at the source. Do transformations in SQL (where the data lives) rather than pulling raw data and manipulating in application code.
- Be explicit about NULLs. Always decide how to handle missing values rather than letting them propagate silently.
- Document transformations. Others (including future you) need to understand what and why.
- Test edge cases. Empty sets, NULLs, duplicates, and boundary values should all be handled.
- Optimize for readability. Use CTEs, meaningful aliases, and consistent formatting.
- Validate results. After manipulation, spot-check that outputs make sense (row counts, value ranges, known answers).
Summary
Data manipulation is the essential skill that connects raw data to actionable analysis. Whether you use SQL, Python, or AI-powered tools like Skopx (which generates the manipulation logic from natural language), the core operations remain the same: filter, sort, aggregate, join, transform, and reshape. Master these patterns and you can answer any business question your data can support.
Saad Selim
The Skopx engineering and product team