Back to Resources
Data Engineering

Data Manipulation: Techniques, Tools, and Best Practices

Saad Selim
May 4, 2026
10 min read

Data manipulation is the process of changing data to make it more organized, readable, and useful for analysis. It includes sorting, filtering, aggregating, transforming, joining, and reshaping data. Every analytics workflow, from simple reporting to machine learning, starts with data manipulation.

Core Data Manipulation Operations

1. Filtering (Selecting Rows)

Keep only the rows that match specific criteria.

SQL:

SELECT * FROM orders
WHERE status = 'completed'
AND order_date >= '2026-01-01'
AND amount > 100;

Python (pandas):

filtered = orders[
    (orders['status'] == 'completed') &
    (orders['order_date'] >= '2026-01-01') &
    (orders['amount'] > 100)
]

2. Sorting (Ordering Rows)

Arrange rows in a specific order.

SQL:

SELECT customer_name, total_spend
FROM customers
ORDER BY total_spend DESC
LIMIT 10;

3. Aggregating (Summarizing)

Reduce many rows to summary statistics.

SQL:

SELECT
    product_category,
    COUNT(*) AS order_count,
    SUM(revenue) AS total_revenue,
    AVG(revenue) AS avg_revenue
FROM orders
GROUP BY product_category;

4. Joining (Combining Tables)

Merge data from multiple tables using shared keys.

SQL:

SELECT
    o.order_id,
    o.amount,
    c.customer_name,
    c.segment
FROM orders o
JOIN customers c ON o.customer_id = c.id;

5. Transforming (Creating New Columns)

Derive new values from existing columns.

SQL:

SELECT
    order_id,
    amount,
    quantity,
    amount / quantity AS unit_price,
    CASE
        WHEN amount > 1000 THEN 'large'
        WHEN amount > 100 THEN 'medium'
        ELSE 'small'
    END AS order_size,
    DATE_TRUNC('month', order_date) AS order_month
FROM orders;

6. Pivoting (Reshaping)

Convert rows to columns or columns to rows.

SQL (pivot using CASE):

SELECT
    customer_id,
    SUM(CASE WHEN product = 'Basic' THEN revenue ELSE 0 END) AS basic_revenue,
    SUM(CASE WHEN product = 'Pro' THEN revenue ELSE 0 END) AS pro_revenue,
    SUM(CASE WHEN product = 'Enterprise' THEN revenue ELSE 0 END) AS enterprise_revenue
FROM subscriptions
GROUP BY customer_id;

7. Deduplication (Removing Duplicates)

Eliminate duplicate rows.

SQL:

WITH ranked AS (
    SELECT *,
        ROW_NUMBER() OVER (
            PARTITION BY email
            ORDER BY created_at DESC
        ) AS rn
    FROM contacts
)
SELECT * FROM ranked WHERE rn = 1;

8. String Manipulation

Clean and transform text data.

SQL:

SELECT
    LOWER(TRIM(email)) AS clean_email,
    UPPER(LEFT(first_name, 1)) || LOWER(SUBSTRING(first_name, 2)) AS proper_name,
    SPLIT_PART(email, '@', 2) AS email_domain,
    REPLACE(phone, '-', '') AS clean_phone
FROM users;

9. Date Manipulation

Extract and transform temporal data.

SQL:

SELECT
    order_date,
    EXTRACT(YEAR FROM order_date) AS year,
    EXTRACT(MONTH FROM order_date) AS month,
    EXTRACT(DOW FROM order_date) AS day_of_week,
    order_date - LAG(order_date) OVER (PARTITION BY customer_id ORDER BY order_date) AS days_between_orders,
    CURRENT_DATE - order_date AS days_since_order
FROM orders;

10. Null Handling

Deal with missing values.

SQL:

SELECT
    customer_id,
    COALESCE(phone, email, 'no contact') AS primary_contact,
    COALESCE(discount_rate, 0) AS effective_discount,
    NULLIF(status, '') AS clean_status  -- Convert empty string to NULL
FROM customers;

Data Manipulation Tools

ToolLanguageBest ForScale
SQLSQLDatabase-resident data, production analyticsBillions of rows
pandasPythonExploratory analysis, complex transformationsMillions of rows
PolarsPython/RustFast pandas alternative, larger datasetsTens of millions
dbtSQLAutomated transformations in warehousesBillions of rows
Excel/SheetsPoint-and-clickQuick manipulation of small datasetsThousands of rows
SparkPython/Scala/SQLDistributed processing of massive dataBillions+ rows
SkopxNatural languageNon-technical users who need data answersAny scale

Common Data Manipulation Patterns

Pattern 1: Customer 360

Combine data from multiple sources into a unified customer view:

SELECT
    c.customer_id,
    c.name,
    c.segment,
    o.total_orders,
    o.total_revenue,
    o.last_order_date,
    s.open_tickets,
    s.avg_resolution_hours,
    p.feature_adoption_score,
    p.last_login
FROM customers c
LEFT JOIN (
    SELECT customer_id, COUNT(*) AS total_orders, SUM(amount) AS total_revenue, MAX(order_date) AS last_order_date
    FROM orders GROUP BY 1
) o ON c.customer_id = o.customer_id
LEFT JOIN (
    SELECT customer_id, COUNT(*) AS open_tickets, AVG(resolution_hours) AS avg_resolution_hours
    FROM support_tickets WHERE status = 'open' GROUP BY 1
) s ON c.customer_id = s.customer_id
LEFT JOIN (
    SELECT user_id AS customer_id, feature_score AS feature_adoption_score, last_login
    FROM product_usage
) p ON c.customer_id = p.customer_id;

Pattern 2: Period-over-Period Comparison

Compare current period to previous period:

WITH monthly AS (
    SELECT
        DATE_TRUNC('month', order_date) AS month,
        SUM(revenue) AS revenue
    FROM orders
    GROUP BY 1
)
SELECT
    month,
    revenue,
    LAG(revenue) OVER (ORDER BY month) AS prev_month_revenue,
    ROUND((revenue - LAG(revenue) OVER (ORDER BY month)) * 100.0 /
        LAG(revenue) OVER (ORDER BY month), 1) AS growth_pct
FROM monthly;

Pattern 3: Funnel Analysis

Track conversion through sequential steps:

SELECT
    COUNT(DISTINCT CASE WHEN visited THEN user_id END) AS visitors,
    COUNT(DISTINCT CASE WHEN signed_up THEN user_id END) AS signups,
    COUNT(DISTINCT CASE WHEN activated THEN user_id END) AS activated,
    COUNT(DISTINCT CASE WHEN subscribed THEN user_id END) AS subscribers,
    ROUND(COUNT(DISTINCT CASE WHEN signed_up THEN user_id END) * 100.0 /
        NULLIF(COUNT(DISTINCT CASE WHEN visited THEN user_id END), 0), 1) AS visit_to_signup_pct,
    ROUND(COUNT(DISTINCT CASE WHEN subscribed THEN user_id END) * 100.0 /
        NULLIF(COUNT(DISTINCT CASE WHEN visited THEN user_id END), 0), 1) AS overall_conversion_pct
FROM user_journey
WHERE event_date >= CURRENT_DATE - INTERVAL '30 days';

Best Practices

  1. Manipulate at the source. Do transformations in SQL (where the data lives) rather than pulling raw data and manipulating in application code.
  2. Be explicit about NULLs. Always decide how to handle missing values rather than letting them propagate silently.
  3. Document transformations. Others (including future you) need to understand what and why.
  4. Test edge cases. Empty sets, NULLs, duplicates, and boundary values should all be handled.
  5. Optimize for readability. Use CTEs, meaningful aliases, and consistent formatting.
  6. Validate results. After manipulation, spot-check that outputs make sense (row counts, value ranges, known answers).

Summary

Data manipulation is the essential skill that connects raw data to actionable analysis. Whether you use SQL, Python, or AI-powered tools like Skopx (which generates the manipulation logic from natural language), the core operations remain the same: filter, sort, aggregate, join, transform, and reshape. Master these patterns and you can answer any business question your data can support.

Share this article

Saad Selim

The Skopx engineering and product team

Stay Updated

Get the latest insights on AI-powered code intelligence delivered to your inbox.