Data Engineering

Data Manipulation: Techniques, Tools, and Best Practices

Saad Selim

May 4, 2026

10 min read

Data manipulation is the process of changing data to make it more organized, readable, and useful for analysis. It includes sorting, filtering, aggregating, transforming, joining, and reshaping data. Every analytics workflow, from simple reporting to machine learning, starts with data manipulation.

Core Data Manipulation Operations

1. Filtering (Selecting Rows)

Keep only the rows that match specific criteria.

SQL:

SELECT * FROM orders
WHERE status = 'completed'
AND order_date >= '2026-01-01'
AND amount > 100;

Python (pandas):

filtered = orders[
    (orders['status'] == 'completed') &
    (orders['order_date'] >= '2026-01-01') &
    (orders['amount'] > 100)
]

2. Sorting (Ordering Rows)

Arrange rows in a specific order.

SQL:

SELECT customer_name, total_spend
FROM customers
ORDER BY total_spend DESC
LIMIT 10;

3. Aggregating (Summarizing)

Reduce many rows to summary statistics.

SQL:

SELECT
    product_category,
    COUNT(*) AS order_count,
    SUM(revenue) AS total_revenue,
    AVG(revenue) AS avg_revenue
FROM orders
GROUP BY product_category;

4. Joining (Combining Tables)

Merge data from multiple tables using shared keys.

SQL:

SELECT
    o.order_id,
    o.amount,
    c.customer_name,
    c.segment
FROM orders o
JOIN customers c ON o.customer_id = c.id;

5. Transforming (Creating New Columns)

Derive new values from existing columns.

SQL:

SELECT
    order_id,
    amount,
    quantity,
    amount / quantity AS unit_price,
    CASE
        WHEN amount > 1000 THEN 'large'
        WHEN amount > 100 THEN 'medium'
        ELSE 'small'
    END AS order_size,
    DATE_TRUNC('month', order_date) AS order_month
FROM orders;

6. Pivoting (Reshaping)

Convert rows to columns or columns to rows.

SQL (pivot using CASE):

SELECT
    customer_id,
    SUM(CASE WHEN product = 'Basic' THEN revenue ELSE 0 END) AS basic_revenue,
    SUM(CASE WHEN product = 'Pro' THEN revenue ELSE 0 END) AS pro_revenue,
    SUM(CASE WHEN product = 'Enterprise' THEN revenue ELSE 0 END) AS enterprise_revenue
FROM subscriptions
GROUP BY customer_id;

7. Deduplication (Removing Duplicates)

Eliminate duplicate rows.

SQL:

WITH ranked AS (
    SELECT *,
        ROW_NUMBER() OVER (
            PARTITION BY email
            ORDER BY created_at DESC
        ) AS rn
    FROM contacts
)
SELECT * FROM ranked WHERE rn = 1;

8. String Manipulation

Clean and transform text data.

SQL:

SELECT
    LOWER(TRIM(email)) AS clean_email,
    UPPER(LEFT(first_name, 1)) || LOWER(SUBSTRING(first_name, 2)) AS proper_name,
    SPLIT_PART(email, '@', 2) AS email_domain,
    REPLACE(phone, '-', '') AS clean_phone
FROM users;

9. Date Manipulation

Extract and transform temporal data.

SQL:

SELECT
    order_date,
    EXTRACT(YEAR FROM order_date) AS year,
    EXTRACT(MONTH FROM order_date) AS month,
    EXTRACT(DOW FROM order_date) AS day_of_week,
    order_date - LAG(order_date) OVER (PARTITION BY customer_id ORDER BY order_date) AS days_between_orders,
    CURRENT_DATE - order_date AS days_since_order
FROM orders;

10. Null Handling

Deal with missing values.

SQL:

SELECT
    customer_id,
    COALESCE(phone, email, 'no contact') AS primary_contact,
    COALESCE(discount_rate, 0) AS effective_discount,
    NULLIF(status, '') AS clean_status  -- Convert empty string to NULL
FROM customers;

Data Manipulation Tools

Tool	Language	Best For	Scale
SQL	SQL	Database-resident data, production analytics	Billions of rows
pandas	Python	Exploratory analysis, complex transformations	Millions of rows
Polars	Python/Rust	Fast pandas alternative, larger datasets	Tens of millions
dbt	SQL	Automated transformations in warehouses	Billions of rows
Excel/Sheets	Point-and-click	Quick manipulation of small datasets	Thousands of rows
Spark	Python/Scala/SQL	Distributed processing of massive data	Billions+ rows
Skopx	Natural language	Non-technical users who need data answers	Any scale

Common Data Manipulation Patterns

Pattern 1: Customer 360

Combine data from multiple sources into a unified customer view:

SELECT
    c.customer_id,
    c.name,
    c.segment,
    o.total_orders,
    o.total_revenue,
    o.last_order_date,
    s.open_tickets,
    s.avg_resolution_hours,
    p.feature_adoption_score,
    p.last_login
FROM customers c
LEFT JOIN (
    SELECT customer_id, COUNT(*) AS total_orders, SUM(amount) AS total_revenue, MAX(order_date) AS last_order_date
    FROM orders GROUP BY 1
) o ON c.customer_id = o.customer_id
LEFT JOIN (
    SELECT customer_id, COUNT(*) AS open_tickets, AVG(resolution_hours) AS avg_resolution_hours
    FROM support_tickets WHERE status = 'open' GROUP BY 1
) s ON c.customer_id = s.customer_id
LEFT JOIN (
    SELECT user_id AS customer_id, feature_score AS feature_adoption_score, last_login
    FROM product_usage
) p ON c.customer_id = p.customer_id;

Pattern 2: Period-over-Period Comparison

Compare current period to previous period:

WITH monthly AS (
    SELECT
        DATE_TRUNC('month', order_date) AS month,
        SUM(revenue) AS revenue
    FROM orders
    GROUP BY 1
)
SELECT
    month,
    revenue,
    LAG(revenue) OVER (ORDER BY month) AS prev_month_revenue,
    ROUND((revenue - LAG(revenue) OVER (ORDER BY month)) * 100.0 /
        LAG(revenue) OVER (ORDER BY month), 1) AS growth_pct
FROM monthly;

Pattern 3: Funnel Analysis

Track conversion through sequential steps:

SELECT
    COUNT(DISTINCT CASE WHEN visited THEN user_id END) AS visitors,
    COUNT(DISTINCT CASE WHEN signed_up THEN user_id END) AS signups,
    COUNT(DISTINCT CASE WHEN activated THEN user_id END) AS activated,
    COUNT(DISTINCT CASE WHEN subscribed THEN user_id END) AS subscribers,
    ROUND(COUNT(DISTINCT CASE WHEN signed_up THEN user_id END) * 100.0 /
        NULLIF(COUNT(DISTINCT CASE WHEN visited THEN user_id END), 0), 1) AS visit_to_signup_pct,
    ROUND(COUNT(DISTINCT CASE WHEN subscribed THEN user_id END) * 100.0 /
        NULLIF(COUNT(DISTINCT CASE WHEN visited THEN user_id END), 0), 1) AS overall_conversion_pct
FROM user_journey
WHERE event_date >= CURRENT_DATE - INTERVAL '30 days';

Best Practices

Manipulate at the source. Do transformations in SQL (where the data lives) rather than pulling raw data and manipulating in application code.
Be explicit about NULLs. Always decide how to handle missing values rather than letting them propagate silently.
Document transformations. Others (including future you) need to understand what and why.
Test edge cases. Empty sets, NULLs, duplicates, and boundary values should all be handled.
Optimize for readability. Use CTEs, meaningful aliases, and consistent formatting.
Validate results. After manipulation, spot-check that outputs make sense (row counts, value ranges, known answers).

Summary

Data manipulation is the essential skill that connects raw data to actionable analysis. Whether you use SQL, Python, or AI-powered tools like Skopx (which generates the manipulation logic from natural language), the core operations remain the same: filter, sort, aggregate, join, transform, and reshape. Master these patterns and you can answer any business question your data can support.

Share this article

Saad Selim

The Skopx engineering and product team

Data Manipulation: Techniques, Tools, and Best Practices

Core Data Manipulation Operations

1. Filtering (Selecting Rows)

2. Sorting (Ordering Rows)

3. Aggregating (Summarizing)

4. Joining (Combining Tables)

5. Transforming (Creating New Columns)

6. Pivoting (Reshaping)

7. Deduplication (Removing Duplicates)

8. String Manipulation

9. Date Manipulation

10. Null Handling

Data Manipulation Tools

Common Data Manipulation Patterns

Pattern 1: Customer 360

Pattern 2: Period-over-Period Comparison

Pattern 3: Funnel Analysis

Best Practices

Summary

Share this article

Saad Selim

Related Articles

Data Modeling Tools: Top Options for 2026 Compared

Reverse ETL: What It Is, Why You Need It, and How It Works

Data Preprocessing: Steps, Techniques, and Why It Matters

Slowly Changing Dimensions: Types 1-6 Explained with Examples

Modern Data Stack: What It Includes and Where It Is Heading in 2026

Accounting Automation Software: What to Automate First

Stay Updated