Back to Resources
Data Science

Data Preprocessing: Steps, Techniques, and Why It Matters

Saad Selim
May 4, 2026
13 min read

Data preprocessing is the set of transformations applied to raw data before analysis or model training. It includes cleaning errors, handling missing values, normalizing scales, encoding categories, and engineering features. Preprocessing typically consumes 60-80% of a data science project's time and has more impact on results than algorithm selection.

Raw data from real-world sources is messy. Databases contain nulls, duplicates, and inconsistencies. Sensor data has noise and outliers. Text data has typos and varying formats. Machine learning algorithms assume clean, well-structured numeric inputs. The gap between raw reality and algorithmic expectations is bridged by preprocessing.

Why Preprocessing Matters

Garbage in, garbage out. A sophisticated model trained on poorly preprocessed data will produce unreliable predictions. A simple model on well-prepared data often outperforms a complex model on raw data.

Algorithm requirements. Most ML algorithms cannot handle missing values, non-numeric features, or vastly different scales. Without preprocessing, models either fail to train or produce biased results.

Signal-to-noise ratio. Preprocessing removes noise (errors, irrelevant variation) and amplifies signal (meaningful patterns). This directly improves model accuracy and interpretation.

Computation efficiency. Reducing dimensionality, removing irrelevant features, and efficient encoding reduce training time and memory usage, especially important with large datasets.

Step 1: Data Cleaning

Handling Missing Values

Missing data occurs in virtually every real-world dataset. The approach depends on why data is missing and how much is missing.

Types of missingness:

  • MCAR (Missing Completely at Random): Missingness is unrelated to any variable. Safe to drop or impute simply.
  • MAR (Missing at Random): Missingness depends on observed variables. Need conditional imputation.
  • MNAR (Missing Not at Random): Missingness depends on the missing value itself. Hardest to handle; may need domain expertise.

Techniques:

MethodWhen to UseProsCons
Drop rows<5% missing, MCARSimple, no introduced biasLoses data
Drop columns>40% missingRemoves unreliable feature entirelyLoses information
Mean/median imputationNumeric, MCAR, few missingSimple, preserves sample sizeReduces variance, distorts correlations
Mode imputationCategorical, few missingSimpleCan amplify majority class
KNN imputationComplex patterns, MARCaptures local structureComputationally expensive
Multiple imputation (MICE)Rigorous analysis needed, MARProper uncertainty quantificationComplex, slow
Indicator variableMissingness might be informativePreserves missingness informationAdds features

Python example:

import pandas as pd
from sklearn.impute import SimpleImputer, KNNImputer

# Simple median imputation
imputer = SimpleImputer(strategy='median')
df[numeric_cols] = imputer.fit_transform(df[numeric_cols])

# KNN imputation (considers neighboring rows)
knn_imputer = KNNImputer(n_neighbors=5)
df[numeric_cols] = knn_imputer.fit_transform(df[numeric_cols])

Removing Duplicates

Duplicate records inflate frequencies, bias models, and distort aggregations.

# Exact duplicates
df = df.drop_duplicates()

# Near-duplicates (same customer, slightly different spelling)
# Requires fuzzy matching or domain-specific logic

Fixing Inconsistencies

Common issues:

  • Mixed formats: "USA", "US", "United States", "U.S.A."
  • Case inconsistencies: "new york", "New York", "NEW YORK"
  • Unit mismatches: some rows in meters, others in feet
  • Date format variations: "03/04/2026" vs "2026-04-03"
# Standardize text
df['country'] = df['country'].str.strip().str.upper()
df['country'] = df['country'].replace({'USA': 'US', 'UNITED STATES': 'US'})

# Standardize dates
df['date'] = pd.to_datetime(df['date'], format='mixed')

Handling Outliers

Outliers can be legitimate extreme values or data errors. Treatment depends on context.

Detection methods:

  • Z-score: flag values more than 3 standard deviations from the mean
  • IQR method: flag values below Q1 - 1.5IQR or above Q3 + 1.5IQR
  • Domain knowledge: values outside physically possible ranges

Treatment options:

  • Remove: if confirmed as errors
  • Cap (winsorize): replace extremes with boundary values (e.g., 1st and 99th percentile)
  • Transform: log or square root to reduce the influence of extreme values
  • Keep: if they represent genuine phenomena important to your analysis

Step 2: Data Transformation

Normalization (Min-Max Scaling)

Scales features to a fixed range, typically [0, 1].

Formula: x_scaled = (x - x_min) / (x_max - x_min)

Use when: Algorithms are sensitive to magnitude (neural networks, KNN, SVM). You want bounded values.

Risk: Sensitive to outliers (a single extreme value compresses all other values into a narrow range).

Standardization (Z-Score Scaling)

Centers data at mean 0 with standard deviation 1.

Formula: z = (x - mean) / std

Use when: You assume normally distributed features. Algorithms that use distance metrics or gradient descent (linear regression, logistic regression, PCA, SVM).

Advantage over normalization: Less affected by outliers.

Log Transformation

Compresses right-skewed distributions and stabilizes variance.

Formula: x_transformed = log(x + 1) (the +1 handles zeros)

Use when: Data spans several orders of magnitude (income, population, web traffic). When multiplicative relationships are expected.

Power Transforms (Box-Cox, Yeo-Johnson)

Parametric families that find the optimal transformation to make data more Gaussian.

from sklearn.preprocessing import PowerTransformer

pt = PowerTransformer(method='yeo-johnson')  # handles negative values
df[numeric_cols] = pt.fit_transform(df[numeric_cols])

Step 3: Encoding Categorical Variables

Machine learning algorithms need numeric inputs. Categorical variables must be encoded.

One-Hot Encoding

Creates binary columns for each category value.

Example: Color {red, blue, green} becomes three columns: is_red, is_blue, is_green.

Use when: Categories have no natural ordering. Fewer than 15-20 unique values.

Risk: High-cardinality features (thousands of unique values) create enormous sparse matrices.

df_encoded = pd.get_dummies(df, columns=['color', 'region'], drop_first=True)

Label Encoding

Maps categories to integers: red=0, blue=1, green=2.

Use when: Categories have a natural order (low/medium/high, grade A/B/C/D). Or when using tree-based algorithms (which handle arbitrary integer encodings well).

Risk: Implies ordering where none exists (the model might think green > blue > red).

Target Encoding

Replaces categories with the mean of the target variable for that category.

Use when: High-cardinality categories (city names, product IDs). Want to capture the relationship between category and target.

Risk: Overfitting on small groups. Must use cross-validation or smoothing.

Frequency Encoding

Replaces categories with their frequency (count or percentage) in the dataset.

Use when: The frequency of a category carries information (popular products behave differently from rare ones).

Step 4: Feature Engineering

Feature engineering creates new input variables from existing data to improve model performance.

Common Techniques

Date/time decomposition: Extract year, month, day of week, hour, is_weekend, days_since_event from datetime fields.

Mathematical combinations: Create interaction terms (price * quantity = revenue), ratios (revenue / employees = productivity), differences (current - previous = change).

Aggregations: Group-level statistics (average order value per customer, total purchases last 30 days, days since last login).

Text features: Word count, character length, presence of specific keywords, TF-IDF vectors, embedding representations.

Binning: Convert continuous variables into categories (age groups, income brackets). Useful when the relationship with the target is non-linear with clear breakpoints.

Example: Customer Churn Features

From raw transaction data, engineer features like:

  • recency: days since last purchase
  • frequency: number of purchases in last 90 days
  • monetary: average order value
  • trend: is purchase frequency increasing or decreasing?
  • engagement: days since last login / support ticket
  • tenure: days since account creation

Step 5: Feature Selection

Not all features help. Irrelevant or redundant features add noise and increase overfitting risk.

Filter methods: Statistical tests to rank features by relevance.

  • Correlation with target (for numeric targets)
  • Chi-squared test (for categorical features vs categorical target)
  • Mutual information (captures non-linear relationships)

Wrapper methods: Train models with different feature subsets and keep the best.

  • Forward selection: add features one at a time
  • Backward elimination: start with all, remove one at a time
  • Recursive feature elimination (RFE)

Embedded methods: Feature selection built into the model.

  • L1 regularization (Lasso): automatically zeroes out unimportant features
  • Tree-based importance: Random Forest and XGBoost report feature importance scores
from sklearn.feature_selection import SelectKBest, mutual_info_classif

selector = SelectKBest(mutual_info_classif, k=20)
X_selected = selector.fit_transform(X, y)
selected_features = X.columns[selector.get_support()]

Step 6: Handling Imbalanced Data

When one class vastly outnumbers others (e.g., 2% fraud vs 98% legitimate), models tend to predict the majority class always and achieve high accuracy while being useless.

Techniques:

  • Oversampling minority class: SMOTE (Synthetic Minority Oversampling Technique) creates synthetic examples
  • Undersampling majority class: Randomly remove majority examples (loses information)
  • Class weights: Tell the algorithm to penalize minority class errors more heavily
  • Threshold adjustment: Lower the classification threshold to catch more positive cases
  • Ensemble methods: Balanced Random Forest, EasyEnsemble

Preprocessing Pipeline Best Practices

1. Fit on Training Data Only

Never use test data statistics (mean, std, min, max) for scaling or imputation. Fit transformers on training data, then apply to test data. This prevents data leakage.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
    ('model', RandomForestClassifier())
])

pipeline.fit(X_train, y_train)  # fits all steps on training data
predictions = pipeline.predict(X_test)  # applies learned transforms to test

2. Document Every Step

Record what transformations you applied, in what order, with what parameters. Reproducibility requires knowing the exact preprocessing pipeline.

3. Validate Impact

Compare model performance with and without each preprocessing step. Sometimes "raw" features work fine, and preprocessing adds unnecessary complexity.

4. Automate

Build preprocessing as code (pipelines), not manual spreadsheet manipulations. This ensures consistency between training and production.

Preprocessing in Analytics Platforms

Modern analytics platforms like Skopx handle much of the preprocessing automatically when users query data. When you ask a natural-language question, the platform cleans joins, handles type mismatches, and normalizes formats behind the scenes. For teams building custom models, understanding preprocessing steps helps them evaluate whether automated pipelines are appropriate for their specific data characteristics or whether manual intervention is needed.

Summary

Data preprocessing transforms raw data into a form suitable for analysis and modeling. The core steps are: clean (handle missing values, duplicates, inconsistencies, outliers), transform (normalize, standardize, apply power transforms), encode (convert categories to numbers), engineer (create informative new features), and select (keep only relevant features). Always fit preprocessing on training data only, document every step, and validate that each transformation actually improves results. Preprocessing is often the highest-leverage work in a data science project because no algorithm can compensate for poorly prepared data.

Share this article

Saad Selim

The Skopx engineering and product team

Stay Updated

Get the latest insights on AI-powered code intelligence delivered to your inbox.