Univariate Analysis: Methods, Examples, and When to Use It
Univariate analysis examines one variable at a time. It is the simplest form of statistical analysis and the essential first step before exploring relationships between variables. Before asking "does X affect Y?" you need to understand X on its own: its distribution, central tendency, spread, and outliers.
Why Start with Univariate Analysis
Every thorough analysis begins here because:
- Detect data quality issues. Impossible values, unexpected distributions, and missing data patterns are visible.
- Understand each variable independently. Know the range, typical values, and shape before combining variables.
- Identify outliers. Extreme values that might distort later analysis.
- Choose appropriate methods. The distribution of a variable determines which statistical tests are valid.
Measures of Central Tendency
These describe the "typical" value:
Mean (Average)
Sum of all values divided by the number of values.
When to use: Data is roughly symmetric without extreme outliers. When NOT to use: Skewed data or data with outliers (the mean is pulled toward extremes).
Example: Salaries in a department: $60K, $65K, $70K, $72K, $75K, $500K (CEO) Mean = $140K (misleading because one extreme value distorts it)
Median
The middle value when data is sorted. Half the values are above, half below.
When to use: Skewed data, data with outliers, ordinal data. Advantage: Not affected by extreme values.
Example: Same salaries: Median = $71K (much more representative of "typical")
Mode
The most frequently occurring value.
When to use: Categorical data, finding the most common category.
Example: Shirt sizes sold: S(20), M(45), L(38), XL(15). Mode = M (most popular).
Measures of Spread (Dispersion)
These describe how spread out the data is:
Range
Maximum minus minimum value.
Formula: Range = Max - Min Limitation: Extremely sensitive to outliers (one extreme value makes range huge).
Interquartile Range (IQR)
The range of the middle 50% of data (Q3 - Q1).
Formula: IQR = 75th percentile - 25th percentile Advantage: Not affected by outliers. Use: Often used to define outliers (values beyond 1.5 x IQR from Q1 or Q3).
Standard Deviation
Average distance of values from the mean.
Interpretation:
- Small SD: Data points cluster tightly around the mean
- Large SD: Data points are spread widely
Rule of thumb (normal distribution):
- 68% of values within 1 SD of mean
- 95% within 2 SD
- 99.7% within 3 SD
Variance
Standard deviation squared. Used in formulas but harder to interpret directly (units are squared).
Coefficient of Variation (CV)
Standard deviation divided by mean, expressed as percentage. Allows comparison of spread across variables with different scales.
Example: Comparing variability of revenue ($1M average, $200K SD) vs. orders (500 average, 100 SD).
- Revenue CV = 20%
- Orders CV = 20%
- Same relative variability despite different scales.
Distribution Shape
Skewness
Measures asymmetry of the distribution:
- Right-skewed (positive): Tail extends right. Mean > Median. Examples: income, house prices, website session duration.
- Left-skewed (negative): Tail extends left. Mean < Median. Examples: age at retirement, exam scores (with ceiling effect).
- Symmetric: Mean = Median. Example: height, IQ scores.
Kurtosis
Measures the "tailedness" of the distribution:
- High kurtosis: Heavy tails, more outliers than normal distribution
- Low kurtosis: Light tails, fewer outliers
- Normal kurtosis: Similar to bell curve
Visualization Methods for Univariate Analysis
| Method | Best For | Shows |
|---|---|---|
| Histogram | Continuous data distribution shape | Frequency distribution, skewness, modes |
| Box plot | Summary statistics and outliers | Median, IQR, range, outliers |
| Bar chart | Categorical frequency | Count or proportion per category |
| Density plot | Smooth distribution estimate | Shape without bin sensitivity |
| Dot plot | Small datasets | Individual values |
| QQ plot | Checking normality | How closely data follows normal distribution |
Reading a Histogram
A histogram divides continuous data into bins and counts values in each bin:
- Bell-shaped: Data is approximately normal
- Right-skewed: Long tail on right (most values on left)
- Bimodal: Two peaks (possibly two subgroups in the data)
- Uniform: All bins roughly equal height (no preferred value)
Reading a Box Plot
| Component | Meaning |
|---|---|
| Box bottom (Q1) | 25th percentile |
| Line in box | Median (50th percentile) |
| Box top (Q3) | 75th percentile |
| Whiskers | Extend to 1.5 x IQR from box edges |
| Dots beyond whiskers | Outliers |
Univariate Analysis in Practice
Example: Analyzing Response Times
Data: 10,000 API response times from the past week.
Step 1: Summary statistics
- Mean: 245ms
- Median: 180ms
- SD: 320ms
- Min: 15ms, Max: 8,500ms
Step 2: Interpretation
- Mean > Median indicates right skew (confirmed by histogram)
- Large SD relative to mean suggests high variability
- Max of 8.5s is an extreme outlier worth investigating
Step 3: Distribution analysis
- 90% of requests complete under 400ms (acceptable)
- 5% take 400-1000ms (slow)
- 5% take over 1 second (problematic)
- The long tail distorts the mean; median is more representative of typical experience
Step 4: Action
- Report P50 (180ms) and P95 (950ms) rather than mean
- Investigate the 5% > 1 second (likely a specific endpoint or condition)
- Set SLO at P99 < 2000ms
Example: Analyzing Deal Sizes
Data: 500 closed deals from the past year.
Step 1: Summary statistics
- Mean: $42K
- Median: $28K
- SD: $38K
- Min: $2K, Max: $350K
Step 2: Distribution shape
- Strongly right-skewed (a few large enterprise deals pull the mean up)
- Bimodal: peaks around $15K (SMB) and $60K (enterprise)
Step 3: Insight The bimodal distribution suggests two distinct segments purchasing differently. Analyzing them separately would yield better insights than treating all deals as one population.
Univariate Analysis for Different Data Types
Continuous Data (numbers with any value)
- Summary statistics: mean, median, SD, IQR
- Visualization: histogram, box plot, density plot
- Shape: skewness, kurtosis, modality
Discrete Data (countable numbers)
- Summary statistics: mean, median, mode
- Visualization: bar chart, frequency table
- Special consideration: zero-inflation (many zeros)
Categorical Data (groups/labels)
- Summary: frequency counts, proportions, mode
- Visualization: bar chart, pie chart (2-5 categories only)
- Analysis: chi-square goodness-of-fit test
Ordinal Data (ordered categories)
- Summary: median, mode, percentiles (not mean)
- Visualization: bar chart (ordered), cumulative frequency
- Special consideration: do not treat as continuous (intervals may be unequal)
Tools for Univariate Analysis
SQL:
SELECT
COUNT(*) AS n,
AVG(amount) AS mean,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY amount) AS median,
STDDEV(amount) AS std_dev,
MIN(amount) AS min_val,
MAX(amount) AS max_val,
PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY amount) AS q1,
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY amount) AS q3
FROM orders;
AI-powered tools: Platforms like Skopx let you ask "describe the distribution of order amounts" or "show me a histogram of response times" in natural language and get the analysis instantly.
Summary
Univariate analysis is the foundation of all statistical work. Before exploring relationships between variables, understand each variable independently: its center, spread, shape, and outliers. This step catches data quality issues, informs method selection, and often reveals insights on its own. Never skip it.
Saad Selim
The Skopx engineering and product team