what is exploratory data analysis (EDA)? a beginner guide

exploratory data analysis is what you do with a dataset before you know what the story is.

it is not statistical modeling. it is not machine learning. it is the practice of loading a dataset and asking: what is this, what does it look like, what is weird about it, and what might be interesting?

the skill matters because most data analysis failures happen in this step — not in the modeling or visualization. they happen because someone ran an analysis without first checking if the data was clean, complete, and representative.

what EDA answers

before running any formal analysis, EDA answers five questions:

how big is the dataset? (rows, columns)
how much data is missing? (null values in key columns)
what does each column look like? (data types, min/max, typical values)
are there outliers? (values that will skew averages and mislead analysis)
what relationships exist between columns? (are any columns correlated? do any move together?)

these five questions take 30-60 minutes on any dataset. skipping them and going straight to analysis produces conclusions that do not hold up.

step-by-step EDA in Google Sheets

step 1: check the dimensions

open your CSV in Google Sheets. note the row count (visible in the scroll bar) and column count (count the headers in row 1).

are there the expected number of rows? if you expected 12 months of data and have 13 rows, there is a duplicate or an error.

step 2: check for blank cells

go to a blank column. add a COUNTA formula for each data column:
=COUNTA(A:A) — counts non-blank cells in column A

compare to the total row count. if column A has 500 rows but COUNTA returns 420, you have 80 blank values. for a customer ID or date column, 80 blanks is a serious data quality issue.

step 3: check distributions for numeric columns

click on a numeric column header. the bottom toolbar shows the SUM, AVERAGE, MIN, MAX, and COUNT.

more informative: =DESCRIBE_STATISTICS does not exist in Sheets, but you can get the same with:

=MIN(E:E)
=MAX(E:E)
=AVERAGE(E:E)
=MEDIAN(E:E)
=STDEV(E:E)

compare MIN, MAX, and MEDIAN. if the AVERAGE is much higher than the MEDIAN, there are outliers pulling the average up. if MIN or MAX are extreme (negative revenue, zero quantity on an orders table), investigate those records.

step 4: check categorical column values

for text columns like “region” or “product category”: build a COUNTIF distribution.
– list unique values with =UNIQUE(B:B)
– count frequency with =COUNTIF(B:B, C2) (where C2 is the unique value)

look for: unexpected categories (“N/A”, “Unknown”, or typos), disproportionate distributions (one category with 90% of records), and categories that should not exist.

step 5: check for obvious relationships

create a pivot table with two categorical variables and see if the distribution looks right. create a scatter plot of two numeric variables and eyeball whether they move together.

in Sheets: Insert → Chart → Scatter chart → set X and Y axis to two numeric columns. if the points trend upward or downward, there may be a correlation worth investigating.

EDA in Python pandas (5 lines)

if your dataset is larger or you want faster EDA:

import pandas as pd

df = pd.read_csv('your_file.csv')

# step 1: dimensions
print(df.shape)

# step 2: missing values
print(df.isnull().sum())

# step 3: distributions for numeric columns
print(df.describe())

# step 4: categorical columns
for col in df.select_dtypes(include='object').columns:
    print(col, df[col].value_counts().head(10))

# step 5: correlations
print(df.corr())

df.describe() returns count, mean, std, min, 25th percentile, median, 75th percentile, and max for every numeric column in one output. this is the single most efficient EDA step for numeric data.

for the full pandas tutorial: Python pandas for non-programmers.

common EDA findings and what to do with them

finding: a key column has 30% null values

action: investigate why. are nulls random (data entry gaps) or systematic (all from one time period or region)? if systematic, the dataset is biased. decide whether to filter out nulls, fill them with a proxy value, or flag them in the analysis as a limitation.

finding: the average is 3x higher than the median

action: you have outliers. filter for values above the 99th percentile and look at those records. outliers are either data errors (a transaction recorded with 10,000 units instead of 10) or real extreme values (one whale customer). decide whether to include, exclude, or report separately.

finding: one category has 85% of records

action: your categorical variable may not be useful for analysis if it does not discriminate meaningfully. or: this is an important finding (85% of orders come from one region). investigate whether this is expected or surprising.

finding: two columns are highly correlated (>0.8)

action: do not use both columns in the same statistical model — they are measuring the same thing. understand the relationship: is one column causing the other, or are both driven by a third factor?

EDA is the difference between valid and misleading analysis

most impressive-looking data analysis fails not because the model is wrong but because the data was not understood before the model ran.

a regression model that explains 80% of revenue variance sounds compelling until you discover that 40% of the data was duplicated, one outlier transaction accounts for 15% of total revenue, and the date column was stored as text so the time-series analysis covered the wrong periods.

EDA catches all three of these in 30 minutes.

for the datasets to practice EDA on: best free datasets for research 2026.

for applying EDA findings in Excel: how to analyze data in Excel without being a data scientist.