Python pandas tutorial for non-programmers (real examples)

Python pandas tutorial for non-programmers (real examples)

pandas is a Python library that lets you do spreadsheet-style data analysis in code.

that matters because code is repeatable. once you have written the analysis, you run it again on new data in one click. you can analyze 1 million rows as easily as 1,000. you can combine tables, reshape data, and create reproducible reports — all without the manual steps that break in Excel or Google Sheets.

this tutorial uses Google Colab — a free browser-based Python environment. no installation, no setup. just open the link and start writing code.

what pandas is and why non-programmers use it

pandas is the most-used Python library for data analysis. it provides two main data structures:

  • DataFrame: a table with rows and columns — like a spreadsheet inside Python
  • Series: a single column of data

everything you do in a spreadsheet — filter rows, sum a column, group data by category, merge two tables — pandas does in code. the advantage is speed, scale, and reproducibility.

non-programmers use pandas for:
– analyzing sales data too large for Google Sheets
– automating monthly reports that currently take hours to update manually
– combining data from multiple sources (a CSV from your CRM + a CSV from your payment processor)
– cleaning messy data consistently across many files

how to run pandas without installing anything (Google Colab)

go to colab.research.google.com. sign in with your Google account. click “New Notebook.”

you will see a code cell — a gray box where you type Python code. pandas is pre-installed in Colab. you do not need to install anything.

to check that pandas is available:

import pandas as pd
print(pd.__version__)

click the play button (or press Shift+Enter) to run the cell. Colab prints the pandas version number and you are ready to go.

loading your first CSV file into pandas

option 1: upload a file from your computer

from google.colab import files
uploaded = files.upload()

run this cell. Colab shows a file picker. upload your CSV. then load it:

import pandas as pd
df = pd.read_csv('your_file.csv')

option 2: load a file from a URL

import pandas as pd
url = 'https://raw.githubusercontent.com/datasets/covid-19/main/data/countries-aggregated.csv'
df = pd.read_csv(url)

replace the URL with any direct link to a CSV file (Kaggle datasets, Data.gov, GitHub raw file URLs).

the first things to check after loading

df.shape        # (rows, columns) — how big is the dataset
df.head()       # first 5 rows — what does the data look like
df.dtypes       # data types of each column
df.isnull().sum()  # blank cells in each column

run these four lines on every new dataset before doing anything else. they give you a complete snapshot of what you are working with.

the pandas operations that replace 80% of Excel work

filtering rows (like WHERE in SQL, like a filter in Excel)

# keep only rows where region is 'North America'
df_na = df[df['region'] == 'North America']

# keep rows where revenue is above 1000
df_high = df[df['revenue'] > 1000]

# combine conditions with & (AND) and | (OR)
df_filtered = df[(df['region'] == 'North America') & (df['revenue'] > 1000)]

selecting columns

# single column
df['product']

# multiple columns
df[['product', 'revenue', 'date']]

summarizing with groupby (like GROUP BY in SQL, like a pivot table in Excel)

# total revenue by product
df.groupby('product')['revenue'].sum()

# multiple aggregations at once
df.groupby('product').agg({
    'revenue': 'sum',
    'orders': 'count',
    'order_value': 'mean'
})

sorting

# sort by revenue, highest to lowest
df.sort_values('revenue', ascending=False)

# sort by multiple columns
df.sort_values(['product', 'date'])

adding new columns

# calculate profit margin
df['margin'] = (df['revenue'] - df['cost']) / df['revenue']

# flag high-value orders
df['tier'] = df['revenue'].apply(lambda x: 'High' if x > 1000 else 'Standard')

quick statistics

df.describe()           # count, mean, std, min, quartiles, max for all numeric columns
df['revenue'].mean()    # average revenue
df['revenue'].median()  # median revenue
df['product'].value_counts()  # count of each unique product

a real example: analyzing a sales spreadsheet with pandas

imagine a CSV with these columns: Date, Product, Region, Salesperson, Units, Revenue.

step 1: load and inspect

import pandas as pd
df = pd.read_csv('sales_2026.csv')
print(df.shape)
df.head()

step 2: check for missing data and data types

df.isnull().sum()
df.dtypes

fix the date column if it loaded as text:

df['Date'] = pd.to_datetime(df['Date'])

step 3: revenue by product (the pivot table equivalent)

by_product = df.groupby('Product')['Revenue'].sum().sort_values(ascending=False)
print(by_product)

step 4: monthly revenue trend

df['Month'] = df['Date'].dt.to_period('M')
monthly = df.groupby('Month')['Revenue'].sum()
print(monthly)

step 5: who is the top salesperson by region?

top_by_region = df.groupby(['Region', 'Salesperson'])['Revenue'].sum()
top_by_region = top_by_region.reset_index()
top_by_region = top_by_region.sort_values(['Region', 'Revenue'], ascending=[True, False])
print(top_by_region.groupby('Region').first())

step 6: plot the monthly trend

import matplotlib.pyplot as plt
monthly.plot(kind='bar', figsize=(12, 5), title='Monthly Revenue 2026')
plt.xlabel('Month')
plt.ylabel('Revenue')
plt.tight_layout()
plt.show()

Colab displays the chart inline below the code cell.

the pandas operations that clean data

data cleaning in pandas is often faster than in Google Sheets for large or recurring datasets.

# remove duplicate rows
df = df.drop_duplicates()

# fill blank cells with a value
df['region'] = df['region'].fillna('Unknown')

# remove rows where a key column is blank
df = df.dropna(subset=['customer_id'])

# standardize text: strip spaces and convert to lowercase
df['product'] = df['product'].str.strip().str.lower()

# fix a column that should be numeric but loaded as text
df['revenue'] = pd.to_numeric(df['revenue'], errors='coerce')

what to learn next

visualization: matplotlib (used in this tutorial) handles basic charts. seaborn produces more polished statistical charts. plotly creates interactive charts.

import seaborn as sns
sns.barplot(data=df, x='Region', y='Revenue')

merging tables: when you need to combine two DataFrames (like JOIN in SQL):

merged = df_orders.merge(df_customers, on='customer_id', how='left')

reading from Google Sheets: pandas can read a Google Sheet as a DataFrame using its CSV export URL.

next resources:
Python for business data analysis — beginner guide
best free datasets for practice — Kaggle has hundreds of real datasets to practice on