postgresql for analysts: practical 2026 tutorial (no DBA needed)
most SQL tutorials for beginners stop at SELECT, JOIN, and GROUP BY. that is enough to get a job interview but not enough to do real analysis. once you start asking real business questions, you immediately need window functions, common table expressions, date functions, and the JSON-handling features that PostgreSQL has and other databases either lack or implement badly.
PostgreSQL is the analyst’s database in 2026. it is free, open-source, and powers a huge share of modern data infrastructure. its analytical features are best-in-class among general-purpose relational databases. and you do not need to be a DBA to use it well, you just need to know the 80 percent of features that show up in real analysis work.
this tutorial is for analysts and solopreneurs who already know basic SQL and want to graduate to the PostgreSQL features that show up in real analysis. you will learn window functions, CTEs, date manipulation, JSON handling, and the small operational concepts that separate analysts who make Postgres do real work from analysts who treat it like a fancy spreadsheet. by the end you will be able to write the kinds of queries that previously required exporting to Excel or Python. this fits naturally with Supabase for data analysis for the hosted-Postgres workflow.
what makes PostgreSQL the analyst’s choice
PostgreSQL has decades of analytical features baked in: window functions, recursive queries, full-text search, JSON support, geographic data types, and a planner sophisticated enough to handle complex analytical queries on millions of rows.
PostgreSQL is the open-source SQL database with the deepest analytical features built in. for analysts in 2026, the load-bearing features are window functions for running totals and ranks, common table expressions (CTEs) for readable multi-step queries, date functions for time-based analysis, and JSON operators for semi-structured data. learning these four features turns Postgres from “a database” into “a portable analysis environment” that scales beyond spreadsheet limits without learning Python or a new tool.
it runs on every cloud, every operating system, and is the underlying engine for many analyst-friendly products like Supabase, Neon, AWS RDS Postgres, and Aiven.
prerequisites
- comfort with SELECT, WHERE, GROUP BY, JOIN
- access to a PostgreSQL database (local install, Supabase, Docker, or RDS)
- a SQL client (DBeaver free, TablePlus, pgAdmin, or the Supabase SQL editor)
if you do not have a database, the easiest free path: sign up for Supabase (covered in Supabase for data analysis) and use its built-in SQL editor.
step 1: window functions (the biggest level-up)
window functions calculate values across a set of rows related to the current row, without collapsing the result like GROUP BY does.
running total
SELECT
order_date,
amount,
SUM(amount) OVER (ORDER BY order_date) AS running_total
FROM sales
ORDER BY order_date;
result: each row keeps its original detail and gets a running total alongside.
rank within a group
SELECT
customer_id,
order_id,
amount,
RANK() OVER (PARTITION BY customer_id ORDER BY amount DESC) AS amount_rank
FROM sales;
result: each customer’s orders ranked by amount.
| function | use |
|---|---|
| ROW_NUMBER() | unique sequential number |
| RANK() | rank with gaps for ties |
| DENSE_RANK() | rank without gaps |
| LAG(col, n) | value from N rows before |
| LEAD(col, n) | value from N rows after |
| FIRST_VALUE(col) | first value in window |
| LAST_VALUE(col) | last value in window |
| NTILE(n) | divide into N buckets |
moving average
SELECT
order_date,
amount,
AVG(amount) OVER (
ORDER BY order_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS rolling_7day_avg
FROM daily_sales;
frame clauses (ROWS BETWEEN) let you control exactly which rows are included.
window functions replace dozens of subqueries and CTEs you would otherwise write. learning them is the single biggest jump from beginner to intermediate SQL.
step 2: common table expressions (CTEs)
CTEs let you write multi-step queries that read top-down instead of inside-out.
bad (deeply nested):
SELECT customer_id, COUNT(*)
FROM (
SELECT * FROM (
SELECT * FROM orders WHERE order_date >= '2026-01-01'
) recent
WHERE recent.amount > 100
) big_recent
GROUP BY customer_id;
good (CTE):
WITH recent_orders AS (
SELECT * FROM orders WHERE order_date >= '2026-01-01'
),
big_orders AS (
SELECT * FROM recent_orders WHERE amount > 100
)
SELECT customer_id, COUNT(*) AS big_order_count
FROM big_orders
GROUP BY customer_id;
CTEs are easier to read, easier to debug (run each step independently), and Postgres optimizes them well in modern versions.
multiple CTEs and reuse
you can define multiple CTEs and reference them in the final SELECT:
WITH
monthly_revenue AS (
SELECT
DATE_TRUNC('month', order_date) AS month,
SUM(amount) AS revenue
FROM sales
GROUP BY 1
),
monthly_orders AS (
SELECT
DATE_TRUNC('month', order_date) AS month,
COUNT(*) AS orders
FROM sales
GROUP BY 1
)
SELECT
r.month,
r.revenue,
o.orders,
r.revenue / NULLIF(o.orders, 0) AS avg_order_value
FROM monthly_revenue r
JOIN monthly_orders o ON r.month = o.month
ORDER BY r.month;
readable, debuggable, and easy to extend.
recursive CTEs
for hierarchical data (org charts, category trees) Postgres supports recursive CTEs. less common in solopreneur work but worth knowing exists.
step 3: date and time functions
dates are at the heart of most analysis. PostgreSQL’s date functions are excellent.
truncating to month or day
DATE_TRUNC('month', order_date)
DATE_TRUNC('day', order_date)
DATE_TRUNC('week', order_date)
date arithmetic
order_date + INTERVAL '7 days'
order_date - INTERVAL '1 month'
NOW() - INTERVAL '24 hours'
extracting parts
EXTRACT(YEAR FROM order_date)
EXTRACT(MONTH FROM order_date)
EXTRACT(DOW FROM order_date) -- day of week, 0-6
date difference
order_date - signup_date -- gives interval
EXTRACT(EPOCH FROM (a - b))/86400 -- gives days as a number
age()
AGE(NOW(), signup_date) -- returns interval like "2 years 3 months 5 days"
example: cohort retention by signup month:
SELECT
DATE_TRUNC('month', u.signup_date) AS cohort_month,
DATE_TRUNC('month', a.activity_date) AS activity_month,
COUNT(DISTINCT u.user_id) AS active_users
FROM users u
JOIN activity a ON u.user_id = a.user_id
GROUP BY 1, 2
ORDER BY 1, 2;
[SCREENSHOT: query result showing cohort retention by signup month and activity month]
step 4: JSON and JSONB
modern data often arrives as JSON. PostgreSQL handles it natively.
accessing fields
SELECT data->'user'->>'name' AS user_name
FROM events;
-> returns JSON, ->> returns text.
filtering by JSON content
SELECT * FROM events
WHERE data->>'event_type' = 'signup';
casting
SELECT (data->>'amount')::numeric AS amount
FROM events;
indexing JSON
for performance on large datasets:
CREATE INDEX idx_events_data_event_type
ON events ((data->>'event_type'));
JSON support means you can store webhook payloads, API responses, or product event streams directly in Postgres without first flattening to columns.
step 5: aggregation tricks
FILTER clause (cleaner than CASE WHEN)
bad:
SELECT
COUNT(*) AS total,
COUNT(CASE WHEN status = 'completed' THEN 1 END) AS completed
FROM orders;
good:
SELECT
COUNT(*) AS total,
COUNT(*) FILTER (WHERE status = 'completed') AS completed,
COUNT(*) FILTER (WHERE status = 'pending') AS pending,
COUNT(*) FILTER (WHERE status = 'cancelled') AS cancelled
FROM orders;
readable, concise, and just as fast.
percentiles
SELECT
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY amount) AS median,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY amount) AS p95
FROM sales;
array aggregation
SELECT
customer_id,
ARRAY_AGG(product_name) AS purchased_products,
STRING_AGG(product_name, ', ') AS products_csv
FROM sales
GROUP BY customer_id;
useful for collapsing many rows into a single output for reporting.
step 6: indexes for analysis
without indexes, queries on big tables slow to a crawl.
B-tree (default, most common)
CREATE INDEX idx_sales_order_date ON sales(order_date);
good for filtering and sorting.
multi-column indexes
CREATE INDEX idx_sales_customer_date ON sales(customer_id, order_date);
useful when filtering on both columns.
partial indexes
CREATE INDEX idx_sales_high_value
ON sales(amount)
WHERE amount > 1000;
smaller, faster index that only covers the rows you actually filter to.
checking index usage
EXPLAIN ANALYZE
SELECT * FROM sales WHERE order_date >= '2026-01-01';
EXPLAIN ANALYZE shows whether the planner used your index. learning to read EXPLAIN output is what turns “queries are slow” into “I know exactly why”.
step 7: materialized views for slow queries
views are recomputed every time they are queried. materialized views are computed once and stored, like a cached table.
CREATE MATERIALIZED VIEW mv_monthly_summary AS
SELECT
DATE_TRUNC('month', order_date) AS month,
SUM(amount) AS revenue,
COUNT(*) AS orders,
COUNT(DISTINCT customer_id) AS customers
FROM sales
GROUP BY 1;
refresh on demand or on a schedule:
REFRESH MATERIALIZED VIEW mv_monthly_summary;
great for expensive aggregations that you query repeatedly.
step 8: connection from analysis tools
PostgreSQL connects to virtually every analysis tool out there.
| tool | connection notes |
|---|---|
| Looker Studio | use the PostgreSQL connector |
| Tableau | native Postgres connector |
| Power BI | native Postgres connector |
| Excel via Power Query | use Get Data → From Database → From PostgreSQL |
| Python | psycopg2 or sqlalchemy |
| R | RPostgreSQL or DBI |
| Jupyter notebooks | psycopg2 + pandas read_sql |
for the BI side see Looker Studio complete tutorial 2026 and Power Query in Excel tutorial 2026.
comparing PostgreSQL to alternatives
| database | cost | best for | learning curve |
|---|---|---|---|
| PostgreSQL | free open-source | general analytics, OLTP+OLAP | medium |
| MySQL | free open-source | simpler workloads, web apps | low-medium |
| SQLite | free | local, embedded | low |
| DuckDB | free | local analytics, columnar | low-medium |
| Snowflake | $$$ | warehouse-scale analytics | medium |
| BigQuery | pay per query | warehouse-scale, Google ecosystem | medium |
PostgreSQL hits the sweet spot for most analyst needs up to billions of rows. above that, columnar warehouses (Snowflake, BigQuery, Redshift) start to win on query speed.
common mistakes
1. SELECT * on large tables
pulls every column even when you only need three. always specify columns.
2. ignoring NULL handling
COUNT(column) excludes NULLs, COUNT(*) includes them. SUM(NULL + 1) = NULL. use COALESCE or NULLIF when needed.
3. forgetting to qualify columns in joins
ambiguous column names create silent bugs. always alias tables (FROM sales s) and qualify (s.customer_id).
4. mixing SQL standards
PostgreSQL is mostly SQL-standard with extensions. queries that work in MySQL or SQLite may not always work identically. when copying examples, check Postgres-specific syntax.
5. not using EXPLAIN
slow queries usually have one root cause that EXPLAIN ANALYZE will reveal. learn to read it.
advanced patterns: cohort analysis from scratch
cohort analysis is the canonical “advanced SQL” exercise. with PostgreSQL window functions and CTEs, it is straightforward.
building a weekly retention cohort
WITH cohorts AS (
SELECT
user_id,
DATE_TRUNC('week', signup_date) AS cohort_week
FROM users
),
activity AS (
SELECT
user_id,
DATE_TRUNC('week', activity_date) AS activity_week
FROM events
),
joined AS (
SELECT
c.cohort_week,
a.activity_week,
EXTRACT(WEEK FROM (a.activity_week - c.cohort_week)) AS weeks_since_signup,
COUNT(DISTINCT a.user_id) AS active_users
FROM cohorts c
LEFT JOIN activity a ON c.user_id = a.user_id
GROUP BY 1, 2, 3
)
SELECT
cohort_week,
weeks_since_signup,
active_users
FROM joined
WHERE weeks_since_signup BETWEEN 0 AND 12
ORDER BY cohort_week, weeks_since_signup;
result: one row per cohort per week-since-signup. pivot in your BI tool for the classic triangular retention chart.
advanced patterns: funnel analysis
funnel analysis with window functions and FILTER:
WITH user_events AS (
SELECT
user_id,
BOOL_OR(event_name = 'signup') AS did_signup,
BOOL_OR(event_name = 'first_action') AS did_first_action,
BOOL_OR(event_name = 'purchase') AS did_purchase
FROM events
GROUP BY user_id
)
SELECT
COUNT(*) FILTER (WHERE did_signup) AS signups,
COUNT(*) FILTER (WHERE did_signup AND did_first_action) AS activated,
COUNT(*) FILTER (WHERE did_signup AND did_first_action AND did_purchase) AS purchased,
ROUND(100.0 * COUNT(*) FILTER (WHERE did_signup AND did_first_action) / NULLIF(COUNT(*) FILTER (WHERE did_signup), 0), 2) AS activation_rate,
ROUND(100.0 * COUNT(*) FILTER (WHERE did_signup AND did_first_action AND did_purchase) / NULLIF(COUNT(*) FILTER (WHERE did_signup AND did_first_action), 0), 2) AS purchase_rate
FROM user_events;
one query, full funnel summary, no spreadsheet export needed.
query performance fundamentals
three habits that distinguish slow queries from fast ones:
habit 1: filter early
push WHERE clauses as low in your CTE chain as possible. filtering 10 million rows down to 100 thousand at step 1 makes every subsequent step 100x faster.
habit 2: avoid SELECT * in subqueries
every column you select must be materialized. select only what you need.
habit 3: use EXPLAIN before assuming
if a query feels slow, EXPLAIN ANALYZE first. the planner usually tells you exactly what is slow (sequential scan instead of index scan, large hash join, missing statistics).
learning to read EXPLAIN output is the highest-leverage SQL performance skill.
connecting Postgres to your wider stack
PostgreSQL is the engine. above and around:
- hosted Postgres for solopreneurs: Supabase for data analysis
- transformation layer: dbt for analysts works directly with Postgres
- visualization: Looker Studio complete tutorial 2026, how to build a business dashboard
- import via Power Query: Power Query in Excel tutorial 2026
- AI for analysis on top: chatgpt vs claude for data analysis
- broader tool landscape: best free data analysis tools 2026, SQL for beginners
a typical analyst stack: source data loaded into Postgres → dbt models → Looker Studio dashboards. PostgreSQL is the engine that makes the rest of it possible.
extensions worth knowing
PostgreSQL’s extensibility is one of its hidden superpowers.
postgis
geographic data: latitude/longitude, distances, polygons, intersections. essential for any location-based analysis. works in Supabase out of the box.
pg_stat_statements
tracks query execution stats across your database. useful for identifying which queries are most-run and most-slow.
pg_trgm
trigram-based fuzzy text matching. useful for “similar to” queries on text fields when you need fuzzy joins.
vector / pgvector
vector similarity search for embeddings. lets Postgres serve as a simple vector database for AI-powered search and retrieval.
most managed Postgres services (Supabase, Neon, RDS) support these out of the box.
working with raw timestamps
a small but useful pattern: timestamps with timezone vs without.
timestamptz(timestamp with time zone): stores UTC, displays in client timezone. use this for almost everything.timestamp(timestamp without time zone): stores literal value, no timezone awareness. use only for true “wall clock” times that have no timezone meaning.
most date confusion in analyst work comes from mixing these. always use timestamptz for event data.
conclusion
PostgreSQL is the analyst’s general-purpose database in 2026. its analytical features (window functions, CTEs, date functions, JSON support) are deep enough to handle real business analysis without exporting to Excel or Python.
the seven feature areas above (window functions, CTEs, dates, JSON, aggregation tricks, indexes, materialized views) are the 20 percent of PostgreSQL that handles 80 percent of analyst work. learning each one moves you a level up. window functions alone replace whole categories of subqueries you previously had to write.
start with one painful query you currently solve by exporting to Excel. rewrite it in Postgres using window functions or a CTE. measure the time saved. add an index if it is slow. wrap it in a view if you run it often. by the third query, the workflow becomes automatic.
if you write SQL daily and want to leave the SELECT-and-GROUP-BY plateau, PostgreSQL is the database that rewards investment most. it is free, hosted options like Supabase make it trivial to spin up, and the skills carry across every modern data platform.