postgresql for analysts: practical 2026 tutorial (no DBA needed)

postgresql for analysts: practical 2026 tutorial (no DBA needed)

most SQL tutorials for beginners stop at SELECT, JOIN, and GROUP BY. that is enough to get a job interview but not enough to do real analysis. once you start asking real business questions, you immediately need window functions, common table expressions, date functions, and the JSON-handling features that PostgreSQL has and other databases either lack or implement badly.

PostgreSQL is the analyst’s database in 2026. it is free, open-source, and powers a huge share of modern data infrastructure. its analytical features are best-in-class among general-purpose relational databases. and you do not need to be a DBA to use it well, you just need to know the 80 percent of features that show up in real analysis work.

this tutorial is for analysts and solopreneurs who already know basic SQL and want to graduate to the PostgreSQL features that show up in real analysis. you will learn window functions, CTEs, date manipulation, JSON handling, and the small operational concepts that separate analysts who make Postgres do real work from analysts who treat it like a fancy spreadsheet. by the end you will be able to write the kinds of queries that previously required exporting to Excel or Python. this fits naturally with Supabase for data analysis for the hosted-Postgres workflow.

what makes PostgreSQL the analyst’s choice

PostgreSQL has decades of analytical features baked in: window functions, recursive queries, full-text search, JSON support, geographic data types, and a planner sophisticated enough to handle complex analytical queries on millions of rows.

PostgreSQL is the open-source SQL database with the deepest analytical features built in. for analysts in 2026, the load-bearing features are window functions for running totals and ranks, common table expressions (CTEs) for readable multi-step queries, date functions for time-based analysis, and JSON operators for semi-structured data. learning these four features turns Postgres from “a database” into “a portable analysis environment” that scales beyond spreadsheet limits without learning Python or a new tool.

it runs on every cloud, every operating system, and is the underlying engine for many analyst-friendly products like Supabase, Neon, AWS RDS Postgres, and Aiven.

prerequisites

  • comfort with SELECT, WHERE, GROUP BY, JOIN
  • access to a PostgreSQL database (local install, Supabase, Docker, or RDS)
  • a SQL client (DBeaver free, TablePlus, pgAdmin, or the Supabase SQL editor)

if you do not have a database, the easiest free path: sign up for Supabase (covered in Supabase for data analysis) and use its built-in SQL editor.

step 1: window functions (the biggest level-up)

window functions calculate values across a set of rows related to the current row, without collapsing the result like GROUP BY does.

running total

SELECT
  order_date,
  amount,
  SUM(amount) OVER (ORDER BY order_date) AS running_total
FROM sales
ORDER BY order_date;

result: each row keeps its original detail and gets a running total alongside.

rank within a group

SELECT
  customer_id,
  order_id,
  amount,
  RANK() OVER (PARTITION BY customer_id ORDER BY amount DESC) AS amount_rank
FROM sales;

result: each customer’s orders ranked by amount.

function use
ROW_NUMBER() unique sequential number
RANK() rank with gaps for ties
DENSE_RANK() rank without gaps
LAG(col, n) value from N rows before
LEAD(col, n) value from N rows after
FIRST_VALUE(col) first value in window
LAST_VALUE(col) last value in window
NTILE(n) divide into N buckets

moving average

SELECT
  order_date,
  amount,
  AVG(amount) OVER (
    ORDER BY order_date
    ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
  ) AS rolling_7day_avg
FROM daily_sales;

frame clauses (ROWS BETWEEN) let you control exactly which rows are included.

window functions replace dozens of subqueries and CTEs you would otherwise write. learning them is the single biggest jump from beginner to intermediate SQL.

step 2: common table expressions (CTEs)

CTEs let you write multi-step queries that read top-down instead of inside-out.

bad (deeply nested):

SELECT customer_id, COUNT(*)
FROM (
  SELECT * FROM (
    SELECT * FROM orders WHERE order_date >= '2026-01-01'
  ) recent
  WHERE recent.amount > 100
) big_recent
GROUP BY customer_id;

good (CTE):

WITH recent_orders AS (
  SELECT * FROM orders WHERE order_date >= '2026-01-01'
),
big_orders AS (
  SELECT * FROM recent_orders WHERE amount > 100
)
SELECT customer_id, COUNT(*) AS big_order_count
FROM big_orders
GROUP BY customer_id;

CTEs are easier to read, easier to debug (run each step independently), and Postgres optimizes them well in modern versions.

multiple CTEs and reuse

you can define multiple CTEs and reference them in the final SELECT:

WITH
monthly_revenue AS (
  SELECT
    DATE_TRUNC('month', order_date) AS month,
    SUM(amount) AS revenue
  FROM sales
  GROUP BY 1
),
monthly_orders AS (
  SELECT
    DATE_TRUNC('month', order_date) AS month,
    COUNT(*) AS orders
  FROM sales
  GROUP BY 1
)
SELECT
  r.month,
  r.revenue,
  o.orders,
  r.revenue / NULLIF(o.orders, 0) AS avg_order_value
FROM monthly_revenue r
JOIN monthly_orders o ON r.month = o.month
ORDER BY r.month;

readable, debuggable, and easy to extend.

recursive CTEs

for hierarchical data (org charts, category trees) Postgres supports recursive CTEs. less common in solopreneur work but worth knowing exists.

step 3: date and time functions

dates are at the heart of most analysis. PostgreSQL’s date functions are excellent.

truncating to month or day

DATE_TRUNC('month', order_date)
DATE_TRUNC('day', order_date)
DATE_TRUNC('week', order_date)

date arithmetic

order_date + INTERVAL '7 days'
order_date - INTERVAL '1 month'
NOW() - INTERVAL '24 hours'

extracting parts

EXTRACT(YEAR FROM order_date)
EXTRACT(MONTH FROM order_date)
EXTRACT(DOW FROM order_date)  -- day of week, 0-6

date difference

order_date - signup_date           -- gives interval
EXTRACT(EPOCH FROM (a - b))/86400  -- gives days as a number

age()

AGE(NOW(), signup_date)  -- returns interval like "2 years 3 months 5 days"

example: cohort retention by signup month:

SELECT
  DATE_TRUNC('month', u.signup_date) AS cohort_month,
  DATE_TRUNC('month', a.activity_date) AS activity_month,
  COUNT(DISTINCT u.user_id) AS active_users
FROM users u
JOIN activity a ON u.user_id = a.user_id
GROUP BY 1, 2
ORDER BY 1, 2;

[SCREENSHOT: query result showing cohort retention by signup month and activity month]

step 4: JSON and JSONB

modern data often arrives as JSON. PostgreSQL handles it natively.

accessing fields

SELECT data->'user'->>'name' AS user_name
FROM events;

-> returns JSON, ->> returns text.

filtering by JSON content

SELECT * FROM events
WHERE data->>'event_type' = 'signup';

casting

SELECT (data->>'amount')::numeric AS amount
FROM events;

indexing JSON

for performance on large datasets:

CREATE INDEX idx_events_data_event_type
ON events ((data->>'event_type'));

JSON support means you can store webhook payloads, API responses, or product event streams directly in Postgres without first flattening to columns.

step 5: aggregation tricks

FILTER clause (cleaner than CASE WHEN)

bad:

SELECT
  COUNT(*) AS total,
  COUNT(CASE WHEN status = 'completed' THEN 1 END) AS completed
FROM orders;

good:

SELECT
  COUNT(*) AS total,
  COUNT(*) FILTER (WHERE status = 'completed') AS completed,
  COUNT(*) FILTER (WHERE status = 'pending') AS pending,
  COUNT(*) FILTER (WHERE status = 'cancelled') AS cancelled
FROM orders;

readable, concise, and just as fast.

percentiles

SELECT
  PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY amount) AS median,
  PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY amount) AS p95
FROM sales;

array aggregation

SELECT
  customer_id,
  ARRAY_AGG(product_name) AS purchased_products,
  STRING_AGG(product_name, ', ') AS products_csv
FROM sales
GROUP BY customer_id;

useful for collapsing many rows into a single output for reporting.

step 6: indexes for analysis

without indexes, queries on big tables slow to a crawl.

B-tree (default, most common)

CREATE INDEX idx_sales_order_date ON sales(order_date);

good for filtering and sorting.

multi-column indexes

CREATE INDEX idx_sales_customer_date ON sales(customer_id, order_date);

useful when filtering on both columns.

partial indexes

CREATE INDEX idx_sales_high_value
ON sales(amount)
WHERE amount > 1000;

smaller, faster index that only covers the rows you actually filter to.

checking index usage

EXPLAIN ANALYZE
SELECT * FROM sales WHERE order_date >= '2026-01-01';

EXPLAIN ANALYZE shows whether the planner used your index. learning to read EXPLAIN output is what turns “queries are slow” into “I know exactly why”.

step 7: materialized views for slow queries

views are recomputed every time they are queried. materialized views are computed once and stored, like a cached table.

CREATE MATERIALIZED VIEW mv_monthly_summary AS
SELECT
  DATE_TRUNC('month', order_date) AS month,
  SUM(amount) AS revenue,
  COUNT(*) AS orders,
  COUNT(DISTINCT customer_id) AS customers
FROM sales
GROUP BY 1;

refresh on demand or on a schedule:

REFRESH MATERIALIZED VIEW mv_monthly_summary;

great for expensive aggregations that you query repeatedly.

step 8: connection from analysis tools

PostgreSQL connects to virtually every analysis tool out there.

tool connection notes
Looker Studio use the PostgreSQL connector
Tableau native Postgres connector
Power BI native Postgres connector
Excel via Power Query use Get Data → From Database → From PostgreSQL
Python psycopg2 or sqlalchemy
R RPostgreSQL or DBI
Jupyter notebooks psycopg2 + pandas read_sql

for the BI side see Looker Studio complete tutorial 2026 and Power Query in Excel tutorial 2026.

comparing PostgreSQL to alternatives

database cost best for learning curve
PostgreSQL free open-source general analytics, OLTP+OLAP medium
MySQL free open-source simpler workloads, web apps low-medium
SQLite free local, embedded low
DuckDB free local analytics, columnar low-medium
Snowflake $$$ warehouse-scale analytics medium
BigQuery pay per query warehouse-scale, Google ecosystem medium

PostgreSQL hits the sweet spot for most analyst needs up to billions of rows. above that, columnar warehouses (Snowflake, BigQuery, Redshift) start to win on query speed.

common mistakes

1. SELECT * on large tables

pulls every column even when you only need three. always specify columns.

2. ignoring NULL handling

COUNT(column) excludes NULLs, COUNT(*) includes them. SUM(NULL + 1) = NULL. use COALESCE or NULLIF when needed.

3. forgetting to qualify columns in joins

ambiguous column names create silent bugs. always alias tables (FROM sales s) and qualify (s.customer_id).

4. mixing SQL standards

PostgreSQL is mostly SQL-standard with extensions. queries that work in MySQL or SQLite may not always work identically. when copying examples, check Postgres-specific syntax.

5. not using EXPLAIN

slow queries usually have one root cause that EXPLAIN ANALYZE will reveal. learn to read it.

advanced patterns: cohort analysis from scratch

cohort analysis is the canonical “advanced SQL” exercise. with PostgreSQL window functions and CTEs, it is straightforward.

building a weekly retention cohort

WITH cohorts AS (
  SELECT
    user_id,
    DATE_TRUNC('week', signup_date) AS cohort_week
  FROM users
),
activity AS (
  SELECT
    user_id,
    DATE_TRUNC('week', activity_date) AS activity_week
  FROM events
),
joined AS (
  SELECT
    c.cohort_week,
    a.activity_week,
    EXTRACT(WEEK FROM (a.activity_week - c.cohort_week)) AS weeks_since_signup,
    COUNT(DISTINCT a.user_id) AS active_users
  FROM cohorts c
  LEFT JOIN activity a ON c.user_id = a.user_id
  GROUP BY 1, 2, 3
)
SELECT
  cohort_week,
  weeks_since_signup,
  active_users
FROM joined
WHERE weeks_since_signup BETWEEN 0 AND 12
ORDER BY cohort_week, weeks_since_signup;

result: one row per cohort per week-since-signup. pivot in your BI tool for the classic triangular retention chart.

advanced patterns: funnel analysis

funnel analysis with window functions and FILTER:

WITH user_events AS (
  SELECT
    user_id,
    BOOL_OR(event_name = 'signup') AS did_signup,
    BOOL_OR(event_name = 'first_action') AS did_first_action,
    BOOL_OR(event_name = 'purchase') AS did_purchase
  FROM events
  GROUP BY user_id
)
SELECT
  COUNT(*) FILTER (WHERE did_signup) AS signups,
  COUNT(*) FILTER (WHERE did_signup AND did_first_action) AS activated,
  COUNT(*) FILTER (WHERE did_signup AND did_first_action AND did_purchase) AS purchased,
  ROUND(100.0 * COUNT(*) FILTER (WHERE did_signup AND did_first_action) / NULLIF(COUNT(*) FILTER (WHERE did_signup), 0), 2) AS activation_rate,
  ROUND(100.0 * COUNT(*) FILTER (WHERE did_signup AND did_first_action AND did_purchase) / NULLIF(COUNT(*) FILTER (WHERE did_signup AND did_first_action), 0), 2) AS purchase_rate
FROM user_events;

one query, full funnel summary, no spreadsheet export needed.

query performance fundamentals

three habits that distinguish slow queries from fast ones:

habit 1: filter early

push WHERE clauses as low in your CTE chain as possible. filtering 10 million rows down to 100 thousand at step 1 makes every subsequent step 100x faster.

habit 2: avoid SELECT * in subqueries

every column you select must be materialized. select only what you need.

habit 3: use EXPLAIN before assuming

if a query feels slow, EXPLAIN ANALYZE first. the planner usually tells you exactly what is slow (sequential scan instead of index scan, large hash join, missing statistics).

learning to read EXPLAIN output is the highest-leverage SQL performance skill.

connecting Postgres to your wider stack

PostgreSQL is the engine. above and around:

a typical analyst stack: source data loaded into Postgres → dbt models → Looker Studio dashboards. PostgreSQL is the engine that makes the rest of it possible.

extensions worth knowing

PostgreSQL’s extensibility is one of its hidden superpowers.

postgis

geographic data: latitude/longitude, distances, polygons, intersections. essential for any location-based analysis. works in Supabase out of the box.

pg_stat_statements

tracks query execution stats across your database. useful for identifying which queries are most-run and most-slow.

pg_trgm

trigram-based fuzzy text matching. useful for “similar to” queries on text fields when you need fuzzy joins.

vector / pgvector

vector similarity search for embeddings. lets Postgres serve as a simple vector database for AI-powered search and retrieval.

most managed Postgres services (Supabase, Neon, RDS) support these out of the box.

working with raw timestamps

a small but useful pattern: timestamps with timezone vs without.

  • timestamptz (timestamp with time zone): stores UTC, displays in client timezone. use this for almost everything.
  • timestamp (timestamp without time zone): stores literal value, no timezone awareness. use only for true “wall clock” times that have no timezone meaning.

most date confusion in analyst work comes from mixing these. always use timestamptz for event data.

conclusion

PostgreSQL is the analyst’s general-purpose database in 2026. its analytical features (window functions, CTEs, date functions, JSON support) are deep enough to handle real business analysis without exporting to Excel or Python.

the seven feature areas above (window functions, CTEs, dates, JSON, aggregation tricks, indexes, materialized views) are the 20 percent of PostgreSQL that handles 80 percent of analyst work. learning each one moves you a level up. window functions alone replace whole categories of subqueries you previously had to write.

start with one painful query you currently solve by exporting to Excel. rewrite it in Postgres using window functions or a CTE. measure the time saved. add an index if it is slow. wrap it in a view if you run it often. by the third query, the workflow becomes automatic.

if you write SQL daily and want to leave the SELECT-and-GROUP-BY plateau, PostgreSQL is the database that rewards investment most. it is free, hosted options like Supabase make it trivial to spin up, and the skills carry across every modern data platform.