How to set up dbt from scratch as a solo data person - Data Research Analysis Collection

TL;DR

You can have dbt Core running against a real warehouse in about an hour. You need Python 3.11+, a database to connect to, and basic comfort with the command line. The outcome is a working project where SQL transforms are version-controlled, tested, and documented without any paid tooling.

What You Need Before You Start

Python 3.11 or 3.12 installed and accessible via python --version in your terminal
pip bundled with Python; verify with pip --version
A virtual environment tool — the built-in venv module is enough
A warehouse or database to connect to, ranked by cost for solo setups:
DuckDB — runs locally, completely free, no account needed, ideal for starting out
BigQuery — free tier gives you 10 GB storage and 1 TB of queries per month
Snowflake — free 30-day trial with $400 in compute credits
Git installed (check: git --version) plus a GitHub account for version control
A text editor that handles YAML cleanly — VS Code with the dbt Power User extension is a strong choice
A real CSV or table you care about — transforms on your own data will stick better than tutorial data

Optional: the dbt Power User VS Code extension gives you lineage graphs and model previews right inside the editor.

Step 1: Create a Dedicated Python Virtual Environment

Never install dbt into your system Python. It pulls in many dependencies and version conflicts will eventually cause real pain. Create an isolated environment first.

Navigate to where you want your dbt project to live and run:

mkdir my-dbt-project
cd my-dbt-project
python -m venv .venv
source .venv/bin/activate   # Windows: .venv\Scripts\activate

Your terminal prompt should now show (.venv) at the start. Every package you install from here goes into this environment, not into your global Python.

Keep the virtual environment inside the project folder. Whenever you open a new terminal tab or restart your machine, you need to run source .venv/bin/activate again before using dbt. Forgetting this step is the most common reason people see command not found: dbt.

You should now see: (.venv) in your shell prompt. Run which python on Mac/Linux or where python on Windows and confirm the path points inside your .venv folder.

Step 2: Install dbt Core with the Right Adapter

dbt Core is the open-source CLI. You install it alongside a database-specific adapter that matches your warehouse.

For DuckDB (the best starting point for solo setups):

pip install dbt-duckdb

For BigQuery:

pip install dbt-bigquery

For Snowflake:

pip install dbt-snowflake

After installation, confirm everything landed correctly:

dbt --version

You will see output showing your Core version and adapter version, for example Core: 1.9.2 and Plugins: duckdb: 1.9.0. If you see command not found, your virtual environment is not activated.

Pin your versions right now. Create a requirements.txt:

dbt-duckdb==1.9.0

This prevents a future situation where installing on a new machine pulls a newer version and breaks your project.

You should now see: version output from dbt --version with no errors.

Step 3: Initialize Your dbt Project

dbt has a scaffolding command that creates the folder structure you need.

dbt init my_project

You will be prompted to pick your database adapter. Select the one you installed. dbt creates a folder called my_project with this layout:

my_project/
  models/
  tests/
  seeds/
  macros/
  analyses/
  dbt_project.yml
  README.md

Move into that folder:

cd my_project

Open dbt_project.yml. Check that the name field matches what you typed and that profile is set to the same name. These two values need to match the profile you will create in the next step. Mismatching them is a common source of confusing errors later.

You should now see: the folder structure above when you run ls inside your project directory.

Step 4: Configure Your Connection in profiles.yml

dbt looks for connection credentials in ~/.dbt/profiles.yml by default. This file lives outside your project folder so it stays off GitHub and keeps your credentials out of version control.

Create or open ~/.dbt/profiles.yml and add:

my_project:
  target: dev
  outputs:
    dev:
      type: duckdb
      path: "dev.duckdb"
      threads: 4

For BigQuery, the profile looks like this:

my_project:
  target: dev
  outputs:
    dev:
      type: bigquery
      method: oauth
      project: your-gcp-project-id
      dataset: dbt_dev
      threads: 4
      timeout_seconds: 300

Now test the connection from inside your project folder:

dbt debug

This command checks that your credentials are valid, the adapter is installed, and your project YAML is well-formed. Read and fix any errors before moving on. Trying to skip a failing debug check costs more time than fixing it immediately.

You should now see: All checks passed! at the bottom of the dbt debug output.

Step 5: Load Your First Seed File

Seeds are CSV files that dbt loads directly into your warehouse. They are ideal for small reference tables like country codes, channel names, or product categories.

Drop a CSV into the seeds/ folder. For example, seeds/channels.csv:

channel_id,channel_name,channel_type
1,Google Ads,paid
2,Meta Ads,paid
3,Newsletter,organic
4,Direct,organic

Run:

dbt seed

dbt reads the CSV, infers column types, and creates a table in your warehouse. For DuckDB it writes into dev.duckdb. For BigQuery it creates a table inside your dbt_dev dataset. If you update the CSV later, re-run dbt seed and the table updates.

You should now see: a channels table you can query directly with SELECT * FROM channels LIMIT 5.

Step 6: Write Your First Model

Models are the core of dbt. Each .sql file in the models/ folder becomes a view or table in your warehouse.

Create models/staging/stg_orders.sql:

with source as (
    select * from {{ source('raw', 'orders') }}
),

renamed as (
    select
        order_id,
        customer_id,
        order_date::date            as order_date,
        total_amount_cents / 100.0  as total_amount_usd,
        status
    from source
)

select * from renamed

The {{ source(...) }} syntax tells dbt where raw data lives. Declare that source in a YAML file. Create models/staging/sources.yml:

version: 2

sources:
  - name: raw
    schema: raw
    tables:
      - name: orders

Run the model:

dbt run

You should now see: a stg_orders view in your warehouse. Query it and confirm the types and row counts look right.

Step 7: Add Tests to Your Models

This is the step most solo analysts skip and regret later. dbt has four built-in generic tests and they take two minutes to add.

Extend your models/staging/sources.yml:

version: 2

sources:
  - name: raw
    schema: raw
    tables:
      - name: orders
        columns:
          - name: order_id
            tests:
              - unique
              - not_null
          - name: status
            tests:
              - accepted_values:
                  values: ['pending', 'completed', 'cancelled']

Run tests:

dbt test

dbt runs a SQL assertion behind each test and reports failures. A unique test on order_id will surface duplicate rows. An accepted_values test will flag unexpected status strings before they corrupt a downstream metric.

Use dbt build to run models and tests together in dependency order:

dbt build

You should now see: a test results summary showing how many passed and how many failed.

Step 8: Generate and Browse Your Documentation

dbt builds a browsable documentation site from your project files. This becomes useful three months from now when you cannot remember what a model actually does.

Add descriptions to your model in models/staging/schema.yml:

version: 2

models:
  - name: stg_orders
    description: "Cleaned orders from the raw source. Amounts converted to USD."
    columns:
      - name: order_id
        description: "Primary key. One row per order."

Generate and serve the docs:

dbt docs generate
dbt docs serve

A browser tab opens at http://localhost:8080 showing your full lineage graph, descriptions, test results, and source connections. No additional tool required.

You should now see: your stg_orders model in the docs site with the description you wrote and a lineage graph connecting it to the raw source.

Step 9: Commit the Project to Git

Your dbt project is code. Version control it from day one.

Inside your project folder:

git init
echo "dev.duckdb" >> .gitignore
echo ".venv/" >> .gitignore
git add .
git commit -m "initial dbt project setup"

Push to GitHub using the GitHub CLI:

gh repo create my-dbt-project --private --source=. --push

Never commit ~/.dbt/profiles.yml. It lives outside your project for exactly this reason. If you work across two machines, use environment variables for credentials rather than hardcoding values in the profile.

You should now see: your project in GitHub with the .duckdb file absent from the repository.

Common Mistakes To Avoid

Installing dbt into system Python. It causes version conflicts that are painful to untangle. Virtual environments are not optional.
Committing credentials. The profiles.yml file lives outside the project folder for a reason. Verify your .gitignore every time before pushing.
Running dbt run without dbt test. Bad data propagates silently when you skip tests. Use dbt build to handle both steps together.
Naming models with hyphens. dbt uses filenames as table names in most warehouses. stg-orders.sql will throw an error. Underscores only.
Hard-coding table names instead of using ref(). Writing a raw table name breaks the lineage graph and causes incorrect build ordering. Always use {{ ref('model_name') }} to reference another model.
Leaving heavy models as views. By default, models are views. A model joining millions of rows that takes 40 seconds each run should be materialized as a table or incremental model in your dbt_project.yml.

When To Level Up

The solo dbt Core setup described here handles a surprising amount of work. You can run hundreds of models, test thousands of assertions, and document everything without paying for anything.

The cracks appear when you need scheduled runs. dbt Core has no scheduler. You will end up running dbt build manually or wiring it to a cron job, which works until it does not. The moment you have a stakeholder who needs fresh data every morning without your involvement, you need an orchestration layer.

At that point your options are dbt Cloud, which has a free Developer tier that includes a scheduler, browser-based IDE, and CI/CD built in, or pairing dbt Core with a lightweight orchestrator like Prefect or Airflow. The second option gives you more control but requires more setup time.

Team size is the other trigger. Once two people edit models regularly, you need pull request review workflows and separate dev environments. dbt Cloud handles both without extra configuration.

For more on choosing the right data stack as your needs grow, browse /category/data-skills/. The comparison in dbt Cloud vs dbt Core for solopreneurs is worth reading before you make that call.

Frequently Asked Questions

Do I need to know Python to use dbt?
Not for most of what dbt does. You write SQL for models and YAML for config and tests. Python knowledge helps when you write custom macros or complex Jinja logic, but the majority of solo workflows never need to go there.

Can I use dbt with Google Sheets as a source?
Not directly. dbt connects to a warehouse, not to spreadsheets. You need to load your Sheets data into BigQuery or another warehouse first. Tools like Airbyte or Fivetran have connectors that sync Sheets automatically.

Is dbt Core actually free?
Yes. dbt Core is open-source under the Apache 2.0 license and costs nothing. You may pay for the warehouse you connect it to depending on data volume, but on DuckDB or BigQuery’s free tier you can run a full project at zero cost.

How long before dbt saves me time instead of costing it?
Most people get a working model in the first hour. The compounding benefit, where tests catch problems early and docs help you remember past decisions, becomes noticeable after two to three weeks of regular use.

What is the difference between a seed and a model?
A seed is a CSV file you load into the warehouse as a static reference table. A model is a SQL transformation that runs against data already in the warehouse. Seeds feed into models, not the other way around.

Bottom Line

Getting dbt Core running as a solo analyst takes about an hour and costs nothing if you use DuckDB locally or BigQuery’s free tier. You create an isolated Python environment, install the right adapter, configure a credential file outside your project, write SQL models, add tests, and generate docs. Everything lives in Git from the first commit. The workflow scales further than you expect before you need scheduling or collaboration features, and the upgrade path when you do hit those limits is clear. Start simple, test everything, and document as you build rather than after. For more tools and workflows built around solo data work, browse the full guide collection at /category/data-skills/ and check out the best SQL tools for solopreneurs as your next read.