How to set up a data warehouse at an early-stage startup

TL;DR

You can have a working data warehouse in a weekend using BigQuery, Airbyte, and dbt Core. the whole setup costs near zero until you hit serious data volume, and the free tiers are genuinely useful. you need a Google Cloud account, a data source, and about four to six hours of focused time.

What You Need Before You Start

a Google account (for Google Cloud and BigQuery free tier)
access to at least one data source: your Postgres database, Stripe, Shopify, or a CSV export will do
Airbyte Cloud account (free tier: up to 1,000 monthly active rows per connection) or Docker installed locally for self-hosted Airbyte
dbt Core installed locally (Python 3.9+ required) or a dbt Cloud free developer account
Python 3.9+ and pip on your local machine
a BI tool account: Metabase Cloud (free up to 5 users) or Google Looker Studio (free)
a text editor you’re comfortable with (VS Code works well)
optional: a GitHub account for version-controlling your dbt project

You do not need a data engineer on staff. this guide is designed for a technical founder or a marketer who can run commands in a terminal.

Step 1: Choose Your Warehouse

BigQuery is the right default for an early startup. the free tier gives you 10 GB of storage and 1 TB of query processing per month, which is more than enough to get started. it also has no idle compute cost, meaning you pay nothing when nobody is querying.

The main alternatives are Snowflake (better for multi-cloud setups, but the $25/month minimum adds up) and DuckDB (excellent for local analytics on files, but not a hosted warehouse). if your data already lives in AWS, Redshift Serverless is worth a look. for 90% of early-stage startups, BigQuery wins on price and simplicity.

Go to console.cloud.google.com and create a new project. name it something like startup-analytics. BigQuery is enabled automatically in new projects.

You should now see: your new project listed in the Google Cloud console with BigQuery visible in the left nav under “Big Data.”

Step 2: Set Up Your BigQuery Project

Inside BigQuery, click your project name in the left panel, then click Create dataset. a dataset in BigQuery is roughly equivalent to a schema in Postgres. create three datasets to start:

raw — untransformed data coming from your sources
staging — lightly cleaned, one-to-one with raw tables
marts — business-logic tables your BI tool will query

Set the data location to the region closest to your team. for US-based teams, pick US (multiple regions). click Create dataset for each one.

-- you can also do this via SQL in the BigQuery console
CREATE SCHEMA IF NOT EXISTS `startup-analytics.raw`
  OPTIONS (location = 'US');

CREATE SCHEMA IF NOT EXISTS `startup-analytics.staging`
  OPTIONS (location = 'US');

CREATE SCHEMA IF NOT EXISTS `startup-analytics.marts`
  OPTIONS (location = 'US');

You should now see: three datasets listed under your project in the BigQuery Explorer panel on the left.

Step 3: Create a Service Account for Connections

Every tool that connects to BigQuery needs credentials. go to IAM and Admin > Service Accounts in Google Cloud Console, then click Create Service Account. name it airbyte-dbt-connector.

Assign two roles: BigQuery Data Editor and BigQuery Job User. after creating it, go to the Keys tab, click Add Key > Create new key, choose JSON, and download the file. store this file somewhere safe, not in a Git repo.

This single service account can be reused by both Airbyte and dbt. that keeps permissions tidy.

You should now see: the JSON key file downloaded to your machine with a filename like startup-analytics-a1b2c3d4.json.

Step 4: Connect Your First Data Source With Airbyte

Airbyte is the easiest way to pull data from SaaS tools and databases into BigQuery without writing custom scripts. it has 350+ pre-built connectors.

Sign in to Airbyte Cloud and click New connection. pick your source. if you’re connecting Stripe, select the Stripe connector, paste your Stripe API key (find it in Stripe Dashboard > Developers > API keys), and click Test and save.

Next, select BigQuery as your destination. paste your service account JSON key content into the credentials field, enter your Google Cloud project ID, and set the default dataset to raw. click Test and save.

Back on the connection screen, choose Full refresh | Overwrite for your first sync to keep things simple. select the streams you want (for Stripe: charges, customers, subscriptions). click Sync now.

Source: Stripe → Destination: BigQuery (raw dataset)
Sync mode: Full refresh | Overwrite
Selected streams: charges, customers, subscriptions

You should now see: tables appearing in your raw dataset in BigQuery within a few minutes, with names like raw.charges.

Step 5: Install and Configure dbt Core

dbt Core is where you write the SQL logic that turns raw data into useful tables. install it with the BigQuery adapter:

pip install dbt-bigquery
dbt init startup_analytics

When prompted, choose BigQuery as your adapter. dbt will ask for your project ID, dataset (use staging for now), and the path to your service account JSON key. it writes these settings to ~/.dbt/profiles.yml.

Navigate into your new project folder and run:

cd startup_analytics
dbt debug

This checks that your credentials and connection work.

You should now see: All checks passed! printed in your terminal. if you see an authentication error, double-check the path to your JSON key in profiles.yml.

Step 6: Write Your First dbt Model

Inside the models/ folder, create a file called stg_stripe_charges.sql:

-- models/stg_stripe_charges.sql
with source as (
    select * from {{ source('raw', 'charges') }}
),

renamed as (
    select
        id                  as charge_id,
        amount / 100.0      as amount_usd,
        currency,
        status,
        customer            as customer_id,
        timestamp_seconds(created) as created_at
    from source
    where status = 'succeeded'
)

select * from renamed

Also create a sources.yml file in the same folder to tell dbt about your raw tables:

version: 2
sources:
  - name: raw
    database: startup-analytics
    schema: raw
    tables:
      - name: charges

Then run the model:

dbt run --select stg_stripe_charges

You should now see: a new table called stg_stripe_charges appear in your staging dataset in BigQuery.

Step 7: Build a Mart Table for Reporting

Now that you have a clean staging layer, build a mart table your BI tool will actually query. create models/marts/mrr_by_month.sql:

-- models/marts/mrr_by_month.sql
with charges as (
    select * from {{ ref('stg_stripe_charges') }}
),

monthly as (
    select
        date_trunc(created_at, month) as month,
        sum(amount_usd)               as total_revenue_usd,
        count(distinct customer_id)   as paying_customers
    from charges
    group by 1
)

select * from monthly
order by month desc

Run it:

dbt run --select mrr_by_month

Create a schema.yml file alongside your models to add descriptions and tests:

version: 2
models:
  - name: mrr_by_month
    description: "Monthly revenue and customer counts from Stripe charges"
    columns:
      - name: month
        tests:
          - not_null
          - unique

Run dbt test to confirm data quality passes.

You should now see: mrr_by_month in your marts dataset, and all tests passing green in the terminal.

Step 8: Connect a BI Tool

Metabase is the easiest self-serve BI tool for non-technical teammates. the Cloud free tier supports up to five users, which covers most founding teams.

Sign up at metabase.com, go to Admin > Databases > Add database, and pick BigQuery. enter your project ID and upload your service account JSON key. Metabase will scan your datasets and surface the tables.

Create your first question by clicking New > Question, selecting marts as the database, and picking mrr_by_month. click on the month column and choose Line chart. save it to a dashboard called “Revenue Overview.”

If you want a free zero-setup alternative, Looker Studio connects to BigQuery natively via the BigQuery connector. go to lookerstudio.google.com, click Create > Report, and add BigQuery as a data source. authenticate with your Google account and pick startup-analytics.marts.mrr_by_month.

You should now see: a line chart of monthly revenue rendered in your BI tool, pulling live from BigQuery.

Step 9: Schedule Syncs and dbt Runs

Raw data sitting in BigQuery only gets stale. set Airbyte to sync on a schedule. in Airbyte Cloud, go to your connection settings and set the Replication frequency to every 6 hours or daily, depending on how fresh you need the data.

For dbt, the simplest free scheduler is a GitHub Actions workflow. create .github/workflows/dbt_run.yml:

name: dbt daily run
on:
  schedule:
    - cron: '0 6 * * *'   # 6am UTC every day
  workflow_dispatch:

jobs:
  dbt:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: actions/setup-python@v4
        with:
          python-version: '3.11'
      - run: pip install dbt-bigquery
      - run: dbt run --profiles-dir .
        env:
          DBT_PROFILES_DIR: .

Store your BigQuery credentials as a GitHub Actions secret and reference them in profiles.yml via environment variables.

You should now see: your dbt models refreshing automatically each morning, and Metabase showing updated numbers when teammates open it.

Common Mistakes To Avoid

Skipping the staging layer. writing marts directly on top of raw tables means one schema change in your source breaks everything downstream. always land raw, clean in staging, aggregate in marts.
Giving your service account Owner permissions. it’s tempting to make the error go away fast. use the minimum roles needed (BigQuery Data Editor + Job User). owner access on a service account key stored in a JSON file is a serious security risk.
Syncing every table from every source on day one. Airbyte will happily sync 200 Stripe tables. you only need five to start. more tables means more storage cost and more dbt models to maintain. pick what you actually need.
Not version-controlling your dbt project. put it in GitHub on day one. when a model breaks a week later, you’ll want git blame.
Querying raw tables in your BI tool. raw tables have duplicate rows, null IDs, and epoch timestamps. always point your BI tool at marts.
Ignoring query costs. BigQuery charges $5 per TB scanned after the free tier. add LIMIT clauses when exploring, use date partition filters, and check the query validator in the BigQuery console before running a big scan.

When To Level Up

This stack (BigQuery + Airbyte + dbt Core + Metabase) handles a startup well up to roughly 50 GB of data and five to ten active analysts or data consumers. past that point, a few things start to break.

dbt Core running on GitHub Actions gets flaky for long-running models. you’ll want dbt Cloud for a proper scheduler, job history, and a visual IDE. Airbyte’s free tier row limits will push you to a paid plan or a self-hosted instance on your own server.

When multiple teams are building their own mart tables, you’ll also need a data catalog so people can find what exists. tools like Atlan or DataHub become worth evaluating. and when your finance team wants a metric to match what the growth team is reporting, a semantic layer (dbt Metrics, Cube) starts earning its keep.

For a full comparison of what comes next, browse the /category/data-skills/ hub where we compare intermediate-level data stack options. the article on free data stack tools for growing startups covers the upgrade paths in detail.

Frequently Asked Questions

Do I need a data engineer to set this up?
No. a technical co-founder or a marketer comfortable with SQL and the command line can complete this setup. the hardest part is usually getting the right service account permissions in Google Cloud, which this guide covers directly.

How much will this cost per month?
For a typical early-stage startup, close to zero. BigQuery’s free tier (10 GB storage, 1 TB queries/month) and Airbyte Cloud’s free tier cover most pre-Series A data volumes. budget $20-50/month once you start syncing more than a handful of sources daily.

Can I use this with a non-Stripe data source?
Yes. Airbyte supports Shopify, HubSpot, PostgreSQL, MySQL, Google Analytics, and 350+ other connectors. the dbt setup is identical regardless of source. just update your sources.yml to point at the correct raw tables.

What if I don’t know SQL?
You need at least basic SQL to write dbt models. if you can write SELECT, FROM, WHERE, GROUP BY, and JOIN, you have enough to start. Metabase also has a no-code question builder that lets non-SQL users query mart tables without writing anything.

Is BigQuery GDPR compliant?
Yes, when configured correctly. set your dataset region to EU instead of US if your users are in Europe, and review Google Cloud’s data processing addendum. avoid storing raw PII in your warehouse where possible, and use dbt to hash or drop sensitive columns in the staging layer.

Bottom Line

The fastest path to a working startup data warehouse is BigQuery for storage, Airbyte for ingestion, dbt Core for transformation, and Metabase for reporting. you can have all four connected and running automated daily refreshes within a weekend. the total cost at early-stage volumes is effectively zero. start with one data source, two or three dbt models, and one dashboard. get your team actually using the data before you add more complexity. the architecture scales cleanly as you grow, and each tool can be swapped or upgraded independently. for the next step once you’ve outgrown this setup, head to /category/data-skills/ to compare more advanced pipeline and transformation tools.