What is a lakehouse architecture? - Data Research Analysis Collection

Quick Definition

A lakehouse is a data architecture that stores raw data cheaply in object storage (like Amazon S3 or Google Cloud Storage) while letting you query it with the speed and reliability you would normally only get from a data warehouse. In other words, it is one system that tries to do what two separate systems used to do, giving you the flexibility of a data lake and the performance of a warehouse without running both in parallel.

Why It Matters In 2026

For most of the past decade, data teams ran two systems side by side. The data lake held raw files: event logs, API responses, clickstream data, images, whatever arrived. The data warehouse held clean, structured tables ready for analysts to query. Data moved from the lake to the warehouse through ETL pipelines, and those pipelines were where a lot of pain lived. They were slow, brittle, and expensive to maintain.

Two things happened around 2023 and 2024 that shifted the conversation. First, open table formats matured. Delta Lake and Apache Iceberg added ACID transactions, schema evolution, and time travel directly on top of object storage. That meant your S3 bucket could behave more like a database table, not just a pile of Parquet files. Second, compute costs for querying cloud storage dropped significantly as engines got smarter about pushing filters down to the storage layer before reading data.

By 2026, the pattern has become the default architecture for companies that outgrow a single Postgres database but do not want to manage two separate data platforms. The business case is straightforward: you pay for one storage tier (object storage is cheap, often under $0.025 per GB per month), and you bring compute to the data only when you need it. You are not copying data from a lake into a warehouse and paying twice.

The trend also matters because real-time and near-real-time analytics became table stakes. Stakeholders stopped accepting dashboards that refresh overnight. The lakehouse pattern, especially with streaming ingestion via tools like Apache Kafka, lets companies query data that arrived minutes ago without separate infrastructure.

A Concrete Example

Imagine a SaaS company called Formly that makes form-building software. They have around 8,000 paying customers. Every time a user submits a form, fills a field, or abandons a flow, an event fires. That is roughly 40 million events per day landing in Amazon S3 as JSON files.

Before they adopted a lakehouse, their stack looked like this: events hit S3, a nightly Fivetran job copied the cleaned events into Snowflake, and analysts queried Snowflake. The total monthly bill was around $4,200. Worse, if a product manager asked “which form templates had the highest abandonment rate yesterday between 2pm and 4pm,” the answer was not available until the next morning.

After adopting Databricks with Delta Lake as the table format, Formly kept the same S3 bucket. They changed how data was written into it. Events now land as Delta tables directly, with a streaming pipeline that flushes every five minutes. dbt runs transformation jobs against those Delta tables, producing analytics-ready models. Analysts query the same S3-backed data using Databricks SQL.

The product manager can now ask that abandonment question and get an answer in under 30 seconds. The monthly cost dropped to around $2,800 because they eliminated the Snowflake license. The trade-off was two weeks of migration work and a steeper learning curve for one engineer who had never used Apache Spark before.

That is the lakehouse pattern in practice: one storage layer, multiple use cases, and compute that scales to zero when you are not querying.

How It Works (Without The Jargon)

The storage layer is just cheap object storage

Everything lives in S3, Azure Data Lake Storage, or Google Cloud Storage. These are not databases. They are flat file systems optimized for storing billions of files cheaply. Your raw data, your cleaned data, and your aggregated tables all sit here as files, typically in Parquet or ORC format.

Table formats add database-like behavior on top of those files

This is the piece that makes lakehouses work. Delta Lake and Apache Iceberg wrap your Parquet files in a metadata layer that tracks which files belong to which table version, which rows were deleted, and what the schema looks like. When you run a SQL query, the engine reads this metadata first so it only touches the files it needs. Without a table format, you would read every file in the folder every time.

A catalog keeps track of everything

The metadata catalog is a registry of all your tables: their names, locations, schemas, and partitions. Open Source options like Apache Hive Metastore or Unity Catalog (Databricks) or AWS Glue handle this. Think of it as a phone book that tells your query engine where to find each table’s metadata before it starts reading files.

Compute engines run queries on demand

The query engine, whether that is Spark, Trino, DuckDB, or a cloud-native service, reads the catalog, uses the table format metadata to find the right files, and executes your SQL. Because compute is separate from storage, you can spin up a large cluster for a heavy transformation job and shut it down when it finishes. You pay for compute only while it runs.

Governance and access control sit across all layers

Row-level security, column masking, and audit logging all need to work across every tool that touches the data. That is still the hardest part of a lakehouse. Unity Catalog and Apache Ranger are common choices. Without a governance layer, you end up with a lake where everyone can read everything, which creates compliance problems fast.

Streaming and batch use the same tables

One underrated benefit is that streaming data and batch data can write to the same Delta or Iceberg table. A Kafka consumer writes events every five minutes. A nightly dbt job transforms them. An analyst runs an ad-hoc query. All three workflows touch the same table without stepping on each other, because the table format handles concurrent writes through optimistic concurrency control.

Common Misconceptions

It is just a data lake with a SQL engine bolted on. Without ACID transactions and a table format, you get query results that can reflect a half-written batch job. The table format is what makes it reliable, not just adding a query tool.
You need a large data team to run one. A two-person team can run a lakehouse on managed services like Databricks or AWS Lake Formation. The complexity scales with your customization requirements, not with the architecture itself.
It replaces your operational database. Your Postgres or MySQL database still handles your application’s transactional reads and writes. The lakehouse is for analytics workloads, not for powering your app’s API responses.
Open source means no cost. The table formats (Delta Lake, Iceberg) are free. The compute to run Spark is not. Managed services add licensing on top. Budget for compute hours, not just storage.
It is only relevant at petabyte scale. You can run a perfectly useful lakehouse on 50 GB of data. The architecture makes sense when you have multiple tools that need to read the same data, regardless of size.
It eliminates the need for data transformation. Raw event data in S3 is still raw. You still need dbt or Spark jobs to clean it, model it, and test it. The lakehouse just removes the step where you copy data into a warehouse first.

When You Actually Need This (And When You Do Not)

You do not need a lakehouse if all your data fits in a Postgres database and your analysts are happy with SQL queries against it. Most companies with fewer than 50 employees and a single product are in this category. A managed warehouse like Snowflake or BigQuery is a perfectly good answer if you are comfortable paying for it and your data volumes are predictable.

You start to benefit from a lakehouse when at least two of these are true: your raw data is too large or too varied to load into a warehouse directly, you are paying to move data between a lake and a warehouse and those pipelines keep breaking, you need sub-hour latency for analytics, or you have data scientists who need the raw files alongside analysts who need clean tables.

The honest answer for most small businesses and early-stage startups: wait. Build on a managed warehouse first. Migrate to a lakehouse when you feel the pain of your current architecture, not before. You can read more about which tool fits your stage at /category/data-skills/.

Frequently Asked Questions

What is the difference between a data lake and a lakehouse?
A data lake is raw object storage, usually full of files in various formats with no transactional guarantees. A lakehouse adds a table format layer (like Delta Lake or Iceberg) that gives you ACID transactions, schema enforcement, and reliable queries on top of that same storage. The lake becomes queryable in a warehouse-like way without moving the data.

Do I have to use Databricks to build a lakehouse?
No. Databricks popularized the term, but the architecture works with many tools. You can use Apache Spark on Amazon EMR, Trino on self-managed clusters, or AWS Athena with Iceberg tables. The key components are object storage, a table format, a catalog, and a query engine. The specific products are interchangeable.

How does a lakehouse handle schema changes?
Both Delta Lake and Apache Iceberg support schema evolution. You can add columns, rename columns, or change column types without rewriting existing data files. The table format stores schema history in its metadata layer, so older queries can still read older data correctly.

Is a lakehouse suitable for real-time analytics?
It depends on your definition of real-time. With a streaming pipeline writing to Delta or Iceberg tables every one to five minutes and an efficient query engine, you can achieve near-real-time latency. True millisecond latency still needs a dedicated OLAP database like Apache Druid or ClickHouse alongside the lakehouse.

What does it cost to run a lakehouse?
Storage costs are low, roughly $0.02 to $0.03 per GB per month on AWS or GCP. Compute costs vary widely based on how often you run queries and how large your jobs are. A small team running a managed lakehouse on Databricks might spend $500 to $2,000 per month. Self-managed on open source tools can be cheaper in licensing but more expensive in engineering time.

Bottom Line

A lakehouse is a practical answer to a real problem: you have more data than a single database can hold, you need analysts and data scientists to work from the same source, and you do not want to maintain two separate systems or pay for data to live in two places. By sitting a table format like Delta Lake or Apache Iceberg on top of cheap object storage, you get reliable SQL queries, schema management, and time travel without copying your data into a warehouse. It is not magic, and it is not the right choice for every team, but for companies that have hit the limits of a warehouse-only setup, it is worth understanding before you commit to a more complex architecture. Start by exploring the data skills resources on this site, including the tool round-ups for modern data stacks and the beginner guide to data warehousing.