Data Lakes, Warehouses & Lakehouses
The data architecture landscape has evolved dramatically over the past decade. Understanding the differences between data lakes, data warehouses, and lakehouses is essential for designing systems that scale.
Data Warehouse
A data warehouse stores structured, processed data optimised for analytics and reporting. Think Snowflake, BigQuery, Amazon Redshift, or Azure Synapse. Warehouses use a schema-on-write approach — data must be modelled before it's loaded. This ensures high query performance and data consistency, but it can be rigid when schemas change frequently.
Best for: Business intelligence dashboards, structured reporting, scenarios where data quality and consistency are paramount.
Data Lake
A data lake stores raw data in its native format — structured, semi-structured, or unstructured. Built on object storage like Amazon S3, Azure Data Lake Storage, or Google Cloud Storage, lakes are cheap and highly scalable. Schema-on-read means you can load data first and model it later.
Best for: Machine learning, data science exploration, storing massive volumes of raw log data, scenarios where schema flexibility is important.
Data Lakehouse
The lakehouse combines the flexibility of a data lake with the reliability and performance of a warehouse. It adds ACID transactions, schema enforcement, and indexing on top of object storage. Databricks pioneered this architecture, and open-source formats like Apache Iceberg, Delta Lake, and Apache Hudi make it broadly accessible.
Best for: Organisations that want a single platform for BI, ML, and data engineering without maintaining separate systems.
How to Choose
There's no one-size-fits-all answer. Many organisations run a medallion architecture (bronze → silver → gold) inside a lakehouse, where raw data is ingested into a bronze layer, cleaned in silver, and aggregated for analytics in gold.

