The Modern Data Stack: From Data Lake to Data Warehouse
The modern data stack (MDS) is a cloud-native architecture that enables organizations to collect, process, store, transform, and analyze data at scale. It relies heavily on modular, scalable, managed components—each specializing in one stage of the data lifecycle.
1. Data Sources (Operational Systems)
Data begins in operational or external systems such as:
Application databases (PostgreSQL, MySQL, MongoDB)
SaaS tools (Salesforce, Stripe, Shopify, HubSpot)
Logs and events (website events, IoT sensor data)
Streaming sources (Kafka, Kinesis, Pub/Sub)
These systems generate raw, fragmented, unstructured or structured data.
2. Ingestion Layer (ETL/ELT Pipelines)
The ingestion layer loads raw data from sources into the storage tier.
Modern Tools
Batch ingestion: Fivetran, Stitch, Hevo
Streaming ingestion: Kafka, Kinesis, Pub/Sub
Custom ingestion: Airbyte, Apache NiFi, custom API pipelines
Modern Pattern: ELT vs ETL
ETL (Extract → Transform → Load): Transform data before loading.
ELT (Extract → Load → Transform): Load raw data into cloud storage/warehouse first, then transform. ELT dominates in MDS because compute in warehouses is cheap and scalable.
3. Data Lake (Raw Storage Layer)
A data lake stores all raw data—structured, semi-structured, unstructured—cheaply and at massive scale.
Common Storage
Amazon S3
Azure Data Lake Storage (ADLS)
Google Cloud Storage (GCS)
File formats
Parquet
ORC
JSON
Avro
CSV (less preferred)
Purpose of the Data Lake
Central source of truth for all raw data
Long-term archive
Feeding machine learning pipelines
Staging area before warehousing
4. Lakehouse Evolution (Combining Lake + Warehouse)
Modern systems like Databricks, Snowflake, and BigQuery blur lines between lakes and warehouses.
A lakehouse uses the data lake for storage but adds:
ACID transactions
Schema enforcement
Query optimization
Table versioning
Governance
Technologies that enable this:
Delta Lake
Apache Iceberg
Apache Hudi
5. Orchestration Layer
Orchestration defines when and how data pipelines run.
Common Tools
Airflow
Dagster
Prefect
These tools coordinate ingestion, transformations, and downstream tasks.
6. Transformation Layer (Data Modeling)
Transformation converts raw data into well-structured, analytics-ready tables.
Modern Transformation Tools
dbt (most widely used)
Dataform
SQL-based transformations within warehouse engines
Layers in a Transform Model (Typical dbt Layout)
Staging layer (raw → cleaned)
Standardized names, type cleanup, deduplication
Core models (business entities)
Customers, orders, transactions
Marts (subject-area analytics)
Marketing models
Finance models
Product analytics models
These produce curated, trustworthy datasets.
7. Data Warehouse (Analytics Storage Computed Layer)
The data warehouse is where transformed, structured, query-optimized data lives.
Common Cloud Warehouses
Snowflake
Google BigQuery
Amazon Redshift
Databricks SQL Warehouse
Key Features
Columnar storage
High-performance SQL queries
Automatic scaling
Separation of storage and compute
BI-friendly architecture
The warehouse is the source of truth for analytics.
8. Semantic Layer (Optional but Growing Trend)
The semantic layer defines:
Business metrics
Dimensions
Aggregations
Definitions shared across tools
Tools:
LookML (Looker)
MetricFlow
dbt Metrics
Cube
This prevents “multiple definitions of the same metric.”
9. BI & Analytics Layer
Dashboards, analytics, and operational insights come from the warehouse or lakehouse.
Popular BI Tools
Looker
Tableau
Power BI
Mode Analytics
Metabase
This layer supports:
Executive dashboards
Product analytics
Finance reporting
Marketing performance
Ad hoc analysis
10. Data Science, ML, and AI Layer
Data scientists build predictive models using warehouse/lake data.
Tools:
Databricks
SageMaker
Vertex AI
Snowpark
Python ecosystem (Pandas, PySpark, scikit-learn)
Models consume clean, high-quality data curated from the warehouse or lakehouse.
Learn Data Science Course in Hyderabad
Read More
What is MLOps? A Guide to Bringing Your Models to Production
Move beyond the model to the infrastructure and production side of data science.
A Tutorial on Self-Supervised Learning
Visit Our Quality Thought Training Institute in Hyderabad
Subscribe by Email
Follow Updates Articles from This Blog via Email
No Comments