Skip to content

Day 9: Data Lakehouse and Cloud Data Warehouse

Today is a sort of departure from the other two days with dbt Fundamentals course (Day 7 and Day 8). It's a reading day!

I'm kind of surprised to learn that Data Lakehouse and Cloud Data Warehouse are not really the same concepts!

The key takeaway of Don’t Let a Cloud Data Warehouse Bottleneck your Machine Learning article by Jason Pohl is:

Cloud data warehouses were not designed to perform machine learning [...] as slow, expensive, and require multiple governance layers

  1. Cloud data warehouses (CDW) export a large amount of data directly to cloud object storage (trying to mitigate the bottleneck accessing large datasets, e.g. while training a ML model)
    • leverage many nodes of a CDW to copy data in parallel from its proprietary internal format to an open format, like Apache Parquet, onto cloud storage
    • data can then be ingested in parallel to perform model training
  2. A data lakehouse stores data once and directly on cloud storage in an open format
    • then ingests directly from there to perform model training
    • the data is written once and can be read many times without the need to copy it.

There's more but I didn't mean to copy the whole article (and claim it's mine 😏).