Day 9: Data Lakehouse and Cloud Data Warehouse¶
Today is a sort of departure from the other two days with dbt Fundamentals course (Day 7 and Day 8). It's a reading day!
I'm kind of surprised to learn that Data Lakehouse and Cloud Data Warehouse are not really the same concepts!
The key takeaway of Don’t Let a Cloud Data Warehouse Bottleneck your Machine Learning article by Jason Pohl is:
Cloud data warehouses were not designed to perform machine learning [...] as slow, expensive, and require multiple governance layers
- Cloud data warehouses (CDW) export a large amount of data directly to cloud object storage (trying to mitigate the bottleneck accessing large datasets, e.g. while training a ML model)
- leverage many nodes of a CDW to copy data in parallel from its proprietary internal format to an open format, like Apache Parquet, onto cloud storage
- data can then be ingested in parallel to perform model training
- A data lakehouse stores data once and directly on cloud storage in an open format
- then ingests directly from there to perform model training
- the data is written once and can be read many times without the need to copy it.
There's more but I didn't mean to copy the whole article (and claim it's mine 😏).