Day 7: dbt Fundamentals Course (part 1)¶
The following is the notes from the dbt Fundamentals course.
ETL vs ELT¶
From IBM Cloud Learn Hub:
- ETL (Extract, Transform, Load) is a process that extracts, transforms, and loads data from multiple sources to a data warehouse (or other unified data repository).
- ETL provides the foundation for data analytics and machine learning workstreams.
- ETL is often used by an organization to:
- Extract data from legacy systems
- Cleanse (transform) the data to improve data quality and establish consistency
- Load data into a target database
- ELT copies or exports the data from the source locations, but instead of loading it to a staging area for transformation, it loads the raw data directly to the target data store to be transformed as needed.
- ELT is particularly useful for high-volume, unstructured datasets as loading can occur directly from the source.
- ELT can be more ideal for big data management since it doesn’t need much upfront planning for data extraction and storage.
- ELT has become increasingly more popular with the adoption of cloud databases
New Terms¶
- analytics engineer and analytics engineering
- Someone who is between data engineer and data analyst roles
- modern data stack and modern data team (I'm still suspicious about their goal and reason to exist)
- Cloud Data Warehouse
- Combine a database and super computer for transforming data
- No need for an extensive administration
- A game-changer for analytics workflow (since all the raw data is already in data lake)
- Scalable compute and storage
dbt¶
- Focuses on T(ransformation)
- Data is already in a warehouse / data lake and you just transform data (between stages)
- That looks so similar to Databricks / Delta Lake's Medallion Architecture
- Does this mean that dbt aims at replacing Spark SQL (that in turn aims mostly at developers who know Scala, Python, Java leaving SQL as an option)?
- It makes a lot of sense for Databricks to support dbt (via dbt-databricks) since Databricks (based on Spark SQL) can execute SQL just fine