Skip to content

Day 7: dbt Fundamentals Course (part 1)

The following is the notes from the dbt Fundamentals course.

ETL vs ELT

From IBM Cloud Learn Hub:

  1. ETL (Extract, Transform, Load) is a process that extracts, transforms, and loads data from multiple sources to a data warehouse (or other unified data repository).
  2. ETL provides the foundation for data analytics and machine learning workstreams.
  3. ETL is often used by an organization to:
    • Extract data from legacy systems
    • Cleanse (transform) the data to improve data quality and establish consistency
    • Load data into a target database
  4. ELT copies or exports the data from the source locations, but instead of loading it to a staging area for transformation, it loads the raw data directly to the target data store to be transformed as needed.
    1. ELT is particularly useful for high-volume, unstructured datasets as loading can occur directly from the source.
    2. ELT can be more ideal for big data management since it doesn’t need much upfront planning for data extraction and storage.
  5. ELT has become increasingly more popular with the adoption of cloud databases

New Terms

  1. analytics engineer and analytics engineering
    1. Someone who is between data engineer and data analyst roles
  2. modern data stack and modern data team (I'm still suspicious about their goal and reason to exist)
  3. Cloud Data Warehouse
    1. Combine a database and super computer for transforming data
    2. No need for an extensive administration
    3. A game-changer for analytics workflow (since all the raw data is already in data lake)
    4. Scalable compute and storage

dbt

  1. Focuses on T(ransformation)
    1. Data is already in a warehouse / data lake and you just transform data (between stages)
    2. That looks so similar to Databricks / Delta Lake's Medallion Architecture
    3. Does this mean that dbt aims at replacing Spark SQL (that in turn aims mostly at developers who know Scala, Python, Java leaving SQL as an option)?
    4. It makes a lot of sense for Databricks to support dbt (via dbt-databricks) since Databricks (based on Spark SQL) can execute SQL just fine

dbt on Slack

dbt Community