Day 5: dbt Models¶

The very first sentence in dbt Models reminds me the days of Hibernate and ORMs in general where SELECTs were mapped to Java bean classes (for programmers to work with relational entities as if they were Java types) to build...a model. In dbt, it's no longer needed since it's all about pure SQL (!)

A model is a select statement.

Set Up dbt Project¶

$ pyenv shell dbt

$ dbt init
Enter a name for your project (letters, digits, underscore): hello_dbt_spark
Which database would you like to use?
[1] databricks
[2] postgres
[3] spark

(Don't see the one you want? https://docs.getdbt.com/docs/available-adapters)

Enter a number: 3
...

$ tree hello_dbt_spark
hello_dbt_spark
├── README.md
├── analyses
├── dbt_project.yml
├── macros
├── models
│   └── example
│       ├── my_first_dbt_model.sql
│       ├── my_second_dbt_model.sql
│       └── schema.yml
├── seeds
├── snapshots
└── tests

7 directories, 5 files

Start Spark Thrift Server¶

$ cd $SPARK_HOME

$ ./sbin/start-thriftserver.sh
starting org.apache.spark.sql.hive.thriftserver.HiveThriftServer2, logging to /Users/jacek/dev/oss/spark/logs/spark-jacek-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-japila-new.local.out

$ tail -f logs/spark-jacek-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-japila-new.local.out
...
INFO ThriftCLIService: Starting ThriftBinaryCLIService on port 10000 with 5...500 worker threads
INFO AbstractService: Service:HiveServer2 is started.
INFO HiveThriftServer2: HiveThriftServer2 started

pyenv local dbt¶

Let's switch to the directory and make sure that dbt virtualenv is there forever.

$ cd hello_dbt_spark
$ pyenv local dbt

Just as a friendly reminder (to myself), the above pyenv command creates .python-version file with the name of the virtualenv.

$ cat .python-version
dbt

dbt debug¶

$ dbt debug
dbt version: 1.1.1
python version: 3.9.13
python path: /Users/jacek/.pyenv/versions/dbt/bin/python3.9
os info: macOS-11.6.6-x86_64-i386-64bit
Using profiles.yml file at /Users/jacek/.dbt/profiles.yml
Using dbt_project.yml file at /Users/jacek/dev/sandbox/hello_dbt_spark/dbt_project.yml

Configuration:
  profiles.yml file [OK found and valid]
  dbt_project.yml file [OK found and valid]

Required dependencies:
 - git [OK found]

Connection:
  host: localhost
  port: 10000
  cluster: None
  endpoint: None
  schema: analytics
  organization: 0
  Connection test: [OK connection ok]

All checks passed!

dbt run¶

From dbt Models:

When you execute the dbt run command, dbt will build this model in your data warehouse by wrapping it in a create view as or create table as statement.

$ dbt run
Running with dbt=1.1.1
Partial parse save file not found. Starting full parse.
Found 2 models, 4 tests, 0 snapshots, 0 analyses, 199 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics

Concurrency: 1 threads (target='dev')

1 of 2 START table model analytics.my_first_dbt_model .......................... [RUN]
1 of 2 OK created table model analytics.my_first_dbt_model ..................... [OK in 1.45s]
2 of 2 START view model analytics.my_second_dbt_model .......................... [RUN]
2 of 2 OK created view model analytics.my_second_dbt_model ..................... [OK in 0.19s]

Finished running 1 table model, 1 view model in 2.54s.

Completed successfully

Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2

beeline¶

$ cd $SPARK_HOME

$ ./bin/beeline -u jdbc:hive2://localhost:10000/analytics
Connecting to jdbc:hive2://localhost:10000/analytics
Connected to: Spark SQL (version 3.3.0)
Driver: Hive JDBC (version 2.3.9)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 2.3.9 by Apache Hive
0: jdbc:hive2://localhost:10000/analytics> show tables;
+------------+----------------------+--------------+
| namespace  |      tableName       | isTemporary  |
+------------+----------------------+--------------+
| analytics  | my_first_dbt_model   | false        |
| analytics  | my_second_dbt_model  | false        |
+------------+----------------------+--------------+
2 rows selected (0.119 seconds)
0: jdbc:hive2://localhost:10000/analytics> show views;
+------------+----------------------+--------------+
| namespace  |       viewName       | isTemporary  |
+------------+----------------------+--------------+
| analytics  | my_second_dbt_model  | false        |
+------------+----------------------+--------------+
1 row selected (0.048 seconds)

How can I see the SQL that dbt is running?¶

From FAQs:

The target/compiled/ directory for compiled select statements
The target/run/ directory for compiled create statements
The logs/dbt.log file for verbose logging.

Stop Spark Thrift Server¶

$ cd $SPARK_HOME

$ ./sbin/stop-thriftserver.sh
stopping org.apache.spark.sql.hive.thriftserver.HiveThriftServer2