Day 5: dbt Models¶
The very first sentence in dbt Models reminds me the days of Hibernate and ORMs in general where SELECT
s were mapped to Java bean classes (for programmers to work with relational entities as if they were Java types) to build...a model. In dbt, it's no longer needed since it's all about pure SQL (!)
A model is a
select
statement.
Set Up dbt Project¶
$ pyenv shell dbt
$ dbt init
Enter a name for your project (letters, digits, underscore): hello_dbt_spark
Which database would you like to use?
[1] databricks
[2] postgres
[3] spark
(Don't see the one you want? https://docs.getdbt.com/docs/available-adapters)
Enter a number: 3
...
$ tree hello_dbt_spark
hello_dbt_spark
├── README.md
├── analyses
├── dbt_project.yml
├── macros
├── models
│ └── example
│ ├── my_first_dbt_model.sql
│ ├── my_second_dbt_model.sql
│ └── schema.yml
├── seeds
├── snapshots
└── tests
7 directories, 5 files
Start Spark Thrift Server¶
$ cd $SPARK_HOME
$ ./sbin/start-thriftserver.sh
starting org.apache.spark.sql.hive.thriftserver.HiveThriftServer2, logging to /Users/jacek/dev/oss/spark/logs/spark-jacek-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-japila-new.local.out
$ tail -f logs/spark-jacek-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-japila-new.local.out
...
INFO ThriftCLIService: Starting ThriftBinaryCLIService on port 10000 with 5...500 worker threads
INFO AbstractService: Service:HiveServer2 is started.
INFO HiveThriftServer2: HiveThriftServer2 started
pyenv local dbt¶
Let's switch to the directory and make sure that dbt
virtualenv is there forever.
$ cd hello_dbt_spark
$ pyenv local dbt
Just as a friendly reminder (to myself), the above pyenv
command creates .python-version
file with the name of the virtualenv.
$ cat .python-version
dbt
dbt debug¶
$ dbt debug
dbt version: 1.1.1
python version: 3.9.13
python path: /Users/jacek/.pyenv/versions/dbt/bin/python3.9
os info: macOS-11.6.6-x86_64-i386-64bit
Using profiles.yml file at /Users/jacek/.dbt/profiles.yml
Using dbt_project.yml file at /Users/jacek/dev/sandbox/hello_dbt_spark/dbt_project.yml
Configuration:
profiles.yml file [OK found and valid]
dbt_project.yml file [OK found and valid]
Required dependencies:
- git [OK found]
Connection:
host: localhost
port: 10000
cluster: None
endpoint: None
schema: analytics
organization: 0
Connection test: [OK connection ok]
All checks passed!
dbt run¶
From dbt Models:
When you execute the
dbt run
command, dbt will build this model in your data warehouse by wrapping it in acreate view as
orcreate table as
statement.
$ dbt run
Running with dbt=1.1.1
Partial parse save file not found. Starting full parse.
Found 2 models, 4 tests, 0 snapshots, 0 analyses, 199 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics
Concurrency: 1 threads (target='dev')
1 of 2 START table model analytics.my_first_dbt_model .......................... [RUN]
1 of 2 OK created table model analytics.my_first_dbt_model ..................... [OK in 1.45s]
2 of 2 START view model analytics.my_second_dbt_model .......................... [RUN]
2 of 2 OK created view model analytics.my_second_dbt_model ..................... [OK in 0.19s]
Finished running 1 table model, 1 view model in 2.54s.
Completed successfully
Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2
beeline¶
$ cd $SPARK_HOME
$ ./bin/beeline -u jdbc:hive2://localhost:10000/analytics
Connecting to jdbc:hive2://localhost:10000/analytics
Connected to: Spark SQL (version 3.3.0)
Driver: Hive JDBC (version 2.3.9)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 2.3.9 by Apache Hive
0: jdbc:hive2://localhost:10000/analytics> show tables;
+------------+----------------------+--------------+
| namespace | tableName | isTemporary |
+------------+----------------------+--------------+
| analytics | my_first_dbt_model | false |
| analytics | my_second_dbt_model | false |
+------------+----------------------+--------------+
2 rows selected (0.119 seconds)
0: jdbc:hive2://localhost:10000/analytics> show views;
+------------+----------------------+--------------+
| namespace | viewName | isTemporary |
+------------+----------------------+--------------+
| analytics | my_second_dbt_model | false |
+------------+----------------------+--------------+
1 row selected (0.048 seconds)
How can I see the SQL that dbt is running?¶
From FAQs:
- The
target/compiled/
directory for compiled select statements - The
target/run/
directory for compiled create statements - The
logs/dbt.log
file for verbose logging.
Stop Spark Thrift Server¶
$ cd $SPARK_HOME
$ ./sbin/stop-thriftserver.sh
stopping org.apache.spark.sql.hive.thriftserver.HiveThriftServer2