Skip to content

Delta Lake

Delta Lake is an open-source Apache Spark-based storage layer with ACID transactions and time travel.

As it was well said: "Delta is a storage format while Spark is an execution engine...to separate storage from compute."

Delta Tables

Delta tables are parquet tables with a transactional log.

Changes to (the state of) a delta table are reflected as actions and persisted to the transactional log (in JSON format).

OptimisticTransaction

Delta Lake uses OptimisticTransaction for transactional writes. A commit is successful when the transaction can write the actions to a delta file (in the transactional log). In case the delta file for the commit version already exists, the transaction is retried.

Structured queries can write (transactionally) to a delta table using the following interfaces:

  • WriteIntoDelta command for batch queries (Spark SQL)

  • DeltaSink for streaming queries (Spark Structured Streaming)

More importantly, multiple queries can write to the same delta table simultaneously (at exactly the same time).

DeltaTable API

Delta Lake provides DeltaTable API to programmatically access Delta tables.

Structured Queries

Delta Lake supports batch and streaming queries (Spark SQL and Structured Streaming, respectively) using delta format.

In order to fine tune queries over data in Delta Lake use options.

Batch Queries

Delta Lake supports reading and writing in batch queries:

Streaming Queries

Delta Lake supports reading and writing in streaming queries:

LogStore

Delta Lake uses LogStore abstraction to read and write physical log files and checkpoints (using Hadoop FileSystem API).

Delta Tables in Logical Query Plans

Delta Table defines DeltaTable Scala extractor to find delta tables in a logical query plan. The extractor finds LogicalRelations (Spark SQL) with HadoopFsRelation (Spark SQL) and TahoeFileIndex.

Put simply, delta tables are LogicalRelations with HadoopFsRelation with TahoeFileIndex in logical query plans.

Concurrent Blind Append Transactions

A transaction can be blind append when simply appends new data to a table with no reliance on existing data (and without reading or modifying it).

Blind append transactions are marked in the commit info to distinguish them from read-modify-appends (deletes, merges or updates) and assume no conflict between concurrent transactions.

Blind Append Transactions allow for concurrent updates.

df.format("delta").mode("append").save(...)

Generated Columns

Delta Lake supports Generated Columns.

Table Constraints

Delta Lake introduces table constraints to ensure data quality and integrity (during writes).


Last update: 2021-06-14