Delta Lake¶

Delta Lake is an open-source table format (storage layer) for cloud data lakes with ACID transactions, time travel and many many more that make it so amazing (even awesomesauce 😏):

Auto Compaction
Change Data Feed
CHECK Constraints
Column Invariants
Column Mapping
Column Statistics
Commands
Data Skipping
Deletion Vectors
Delta SQL
Developer API
Generated Columns
Identity Columns ^3.3.0
Spark SQL integration with support for batch and streaming queries
Table Constraints
Time Travel
others (listed in the menu on the left)

Delta Lake allows you to store data on blob stores like HDFS, S3, Azure Data Lake, GCS, query from many processing engines including Apache Spark, Trino, Apache Hive, Apache Flink, and provides APIs for SQL, Scala, Java, Python, Rust (to name a few).

As it was well said: "Delta is a storage format while Spark is an execution engine...to separate storage from compute." Delta Lake can run with other execution engines like Trino or Apache Flink.

Delta tables can be registered in a table catalog. Delta Lake creates a transaction log at the root directory of a table, and the catalog contains no information but the table format and the location of the table. All table properties, schema and partitioning information live in the transaction log to avoid a "split brain" situation (Wikipedia).

Delta Lake 3.3.0 supports Apache Spark 3.5.3 (cf. build.sbt).

Delta Tables¶

Delta tables are parquet tables with a transactional log.

Changes to (the state of) a delta table are reflected as actions and persisted to the transactional log (in JSON format).

OptimisticTransaction¶

Delta Lake uses OptimisticTransaction for transactional writes. A commit is successful when the transaction can write the actions to a delta file (in the transactional log). In case the delta file for the commit version already exists, the transaction is retried.

More importantly, multiple queries can write to the same delta table simultaneously (at exactly the same time).

Transactional Writers¶

TransactionalWrite is an interface for writing out data to a delta table.

The following commands and interfaces can transactionally write new data files out to a data directory of a delta table:

Developer APIs¶

Delta Lake provides the following Developer APIs for developers to interact with (and even extend) Delta Lake using a supported programming language:

Structured Queries¶

Delta Lake supports batch and streaming queries (Spark SQL and Structured Streaming, respectively) using delta format.

In order to fine tune queries over data in Delta Lake use options.

Structured queries can write (transactionally) to a delta table using the following interfaces:

WriteIntoDelta command for batch queries (Spark SQL)
DeltaSink for streaming queries (Spark Structured Streaming)

Batch Queries¶

Delta Lake supports reading and writing in batch queries:

Batch reads (as a RelationProvider)
Batch writes (as a CreatableRelationProvider)

Streaming Queries¶

Delta Lake supports reading and writing in streaming queries:

Stream reads (as a Source)
Stream writes (as a Sink)

LogStore¶

Delta Lake uses LogStore abstraction for reading and writing physical log files and checkpoints (using Hadoop FileSystem API).

Delta Tables in Logical Query Plans¶

Delta Table defines DeltaTable Scala extractor to find delta tables in a logical query plan. The extractor finds LogicalRelations (Spark SQL) with HadoopFsRelation (Spark SQL) and TahoeFileIndex.

Put simply, delta tables are LogicalRelations with HadoopFsRelation with TahoeFileIndex in logical query plans.

A transaction can be blind append when simply appends new data to a table with no reliance on existing data (and without reading or modifying it).

Blind append transactions are marked in the commit info to distinguish them from read-modify-appends (deletes, merges or updates) and assume no conflict between concurrent transactions.

Blind Append Transactions allow for concurrent updates.

df.format("delta")
  .mode("append")
  .save(...)

Exception Public API¶

Delta Lake introduces exceptions due to conflicts between concurrent operations as a public API.

Learn More¶

What's New in Delta Lake 2.3.0 by Will Girten