Skip to content

Delta Lake

Delta Lake is an open-source storage management system (storage layer) that brings ACID transactions and time travel to Apache Spark and big data workloads.

As it was well said: "Delta is a storage format while Spark is an execution engine...to separate storage from compute."

Danger

As of 0.7.0 Delta Lake requires Spark 3. Please note that Spark 3.1.1 is not yet supported. Use Spark 3.0.2 instead.

Delta Lake is a table format. It introduces DeltaTable abstraction that is simply a parquet table with a transactional log.

Changes to (the state of) a delta table are reflected as actions and persisted to the transactional log (in JSON format).

Delta Lake uses OptimisticTransaction for transactional writes. A commit is successful when the transaction can write the actions to a delta file (in the transactional log). In case the delta file for the commit version already exists, the transaction is retried.

Structured queries can write (transactionally) to a delta table using the following interfaces:

  • WriteIntoDelta command for batch queries (Spark SQL)

  • DeltaSink for streaming queries (Spark Structured Streaming)

More importantly, multiple queries can write to the same delta table simultaneously (at exactly the same time).

Delta Lake provides DeltaTable API to programmatically access Delta tables. A delta table can be created based on a parquet table or from scratch.

Delta Lake supports batch and streaming queries (Spark SQL and Structured Streaming, respectively) using delta format.

In order to fine tune queries over data in Delta Lake use options. Among the options path option is mandatory.

Delta Lake supports reading and writing in batch queries:

Delta Lake supports reading and writing in streaming queries:

Delta Lake uses LogStore abstraction to read and write physical log files and checkpoints (using Hadoop FileSystem API).

Delta Tables in Logical Query Plans

Delta Table defines DeltaTable Scala extractor to find delta tables in a logical query plan. The extractor finds LogicalRelations (Spark SQL) with HadoopFsRelation (Spark SQL) and TahoeFileIndex.

Put simply, delta tables are LogicalRelations with HadoopFsRelation with TahoeFileIndex in logical query plans.


Last update: 2021-03-25