The Internals of Delta Lake 0.4.0

Delta Lake is an open-source storage management system (storage layer) that brings ACID transactions and time travel to Apache Spark and big data workloads.

Delta Lake introduces a concept of delta table that is simply a parquet table with transactional log.

Changes to (the state of) a Delta table are reflected as actions and persisted to the transactional log (in JSON format).

Delta Lake provides DeltaTable API to programmatically access Delta tables. A delta table can be created based on a parquet table (DeltaTable.convertToDelta) or from scratch (DeltaTable.forPath).

Delta Lake supports Spark SQL and Structured Streaming using delta format.

Delta Lake supports reading and writing in batch queries:

Delta Lake supports reading and writing in streaming queries:

Delta Lake uses LogStore abstraction to read and write physical log files and checkpoints (using Hadoop FileSystem API).

Installing Delta Lake

In order to "install" and use Delta Lake in a Spark application (e.g. spark-shell), use --packages command-line option.

/*
./bin/spark-shell \
  --packages io.delta:delta-core_2.12:0.4.0 \
  --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension
*/
assert(spark.isInstanceOf[org.apache.spark.sql.SparkSession])
assert(spark.version.matches("2.4.[2-4]"), "Delta Lake supports Spark 2.4.2+")

val input = spark
  .read
  .format("delta")
  .option("path", "delta")
  .load

delta data source requires some options (with path option required).