DataSource V2¶
DataSource V2 (DataSource API V2 or Data Source V2) is a new API for data sources in Spark SQL with the following abstractions:
DataSource V2 was tracked under SPARK-15689 DataSource V2 and was marked as fixed in Spark 2.3.0. This was subsequently rewamped alongside Catalog Plugin API and Multi-Catalog Support and released in Spark 3.0.0 (which could've safely been referred to as DataSource V3).
Query Planning and Execution¶
DataSource V2 relies on the DataSourceV2Strategy execution planning strategy for query planning.
Data Reading¶
DataSource V2 uses DataSourceV2Relation logical operator to represent data reading (aka data scan).
DataSourceV2Relation
is planned (translated) to a ProjectExec with a DataSourceV2ScanExec physical operator (possibly under the FilterExec operator) when DataSourceV2Strategy execution planning strategy is executed.
When executed, DataSourceV2ScanExec
physical operator creates a DataSourceRDD (or a ContinuousReader
for Spark Structured Streaming).
DataSourceRDD
uses InputPartitions for partitions, preferred locations, and computing partitions.
Data Writing¶
DataSource V2 uses WriteToDataSourceV2 and AppendData logical operators to represent data writing (over a DataSourceV2Relation logical operator). As of Spark SQL 2.4.0, WriteToDataSourceV2
operator was deprecated for the more specific AppendData
operator (compare "data writing" to "data append" which is certainly more specific).
NOTE: One of the differences between WriteToDataSourceV2
and AppendData
logical operators is that the former (WriteToDataSourceV2
) uses DataSourceWriter directly while the latter (AppendData
) uses DataSourceV2Relation to get the DataSourceWriter from.
WriteToDataSourceV2 and AppendData (with DataSourceV2Relation) logical operators are planned as (translated to) a WriteToDataSourceV2Exec physical operator.
When executed, WriteToDataSourceV2Exec
physical operator...FIXME
Filter Pushdown Performance Optimization¶
DataSource V2 supports filter pushdown performance optimization for...FIXME
From Parquet Filter Pushdown in Apache Drill's documentation:
Filter pushdown is a performance optimization that prunes extraneous data while reading from a data source to reduce the amount of data to scan and read for queries with supported filter expressions. Pruning data reduces the I/O, CPU, and network overhead to optimize query performance.
Tip
Enable INFO logging level for the DataSourceV2Strategy logger to be informed what filters were pushed down.
References¶
Videos¶
- Apache Spark DataSource V2 by Wenchen Fan and Gengliang Wang (Databricks)
- DataSource V2 and Cassandra – A Whole New World by Russell Spitzer (Datastax)