Skip to content

Parquet Data Source

Apache Parquet is a columnar storage format for the Apache Hadoop ecosystem with support for efficient storage and encoding of data.

Spark SQL supports parquet-encoded data using ParquetDataSourceV2. There is also an older ParquetFileFormat that is used as a fallbackFileFormat, for backward-compatibility and Hive (to name a few use cases).

Parquet is the default data source format based on the spark.sql.sources.default configuration property.

Parquet data source uses spark.sql.parquet prefix for parquet-specific configuration properties.