ParquetUtils¶
Infering Schema (Schema Discovery)¶
inferSchema(
sparkSession: SparkSession,
parameters: Map[String, String],
files: Seq[FileStatus]): Option[StructType]
inferSchema determines which file(s) to touch in order to determine the schema:
-
With mergeSchema enabled,
inferSchemamerges part, metadata and common metadata files.Data part-files are skipped with spark.sql.parquet.respectSummaryFiles enabled.
-
With mergeSchema disabled,
inferSchemaprefers summary files (with_common_metadatas preferable over_metadatas as they contain no extra row groups information and hence are smaller for large Parquet files with lots of row groups).inferSchemafalls back to a random part-file.inferSchematakes the very first parquet file (ordered by path) from the following (until a file is found):_common_metadatafiles_metadatafiles- data part-files
inferSchema creates a ParquetOptions (with the input parameters and the SparkSession's SQLConf) to read the value of mergeSchema option.
inferSchema reads the value of spark.sql.parquet.respectSummaryFiles configuration property.
inferSchema organizes parquet files by type for the given FileStatuses (Apache Hadoop).
inferSchema mergeSchemasInParallel for the files to touch.
inferSchema is used when:
ParquetFileFormatis requested to infer schemaParquetTableis requested to infer schema
Organizing Parquet Files by Type¶
splitFiles(
allFiles: Seq[FileStatus]): FileTypes
splitFiles sorts the given FileStatuses (Apache Hadoop) by path.
splitFiles creates a FileTypes with the following:
- Data files (i.e., files that are not summary files so neither
_common_metadatanor_metadata) - Metadata files (
_metadata) - Common metadata files (
_common_metadata)
isBatchReadSupportedForSchema¶
isBatchReadSupportedForSchema(
sqlConf: SQLConf,
schema: StructType): Boolean
isBatchReadSupportedForSchema...FIXME
isBatchReadSupportedForSchema is used when:
ParquetFileFormatis requested to supportBatch and buildReaderWithPartitionValuesParquetPartitionReaderFactoryis created
isBatchReadSupported¶
isBatchReadSupported(
sqlConf: SQLConf,
dt: DataType): Boolean
isBatchReadSupported...FIXME
prepareWrite¶
prepareWrite(
sqlConf: SQLConf,
job: Job,
dataSchema: StructType,
parquetOptions: ParquetOptions): OutputWriterFactory
prepareWrite...FIXME
prepareWrite is used when:
ParquetFileFormatis requested to prepareWriteParquetWriteis requested to prepareWrite
Logging¶
Enable ALL logging level for org.apache.spark.sql.execution.datasources.parquet.ParquetUtils logger to see what happens inside.
Add the following line to conf/log4j2.properties:
logger.ParquetUtils.name = org.apache.spark.sql.execution.datasources.parquet.ParquetUtils
logger.ParquetUtils.level = all
Refer to Logging.