Skip to content

ParquetUtils

Infering Schema (Schema Discovery)

inferSchema(
  sparkSession: SparkSession,
  parameters: Map[String, String],
  files: Seq[FileStatus]): Option[StructType]

inferSchema determines which file(s) to touch in order to determine the schema:

  • With mergeSchema enabled, inferSchema merges part, metadata and common metadata files.

    Data part-files are skipped with spark.sql.parquet.respectSummaryFiles enabled.

  • With mergeSchema disabled, inferSchema prefers summary files (with _common_metadatas preferable over _metadatas as they contain no extra row groups information and hence are smaller for large Parquet files with lots of row groups).

    inferSchema falls back to a random part-file.

    inferSchema takes the very first parquet file (ordered by path) from the following (until a file is found):

    1. _common_metadata files
    2. _metadata files
    3. data part-files

inferSchema creates a ParquetOptions (with the input parameters and the SparkSession's SQLConf) to read the value of mergeSchema option.

inferSchema reads the value of spark.sql.parquet.respectSummaryFiles configuration property.

inferSchema organizes parquet files by type for the given FileStatuses (Apache Hadoop).

inferSchema mergeSchemasInParallel for the files to touch.


inferSchema is used when:

Organizing Parquet Files by Type

splitFiles(
  allFiles: Seq[FileStatus]): FileTypes

splitFiles sorts the given FileStatuses (Apache Hadoop) by path.

splitFiles creates a FileTypes with the following:

  • Data files (i.e., files that are not summary files so neither _common_metadata nor _metadata)
  • Metadata files (_metadata)
  • Common metadata files (_common_metadata)

isBatchReadSupportedForSchema

isBatchReadSupportedForSchema(
  sqlConf: SQLConf,
  schema: StructType): Boolean

isBatchReadSupportedForSchema...FIXME


isBatchReadSupportedForSchema is used when:

isBatchReadSupported

isBatchReadSupported(
  sqlConf: SQLConf,
  dt: DataType): Boolean

isBatchReadSupported...FIXME

prepareWrite

prepareWrite(
  sqlConf: SQLConf,
  job: Job,
  dataSchema: StructType,
  parquetOptions: ParquetOptions): OutputWriterFactory

prepareWrite...FIXME


prepareWrite is used when:

Logging

Enable ALL logging level for org.apache.spark.sql.execution.datasources.parquet.ParquetUtils logger to see what happens inside.

Add the following line to conf/log4j2.properties:

logger.ParquetUtils.name = org.apache.spark.sql.execution.datasources.parquet.ParquetUtils
logger.ParquetUtils.level = all

Refer to Logging.