ParquetUtils¶
Infering Schema (Schema Discovery)¶
inferSchema(
sparkSession: SparkSession,
parameters: Map[String, String],
files: Seq[FileStatus]): Option[StructType]
inferSchema
determines which file(s) to touch in order to determine the schema:
-
With mergeSchema enabled,
inferSchema
merges part, metadata and common metadata files.Data part-files are skipped with spark.sql.parquet.respectSummaryFiles enabled.
-
With mergeSchema disabled,
inferSchema
prefers summary files (with_common_metadata
s preferable over_metadata
s as they contain no extra row groups information and hence are smaller for large Parquet files with lots of row groups).inferSchema
falls back to a random part-file.inferSchema
takes the very first parquet file (ordered by path) from the following (until a file is found):_common_metadata
files_metadata
files- data part-files
inferSchema
creates a ParquetOptions (with the input parameters
and the SparkSession
's SQLConf) to read the value of mergeSchema option.
inferSchema
reads the value of spark.sql.parquet.respectSummaryFiles configuration property.
inferSchema
organizes parquet files by type for the given FileStatus
es (Apache Hadoop).
inferSchema
mergeSchemasInParallel for the files to touch.
inferSchema
is used when:
ParquetFileFormat
is requested to infer schemaParquetTable
is requested to infer schema
Organizing Parquet Files by Type¶
splitFiles(
allFiles: Seq[FileStatus]): FileTypes
splitFiles
sorts the given FileStatus
es (Apache Hadoop) by path.
splitFiles
creates a FileTypes
with the following:
- Data files (i.e., files that are not summary files so neither
_common_metadata
nor_metadata
) - Metadata files (
_metadata
) - Common metadata files (
_common_metadata
)
isBatchReadSupportedForSchema¶
isBatchReadSupportedForSchema(
sqlConf: SQLConf,
schema: StructType): Boolean
isBatchReadSupportedForSchema
...FIXME
isBatchReadSupportedForSchema
is used when:
ParquetFileFormat
is requested to supportBatch and buildReaderWithPartitionValuesParquetPartitionReaderFactory
is created
isBatchReadSupported¶
isBatchReadSupported(
sqlConf: SQLConf,
dt: DataType): Boolean
isBatchReadSupported
...FIXME
prepareWrite¶
prepareWrite(
sqlConf: SQLConf,
job: Job,
dataSchema: StructType,
parquetOptions: ParquetOptions): OutputWriterFactory
prepareWrite
...FIXME
prepareWrite
is used when:
ParquetFileFormat
is requested to prepareWriteParquetWrite
is requested to prepareWrite
Logging¶
Enable ALL
logging level for org.apache.spark.sql.execution.datasources.parquet.ParquetUtils
logger to see what happens inside.
Add the following line to conf/log4j2.properties
:
logger.ParquetUtils.name = org.apache.spark.sql.execution.datasources.parquet.ParquetUtils
logger.ParquetUtils.level = all
Refer to Logging.