FileFormat¶
FileFormat is an abstraction of connectors that can read and write data stored in files.
Contract¶
Schema Inference¶
inferSchema(
sparkSession: SparkSession,
options: Map[String, String],
files: Seq[FileStatus]): Option[StructType]
Infers the schema of the given files (as Hadoop FileStatuses), if supported. Otherwise, None should be returned.
See:
Used when:
HiveMetastoreCatalogis requested to inferIfNeededDataSourceis requested to getOrInferFileFormatSchema and resolveRelation
Preparing Write¶
prepareWrite(
sparkSession: SparkSession,
job: Job,
options: Map[String, String],
dataSchema: StructType): OutputWriterFactory
Prepares a write job and returns an OutputWriterFactory
See:
Used when:
FileFormatWriterutility is used to write out a query result
Implementations¶
- AvroFileFormat
BinaryFileFormat- HiveFileFormat
ImageFileFormatOrcFileFormat- ParquetFileFormat
TextBasedFileFormat
Building Data Reader With Partition Values¶
buildReaderWithPartitionValues(
sparkSession: SparkSession,
dataSchema: StructType,
partitionSchema: StructType,
requiredSchema: StructType,
filters: Seq[Filter],
options: Map[String, String],
hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow]
buildReaderWithPartitionValues builds a data reader with partition column values appended.
Note
buildReaderWithPartitionValues is simply an enhanced buildReader that appends partition column values to the internal rows produced by the reader function.
buildReaderWithPartitionValues builds a data reader with the input parameters and gives a data reader function (of a PartitionedFile to an Iterator[InternalRow]) that does the following:
-
Creates a converter by requesting
GenerateUnsafeProjectionto generate an UnsafeProjection for the attributes of the inputrequiredSchemaandpartitionSchema -
Applies the data reader to a
PartitionedFileand converts the result using the converter on the joined row with the partition column values appended.
buildReaderWithPartitionValues is used when FileSourceScanExec physical operator is requested for the inputRDD.
Creating FileFormat Metadata Column¶
createFileMetadataCol(): AttributeReference
createFileMetadataCol cleans up metadata of the metadata fields.
In the end, createFileMetadataCol creates an AttributeReference for _metadata column (with the file internal metadata).
createFileMetadataCol is used when:
LogicalRelationlogical operator is requested for the metadataOutput (of a HadoopFsRelation)StreamingRelation(Spark Structured Streaming) logical operator is requested for themetadataOutput(of aFileFormat)
Building Data Reader¶
buildReader(
sparkSession: SparkSession,
dataSchema: StructType,
partitionSchema: StructType,
requiredSchema: StructType,
filters: Seq[Filter],
options: Map[String, String],
hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow]
Builds a Catalyst data reader (a function that reads a single PartitionedFile file in to produce InternalRows).
buildReader throws an UnsupportedOperationException by default (and should therefore be overriden to work):
buildReader is not supported for [this]
Used when:
FileFormatis requested to buildReaderWithPartitionValues
isSplitable¶
isSplitable(
sparkSession: SparkSession,
options: Map[String, String],
path: Path): Boolean
Controls whether this format (under the given Hadoop Path and the options) is splittable or not
Default: false
Always splitable:
- AvroFileFormat
OrcFileFormat- ParquetFileFormat
Never splitable:
BinaryFileFormat
Used when:
FileSourceScanExecphysical operator is requested to create an RDD for a non-bucketed read (when requested for the inputRDD)
supportBatch¶
supportBatch(
sparkSession: SparkSession,
dataSchema: StructType): Boolean
Whether this format supports vectorized decoding or not
Default: false
Used when:
FileSourceScanExecphysical operator is requested for the supportsBatch flagOrcFileFormatis requested tobuildReaderWithPartitionValuesParquetFileFormatis requested to buildReaderWithPartitionValues
supportDataType¶
supportDataType(
dataType: DataType): Boolean
Controls whether this format supports the given DataType in read or write paths
Default: true (all data types are supported)
Used when:
DataSourceUtilsis used toverifySchema
supportFieldName¶
supportFieldName(
name: String): Boolean
supportFieldName controls whether this format supports the given field name in read or write paths.
supportFieldName is true (all field names are supported) by default.
See:
DeltaParquetFileFormat(Delta Lake)
supportFieldName is used when:
DataSourceUtilsis requested to checkFieldNames
Vector Types¶
vectorTypes(
requiredSchema: StructType,
partitionSchema: StructType,
sqlConf: SQLConf): Option[Seq[String]]
Defines the fully-qualified class names (types) of the concrete ColumnVectors for every column in the input requiredSchema and partitionSchema schemas (to use in columnar processing mode)
Default: None (undefined)
Used when:
FileSourceScanExecphysical operator is requested for the vectorTypes
Metadata Columns¶
metadataSchemaFields: Seq[StructField]
metadataSchemaFields is the following non-nullable hidden file metadata columns:
| Name | Data Type |
|---|---|
file_path | StringType |
file_name | StringType |
file_size | LongType |
file_block_start | LongType |
file_block_length | LongType |
file_modification_time | TimestampType |
See:
- ParquetFileFormat
DeltaParquetFileFormat(Delta Lake)
metadataSchemaFields is used when:
FileFormatis requested to createFileMetadataCol- FileSourceStrategy execution planning strategy is executed