FileFormat¶
FileFormat
is an abstraction of data sources that can read and write data stored in files.
Contract¶
Building Data Reader¶
buildReader(
sparkSession: SparkSession,
dataSchema: StructType,
partitionSchema: StructType,
requiredSchema: StructType,
filters: Seq[Filter],
options: Map[String, String],
hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow]
Builds a Catalyst data reader (a function that reads a single PartitionedFile file in to produce InternalRows).
buildReader
throws an UnsupportedOperationException
by default (and should therefore be overriden to work):
buildReader is not supported for [this]
Used when FileFormat
is requested to buildReaderWithPartitionValues.
Schema Inference¶
inferSchema(
sparkSession: SparkSession,
options: Map[String, String],
files: Seq[FileStatus]): Option[StructType]
Infers the schema of the given files (as Hadoop FileStatuses) if supported. Otherwise, None
should be returned.
Used when:
HiveMetastoreCatalog
is requested to inferIfNeededDataSource
is requested to getOrInferFileFormatSchema and resolveRelation
isSplitable¶
isSplitable(
sparkSession: SparkSession,
options: Map[String, String],
path: Path): Boolean
Controls whether this format (under the given Hadoop Path and the options
) is splittable or not
Default: false
Always splitable:
- AvroFileFormat
OrcFileFormat
- ParquetFileFormat
Never splitable:
BinaryFileFormat
Used when:
FileSourceScanExec
physical operator is requested to create an RDD for a non-bucketed read (when requested for the inputRDD)
Preparing Write¶
prepareWrite(
sparkSession: SparkSession,
job: Job,
options: Map[String, String],
dataSchema: StructType): OutputWriterFactory
Prepares a write job and returns an OutputWriterFactory
Used when FileFormatWriter
utility is used to write out a query result
supportBatch¶
supportBatch(
sparkSession: SparkSession,
dataSchema: StructType): Boolean
Whether this format supports vectorized decoding or not
Default: false
Used when:
FileSourceScanExec
physical operator is requested for the supportsBatch flagOrcFileFormat
is requested tobuildReaderWithPartitionValues
ParquetFileFormat
is requested to buildReaderWithPartitionValues
supportDataType¶
supportDataType(
dataType: DataType): Boolean
Controls whether this format supports the given DataType in read or write paths
Default: true
(all data types are supported)
Used when DataSourceUtils
is used to verifySchema
Vector Types¶
vectorTypes(
requiredSchema: StructType,
partitionSchema: StructType,
sqlConf: SQLConf): Option[Seq[String]]
Defines the fully-qualified class names (types) of the concrete ColumnVectors for every column in the input requiredSchema
and partitionSchema
schemas (to use in columnar processing mode)
Default: None
(undefined)
Used when:
FileSourceScanExec
physical operator is requested for the vectorTypes
Implementations¶
- AvroFileFormat
BinaryFileFormat
- HiveFileFormat
ImageFileFormat
OrcFileFormat
- ParquetFileFormat
TextBasedFileFormat
Building Data Reader With Partition Values¶
buildReaderWithPartitionValues(
sparkSession: SparkSession,
dataSchema: StructType,
partitionSchema: StructType,
requiredSchema: StructType,
filters: Seq[Filter],
options: Map[String, String],
hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow]
buildReaderWithPartitionValues
builds a data reader with partition column values appended.
Note
buildReaderWithPartitionValues
is simply an enhanced buildReader that appends partition column values to the internal rows produced by the reader function.
buildReaderWithPartitionValues
builds a data reader with the input parameters and gives a data reader function (of a PartitionedFile to an Iterator[InternalRow]
) that does the following:
-
Creates a converter by requesting
GenerateUnsafeProjection
to generate an UnsafeProjection for the attributes of the inputrequiredSchema
andpartitionSchema
-
Applies the data reader to a
PartitionedFile
and converts the result using the converter on the joined row with the partition column values appended.
buildReaderWithPartitionValues
is used when FileSourceScanExec
physical operator is requested for the inputRDD.
createFileMetadataCol¶
createFileMetadataCol(
fileFormat: FileFormat): AttributeReference
createFileMetadataCol
...FIXME
createFileMetadataCol
is used when:
LogicalRelation
logical operator is requested for the metadataOutputStreamingRelation
(Spark Structured Streaming) logical operator is requested for themetadataOutput