FileFormat¶
FileFormat
is an abstraction of data sources that can read and write data stored in files.
Contract¶
Building Data Reader¶
buildReader(
sparkSession: SparkSession,
dataSchema: StructType,
partitionSchema: StructType,
requiredSchema: StructType,
filters: Seq[Filter],
options: Map[String, String],
hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow]
Builds a Catalyst data reader (a function that reads a single PartitionedFile file in to produce InternalRows).
buildReader
throws an UnsupportedOperationException
by default (and should therefore be overriden to work):
buildReader is not supported for [this]
Used when FileFormat
is requested to buildReaderWithPartitionValues.
Schema Inference¶
inferSchema(
sparkSession: SparkSession,
options: Map[String, String],
files: Seq[FileStatus]): Option[StructType]
Infers the schema of the given files (as Hadoop FileStatuses) if supported. Otherwise, None
should be returned.
Used when:
HiveMetastoreCatalog
is requested to inferIfNeededDataSource
is requested to getOrInferFileFormatSchema and resolveRelation
isSplitable¶
isSplitable(
sparkSession: SparkSession,
options: Map[String, String],
path: Path): Boolean
Controls whether this format (under the given Hadoop Path and the options
) is splittable or not
Default: false
Always splitable:
Never splitable:
BinaryFileFormat
See:
Used when:
FileSourceScanExec
physical operator is requested to create an RDD for a non-bucketed read (when requested for the inputRDD)
Preparing Write¶
prepareWrite(
sparkSession: SparkSession,
job: Job,
options: Map[String, String],
dataSchema: StructType): OutputWriterFactory
Prepares a write job and returns an OutputWriterFactory
Used when FileFormatWriter
utility is used to write out a query result
supportBatch¶
supportBatch(
sparkSession: SparkSession,
dataSchema: StructType): Boolean
Whether this format supports vectorized decoding or not
Default: false
Used when:
FileSourceScanExec
physical operator is requested for the supportsBatch flagOrcFileFormat
is requested to buildReaderWithPartitionValuesParquetFileFormat
is requested to buildReaderWithPartitionValues
supportDataType¶
supportDataType(
dataType: DataType): Boolean
Controls whether this format supports the given DataType in read or write paths
Default: true
(all data types are supported)
Used when DataSourceUtils
is used to verifySchema
Vector Types¶
vectorTypes(
requiredSchema: StructType,
partitionSchema: StructType,
sqlConf: SQLConf): Option[Seq[String]]
Defines the fully-qualified class names (types) of the concrete ColumnVectors for every column in the input requiredSchema
and partitionSchema
schemas (to use in columnar processing mode)
Default: None
(undefined)
Used when:
FileSourceScanExec
physical operator is requested for the vectorTypes
Implementations¶
- AvroFileFormat
BinaryFileFormat
- HiveFileFormat
ImageFileFormat
- OrcFileFormat
- ParquetFileFormat
- TextBasedFileFormat
Building Data Reader With Partition Values¶
buildReaderWithPartitionValues(
sparkSession: SparkSession,
dataSchema: StructType,
partitionSchema: StructType,
requiredSchema: StructType,
filters: Seq[Filter],
options: Map[String, String],
hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow]
buildReaderWithPartitionValues
builds a data reader with partition column values appended.
Note
buildReaderWithPartitionValues
is simply an enhanced buildReader that appends partition column values to the internal rows produced by the reader function.
buildReaderWithPartitionValues
builds a data reader with the input parameters and gives a data reader function (of a PartitionedFile to an Iterator[InternalRow]
) that does the following:
-
Creates a converter by requesting
GenerateUnsafeProjection
to generate an UnsafeProjection for the attributes of the inputrequiredSchema
andpartitionSchema
-
Applies the data reader to a
PartitionedFile
and converts the result using the converter on the joined row with the partition column values appended.
buildReaderWithPartitionValues
is used when FileSourceScanExec
physical operator is requested for the inputRDD.