Skip to content

FileFormat

FileFormat is an abstraction of data sources that can read and write data stored in files.

Contract

Building Data Reader

buildReader(
  sparkSession: SparkSession,
  dataSchema: StructType,
  partitionSchema: StructType,
  requiredSchema: StructType,
  filters: Seq[Filter],
  options: Map[String, String],
  hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow]

Builds a Catalyst data reader (a function that reads a single PartitionedFile file in to produce InternalRows).

buildReader throws an UnsupportedOperationException by default (and should therefore be overriden to work):

buildReader is not supported for [this]

Used when FileFormat is requested to buildReaderWithPartitionValues.

Schema Inference

inferSchema(
  sparkSession: SparkSession,
  options: Map[String, String],
  files: Seq[FileStatus]): Option[StructType]

Infers the schema of the given files (as Hadoop FileStatuses) if supported. Otherwise, None should be returned.

Used when:

isSplitable

isSplitable(
  sparkSession: SparkSession,
  options: Map[String, String],
  path: Path): Boolean

Controls whether the format (under the given path as Hadoop Path) is splittable or not

Default: false

Used when FileSourceScanExec physical operator is requested to create an RDD for non-bucketed reads (when requested for the inputRDD)

Preparing Write

prepareWrite(
  sparkSession: SparkSession,
  job: Job,
  options: Map[String, String],
  dataSchema: StructType): OutputWriterFactory

Prepares a write job and returns an OutputWriterFactory

Used when FileFormatWriter utility is used to write out a query result

supportBatch

supportBatch(
  sparkSession: SparkSession,
  dataSchema: StructType): Boolean

Controls whether the format supports vectorized decoding (aka columnar batch) or not

Default: false

Used when:

supportDataType

supportDataType(
  dataType: DataType): Boolean

Controls whether this format supports the given DataType in read or write paths

Default: true (all data types are supported)

Used when DataSourceUtils is used to verifySchema

Vector Types

vectorTypes(
  requiredSchema: StructType,
  partitionSchema: StructType,
  sqlConf: SQLConf): Option[Seq[String]]

Defines the fully-qualified class names (types) of the concrete ColumnVectors for every column in the input requiredSchema and partitionSchema schemas (to use in columnar processing mode)

Default: None (undefined)

Used when FileSourceScanExec physical operator is requested for the vectorTypes

Implementations

Building Data Reader With Partition Values

buildReaderWithPartitionValues(
  sparkSession: SparkSession,
  dataSchema: StructType,
  partitionSchema: StructType,
  requiredSchema: StructType,
  filters: Seq[Filter],
  options: Map[String, String],
  hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow]

buildReaderWithPartitionValues builds a data reader with partition column values appended.

Note

buildReaderWithPartitionValues is simply an enhanced buildReader that appends partition column values to the internal rows produced by the reader function.

buildReaderWithPartitionValues builds a data reader with the input parameters and gives a data reader function (of a PartitionedFile to an Iterator[InternalRow]) that does the following:

  1. Creates a converter by requesting GenerateUnsafeProjection to generate an UnsafeProjection for the attributes of the input requiredSchema and partitionSchema

  2. Applies the data reader to a PartitionedFile and converts the result using the converter on the joined row with the partition column values appended.

buildReaderWithPartitionValues is used when FileSourceScanExec physical operator is requested for the inputRDD.