Skip to content


FileFormat is an abstraction of connectors that can read and write data stored in files.


Schema Inference

  sparkSession: SparkSession,
  options: Map[String, String],
  files: Seq[FileStatus]): Option[StructType]

Infers the schema of the given files (as Hadoop FileStatuses), if supported. Otherwise, None should be returned.


Used when:

Preparing Write

  sparkSession: SparkSession,
  job: Job,
  options: Map[String, String],
  dataSchema: StructType): OutputWriterFactory

Prepares a write job and returns an OutputWriterFactory


Used when:


Building Data Reader With Partition Values

  sparkSession: SparkSession,
  dataSchema: StructType,
  partitionSchema: StructType,
  requiredSchema: StructType,
  filters: Seq[Filter],
  options: Map[String, String],
  hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow]

buildReaderWithPartitionValues builds a data reader with partition column values appended.


buildReaderWithPartitionValues is simply an enhanced buildReader that appends partition column values to the internal rows produced by the reader function.

buildReaderWithPartitionValues builds a data reader with the input parameters and gives a data reader function (of a PartitionedFile to an Iterator[InternalRow]) that does the following:

  1. Creates a converter by requesting GenerateUnsafeProjection to generate an UnsafeProjection for the attributes of the input requiredSchema and partitionSchema

  2. Applies the data reader to a PartitionedFile and converts the result using the converter on the joined row with the partition column values appended.

buildReaderWithPartitionValues is used when FileSourceScanExec physical operator is requested for the inputRDD.

Creating FileFormat Metadata Column

createFileMetadataCol(): AttributeReference

createFileMetadataCol cleans up metadata of the metadata fields.

In the end, createFileMetadataCol creates an AttributeReference for _metadata column (with the file internal metadata).

createFileMetadataCol is used when:

Building Data Reader

  sparkSession: SparkSession,
  dataSchema: StructType,
  partitionSchema: StructType,
  requiredSchema: StructType,
  filters: Seq[Filter],
  options: Map[String, String],
  hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow]

Builds a Catalyst data reader (a function that reads a single PartitionedFile file in to produce InternalRows).

buildReader throws an UnsupportedOperationException by default (and should therefore be overriden to work):

buildReader is not supported for [this]

Used when:


  sparkSession: SparkSession,
  options: Map[String, String],
  path: Path): Boolean

Controls whether this format (under the given Hadoop Path and the options) is splittable or not

Default: false

Always splitable:

Never splitable:

  • BinaryFileFormat

Used when:


  sparkSession: SparkSession,
  dataSchema: StructType): Boolean

Whether this format supports vectorized decoding or not

Default: false

Used when:


  dataType: DataType): Boolean

Controls whether this format supports the given DataType in read or write paths

Default: true (all data types are supported)

Used when:

  • DataSourceUtils is used to verifySchema


  name: String): Boolean

supportFieldName controls whether this format supports the given field name in read or write paths.

supportFieldName is true (all field names are supported) by default.


supportFieldName is used when:

Vector Types

  requiredSchema: StructType,
  partitionSchema: StructType,
  sqlConf: SQLConf): Option[Seq[String]]

Defines the fully-qualified class names (types) of the concrete ColumnVectors for every column in the input requiredSchema and partitionSchema schemas (to use in columnar processing mode)

Default: None (undefined)

Used when:

  • FileSourceScanExec physical operator is requested for the vectorTypes

Metadata Columns

metadataSchemaFields: Seq[StructField]

metadataSchemaFields is the following non-nullable hidden file metadata columns:

Name Data Type
file_path StringType
file_name StringType
file_size LongType
file_block_start LongType
file_block_length LongType
file_modification_time TimestampType


metadataSchemaFields is used when: