Skip to content

FileScan

FileScan is an extension of the Scan abstraction for scans in Batch queries.

SupportsReportStatistics

FileScan is a SupportsReportStatistics.

Contract

DataFilters

dataFilters: Seq[Expression]

Expressions

Used when:

FileIndex

fileIndex: PartitioningAwareFileIndex

PartitioningAwareFileIndex

getFileUnSplittableReason

getFileUnSplittableReason(
  path: Path): String

Partition Filters

partitionFilters: Seq[Expression]

Expressions

Read Data Schema

readDataSchema: StructType

StructType

Three Schemas

Beside the read data schema of a FileScan, there are two others:

  1. readPartitionSchema
  2. readSchema

Read Partition Schema

readPartitionSchema: StructType

seqToString

seqToString(
  seq: Seq[Any]): String

sparkSession

sparkSession: SparkSession

SparkSession associated with this FileScan

withFilters

withFilters(
  partitionFilters: Seq[Expression],
  dataFilters: Seq[Expression]): FileScan

Implementations

description

description(): String

description is part of the Scan abstraction.


description...FIXME

Planning Input Partitions

Signature
planInputPartitions(): Array[InputPartition]

planInputPartitions is part of the Batch abstraction.

planInputPartitions is the file partitions.

File Partitions

partitions: Seq[FilePartition]

partitions requests the PartitioningAwareFileIndex for the partition directories (selectedPartitions).

For every selected partition directory, partitions requests the Hadoop FileStatuses that are split (if isSplitable) to maxSplitBytes and sorted by size (in reversed order).

In the end, partitions returns the FilePartitions.

estimateStatistics

estimateStatistics(): Statistics

estimateStatistics is part of the SupportsReportStatistics abstraction.


estimateStatistics...FIXME

Converting to Batch

toBatch: Batch

toBatch is part of the Scan abstraction.


toBatch is this FileScan.

Read Schema

readSchema(): StructType

readSchema is part of the Scan abstraction.


readSchema is the readDataSchema with the readPartitionSchema.

isSplitable

isSplitable(
  path: Path): Boolean

isSplitable is disabled by default (false).

FileScan isSplitable
AvroScan true
ParquetScan isSplitable

Used when:

SupportsMetadata

FileScan is a SupportsMetadata.

Metadata

getMetaData(): Map[String, String]

getMetaData is part of the SupportsMetadata abstraction.


getMetaData returns the following metadata:

Name Description
Format The lower-case name of this FileScan (with Scan removed)
ReadSchema catalogString of the Read Data Schema
PartitionFilters Partition Filters
DataFilters Data Filters
Location PartitioningAwareFileIndex followed by root paths (with their number in the file listing up to spark.sql.maxMetadataStringLength)