FileScan¶

FileScan is an extension of the Scan abstraction for scans in Batch queries.

SupportsReportStatistics¶

FileScan is a SupportsReportStatistics.

Contract¶

DataFilters¶

dataFilters: Seq[Expression]

Expressions

Used when:

FileScan is requested for normalized DataFilters, metadata, partitions

FileIndex¶

fileIndex: PartitioningAwareFileIndex

PartitioningAwareFileIndex

getFileUnSplittableReason¶

getFileUnSplittableReason(
  path: Path): String

Partition Filters¶

partitionFilters: Seq[Expression]

Expressions

Read Data Schema¶

readDataSchema: StructType

StructType

Three Schemas

Beside the read data schema of a FileScan, there are two others:

readPartitionSchema
readSchema

Read Partition Schema¶

readPartitionSchema: StructType

seqToString¶

seqToString(
  seq: Seq[Any]): String

sparkSession¶

sparkSession: SparkSession

SparkSession associated with this FileScan

withFilters¶

withFilters(
  partitionFilters: Seq[Expression],
  dataFilters: Seq[Expression]): FileScan

Implementations¶

ParquetScan
others

description¶

description(): String

description is part of the Scan abstraction.

description...FIXME

Planning Input Partitions¶

Signature

planInputPartitions(): Array[InputPartition]

planInputPartitions is part of the Batch abstraction.

planInputPartitions is the file partitions.

File Partitions¶

partitions: Seq[FilePartition]

partitions requests the PartitioningAwareFileIndex for the partition directories (selectedPartitions).

For every selected partition directory, partitions requests the Hadoop FileStatuses that are split (if isSplitable) to maxSplitBytes and sorted by size (in reversed order).

In the end, partitions returns the FilePartitions.

estimateStatistics¶

estimateStatistics(): Statistics

estimateStatistics is part of the SupportsReportStatistics abstraction.

estimateStatistics...FIXME

Converting to Batch¶

toBatch: Batch

toBatch is part of the Scan abstraction.

toBatch is this FileScan.

Read Schema¶

readSchema(): StructType

readSchema is part of the Scan abstraction.

readSchema is the readDataSchema with the readPartitionSchema.

isSplitable¶

isSplitable(
  path: Path): Boolean

isSplitable is disabled by default (false).

FileScan	isSplitable
`AvroScan`	`true`
ParquetScan	isSplitable

Used when:

FileScan is requested to getFileUnSplittableReason and partitions

SupportsMetadata¶

FileScan is a SupportsMetadata.

Metadata¶

getMetaData(): Map[String, String]

getMetaData is part of the SupportsMetadata abstraction.

getMetaData returns the following metadata:

Name	Description
`Format`	The lower-case name of this FileScan (with `Scan` removed)
`ReadSchema`	catalogString of the Read Data Schema
`PartitionFilters`	Partition Filters
`DataFilters`	Data Filters
`Location`	PartitioningAwareFileIndex followed by root paths (with their number in the file listing up to spark.sql.maxMetadataStringLength)