FileScan¶
FileScan is an extension of the Scan abstraction for scans in Batch queries.
SupportsReportStatistics¶
FileScan is a SupportsReportStatistics.
Contract¶
DataFilters¶
dataFilters: Seq[Expression]
Used when:
FileScanis requested for normalized DataFilters, metadata, partitions
FileIndex¶
fileIndex: PartitioningAwareFileIndex
getFileUnSplittableReason¶
getFileUnSplittableReason(
path: Path): String
Partition Filters¶
partitionFilters: Seq[Expression]
Read Data Schema¶
readDataSchema: StructType
Three Schemas
Beside the read data schema of a FileScan, there are two others:
Read Partition Schema¶
readPartitionSchema: StructType
seqToString¶
seqToString(
seq: Seq[Any]): String
sparkSession¶
sparkSession: SparkSession
SparkSession associated with this FileScan
withFilters¶
withFilters(
partitionFilters: Seq[Expression],
dataFilters: Seq[Expression]): FileScan
Implementations¶
- ParquetScan
- others
description¶
description(): String
description is part of the Scan abstraction.
description...FIXME
Planning Input Partitions¶
Signature
planInputPartitions(): Array[InputPartition]
planInputPartitions is part of the Batch abstraction.
planInputPartitions is the file partitions.
File Partitions¶
partitions: Seq[FilePartition]
partitions requests the PartitioningAwareFileIndex for the partition directories (selectedPartitions).
For every selected partition directory, partitions requests the Hadoop FileStatuses that are split (if isSplitable) to maxSplitBytes and sorted by size (in reversed order).
In the end, partitions returns the FilePartitions.
estimateStatistics¶
estimateStatistics(): Statistics
estimateStatistics is part of the SupportsReportStatistics abstraction.
estimateStatistics...FIXME
Converting to Batch¶
toBatch: Batch
toBatch is part of the Scan abstraction.
toBatch is this FileScan.
Read Schema¶
readSchema(): StructType
readSchema is part of the Scan abstraction.
readSchema is the readDataSchema with the readPartitionSchema.
isSplitable¶
isSplitable(
path: Path): Boolean
isSplitable is disabled by default (false).
| FileScan | isSplitable |
|---|---|
AvroScan | true |
| ParquetScan | isSplitable |
Used when:
FileScanis requested to getFileUnSplittableReason and partitions
SupportsMetadata¶
FileScan is a SupportsMetadata.
Metadata¶
getMetaData(): Map[String, String]
getMetaData is part of the SupportsMetadata abstraction.
getMetaData returns the following metadata:
| Name | Description |
|---|---|
Format | The lower-case name of this FileScan (with Scan removed) |
ReadSchema | catalogString of the Read Data Schema |
PartitionFilters | Partition Filters |
DataFilters | Data Filters |
Location | PartitioningAwareFileIndex followed by root paths (with their number in the file listing up to spark.sql.maxMetadataStringLength) |