FileScan¶
FileScan
is an extension of the Scan abstraction for scans in Batch queries.
SupportsReportStatistics¶
FileScan
is a SupportsReportStatistics.
Contract¶
DataFilters¶
dataFilters: Seq[Expression]
Used when:
FileScan
is requested for normalized DataFilters, metadata, partitions
FileIndex¶
fileIndex: PartitioningAwareFileIndex
getFileUnSplittableReason¶
getFileUnSplittableReason(
path: Path): String
Partition Filters¶
partitionFilters: Seq[Expression]
Read Data Schema¶
readDataSchema: StructType
Three Schemas
Beside the read data schema of a FileScan
, there are two others:
Read Partition Schema¶
readPartitionSchema: StructType
seqToString¶
seqToString(
seq: Seq[Any]): String
sparkSession¶
sparkSession: SparkSession
SparkSession associated with this FileScan
withFilters¶
withFilters(
partitionFilters: Seq[Expression],
dataFilters: Seq[Expression]): FileScan
Implementations¶
- ParquetScan
- others
description¶
description(): String
description
is part of the Scan abstraction.
description
...FIXME
Planning Input Partitions¶
Signature
planInputPartitions(): Array[InputPartition]
planInputPartitions
is part of the Batch abstraction.
planInputPartitions
is the file partitions.
File Partitions¶
partitions: Seq[FilePartition]
partitions
requests the PartitioningAwareFileIndex for the partition directories (selectedPartitions).
For every selected partition directory, partitions
requests the Hadoop FileStatuses that are split (if isSplitable) to maxSplitBytes and sorted by size (in reversed order).
In the end, partitions
returns the FilePartitions.
estimateStatistics¶
estimateStatistics(): Statistics
estimateStatistics
is part of the SupportsReportStatistics abstraction.
estimateStatistics
...FIXME
Converting to Batch¶
toBatch: Batch
toBatch
is part of the Scan abstraction.
toBatch
is this FileScan.
Read Schema¶
readSchema(): StructType
readSchema
is part of the Scan abstraction.
readSchema
is the readDataSchema with the readPartitionSchema.
isSplitable¶
isSplitable(
path: Path): Boolean
isSplitable
is disabled by default (false
).
FileScan | isSplitable |
---|---|
AvroScan | true |
ParquetScan | isSplitable |
Used when:
FileScan
is requested to getFileUnSplittableReason and partitions
SupportsMetadata¶
FileScan
is a SupportsMetadata.
Metadata¶
getMetaData(): Map[String, String]
getMetaData
is part of the SupportsMetadata abstraction.
getMetaData
returns the following metadata:
Name | Description |
---|---|
Format | The lower-case name of this FileScan (with Scan removed) |
ReadSchema | catalogString of the Read Data Schema |
PartitionFilters | Partition Filters |
DataFilters | Data Filters |
Location | PartitioningAwareFileIndex followed by root paths (with their number in the file listing up to spark.sql.maxMetadataStringLength) |