FilePartition¶
FilePartition
is a Partition
(Apache Spark).
FilePartition
is an InputPartition with a collection of file blocks (that should be read by a single task).
FilePartition
is used in the following:
FileSourceScanExec
physical operator is requested to createBucketedReadRDD and createReadRDDFileScan
is requested to plan input partitions (for BatchScanExec physical operator)
Creating Instance¶
FilePartition
takes the following to be created:
- Partition Index
- PartitionedFiles
FilePartition
is created when:
FileSourceScanExec
physical operator is requested to createBucketedReadRDDFilePartition
is requested to getFilePartitions
getFilePartitions¶
getFilePartitions(
sparkSession: SparkSession,
partitionedFiles: Seq[PartitionedFile],
maxSplitBytes: Long): Seq[FilePartition]
getFilePartitions
...FIXME
getFilePartitions
is used when:
FileSourceScanExec
physical operator is requested to createReadRDDFileScan
is requested for the partitions
preferredLocations¶
Signature
preferredLocations(): Array[String]
preferredLocations
is part of the InputPartition abstraction.
preferredLocations
...FIXME
maxSplitBytes¶
maxSplitBytes(
sparkSession: SparkSession,
selectedPartitions: Seq[PartitionDirectory]): Long
maxSplitBytes
can be adjusted based on the following configuration properties:
- spark.sql.files.maxPartitionBytes
- spark.sql.files.openCostInBytes
- spark.sql.files.minPartitionNum (default: Default Parallelism of Leaf Nodes)
maxSplitBytes
calculates the total size of all the files (in the given PartitionDirectory
ies) with spark.sql.files.openCostInBytes overhead added (to the size of every file).
PartitionDirectory
PartitionDirectory
is a collection of FileStatus
es (Apache Hadoop) along with partition values (if there are any).
maxSplitBytes
calculates how many bytes to allow per partition (bytesPerCore
) that is the total size of all the files divided by spark.sql.files.minPartitionNum configuration property.
In the end, maxSplitBytes
is spark.sql.files.maxPartitionBytes unless the maximum of spark.sql.files.openCostInBytes and bytesPerCore
is even smaller.
maxSplitBytes
is used when:
FileSourceScanExec
physical operator is requested to create an RDD for scanning (and creates a FileScanRDD)FileScan
is requested for partitions