FilePartition is a
Partition (Apache Spark).
FilePartition is an InputPartition with a collection of file blocks (that should be read by a single task).
FilePartition is used in the following:
FileSourceScanExecphysical operator is requested to createBucketedReadRDD and createReadRDD
FileScanis requested to plan input partitions (for BatchScanExec physical operator)
FilePartition takes the following to be created:
- Partition Index
FilePartition is created when:
FileSourceScanExecphysical operator is requested to createBucketedReadRDD
FilePartitionis requested to getFilePartitions
getFilePartitions( sparkSession: SparkSession, partitionedFiles: Seq[PartitionedFile], maxSplitBytes: Long): Seq[FilePartition]
getFilePartitions is used when:
FileSourceScanExecphysical operator is requested to createReadRDD
FileScanis requested for the partitions
preferredLocations is part of the InputPartition abstraction.
maxSplitBytes( sparkSession: SparkSession, selectedPartitions: Seq[PartitionDirectory]): Long
maxSplitBytes can be adjusted based on the following configuration properties:
- spark.sql.files.minPartitionNum (default: Default Parallelism of Leaf Nodes)
maxSplitBytes calculates the total size of all the files (in the given
PartitionDirectoryies) with spark.sql.files.openCostInBytes overhead added (to the size of every file).
PartitionDirectory is a collection of
FileStatuses (Apache Hadoop) along with partition values (if there are any).
maxSplitBytes calculates how many bytes to allow per partition (
bytesPerCore) that is the total size of all the files divided by spark.sql.files.minPartitionNum configuration property.
In the end,
maxSplitBytes is spark.sql.files.maxPartitionBytes unless the maximum of spark.sql.files.openCostInBytes and
bytesPerCore is even smaller.
maxSplitBytes is used when:
FileSourceScanExecphysical operator is requested to create an RDD for scanning (and creates a FileScanRDD)
FileScanis requested for partitions