FilePartition¶

FilePartition is a Partition (Apache Spark).

FilePartition is an InputPartition with a collection of file blocks (that should be read by a single task).

FilePartition is used in the following:

FileSourceScanExec physical operator is requested to createBucketedReadRDD and createReadRDD
FileScan is requested to plan input partitions (for BatchScanExec physical operator)

Creating Instance¶

FilePartition takes the following to be created:

Partition Index
PartitionedFiles

FilePartition is created when:

FileSourceScanExec physical operator is requested to createBucketedReadRDD
FilePartition is requested to getFilePartitions

getFilePartitions¶

getFilePartitions(
  sparkSession: SparkSession,
  partitionedFiles: Seq[PartitionedFile],
  maxSplitBytes: Long): Seq[FilePartition]

getFilePartitions...FIXME

getFilePartitions is used when:

FileSourceScanExec physical operator is requested to createReadRDD
FileScan is requested for the partitions

preferredLocations¶

Signature

preferredLocations(): Array[String]

preferredLocations is part of the InputPartition abstraction.

preferredLocations...FIXME

maxSplitBytes¶

maxSplitBytes(
  sparkSession: SparkSession,
  selectedPartitions: Seq[PartitionDirectory]): Long

maxSplitBytes can be adjusted based on the following configuration properties:

maxSplitBytes calculates the total size of all the files (in the given PartitionDirectoryies) with spark.sql.files.openCostInBytes overhead added (to the size of every file).

PartitionDirectory

PartitionDirectory is a collection of FileStatuses (Apache Hadoop) along with partition values (if there are any).

maxSplitBytes calculates how many bytes to allow per partition (bytesPerCore) that is the total size of all the files divided by spark.sql.files.minPartitionNum configuration property.

In the end, maxSplitBytes is spark.sql.files.maxPartitionBytes unless the maximum of spark.sql.files.openCostInBytes and bytesPerCore is even smaller.

maxSplitBytes is used when:

FileSourceScanExec physical operator is requested to create an RDD for scanning (and creates a FileScanRDD)
FileScan is requested for partitions