PartitionedFileUtil¶
When requested for split files of a file (Apache Hadoop), PartitionedFileUtil
uses isSplitable property of a FileFormat and creates one or more PartitionedFiles.
Only when splitable, a file will have as many PartitionedFiles as the number of parts of maxSplitBytes size.
Split Files¶
splitFiles(
sparkSession: SparkSession,
file: FileStatus,
filePath: Path,
isSplitable: Boolean,
maxSplitBytes: Long,
partitionValues: InternalRow): Seq[PartitionedFile]
splitFiles
branches off based on the given isSplitable
flag.
If splitable, splitFiles
uses the given maxSplitBytes
to split the given file
into PartitionedFiles for every part file.
Otherwise, splitFiles
creates a single PartitionedFile for the given file
(with the given filePath
and partitionValues
).
splitFiles
is used when:
FileSourceScanExec
is requested to createReadRDDFileScan
is requested for the partitions
getPartitionedFile¶
getPartitionedFile(
file: FileStatus,
filePath: Path,
partitionValues: InternalRow): PartitionedFile
getPartitionedFile
finds the BlockLocations of the given FileStatus
(Apache Hadoop).
getPartitionedFile
finds the BlockHosts with the BlockLocation
s.
In the end, getPartitionedFile
creates a PartitionedFile with the following:
Argument | Value |
---|---|
partitionValues | The given partitionValues |
filePath | The URI of the given filePath |
start | 0 |
length | The lenght of the file |
locations | Block hosts |
modificationTime | The modification time of the file |
fileSize | The size of the file |
getPartitionedFile
is used when:
FileSourceScanExec
is requested to create a FileScanRDD with Bucketing SupportPartitionedFileUtil
is requested to splitFiles