PartitionedFile¶
PartitionedFile is a part (block) of a file (similarly to a Parquet block or a HDFS split).
PartitionedFile represents a chunk of a file that will be read, along with partition column values appended to each row, in a partition.
Creating Instance¶
PartitionedFile takes the following to be created:
- Partition Column Values
- Path of the file to read
- Beginning offset (in bytes)
- Length of this file part (number of bytes to read)
- Block Hosts
- Modification time
- File size
PartitionedFile is created when:
PartitionedFileUtilis requested for split files and getPartitionedFile
Partition Column Values¶
partitionValues: InternalRow
PartitionedFile is given an InternalRow with the partition column values to be appended to each row.
The partition column values are the values of the partition columns and therefore part of the directory structure not the partitioned files themselves (that together are the partitioned dataset).
Block Hosts¶
locations: Array[String]
PartitionedFile is given a collection of nodes (host names) with data blocks.
Default: (empty)
String Representation¶
toString: String
toString is part of the Object (Java) abstraction.
toString is the following text:
path: [filePath], range: [start]-[start+length], partition values: [partitionValues]
Demo¶
import org.apache.spark.sql.execution.datasources.PartitionedFile
import org.apache.spark.sql.catalyst.InternalRow
val partFile = PartitionedFile(InternalRow.empty, "fakePath0", 0, 10, Array("host0", "host1"))
println(partFile)
path: fakePath0, range: 0-10, partition values: [empty row]