PartitionedFile¶

PartitionedFile is a part (block) of a file (similarly to a Parquet block or a HDFS split).

PartitionedFile represents a chunk of a file that will be read, along with partition column values appended to each row, in a partition.

Creating Instance¶

PartitionedFile takes the following to be created:

Partition Column Values
Path of the file to read
Beginning offset (in bytes)
Length of this file part (number of bytes to read)
Block Hosts
Modification time
File size

PartitionedFile is created when:

PartitionedFileUtil is requested for split files and getPartitionedFile

Partition Column Values¶

partitionValues: InternalRow

PartitionedFile is given an InternalRow with the partition column values to be appended to each row.

The partition column values are the values of the partition columns and therefore part of the directory structure not the partitioned files themselves (that together are the partitioned dataset).

Block Hosts¶

locations: Array[String]

PartitionedFile is given a collection of nodes (host names) with data blocks.

Default: (empty)

String Representation¶

toString: String

toString is part of the Object (Java) abstraction.

toString is the following text:

path: [filePath], range: [start]-[start+length], partition values: [partitionValues]

Demo¶

import org.apache.spark.sql.execution.datasources.PartitionedFile
import org.apache.spark.sql.catalyst.InternalRow

val partFile = PartitionedFile(InternalRow.empty, "fakePath0", 0, 10, Array("host0", "host1"))

println(partFile)

path: fakePath0, range: 0-10, partition values: [empty row]