ReliableCheckpointRDD

ReliableCheckpointRDD is an CheckpointRDD…​FIXME

Creating Instance

ReliableCheckpointRDD takes the following to be created:

ReliableCheckpointRDD is created when:

Checkpointed Partitioner File

ReliableCheckpointRDD uses _partitioner as the name of the file in the checkpoint directory with the Partitioner serialized to.

Partitioner

ReliableCheckpointRDD can be given a Partitioner to be created.

When requested for the Partitioner (as an RDD), ReliableCheckpointRDD returns the one it was created with or reads the partitioner from the given RDD checkpoint directory, if exists.

Writing RDD to Checkpoint Directory

writeRDDToCheckpointDirectory[T: ClassTag](
  originalRDD: RDD[T],
  checkpointDir: String,
  blockSize: Int = -1): ReliableCheckpointRDD[T]

writeRDDToCheckpointDirectory…​FIXME

writeRDDToCheckpointDirectory is used when ReliableRDDCheckpointData is requested to doCheckpoint.

Writing Partitioner to Checkpoint Directory

writePartitionerToCheckpointDir(
  sc: SparkContext,
  partitioner: Partitioner,
  checkpointDirPath: Path): Unit

writePartitionerToCheckpointDir creates the partitioner file with the buffer size based on spark.buffer.size configuration property.

writePartitionerToCheckpointDir requests the default Serializer for a new SerializerInstance.

writePartitionerToCheckpointDir requests the SerializerInstance to serialize the output stream and writes the given Partitioner.

In the end, writePartitionerToCheckpointDir prints out the following DEBUG message to the logs:

Written partitioner to [partitionerFilePath]

In case of any non-fatal exception, writePartitionerToCheckpointDir prints out the following DEBUG message to the logs:

Error writing partitioner [partitioner] to [checkpointDirPath]

writePartitionerToCheckpointDir is used when ReliableCheckpointRDD is requested to write the RDD to the checkpoint directory.

Reading Partitioner from Checkpointed Directory

readCheckpointedPartitionerFile(
  sc: SparkContext,
  checkpointDirPath: String): Option[Partitioner]

readCheckpointedPartitionerFile opens the partitioner file with the buffer size based on spark.buffer.size configuration property.

readCheckpointedPartitionerFile requests the default Serializer for a new SerializerInstance.

readCheckpointedPartitionerFile requests the SerializerInstance to deserialize the input stream and read the Partitioner from the partitioner file.

readCheckpointedPartitionerFile prints out the following DEBUG message to the logs and returns the partitioner.

Read partitioner from [partitionerFilePath]

In case of FileNotFoundException or any non-fatal exceptions, readCheckpointedPartitionerFile prints out a corresponding message to the logs and returns None.

readCheckpointedPartitionerFile is used when ReliableCheckpointRDD is requested for the Partitioner.

Logging

Enable ALL logging level for org.apache.spark.rdd.ReliableCheckpointRDD$ logger to see what happens inside.

Add the following line to conf/log4j.properties:

log4j.logger.org.apache.spark.rdd.ReliableCheckpointRDD$=ALL

Refer to Logging.