Skip to content

== HadoopRDD[HadoopRDD] is an RDD that provides core functionality for reading data stored in HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI using the older MapReduce API ([org.apache.hadoop.mapred]).

HadoopRDD is created as a result of calling the following methods in[]:

  • hadoopFile
  • textFile (the most often used in examples!)
  • sequenceFile

Partitions are of type HadoopPartition.

When an HadoopRDD is computed, i.e. an action is called, you should see the INFO message Input split: in the logs.

scala> sc.textFile("").count
15/10/10 18:03:21 INFO HadoopRDD: Input split: file:/Users/jacek/dev/oss/spark/
15/10/10 18:03:21 INFO HadoopRDD: Input split: file:/Users/jacek/dev/oss/spark/

The following properties are set upon partition execution:

  • - task id of this task's attempt
  • - task attempt's id
  • as true
  • mapred.task.partition - split id

Spark settings for HadoopRDD:

  • spark.hadoop.cloneConf (default: false) - shouldCloneJobConf - should a Hadoop job configuration JobConf object be cloned before spawning a Hadoop job. Refer to[[SPARK-2546] Configuration object thread safety issue]. When true, you should see a DEBUG message Cloning Hadoop Configuration.

You can register callbacks on[TaskContext].

HadoopRDDs are not checkpointed. They do nothing when checkpoint() is called.



  • What are InputMetrics?
  • What is JobConf?
  • What are the InputSplits: FileSplit and CombineFileSplit? * What are InputFormat and Configurable subtypes?
  • What's InputFormat's RecordReader? It creates a key and a value. What are they?
  • What's Hadoop Split? input splits for Hadoop reads? See InputFormat.getSplits

=== [[getPreferredLocations]] getPreferredLocations Method


=== [[getPartitions]] getPartitions Method

The number of partition for HadoopRDD, i.e. the return value of getPartitions, is calculated using InputFormat.getSplits(jobConf, minPartitions) where minPartitions is only a hint of how many partitions one may want at minimum. As a hint it does not mean the number of partitions will be exactly the number given.

For SparkContext.textFile the input format class is[org.apache.hadoop.mapred.TextInputFormat].

The[javadoc of org.apache.hadoop.mapred.FileInputFormat] says:

FileInputFormat is the base class for all file-based InputFormats. This provides a generic implementation of getSplits(JobConf, int). Subclasses of FileInputFormat can also override the isSplitable(FileSystem, Path) method to ensure input-files are not split-up and are processed as a whole by Mappers.

TIP: You may find[the sources of org.apache.hadoop.mapred.FileInputFormat.getSplits] enlightening.

Last update: 2020-10-06