ExternalSorter

ExternalSorter is a Spillable of WritablePartitionedPairCollection of K-key / C-value pairs.

When created ExternalSorter expects three different types of data defined, i.e. K, V, C, for keys, values, and combiner (partial) values, respectively.

Enable INFO or WARN logging levels for org.apache.spark.util.collection.ExternalSorter logger to see what happens in ExternalSorter.

Add the following line to conf/log4j.properties:

log4j.logger.org.apache.spark.util.collection.ExternalSorter=INFO

Refer to Logging.

stop Method

FIXME

writePartitionedFile Method

FIXME

Creating ExternalSorter Instance

ExternalSorter takes the following:

  1. TaskContext

  2. Optional Aggregator

  3. Optional Partitioner

  4. Optional Scala’s Ordering

  5. Optional Serializer

spillMemoryIteratorToDisk Internal Method

spillMemoryIteratorToDisk(inMemoryIterator: WritablePartitionedIterator): SpilledFile
FIXME

spill Method

spill(collection: WritablePartitionedPairCollection[K, C]): Unit
spill is part of Spillable contract.
FIXME

maybeSpillCollection Internal Method

maybeSpillCollection(usingMap: Boolean): Unit
FIXME

insertAll Method

insertAll(records: Iterator[Product2[K, V]]): Unit
FIXME

Settings

Table 1. Spark Properties
Spark Property Default Value Description

spark.shuffle.file.buffer

32k

Size of the in-memory buffer for each shuffle file output stream. In bytes unless the unit is specified.

These buffers reduce the number of disk seeks and system calls made in creating intermediate shuffle files.

Used in ExternalSorter, BypassMergeSortShuffleWriter and ExternalAppendOnlyMap (for fileBufferSize) and in ShuffleExternalSorter (for fileBufferSizeBytes).

NOTE: spark.shuffle.file.buffer was previously known as spark.shuffle.file.buffer.kb.

spark.shuffle.spill.batchSize

10000

Size of object batches when reading/writing from serializers.