BypassMergeSortShuffleWriter is a ShuffleWriter for ShuffleMapTasks to write records into one single shuffle block data file.
BypassMergeSortShuffleWriter takes the following to be created:
BypassMergeSortShuffleWriter uses DiskBlockObjectWriter when…FIXME
void write( Iterator<Product2<K, V>> records)
write creates a new instance of the Serializer.
write requests the BlockManager for the DiskBlockManager and for every partition write requests it for a shuffle block ID and the file. write creates a DiskBlockObjectWriter for the shuffle block (using the BlockManager). write stores the reference to DiskBlockObjectWriters in the partitionWriters internal registry.
For every record (a key-value pair), write requests the Partitioner for the partition ID for the key. The partition ID is then used as an index of the partition writer (among the DiskBlockObjectWriters) to write the current record out to a block file.
Once all records have been writted out to their respective block files, write does the following for every DiskBlockObjectWriter:
Requests the DiskBlockObjectWriter to commit and return a corresponding FileSegment of the shuffle block
Saves the (reference to) FileSegments in the partitionWriterSegments internal registry
Requests the DiskBlockObjectWriter to close
|At this point, all the records are in shuffle block files on a local disk. The records are split across block files by key.|
write creates a temporary file (based on the name of the shuffle file) and writes all the per-partition shuffle files to it. The size of every per-partition shuffle files is saved as the partitionLengths internal registry.
|At this point, all the per-partition shuffle block files are one single map shuffle data file.|
write is part of ShuffleWriter abstraction.
write requires that there are no DiskBlockObjectWriters.
long writePartitionedFile( File outputFile)
writePartitionedFile creates a file output stream for the input
outputFile in append mode.
writePartitionedFile starts tracking write time (as
writePartitionedFile then copies the raw bytes from each partition segment input stream to
outputFile (possibly using Java New I/O per transferToEnabled flag set when BypassMergeSortShuffleWriter was created) and records the length of the shuffle data file (in
lengths internal array).
writePartitionedFile is used when BypassMergeSortShuffleWriter is requested to write records.
copyStream( in: InputStream, out: OutputStream, closeStreams: Boolean = false, transferToEnabled: Boolean = false): Long
copyStream branches off depending on the type of
out streams, i.e. whether they are both
transferToEnabled input flag is enabled.
If they are both
transferToEnabled enabled, copyStream gets their
FileChannels and transfers bytes from the input file to the output file and counts the number of bytes, possibly zero, that were actually transferred.
|copyStream uses Java’s java.nio.channels.FileChannel to manage file channels.|
out input streams are not
transferToEnabled flag is disabled (default), copyStream reads data from
in to write to
out and counts the number of bytes written.
copyStream can optionally close
out streams (depending on the input
closeStreams — disabled by default).
ALL logging level for
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter logger to see what happens inside.
Add the following line to
Refer to Logging.
Initialized every time BypassMergeSortShuffleWriter writes records.
Temporary array of partition lengths after records are written to a shuffle system.
Initialized every time BypassMergeSortShuffleWriter writes records before passing it in to IndexShuffleBlockResolver). After IndexShuffleBlockResolver finishes, it is used to initialize mapStatus internal property.