CompressionCodec

CompressionCodec is an abstraction of IO compression codecs.

The default compression codec is configured using spark.io.compression.codec configuration property.

Available CompressionCodecs

CompressionCodec Alias Description

LZ4CompressionCodec

lz4

LZFCompressionCodec

lzf

LZF compression

SnappyCompressionCodec

snappy

ZStdCompressionCodec

zstd

Compressing Output Stream

compressedOutputStream(
  s: OutputStream): OutputStream

compressedOutputStream is used when:

Compressing Input Stream

compressedInputStream(
  s: InputStream): InputStream

compressedInputStream is used when:

Creating CompressionCodec

createCodec(
  conf: SparkConf): CompressionCodec
createCodec(
  conf: SparkConf,
  codecName: String): CompressionCodec

createCodec creates an instance of the compression codec by the given name (using a constructor that accepts a SparkConf).

createCodec uses getCodecName utility to find the codec name unless specified explicitly.

createCodec finds the class name in the shortCompressionCodecNames internal lookup table or assumes that the codec name is already a fully-qualified class name.

createCodec throws an IllegalArgumentException exception if a compression codec could not be found:

Codec [codecName] is not available. Consider setting spark.io.compression.codec=snappy

createCodec is used when:

  • TorrentBroadcast is requested to setConf

  • ReliableCheckpointRDD is requested to writePartitionToCheckpointFile and readCheckpointFile

  • EventLoggingListener is created and requested to openEventLog

  • GenericAvroSerializer is created

  • SerializerManager is created

  • UnsafeShuffleWriter is requested to merge spills

Finding Compression Codec Name

getCodecName(
  conf: SparkConf): String

getCodecName takes the name of a compression codec based on spark.io.compression.codec configuration property (using the SparkConf) if available or defaults to lz4.

getCodecName is used when:

supportsConcatenationOfSerializedStreams Method

supportsConcatenationOfSerializedStreams(
  codec: CompressionCodec): Boolean

supportsConcatenationOfSerializedStreams returns true when the given CompressionCodec is one of the build-in ones.

supportsConcatenationOfSerializedStreams is used when UnsafeShuffleWriter is requested to merge spills.