TextBasedFileFormat¶
TextBasedFileFormat
is an extension of the FileFormat contract for <
[[implementations]] .TextBasedFileFormats [cols="1,2",options="header",width="100%"] |=== | TextBasedFileFormat | Description
| CSVFileFormat | [[CSVFileFormat]]
| JsonFileFormat | [[JsonFileFormat]]
| LibSVMFileFormat
| [[LibSVMFileFormat]] Used in Spark MLlib
| TextFileFormat | [[TextFileFormat]] |===
[[codecFactory]] TextBasedFileFormat
uses Hadoop's https://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/compress/CompressionCodecFactory.html[CompressionCodecFactory] to <
=== [[isSplitable]] isSplitable
Method
[source, scala]¶
isSplitable( sparkSession: SparkSession, options: Map[String, String], path: Path): Boolean
isSplitable
is part of the FileFormat abstraction.
isSplitable
requests the <path
) based on its filename suffix.
isSplitable
returns true
when the compression codec is not used (i.e. null
) or is a Hadoop https://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/compress/SplittableCompressionCodec.html[SplittableCompressionCodec] (e.g. https://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/compress/BZip2Codec.html[BZip2Codec]).
If the <isSplitable
creates a https://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/compress/CompressionCodecFactory.html[CompressionCodecFactory] (with a Hadoop Configuration
by requesting the SessionState
for a SessionState.md#newHadoopConfWithOptions[new Hadoop Configuration with extra options]).
NOTE: isSplitable
uses the input sparkSession
to access SparkSession.md#sessionState[SessionState].
[NOTE]¶
https://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/compress/SplittableCompressionCodec.html[SplittableCompressionCodec] interface is for compression codecs that are capable to compress and de-compress a stream starting at any arbitrary position.
Such codecs are highly valuable, especially in the context of Hadoop, because an input compressed file can be split and hence can be worked on by multiple machines in parallel.