Skip to content

CSVFileFormat

[[shortName]] CSVFileFormat is a TextBasedFileFormat for csv format (i.e. registers itself to handle files in csv format and converts them to Spark SQL rows).

spark.read.format("csv").load("csv-datasets")

// or the same as above using a shortcut
spark.read.csv("csv-datasets")

CSVFileFormat uses <> (that in turn are used to configure the underlying CSV parser from https://github.com/uniVocity/univocity-parsers[uniVocity-parsers] project).

[[options]] [[CSVOptions]] .CSVFileFormat's Options [cols="1,1,3",options="header",width="100%"] |=== | Option | Default Value | Description

| [[charset]] charset | UTF-8 |

Alias of <>

| [[charToEscapeQuoteEscaping]] charToEscapeQuoteEscaping | \\ | One character to...FIXME

| [[codec]] codec | a| Compression codec that can be either one of the known aliases or a fully-qualified class name.

Alias of <>

[[columnNameOfCorruptRecord]] columnNameOfCorruptRecord

| [[comment]] comment | \u0000 |

| [[compression]] compression | a| Compression codec that can be either one of the known aliases or a fully-qualified class name.

Alias of <>

| [[dateFormat]] dateFormat | yyyy-MM-dd | Uses en_US locale

| [[delimiter]] delimiter | , (comma) |

Alias of <>

| [[encoding]] encoding | UTF-8 |

Alias of <>

| [[escape]] escape | \\ |

| [[escapeQuotes]] escapeQuotes | true |

[[header_]] header

| [[ignoreLeadingWhiteSpace]] ignoreLeadingWhiteSpace a| * false (for reading) * true (for writing) |

| [[ignoreTrailingWhiteSpace]] ignoreTrailingWhiteSpace a| * false (for reading) * true (for writing) |

[[inferSchema]] inferSchema

| [[maxCharsPerColumn]] maxCharsPerColumn | -1 |

| [[maxColumns]] maxColumns | 20480 |

| [[mode]] mode | PERMISSIVE a|

Possible values:

  • DROPMALFORMED
  • PERMISSIVE (default)
  • FAILFAST

| [[multiLine]] multiLine | false |

| [[nanValue]] nanValue | NaN |

| [[negativeInf]] negativeInf | -Inf |

| [[nullValue]] nullValue | (empty string) |

| [[positiveInf]] positiveInf | Inf |

| [[sep]] sep | , (comma) |

Alias of <>

| [[timestampFormat]] timestampFormat | yyyy-MM-dd'T'HH:mm:ss.SSSXXX | Uses <> and en_US locale

| [[timeZone]] timeZone | spark.sql.session.timeZone |

| [[quote]] quote | \" |

| [[quoteAll]] quoteAll | false | |===

=== [[prepareWrite]] Preparing Write Job -- prepareWrite Method

[source, scala]

prepareWrite( sparkSession: SparkSession, job: Job, options: Map[String, String], dataSchema: StructType): OutputWriterFactory


prepareWrite...FIXME

prepareWrite is part of the FileFormat abstraction.

=== [[buildReader]] Building Partitioned Data Reader -- buildReader Method

[source, scala]

buildReader( sparkSession: SparkSession, dataSchema: StructType, partitionSchema: StructType, requiredSchema: StructType, filters: Seq[Filter], options: Map[String, String], hadoopConf: Configuration): (PartitionedFile) => Iterator[InternalRow]


buildReader...FIXME

buildReader is part of the FileFormat abstraction.