Skip to content

HadoopFileLinesReader

HadoopFileLinesReader is a Scala Iterator of Apache Hadoop's org.apache.hadoop.io.Text.

HadoopFileLinesReader is <> to access datasets in the following data sources:

  • SimpleTextSource
  • LibSVMFileFormat
  • TextInputCSVDataSource
  • TextInputJsonDataSource
  • TextFileFormat

HadoopFileLinesReader uses the internal <> that handles accessing files using Hadoop's FileSystem API.

Creating Instance

HadoopFileLinesReader takes the following when created:

=== [[iterator]] iterator Internal Property

[source, scala]

iterator: RecordReaderIterator[Text]

When <>, HadoopFileLinesReader creates an internal iterator that uses Hadoop's https://hadoop.apache.org/docs/r2.7.3/api/org/apache/hadoop/mapreduce/lib/input/FileSplit.html[org.apache.hadoop.mapreduce.lib.input.FileSplit] with Hadoop's https://hadoop.apache.org/docs/r2.7.3/api/org/apache/hadoop/fs/Path.html[org.apache.hadoop.fs.Path] and <>.

iterator creates Hadoop's TaskAttemptID, TaskAttemptContextImpl and LineRecordReader.

iterator initializes LineRecordReader and passes it on to a RecordReaderIterator.

NOTE: iterator is used for Iterator-specific methods, i.e. hasNext, next and close.