BaseRelation — Collection of Tuples with Schema¶
BaseRelation is an abstraction of relations that are collections of tuples (rows) with a known schema.
BaseRelation represents an external data source with data to load datasets from or write to.
BaseRelation is "created" when DataSource is requested to resolve a relation.
BaseRelation is then transformed into a DataFrame when SparkSession is requested to create a DataFrame.
Note
"Relation" and "table" used to be synonyms, but Connector API in Spark 3 changed it with Table abstraction.
Contract¶
SQLContext¶
sqlContext: SQLContext
SQLContext
Schema¶
schema: StructType
Schema of the tuples of the relation
Size¶
sizeInBytes: Long
Estimated size of the relation (in bytes)
Default: spark.sql.defaultSizeInBytes configuration property
sizeInBytes is used when LogicalRelation is requested for statistics (and they are not available in a catalog).
Needs Conversion¶
needConversion: Boolean
Controls type conversion (whether or not JVM objects inside Rows needs to be converted to Catalyst types, e.g. java.lang.String to UTF8String)
Default: true
Note
It is recommended to leave needConversion enabled (as is) for custom data sources (outside Spark SQL).
Used when DataSourceStrategy execution planning strategy is executed (and does the RDD conversion from RDD[Row] to RDD[InternalRow]).
Unhandled Filters¶
unhandledFilters(
filters: Array[Filter]): Array[Filter]
Filter predicates that the relation does not support (handle) natively
Default: the input filters (as it is considered safe to double evaluate filters regardless whether they could be supported or not)
Used when DataSourceStrategy execution planning strategy is executed (and selectFilters).
Implementations¶
- ConsoleRelation (Spark Structured Streaming)
- HadoopFsRelation
- JDBCRelation
- KafkaRelation
- KafkaSourceProvider