BaseRelation — Collection of Tuples with Schema¶
BaseRelation
is an abstraction of relations that are collections of tuples (rows) with a known schema.
BaseRelation
represents an external data source with data to load datasets from or write to.
BaseRelation
is "created" when DataSource
is requested to resolve a relation.
BaseRelation
is then transformed into a DataFrame
when SparkSession
is requested to create a DataFrame.
Note
"Relation" and "table" used to be synonyms, but Connector API in Spark 3 changed it with Table abstraction.
Contract¶
SQLContext¶
sqlContext: SQLContext
SQLContext
Schema¶
schema: StructType
Schema of the tuples of the relation
Size¶
sizeInBytes: Long
Estimated size of the relation (in bytes)
Default: spark.sql.defaultSizeInBytes configuration property
sizeInBytes
is used when LogicalRelation
is requested for statistics (and they are not available in a catalog).
Needs Conversion¶
needConversion: Boolean
Controls type conversion (whether or not JVM objects inside Rows needs to be converted to Catalyst types, e.g. java.lang.String
to UTF8String
)
Default: true
Note
It is recommended to leave needConversion
enabled (as is) for custom data sources (outside Spark SQL).
Used when DataSourceStrategy execution planning strategy is executed (and does the RDD conversion from RDD[Row]
to RDD[InternalRow]
).
Unhandled Filters¶
unhandledFilters(
filters: Array[Filter]): Array[Filter]
Filter predicates that the relation does not support (handle) natively
Default: the input filters (as it is considered safe to double evaluate filters regardless whether they could be supported or not)
Used when DataSourceStrategy execution planning strategy is executed (and selectFilters).
Implementations¶
- ConsoleRelation (Spark Structured Streaming)
- HadoopFsRelation
- JDBCRelation
- KafkaRelation
- KafkaSourceProvider