SparkSession¶
SparkSession is a Python class in pyspark.sql.session module.
from pyspark.sql.session import SparkSession
SparkConversionMixin¶
SparkSession uses SparkConversionMixin (for pandas to Spark conversion).
Creating Instance¶
SparkSession takes the following to be created:
- SparkContext
-
SparkSession(Optional[JavaObject]) - Options
While being created, SparkSession gets access to _jsc and _jvm using the given SparkContext.
Note
It is expected that _jvm is defined (or an exception is thrown).
Unless the given SparkSession is defined, SparkSession gets one from the _jvm.
SparkSession _monkey_patch_RDD.
SparkSession install_exception_handler.
SparkSession is created when:
SparkSession.Builderis requested to get or create oneSparkSessionis requested to get an active SparkSession
Java SparkContext¶
_jsc: JavaObject
_jsc is a Java SparkContext (Spark Core) that is created through Py4J.
JavaObject
JavaObject (Py4J) represents a Java object from which you can call methods or access fields.
_jsc is initialized when SparkSession is created to be the _jsc of the given SparkContext.
_jsc is used (among the other internal uses) when:
SCCallSiteSyncis requested to__enter__and__exit__
py4j JVMView¶
_jvm: ClassVar[Optional[JVMView]]
JVMView
JVMView (Py4J) that allows access to the Java Virtual Machine of a JavaGateway.
JVMView can be used to reference static members (fields and methods) and to call constructors.
From py4j.JVMView javadoc:
A JVM view keeps track of imports and import searches. A Python client can have multiple JVM views (e.g., one for each module) so that imports in one view do not conflict with imports from other views.
_jvm is initialized when SparkSession is created to be the _jvm of the given SparkContext.
_jvm must be defined when SparkSession is created or an AssertionError is thrown.
_jvm is "cleared" (stopped) in stop.
_jvm is used (among the other internal uses) when:
ChannelBuilderis requested todefault_portInternalFrameis requested toattach_distributed_columnDataFrameReaderis requested tocsvandjsonpyspark.pandas.spark.functions.pymodule is requested to_call_udfand_make_argumentsSparkConversionMixinis requested to _create_from_pandas_with_arrowSparkSessionis requested to _create_dataframe
>>> type(spark)
<class 'pyspark.sql.session.SparkSession'>
>>> type(spark._jvm)
<class 'py4j.java_gateway.JVMView'>
Creating Builder¶
@classproperty
builder(
cls) -> Builder
@classproperty Decorator
builder is a @classproperty that is PySpark-specific to mimic how @classmethod and @property should work together.
builder creates a new SparkSession.Builder.
__enter__¶
__enter__(
self) -> "SparkSession"
Special Method
Enables with SparkSession.builder.(...).getOrCreate() as session: syntax.
Learn more:
__enter__ returns self.
__exit__¶
__exit__(
self,
exc_type: Optional[Type[BaseException]],
exc_val: Optional[BaseException],
exc_tb: Optional[TracebackType],
) -> None
Special Method
Enables with SparkSession.builder.(...).getOrCreate() as session: syntax.
Learn more:
__exit__ stop this SparkSession (which is exactly what __exit__ is supposed to do with resource manager once they're out of scope and resources should be released).
_create_shell_session¶
@staticmethod
_create_shell_session() -> "SparkSession"
@staticmethod
Learn more in Python Documentation.
_create_shell_session...FIXME
_create_shell_session is used when:
- pyspark/shell.py module is imported
Executing SQL Statement¶
sql(
self,
sqlQuery: str,
args: Optional[Dict[str, Any]] = None,
**kwargs: Any) -> DataFrame
sql creates a DataFrame with the sqlQuery query executed.
sql uses SQLStringFormatter to format the given sqlQuery with the kwargs, if defined.