Skip to content

SparkSession

SparkSession is a Python class in pyspark.sql.session module.

from pyspark.sql.session import SparkSession

SparkConversionMixin

SparkSession uses SparkConversionMixin (for pandas to Spark conversion).

Creating Instance

SparkSession takes the following to be created:

While being created, SparkSession gets access to _jsc and _jvm using the given SparkContext.

Note

It is expected that _jvm is defined (or an exception is thrown).

Unless the given SparkSession is defined, SparkSession gets one from the _jvm.

SparkSession _monkey_patch_RDD.

SparkSession install_exception_handler.


SparkSession is created when:

Java SparkContext

_jsc: JavaObject

_jsc is a Java SparkContext (Spark Core) that is created through Py4J.

JavaObject

JavaObject (Py4J) represents a Java object from which you can call methods or access fields.

_jsc is initialized when SparkSession is created to be the _jsc of the given SparkContext.

_jsc is used (among the other internal uses) when:

  • SCCallSiteSync is requested to __enter__ and __exit__

py4j JVMView

_jvm: ClassVar[Optional[JVMView]]
JVMView

JVMView (Py4J) that allows access to the Java Virtual Machine of a JavaGateway.

JVMView can be used to reference static members (fields and methods) and to call constructors.

From py4j.JVMView javadoc:

A JVM view keeps track of imports and import searches. A Python client can have multiple JVM views (e.g., one for each module) so that imports in one view do not conflict with imports from other views.

_jvm is initialized when SparkSession is created to be the _jvm of the given SparkContext.

_jvm must be defined when SparkSession is created or an AssertionError is thrown.

_jvm is "cleared" (stopped) in stop.

_jvm is used (among the other internal uses) when:

  • ChannelBuilder is requested to default_port
  • InternalFrame is requested to attach_distributed_column
  • DataFrameReader is requested to csv and json
  • pyspark.pandas.spark.functions.py module is requested to _call_udf and _make_arguments
  • SparkConversionMixin is requested to _create_from_pandas_with_arrow
  • SparkSession is requested to _create_dataframe
>>> type(spark)
<class 'pyspark.sql.session.SparkSession'>

>>> type(spark._jvm)
<class 'py4j.java_gateway.JVMView'>

Creating Builder

@classproperty
builder(
  cls) -> Builder
@classproperty Decorator

builder is a @classproperty that is PySpark-specific to mimic how @classmethod and @property should work together.

builder creates a new SparkSession.Builder.

__enter__

__enter__(
  self) -> "SparkSession"
Special Method

Enables with SparkSession.builder.(...).getOrCreate() as session: syntax.

Learn more:

  1. PEP 343 – The "with" Statement
  2. 3.3.9. With Statement Context Managers
  3. Context Managers and Python's with Statement

__enter__ returns self.

__exit__

__exit__(
  self,
  exc_type: Optional[Type[BaseException]],
  exc_val: Optional[BaseException],
  exc_tb: Optional[TracebackType],
) -> None
Special Method

Enables with SparkSession.builder.(...).getOrCreate() as session: syntax.

Learn more:

  1. PEP 343 – The "with" Statement
  2. 3.3.9. With Statement Context Managers
  3. Context Managers and Python's with Statement

__exit__ stop this SparkSession (which is exactly what __exit__ is supposed to do with resource manager once they're out of scope and resources should be released).

_create_shell_session

@staticmethod
_create_shell_session() -> "SparkSession"
@staticmethod

Learn more in Python Documentation.

_create_shell_session...FIXME


_create_shell_session is used when:

Executing SQL Statement

sql(
  self,
  sqlQuery: str,
  args: Optional[Dict[str, Any]] = None,
  **kwargs: Any) -> DataFrame

sql creates a DataFrame with the sqlQuery query executed.

sql uses SQLStringFormatter to format the given sqlQuery with the kwargs, if defined.