SparkSession¶

SparkSession is a Python class in pyspark.sql.session module.

from pyspark.sql.session import SparkSession

SparkConversionMixin¶

SparkSession uses SparkConversionMixin (for pandas to Spark conversion).

Creating Instance¶

SparkSession takes the following to be created:

SparkContext
SparkSession (Optional[JavaObject])
Options

While being created, SparkSession gets access to _jsc and _jvm using the given SparkContext.

Note

It is expected that _jvm is defined (or an exception is thrown).

Unless the given SparkSession is defined, SparkSession gets one from the _jvm.

SparkSession _monkey_patch_RDD.

SparkSession install_exception_handler.

SparkSession is created when:

SparkSession.Builder is requested to get or create one
SparkSession is requested to get an active SparkSession

Java SparkContext¶

_jsc: JavaObject

_jsc is a Java SparkContext (Spark Core) that is created through Py4J.

JavaObject

JavaObject (Py4J) represents a Java object from which you can call methods or access fields.

_jsc is initialized when SparkSession is created to be the _jsc of the given SparkContext.

_jsc is used (among the other internal uses) when:

SCCallSiteSync is requested to __enter__ and __exit__

py4j JVMView¶

_jvm: ClassVar[Optional[JVMView]]

JVMView

JVMView (Py4J) that allows access to the Java Virtual Machine of a JavaGateway.

JVMView can be used to reference static members (fields and methods) and to call constructors.

From py4j.JVMView javadoc:

A JVM view keeps track of imports and import searches. A Python client can have multiple JVM views (e.g., one for each module) so that imports in one view do not conflict with imports from other views.

_jvm is initialized when SparkSession is created to be the _jvm of the given SparkContext.

_jvm must be defined when SparkSession is created or an AssertionError is thrown.

_jvm is "cleared" (stopped) in stop.

_jvm is used (among the other internal uses) when:

ChannelBuilder is requested to default_port
InternalFrame is requested to attach_distributed_column
DataFrameReader is requested to csv and json
pyspark.pandas.spark.functions.py module is requested to _call_udf and _make_arguments
SparkConversionMixin is requested to _create_from_pandas_with_arrow
SparkSession is requested to _create_dataframe

>>> type(spark)
<class 'pyspark.sql.session.SparkSession'>

>>> type(spark._jvm)
<class 'py4j.java_gateway.JVMView'>

Creating Builder¶

@classproperty
builder(
  cls) -> Builder

@classproperty Decorator

builder is a @classproperty that is PySpark-specific to mimic how @classmethod and @property should work together.

builder creates a new SparkSession.Builder.

enter¶

__enter__(
  self) -> "SparkSession"

Special Method

Enables with SparkSession.builder.(...).getOrCreate() as session: syntax.

Learn more:

__enter__ returns self.

exit¶

__exit__(
  self,
  exc_type: Optional[Type[BaseException]],
  exc_val: Optional[BaseException],
  exc_tb: Optional[TracebackType],
) -> None

Special Method

Enables with SparkSession.builder.(...).getOrCreate() as session: syntax.

Learn more:

__exit__ stop this SparkSession (which is exactly what __exit__ is supposed to do with resource manager once they're out of scope and resources should be released).

_create_shell_session¶

@staticmethod
_create_shell_session() -> "SparkSession"

@staticmethod

Learn more in Python Documentation.

_create_shell_session...FIXME

_create_shell_session is used when:

pyspark/shell.py module is imported

Executing SQL Statement¶

sql(
  self,
  sqlQuery: str,
  args: Optional[Dict[str, Any]] = None,
  **kwargs: Any) -> DataFrame

sql creates a DataFrame with the sqlQuery query executed.

sql uses SQLStringFormatter to format the given sqlQuery with the kwargs, if defined.