SparkSession¶
SparkSession
is a Python class in pyspark.sql.session module.
from pyspark.sql.session import SparkSession
SparkConversionMixin¶
SparkSession
uses SparkConversionMixin (for pandas to Spark conversion).
Creating Instance¶
SparkSession
takes the following to be created:
- SparkContext
-
SparkSession
(Optional[JavaObject]
) - Options
While being created, SparkSession
gets access to _jsc and _jvm using the given SparkContext.
Note
It is expected that _jvm is defined (or an exception is thrown).
Unless the given SparkSession is defined, SparkSession
gets one from the _jvm.
SparkSession
_monkey_patch_RDD.
SparkSession
install_exception_handler.
SparkSession
is created when:
SparkSession.Builder
is requested to get or create oneSparkSession
is requested to get an active SparkSession
Java SparkContext¶
_jsc: JavaObject
_jsc
is a Java SparkContext
(Spark Core) that is created through Py4J.
JavaObject
JavaObject
(Py4J) represents a Java object from which you can call methods or access fields.
_jsc
is initialized when SparkSession
is created to be the _jsc of the given SparkContext.
_jsc
is used (among the other internal uses) when:
SCCallSiteSync
is requested to__enter__
and__exit__
py4j JVMView¶
_jvm: ClassVar[Optional[JVMView]]
JVMView
JVMView
(Py4J) that allows access to the Java Virtual Machine of a JavaGateway
.
JVMView
can be used to reference static members (fields and methods) and to call constructors.
From py4j.JVMView javadoc:
A JVM view keeps track of imports and import searches. A Python client can have multiple JVM views (e.g., one for each module) so that imports in one view do not conflict with imports from other views.
_jvm
is initialized when SparkSession
is created to be the _jvm of the given SparkContext.
_jvm
must be defined when SparkSession
is created or an AssertionError
is thrown.
_jvm
is "cleared" (stopped) in stop.
_jvm
is used (among the other internal uses) when:
ChannelBuilder
is requested todefault_port
InternalFrame
is requested toattach_distributed_column
DataFrameReader
is requested tocsv
andjson
pyspark.pandas.spark.functions.py
module is requested to_call_udf
and_make_arguments
SparkConversionMixin
is requested to _create_from_pandas_with_arrowSparkSession
is requested to _create_dataframe
>>> type(spark)
<class 'pyspark.sql.session.SparkSession'>
>>> type(spark._jvm)
<class 'py4j.java_gateway.JVMView'>
Creating Builder¶
@classproperty
builder(
cls) -> Builder
@classproperty
Decorator
builder
is a @classproperty
that is PySpark-specific to mimic how @classmethod and @property should work together.
builder
creates a new SparkSession.Builder.
__enter__¶
__enter__(
self) -> "SparkSession"
Special Method
Enables with SparkSession.builder.(...).getOrCreate() as session:
syntax.
Learn more:
__enter__
returns self
.
__exit__¶
__exit__(
self,
exc_type: Optional[Type[BaseException]],
exc_val: Optional[BaseException],
exc_tb: Optional[TracebackType],
) -> None
Special Method
Enables with SparkSession.builder.(...).getOrCreate() as session:
syntax.
Learn more:
__exit__
stop this SparkSession
(which is exactly what __exit__
is supposed to do with resource manager once they're out of scope and resources should be released).
_create_shell_session¶
@staticmethod
_create_shell_session() -> "SparkSession"
@staticmethod
Learn more in Python Documentation.
_create_shell_session
...FIXME
_create_shell_session
is used when:
- pyspark/shell.py module is imported
Executing SQL Statement¶
sql(
self,
sqlQuery: str,
args: Optional[Dict[str, Any]] = None,
**kwargs: Any) -> DataFrame
sql
creates a DataFrame with the sqlQuery
query executed.
sql
uses SQLStringFormatter
to format
the given sqlQuery
with the kwargs
, if defined.