Demo: SDP Python API¶

Activate Virtual Environment

Follow Demo: Create Virtual Environment for Python Client before getting started with this demo.

In a terminal, start a Spark Connect Server.

./sbin/start-connect-server.sh

It will listen on port 15002.

Monitor Logs

tail -f logs/*org.apache.spark.sql.connect.service.SparkConnectServer*.out

Start PySpark Shell¶

Start a Spark Connect-enabled PySpark shell.

$SPARK_HOME/bin/pyspark --remote sc://localhost:15002

from pyspark.pipelines.spark_connect_pipeline import create_dataflow_graph
dataflow_graph_id = create_dataflow_graph(
  spark,
  default_catalog=None,
  default_database=None,
  sql_conf=None,
)

# >>> print(dataflow_graph_id)
# 3cb66d5a-0621-4f15-9920-e99020e30e48

from pyspark.pipelines.spark_connect_graph_element_registry import SparkConnectGraphElementRegistry
registry = SparkConnectGraphElementRegistry(spark, dataflow_graph_id)

from pyspark import pipelines as dp

from pyspark.pipelines.graph_element_registry import graph_element_registration_context
with graph_element_registration_context(registry):
  dp.create_streaming_table("demo_streaming_table")

You should see the following INFO message in the logs of the Spark Connect Server:

INFO PipelinesHandler: Define pipelines dataset cmd received: define_dataset {
  dataflow_graph_id: "3cb66d5a-0621-4f15-9920-e99020e30e48"
  dataset_name: "demo_streaming_table"
  dataset_type: TABLE
}