DataFrame¶
DataFrame
is a Python class with PandasMapOpsMixin and PandasConversionMixin mixins.
DataFrame
is defined in pyspark.sql.dataframe module.
from pyspark.sql.dataframe import DataFrame
Creating Instance¶
DataFrame
takes the following to be created:
- jdf
- SQLContext
groupBy¶
groupBy(self, *cols)
groupBy
requests the _jdf to groupBy
and creates a GroupedData with it.
observe¶
observe(
self,
observation: Union["Observation", str],
*exprs: Column,
) -> "DataFrame"
observe
accepts an Observation or a name as the observation
:
-
For an Observation,
observe
requests it to _on (with thisDataFrame
and theexprs
columns). -
For a name,
observe
creates a newDataFrame
after requesting _jdf toobserve
(with the name).
Demo¶
QueryExecutionListener
You should install QueryExecutionListener
(Spark SQL) to intercept QueryExecution
on a successful query execution (to access observedMetrics
).
import pandas as pd
pandas_df = pd.DataFrame({
'name': ['jacek', 'agata', 'iweta', 'patryk', 'maksym'],
'age': [50, 49, 29, 26, 11]
})
df = spark.createDataFrame(pandas_df)
from pyspark.sql.functions import *
row_count_metric = count(lit(1)).alias("count")
observed_df = df.observe("observe_demo", row_count_metric)
observed_df.count()