Observation¶
Observation
is a Python class to observe (named) metrics on a DataFrame.
from pyspark.sql.observation import Observation
pyspark.sql
Observation
is imported using *
import from pyspark.sql
as well as pyspark.sql.observation
(as is included in __all__
of the modules).
from pyspark.sql import *
Creating Instance¶
Observation
takes the following to be created:
- Name (optional)
_jo¶
_jo: Optional[JavaObject]
get¶
get(
self) -> Dict[str, Any]
get
requests the _jo to getAsJava
and converts the py4j JavaMap
to a Python dict.
Demo¶
from pyspark.sql.observation import Observation
observation = Observation("demo")
import pandas as pd
pandas_df = pd.DataFrame({
'name': ['jacek', 'agata', 'iweta', 'patryk', 'maksym'],
'age': [50, 49, 29, 26, 11]
})
df = spark.createDataFrame(pandas_df)
from pyspark.sql.functions import *
row_count_metric = count(lit(1)).alias("count")
observed_df = df.observe(observation, row_count_metric)
observed_df.count()
observation.get()
{'count': 5}