文章最前: 我是Octopus,这个名字来源于我的中文名--章鱼;我热爱编程、热爱算法、热爱开源。所有源码在我的个人github ;这博客是记录我学习的点点滴滴,如果您对 Python、Java、AI、算法有兴趣,可以关注我的动态,一起学习,共同进步。
相关文章:
这篇文章旨在帮你写出健壮的pyspark 代码。
在这里,通过它写pyspark单元测试,看这个代码通过PySpark built,下载该目录代码,查看JIRA 看板票的pyspark测试
这边一个例子是怎么创建pyspark应用,如果你的应用已经测试,你可以跳过这一段,测试你的pyspark程序。
现在,开始测试你的spark session
- from pyspark.sql import SparkSession
- from pyspark.sql.functions import col
-
- # Create a SparkSession
- spark = SparkSession.builder.appName("Testing PySpark Example").getOrCreate()
接下来,创建一个DataFrame
- sample_data = [{"name": "John D.", "age": 30},
- {"name": "Alice G.", "age": 25},
- {"name": "Bob T.", "age": 35},
- {"name": "Eve A.", "age": 28}]
-
- df = spark.createDataFrame(sample_data)
现在,我们对我们的DataFrame来定义转换算子
- from pyspark.sql.functions import col, regexp_replace
-
- # Remove additional spaces in name
- def remove_extra_spaces(df, column_name):
- # Remove extra spaces from the specified column
- df_transformed = df.withColumn(column_name, regexp_replace(col(column_name), "\\s+", " "))
-
- return df_transformed
-
- transformed_df = remove_extra_spaces(df, "name")
-
- transformed_df.show()
+---+--------+ |age| name| +---+--------+ | 30| John D.| | 25|Alice G.| | 35| Bob T.| | 28| Eve A.| +---+--------+
现在来测试你的pyspark转换算子。一个选择简化DataFrame测试结果,可以简化数据或者输入数据。更好的方式写测试例子,这里有一些例子怎么去测试我们的代码,这些代码是基于spark 3.5以下版本。对于这些例子做笔记是非常值得的,可以通过测试框架,不管你是使用unittest or pytest; built-in PySpark 测试是单机的,意味着他兼容测试框架和CI测试
- import pyspark.testing
- from pyspark.testing.utils import assertDataFrameEqual
-
- # Example 1
- df1 = spark.createDataFrame(data=[("1", 1000), ("2", 3000)], schema=["id", "amount"])
- df2 = spark.createDataFrame(data=[("1", 1000), ("2", 3000)], schema=["id", "amount"])
- assertDataFrameEqual(df1, df2) # pass, DataFrames are identical
- # Example 2
- df1 = spark.createDataFrame(data=[("1", 0.1), ("2", 3.23)], schema=["id", "amount"])
- df2 = spark.createDataFrame(data=[("1", 0.109), ("2", 3.23)], schema=["id", "amount"])
- assertDataFrameEqual(df1, df2, rtol=1e-1) # pass, DataFrames are approx equal by rtol
您还可以简单地比较两个 DataFrame 模式:
- from pyspark.testing.utils import assertSchemaEqual
- from pyspark.sql.types import StructType, StructField, ArrayType, DoubleType
-
- s1 = StructType([StructField("names", ArrayType(DoubleType(), True), True)])
- s2 = StructType([StructField("names", ArrayType(DoubleType(), True), True)])
-
- assertSchemaEqual(s1, s2) # pass, schemas are identical
对于更复杂的测试场景,您可能需要使用测试框架。
最流行的测试框架选项之一是单元测试。让我们逐步了解如何使用内置 Pythonunittest库来编写 PySpark 测试。有关该unittest库的更多信息,请参阅此处: https: //docs.python.org/3/library/unittest.html。
首先,您需要一个 Spark 会话。您可以使用包@classmethod中的装饰器unittest来负责设置和拆除 Spark 会话。
- import unittest
-
- class PySparkTestCase(unittest.TestCase):
- @classmethod
- def setUpClass(cls):
- cls.spark = SparkSession.builder.appName("Testing PySpark Example").getOrCreate()
-
-
- @classmethod
- def tearDownClass(cls):
- cls.spark.stop()
现在我们来写一个unittest类。
- from pyspark.testing.utils import assertDataFrameEqual
-
- class TestTranformation(PySparkTestCase):
- def test_single_space(self):
- sample_data = [{"name": "John D.", "age": 30},
- {"name": "Alice G.", "age": 25},
- {"name": "Bob T.", "age": 35},
- {"name": "Eve A.", "age": 28}]
-
- # Create a Spark DataFrame
- original_df = spark.createDataFrame(sample_data)
-
- # Apply the transformation function from before
- transformed_df = remove_extra_spaces(original_df, "name")
-
- expected_data = [{"name": "John D.", "age": 30},
- {"name": "Alice G.", "age": 25},
- {"name": "Bob T.", "age": 35},
- {"name": "Eve A.", "age": 28}]
-
- expected_df = spark.createDataFrame(expected_data)
-
- assertDataFrameEqual(transformed_df, expected_df)
运行时,unittest将选取名称以“test”开头的所有函数。
pytest我们还可以使用最流行的 Python 测试框架之一来编写测试。有关 的更多信息pytest,请参阅此处的文档: https: //docs.pytest.org/en/7.1.x/contents.html。
使用pytest固定装置允许我们在测试之间共享 Spark 会话,并在测试完成时将其拆除。
- import pytest
-
- @pytest.fixture
- def spark_fixture():
- spark = SparkSession.builder.appName("Testing PySpark Example").getOrCreate()
- yield spark
然后我们可以这样定义我们的测试:
- import pytest
- from pyspark.testing.utils import assertDataFrameEqual
-
- def test_single_space(spark_fixture):
- sample_data = [{"name": "John D.", "age": 30},
- {"name": "Alice G.", "age": 25},
- {"name": "Bob T.", "age": 35},
- {"name": "Eve A.", "age": 28}]
-
- # Create a Spark DataFrame
- original_df = spark.createDataFrame(sample_data)
-
- # Apply the transformation function from before
- transformed_df = remove_extra_spaces(original_df, "name")
-
- expected_data = [{"name": "John D.", "age": 30},
- {"name": "Alice G.", "age": 25},
- {"name": "Bob T.", "age": 35},
- {"name": "Eve A.", "age": 28}]
-
- expected_df = spark.createDataFrame(expected_data)
-
- assertDataFrameEqual(transformed_df, expected_df)
当您使用该pytest命令运行测试文件时,它将选取名称以“test”开头的所有函数。
让我们在单元测试示例中一起查看所有步骤。
- # pkg/etl.py
- import unittest
-
- from pyspark.sql import SparkSession
- from pyspark.sql.functions import col
- from pyspark.sql.functions import regexp_replace
- from pyspark.testing.utils import assertDataFrameEqual
-
- # Create a SparkSession
- spark = SparkSession.builder.appName("Sample PySpark ETL").getOrCreate()
-
- sample_data = [{"name": "John D.", "age": 30},
- {"name": "Alice G.", "age": 25},
- {"name": "Bob T.", "age": 35},
- {"name": "Eve A.", "age": 28}]
-
- df = spark.createDataFrame(sample_data)
-
- # Define DataFrame transformation function
- def remove_extra_spaces(df, column_name):
- # Remove extra spaces from the specified column using regexp_replace
- df_transformed = df.withColumn(column_name, regexp_replace(col(column_name), "\\s+", " "))
-
- return df_transformed
- # pkg/test_etl.py
- import unittest
-
- from pyspark.sql import SparkSession
-
- # Define unit test base class
- class PySparkTestCase(unittest.TestCase):
- @classmethod
- def setUpClass(cls):
- cls.spark = SparkSession.builder.appName("Sample PySpark ETL").getOrCreate()
-
- @classmethod
- def tearDownClass(cls):
- cls.spark.stop()
-
- # Define unit test
- class TestTranformation(PySparkTestCase):
- def test_single_space(self):
- sample_data = [{"name": "John D.", "age": 30},
- {"name": "Alice G.", "age": 25},
- {"name": "Bob T.", "age": 35},
- {"name": "Eve A.", "age": 28}]
-
- # Create a Spark DataFrame
- original_df = spark.createDataFrame(sample_data)
-
- # Apply the transformation function from before
- transformed_df = remove_extra_spaces(original_df, "name")
-
- expected_data = [{"name": "John D.", "age": 30},
- {"name": "Alice G.", "age": 25},
- {"name": "Bob T.", "age": 35},
- {"name": "Eve A.", "age": 28}]
-
- expected_df = spark.createDataFrame(expected_data)
-
- assertDataFrameEqual(transformed_df, expected_df)
unittest.main(argv=[''], verbosity=0, exit=False)
在 1.734 秒内完成 1 次测试