下载安装JDK17,配置JAVA_HOME
下载安装hadoop-3.3.5并完整替换bin目录,配置HADOOP_HOME
Index of /hadoop/common/hadoop-3.3.5
GitHub - cdarlint/winutils: winutils.exe hadoop.dll and hdfs.dll binaries for hadoop windows
下载spark配置SPARK_HOME
安装pyspark
Demo
遇到错误
org.apache.spark.SparkException: Python worker failed to connect back.
注意要指定python的地址
- from pyspark.sql import SparkSession
- import time
-
- # 创建SparkSession
- spark = SparkSession.builder.appName("CSV to DataFrame").getOrCreate()
-
- # 读取CSV文件到DataFrame
- csv_file_path = "../large_test_file.csv" # 替换为你的CSV文件路径
- df = spark.read.csv(csv_file_path, header=True, inferSchema=True)
-
- # 注册临时表以进行SQL查询
- df.createOrReplaceTempView("csv_table")
- start_time = time.time()
- # 使用Spark SQL查询数据
- sql_query = """
- SELECT max(col_18) as final FROM csv_table
- """
- result_df = spark.sql(sql_query)
-
- # 显示查询结果
- result_df.show()
- print(f"datetime 模块测量时间: {time.time() - start_time}")
- # datetime 模块测量时间: 0.9699978828430176
- # 停止SparkSession
- spark.stop()
环境
python3.10
- annotated-types==0.7.0
- anyio==4.4.0
- certifi==2024.2.2
- click==8.1.7
- cloudpickle==3.0.0
- colorama==0.4.6
- dask==2024.1.1
- dask_sql==2024.3.0
- distributed==2024.1.1
- dnspython==2.6.1
- email_validator==2.1.1
- exceptiongroup==1.2.1
- fastapi==0.111.0
- fastapi-cli==0.0.4
- fsspec==2024.5.0
- h11==0.14.0
- httpcore==1.0.5
- httptools==0.6.1
- httpx==0.27.0
- idna==3.7
- importlib_metadata==7.1.0
- Jinja2==3.1.4
- locket==1.0.0
- markdown-it-py==3.0.0
- MarkupSafe==2.1.5
- mdurl==0.1.2
- msgpack==1.0.8
- numpy==1.26.4
- orjson==3.10.3
- packaging==24.0
- pandas==2.2.2
- partd==1.4.2
- prompt_toolkit==3.0.45
- psutil==5.9.8
- py4j==0.10.9.7
- pydantic==2.7.1
- pydantic_core==2.18.2
- Pygments==2.18.0
- pyspark==3.5.1
- python-dateutil==2.9.0.post0
- python-dotenv==1.0.1
- python-multipart==0.0.9
- pytz==2024.1
- PyYAML==6.0.1
- rich==13.7.1
- shellingham==1.5.4
- six==1.16.0
- sniffio==1.3.1
- sortedcontainers==2.4.0
- starlette==0.37.2
- tabulate==0.9.0
- tblib==3.0.0
- toolz==0.12.1
- tornado==6.4
- typer==0.12.3
- typing_extensions==4.12.0
- tzdata==2024.1
- tzlocal==5.2
- ujson==5.10.0
- urllib3==2.2.1
- uvicorn==0.30.0
- watchfiles==0.22.0
- wcwidth==0.2.13
- websockets==12.0
- zict==3.0.0
- zipp==3.19.0