• 使用Jupyter Notebook调试PySpark程序错误总结


    项目场景:

    在Ubuntu16.04 hadoop2.6.0 spark2.3.1环境下
    简单调试一个PySpark程序,中间遇到的错误总结(发现版对应和基础配置很重要)

    注意:在前提安装配置好
            hadoop hive anaconda jupyternotebook spark zookeeper

    (有机会可以安排一下教程)


    问题:

    pyspark发现没有出现spark图标

    1. cuihaipeng01@hadoop1:~$ pyspark
    2. Python 3.7.6 (default, Jan 8 2020, 19:59:22)
    3. [GCC 7.3.0] :: Anaconda, Inc. on linux
    4. Type "help", "copyright", "credits" or "license" for more information.
    5. SLF4J: Class path contains multiple SLF4J bindings.
    6. SLF4J: Found binding in [jar:file:/apps/spark/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    7. SLF4J: Found binding in [jar:file:/apps/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    8. SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
    9. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
    10. 2023-11-17 14:14:21 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    11. Setting default log level to "WARN".
    12. To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
    13. Traceback (most recent call last):
    14. File "/apps/spark/python/pyspark/shell.py", line 45, in
    15. spark = SparkSession.builder\
    16. File "/apps/spark/python/pyspark/sql/session.py", line 173, in getOrCreate
    17. sc = SparkContext.getOrCreate(sparkConf)
    18. File "/apps/spark/python/pyspark/context.py", line 343, in getOrCreate
    19. SparkContext(conf=conf or SparkConf())
    20. File "/apps/spark/python/pyspark/context.py", line 118, in __init__
    21. conf, jsc, profiler_cls)
    22. File "/apps/spark/python/pyspark/context.py", line 186, in _do_init
    23. self._accumulatorServer = accumulators._start_update_server()
    24. File "/apps/spark/python/pyspark/accumulators.py", line 259, in _start_update_server
    25. server = AccumulatorServer(("localhost", 0), _UpdateRequestHandler)
    26. File "/apps/anaconda3/lib/python3.7/socketserver.py", line 452, in __init__
    27. self.server_bind()
    28. File "/apps/anaconda3/lib/python3.7/socketserver.py", line 466, in server_bind
    29. self.socket.bind(self.server_address)
    30. socket.gaierror: [Errno -2] Name or service not known
    31. >>>
     
    

    原因分析:

    注意到这句话

    socket.gaierror: [Errno -2] Name or service not known

    导致这个问题的原因有:
    1.SPARK_MASTER_IP没有指定

    2.没有导入pyspark库

    1.检查SPARK_MASTER_IP

    编辑spark-env.sh配置文件

    vim /apps/spark/conf/spark-env.sh
    1. export SPARK_DIST_CLASSPATH=$(/apps/hadoop/bin/hadoop classpath)
    2. export HADOOP_CONF_DIR=/apps/hadoop/etc/hadoop
    3. export JAVA_HOME=/apps/java
    4. export SPARK_MASTER_IP=cuihaipeng01

      发现我的配置是没有问题的,我这里的cuihaipeng01是我映射的主机名,对应的ip是没有问题的,于是排除了这个问题。

    2.检查是否导入pyspark库

    用pip list命令查看python库

    解决方案:

    发现没有pyspark库,于是发现了问题所在,于是有了下面的问题

    (这里一定要指定版本:对应自己的spark版本就可以,比如spark2.3.1 那就下载 pyspark2.3.1)

    用pip install pystark 发现报错,即使带了镜像也有问题,后来查资料说是因为资源库在国外用了

    解决方法:国内的资源库

    pip install -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com pyspark==2.3.1

    发现还是不行,报别的错误了

    解决方法:下载finspark库和requests库

    pip install -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com findspark
    pip install -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com requests
    

    最后再次执行:

     pip install -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com pyspark==2.3.1

    安装成功

    再次执行pyspark

    这里发现一个警告,但是查看~/.bashrc下发现配置是没有问题的

    1. #hadoop
    2. export HADOOP_HOME=/apps/hadoop
    3. export PATH=$HADOOP_HOME/bin:$PATH
    4. export JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native
    5. export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"

    于是直接在ubuntu上执行了pyspark进行测试

    创建test.py

    1. import findspark
    2. findspark.init()
    3. from pyspark import SparkConf, SparkContext
    4. conf = SparkConf().setMaster("spark://cuihaipeng01:7077").setAppName("My App").set("spark.ui.port", "4050")
    5. sc = SparkContext(conf = conf)
    6. logFile = "file:///apps/spark/README.md"
    7. logData = sc.textFile(logFile, 2).cache()
    8. numAs = logData.filter(lambda line: 'a' in line).count()
    9. numBs = logData.filter(lambda line: 'b' in line).count()
    10. print('Lines with a: %s, Lines with b: %s' % (numAs, numBs))

    这里的 "spark://cuihaipeng01:7077"是我指定的spark-master,因为我搭建的是四台主机,一台master和三台slave

    本地的可以是local

    启动pyspark后重新打开一个终端执行文件

    python3 ~/test.py


    在jupyter notebook上执行pyspark程序:

    重新打开一个终端执行

    jupyter notebook

    执行过程:

    执行成功

  • 相关阅读:
    图鸟使用阿
    指静脉采集模组之调节Sensor
    事件驱动API架构的五个协议
    k8s 的 pod 基础 2
    ClickHouse(04)如何搭建ClickHouse集群
    Javascript 获取下拉框选项(Option)的值,动态加载和静态获取
    手摸手带你 在Windows系统中安装Istio
    阿里云安全中心需要购买吗?功能及价格告诉你值不值!
    使用Python+moviepy保存截取视频画面
    实践 uboot kernel编译下载
  • 原文地址:https://blog.csdn.net/qq_44741467/article/details/134463574