• 解决报错TypeError:unsupported operand type(s) for +: ‘NoneType‘ and ‘str‘


    一、问题描述

    from pyspark.sql.types import StringType
    
    @udf(returnType = StringType())
    def bad_funify(s):
        return s + " is fun!"
    
    countries2 = spark.createDataFrame([("Thailand", 3), (None, 4)], ["country", "id"])
    countries2.withColumn("fun_country", bad_funify("country")).show()
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8

    用一个udf想让df(有country和id两个字段)生成新的一列fun_country(内容是字符串,内容为country xx is fun),但是df中有的country字段内容没有数据(注意类型是None而不是null),结果报错如下:

    PythonException: 
      An exception was thrown from the Python worker. Please see the stack trace below.
    Traceback (most recent call last):
      File "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/worker.py", line 619, in main
        process()
      File "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/worker.py", line 611, in process
        serializer.dump_stream(out_iter, outfile)
      File "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/serializers.py", line 211, in dump_stream
        self.serializer.dump_stream(self._batched(iterator), stream)
      File "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/serializers.py", line 132, in dump_stream
        for obj in iterator:
      File "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/serializers.py", line 200, in _batched
        for item in iterator:
      File "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/worker.py", line 452, in mapper
        result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
      File "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/worker.py", line 452, in <genexpr>
        result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
      File "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/worker.py", line 87, in <lambda>
        return lambda *a: f(*a)
      File "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/util.py", line 74, in wrapper
        return f(*args, **kwargs)
      File "", line 5, in bad_funify
    TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23

    二、解决方案

    这是个很蠢的问题。其实如果country为空值时,fun_country应该也是空的,所以就简单加多个判断的逻辑即可。修改udf为good_funity后:

    @udf(returnType=StringType())
    def good_funify(s):
         return None if s == None else s + " is fun!"
    countries2.withColumn("fun_country", good_funify("country")).show()
    
    +--------+---+----------------+
    | country| id|     fun_country|
    +--------+---+----------------+
    |Thailand|  3|Thailand is fun!|
    |    null|  4|            null|
    +--------+---+----------------+
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11

    Reference

    [1] Navigating None and null in PySpark

  • 相关阅读:
    机器学习笔记之指数族分布——指数族分布介绍
    ssh外网访问内网服务器
    刺激的8月!字节三面鞭尸/嘴贱痛失腾讯offer,想要个offer这么难吗
    【开发小记】vue2+elementUI实现搜索结果无限滚动(触底加载)展示
    BeanDefinition扫描注册过程
    【2022.7月份停更说明 && 总结】
    Flask之路由(app.route)详解
    快速上手 TypeScript
    想当测试Leader,这6项技能你会吗?
    JS最新的关键字和保留字
  • 原文地址:https://blog.csdn.net/qq_35812205/article/details/126077463