Spark报错异常及解决

org.apache.spark.shuffle.FetchFailedException
Container killed by YARN for exceeding memory limits.
Container on host: xxx was preempted
org.apache.spark.SparkException: Task not serializable

org.apache.spark.shuffle.FetchFailedException

org.apache.spark.shuffle.FetchFailedException: Connecting to xxx timed out (120000 ms)
1

原因：数据读取过多
结果：导致stage失败，然后重试
解决：暂无

Container killed by YARN for exceeding memory limits.

原因：内存处理数据量过大，executor内存不足
结果：当前task失败
解决：

调高executor内存的配额

增加spark.yarn.executor.memoryOverhead

--conf spark.yarn.executor.memoryOverhead=4096
	（注意这里的单位是MB）
1
2

Container on host: xxx was preempted

https://blog.csdn.net/weixin_39750084/article/details/107637667
原因：有task占用的内存太大，而我们的yarn又是使用的公平调度机制，当有新任务来的时候，我的task对应的容器就会被别的任务抢占。
结果：task失败重试，失败次数很多后会产生FetchFailedException错误，导致stage失败
解决（maybe）：避免资源紧张；减少不必要的数据，或者减少不必要的关联/操作，避免一个task中数据过多；改yarn的参数了，让资源抢占的门限值变高些。

org.apache.spark.SparkException: Task not serializable

Caused by: java.io.NotSerializableException: com.google.gson.Gson
Serialization stack:......
1
2

java.io.NotSerializableException是因为你试图序列化一个不可序列化的对象。在这个例子中，com.google.gson.Gson对象是不可序列化的。
问题定位：

def executeDiffLayer(dataDf: DataFrame): DataFrame = {
    val gson = new Gson()
    val changeDf = dataDf
      .map(row => {
        val value = row.getAsmutable.WrappedArray[String].map(gson.fromJson(, classOf[xxx])).toList
        (value)
      }).toDF()
  }
1
2
3
4
5
6
7
8

在Spark中，map函数中的所有对象都需要被序列化以便在网络中传输。但是Gson对象是不可序列化的，所以在运行时会抛出NotSerializableException异常。
解决：
将Gson对象的创建放在map函数中，这样每次在executor上执行map函数时，都会创建一个新的Gson对象，而不需要将Gson对象序列化传输。

相关阅读:
jQuery_Ajax
Kubernetes——KubeSphere构建mysql集群
阿里云中的云服务器的ubuntu中的vim没有显示行号
如何打造小红书产品差异化，打造产品优势？
MyBatis笔记
【附源码】计算机毕业设计SSM食品溯源信息查询系统
星乐园项目┃助学无止境·探访暖人心
数位DP
Go：Bitwise按位算法(附完整源码)
[SpringBoot] SpringBoot-03-配置文件格式

原文地址：https://blog.csdn.net/Hanhahahahah/article/details/133161370