• hive报错 Too many bytes before newline: 2147483648


    报错

    1. Caused by: java.lang.RuntimeException: java.io.IOException: java.io.IOException: Too many bytes before newline: 2147483648
    2.         at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:206)
    3.         at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.<init>(TezGroupedSplitsInputFormat.java:145)
    4.         at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat.getRecordReader(TezGroupedSplitsInputFormat.java:111)
    5.         at org.apache.tez.mapreduce.lib.MRReaderMapred.setupOldRecordReader(MRReaderMapred.java:156)
    6.         at org.apache.tez.mapreduce.lib.MRReaderMapred.setSplit(MRReaderMapred.java:82)
    7.         at org.apache.tez.mapreduce.input.MRInput.initFromEventInternal(MRInput.java:703)
    8.         at org.apache.tez.mapreduce.input.MRInput.initFromEvent(MRInput.java:662)
    9.         at org.apache.tez.mapreduce.input.MRInputLegacy.checkAndAwaitRecordReaderInitialization(MRInputLegacy.java:150)
    10.         at org.apache.tez.mapreduce.input.MRInputLegacy.init(MRInputLegacy.java:114)
    11.         at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.getMRInput(MapRecordProcessor.java:543)
    12.         at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.init(MapRecordProcessor.java:189)
    13.         at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:268)
    14.         ... 16 more
    15. Caused by: java.io.IOException: java.io.IOException: Too many bytes before newline: 2147483648
    16.         at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
    17.         at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
    18.         at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:433)
    19.         at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:203)
    20.         ... 27 more
    21. Caused by: java.io.IOException: Too many bytes before newline: 2147483648
    22.         at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:251)
    23.         at org.apache.hadoop.util.LineReader.readLine(LineReader.java:176)
    24.         at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.readLine(UncompressedSplitLineReader.java:94)
    25.         at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:149)
    26.         at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
    27.         at org.apache.hadoop.hive.ql.io.RecordReaderWrapper.create(RecordReaderWrapper.java:72)
    28.         at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:430)
    29.         ... 28 more
    30. ], TaskAttempt 3 failed, info=[Error: Error while running task ( failure ) : attempt_1667789273844_0742_1_00_000028_3:java.lang.RuntimeException: java.lang.RuntimeException: java.io.IOException: java.io.IOException: Too many bytes before newline: 2147483648
    31.         at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:298)
    32.         at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:252)
    33.         at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
    34.         at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:75)
    35.         at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:62)
    36.         at java.security.AccessController.doPrivileged(Native Method)
    37.         at javax.security.auth.Subject.doAs(Subject.java:422)
    38.         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898)
    39.         at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:62)
    40.         at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:38)
    41.         at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
    42.         at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
    43.         at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69)
    44.         at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
    45.         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    46.         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    47.         at java.lang.Thread.run(Thread.java:748)
    48. Caused by: java.lang.RuntimeException: java.io.IOException: java.io.IOException: Too many bytes before newline: 2147483648
    49.         at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:206)
    50.         at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.<init>(TezGroupedSplitsInputFormat.java:145)
    51.         at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat.getRecordReader(TezGroupedSplitsInputFormat.java:111)
    52.         at org.apache.tez.mapreduce.lib.MRReaderMapred.setupOldRecordReader(MRReaderMapred.java:156)
    53.         at org.apache.tez.mapreduce.lib.MRReaderMapred.setSplit(MRReaderMapred.java:82)
    54.         at org.apache.tez.mapreduce.input.MRInput.initFromEventInternal(MRInput.java:703)
    55.         at org.apache.tez.mapreduce.input.MRInput.initFromEvent(MRInput.java:662)
    56.         at org.apache.tez.mapreduce.input.MRInputLegacy.checkAndAwaitRecordReaderInitialization(MRInputLegacy.java:150)
    57.         at org.apache.tez.mapreduce.input.MRInputLegacy.init(MRInputLegacy.java:114)
    58.         at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.getMRInput(MapRecordProcessor.java:543)
    59.         at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.init(MapRecordProcessor.java:189)
    60.         at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:268)
    61.         ... 16 more
    62. Caused by: java.io.IOException: java.io.IOException: Too many bytes before newline: 2147483648
    63.         at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
    64.         at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
    65.         at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:433)
    66.         at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:203)
    67.         ... 27 more
    68. Caused by: java.io.IOException: Too many bytes before newline: 2147483648
    69.         at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:251)
    70.         at org.apache.hadoop.util.LineReader.readLine(LineReader.java:176)
    71.         at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.readLine(UncompressedSplitLineReader.java:94)
    72.         at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:149)
    73.         at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
    74.         at org.apache.hadoop.hive.ql.io.RecordReaderWrapper.create(RecordReaderWrapper.java:72)
    75.         at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:430)
    76.         ... 28 more

    发生背景

    同事采用hive的opencsv serde去load csv数据(这套我经常用没啥问题)。他也用过几次都ok,突然某天就说这个问题。

    具体现象

    select * from odsctdata.ods_ct_order_list_csv limit 10 --ok

    select count(1),max(),min() from odsctdata.ods_ct_order_list_csv --报错

    错误具体为:Too many bytes before newline: 2147483648

    此时看到这个错,大家第一反应是从哪里解决? 看看各位的思路,欢迎评论区留言。

    1.2147483648 ,这个数值会不会偏小?就和内存一样 hive的有些参数默认是比较保守的,可以适当调大点,就避免很多报错了。 通过set -v 我查不到这个值,说明应该不是属性问题。

    2.Too many bytes before newline ,这个报错可能是hive源码里有的报错信息,我们看下这个报错的具体位置,可能是有些属性需要我们额外去配置,或者其他问题,看源码总是好的。

    下载了hive源码,无

    3.看具体的报错信息 

    Caused by: java.io.IOException: Too many bytes before newline: 2147483648
            at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:251)
            at org.apache.hadoop.util.LineReader.readLine(LineReader.java:176)

    这里提到了org.apache.hadoop.util.LineReader,查看源码。与之吻合。 

    具体看下这个方法

    看了下大概就是读取每一行的数据,并且获取大小,对大小有个判断。

    1. private int readDefaultLine(Text str, int maxLineLength, int maxBytesToConsume) throws IOException {
    2. str.clear();
    3. int txtLength = 0;
    4. int newlineLength = 0;
    5. boolean prevCharCR = false;
    6. long bytesConsumed = 0L;
    7. do {
    8. int startPosn = this.bufferPosn;
    9. if (this.bufferPosn >= this.bufferLength) {
    10. startPosn = this.bufferPosn = 0;
    11. if (prevCharCR) {
    12. ++bytesConsumed;
    13. }
    14. this.bufferLength = this.fillBuffer(this.in, this.buffer, prevCharCR);
    15. if (this.bufferLength <= 0) {
    16. break;
    17. }
    18. }
    19. while(this.bufferPosn < this.bufferLength) {
    20. if (this.buffer[this.bufferPosn] == 10) {
    21. newlineLength = prevCharCR ? 2 : 1;
    22. ++this.bufferPosn;
    23. break;
    24. }
    25. if (prevCharCR) {
    26. newlineLength = 1;
    27. break;
    28. }
    29. prevCharCR = this.buffer[this.bufferPosn] == 13;
    30. ++this.bufferPosn;
    31. }
    32. int readLength = this.bufferPosn - startPosn;
    33. if (prevCharCR && newlineLength == 0) {
    34. --readLength;
    35. }
    36. bytesConsumed += (long)readLength;
    37. int appendLength = readLength - newlineLength;
    38. if (appendLength > maxLineLength - txtLength) {
    39. appendLength = maxLineLength - txtLength;
    40. }
    41. if (appendLength > 0) {
    42. str.append(this.buffer, startPosn, appendLength);
    43. txtLength += appendLength;
    44. }
    45. } while(newlineLength == 0 && bytesConsumed < (long)maxBytesToConsume);
    46. if (bytesConsumed > 2147483647L) {
    47. throw new IOException("Too many bytes before newline: " + bytesConsumed);
    48. } else {
    49. return (int)bytesConsumed;
    50. }
    51. }

     接着我们去看csv文件。

    看到这两个命令,有经验的小伙伴已经发现不对劲了。

    41w条数据,足足有3.2G。有点奇奇怪怪。

     继续再看下 ,发现更加奇怪了。每一行的数据也不多,csv行数也不多,怎么会有这么大?

    总共41w数据,那我split 按照10w分5个文件看下,因为怀疑有一条数据特别大

    发现最后一个文件特别大。有问题。此时已经怀疑是最后一条数据有问题了,但是没证据咋办。

    总行数409256 我分成2个文件409255+1就行了。

     此时问题数据已经查到。。就是最后一条。 

     3233377302>2147483647L

    再来说说这个数字。2147483647 咋一看好像啥也不是和程序员的1024好像没有任何关系

     但是如果是一个年迈的程序员,一眼就可以看出这个数字很熟悉,比如我 哈哈。

    2147483647+1=2147483648=1024*1024*1024*2。说明hive的一行数据不能超过2G。

     

     至此问题解决。等我问题解决的时候同事来了一句。他导出csv文件的时候count(1)了下有900w数据。。。。 我还忙活啥劲呢。。

     

    如果这篇帮到你,点个赞是对我最大的支持。

  • 相关阅读:
    在Android studio 创建Flutter项目运行出现问题总结
    【观察】数字化时代的咨询往何处走?软通咨询的思与行
    使用 GPU 进行 Lightmap 烘焙 - 简单 demo
    增速冠军 | 超云AI与信创实践典范,引领IDC中国服务器市场
    解决一个Mysql的utf8编码导致的问题
    Dubbo+Zookeeper入门实例
    Vue3 项目中使用 jsx 详细教程
    Linux shell 中变量 $#,$@,$0,$1,$2,$*,$$,$? 的含义
    Leetcode DAY 15: 层序遍 and 翻转二叉树 and 对称二叉树
    机器学习第十三课--主成分分析PCA
  • 原文地址:https://blog.csdn.net/cclovezbf/article/details/128092934