• llama factory 大数据量下训练失败


    06/11/2024 07:09:50 - WARNING - llmtuner.data.utils - Checksum failed: missing SHA-1 hash value in dataset_info.json.
    06/11/2024 07:09:50 - WARNING - llmtuner.data.utils - Checksum failed: missing SHA-1 hash value in dataset_info.json.
    Traceback (most recent call last):
      File "/home/ca2/anaconda3/envs/llama/lib/python3.10/site-packages/datasets/builder.py", line 1973, in _prepare_split_single
        for _, table in generator:
      File "/home/ca2/anaconda3/envs/llama/lib/python3.10/site-packages/datasets/packaged_modules/json/json.py", line 122, in _generate_tables
        io.BytesIO(batch), read_options=paj.ReadOptions(block_size=block_size)
      File "pyarrow/_json.pyx", line 52, in pyarrow._json.ReadOptions.__init__
      File "pyarrow/_json.pyx", line 77, in pyarrow._json.ReadOptions.block_size.__set__
    OverflowError: value too large to convert to int32_t
    
    The above exception was the direct cause of the following exception:
    
    Traceback (most recent call last):
      File "/home/ca2/LLaMA-Factory/src/train_bash.py", line 14, in <module>
        main()
      File "/home/ca2/LLaMA-Factory/src/train_bash.py", line 5, in main
        run_exp()
      File "/home/ca2/LLaMA-Factory/src/llmtuner/train/tuner.py", line 29, in run_exp
        run_pt(model_args, data_args, training_args, finetuning_args, callbacks)
      File "/home/ca2/LLaMA-Factory/src/llmtuner/train/pt/workflow.py", line 28, in run_pt
        dataset = get_dataset(tokenizer, model_args, data_args, training_args, stage="pt")
      File "/home/ca2/LLaMA-Factory/src/llmtuner/data/loader.py", line 162, in get_dataset
        all_datasets.append(load_single_dataset(dataset_attr, model_args, data_args))
      File "/home/ca2/LLaMA-Factory/src/llmtuner/data/loader.py", line 92, in load_single_dataset
        dataset = load_dataset(
      File "/home/ca2/anaconda3/envs/llama/lib/python3.10/site-packages/datasets/load.py", line 2582, in load_dataset
        builder_instance.download_and_prepare(
      File "/home/ca2/anaconda3/envs/llama/lib/python3.10/site-packages/datasets/builder.py", line 1005, in download_and_prepare
        self._download_and_prepare(
      File "/home/ca2/anaconda3/envs/llama/lib/python3.10/site-packages/datasets/builder.py", line 1100, in _download_and_prepare
        self._prepare_split(split_generator, **prepare_split_kwargs)
      File "/home/ca2/anaconda3/envs/llama/lib/python3.10/site-packages/datasets/builder.py", line 1860, in _prepare_split
        for job_id, done, content in self._prepare_split_single(
      File "/home/ca2/anaconda3/envs/llama/lib/python3.10/site-packages/datasets/builder.py", line 2016, in _prepare_split_single
        raise DatasetGenerationError("An error occurred while generating the dataset") from e
    datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset
    

    从你的错误信息来看,问题出在加载和准备数据集时,具体包括以下两个主要错误:

    1. Checksum Failed:

      06/11/2024 07:09:50 - WARNING - llmtuner.data.utils - Checksum failed: missing SHA-1 hash value in dataset_info.json.
      

      这个警告提示 dataset_info.json 文件中缺少 SHA-1 哈希值。虽然这是一个警告,但它提示数据集的信息文件可能不完整或被篡改。

    2. OverflowError:

      OverflowError: value too large to convert to int32_t
      

      这个错误表明你试图加载的数据集文件块太大,超过了 int32_t 能表示的最大值。这通常是因为数据集文件块的大小设置不当。

    这两个错误最终导致数据集生成失败,并抛出 DatasetGenerationError

    解决方法:

    1. 检查数据集文件:

      • 确保数据集文件 dataset_info.json 完整且包含 SHA-1 哈希值。
      • 如果是从某个数据源下载的数据集,尝试重新下载或从可靠来源获取。
    2. 调整数据集加载选项:

      • 修改数据加载时的块大小设置。可能需要在你的代码中找到并调整 pyarrow 库的读取选项 block_size

      • 尝试修改 block_size 为较小的值,以避免溢出错误。例如:

        import pyarrow.json as paj
        
        read_options = paj.ReadOptions(block_size=65536)  # 64KB,或根据需要调整
        

        在加载数据集的代码中使用上述选项。

    3. 更新库版本:

      • 确保 datasetspyarrow 库是最新版本,某些 bug 可能已经在新版本中修复。

    代码示例:

    如果你使用的是 datasets 库,可以通过如下方式调整 block_size

    from datasets import load_dataset
    import pyarrow.json as paj
    
    # 设置较小的 block_size
    read_options = paj.ReadOptions(block_size=65536)
    
    dataset = load_dataset('your_dataset', split='train', read_options=read_options)
    

    如果错误继续存在,请确保数据集文件的完整性,或者联系数据集提供者寻求帮助。

    最终解决方案将数据从json格式换位jsonl格式

  • 相关阅读:
    华为云云耀云服务器L实例评测|用PHP从数据库到后端到前端完整实现一个中秋节祝福语项目
    java计算机毕业设计食品销售网站源码+数据库+系统+lw文档+mybatis+运行部署
    golang面向对象
    CopyOnWriteArrayList源码分析
    【kafka】解决kafka-tool连接上kafka,brokers和topics不显示问题
    131. 分割回文串
    无人机+人工智能:多智能体,智能蜂群技术详解
    vue下载excel以及自适应表格宽度
    2023年秋招软开大厂笔试在线编程记录
    两层全连接网络反向传播梯度推导(矩阵形式、sigmoid、最小均方差MSE)
  • 原文地址:https://blog.csdn.net/weixin_41046245/article/details/139600468