06/11/2024 07:09:50 - WARNING - llmtuner.data.utils - Checksum failed: missing SHA-1 hash value in dataset_info.json.
06/11/2024 07:09:50 - WARNING - llmtuner.data.utils - Checksum failed: missing SHA-1 hash value in dataset_info.json.
Traceback (most recent call last):
File "/home/ca2/anaconda3/envs/llama/lib/python3.10/site-packages/datasets/builder.py", line 1973, in _prepare_split_single
for _, table in generator:
File "/home/ca2/anaconda3/envs/llama/lib/python3.10/site-packages/datasets/packaged_modules/json/json.py", line 122, in _generate_tables
io.BytesIO(batch), read_options=paj.ReadOptions(block_size=block_size)
File "pyarrow/_json.pyx", line 52, in pyarrow._json.ReadOptions.__init__
File "pyarrow/_json.pyx", line 77, in pyarrow._json.ReadOptions.block_size.__set__
OverflowError: value too large to convert to int32_t
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/ca2/LLaMA-Factory/src/train_bash.py", line 14, in <module>
main()
File "/home/ca2/LLaMA-Factory/src/train_bash.py", line 5, in main
run_exp()
File "/home/ca2/LLaMA-Factory/src/llmtuner/train/tuner.py", line 29, in run_exp
run_pt(model_args, data_args, training_args, finetuning_args, callbacks)
File "/home/ca2/LLaMA-Factory/src/llmtuner/train/pt/workflow.py", line 28, in run_pt
dataset = get_dataset(tokenizer, model_args, data_args, training_args, stage="pt")
File "/home/ca2/LLaMA-Factory/src/llmtuner/data/loader.py", line 162, in get_dataset
all_datasets.append(load_single_dataset(dataset_attr, model_args, data_args))
File "/home/ca2/LLaMA-Factory/src/llmtuner/data/loader.py", line 92, in load_single_dataset
dataset = load_dataset(
File "/home/ca2/anaconda3/envs/llama/lib/python3.10/site-packages/datasets/load.py", line 2582, in load_dataset
builder_instance.download_and_prepare(
File "/home/ca2/anaconda3/envs/llama/lib/python3.10/site-packages/datasets/builder.py", line 1005, in download_and_prepare
self._download_and_prepare(
File "/home/ca2/anaconda3/envs/llama/lib/python3.10/site-packages/datasets/builder.py", line 1100, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/home/ca2/anaconda3/envs/llama/lib/python3.10/site-packages/datasets/builder.py", line 1860, in _prepare_split
for job_id, done, content in self._prepare_split_single(
File "/home/ca2/anaconda3/envs/llama/lib/python3.10/site-packages/datasets/builder.py", line 2016, in _prepare_split_single
raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset
从你的错误信息来看,问题出在加载和准备数据集时,具体包括以下两个主要错误:
Checksum Failed:
06/11/2024 07:09:50 - WARNING - llmtuner.data.utils - Checksum failed: missing SHA-1 hash value in dataset_info.json.
这个警告提示 dataset_info.json
文件中缺少 SHA-1 哈希值。虽然这是一个警告,但它提示数据集的信息文件可能不完整或被篡改。
OverflowError:
OverflowError: value too large to convert to int32_t
这个错误表明你试图加载的数据集文件块太大,超过了 int32_t
能表示的最大值。这通常是因为数据集文件块的大小设置不当。
这两个错误最终导致数据集生成失败,并抛出 DatasetGenerationError
。
检查数据集文件:
dataset_info.json
完整且包含 SHA-1 哈希值。调整数据集加载选项:
修改数据加载时的块大小设置。可能需要在你的代码中找到并调整 pyarrow
库的读取选项 block_size
。
尝试修改 block_size
为较小的值,以避免溢出错误。例如:
import pyarrow.json as paj
read_options = paj.ReadOptions(block_size=65536) # 64KB,或根据需要调整
在加载数据集的代码中使用上述选项。
更新库版本:
datasets
和 pyarrow
库是最新版本,某些 bug 可能已经在新版本中修复。如果你使用的是 datasets
库,可以通过如下方式调整 block_size
:
from datasets import load_dataset
import pyarrow.json as paj
# 设置较小的 block_size
read_options = paj.ReadOptions(block_size=65536)
dataset = load_dataset('your_dataset', split='train', read_options=read_options)
如果错误继续存在,请确保数据集文件的完整性,或者联系数据集提供者寻求帮助。
最终解决方案将数据从json格式换位jsonl格式