前言:最近在做文本生成的任务,需要用到bleu等评价指标,看到其他研究工作中都在用nlg-eval这个github库,就想把它拿过来用,然而安装过程并不是一帆风顺的,谨以此篇博客记录之,为后来者提供一些经验,少走一些弯路。
注:本博客适用于满足 ① linux系统 ② 因网络原因无法通过命令行nlg-eval --setup
下载所需文件 以上两个条件的用户~
1.1 到Oracle官网下载Java SE安装包,这里注意,根据github要求,安装版本要在1.8.0以上。我下载的安装包是jdk-17_linux-x64_bin.tar.gz
(如下)
我的linux系统是Ubuntu,参考这个ubuntu安装Java教程安装java并配置环境变量。CentOS等其他系统小伙伴自行百度一下安装方法。
1.2 首先将安装包拷贝到/opt
目录下,然后切换到该目录下,创建java
目录并更改目录所有权
cd /opt
sudo mkdir java
sudo chown user java
sudo chgrp user java
1.3 将刚刚下载的jdk-17_linux-x64_bin.tar.gz
安装包解压至创建的/opt/java/
目录下
tar -zxvf jdk-8u251-linux-x64.tar.gz -C /opt/java/
1.4 配置环境变量,在/etc/profile
文件中追加如下代码
#set java environment
export JAVA_HOME=/opt/java/jdk-17.0.5
export PATH=${JAVA_HOME}/bin:${PATH}
1.5 激活java环境
source /etc/profile
1.6 验证java环境是否安装成功
java -version
若缺少此步骤,则会报错:
FileNotFoundError: [Errno 2] No such file or directory: ‘java’: ‘java’
或报错:
AttributeError: ‘Meteor’ object has no attribute ‘meteor_p’
补充:当我使用新的脚本计算nlg-eval时,还是会报错无法查询到java环境,参考这篇博客发现,这一问题涉及到步骤1.4和1.5中java环境遍变量的配置与激活的文件选取问题。我们选择在/etc/profile
文件进行java环境变量的写入并对该文件进行激活,但该文件的问题在于它只在用户登入的时候执行一次。所以在执行其他的shell脚本时,java环境便无处可寻了。发现问题所在后,我们就要找一个能够在shell脚本每次调用时都会使用到的文件,向该文件中添加java环境变量。所以关于这一问题,一劳永逸的解决办法是:
1.4 在"/root/.bashrc"
文件中追加如下代码
export JAVA_HOME=/opt/java/jdk-17.0.5/
export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib:$CLASSPATH
export JAVA_PATH=${JAVA_HOME}/bin:${JRE_HOME}/bin
export PATH=$PATH:${JAVA_PATH}
1.5 激活java环境
source /root/.bashrc
/etc/profile
与/root/.bashrc
二者都是用户个性化设置文档,具有定制功能,但二者区别在于
/etc/profile
文件只在登入的时候执行一次,这解释了为何打开新的shell就无法识别java环境这一现象;/root/.bashrc
文件可以设置路径,命令别名等,重点在于每次shell脚本的执行都会使用该文件。2.1 网络原因,使用命令pip install git+https://github.com/Maluuba/nlg-eval.git@master
无法成功下载安装包,所以去github把nlg-eval-master安装包手动下载下来之后放到linux服务器上,我的存放路径是/root/nlg-eval/
(如下,这里我把nlg-eval-master 重命名为 nlg-eval)。这个路径之后会用到,大家留意自己的安装包下载路径。
2.2 下载安装包后,使用命令行cd /root/nlg-eval/
进入nlg-eval文件夹中,执行命令python setup.py install
执行命令 python setup.py install
时,有可能报如下错误:
installing scripts to build/bdist.linux-x86_64/egg/EGG-INFO/scripts
running install_scripts
running build_scripts
changing mode of build/bdist.linux-x86_64/egg/EGG-INFO/scripts/nlg-eval to 755
copying nlg_eval.egg-info/PKG-INFO -> build/bdist.linux-x86_64/egg/EGG-INFO
copying nlg_eval.egg-info/SOURCES.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying nlg_eval.egg-info/dependency_links.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying nlg_eval.egg-info/requires.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying nlg_eval.egg-info/top_level.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
zip_safe flag not set; analyzing archive contents...
Traceback (most recent call last):
File "setup.py", line 29, in <module>
setup(name='nlg-eval',
File "/root/anaconda3/envs/torchNew/lib/python3.8/site-packages/setuptools/__init__.py", line 107, in setup
return distutils.core.setup(**attrs)
File "/root/anaconda3/envs/torchNew/lib/python3.8/site-packages/setuptools/_distutils/core.py", line 185, in setup
return run_commands(dist)
File "/root/anaconda3/envs/torchNew/lib/python3.8/site-packages/setuptools/_distutils/core.py", line 201, in run_commands
dist.run_commands()
File "/root/anaconda3/envs/torchNew/lib/python3.8/site-packages/setuptools/_distutils/dist.py", line 969, in run_commands
self.run_command(cmd)
File "/root/anaconda3/envs/torchNew/lib/python3.8/site-packages/setuptools/dist.py", line 1244, in run_command
super().run_command(command)
File "/root/anaconda3/envs/torchNew/lib/python3.8/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
cmd_obj.run()
File "/root/anaconda3/envs/torchNew/lib/python3.8/site-packages/setuptools/command/install.py", line 80, in run
self.do_egg_install()
File "/root/anaconda3/envs/torchNew/lib/python3.8/site-packages/setuptools/command/install.py", line 129, in do_egg_install
self.run_command('bdist_egg')
File "/root/anaconda3/envs/torchNew/lib/python3.8/site-packages/setuptools/_distutils/cmd.py", line 318, in run_command
self.distribution.run_command(command)
File "/root/anaconda3/envs/torchNew/lib/python3.8/site-packages/setuptools/dist.py", line 1244, in run_command
super().run_command(command)
File "/root/anaconda3/envs/torchNew/lib/python3.8/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
cmd_obj.run()
File "/root/anaconda3/envs/torchNew/lib/python3.8/site-packages/setuptools/command/bdist_egg.py", line 212, in run
os.path.join(archive_root, 'EGG-INFO'), self.zip_safe()
File "/root/anaconda3/envs/torchNew/lib/python3.8/site-packages/setuptools/command/bdist_egg.py", line 265, in zip_safe
return analyze_egg(self.bdist_dir, self.stubs)
File "/root/anaconda3/envs/torchNew/lib/python3.8/site-packages/setuptools/command/bdist_egg.py", line 339, in analyze_egg
safe = scan_module(egg_dir, base, name, stubs) and safe
File "/root/anaconda3/envs/torchNew/lib/python3.8/site-packages/setuptools/command/bdist_egg.py", line 376, in scan_module
code = marshal.load(f)
ValueError: bad marshal data (unknown type code)
参考此篇文章,解决方法为在终端执行 find . -name "*.pyc" -exec rm -f {} \;
命令。
注意⚠️:若缺少此步骤(未执行 python setup.py install
命令),则会报错:
nlg-eval command not found
2.3 使用命令conda list nlg-eval
初步验证nlg-eval已经存在于当前环境中:
3.1 由于网络原因,手动下载所有的文件到本地。需要下载的文件列表在刚刚下载好的安装包的"/root/nlg-eval/bin/nlg-eval"
文件中可以找到,一共有10个文件,亲测都可以使用手动下载完成,大概7个GB。以下是下载路径汇总:
word2vec(2个文件)
https://raw.githubusercontent.com/robmsmt/glove-gensim/4c2224bccd61627b76c50a5e1d6afd1c82699d22/glove2word2vec.py
http://nlp.stanford.edu/data/glove.6B.zip
Skip-thoughts data(7个文件)
http://www.cs.toronto.edu/~rkiros/models/dictionary.txt
http://www.cs.toronto.edu/~rkiros/models/utable.npy
http://www.cs.toronto.edu/~rkiros/models/btable.npy
http://www.cs.toronto.edu/~rkiros/models/uni_skip.npz
http://www.cs.toronto.edu/~rkiros/models/uni_skip.npz.pkl
http://www.cs.toronto.edu/~rkiros/models/bi_skip.npz
http://www.cs.toronto.edu/~rkiros/models/bi_skip.npz.pkl
multi-bleu.perl(1个文件)
https://raw.githubusercontent.com/moses-smt/mosesdecoder/b199e654df2a26ea58f234cbb642e89d9c1f269d/scripts/generic/multi-bleu.perl
3.2 将本地下载好的文件上传到服务器,各文件上传的路径如下:(/root/替换成自己服务器对应的路径)
/root/nlg-eval/nlgeval/word2vec/
目录下/root/nlg-eval/nlgeval/multibleu/
目录下/root/.cache/nlgeval/
目录下直接在/root/nlg-eval/
目录下使用 nlg-eval --setup
命令下载上述文件会因为网络原因报错,例如:
requests.exceptions.ConnectionError: HTTPSConnectionPool(host=‘raw.githubusercontent.com’, port=443): Max retries exceeded with url: /moses-smt/mosesdecoder/b199e654df2a26ea58f234cbb642e89d9c/scripts/generic/multi-bleu.perl (Caused by NewConnectionError(‘
: Failed to establish a new connection: [Errno -3] Temporary failuname resolution’,))
若缺少此步骤或路径保存不当,则会报错,例如:
FileNotFoundError: [Errno 2] No such file or directory: ‘/root/nlg-eval/uni_skip.npz.pkl’
到这里,需要下载的文件已经全部上传至服务器,但是还有3个和word2vec相关的文件需要生成,这里用到 /root/nlg-eval/bin/nlg-eval
这个文件中的部分代码如下:
from nlgeval.word2vec.generate_w2v_files import generate
with ZipFile(os.path.join(data_path, 'glove.6B.zip')) as z:
z.extract('glove.6B.300d.txt', data_path)
generate(data_path)
这里的data_path
就是glove.6B.zip
所在目录,我这里是 /root/.cache/nlgeval/
。我们也可以提前把glove.6B.zip
文件使用unzip
命令解压缩之后直接使用下面代码:
from nlgeval.word2vec.generate_w2v_files import generate
generate(data_path) # data_path="/root/.cache/nlgeval/"
执行结果如下:
上图中用红框标出来的就是刚刚生成的3个相关文件夹及其所在路径。
若缺少此步骤,则会报错:
FileNotFoundError: [Errno 2] No such file or directory: ‘/root/.cache/nlgeval/glove.6B.300d.model.bin’
至此,我们就完成所有文件的准备工作啦~
为节省大家的下载时间,在这里给出第三步和第四步所需的全部文件下载链接(夸克网盘),伙伴们按需自取~
链接:https://pan.quark.cn/s/449a7fc79f17
提取码:gKJ9
下载下来后直接放到对应的路径下就可以啦~
由于我们把下载好的文件存放在目录下,所以要修改数据调用时的目录。修改方法如下:找到自己当前环境中的nlg_eval-2.3-py3.6.egg/nlgeval/utils.py
文件,我的文件完整路径为/root/anaconda3/envs/torch/lib/python3.6/site-packages/nlg_eval-2.3-py3.6.egg/nlgeval/utils.py
,在文件中会发现以下代码:
def get_data_dir():
if os.environ.get('NLGEVAL_DATA'):
if not os.path.exists(os.environ.get('NLGEVAL_DATA')):
click.secho("NLGEVAL_DATA variable is set but points to non-existent path.", fg='red', err=True)
raise InvalidDataDirException()
return os.environ.get('NLGEVAL_DATA')
else:
try:
cfg_file = os.path.join(XDG_CONFIG_HOME, 'nlgeval', 'rc.json')
with open(cfg_file, 'rt') as f:
rc = json.load(f)
if not os.path.exists(rc['data_path']):
click.secho("Data path found in {} does not exist: {} " % (cfg_file, rc['data_path']), fg='red', err=True)
click.secho("Run `nlg-eval --setup DATA_DIR' to download or set $NLGEVAL_DATA to an existing location",
fg='red', err=True)
raise InvalidDataDirException()
return rc['data_path']
except:
click.secho("Could not determine location of data.", fg='red', err=True)
click.secho("Run `nlg-eval --setup DATA_DIR' to download or set $NLGEVAL_DATA to an existing location", fg='red',
err=True)
raise InvalidDataDirException()
可以直接重写get_data_dir()
函数,把上述代码改成如下并保存。
def get_data_dir():
data_path = "/root/.cache/nlgeval/"
return data_path
(这里另一种解决方法是修改环境变量NLGEVAL_DATA
的值,如果想使用这种方法可以参考底部的资料:Linux安装NLG-Eval)
若缺少此步骤,则会报错:
FileNotFoundError: [Errno 2] No such file or directory: ‘/root/.config/nlgeval/rc.json’
运行下述代码
from nlgeval import NLGEval
hyp=['this puppy is so cute!'] # ,'He is such a cutie!','I also want one dog like this!'
ref1=['It is such a cutie!','What breed is this dog?','Where can I get one puppy like this?','What kind of dog it is?','Where can I get such a cutie?','Look at his adorable tail!']
lis=[[r] for r in ref1]
nlgeval_=NLGEval()
ans=nlgeval_.compute_metrics(hyp_list=hyp,ref_list=lis)
print(ans)
顺利得到结果~
{'Bleu_1': 0.5999999997600003, 'Bleu_2': 1.224744870871073e-08, 'Bleu_3': 3.684031496941645e-11, 'Bleu_4': 2.2360679763351715e-12, 'METEOR': 0.1333333333333333, 'ROUGE_L': 0.2, 'CIDEr': 0.0, 'SkipThoughtCS': 0.8338306, 'EmbeddingAverageCosineSimilarity': 0.859627, 'EmbeddingAverageCosineSimilairty': 0.859627, 'VectorExtremaCosineSimilarity': 0.704425, 'GreedyMatchingScore': 0.690544}
注:运行测试程序时可能会输出如下警告:
WARNING (theano.tensor.blas): Using NumPy C-API based implementation for BLAS functions.
上述警告的解除方法是:在 root 权限下运行如下命令:
apt-get install libblas-dev liblapack-dev libatlas-base-dev gfortran
写在最后:有些坑还是要自己蹚一遍才印象深刻……之前同学帮忙装了一次,但是他毕业了哈哈,这次只能自己来喽!希望小伙伴们都可以一次安装成功~
参考资料