使用哈工大LTP测试分词并且增加自定义字典

1、github下载源码

https://github.com/HIT-SCIR/ltp
安装
在这里插入图片描述

2、拷贝测试代码测试

详细说明下拷贝测试代码
https://github.com/HIT-SCIR/ltp/blob/master/docs/quickstart.rst

from ltp import LTP

ltp = LTP()

segment, _ = ltp.seg(["他叫汤姆去拿外衣。"])
# [['他', '叫', '汤姆', '去', '拿', '外衣', '。']]
1
2
3
4
5
6

在这里插入图片描述
报错了

稍微修改下代码，添加前面两行

import sys,os
sys.path.append(os.path.abspath(os.path.dirname(__file__) + '/' + '..'))

from ltp import LTP

ltp = LTP()

segment, _ = ltp.seg(["他叫汤姆去拿外衣。"])
# [['他', '叫', '汤姆', '去', '拿', '外衣', '。']]

print(segment)

1
2
3
4
5
6
7
8
9
10
11
12

执行过程中，缺少什么包就安装什么包就行，我的少了这些

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple packaging
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple transformers --user
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple numpy==1.17.3
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple pygtrie
1
2
3
4

执行成功，等待下载训练包，然后测试结果成功

Ignored unknown kwarg option direction
[['他', '叫', '汤姆', '去', '拿', '外衣', '。']]
1
2

3、升级测试，增加自定义字典

（1）、脚本中增加

仅仅为了测试，增加的自定义字典是随便写的

import sys,os
sys.path.append(os.path.abspath(os.path.dirname(__file__) + '/' + '..'))

from ltp import LTP

ltp = LTP()

# 也可以在代码中添加自定义的词语
ltp.add_words(words=["叫汤姆去"], max_window=4)

segment, _ = ltp.seg(["他叫汤姆去拿外衣。"])
print(segment)

1
2
3
4
5
6
7
8
9
10
11
12
13

（2）、文件中增加

其中’user_dict.txt’文件我是放在了这里
在这里插入图片描述

文本记得用utf-8编码保存

import sys,os
sys.path.append(os.path.abspath(os.path.dirname(__file__) + '/' + '..'))

from ltp import LTP

ltp = LTP()
root_path=os.path.abspath(os.path.dirname(__file__) + '/' + '..')
# user_dict.txt 是词典文件， max_window是最大前向分词窗口
ltp.init_dict(path=os.path.join(root_path,"dict",'user_dict.txt'), max_window=4)
segment, _ = ltp.seg(["他叫汤姆去拿外衣。"])
print(segment)

1
2
3
4
5
6
7
8
9
10
11
12

读取utf-8编码的文本时，读取第一个文本会出现乱码问题
在这里插入图片描述

参考：产生 \ufeff 问题的原因及解决办法
在这里插入图片描述

代码有个地方（“.\ltp-master\ltp\algorithms\maximum_forward_matching.py”）稍微修改了下，修改之后就正常了
在这里插入图片描述
测试结果成功

Ignored unknown kwarg option direction
[['他', '叫汤姆去', '拿', '外衣', '。']]
1
2

4、修改模型

主要是修改
ltp = LTP(path = “base”)
模型位置大概在
“C:\Users\LYF.cache\torch\ltp\8909177e47aa4daf900c569b86053ac68838d09da28c7bbeb42b8efcb08f56aa-edb9303f86310d4bcfd1ac0fa20a744c9a7e13ee515fe3cf88ad31921ed616b2-extracted\ltp.model”

import sys,os,time

sys.path.append(os.path.abspath(os.path.dirname(__file__) + '/' + '..'))

from ltp import LTP
root_path=os.path.abspath(os.path.dirname(__file__) + '/' + '..')

ltp = LTP(path = "base")

# user_dict.txt 是词典文件， max_window是最大前向分词窗口
# ltp.init_dict(path=os.path.join(root_path,"dict",'user_dict.txt'), max_window=4)
# ltp.add_words(words=["\n"], max_window=4)

# user_dict.txt 是词典文件， max_window是最大前向分词窗口
# ltp.init_dict(path=os.path.join(root_path,"dict",'user_dict.txt'), max_window=4)
# 也可以在代码中添加自定义的词语
# ltp.add_words(words=["叫汤姆去"], max_window=4)

url = "tests/zrbzdz.txt"
t1 = time.time()

contents = open(url,"r",encoding='utf-8-sig').read()
segment, _ = ltp.seg([contents])

output="/ ".join(segment[0])
# print(segment)
t2 = time.time()-t1

# 输出分词后的文件路径
LTP_f = open("tests/output/1_LTP.txt","wb")
LTP_f.write(output.encode('utf-8'))
LTP_f.close()

print('time ' + str(t2))


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36

我的数据经过测试，模型"base1"比"base"的效果要好，模型"base2"比"base1"的效果要好

5、批量处理文件数据

import sys,os,time
sys.path.append(os.path.abspath(os.path.dirname(__file__) + '/' + '..'))

from ltp import LTP
root_path=os.path.abspath(os.path.dirname(__file__) + '/' + '..')

ltp = LTP(path = "base")

url = "tests/zrbzdz.txt"
t1 = time.time()

lines = []
count=0
output=[]
with open(url,"r",encoding='utf-8-sig') as f:
    for line in f:
        line = line.strip()
        lines.append(line)
        count+=1
        if count%2000==0:
            output.extend(ltp.seg(lines)[0])
            lines = []
# 输出分词后的文件路径
LTP_f = open("tests/output/1_LTP.txt","w",encoding='utf-8-sig')

str1='/ '
for out in output:
    LTP_f.write(str1.join(out)+'\n')
LTP_f.close()

tt =  time.time()-t1
print('time ' + str(tt))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

6、修改model路径为自定义路径

将model从默认路径拷贝到自定义路径
从
在这里插入图片描述
拷贝到自定义目录下，还要修改文件夹名称，跟脚本中的名称一样，将.json文件重命名为config.json

import sys,os,time
sys.path.append(os.path.abspath(os.path.dirname(__file__) + '/' + '..'))

from ltp import LTP
root_path=os.path.abspath(os.path.dirname(__file__) + '/' + '..')

ltp = LTP(path = "tests/model/base2.model")

url = "tests/zrbzdz.txt"
t1 = time.time()

lines = []
count=0
output=[]
with open(url,"r",encoding='utf-8-sig') as f:
    for line in f:
        line = line.strip()
        lines.append(line)
        count+=1
        if count%2000==0:
            output.extend(ltp.seg(lines)[0])
            lines = []
# 输出分词后的文件路径
LTP_f = open("tests/output/base22_LTP.txt","w",encoding='utf-8-sig')

str1='/ '
for out in output:
    LTP_f.write(str1.join(out)+'\n')
LTP_f.close()

tt =  time.time()-t1
print('time ' + str(tt))


# with open(url,"r",encoding='utf-8-sig') as f:
#     lines=f.readlines()
#     for line in lines:
#         segment, _ = ltp.seg([line])
#         output+="/ ".join(segment[0])+'\n'

# def split_l(l, n=8):
#     return [l[i:i + n] for i in range(0, len(l), n)]

# for batch_data in split_l(lines, 32):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45

相关阅读:
dbeaver怎么批量执行sql
数据结构和算法（7）：图应用
 Fiddler的配置和使用
 jvm学习
 尚硅谷ES6复习总结下（65th）
MLX90640 红外热成像传感器测温模块开发笔记（二）
TiDB Dashboard 慢查询页面
 文献翻译平台（自用）
MAC M1芯片安装mounty读写移动硬盘中的文件
 python 脚本压缩文件linux 正常，windows 文件夹/文件名称被加上了上级文件夹名
原文地址：https://blog.csdn.net/yilvyangguang520/article/details/126232312