• 在矩池云上使用Syntaxnet解析文本


    Step1:运行syntaxnet/Parsey McParseface服务器

    首先,需要一台syntaxnet/Parsey McParseface服务器运行在指定的url地址。
    详细可参照以下官网:parsey-universal-server
    它是一个简单的可以在HTTP下作为API来提供Parsey McParseface的Python Flask app。
    我们需要先在矩池云租用一台支持docker的机器,如下图所示:
    在这里插入图片描述
    然后用VsCode连接这台机器,连接成功后打开终端,如下图所示:
    在这里插入图片描述
    在构建时,我们需要在终端命令行中输入以下命令:

    git clone https://github.com/andersrye/parsey-universal-server.git
    
    • 1

    执行完后,资源管理器中会出现如下文件夹:
    在这里插入图片描述
    我们切到这个目录下:

    cd parsey-universal-server-master
    
    • 1

    接着输入以下命令来使用 Dockerfile 创建镜像:

    docker build -t parseyserver .
    
    • 1

    可能会出现timeout的情况,但不要慌,再运行一遍试试,出现下图这样的输出信息后,等它创建完成就可以啦:
    在这里插入图片描述
    在这里插入图片描述
    创建一个新的容器并运行一个命令(-d: 后台运行容器,并返回容器ID)

    docker run -d -it -p 7777:80 -e PARSEY_MODELS=English -e PARSEY_BATCH_SIZE=100 --name=syntaxnet --restart unless-stopped andersrye/parsey-universal-server
    
    • 1

    默认模式是English,可以通过设置PARSEY_MODELS环境遍历来选择模式:

    docker run -it --rm -p 7777:80 -e PARSEY_MODELS=Latin,English,Greek andersrye/parsey-universal-server
    
    • 1

    Step2:预处理文本

    我们需要先把文本处理成Syntexnet能够处理的形式:

    def prepareTextForSyntaxnet(text=""): # 输入文本text
        tmp = tmp.replace("\n","") # 移除换行
        result = ""
        for sentence in tokenizer.split_sentences(tmp): # 遍历文本中的每一句
            tokenized_sentence = " ".join(tokenizer.tokenize(sentence)) # 对每一句做分词和词性标注,并加入到标记号的句子中
            if(len(tokenized_sentence.replace(" ","")))>8:
                result = result + tokenized_sentence + "\n" # 每8个字符一个换行
        return result # 返回能够送给Syntaxnet处理的结果
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8

    这里我们可以写一个Tokenizer的类来做分词、词性标注与分句:

    class Tokenizer:
        def __init__(self,lang="en"):
            # 加载模型
            self.spacy_nlp = spacy.load('en_core_web_sm')
    
        def tokenize(self, input):
            return [x.text for x in self.spacy_nlp.tokenizer(input) if x.text != " "]
    
        def split_sentences(self, input):
            return [x.text for x in self.spacy_nlp(input).sents if x.text != " "]
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10

    不知道为什么目前矩池云公测的这台机器只能安装spacy>=3.0.0:

    pip install spacy==3.0.0
    
    • 1

    在安装spacy==3.0.0后需要下载模型en_core_web_sm,我采用的是离线的方式,直接下载会timeout:
    先下载压缩包,并保存的指定的路径下:/mnt/package/en_core_web_sm-3.0.0.tar.gz
    在这里插入图片描述
    然后安装就可以啦:

    /mnt/package/en_core_web_sm-3.0.0.tar.gz
    
    • 1

    Step3:使用Syntaxnet处理文本,并返回结果

    def processText(text="",chunkSize=50):
        url = 'http://0.0.0.0:7777'
        result = []
        sentences = text.split("\n")
        for i in range(0, len(sentences), chunkSize):
            print(i,"/",len(sentences))
            chunk=  "\n".join(sentences[i:i + chunkSize])
            #print(chunk)
            if len(chunk.replace("\n","").replace(" ",""))>0:
                request = Request(url, chunk.encode(),headers={"Content-Type":"text/plain"})
                jsonresponse = urlopen(request,timeout=240).read().decode()
                tmp = json.loads(jsonresponse)
                result= result + tmp
        return result
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14

    将请求的结果保存:

    jsonResult = processText(docs[key])
    	with open(outFilename, 'w') as outfile:
    		json.dump(jsonResult, outfile, indent=4)
    		print("CREATED:"+outFilename)
    
    • 1
    • 2
    • 3
    • 4

    处理结果示例:

    [
        [
            {
                "id": 1,
                "form": "A",
                "upostag": "DET",
                "xpostag": "DT",
                "feats": {
                    "fPOS": "PROPN++NNP",
                    "Number": "Sing"
                },
                "head": 3,
                "deprel": "det"
            },
            {
                "id": 2,
                "form": "Christmas",
                "upostag": "PROPN",
                "xpostag": "NNP",
                "feats": {
                    "fPOS": "PROPN++NNP",
                    "Number": "Sing"
                },
                "head": 3,
                "deprel": "compound"
            },
            {
                "id": 3,
                "form": "Carol",
                "upostag": "PROPN",
                "xpostag": "NNP",
                "feats": {
                    "fPOS": "PROPN++NNP",
                    "Number": "Sing"
                },
                "head": 7,
                "deprel": "nsubj"
            },
            {
                "id": 4,
                "form": "is",
                "upostag": "VERB",
                "xpostag": "VBZ",
                "feats": {
                    "Mood": "Ind",
                    "fPOS": "VERB++VBZ",
                    "Number": "Sing",
                    "Person": "3",
                    "Tense": "Pres",
                    "VerbForm": "Fin"
                },
                "head": 7,
                "deprel": "cop"
            },
            {
                "id": 5,
                "form": "a",
                "upostag": "DET",
                "xpostag": "DT",
                "feats": {
                    "Definite": "Ind",
                    "fPOS": "DET++DT",
                    "PronType": "Art"
                },
                "head": 7,
                "deprel": "det"
            },
            {
                "id": 6,
                "form": "short",
                "upostag": "ADJ",
                "xpostag": "JJ",
                "feats": {
                    "fPOS": "ADJ++JJ",
                    "Degree": "Pos"
                },
                "head": 7,
                "deprel": "amod"
            },
            {
                "id": 7,
                "form": "novel",
                "upostag": "NOUN",
                "xpostag": "NN",
                "feats": {
                    "fPOS": "NOUN++NN",
                    "Number": "Sing"
                },
                "head": 0,
                "deprel": "ROOT"
            },
            {
                "id": 8,
                "form": "by",
                "upostag": "ADP",
                "xpostag": "IN",
                "feats": {
                    "fPOS": "ADP++IN"
                },
                "head": 10,
                "deprel": "case"
            },
            {
                "id": 9,
                "form": "Charles",
                "upostag": "PROPN",
                "xpostag": "NNP",
                "feats": {
                    "fPOS": "PROPN++NNP",
                    "Number": "Sing"
                },
                "head": 10,
                "deprel": "name"
            },
            {
                "id": 10,
                "form": "Dickens",
                "upostag": "PROPN",
                "xpostag": "NNP",
                "feats": {
                    "fPOS": "PROPN++NNP",
                    "Number": "Sing"
                },
                "head": 7,
                "deprel": "nmod"
            },
            {
                "id": 11,
                "form": ".",
                "upostag": "PUNCT",
                "xpostag": ".",
                "feats": {
                    "fPOS": "PUNCT++."
                },
                "head": 7,
                "deprel": "punct"
            }
        ],
        [
            {
                "id": 1,
                "form": "It",
                "upostag": "PRON",
                "xpostag": "PRP",
                "feats": {
                    "Case": "Nom",
                    "PronType": "Prs",
                    "Gender": "Neut",
                    "fPOS": "PRON++PRP",
                    "Number": "Sing",
                    "Person": "3"
                },
                "head": 4,
                "deprel": "nsubjpass"
            },
            {
                "id": 2,
                "form": "was",
                "upostag": "AUX",
                "xpostag": "VBD",
                "feats": {
                    "Mood": "Ind",
                    "fPOS": "AUX++VBD",
                    "Number": "Sing",
                    "Person": "3",
                    "Tense": "Past",
                    "VerbForm": "Fin"
                },
                "head": 4,
                "deprel": "auxpass"
            },
            {
                "id": 3,
                "form": "first",
                "upostag": "ADV",
                "xpostag": "RB",
                "feats": {
                    "fPOS": "ADV++RB"
                },
                "head": 4,
                "deprel": "advmod"
            },
            {
                "id": 4,
                "form": "published",
                "upostag": "VERB",
                "xpostag": "VBN",
                "feats": {
                    "Tense": "Past",
                    "Voice": "Pass",
                    "fPOS": "VERB++VBN",
                    "VerbForm": "Part"
                },
                "head": 0,
                "deprel": "ROOT"
            },
            {
                "id": 5,
                "form": "on",
                "upostag": "ADP",
                "xpostag": "IN",
                "feats": {
                    "fPOS": "ADP++IN"
                },
                "head": 7,
                "deprel": "case"
            },
            {
                "id": 6,
                "form": "19",
                "upostag": "NUM",
                "xpostag": "CD",
                "feats": {
                    "fPOS": "NUM++CD",
                    "NumType": "Card"
                },
                "head": 7,
                "deprel": "nummod"
            },
            {
                "id": 7,
                "form": "December",
                "upostag": "PROPN",
                "xpostag": "NNP",
                "feats": {
                    "fPOS": "PROPN++NNP",
                    "Number": "Sing"
                },
                "head": 4,
                "deprel": "nmod"
            },
            {
                "id": 8,
                "form": "1843",
                "upostag": "NUM",
                "xpostag": "CD",
                "feats": {
                    "fPOS": "NUM++CD",
                    "NumType": "Card"
                },
                "head": 7,
                "deprel": "nummod"
            },
            {
                "id": 9,
                "form": ".",
                "upostag": "PUNCT",
                "xpostag": ".",
                "feats": {
                    "fPOS": "PUNCT++."
                },
                "head": 4,
                "deprel": "punct"
            }
        ]
    ]
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62
    • 63
    • 64
    • 65
    • 66
    • 67
    • 68
    • 69
    • 70
    • 71
    • 72
    • 73
    • 74
    • 75
    • 76
    • 77
    • 78
    • 79
    • 80
    • 81
    • 82
    • 83
    • 84
    • 85
    • 86
    • 87
    • 88
    • 89
    • 90
    • 91
    • 92
    • 93
    • 94
    • 95
    • 96
    • 97
    • 98
    • 99
    • 100
    • 101
    • 102
    • 103
    • 104
    • 105
    • 106
    • 107
    • 108
    • 109
    • 110
    • 111
    • 112
    • 113
    • 114
    • 115
    • 116
    • 117
    • 118
    • 119
    • 120
    • 121
    • 122
    • 123
    • 124
    • 125
    • 126
    • 127
    • 128
    • 129
    • 130
    • 131
    • 132
    • 133
    • 134
    • 135
    • 136
    • 137
    • 138
    • 139
    • 140
    • 141
    • 142
    • 143
    • 144
    • 145
    • 146
    • 147
    • 148
    • 149
    • 150
    • 151
    • 152
    • 153
    • 154
    • 155
    • 156
    • 157
    • 158
    • 159
    • 160
    • 161
    • 162
    • 163
    • 164
    • 165
    • 166
    • 167
    • 168
    • 169
    • 170
    • 171
    • 172
    • 173
    • 174
    • 175
    • 176
    • 177
    • 178
    • 179
    • 180
    • 181
    • 182
    • 183
    • 184
    • 185
    • 186
    • 187
    • 188
    • 189
    • 190
    • 191
    • 192
    • 193
    • 194
    • 195
    • 196
    • 197
    • 198
    • 199
    • 200
    • 201
    • 202
    • 203
    • 204
    • 205
    • 206
    • 207
    • 208
    • 209
    • 210
    • 211
    • 212
    • 213
    • 214
    • 215
    • 216
    • 217
    • 218
    • 219
    • 220
    • 221
    • 222
    • 223
    • 224
    • 225
    • 226
    • 227
    • 228
    • 229
    • 230
    • 231
    • 232
    • 233
    • 234
    • 235
    • 236
    • 237
    • 238
    • 239
    • 240
    • 241
    • 242
    • 243
    • 244
    • 245
    • 246
    • 247
    • 248
    • 249
    • 250
    • 251
    • 252
    • 253
    • 254
    • 255
    • 256
  • 相关阅读:
    Posix与System V共享内存区
    让学指针变得更简单(三)
    qt 绘图
    《WEB前端框架开发技术》HTML5响应式旅游景区网站设计与实现——榆林子州HTML+CSS+JavaScript
    测量学:绪论那些重点基础知识大总结
    TensorFlow.NET--数据类型与张量详解
    Ribbon 负载均衡
    电气滑环更换原因分析
    高频面试题
    音乐创作软件:ToneLIB Jam v4.7.8 Crack
  • 原文地址:https://blog.csdn.net/GW_Krystal/article/details/125477028