自然语言处理——数据清洗





    1. 现实生活中,数据并非完美的, 需要进行清洗才能进行后面的数据分析
    2. 数据清洗是整个数据分析项目最消耗时间的一步
    3. 数据的质量最终决定了数据分析的准确性
    4. 数据清洗是唯一可以提高数据质量的方法,使得数据分析的结果也变得更加可靠






    1. import numpy as np
    2. import pandas as pd
    3. import matplotlib.pyplot as plt
    4. df = pd.read_csv("./dataset/googleplaystore.csv",usecols = (0,1,2,3,4,5,6))
                                                  App        Category  Rating  \
    0  Photo Editor & Candy Camera & Grid & ScrapBook  ART_AND_DESIGN     4.1   
      Reviews Size Installs  Type  
    0     159  19M  10,000+  Free

    (10841, 7)


    App         10841
    Category    10841
    Rating       9367
    Reviews     10841
    Size        10841
    Installs    10841
    Type        10840
    dtype: int64


    count  9367.000000
    mean      4.193338
    std       0.537431
    min       1.000000
    25%       4.000000
    50%       4.300000
    75%       4.500000
    max      19.000000





    • 填充具体数值,通常是0

    • 填充某个统计值,比如均值、中位数、众数等

    • 填充前后项的值

    • 基于SimpleImputer类的填充

    • 基于KNN算法的填充



    1. #时间转换
    2. import datetime
    3. date_str = '2023-09-11'
    4. date_obj = datetime.datetime.strptime(date_str, '%Y-%m-%d')
    5. formatted_date_str = date_obj.strftime('%m/%d/%Y')
    6. print("转换结果:" + formatted_date_str)


    1. num_str = '123.4567'
    2. num_float = float(num_str)
    3. formatted_num_str = "{:.2f}".format(num_float)
    4. print("转换结果:"+formatted_num_str)






    有的分析师喜欢把去重放在第一步,但我强烈建议把去重放在格式内容清洗之后,原因已经说过了(多个空格导致工具认为“陈丹奕”和“陈 丹奕”不是一个人,去重失败)。而且,并不是所有的重复都能这么简单的去掉……






























    1. str="TangRengui is a StuDeNt"
    2. print("lower转换为小写后:",str.lower())


    lower转换为小写后: tangrengui is a student



    1. import re
    2. sentence = "+蚂=蚁!花!呗/期?免,息★.---《平凡的世界》:了*解一(#@)个“普通人”波涛汹涌的内心世界!"
    3. sentenceClean = []
    4. remove_chars = '[·’!"\#$%&\'()#!()*+,-./:;<=>?\@,:?¥★、….>【】[]《》?“”‘’\[\\]^_`{|}~]+'
    5. string = re.sub(remove_chars, "", sentence)
    6. sentenceClean.append(string)
    7. print(sentenceClean)




    去停用词时,首先要用到停用词表,常见的有哈工大停用词表 及 百度停用词表,在网上随便下载一个即可。

    在去停用词之前,首先要通过 load_stopword( ) 方法来加载停用词列表,接着按照上文所示,加载自定义词典,对句子进行分词,然后判断分词后的句子中的每一个词,是否在停用词表内,如果不在,就把它加入 outstr,用空格来区分 。 

    1. import jieba
    2. # 加载停用词列表
    3. def load_stopword():
    4. f_stop = open('stopword.txt', encoding='utf-8') # 自己的中文停用词表
    5. sw = [line.strip() for line in f_stop] # strip() 方法用于移除字符串头尾指定的字符(默认为空格)
    6. f_stop.close()
    7. return sw
    8. # 中文分词并且去停用词
    9. def seg_word(sentence):
    10. file_userDict = 'dict.txt' # 自定义的词典
    11. jieba.load_userdict(file_userDict)
    12. sentence_seged = jieba.cut(sentence.strip())
    13. stopwords = load_stopword()
    14. outstr = ''
    15. for word in sentence_seged:
    16. if word not in stopwords:
    17. if word != '/t':
    18. outstr += word
    19. outstr += " "
    20. print(outstr)
    21. return outstr
    22. if __name__ == '__main__':
    23. sentence = "人们宁愿去关心一个蹩脚电影演员的吃喝拉撒和鸡毛蒜皮,而不愿了解一个普通人波涛汹涌的内心世界"
    24. seg_word(sentence)


    人们 去 关心 蹩脚 电影演员 吃喝拉撒 鸡毛蒜皮 不愿 了解 普通人 波涛汹涌 内心世界



    1. # jieba高频热词提取
    2. import glob
    3. import random
    4. import jieba
    5. # 加载文本
    6. def get_content(path):
    7. with open(path, 'r', encoding='gbk', errors='ignore') as f:
    8. content = ''
    9. for l in f:
    10. l = l.strip()
    11. content += l
    12. return content
    13. # 热词计数
    14. def get_TF(words, topK=10):
    15. tf_dic = {}
    16. for w in words:
    17. if w not in stop_words('stopword.txt'):
    18. tf_dic[w] = tf_dic.get(w, 0) + 1
    19. return sorted(tf_dic.items(), key=lambda x: x[1], reverse=True)[:topK]
    20. # 加载停用词
    21. def stop_words(path):
    22. with open(path, 'r', encoding='gbk', errors='ignore') as f:
    23. return [l.strip() for l in f]
    24. files = glob.glob('./news*.txt')
    25. corpus = [get_content(x) for x in files]
    26. split_words = list(jieba.cut(corpus[0]))
    27. print('样本之一:', corpus[0])
    28. print('样本分词效果:', ','.join(split_words))
    29. print('样本的top10热词:', str(get_TF(split_words)))


    1. 样本之一: 先天性心脏病“几岁可根治,十几岁变难治,几十岁成不治”,中国著名心血管学术领袖胡大一今天在此间表示救治心脏病应从儿童抓起,他呼吁社会各界关心贫困地区的先天性心脏病儿童。据了解,今年五月一日到五月三日,胡大一及其“爱心工程”专家组将联合北京军区总医院在安徽太和县举办第三届先心病义诊活动。安徽太和县是国家重点贫困县,同时又是先天性心脏病的高发区。由于受贫苦地区医疗技术条件限制,当地很多孩子由于就医太晚而失去了治疗时机,当地群众也因此陷入“生病—贫困—无力医治—病情加重—更加贫困”的恶性循环中。胡大一表示,由于中国经济发展的不平衡与医疗水平的严重差异化,目前中国有这种情况的绝不止一个太和县。但按照现行医疗体制,目前医院、医生为社会提供的服务模式和力度都远远不能适应社会需求。他希望,发达地区的医院、医生能积极走出来,到患者需要的地方去。据悉,胡大一于二00二发起了面向全国先天性心脏病儿童的“胡大一爱心工程”,旨在呼吁社会对于先心病儿童的关注,同时通过组织大城市专家走进贫困地区开展义诊活动,对贫困地区贫困家庭优先实施免费手术,并对其他先心病儿童给予适当资助。 (钟啸灵)专家简介:胡大一、男、19467月生于河南开封,主任医师、教授、博士生导师,国家突出贡献专家、享受政府专家津贴。现任同济大学医学院院长、首都医科大学心脏病学系主任、北京大学人民医院心研所所长、心内科主任,首都医科大学心血管疾病研究所所长,首都医科大学北京同仁医院心血管疾病诊疗中心主任。任中华医学会心血管病分会副主任委员、中华医学会北京心血管病分会主任委员、中国生物医学工程学会心脏起搏与电生理分会主任委员、中国医师学会循证医学专业委员会主任委员、北京市健康协会理事长、北京医师协会副会长及美国心脏病学院会员。(来源:北大人民医院网站)
    2. 样本分词效果: 先天性,心脏病,“,几岁,可,根治,,,十几岁,变,难治,,,几十岁,成不治,”,,,中国,著名,心血管,学术,领袖,胡大一,今天,在,此间,表示,救治,心脏病,应从,儿童,抓起,,,他,呼吁,社会各界,关心,贫困地区,的,先天性,心脏病,儿童,。,据,了解,,,今年,五月,一日,到,五月,三日,,,胡大一,及其,“,爱心,工程,”,专家组,将,联合,北京军区总医院,在,安徽,太和县,举办,第三届,先心病,义诊,活动,。,安徽,太和县,是,国家,重点,贫困县,,,同时,又,是,先天性,心脏病,的,高发区,。,由于,受,贫苦,地区,医疗,技术,条件,限制,,,当地,很多,孩子,由于,就医,太晚,而,失去,了,治疗,时机,,,当地,群众,也,因此,陷入,“,生病,—,贫困,—,无力,医治,—,病情,加重,—,更加,贫困,”,的,恶性循环,中,。,胡大一,表示,,,由于,中国,经济,发展,的,不,平衡,与,医疗,水平,的,严重,差异化,,,目前,中国,有,这种,情况,的,绝,不止,一个,太和县,。,但,按照,现行,医疗,体制,,,目前,医院,、,医生,为,社会,提供,的,服务,模式,和,力度,都,远远,不能,适应,社会,需求,。,他,希望,,,发达,地区,的,医院,、,医生,能,积极,走,出来,,,到,患者,需要,的,地方,去,。,据悉,,,胡大一,于,二,00,二,发起,了,面向全国,先天性,心脏病,儿童,的,“,胡大一,爱心,工程,”,,,旨在,呼吁,社会,对于,先心病,儿童,的,关注,,,同时,通过,组织,大城市,专家,走进,贫困地区,开展,义诊,活动,,,对,贫困地区,贫困家庭,优先,实施,免费,手术,,,并,对,其他,先心病,儿童,给予,适当,资助,。, ,(,钟啸灵,),专家,简介,:,胡大一,、,男,、,1946,年,7,月,生于,河南,开封,,,主任医师,、,教授,、,博士生,导师,,,国家,突出贡献,专家,、,享受,政府,专家,津贴,。,现任,同济大学,医学院,院长,、,首都医科大学,心脏病学,系主任,、,北京大学人民医院,心研,所,所长,、,心内科,主任,,,首都医科大学,心血管,疾病,研究所,所长,,,首都医科大学,北京同仁医院,心血管,疾病,诊疗,中心,主任,。,任,中华医学会,心血管病,分会,副,主任委员,、,中华医学会,北京,心血管病,分会,主任委员,、,中国,生物医学,工程,学会,心脏,起搏,与,电,生理,分会,主任委员,、,中国,医师,学会,循证,医学专业,委员会,主任委员,、,北京市,健康,协会,理事长,、,北京,医师,协会,副会长,及,美国,心脏病,学院,会员,。,(,来源,:,北大人民医院,网站,)
    3. 样本的top10热词: [(',', 23), ('、', 15), ('的', 11), ('。', 11), ('心脏病', 6), ('胡大一', 6), ('中国', 5), ('儿童', 5), ('先天性', 4), ('“', 4)]


    1. # coding:utf-8
    2. import nltk
    3. import re
    4. import string
    5. from nltk.tokenize import sent_tokenize
    6. from nltk.corpus import stopwords
    7. from nltk.stem import LancasterStemmer
    8. from nltk.probability import FreqDist
    9. # 标点符号过滤
    10. def filter_punctuation(words):
    11. new_words = [];
    12. illegal_char = string.punctuation + '【·!…()—:“”?《》、;】'
    13. pattern=re.compile('[%s]' % re.escape(illegal_char))
    14. for word in words:
    15. new_word = pattern.sub(u'', word)
    16. if not new_word == u'':
    17. new_words.append(new_word)
    18. return new_words
    19. # 处理停止词
    20. def filter_stop_words(words):
    21. stops=set(stopwords.words('english'))
    22. words = [word for word in words if word.lower() not in stops]
    23. return words
    24. # 分词、提取词干
    25. def Word_segmentation_and_extraction(text):
    26. words=nltk.word_tokenize(text)
    27. stemmerlan=LancasterStemmer()
    28. for i in range(len(words)):
    29. words[i] = stemmerlan.stem(words[i])
    30. return words
    31. # 低频词过滤
    32. def filter_low_frequency_words(words):
    33. fdist = FreqDist(words)
    34. return fdist
    35. text_en = open(u'./data/text_en.txt',encoding='utf-8',errors='ignore').read()
    36. # 分词、提取词干
    37. f1 = open("1.txt", "w",encoding='utf-8')
    38. words_seg=Word_segmentation_and_extraction(text_en)
    39. for word in words_seg:
    40. f1.write(word+'\n')
    41. # 去除停用词
    42. f2 = open("2.txt", "w",encoding='utf-8')
    43. words_no_stop=filter_stop_words(words_seg)
    44. for word in words_no_stop:
    45. f2.write(word+'\n')
    46. # 去除标点符号
    47. f3 = open("3.txt", "w",encoding='utf-8')
    48. words_no_punc = filter_punctuation(words_no_stop)
    49. for word in words_no_punc:
    50. f3.write(word+'\n')
    51. # 低频词过滤 fre为20
    52. fre = 20
    53. f4 = open("4.txt", "w",encoding='utf-8')
    54. fdist_no_low_fre = filter_low_frequency_words(words_no_punc)
    55. for key in fdist_no_low_fre:
    56. if(fdist_no_low_fre[key] > fre):
    57. f4.write(key + ' ' + str(fdist_no_low_fre[key])+'\n')
    58. # 绘制离散图,查看指定单词(Elizabeth, Darcy, Wickham, Bingley, Jane)在文中的分布位置
    59. # 新建一个Text对象
    60. f5 = open("5.txt", "w",encoding='utf-8')
    61. my_text = nltk.text.Text(nltk.word_tokenize(text_en))
    62. name = ['Elizabeth', 'Darcy', 'Wickham', 'Bingley', 'Jane']
    63. for n in name:
    64. my_text.concordance(n)
    65. my_text.dispersion_plot(name[:])
    66. # 对前20个有意义的高频词,绘制频率分布图
    67. n = 20
    68. fdist = FreqDist(words_no_punc)
    69. fdist.plot(n)


    1. ssion to introduce his friend , Mr. Wickham , who had returned with him the day
    2. looked white , the other red . Mr. Wickham , after a few moments , touched his
    3. with his friend . Mr. Denny and Mr. Wickham walked with the young ladies to the
    4. p and down the street , and had Mr. Wickham appeared , Kitty and Lydia would ce
    5. sed to make her husband call on Mr. Wickham , and give him an invitation also ,
    6. entered the drawing-room , that Mr. Wickham had accepted their uncle 's invitat
    7. ntlemen did approach , and when Mr. Wickham walked into the room , Elizabeth fe
    8. were of the present party ; but Mr. Wickham was as far beyond them all in perso
    9. o followed them into the room . Mr. Wickham was the happy man towards whom almo
    10. s for the notice of the fair as Mr. Wickham and the officers , Mr. Collins seem
    11. could not wait for his reason . Mr. Wickham did not play at whist , and with re
    12. he common demands of the game , Mr. Wickham was therefore at leisure to talk to
    13. r , was unexpectedly relieved . Mr. Wickham began the subject himself . He inqu
    14. rstand . '' `` Yes , '' replied Mr. Wickham ; `` his estate there is a noble on
    15. right to give my opinion , '' said Wickham , `` as to his being agreeable or o
    16. n not pretend to be sorry , '' said Wickham , after a short interruption , `` t
    17. ce , to be an ill-tempered man . '' Wickham only shook his head . `` I wonder ,
    18. it prevented further inquiry . Mr. Wickham began to speak on more general topi
    19. myself on the subject , '' replied Wickham ; `` I can hardly be just to him .
    20. '' `` It is wonderful , '' replied Wickham , `` for almost all his actions may
    21. f regarding little matters . '' Mr. Wickham 's attention was caught ; and after
    22. both in a great degree , '' replied Wickham ; `` I have not seen her for many y
    23. st of the ladies their share of Mr. Wickham 's attentions . There could be no c
    24. e could think of nothing but of Mr. Wickham , and of what he had told her , all
    25. ext day what had passed between Mr. Wickham and herself . Jane listened with as
    26. Displaying 25 of 305 matches:
    27. ek . '' `` What is his name ? '' `` Bingley . '' `` Is he married or single ? '
    28. re as handsome as any of them , Mr. Bingley may like you the best of the party
    29. ar , you must indeed go and see Mr. Bingley when he comes into the neighbourhoo
    30. crupulous , surely . I dare say Mr. Bingley will be very glad to see you ; and
    31. earliest of those who waited on Mr. Bingley . He had always intended to visit h
    32. addressed her with : '' I hope Mr. Bingley will like it , Lizzy . '' `` We are
    33. e are not in a way to know what Mr. Bingley likes , '' said her mother resentfu
    34. of your friend , and introduce Mr. Bingley to her . '' `` Impossible , Mr. Ben
    35. ontinued , `` let us return to Mr . Bingley . '' `` I am sick of Mr. Bingley ,
    36. . Bingley . '' `` I am sick of Mr. Bingley , '' cried his wife . `` I am sorry
    37. u are the youngest , I dare say Mr. Bingley will dance with you at the next bal
    38. any satisfactory description of Mr. Bingley . They attacked him in various ways
    39. love ; and very lively hopes of Mr. Bingley 's heart were entertained . `` If I
    40. to wish for . '' In a few days Mr. Bingley returned Mr. Bennet 's visit , and
    41. arrived which deferred it all . Mr. Bingley was obliged to be in town the follo
    42. and a report soon followed that Mr. Bingley was to bring twelve ladies and seve
    43. sisted of only five altogether—Mr . Bingley , his two sisters , the husband of
    44. ldest , and another young man . Mr. Bingley was good-looking and gentlemanlike
    45. ared he was much handsomer than Mr. Bingley , and he was looked at with great a
    46. o be compared with his friend . Mr. Bingley had soon made himself acquainted wi
    47. with Mrs. Hurst and once with Miss Bingley , declined being introduced to any
    48. a conversation between him and Mr. Bingley , who came from the dance for a few
    49. astidious as you are , '' cried Mr. Bingley , `` for a kingdom ! Upon my honour
    50. wasting your time with me . '' Mr. Bingley followed his advice . Mr. Darcy wal
    51. ired by the Netherfield party . Mr. Bingley had danced with her twice , and she
    52. Displaying 25 of 288 matches:
    53. rg EBook of Pride and Prejudice , by Jane Austen Chapter 1 It is a truth unive
    54. sure she is not half so handsome as Jane , nor half so good-humoured as Lydia
    55. been distinguished by his sisters . Jane was as much gratified by this as her
    56. gh in a quieter way . Elizabeth felt Jane 's pleasure . Mary had heard herself
    57. t ball . I wish you had been there . Jane was so admired , nothing could be li
    58. ow ; and he seemed quite struck with Jane as she was going down the dance . So
    59. Maria Lucas , and the two fifth with Jane again , and the two sixth with Lizzy
    60. e detest the man . '' Chapter 4 When Jane and Elizabeth were alone , the forme
    61. second better . '' `` Oh ! you mean Jane , I suppose , because he danced with
    62. not there a little mistake ? '' said Jane . `` I certainly saw Mr. Darcy speak
    63. '' `` Miss Bingley told me , '' said Jane , `` that he never speaks much , unl
    64. xpressed towards the two eldest . By Jane , this attention was received with t
    65. like them ; though their kindness to Jane , such as it was , had a value as ar
    66. d to her it was equally evident that Jane was yielding to the preference which
    67. ered by the world in general , since Jane united , with great strength of feel
    68. mber , Eliza , that he does not know Jane 's disposition as you do . '' `` But
    69. gh of her . But , though Bingley and Jane meet tolerably often , it is never f
    70. be employed in conversing together . Jane should therefore make the most of ev
    71. should adopt it . But these are not Jane 's feelings ; she is not acting by d
    72. Well , '' said Charlotte , `` I wish Jane success with all my heart ; and if s
    73. while her daughter read , '' Well , Jane , who is it from ? What is it about
    74. it about ? What does he say ? Well , Jane , make haste and tell us ; make hast
    75. `` It is from Miss Bingley , '' said Jane , and then read it aloud . `` MY DEA
    76. `` Can I have the carriage ? '' said Jane . `` No , my dear , you had better g
    77. gment that the horses were engaged . Jane was therefore obliged to go on horse



    1. # 词典库
    2. # 转换成集合复杂度O(logn),列表复杂度为O(n)
    3. vocab = set([line.rstrip() for line in open('vocab.txt')])
    4. print(vocab)
    5. # 需要生成所有候选集合
    6. def generate_candidates(word):
    7. """
    8. word: 给定的输入(错误的输入)
    9. 返回所有(valid)候选集合
    10. """
    11. # 生成编辑距离为1的单词
    12. # 1.insert 2. delete 3. replace
    13. # appl: replace: bppl, cppl, aapl, abpl...
    14. # insert: bappl, cappl, abppl, acppl....
    15. # delete: ppl, apl, app
    16. # 假设使用26个字符
    17. letters = 'abcdefghijklmnopqrstuvwxyz'
    18. splits = [(word[:i], word[i:]) for i in range(len(word)+1)]
    19. # insert操作
    20. inserts = [L+c+R for L, R in splits for c in letters]
    21. # delete
    22. deletes = [L+R[1:] for L,R in splits if R]
    23. # replace
    24. replaces = [L+c+R[1:] for L,R in splits if R for c in letters]
    25. candidates = set(inserts+deletes+replaces)
    26. # 过来掉不存在于词典库里面的单词
    27. return [word for word in candidates if word in vocab]
    28. print(generate_candidates("apple"))
    29. from nltk.corpus import reuters
    30. # 读取语料库
    31. categories = reuters.categories()
    32. corpus = reuters.sents(categories=categories)
    33. print(corpus)
    34. # 构建语言模型: bigram
    35. term_count = {}
    36. bigram_count = {}
    37. for doc in corpus:
    38. doc = [''] + doc
    39. for i in range(0, len(doc)-1):
    40. # bigram: [i,i+1]
    41. term = doc[i]
    42. bigram = doc[i:i+2]
    43. if term in term_count:
    44. term_count[term]+=1
    45. else:
    46. term_count[term]=1
    47. bigram = ' '.join(bigram)
    48. if bigram in bigram_count:
    49. bigram_count[bigram]+=1
    50. else:
    51. bigram_count[bigram]=1
    52. print(term_count)
    53. print(bigram_count)
    54. channel_prob = {}
    55. for line in open('spell-errors.txt'):
    56. items = line.split(":")
    57. correct = items[0].strip()
    58. mistakes = [item.strip() for item in items[1].strip().split(",")]
    59. channel_prob[correct] = {}
    60. for mis in mistakes:
    61. channel_prob[correct][mis] = 1.0/len(mistakes)
    62. print(channel_prob)
    63. import numpy as np
    64. V = len(term_count.keys())
    65. file = open("testdata.txt","r")
    66. for line in file:
    67. items = line.rstrip().split('\t')
    68. line = items[2].split()
    69. j = 0
    70. for word in line:
    71. if word not in vocab:
    72. #需要替换word成正确的单词
    73. #Step1: 生成所有的(valid)候选集合
    74. candidates = generate_candidates(word)
    75. # 一种方式: if candidate = [], 多生成几个candidates, 比如生成编辑距离不大于2的
    76. # TODO : 根据条件生成更多的候选集合
    77. if len(candidates) < 1:
    78. continue
    79. probs=[]
    80. # 对于每一个candidate, 计算它的score
    81. # score = p(correct)*p(mistake|correct)
    82. # = log p(correct) + log p(mistake|correct)
    83. # 返回score最大的candidate
    84. for candi in candidates:
    85. prob = 0
    86. # 计算channel probability
    87. if candi in channel_prob and word in channel_prob[candi]:
    88. prob += np.log(channel_prob[candi][word])
    89. else:
    90. prob += np.log(0.0001)
    91. #计算语言模型的概率
    92. pre_word = line[j-1]+" "+candi
    93. if pre_word in bigram_count and line[j-1] in term_count:
    94. prob += np.log((bigram_count[pre_word]+1.0)/(term_count[line[j-1]]+V))
    95. else:
    96. prob += np.log(1.0/V)
    97. if j+1 < len(line):
    98. pos_word = candi + " " + line[j+1]
    99. if pos_word in bigram_count and candi in term_count:
    100. prob += np.log((bigram_count[pos_word] + 1.0)/(term_count[candi]+V))
    101. else:
    102. prob += np.log(1.0/V)
    103. probs.append(prob)
    104. max_idx = probs.index(max(probs))
    105. print(word,candidates[max_idx])
    106. j +=1


    1. protectionst protectionist
    2. products. products
    3. long-run, long-run
    4. gain. gains
    5. 17, 17
    6. retaiation retaliation
    7. cost. costs
    8. busines, business
    9. ltMC.T. ltMC.T
    10. U.S., U.S.
    11. Murtha, Murtha
    12. ....




    1. import jieba
    2. sentence="你需要羽毛球拍吗?"
    3. seg_list = jieba.cut(sentence,cut_all=True)
    4. print("全模式:","/".join(seg_list))
    5. seg_list = jieba.cut(sentence,cut_all=False)
    6. print("精确模式:","/".join(seg_list))
    7. seg_list = jieba.cut_for_search(sentence)
    8. print("搜索引擎模式:","/".join(seg_list))
    9. seg_list = jieba.cut(sentence)
    10. print("默认模式:","/".join(seg_list))


    1. 全模式: 你/需要/羽毛/羽毛球/羽毛球拍/球拍/吗/?
    2. 精确模式: 你/需要/羽毛球拍/吗/?
    3. 搜索引擎模式: 你/需要/羽毛/球拍/羽毛球/羽毛球拍/吗/?
    4. 默认模式: 你/需要/羽毛球拍/吗/?


    1. """
    2. 词干提取器
    3. """
    4. import nltk.stem.porter as pt
    5. import nltk.stem.lancaster as lc
    6. import nltk.stem.snowball as sb
    7. words = ['table', 'probably', 'wolves',
    8. 'playing', 'is', 'the', 'beaches',
    9. 'grouded', 'dreamt', 'envision']
    10. pt_stemmer = pt.PorterStemmer()
    11. lc_stemmer = lc.LancasterStemmer()
    12. sb_stemmer = sb.SnowballStemmer('english')
    13. for word in words:
    14. pt_stem = pt_stemmer.stem(word)
    15. lc_stem = lc_stemmer.stem(word)
    16. sb_stem = sb_stemmer.stem(word)
    17. print('%8s %8s %8s %8s' % \
    18. (word, pt_stem, lc_stem, sb_stem))


    1. table tabl tabl tabl
    2. probably probabl prob probabl
    3. wolves wolv wolv wolv
    4. playing play play play
    5. is is is is
    6. the the the the
    7. beaches beach beach beach
    8. grouded groud groud groud
    9. dreamt dreamt dreamt dreamt
    10. envision envis envid envis
    11. Process finished with exit code 0


     “词形还原” 作用为英语分词后根据其词性将单词还原为字典中原型词汇。简单说来,词形还原就是去掉单词的词缀,提取单词的主干部分,通常提取后的单词会是字典中的单词,不同于词干提取(stemming),提取后的单词不一定会出现在单词中。比如,单词“cars”词形还原后的单词为“car”,单词“ate”词形还原后的单词为“eat”。

    1. from nltk import word_tokenize, pos_tag
    2. from nltk.corpus import wordnet
    3. from nltk.stem import WordNetLemmatizer
    4. # 获取单词的词性
    5. def get_wordnet_pos(tag):
    6. if tag.startswith('J'):
    7. return wordnet.ADJ
    8. elif tag.startswith('V'):
    9. return wordnet.VERB
    10. elif tag.startswith('N'):
    11. return wordnet.NOUN
    12. elif tag.startswith('R'):
    13. return wordnet.ADV
    14. else:
    15. return None
    16. sentence = 'football is a family of team sports that involve, to varying degrees, kicking a ball to score a goal.'
    17. tokens = word_tokenize(sentence) # 分词
    18. tagged_sent = pos_tag(tokens) # 获取单词词性
    19. wnl = WordNetLemmatizer()
    20. lemmas_sent = []
    21. for tag in tagged_sent:
    22. wordnet_pos = get_wordnet_pos(tag[1]) or wordnet.NOUN
    23. lemmas_sent.append(wnl.lemmatize(tag[0], pos=wordnet_pos)) # 词形还原
    24. print(lemmas_sent)


    ['football', 'be', 'a', 'family', 'of', 'team', 'sport', 'that', 'involve', ',', 'to', 'vary', 'degree', ',', 'kick', 'a', 'ball', 'to', 'score', 'a', 'goal', '.']
