最近写豆瓣,写一写个人感想,现在的网络环境,就是很多不能说,有些词,不能通过审核,我自己手动把一些关键词替换掉。想到用Python直接写了一个简单脚本。
- import numpy as np
- import pandas as pd
- import matplotlib.pyplot as plt
- import re
- import os
- from collections import Counter
- import time
-
- # import requests
- # from scrapy import Selector
- # import seaborn as sns
- # import jieba
- # import jieba.posseg as psg
-
- plt.rcParams['font.family'] = ['SimHei']
- plt.rcParams['axes.unicode_minus'] = False
- #
- wk_dir = "2022——社会科学研究方法/test_替换敏感词"
- data_dir = "2022——社会科学研究方法/test_替换敏感词/data_dir_politic_senti_replace"
-
- #---------------------------------------------------------#
- #---- * * ----#
- #---------------------------------------------------------#
-
- f_replace = open(os.path.join(data_dir, "dct_politic_senti.txt"),'r', encoding="UTF-8").readlines()
- f = open("2022——社会科学研究方法/test_替换敏感词/data_dir_politic_senti_replace/dct_politic_senti.txt", encoding="UTF-8")
- with open("2022——社会科学研究方法/test_替换敏感词/data_dir_politic_senti_replace/dct_politic_senti.txt", encoding="UTF-8") as f:
- f.read()
- f = open("2022——社会科学研究方法/test_替换敏感词/data_dir_politic_senti_replace/dct_politic_senti.txt", encoding="UTF-8")
- dct_code = f.readlines()
- dct_code
-
- dct_code = [x.strip() for x in dct_code]
- dct_code
- dct_code = [x.split(" ") for x in dct_code]
- hanzi = [ x[0] for x in dct_code]
- hanzi
- yingwen = [ x[1] for x in dct_code]
- yingwen
-
- dct_repl = dict(zip(hanzi,yingwen))
- dct_repl
-
- txt = open(os.path.join(data_dir, "artical1.txt"), encoding="utf-8").read()
- txt
-
-
- for key, value in dct_repl.items():
- if key in txt:
- txt = txt.replace(key, value)
-
- txt
需要一个字典。比如,把这些次替换掉。

结果就是这样的,不知道能不能通过审核发布,

关键代码是这一段,
- for key, value in dct_repl.items():
- if key in txt:
- txt = txt.replace(key, value)
这一段,是一遍一遍筛选词,一遍一遍替换,效率有点低,但是还没想到更好更高效的解决办法。
希望有高手帮忙指点。