浅谈keras.preprocessing.text

Keras是一个用python编写的开源神经网络库，从2021年8月的版本2.6开始，成为Tensorflow2的高层API。它拥有着丰富的数据封装和一些先进的模型实现，避免了“重复造轮子”。

最近接触到Keras的embedding层，进而学习了一下Keras.preprocessing.text的相关知识。虽然Keras.preprocessing.text已经Deprecated。取而代之的是


tf.keras.utils.text_dataset_from_directory
tf.keras.layers.TextVectorization

但是，之前不少的代码用的还是Keras.preprocessing.text，因此还是有总结一下的必要。

一、主要API

自上而下的罗列了这几个API。Tokenzier_from_json调用Tokenizer，Tokenizer调用one_hot，

one_hot调用hashing_trick。最后hashing_trick，调用text_to_word_sequence。

文本预处理的大致方向应该是分词、然后向量编码。其中text_to_word_sequence是分词，hashing_trick和one_hot用来编码。而Tokenizer是一个更加抽象的类，允许使用两种方法直接向量化一个文本语料库。Tokenizer_from_json，顾名思义，文本数据从json文件来的。

1、Tokenizer_from_json 和Tokenizer


def tokenizer_from_json(json_string):
    """Parses a JSON tokenizer configuration and returns a tokenizer instance.
    Args:
        json_string: JSON string encoding a tokenizer configuration.
    Returns:
        A Keras Tokenizer instance
    """


keras.preprocessing.text.Tokenizer(num_words=None, 
                                   filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~ ', 
                                   lower=True, 
                                   split=' ', 
                                   char_level=False, 
                                   oov_token=None, 
                                   document_count=0)

2、one_hot 和 hash_trick


keras.preprocessing.text.one_hot(text, n, 
                                 filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~', 
                                 lower=True, 
                                 split=' ')


keras.preprocessing.text.hashing_trick(text, n,
                                       hash_function=None, 
                                       filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~ ', lower=True, 
                                       split=' ')

3、text_to_word_sequence


keras.preprocessing.text.text_to_word_sequence(text, 
                                               filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~ ', 
                                               lower=True, 
                                               split=' ')

二、探究hashing_trick


def one_hot(
    input_text,
    n,
    filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
    lower=True,
    split=" ",
    analyzer=None,
):
    return hashing_trick(
        input_text,
        n,
        hash_function=hash,
        filters=filters,
        lower=lower,
        split=split,
        analyzer=analyzer,
    )


def hashing_trick(
    text,
    n,
    hash_function=None,
    filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
    lower=True,
    split=" ",
    analyzer=None,
):
    if hash_function is None:
        hash_function = hash
    elif hash_function == "md5":
        hash_function = lambda w: int(hashlib.md5(w.encode()).hexdigest(), 16)
 
    if analyzer is None:
        seq = text_to_word_sequence(
            text, filters=filters, lower=lower, split=split
        )
    else:
        seq = analyzer(text)
 
    return [(hash_function(w) % (n - 1) + 1) for w in seq]

应用示例代码：


sample_text_1="bitty bought a bit of butter"
sample_text_2="but the bit of butter was a bit bitter"
sample_text_3="so she bought some better butter to make the bitter butter better"
 
corp=[sample_text_1,sample_text_2,sample_text_3]
 
vocab_size=50 
encod_corp=[]
for i,doc in enumerate(corp):
    print(doc)
    encod_corp.append(one_hot(doc,50))
    print("The encoding for document",i+1," is : ",one_hot(doc,50))

由源码可见，one_hot直接调用hashing_trick方法（其中hash_function=hash）。而hashing_tirck调用text_to_word_sequence分词后，直接用python的内置hash()函数。

初看应用示例的代码时，我便有个疑惑，为什么one_hot可以多次在不同语句上进行编码？不会出现同样的词有不同的值吗？不会出现不同的词有相同的值吗？这就有必要进一步学习一下Python内置的hash()函数了。

三、python的hash()函数

1、关于hash()函数

Python 的 hash(object) 返回传入对象的哈希值（如果它有的话），哈希值是整数。

相同大小的数字变量有相同的哈希值（即使它们类型不同，如 1 和 1.0）

Python 的不可变对象才有 hash 值，可变对象没有 hash 值，我们称它为不可哈希，如上例中的列表。因此，hash() 可以应用于数字、字符串和对象，不能直接应用于 list、set、dictionary。

集合（set）的元素、字典（dict）的 key 必须是可哈希的，它保证了在同一个解释器进程里相同字符串 hash 一致，在不同进程中字符串的 hash 可能不一样。

2、关于hash冲突

同样的词，或者说同一个值在同一个Python的进程中，经过hash()函数映射的值时是一样的。

四、关于Embedding

待续ing

参考链接：

1. 文本预处理 - Keras 中文文档

2. keras/text.py at v2.11.0 · keras-team/keras · GitHub

3、TensorFlow框架--Keras使用_缘定三石的博客-CSDN博客

4、tensorflow2.0教程- Keras 快速入门_Doit_行之的博客-CSDN博客

5、【学习笔记】Tensorflow 2.0+与Keras的联系与应用（含model详解）_JinyuZ1996的博客-CSDN博客

6、GitHub - keras-team/keras: Deep Learning for humans

7、https://zh.wikipedia.org/zh-cn/Keras

8 自然语言处理入门——文本预处理_Lanciberrr的博客-CSDN博客

9 Python hash() 对象的哈希值 | Python 教程 - 盖若

10 3. Data model — Python 2.7.18 documentation

11 What does hash do in python? - Stack Overflow

相关阅读:
tiup uninstall
pyhton如何判断字符串中是否只含有数字——isnumeric/isdigit/isdecimal三大函数的区别及实例
 echarts的legend的小图标与文本垂直对齐
 使用nginx配置一个ip对应多个域名
 windows上安装和启动Elasticseach
css实现六边形
 道通转债，微芯转债，博22转债上市价格预测
 在IE浏览器下fixed定位容器随着滚动条出现抖动问题（实测有效）
需求管理手册-对需求分类与目标的要求（6）
【MAPBOX基础功能】23、mapbox通过marker的方式绘制波纹点位
原文地址：https://blog.csdn.net/Tanqy1997/article/details/128069906