• 【数据处理】Python:实现求条件分布函数 | 求平均值方差和协方差 | 求函数函数期望值的函数 | 概率论


       猛戳订阅! 👉 《一起玩蛇》🐍

    💭 写在前面:本章我们将通过 Python 手动实现条件分布函数的计算,实现求平均值,方差和协方差函数,实现求函数期望值的函数。部署的测试代码放到文后了,运行所需环境 python version >= 3.6,numpy >= 1.15,nltk >= 3.4,tqdm >= 4.24.0,scikit-learn >= 0.22。

    🔗 相关链接:【概率论】Python:实现求联合分布函数 | 求边缘分布函数

    📜 本章目录:

    0x00 实现求条件分布的函数(Conditional distribution)

    0x01 实现求平均值, 方差和协方差的函数(Mean, Variance, Covariance)

    0x02 实现求函数期望值的函数(Expected Value of a Function)

    0x04 提供测试用例


    0x00 实现求条件分布的函数(Conditional distribution)

    实现 conditional_distribution_of_word_counts 函数,接收 Point 和 Pmarginal 并求出结果。

    请完成下面的代码,计算条件分布函数 (Joint distribution),将结果存放到 Pcond 中并返回:

    1. def conditional_distribution_of_word_counts(Pjoint, Pmarginal):
    2. """
    3. Parameters:
    4. Pjoint (numpy array) - Pjoint[m,n] = P(X0=m,X1=n), where
    5. X0 is the number of times that word0 occurs in a given text,
    6. X1 is the number of times that word1 occurs in the same text.
    7. Pmarginal (numpy array) - Pmarginal[m] = P(X0=m)
    8. Outputs:
    9. Pcond (numpy array) - Pcond[m,n] = P(X1=n|X0=m)
    10. """
    11. raise RuntimeError("You need to write this part!")
    12. return Pcond

    🚩 输出结果演示:

    1. Problem3. Conditional distribution:
    2. [[0.97177419 0.02419355 0.00201613 0. 0.00201613]
    3. [1. 0. 0. 0. 0. ]
    4. [ nan nan nan nan nan]
    5. [ nan nan nan nan nan]
    6. [1. 0. 0. 0. 0. ]]

    💭 提示:条件分布 (Conditional distribution) 公式如下:

    P=(X1=x1|X0=x0)=P(X0=X0,X1=x1)P(X0=x0)" role="presentation" style="position: relative;">P=(X1=x1|X0=x0)=P(X0=X0,X1=x1)P(X0=x0)

    💬 代码演示:conditional_distribution_of_word_counts 的实现

    1. def conditional_distribution_of_word_counts(Pjoint, Pmarginal):
    2. Pcond = Pjoint / Pmarginal[:, np.newaxis] # 根据公式即可算出条件分布
    3. return Pcond

    值得注意的是,如果分母 Pmarginal 中的某些元素为零可能会导致报错问题。这导致除法结果中出现了 NaN(Not a Number)。在计算条件概率分布时,如果边缘分布中某个值为零,那么条件概率无法得到合理的定义。为了解决这个问题,我们可以在计算 Pmarginal 时,将所有零元素替换为一个非零的很小的数,例如 1e-10。

    0x01 实现求平均值, 方差和协方差的函数(Mean, Variance, Covariance)

    使用英文文章中最常出现的 a, the 等单词求出其联合分布 (Pathe) 和边缘分布 (Pthe)。

    Pathe 和 Pthe 在 reader.py 中已经定义好了,不需要我们去实现,具体代码文末可以查阅。

    这里需要我们使用概率分布,编写求平均值、方差和协方差的函数:

    • 函数 mean_from_distribution 和 variance_from_distribution 输入概率分布 P(Pthe)" role="presentation" style="position: relative;">P(Pthe) 中计算概率变量 X" role="presentation" style="position: relative;">X 的平均和方差并返回。平均值和方差保留小数点前三位即可。
    • 函数 convariance_from_distribution 计算概率分布 P(Pathe)" role="presentation" style="position: relative;">P(Pathe) 中的概率变量 X0" role="presentation" style="position: relative;">X0 和概率变量 X1" role="presentation" style="position: relative;">X1 的协方差并返回,同样保留小数点前三位即可。
    1. def mean_from_distribution(P):
    2. """
    3. Parameters:
    4. P (numpy array) - P[n] = P(X=n)
    5. Outputs:
    6. mu (float) - the mean of X
    7. """
    8. raise RuntimeError("You need to write this part!")
    9. return mu
    10. def variance_from_distribution(P):
    11. """
    12. Parameters:
    13. P (numpy array) - P[n] = P(X=n)
    14. Outputs:
    15. var (float) - the variance of X
    16. """
    17. raise RuntimeError("You need to write this part!")
    18. return var
    19. def covariance_from_distribution(P):
    20. """
    21. Parameters:
    22. P (numpy array) - P[m,n] = P(X0=m,X1=n)
    23. Outputs:
    24. covar (float) - the covariance of X0 and X1
    25. """
    26. raise RuntimeError("You need to write this part!")
    27. return covar

    🚩 输出结果演示:

    1. Problem4-1. Mean from distribution:
    2. 4.432
    3. Problem4-2. Variance from distribution:
    4. 41.601
    5. Problem4-3. Convariance from distribution:
    6. 9.235

    💭 提示:求平均值、方差和协方差的公式如下

    μ=xxP(X=x)" role="presentation" style="position: relative;">μ=xxP(X=x)

    σ=x(xμ)2P(X=x)" role="presentation" style="position: relative;">σ=x(xμ)2P(X=x)

    Cov(X0,X1)=x0,x1(x0μx0)(x1μx1)P(X0=x0,X1=x1)" role="presentation" style="position: relative;">Cov(X0,X1)=x0,x1(x0μx0)(x1μx1)P(X0=x0,X1=x1)

    💬 代码演示:

    1. def mean_from_distribution(P):
    2. mu = np.sum( # Σ
    3. np.arange(len(P)) * P
    4. )
    5. return round(mu, 3) # 保留三位小数
    6. def variance_from_distribution(P):
    7. mu = mean_from_distribution(P)
    8. var = np.sum( # Σ
    9. (np.arange(len(P)) - mu) ** 2 * P
    10. )
    11. return round(var, 3) # 保留三位小数
    12. def covariance_from_distribution(P):
    13. m, n = P.shape
    14. mu_X0 = mean_from_distribution(np.sum(P, axis=1))
    15. mu_X1 = mean_from_distribution(np.sum(P, axis=0))
    16. covar = np.sum( # Σ
    17. (np.arange(m)[:, np.newaxis] - mu_X0) * (np.arange(n) - mu_X1) * P
    18. )
    19. return round(covar, 3)

    0x02 实现求函数期望值的函数(Expected Value of a Function)

    实现 expectation_of_a_function 函数,计算概率函数 X0,X1" role="presentation" style="position: relative;">X0,X1 的 E[f(X0,X1)]" role="presentation" style="position: relative;">E[f(X0,X1)] 。

    其中 P" role="presentation" style="position: relative;">P 为联合分布,f" role="presentation" style="position: relative;">f 为两个实数的输入,以 f(x0,x1)" role="presentation" style="position: relative;">f(x0,x1)  的形式输出。

    函数 f" role="presentation" style="position: relative;">f 已在 reader.py 中定义,你只需要计算 E[f(X0,X1)]" role="presentation" style="position: relative;">E[f(X0,X1)] 的值并保留后三位小数返回即可。

    1. def expectation_of_a_function(P, f):
    2. """
    3. Parameters:
    4. P (numpy array) - joint distribution, P[m,n] = P(X0=m,X1=n)
    5. f (function) - f should be a function that takes two
    6. real-valued inputs, x0 and x1. The output, z=f(x0,x1),
    7. must be a real number for all values of (x0,x1)
    8. such that P(X0=x0,X1=x1) is nonzero.
    9. Output:
    10. expected (float) - the expected value, E[f(X0,X1)]
    11. """
    12. raise RuntimeError("You need to write this part!")
    13. return expected

    🚩 输出结果演示:

    1. Problem5. Expectation of a funciton:
    2. 1.772

    💬 代码演示:expectation_of_a_function 函数的实现

    1. def expectation_of_a_function(P, f):
    2. """
    3. Parameters:
    4. P (numpy array) - joint distribution, P[m,n] = P(X0=m,X1=n)
    5. f (function) - f should be a function that takes two
    6. real-valued inputs, x0 and x1. The output, z=f(x0,x1),
    7. must be a real number for all values of (x0,x1)
    8. such that P(X0=x0,X1=x1) is nonzero.
    9. Output:
    10. expected (float) - the expected value, E[f(X0,X1)]
    11. """
    12. m, n = P.shape
    13. E = 0.0
    14. for x0 in range(m):
    15. for x1 in range(n):
    16. E += f(x0, x1) * P[x0, x1]
    17. return round(E, 3) # 保留三位小数

    0x04 提供测试用例

    这是一个处理文本数据的项目,测试用例为 500 封电子邮件的数据(txt 的格式文件):

    🔨 所需环境:

    1. - python version >= 3.6
    2. - numpy >= 1.15
    3. - nltk >= 3.4
    4. - tqdm >= 4.24.0
    5. - scikit-learn >= 0.22

    nltk 是 Natural Language Toolkit 的缩写,是一个用于处理人类语言数据(文本)的 Python 库。nltk 提供了许多工具和资源,用于文本处理和 NLP,PorterStemmer 用来提取词干,用于将单词转换为它们的基本形式,通常是去除单词的词缀。 RegexpTokenizer 是基于正则表达式的分词器,用于将文本分割成单词。

    💬 data_load.py:用于加载文本数据

    1. import os
    2. import numpy as np
    3. from nltk.stem.porter import PorterStemmer
    4. from nltk.tokenize import RegexpTokenizer
    5. from tqdm import tqdm
    6. porter_stemmer = PorterStemmer()
    7. tokenizer = RegexpTokenizer(r"\w+")
    8. bad_words = {"aed", "oed", "eed"} # these words fail in nltk stemmer algorithm
    9. def loadFile(filename, stemming, lower_case):
    10. """
    11. Load a file, and returns a list of words.
    12. Parameters:
    13. filename (str): the directory containing the data
    14. stemming (bool): if True, use NLTK's stemmer to remove suffixes
    15. lower_case (bool): if True, convert letters to lowercase
    16. Output:
    17. x (list): x[n] is the n'th word in the file
    18. """
    19. text = []
    20. with open(filename, "rb") as f:
    21. for line in f:
    22. if lower_case:
    23. line = line.decode(errors="ignore").lower()
    24. text += tokenizer.tokenize(line)
    25. else:
    26. text += tokenizer.tokenize(line.decode(errors="ignore"))
    27. if stemming:
    28. for i in range(len(text)):
    29. if text[i] in bad_words:
    30. continue
    31. text[i] = porter_stemmer.stem(text[i])
    32. return text
    33. def loadDir(dirname, stemming, lower_case, use_tqdm=True):
    34. """
    35. Loads the files in the folder and returns a
    36. list of lists of words from the text in each file.
    37. Parameters:
    38. name (str): the directory containing the data
    39. stemming (bool): if True, use NLTK's stemmer to remove suffixes
    40. lower_case (bool): if True, convert letters to lowercase
    41. use_tqdm (bool, default:True): if True, use tqdm to show status bar
    42. Output:
    43. texts (list of lists): texts[m][n] is the n'th word in the m'th email
    44. count (int): number of files loaded
    45. """
    46. texts = []
    47. count = 0
    48. if use_tqdm:
    49. for f in tqdm(sorted(os.listdir(dirname))):
    50. texts.append(loadFile(os.path.join(dirname, f), stemming, lower_case))
    51. count = count + 1
    52. else:
    53. for f in sorted(os.listdir(dirname)):
    54. texts.append(loadFile(os.path.join(dirname, f), stemming, lower_case))
    55. count = count + 1
    56. return texts, count

    💬 reader.py:将读取数据并打印

    1. import data_load, hw4, importlib
    2. import numpy as np
    3. if __name__ == "__main__":
    4. texts, count = data_load.loadDir("data", False, False)
    5. importlib.reload(hw4)
    6. Pjoint = hw4.joint_distribution_of_word_counts(texts, "mr", "company")
    7. print("Problem1. Joint distribution:")
    8. print(Pjoint)
    9. print("---------------------------------------------")
    10. P0 = hw4.marginal_distribution_of_word_counts(Pjoint, 0)
    11. P1 = hw4.marginal_distribution_of_word_counts(Pjoint, 1)
    12. print("Problem2. Marginal distribution:")
    13. print("P0:", P0)
    14. print("P1:", P1)
    15. print("---------------------------------------------")
    16. Pcond = hw4.conditional_distribution_of_word_counts(Pjoint, P0)
    17. print("Problem3. Conditional distribution:")
    18. print(Pcond)
    19. print("---------------------------------------------")
    20. Pathe = hw4.joint_distribution_of_word_counts(texts, "a", "the")
    21. Pthe = hw4.marginal_distribution_of_word_counts(Pathe, 1)
    22. mu_the = hw4.mean_from_distribution(Pthe)
    23. print("Problem4-1. Mean from distribution:")
    24. print(mu_the)
    25. var_the = hw4.variance_from_distribution(Pthe)
    26. print("Problem4-2. Variance from distribution:")
    27. print(var_the)
    28. covar_a_the = hw4.covariance_from_distribution(Pathe)
    29. print("Problem4-3. Covariance from distribution:")
    30. print(covar_a_the)
    31. print("---------------------------------------------")
    32. def f(x0, x1):
    33. return np.log(x0 + 1) + np.log(x1 + 1)
    34. expected = hw4.expectation_of_a_function(Pathe, f)
    35. print("Problem5. Expectation of a function:")
    36. print(expected)

    1. 📌 [ 笔者 ]   王亦优
    2. 📃 [ 更新 ]   2023.11.15
    3. ❌ [ 勘误 ]   /* 暂无 */
    4. 📜 [ 声明 ]   由于作者水平有限,本文有错误和不准确之处在所难免,
    5. 本人也很想知道这些错误,恳望读者批评指正!

    📜 参考资料 

    C++reference[EB/OL]. []. http://www.cplusplus.com/reference/.

    Microsoft. MSDN(Microsoft Developer Network)[EB/OL]. []. .

    百度百科[EB/OL]. []. https://baike.baidu.com/.

    比特科技. C++[EB/OL]. 2021[2021.8.31]. 

  • 相关阅读:
    html设置基础样式,使用js动效做到刷积分的的效果(显示当前时间)
    vscode 代码片段
    创意电子学-小知识:如何使用面包板
    Linux系统中 uboot、内核与文件系统之间的关系
    js案例:选字游戏
    数学建模 (一)赛前准备
    【BI看板】superset api接口分析
    ROS2报错 AttributeError: type object ‘type‘ has no attribute ‘_TYPE_SUPPORT‘
    Java集合的lastlastIndexOfSubList()方法具有什么功能呢?
    SpotBugs代码检查:在整数上进行没有起任何实际作用的位操作(INT_VACUOUS_BIT_OPERATION)
  • 原文地址:https://blog.csdn.net/weixin_50502862/article/details/134426225