• 协同过滤算法


    1.1 你听说过推荐算法

    假如我是豆瓣的CEO,很多豆瓣的用户在豆瓣电影上都会对电影进行评分。那么根据这个评分数据,我们有可能知道这些用户除了自己评过分的电影之外还喜欢或讨厌哪些电影吗?这就是一个典型的推荐问题,解决这一类问题的算法被称为推荐算法。

    1.2 什么是协同过滤

    协同过滤的英文全称是Collaborative Filtering,简称CF。注意,这不是一款游戏!从字面上分析,协同就是寻找共同点,过滤就是筛选出优质的内容。

    1.3 协同过滤的分类

    一般来说,协同过滤推荐分为三种类型:

    1. 基于用户(user-based)的协同过滤,通过计算用户和用户的相似度找到跟用户A相似的用户B, C, D...再把这些用户喜欢的内容推荐给A;

    2. 基于物品(item-based)的协同过滤,通过计算物品和物品的相似度找到跟物品1相似的物品2, 3, 4...再把这些物品推荐给看过物品1的用户们;

    3. 基于模型(model based)的协同过滤。主流的方法可以分为:矩阵分解,关联算法,聚类算法,分类算法,回归算法,神经网络。

    1.4 矩阵分解

    矩阵分解 (decomposition, factorization)是将矩阵拆解为数个矩阵的乘积。比如豆瓣电影有m个用户,n个电影。那么用户对电影的评分可以形成一个m行n列的矩阵R,我们可以找到一个m行k列的矩阵U,和一个k行n列的矩阵I,通过U * I来得到矩阵R。

    1.5 ALS

    如果想通过矩阵分解的方法实现基于模型的协同过滤,ALS是一个不错的选择,其英文全称是Alternating Least Square,翻译过来是交替最小二乘法。假设用户为a,物品为b,评分矩阵为R(m, n),可分解为用户矩阵U(k, m)和物品矩阵I(k, n),其中m, n, k代表矩阵的维度。前方小段数学公式低能预警:

    1.6 求解用户矩阵U和物品矩阵I

    矩阵R是已知的,我们随机生成用户矩阵U,

    1. 利用1.5中的式5、R和U求出I

    2. 利用1.5中的式6、R和I求出U

    如此交替地执行步骤1和步骤2,直到算法收敛或者迭代次数超过了最大限制,最终我们用RMSE来评价模型的好坏。

    2. 实现篇

    本人用全宇宙最简单的编程语言——Python实现了ALS算法,没有依赖任何第三方库,便于学习和使用。

    2.1 创建ALS类

    初始化,存储用户ID、物品ID、用户ID与用户矩阵列号的对应关系、物品ID与物品矩阵列号的对应关系、用户已经看过哪些物品、评分矩阵的Shape以及RMSE。

    1. class ALS(object):
    2. def __init__(self):
    3. self.user_ids = None
    4. self.item_ids = None
    5. self.user_ids_dict = None
    6. self.item_ids_dict = None
    7. self.user_matrix = None
    8. self.item_matrix = None
    9. self.user_items = None
    10. self.shape = None
    11. self.rmse = None

    2.2 数据预处理

    对训练数据进行处理,得到用户ID、物品ID、用户ID与用户矩阵列号的对应关系、物品ID与物品矩阵列号的对应关系、评分矩阵的Shape、评分矩阵及评分矩阵的转置。

    1. def _process_data(self, X):
    2. self.user_ids = tuple((set(map(lambda x: x[0], X))))
    3. self.user_ids_dict = dict(map(lambda x: x[::-1],
    4. enumerate(self.user_ids)))
    5. self.item_ids = tuple((set(map(lambda x: x[1], X))))
    6. self.item_ids_dict = dict(map(lambda x: x[::-1],
    7. enumerate(self.item_ids)))
    8. self.shape = (len(self.user_ids), len(self.item_ids))
    9. ratings = defaultdict(lambda: defaultdict(int))
    10. ratings_T = defaultdict(lambda: defaultdict(int))
    11. for row in X:
    12. user_id, item_id, rating = row
    13. ratings[user_id][item_id] = rating
    14. ratings_T[item_id][user_id] = rating
    15. err_msg = "Length of user_ids %d and ratings %d not match!" % (
    16. len(self.user_ids), len(ratings))
    17. assert len(self.user_ids) == len(ratings), err_msg
    18. err_msg = "Length of item_ids %d and ratings_T %d not match!" % (
    19. len(self.item_ids), len(ratings_T))
    20. assert len(self.item_ids) == len(ratings_T), err_msg
    21. return ratings, ratings_T

    2.3 用户矩阵乘以评分矩阵

    实现稠密矩阵与稀疏矩阵的矩阵乘法,得到用户矩阵与评分矩阵的乘积。

    1. def _users_mul_ratings(self, users, ratings_T):
    2. def f(users_row, item_id):
    3. user_ids = iter(ratings_T[item_id].keys())
    4. scores = iter(ratings_T[item_id].values())
    5. col_nos = map(lambda x: self.user_ids_dict[x], user_ids)
    6. _users_row = map(lambda x: users_row[x], col_nos)
    7. return sum(a * b for a, b in zip(_users_row, scores))
    8. ret = [[f(users_row, item_id) for item_id in self.item_ids]
    9. for users_row in users.data]
    10. return Matrix(ret)

    2.4 物品矩阵乘以评分矩阵

    实现稠密矩阵与稀疏矩阵的矩阵乘法,得到物品矩阵与评分矩阵的乘积。

    1. def _items_mul_ratings(self, items, ratings):
    2. def f(items_row, user_id):
    3. item_ids = iter(ratings[user_id].keys())
    4. scores = iter(ratings[user_id].values())
    5. col_nos = map(lambda x: self.item_ids_dict[x], item_ids)
    6. _items_row = map(lambda x: items_row[x], col_nos)
    7. return sum(a * b for a, b in zip(_items_row, scores))
    8. ret = [[f(items_row, user_id) for user_id in self.user_ids]
    9. for items_row in items.data]
    10. return Matrix(ret)

    2.5 生成随机矩阵

    1. def _gen_random_matrix(self, n_rows, n_colums):
    2. data = [[random() for _ in range(n_colums)] for _ in range(n_rows)]
    3. return Matrix(data)

    2.6 计算RMSE

    1. def _get_rmse(self, ratings):
    2. m, n = self.shape
    3. mse = 0.0
    4. n_elements = sum(map(len, ratings.values()))
    5. for i in range(m):
    6. for j in range(n):
    7. user_id = self.user_ids[i]
    8. item_id = self.item_ids[j]
    9. rating = ratings[user_id][item_id]
    10. if rating > 0:
    11. user_row = self.user_matrix.col(i).transpose
    12. item_col = self.item_matrix.col(j)
    13. rating_hat = user_row.mat_mul(item_col).data[0][0]
    14. square_error = (rating - rating_hat) ** 2
    15. mse += square_error / n_elements
    16. return mse ** 0.5

    2.7 训练模型

    1. 数据预处理
    2. 变量k合法性检查
    3. 生成随机矩阵U
    4. 交替计算矩阵U和矩阵I,并打印RMSE信息,直到迭代次数达到max_iter
    5. 保存最终的RMSE
    1. def fit(self, X, k, max_iter=10):
    2. ratings, ratings_T = self._process_data(X)
    3. self.user_items = {k: set(v.keys()) for k, v in ratings.items()}
    4. m, n = self.shape
    5. error_msg = "Parameter k must be less than the rank of original matrix"
    6. assert k < min(m, n), error_msg
    7. self.user_matrix = self._gen_random_matrix(k, m)
    8. for i in range(max_iter):
    9. if i % 2:
    10. items = self.item_matrix
    11. self.user_matrix = self._items_mul_ratings(
    12. items.mat_mul(items.transpose).inverse.mat_mul(items),
    13. ratings
    14. )
    15. else:
    16. users = self.user_matrix
    17. self.item_matrix = self._users_mul_ratings(
    18. users.mat_mul(users.transpose).inverse.mat_mul(users),
    19. ratings_T
    20. )
    21. rmse = self._get_rmse(ratings)
    22. print("Iterations: %d, RMSE: %.6f" % (i + 1, rmse))
    23. self.rmse = rmse

    2.8 预测一个用户

    预测一个用户感兴趣的内容,剔除用户已看过的内容。然后按感兴趣分值排序,取出前n_items个内容。

    1. def _predict(self, user_id, n_items):
    2. users_col = self.user_matrix.col(self.user_ids_dict[user_id])
    3. users_col = users_col.transpose
    4. items_col = enumerate(users_col.mat_mul(self.item_matrix).data[0])
    5. items_scores = map(lambda x: (self.item_ids[x[0]], x[1]), items_col)
    6. viewed_items = self.user_items[user_id]
    7. items_scores = filter(lambda x: x[0] not in viewed_items, items_scores)
    8. return sorted(items_scores, key=lambda x: x[1], reverse=True)[:n_items]

    2.9 预测多个用户

    循环调用2.8,预测多个用户感兴趣的内容。

    1. def predict(self, user_ids, n_items=10):
    2. return [self._predict(user_id, n_items) for user_id in user_ids]

    3 效果评估

    3.1 main函数

    使用电影评分数据集,训练模型并统计RMSE。

    1. @run_time
    2. def main():
    3. print("Tesing the accuracy of ALS...")
    4. X = load_movie_ratings()
    5. model = ALS()
    6. model.fit(X, k=3, max_iter=5)
    7. print()
    8. print("Showing the predictions of users...")
    9. user_ids = range(1, 5)
    10. predictions = model.predict(user_ids, n_items=2)
    11. for user_id, prediction in zip(user_ids, predictions):
    12. _prediction = [format_prediction(item_id, score)
    13. for item_id, score in prediction]
    14. print("User id:%d recommedation: %s" % (user_id, _prediction))

    3.2 效果展示

    设置k=3,迭代5次,并展示了前4个用户的推荐内容,最终RMSE为0.370,运行时间46.5秒,效果还算不错~

     

     数据资源链接:

    recommend-CF/ml-1m at master · iamccme/recommend-CF · GitHub

    基于物品的协同过滤算法,方法为ItemCF

    1. import sys
    2. import random
    3. import math
    4. from operator import itemgetter
    5. def ReadData(file,data):
    6. ''' 读取评分数据
    7. @param file 评分数据文件
    8. @param data 储存评分数据的List
    9. '''
    10. for line in file :
    11. line = line.strip('\n')
    12. linelist = line.split("::")
    13. data.append([linelist[0],linelist[1]])
    14. def SplitData(data, M, key, seed):
    15. ''' 将数据分为训练集和测试集
    16. @param data 储存训练和测试数据的List
    17. @param M 将数据分为M份
    18. @param key 选取第key份数据做为测试数据
    19. @param seed 随机种子
    20. @return train 训练数据集Dict
    21. @return test 测试数据集Dict
    22. '''
    23. test = dict ()
    24. train = dict ()
    25. random.seed(seed)
    26. for user,item in data:
    27. if random.randint(0,M) == key:
    28. if user in test:
    29. test[user].append(item)
    30. else:
    31. test[user] = []
    32. else:
    33. if user in train:
    34. train[user].append(item)
    35. else:
    36. train[user] = []
    37. return train, test
    38. def UserSimilarityOld(train):
    39. W = dict()
    40. for u in train.keys():
    41. W[u] = dict()
    42. for v in train.keys():
    43. if u == v:
    44. continue
    45. W[u][v] = len(list(set(train[u]) & set(train[v])))
    46. W[u][v] /= math.sqrt(len(train[u]) * len(train[v]) * 1.0)
    47. return W
    48. def ItemSimilarity(train):
    49. ''' 计算物品相似度
    50. @param train 训练数据集Dict
    51. @return W 记录用户相似度的二维矩阵
    52. '''
    53. C = dict()
    54. N = dict()
    55. #计算没两个item共有的user数目
    56. for u, items in train.items():
    57. for i in items:
    58. if i not in N:
    59. N[i] = 0
    60. N[i] += 1
    61. for j in items:
    62. if i == j:
    63. continue
    64. if i not in C :
    65. C[i] = dict()
    66. if j not in C[i]:
    67. C[i][j] = 0
    68. C[i][j] += 1
    69. W = dict()
    70. for i, related_items in C.items():
    71. for j, cij in related_items.items():
    72. if i not in W :
    73. W[i] = dict()
    74. W[i][j] = cij / math.sqrt(N[i] * N[j])
    75. return W
    76. def Coverage(train, test, W, N, K):
    77. ''' 获取推荐结果
    78. @param user 输入的用户
    79. @param train 训练数据集Dict
    80. @param W 记录用户相似度的二维矩阵
    81. @param N 推荐结果的数目
    82. @param K 选取近邻的数目
    83. '''
    84. recommned_items = set()
    85. all_items = set()
    86. for user in train.keys():
    87. for item in train[user]:
    88. all_items.add(item)
    89. rank = GetRecommendation(user, train, W, N, K)
    90. for item, pui in rank:
    91. recommned_items.add(item)
    92. print 'len: ',len(recommned_items),'\n'
    93. return len(recommned_items) / (len(all_items) * 1.0)
    94. def GetRecommendation(user, train ,W, N, K):
    95. ''' 获取推荐结果
    96. @param user 输入的用户
    97. @param train 训练数据集Dict
    98. @param W 记录用户相似度的二维矩阵
    99. @param N 推荐结果的数目
    100. @param K 选取近邻的数目
    101. '''
    102. rank = dict()
    103. ru = train[user]
    104. for i in ru:
    105. for j,wj in sorted(W[i].items(), key=itemgetter(1),\
    106. reverse = True)[0:K]:
    107. if j in ru:
    108. continue
    109. if j in rank:
    110. rank[j] += wj
    111. else:
    112. rank[j] = 0
    113. rank = sorted(rank.items(), key=itemgetter(1), reverse = True)[0:N]
    114. return rank
    115. def Recall(train, test, W, N, K):
    116. ''' 计算推荐结果的召回率
    117. @param train 训练数据集Dict
    118. @param test 测试数据集Dict
    119. @param W 记录用户相似度的二维矩阵
    120. @param N 推荐结果的数目
    121. @param K 选取近邻的数目
    122. '''
    123. hit = 0
    124. all = 0
    125. for user in train.keys():
    126. if user in test:
    127. tu = test[user]
    128. rank = GetRecommendation(user, train, W, N, K)
    129. for item, pui in rank:
    130. if item in tu:
    131. hit+= 1
    132. all += len(tu)
    133. print(hit)
    134. print(all)
    135. return hit/(all * 1.0)
    136. def Precision(train, test, W, N, K):
    137. ''' 计算推荐结果的准确率
    138. @param train 训练数据集Dict
    139. @param test 测试数据集Dict
    140. @param W 记录用户相似度的二维矩阵
    141. @param N 推荐结果的数目
    142. @param K 选取近邻的数目
    143. '''
    144. hit = 0
    145. all = 0
    146. for user in train.keys():
    147. if user in test:
    148. tu = test[user]
    149. rank = GetRecommendation(user, train, W, N, K)
    150. for item, pui in rank:
    151. if item in tu:
    152. hit+= 1
    153. all += N
    154. print(hit)
    155. print(all)
    156. return hit/(all * 1.0)
    157. def Popularity(train, test, W, N, K):
    158. ''' 计算推荐结果的流行度
    159. @param train 训练数据集Dict
    160. @param test 测试数据集Dict
    161. @param W 记录用户相似度的二维矩阵
    162. @param N 推荐结果的数目
    163. @param K 选取近邻的数目
    164. '''
    165. item_popularity = dict()
    166. for user, items in train.items():
    167. for item in items:
    168. if item not in item_popularity:
    169. item_popularity[item] = 0
    170. item_popularity[item] += 1
    171. ret = 0
    172. n = 0
    173. for user in train.keys():
    174. rank = GetRecommendation(user, train, W, N, K)
    175. for item, pui in rank:
    176. ret += math.log(1+ item_popularity[item])
    177. n += 1
    178. ret /= n * 1.0
    179. return ret
    180. if __name__ == '__main__':
    181. data = []
    182. M = 7
    183. key = 1
    184. seed = 1
    185. N = 10
    186. K = 10
    187. W = dict()
    188. rank = dict()
    189. print("this is the main function")
    190. file = open('./ml-1m/rating.dat')
    191. ReadData(file,data)
    192. train,test = SplitData(data, M, key, seed)
    193. W = ItemSimilarity(train)
    194. recall = Recall(train, test, W, N, K)
    195. precision = Precision(train, test, W, N, K)
    196. popularity = Popularity(train, test, W, N, K)
    197. coverage = Coverage(train, test, W, N, K)
    198. print 'recall: ',recall,'\n'
    199. print 'precision: ',precision,'\n'
    200. print 'Popularity: ',popularity,'\n'
    201. print 'coverage: ', coverage,'\n'
    202. else :
    203. print("this is not the main function")

    基于用户的协同过滤算法UserCF

    1. import sys
    2. import random
    3. import math
    4. from operator import itemgetter
    5. def ReadData(file,data):
    6. ''' 读取评分数据
    7. @param file 评分数据文件
    8. @param data 储存评分数据的List
    9. '''
    10. for line in file:
    11. line = line.strip('\n')
    12. linelist = line.split("::")
    13. data.append([linelist[0],linelist[1]])
    14. def SplitData(data, M, key, seed):
    15. ''' 将数据分为训练集和测试集
    16. @param data 储存训练和测试数据的List
    17. @param M 将数据分为M份
    18. @param key 选取第key份数据做为测试数据
    19. @param seed 随机种子
    20. @return train 训练数据集Dict
    21. @return test 测试数据集Dict
    22. '''
    23. test = dict ()
    24. train = dict ()
    25. random.seed(seed)
    26. for user,item in data:
    27. if random.randint(0,M) == key:
    28. if user in test:
    29. test[user].append(item)
    30. else:
    31. test[user] = []
    32. else:
    33. if user in train:
    34. train[user].append(item)
    35. else:
    36. train[user] = []
    37. return train, test
    38. def UserSimilarity(train):
    39. ''' 计算用户相似度
    40. @param train 训练数据集Dict
    41. @return W 记录用户相似度的二维矩阵
    42. '''
    43. #建立物品到用户之间的倒查表,降低计算用户相似度的时间复杂性
    44. item_users = dict()
    45. for u, items in train.items():
    46. for i in items:
    47. if(i not in item_users):
    48. item_users[i] = set()
    49. item_users[i].add(u)
    50. C = dict()
    51. N = dict()
    52. #计算用户之间共有的item的数目
    53. for i, users in item_users.items():
    54. for u in users:
    55. if(u not in N):
    56. N[u] = 1
    57. N[u] += 1
    58. for v in users:
    59. if u == v:
    60. continue
    61. if(u not in C):
    62. C[u] = dict()
    63. if(v not in C[u]):
    64. C[u][v] = 0
    65. #对热门物品进行了惩罚,采用这种方法被称做UserCF-IIF
    66. C[u][v] += (1 / math.log(1+len(users)))
    67. W = dict()
    68. for u, related_users in C.items():
    69. for v, cuv in related_users.items():
    70. if(u not in W):
    71. W[u] = dict()
    72. #利用余弦相似度计算用户之间的相似度
    73. W[u][v] = cuv / math.sqrt(N[u] * N[v])
    74. return W
    75. def Coverage(train, test, W, N, K):
    76. ''' 获取推荐结果
    77. @param user 输入的用户
    78. @param train 训练数据集Dict
    79. @param W 记录用户相似度的二维矩阵
    80. @param N 推荐结果的数目
    81. @param K 选取近邻的数目
    82. '''
    83. recommned_items = set()
    84. all_items = set()
    85. for user in train.keys():
    86. for item in train[user]:
    87. all_items.add(item)
    88. rank = GetRecommendation(user, train, W, N, K)
    89. for item, pui in rank:
    90. recommned_items.add(item)
    91. #print 'len: ',len(recommned_items),'\n'
    92. return len(recommned_items) / (len(all_items) * 1.0)
    93. def GetRecommendation(user, train ,W, N, K):
    94. ''' 获取推荐结果
    95. @param user 输入的用户
    96. @param train 训练数据集Dict
    97. @param W 记录用户相似度的二维矩阵
    98. @param N 推荐结果的数目
    99. @param K 选取近邻的数目
    100. '''
    101. rank = dict()
    102. interacted_items = train[user]
    103. #选取K个近邻计算得分
    104. for v,wuv in sorted(W[user].items(), key=itemgetter(1),\
    105. reverse = True)[0:K]:
    106. for i in train[v]:
    107. if i in interacted_items:
    108. continue
    109. if i in rank:
    110. rank[i] += wuv
    111. else:
    112. rank[i] = 0
    113. #取得分最高的N个item作为推荐结果
    114. rank = sorted(rank.items(), key=itemgetter(1), reverse = True)[0:N]
    115. return rank
    116. def Recall(train, test, W, N, K):
    117. ''' 计算推荐结果的召回率
    118. @param train 训练数据集Dict
    119. @param test 测试数据集Dict
    120. @param W 记录用户相似度的二维矩阵
    121. @param N 推荐结果的数目
    122. @param K 选取近邻的数目
    123. '''
    124. hit = 0
    125. all = 0
    126. for user in train.keys():
    127. if user in test:
    128. tu = test[user]
    129. rank = GetRecommendation(user, train, W, N, K)
    130. for item, pui in rank:
    131. if item in tu:
    132. hit+= 1
    133. all += len(tu)
    134. #print(hit)
    135. #print(all)
    136. return hit/(all * 1.0)
    137. def Precision(train, test, W, N, K):
    138. ''' 计算推荐结果的准确率
    139. @param train 训练数据集Dict
    140. @param test 测试数据集Dict
    141. @param W 记录用户相似度的二维矩阵
    142. @param N 推荐结果的数目
    143. @param K 选取近邻的数目
    144. '''
    145. hit = 0
    146. all = 0
    147. for user in train.keys():
    148. if user in test:
    149. tu = test[user]
    150. rank = GetRecommendation(user, train, W, N, K)
    151. for item, pui in rank:
    152. if item in tu:
    153. hit+= 1
    154. all += N
    155. #print(hit)
    156. #print(all)
    157. return hit/(all * 1.0)
    158. def Popularity(train, test, W, N, K):
    159. ''' 计算推荐结果的流行度
    160. @param train 训练数据集Dict
    161. @param test 测试数据集Dict
    162. @param W 记录用户相似度的二维矩阵
    163. @param N 推荐结果的数目
    164. @param K 选取近邻的数目
    165. '''
    166. item_popularity = dict()
    167. for user, items in train.items():
    168. for item in items:
    169. if item not in item_popularity:
    170. item_popularity[item] = 0
    171. item_popularity[item] += 1
    172. ret = 0
    173. n = 0
    174. for user in train.keys():
    175. rank = GetRecommendation(user, train, W, N, K)
    176. for item, pui in rank:
    177. ret += math.log(1+ item_popularity[item])
    178. n += 1
    179. ret /= n * 1.0
    180. return ret
    181. if __name__ == '__main__':
    182. data = []
    183. M = 8
    184. key = 1
    185. seed = 1
    186. N = 10
    187. K = 80
    188. W = dict()
    189. rank = dict()
    190. print("this is the main function")
    191. file = open('./ml-1m/rat.dat')
    192. ReadData(file,data)
    193. train,test = SplitData(data, M, key, seed)
    194. W = UserSimilarity(train)
    195. recall = Recall(train, test, W, N, K)
    196. precision = Precision(train, test, W, N, K)
    197. popularity = Popularity(train, test, W, N, K)
    198. coverage = Coverage(train, test, W, N, K)
    199. print 'recall: ',recall,'\n'
    200. print 'precision: ',precision,'\n'
    201. print 'Popularity: ',popularity,'\n'
    202. print 'coverage: ', coverage,'\n'
    203. else :
    204. print("this is not the main function")

  • 相关阅读:
    Functional Programming in Java venkat(14) Being Lazy
    OpenCV 4.x 版本的新特性都有哪些?
    【SpringCloud-学习笔记】Ribbon负载均衡
    Linux服务器安装MariaDB数据库,配置linux防火墙策略
    论文翻译:2022_DeepFilterNet2: Towards Real-Time Speech Enhancement On Embedded Devices For Fullband Audio
    学术报告系列(六) - Autonomous Driving on the journey to full autonomy
    Hive矢量化
    [笔记] 线性筛质数(素数)
    电脑不小心删除的文件怎么恢复?
    JDK:Font.canDisplay()是如何生效的
  • 原文地址:https://blog.csdn.net/sikh_0529/article/details/126861978