闵老师的文章链接: 日撸 Java 三百行(总述)_minfanphd的博客-CSDN博客
自己也把手敲的代码放在了github上维护:https://github.com/fulisha-ok/sampledata
例如day51-53中的KNN我们预测一个物品类别,我们是以测试样本和我们训练样本的距离的远近来找k个最相似似的邻居,对这k个邻居评分来预测测试样本的类别。而M-distance是根据平均分来计算两个用户 (或项目) 之间的距离。
如下图标这是以基于项目来预测得分的例子。例如预测用户u0对m2的评分,我们怎么来找邻居呢?我们是求出每个项目的平均分,找离m2平均分半径范围内的项目求平均分来预测得分。

文档中的内容有100000行记录,数据如下(部分):

一行代表的意思:0用户对项目0的评分,评分为5分;有943个用户,1682部电影,100000个评分;对部分变量说明
private int numItems;
private int numUsers;
private int numRatings;
-
private int[][] compressedRatingMatrix;

private int[] userDegrees;

private int[] userStartingIndices;
- userAverageRatings(每个用户评价项目的一个平均分)
private double[] userAverageRatings;

private int[] itemDegrees;

private double[] itemAverageRatings;

MBR构造函数即是对上面的变量进行赋值,初始化。
之前在knn中已经接触过了,即将数据集中的每一个样本都被单独作为测试集,而剩下的样本就作为训练集.
以一个测试集来举例。例如我们将第一行的数据作为测试集(0,0,5)我们知道这是用户0对项目0评分为5分,我们现在结合其他项目来预测项目0的评分。接下来的步骤为:


public void leaveOneOutPrediction() {
double tempItemAverageRating;
// Make each line of the code shorter.
int tempUser, tempItem, tempRating;
System.out.println("\r\nLeaveOneOutPrediction for radius " + radius);
numNonNeighbors = 0;
for (int i = 0; i < numRatings; i++) {
tempUser = compressedRatingMatrix[i][0];
tempItem = compressedRatingMatrix[i][1];
tempRating = compressedRatingMatrix[i][2];
// Step 1. Recompute average rating of the current item.
tempItemAverageRating = (itemAverageRatings[tempItem] * itemDegrees[tempItem] - tempRating)
/ (itemDegrees[tempItem] - 1);
// Step 2. Recompute neighbors, at the same time obtain the ratings
// Of neighbors.
int tempNeighbors = 0;
double tempTotal = 0;
int tempComparedItem;
for (int j = userStartingIndices[tempUser]; j < userStartingIndices[tempUser + 1]; j++) {
tempComparedItem = compressedRatingMatrix[j][1];
if (tempItem == tempComparedItem) {
continue;// Ignore itself.
} // Of if
if (Math.abs(tempItemAverageRating - itemAverageRatings[tempComparedItem]) < radius) {
tempTotal += compressedRatingMatrix[j][2];
tempNeighbors++;
}
}
// Step 3. Predict as the average value of neighbors.
if (tempNeighbors > 0) {
predictions[i] = tempTotal / tempNeighbors;
} else {
predictions[i] = DEFAULT_RATING;
numNonNeighbors++;
}
}
}
预测值与实际值之间的平均绝对偏差程度(MAE 的值越小,表示预测结果与实际值的偏差越小,预测模型的准确性越高)
public double computeMAE() throws Exception {
double tempTotalError = 0;
for (int i = 0; i < predictions.length; i++) {
tempTotalError += Math.abs(predictions[i] - compressedRatingMatrix[i][2]);
}
return tempTotalError / predictions.length;
}
预测值与实际值之间的平方值偏差程度。RMSE 的值越小,表示预测结果与实际值的均方差越小,预测模型的准确性越高
public double computeRSME() throws Exception {
double tempTotalError = 0;
for (int i = 0; i < predictions.length; i++) {
tempTotalError += (predictions[i] - compressedRatingMatrix[i][2])
* (predictions[i] - compressedRatingMatrix[i][2]);
}
double tempAverage = tempTotalError / predictions.length;
return Math.sqrt(tempAverage);
}



我最开始也想到的是对compressedRatingMatrix重新赋值,使数组进用户与项目关系互换。但最后我还是想采用列表的方式来编码。我大致思路如下:
将文本内容涉及的三个指标抽象为一个对象Text.其中userNum代表用户的编号,itemNum代表项目的编号,score代表用户对项目的评分。
class Text{
private Integer userNum;
private Integer itemNum;
private Integer score;
public Integer getUserNum() {
return userNum;
}
public void setUserNum(Integer userNum) {
this.userNum = userNum;
}
public Integer getItemNum() {
return itemNum;
}
public void setItemNum(Integer itemNum) {
this.itemNum = itemNum;
}
public Integer getScore() {
return score;
}
public void setScore(Integer score) {
this.score = score;
}
public Text(Integer userNum, Integer itemNum, Integer score) {
this.userNum = userNum;
this.itemNum = itemNum;
this.score = score;
}
}
stream流的知识可以百度使用(结合Lambda表达式),他可以对集合进行非常复杂的查找、过滤、筛选等操作。
我大致思路是将文本内容放在一个列表中:List
stream流的使用:
// 按电影编号分组
textGroupByItem = textList.stream().collect(Collectors.groupingBy(Text::getItemNum));
//按用户编号分组
textGroupByUser = textList.stream().collect(Collectors.groupingBy(Text::getUserNum));
//对列表中某一属性求和
tempUserTotalScore[i] = textsByUser.stream().mapToDouble(Text::getScore).sum();
代码:
public MBR(String paraFileName, int paraNumUsers, int paraNumItems, int paraNumRatings, boolean basedUser) throws Exception {
if (basedUser){
//基于用户的计算
//step1. initialize these arrays
numItems = paraNumItems;
numUsers = paraNumUsers;
numRatings = paraNumRatings;
userDegrees = new int[numUsers];
userAverageRatings = new double[numUsers];
itemDegrees = new int[numItems];
itemAverageRatings = new double[numItems];
predictions = new double[numRatings];
System.out.println("Reading " + paraFileName);
//step2. Read the data file
File tempFile = new File(paraFileName);
if (!tempFile.exists()) {
System.out.println("File " + paraFileName + " does not exists.");
System.exit(0);
}
BufferedReader tempBufReader = new BufferedReader(new FileReader(tempFile));
String tempString;
String[] tempStrArray;
while ((tempString = tempBufReader.readLine()) != null) {
// Each line has three values
tempStrArray = tempString.split(",");
//把数据读入到textList列表中
Text text = new Text(Integer.parseInt(tempStrArray[0]), Integer.parseInt(tempStrArray[1]), Integer.parseInt(tempStrArray[2]));
textList.add(text);
}
tempBufReader.close();
//按电影号分组
textGroupByItem = textList.stream().collect(Collectors.groupingBy(Text::getItemNum));
textGroupByUser = textList.stream().collect(Collectors.groupingBy(Text::getUserNum));
double[] tempUserTotalScore = new double[numUsers];
double[] tempItemTotalScore = new double[numItems];
for (int i = 0; i < numUsers; i++) {
// 用户的总分
List textsByUser = textGroupByUser.get(i);
tempUserTotalScore[i] = textsByUser.stream().mapToDouble(Text::getScore).sum();
userDegrees[i] = textsByUser.size();
userAverageRatings[i] = tempUserTotalScore[i] / userDegrees[i];
}
for (int i = 0; i < numItems; i++) {
try {
// 电影的总分
List textsByItem = textGroupByItem.get(i);
tempItemTotalScore[i] = textsByItem.stream().mapToDouble(Text::getScore).sum();
itemDegrees[i] = textsByItem.size();
itemAverageRatings[i] = tempItemTotalScore[i] / itemDegrees[i];
} catch (Exception e) {
System.out.println(e.getMessage());
}
}
}
}
leave-one-out测试中,以文本中第一条记录为例子(0,0,5)我们要预测对项目0的评分
(1)第一步:排除用户0对项目0的评分(用户0评论了272个项目),重新计算用户0的平均分(271个项目)
(2)第二步:我们看对项目0评分的用户个数(452个),依次遍历用户的平均分与我们重新计算的平均分之差是否在半径范围内,从而累计邻居个数以及他们的总分。
(3)第三步:预测用户0对项目0的得分
stream流的使用:
// 对列表过滤数据
textsByUser = textsByUser.stream().filter(e -> !e.getItemNum().equals(outItem)).collect(Collectors.toList());
代码:
public void leaveOneOutPredictionByUser() {
double tempItemAverageRating;
// Make each line of the code shorter.
int tempUser, tempItem, tempRating;
System.out.println("\r\nLeaveOneOutPredictionUser for radius " + radius);
numNonNeighbors = 0;
for (int i = 0; i < numRatings; i++) {
Text text = textList.get(i);
tempUser = text.getUserNum();
tempItem = text.getItemNum();
tempRating = text.getScore();
// Step 1. Recompute average rating of the current user.
List textsByUser = textGroupByUser.get(tempUser);
Integer outItem = tempItem;
textsByUser = textsByUser.stream().filter(e -> !e.getItemNum().equals(outItem)).collect(Collectors.toList());
tempItemAverageRating = textsByUser.stream().mapToDouble(Text::getScore).sum() / textsByUser.size();
// Step 2. Recompute neighbors, at the same time obtain the ratings
// Of neighbors.
int tempNeighbors = 0;
double tempTotal = 0;
List texts = textGroupByItem.get(tempItem);
for (int j = 0; j < texts.size(); j++) {
Text userText = texts.get(j);
if (tempUser == j) {
continue;// Ignore itself.
}
if (Math.abs(tempItemAverageRating - userAverageRatings[userText.getUserNum()]) < radius) {
tempTotal += userText.getScore();
tempNeighbors++;
}
}
// Step 3. Predict as the average value of neighbors.
if (tempNeighbors > 0) {
predictions[i] = tempTotal / tempNeighbors;
} else {
predictions[i] = DEFAULT_RATING;
numNonNeighbors++;
}
}
}

package machinelearing.knn;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.stream.Collectors;
public class MBR {
/**
* Default rating for 1-5 points
*/
public static final double DEFAULT_RATING = 3.0;
/**
* the total number of users (参与评分的用户数量)
*/
private int numUsers;
/**
* the total number of items (评分的物品数量)
*/
private int numItems;
/**
* the total number of ratings (no-zero values) (非零评分值的数量)
*/
private int numRatings;
/**
* the predictions
*/
private double[] predictions;
/**
* Compressed rating matrix. user-item-rating triples (压缩的评分矩阵,存储用户-物品-评分的三元组)
*/
private int[][] compressedRatingMatrix;
/**
* The degree of users (how many item he has rated). (用户已评分的物品数量)
*/
private int[] userDegrees;
/**
* The average rating of the current user. (当前用户的平均评分。存储每个用户的平均评分值)
*/
private double[] userAverageRatings;
/**
* The degree of users (how many item he has rated). (物品被评分的次数)
*/
private int[] itemDegrees;
/**
* The average rating of the current item. (当前物品的平均评分。存储每个物品的平均评分值)
*/
private double[] itemAverageRatings;
/**
* The first user start from 0. Let the first user has x ratings, the second user will start from x. (用户起始索引。第一个用户的起始索引为0,第二个用户的起始索引为前一个用户评分的数量。用于定位用户的评分在compressedRatingMatrix中的位置。)
*/
private int[] userStartingIndices;
/**
* Number of non-neighbor objects. (非邻居对象的数量。用于表示在某个半径内不属于邻居的对象的数量。)
*/
private int numNonNeighbors;
/**
* The radius (delta) for determining the neighborhood. (: 确定邻域的半径(delta)。用于确定邻域内的对象,即在该半径范围内的对象被视为邻居。)
*/
private double radius;
List textList = new ArrayList<>();
private Map> textGroupByItem = new HashMap<>();
private Map> textGroupByUser= new HashMap<>();
class Text{
private Integer userNum;
private Integer itemNum;
private Integer score;
public Integer getUserNum() {
return userNum;
}
public void setUserNum(Integer userNum) {
this.userNum = userNum;
}
public Integer getItemNum() {
return itemNum;
}
public void setItemNum(Integer itemNum) {
this.itemNum = itemNum;
}
public Integer getScore() {
return score;
}
public void setScore(Integer score) {
this.score = score;
}
public Text(Integer userNum, Integer itemNum, Integer score) {
this.userNum = userNum;
this.itemNum = itemNum;
this.score = score;
}
}
public MBR(String paraFileName, int paraNumUsers, int paraNumItems, int paraNumRatings) throws Exception{
//step1. initialize these arrays
numItems = paraNumItems;
numUsers = paraNumUsers;
numRatings = paraNumRatings;
userDegrees = new int[numUsers];
userStartingIndices = new int[numUsers + 1];
userAverageRatings = new double[numUsers];
itemDegrees = new int[numItems];
compressedRatingMatrix = new int[numRatings][3];
itemAverageRatings = new double[numItems];
predictions = new double[numRatings];
System.out.println("Reading " + paraFileName);
//step2. Read the data file
File tempFile = new File(paraFileName);
if (!tempFile.exists()) {
System.out.println("File " + paraFileName + " does not exists.");
System.exit(0);
}
BufferedReader tempBufReader = new BufferedReader(new FileReader(tempFile));
String tempString;
String[] tempStrArray;
int tempIndex = 0;
userStartingIndices[0] = 0;
userStartingIndices[numUsers] = numRatings;
while ((tempString = tempBufReader.readLine()) != null) {
// Each line has three values
tempStrArray = tempString.split(",");
compressedRatingMatrix[tempIndex][0] = Integer.parseInt(tempStrArray[0]);
compressedRatingMatrix[tempIndex][1] = Integer.parseInt(tempStrArray[1]);
compressedRatingMatrix[tempIndex][2] = Integer.parseInt(tempStrArray[2]);
userDegrees[compressedRatingMatrix[tempIndex][0]]++;
itemDegrees[compressedRatingMatrix[tempIndex][1]]++;
if (tempIndex > 0) {
// Starting to read the data of a new user.
if (compressedRatingMatrix[tempIndex][0] != compressedRatingMatrix[tempIndex - 1][0]) {
userStartingIndices[compressedRatingMatrix[tempIndex][0]] = tempIndex;
}
}
tempIndex++;
}
tempBufReader.close();
double[] tempUserTotalScore = new double[numUsers];
double[] tempItemTotalScore = new double[numItems];
for (int i = 0; i < numRatings; i++) {
tempUserTotalScore[compressedRatingMatrix[i][0]] += compressedRatingMatrix[i][2];
tempItemTotalScore[compressedRatingMatrix[i][1]] += compressedRatingMatrix[i][2];
}
for (int i = 0; i < numUsers; i++) {
userAverageRatings[i] = tempUserTotalScore[i] / userDegrees[i];
}
for (int i = 0; i < numItems; i++) {
itemAverageRatings[i] = tempItemTotalScore[i] / itemDegrees[i];
}
}
public MBR(String paraFileName, int paraNumUsers, int paraNumItems, int paraNumRatings, boolean basedUser) throws Exception {
if (basedUser){
//基于用户的计算
//step1. initialize these arrays
numItems = paraNumItems;
numUsers = paraNumUsers;
numRatings = paraNumRatings;
userDegrees = new int[numUsers];
userAverageRatings = new double[numUsers];
itemDegrees = new int[numItems];
itemAverageRatings = new double[numItems];
predictions = new double[numRatings];
System.out.println("Reading " + paraFileName);
//step2. Read the data file
File tempFile = new File(paraFileName);
if (!tempFile.exists()) {
System.out.println("File " + paraFileName + " does not exists.");
System.exit(0);
}
BufferedReader tempBufReader = new BufferedReader(new FileReader(tempFile));
String tempString;
String[] tempStrArray;
while ((tempString = tempBufReader.readLine()) != null) {
// Each line has three values
tempStrArray = tempString.split(",");
//把数据读入到textList列表中
Text text = new Text(Integer.parseInt(tempStrArray[0]), Integer.parseInt(tempStrArray[1]), Integer.parseInt(tempStrArray[2]));
textList.add(text);
}
tempBufReader.close();
//按电影号分组
textGroupByItem = textList.stream().collect(Collectors.groupingBy(Text::getItemNum));
textGroupByUser = textList.stream().collect(Collectors.groupingBy(Text::getUserNum));
double[] tempUserTotalScore = new double[numUsers];
double[] tempItemTotalScore = new double[numItems];
for (int i = 0; i < numUsers; i++) {
// 用户的总分
List textsByUser = textGroupByUser.get(i);
tempUserTotalScore[i] = textsByUser.stream().mapToDouble(Text::getScore).sum();
userDegrees[i] = textsByUser.size();
userAverageRatings[i] = tempUserTotalScore[i] / userDegrees[i];
}
for (int i = 0; i < numItems; i++) {
try {
// 电影的总分
List textsByItem = textGroupByItem.get(i);
tempItemTotalScore[i] = textsByItem.stream().mapToDouble(Text::getScore).sum();
itemDegrees[i] = textsByItem.size();
itemAverageRatings[i] = tempItemTotalScore[i] / itemDegrees[i];
} catch (Exception e) {
System.out.println(e.getMessage());
}
}
}
}
public void setRadius(double paraRadius) {
if (paraRadius > 0) {
radius = paraRadius;
} else {
radius = 0.1;
}
}
public void leaveOneOutPrediction() {
double tempItemAverageRating;
// Make each line of the code shorter.
int tempUser, tempItem, tempRating;
// System.out.println("\r\nLeaveOneOutPrediction for radius " + radius);
numNonNeighbors = 0;
for (int i = 0; i < numRatings; i++) {
tempUser = compressedRatingMatrix[i][0];
tempItem = compressedRatingMatrix[i][1];
tempRating = compressedRatingMatrix[i][2];
// Step 1. Recompute average rating of the current item.
tempItemAverageRating = (itemAverageRatings[tempItem] * itemDegrees[tempItem] - tempRating)
/ (itemDegrees[tempItem] - 1);
// Step 2. Recompute neighbors, at the same time obtain the ratings
// Of neighbors.
int tempNeighbors = 0;
double tempTotal = 0;
int tempComparedItem;
for (int j = userStartingIndices[tempUser]; j < userStartingIndices[tempUser + 1]; j++) {
tempComparedItem = compressedRatingMatrix[j][1];
if (tempItem == tempComparedItem) {
continue;// Ignore itself.
} // Of if
if (Math.abs(tempItemAverageRating - itemAverageRatings[tempComparedItem]) < radius) {
tempTotal += compressedRatingMatrix[j][2];
tempNeighbors++;
}
}
// Step 3. Predict as the average value of neighbors.
if (tempNeighbors > 0) {
predictions[i] = tempTotal / tempNeighbors;
} else {
predictions[i] = DEFAULT_RATING;
numNonNeighbors++;
}
}
}
public void leaveOneOutPredictionByUser() {
double tempItemAverageRating;
// Make each line of the code shorter.
int tempUser, tempItem, tempRating;
// System.out.println("\r\nLeaveOneOutPredictionUser for radius " + radius);
numNonNeighbors = 0;
for (int i = 0; i < numRatings; i++) {
Text text = textList.get(i);
tempUser = text.getUserNum();
tempItem = text.getItemNum();
tempRating = text.getScore();
// Step 1. Recompute average rating of the current user.
List textsByUser = textGroupByUser.get(tempUser);
Integer outItem = tempItem;
textsByUser = textsByUser.stream().filter(e -> !e.getItemNum().equals(outItem)).collect(Collectors.toList());
tempItemAverageRating = textsByUser.stream().mapToDouble(Text::getScore).sum() / textsByUser.size();
// Step 2. Recompute neighbors, at the same time obtain the ratings
// Of neighbors.
int tempNeighbors = 0;
double tempTotal = 0;
List texts = textGroupByItem.get(tempItem);
for (int j = 0; j < texts.size(); j++) {
Text userText = texts.get(j);
if (tempUser == j) {
continue;// Ignore itself.
}
if (Math.abs(tempItemAverageRating - userAverageRatings[userText.getUserNum()]) < radius) {
tempTotal += userText.getScore();
tempNeighbors++;
}
}
// Step 3. Predict as the average value of neighbors.
if (tempNeighbors > 0) {
predictions[i] = tempTotal / tempNeighbors;
} else {
predictions[i] = DEFAULT_RATING;
numNonNeighbors++;
}
}
}
public double computeMAE() throws Exception {
double tempTotalError = 0;
for (int i = 0; i < predictions.length; i++) {
tempTotalError += Math.abs(predictions[i] - compressedRatingMatrix[i][2]);
} // Of for i
return tempTotalError / predictions.length;
}
public double computeMAE_User() throws Exception {
double tempTotalError = 0;
for (int i = 0; i < predictions.length; i++) {
tempTotalError += Math.abs(predictions[i] - textList.get(i).getScore());
} // Of for i
return tempTotalError / predictions.length;
}
public double computeRSME() throws Exception {
double tempTotalError = 0;
for (int i = 0; i < predictions.length; i++) {
tempTotalError += (predictions[i] - compressedRatingMatrix[i][2])
* (predictions[i] - compressedRatingMatrix[i][2]);
} // Of for i
double tempAverage = tempTotalError / predictions.length;
return Math.sqrt(tempAverage);
}
public double computeRSME_User() throws Exception {
double tempTotalError = 0;
for (int i = 0; i < predictions.length; i++) {
tempTotalError += (predictions[i] - textList.get(i).getScore())
* (predictions[i] - textList.get(i).getScore());
} // Of for i
double tempAverage = tempTotalError / predictions.length;
return Math.sqrt(tempAverage);
}
public static void main(String[] args) {
try {
MBR tempRecommender = new MBR("C:/Users/Desktop/sampledata/movielens-943u1682m.txt", 943, 1682, 100000);
MBR tempRecommender1 = new MBR("C:/Users/Desktop/sampledata/movielens-943u1682m.txt", 943, 1682, 100000, true);
for (double tempRadius = 0.2; tempRadius < 0.6; tempRadius += 0.1) {
tempRecommender.setRadius(tempRadius);
tempRecommender1.setRadius(tempRadius);
tempRecommender.leaveOneOutPrediction();
double tempMAE = tempRecommender.computeMAE();
double tempRSME = tempRecommender.computeRSME();
tempRecommender1.leaveOneOutPredictionByUser();
double tempMAE1 = tempRecommender1.computeMAE_User();
double tempRSME1 = tempRecommender1.computeRSME_User();
System.out.println("Radius_item = " + tempRadius + ", MAE_item = " + tempMAE + ", RSME_item = " + tempRSME
+ ", numNonNeighbors_item = " + tempRecommender.numNonNeighbors);
System.out.println("Radius_user = " + tempRadius + ", MAE_user = " + tempMAE1 + ", RSME_user = " + tempRSME1
+ ", numNonNeighbors_user = " + tempRecommender1.numNonNeighbors);
}
} catch (Exception ee) {
System.out.println(ee);
}
}
}