Inter校企合作淡水质量预测

pH：水样的酸碱度
Iron：铁含量
Nitrate：硝酸盐含量
Chloride：氯化物（氯离子）含量
Lead：铅含量
Zinc：锌含量
Color：水体颜色
Turbidity：浑浊度
Fluoride：氟化物含量
Copper：铜含量
Odor：气味等级或描述
Sulfate：硫酸盐含量
Conductivity：电导率，反映水中离子总浓度的一个指标
Chlorine：氯含量
Manganese：锰含量
Total Dissolved Solids (TDS)：溶解性固体总量
Source：水源信息
Water Temperature：水温
Air Temperature：气温
Month：采样月份
Day：采样日期
Time of Day：采样时间

2.3查看训练数据缺失情况

2.4查看缺失数据分布情况


# 画柱形图函数
def plot_histogram(values,labels, title="xx-bar"):
    """
    创建并显示柱状图展示目标值分布。
    
    参数:
# values = [zeros_count, ones_count]
    
    values (list): 数量的列表.
    title: 图表标题.
    """
    # labels = ['0', '1']
 
    # 创建柱状图
    fig, ax = plt.subplots()
    ax.bar(labels, values)
 
    plt.xlabel('Target Values')
    plt.ylabel('Counts')
    plt.title(title)
 
    # 设置x轴刻度标签倾斜45度
    plt.xticks(rotation=45)
 
    # 显示图形
    plt.show()
 
 
# data.isnull().sum()
# [595 6842 rows x 24 columns]> total:595w
plot_histogram(data.isnull().sum()[1:-1],data.columns[1:-1],"index bar")

各特征缺失情况如图：

2.5查看目标预测值分布情况

Target为0的数据量是1的两倍，数据量较为不平衡。

2.6寻找特征重要性

2.6.1获取局部训练数据

剔除所有具有缺失值的行记录，使用获得的局部数据‘part_data’来寻找特征重要性。


# 所有存在缺失值的数据
missing_data_rows = df[df.isnull().any(axis=1)]

2.6.2水色特征分布情况

测试集中color缺失时可以考虑设置为无色或近似无色。

2.6.3水源特征分布情况

2.6.4编码

在使用随机森林算法查看特征重要性性时，所有数据因为数值类型。


features_data=part_data.iloc[:,1:-1]
# features_data
 
 
 
# 1.转换Color
color_order = ["Colorless", "Near Colorless", "Faint Yellow", "Light Yellow", "Yellow"]
 
# 将color列转换为有序的Categorical类型
features_data['Color'] = pd.Categorical(features_data['Color'], categories=color_order, ordered=True)
 
# 然后将其转化为数值型
features_data['Color'] = features_data['Color'].cat.codes
 
 
 
# 2.转Source
source_order = ['Reservoir', 'Stream', 'Aquifer', 'Ground', 'Well', 'River', 'Lake', 'Spring']
 
# 将color列转换为有序的Categorical类型
features_data['Source'] = pd.Categorical(features_data['Source'], categories=source_order, ordered=True)
# 然后将其转化为数值型
features_data['Source'] = features_data['Source'].cat.codes
 
 
# 3.转 month
month_to_number = {
    "January": 1,
    "February": 2,
    "March": 3,
    "April": 4,
    "May": 5,
    "June": 6,
    "July": 7,
    "August": 8,
    "September": 9,
    "October": 10,
    "November": 11,
    "December": 12,
}
 
# 假设df是你的DataFrame，且'Month'列包含月份的字符串
features_data['Month'] = features_data['Month'].map(month_to_number)
 
features_data

2.6.5训练模型进行特征重要性分析


# from sklearn.ensemble import RandomForestClassifier  # 分类任务
import numpy as np
# 创建特征和目标变量
X = features_data # 特征
y = part_data['Target']  # 目标变量
 
# 创建随机森林分类器实例，并设置n_jobs=-1以使用所有可用CPU核心
clf = RandomForestClassifier(n_estimators=100, n_jobs=-1)
clf.fit(X, y)
 
# 查看特征重要性
importances = clf.feature_importances_
indices = np.argsort(importances)[::-1]
 
# 输出特征及其重要性
for f in range(X.shape[1]):
    print(f"{X.columns[indices[f]]}: {importances[indices[f]]}")

Manganese: 0.12104377858773402
pH: 0.09760652848301078
Turbidity: 0.08681655744693856
Chloride: 0.07780548330395846
Copper: 0.07709669613065845
Odor: 0.07409887220818064
Nitrate: 0.0562709169089539
Chlorine: 0.048526178211342814
Fluoride: 0.04761459070452634
Iron: 0.04695381874680622
Total Dissolved Solids: 0.040240749739134815
color_encoded: 0.04003079773604966
Sulfate: 0.034990423757373615
Zinc: 0.027754942735884806
Air Temperature: 0.02196066652621346
Conductivity: 0.021933299982896146
Water Temperature: 0.02192034022343676
Day: 0.015746555919919757
Time of Day: 0.014685836285175339
Month: 0.01186842121596981
source_encoded: 0.00989443943832384
Lead: 0.00514010570751199

由此可看出，最重要的特征：Manganese、pH、Turbidity、Chloride、Copper 和 Odor 的重要性较高，它们对预测目标变量的影响较大。

次重要的特征：Nitrate、Chlorine、Fluoride、Iron、Total Dissolved Solids 和 color_encoded（编码后的颜色特征）也具有一定的重要性。

较不重要的特征：Sulfate、Zinc、Air Temperature、Conductivity、Water Temperature、Day、Time of Day、Month、source_encoded（编码后的水源特征）以及 Lead 的重要性较低。

2.6.6查看相关性


# 输出特征间的相关性
correlation_matrix = features_data.corr()
# print(correlation_matrix)
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
 
# 计算并四舍五入相关系数矩阵到三位小数
rounded_corr_matrix = np.round(features_data.corr(), decimals=3)
 
# 将其转换为DataFrame（如果还不是的话）
if not isinstance(rounded_corr_matrix, pd.DataFrame):
    rounded_corr_matrix = pd.DataFrame(rounded_corr_matrix)
 
# 创建一个热力图
plt.figure(figsize=(20, 10))
sns.heatmap(rounded_corr_matrix, annot=True, fmt=".3f", cmap='coolwarm', center=0)
plt.title('Graph')
plt.show()

各特征与target的相关性：


# 查看相关性
import numpy as np
bar = features_data.corr()['Target'].abs().sort_values(ascending=False)[1:]
 
plt.bar(bar.index, bar, width=0.5)
# 设置figsize的大小
pos_list = np.arange(len(df.columns))
params = {
    'figure.figsize': '40, 10'
}
plt.title("Relevance")
plt.rcParams.update(params)
plt.xticks(bar.index, bar.index, rotation=-60, fontsize=10)
plt.show()
bar

Color                     0.316406
Turbidity                 0.243901
Copper                    0.233780
Chloride                  0.223704
Manganese                 0.202161
Fluoride                  0.184519
Nitrate                   0.183518
Iron                      0.180652
Odor                      0.175216
Chlorine                  0.159747
Sulfate                   0.139351
Total Dissolved Solids    0.100813
Zinc                      0.088218
Lead                      0.041630
pH                        0.035067
Day                       0.000494
Air Temperature           0.000462
Source                    0.000392
Conductivity              0.000181
Water Temperature         0.000143
Month                     0.000058
Time of Day               0.000038

Color特征与目标结果的相关性最强，其次是Turbidity、Copper和Chloride等特征。pH、Day、Air Temperature、Source、Conductivity、Water Temperature、Month和Time of Day等特征与目标结果的相关性相对较弱。特别是Month、Time of Day相关性几乎为零。

选择出所需特征：


# 选择需要分析的相关列
selected = ['Manganese', 'pH', 'Turbidity', 'Chloride', 'Copper', 'Odor', 'Nitrate','Color', 
                     'Chlorine', 'Fluoride', 'Iron', 'Total Dissolved Solids','Target']#hotcode , 'color_encoded'

2.7查看训练集数据缺失情况

在处理训练集时可以直接删除掉'Color'与 'Total Dissolved Solids'的缺失行

2.8编码及填充缺失值


# 转换color
color_order = ["Colorless", "Near Colorless", "Faint Yellow", "Light Yellow", "Yellow"]
# 将color列转换为有序的Categorical类型
selected_data['Color'] = pd.Categorical(selected_data['Color'], categories=color_order, ordered=True)
 
# 然后将其转化为数值型
# !!!!不可重复运行 否则 mapping -error
selected_data['Color'] = selected_data['Color'].cat.codes
 
# selected_data

使用中位数填充缺失值


# 缺失值填入对应列中位数
for column in selected_data.columns[selected_data.isnull().any()]:
    median_value = selected_data[column].median()
    selected_data[column] = selected_data[column].fillna(median_value)
 
# 查看是否填充完毕
selected_data.isnull().sum()

模型训练

3.1查看并删除数据重复记录

3.2各特征相关性矩阵


# 创建一个热力图来可视化相关性矩阵
correlation_matrix = data.corr()
 
 
plt.figure(figsize=(12, 8))  # 设置图形大小
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')  # 使用'coolwarm'颜色映射，同时标注数值
plt.title('Correlation Matrix')  # 添加图表标题
plt.show()  # 显示图形

3.3各特征箱型图


import matplotlib.pyplot as plt
import seaborn as sns
 
# 假设df是一个DataFrame，且有四列特征
features = ['Manganese', 'pH', 'Turbidity', 'Chloride', 'Copper', 'Odor', 'Nitrate','Color', 
                     'Chlorine', 'Fluoride', 'Iron', 'Total Dissolved Solids']#hotcode , 'color_encoded'
 
# 使用matplotlib绘制箱线图
fig, axes = plt.subplots(nrows=6, ncols=2, figsize=(12, 8))  # 调整子图的数量和布局
 
for i, feature in enumerate(features):
    row, col = divmod(i, 2)
    data.boxplot(column=feature, ax=axes[row, col])
    axes[row, col].set_title(feature)
 
plt.tight_layout()  # 自动调整子图间距
plt.show()

3.4划分训练集与测试集


from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# 开始训练
from imblearn.under_sampling import RandomUnderSampler
import datetime
 
X = data.iloc[:, 0:len(data.columns.tolist()) - 1].values
y = data.iloc[:, len(data.columns.tolist()) - 1].values
 
# # 下采样
under_sampler = RandomUnderSampler(random_state=21)
X, y = under_sampler.fit_resample(X, y)
 
X = data.drop('Target', axis=1)
y = data['Target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
print("Train Shape: {}".format(X_train_scaled.shape))
print("Test Shape: {}".format(X_test_scaled.shape))
 
X_train, X_test = X_train_scaled, X_test_scaled

3.5超参数调参

定义超参数搜索空间
构建评分器


from sklearn.metrics import make_scorer, precision_score, recall_score, accuracy_score,f1_score, roc_auc_score  
param_grid = {
    'max_depth': [10, 15, 20],
    "gamma": [0, 1, 2], # -> 0
    "subsample": [0.9, 1], # -> 1
    "colsample_bytree": [0.3, 0.5, 1], # -> 1
    'min_child_weight': [4, 6, 8], # -> 6
    "n_estimators": [10,50, 80, 100], # -> 80
    "alpha": [3, 4, 5] # -> 4
}
 
scorers = {
    'precision_score': make_scorer(precision_score),
    'recall_score': make_scorer(recall_score),
    'accuracy_score': make_scorer(accuracy_score),
    'f1_score': make_scorer(f1_score),
    'roc_auc_score': make_scorer(roc_auc_score),
}

3.6选用xgboost分类器

设置超参数。


 
from xgboost import XGBClassifier
xgb = XGBClassifier(
    learning_rate=0.1,
    n_estimators=15,
    max_depth=12,
    min_child_weight=6,
    gamma=0,
    subsample=1,
    colsample_bytree=1,
    objective='binary:logistic', # 二元分类的逻辑回归，输出概率
    nthread=4,
    alpha=4,
    scale_pos_weight=1,
    seed=27)

3.7训练并保存模型


from sklearn.model_selection import RandomizedSearchCV
 
refit_score = "f1_score"
 
start_time = datetime.datetime.now()
print(start_time)
rd_search = RandomizedSearchCV(xgb, param_grid, n_iter=10, cv=3, refit=refit_score, scoring=scorers, verbose=10, return_train_score=True)
rd_search.fit(X_train, y_train)
print(rd_search.best_params_)
print(rd_search.best_score_)
print(rd_search.best_estimator_)
print(datetime.datetime.now() - start_time)

模型推理

4.1处理测试集数据

4.1.1读取数据


test_data = pd.read_csv('./test_data.csv')
# test_data
# 1191369 rows

4.1.2筛选所需特征


# 取子集数据
Test_data = test_data[selected].copy()

4.1.3填充缺失值


import numpy as np
 
Test_data['Color'] = Test_data['Color'].replace({np.nan: 'Colorless'})
# Test_data.isnull().sum()
 
# 缺失值填入对应列中位数
for column in Test_data.columns[Test_data.isnull().any()]:
    median_value = Test_data[column].median()
    Test_data[column] = Test_data[column].fillna(median_value)
 
 
# Test_data.isnull().sum()

4.2模型推理


from datetime import datetime
from sklearn.metrics import roc_auc_score, roc_curve, auc, accuracy_score, f1_score
# 记录开始时间
inference_start_time = datetime.now()
 
pred = loaded_model.predict(X)
# len(predictions)#1191369
# 计算模型推理时间
inference_time = datetime.now() - inference_start_time
print("模型推理时间:", inference_time)
# 计算模型在测试集上的f1分数并输出
f1 = f1_score(y, pred)
print("模型在测试集上的f1分数:", f1)

相关阅读:
【开发者必看】【Health kit】运动健康服务典型问题合集
 聚乙二醇修饰单分散/功能化聚乙烯胺/热敏性PNVIBA/温敏性PNIPAAm接枝聚苯乙烯微球
 【数据结构与算法】之深入解析“圆形靶内的最大飞镖数量”的求解思路与算法示例
 大赛征集令｜首届“万应杯”低代码应用开发大赛报名开启啦！
【javase基础】第十七篇：异常（Exception）
如何在BI中增加“路线地图”并进行数据分析？
主动免疫可信计算打造安全可信网络产业生态体系
 【算法-字符串1】反转字符串 + 反转字符串2
网络ip地址冲突会出现什么情况
 Docker 容器数据卷
原文地址：https://blog.csdn.net/qq_60678811/article/details/136355813

Inter校企合作 淡水质量预测

淡水质量预测简介

1.1问题描述

数据预处理

2.1查看数据基本情况

2.2各特征含义