目录
淡水是我们最重要和最稀缺的自然资源之一,仅占地球总水量的 3%。它几乎触及我们日常生活的方方面面,从饮用、游泳和沐浴到生产食物、电力和我们每天使用的产品。获得安全卫生的供水不仅对人类生活至关重要,而且对正在遭受干旱、污染和气温升高影响的周边生态系统的生存也至关重要。
数据共有5956842行,每一行记录有22个特征与一个目标值Target。数据中存在缺失数据以及重复数据。

pH:水样的酸碱度
Iron:铁含量
Nitrate:硝酸盐含量
Chloride:氯化物(氯离子)含量
Lead:铅含量
Zinc:锌含量
Color:水体颜色
Turbidity:浑浊度
Fluoride:氟化物含量
Copper:铜含量
Odor:气味等级或描述
Sulfate:硫酸盐含量
Conductivity:电导率,反映水中离子总浓度的一个指标
Chlorine:氯含量
Manganese:锰含量
Total Dissolved Solids (TDS):溶解性固体总量
Source:水源信息
Water Temperature:水温
Air Temperature:气温
Month:采样月份
Day:采样日期
Time of Day:采样时间

- # 画柱形图函数
- def plot_histogram(values,labels, title="xx-bar"):
- """
- 创建并显示柱状图展示目标值分布。
-
- 参数:
- # values = [zeros_count, ones_count]
-
- values (list): 数量的列表.
- title: 图表标题.
- """
- # labels = ['0', '1']
-
- # 创建柱状图
- fig, ax = plt.subplots()
- ax.bar(labels, values)
-
- plt.xlabel('Target Values')
- plt.ylabel('Counts')
- plt.title(title)
-
- # 设置x轴刻度标签倾斜45度
- plt.xticks(rotation=45)
-
- # 显示图形
- plt.show()
-
-
- # data.isnull().sum()
- # [595 6842 rows x 24 columns]> total:595w
- plot_histogram(data.isnull().sum()[1:-1],data.columns[1:-1],"index bar")
各特征缺失情况如图:


Target为0的数据量是1的两倍,数据量较为不平衡。
剔除所有具有缺失值的行记录,使用获得的局部数据‘part_data’来寻找特征重要性。
- # 所有存在缺失值的数据
- missing_data_rows = df[df.isnull().any(axis=1)]

测试集中color缺失时可以考虑设置为无色或近似无色。


在使用随机森林算法查看特征重要性性时,所有数据因为数值类型。
- features_data=part_data.iloc[:,1:-1]
- # features_data
-
-
-
- # 1.转换Color
- color_order = ["Colorless", "Near Colorless", "Faint Yellow", "Light Yellow", "Yellow"]
-
- # 将color列转换为有序的Categorical类型
- features_data['Color'] = pd.Categorical(features_data['Color'], categories=color_order, ordered=True)
-
- # 然后将其转化为数值型
- features_data['Color'] = features_data['Color'].cat.codes
-
-
-
- # 2.转Source
- source_order = ['Reservoir', 'Stream', 'Aquifer', 'Ground', 'Well', 'River', 'Lake', 'Spring']
-
- # 将color列转换为有序的Categorical类型
- features_data['Source'] = pd.Categorical(features_data['Source'], categories=source_order, ordered=True)
- # 然后将其转化为数值型
- features_data['Source'] = features_data['Source'].cat.codes
-
-
- # 3.转 month
- month_to_number = {
- "January": 1,
- "February": 2,
- "March": 3,
- "April": 4,
- "May": 5,
- "June": 6,
- "July": 7,
- "August": 8,
- "September": 9,
- "October": 10,
- "November": 11,
- "December": 12,
- }
-
- # 假设df是你的DataFrame,且'Month'列包含月份的字符串
- features_data['Month'] = features_data['Month'].map(month_to_number)
-
- features_data

- # from sklearn.ensemble import RandomForestClassifier # 分类任务
- import numpy as np
- # 创建特征和目标变量
- X = features_data # 特征
- y = part_data['Target'] # 目标变量
-
- # 创建随机森林分类器实例,并设置n_jobs=-1以使用所有可用CPU核心
- clf = RandomForestClassifier(n_estimators=100, n_jobs=-1)
- clf.fit(X, y)
-
- # 查看特征重要性
- importances = clf.feature_importances_
- indices = np.argsort(importances)[::-1]
-
- # 输出特征及其重要性
- for f in range(X.shape[1]):
- print(f"{X.columns[indices[f]]}: {importances[indices[f]]}")
Manganese: 0.12104377858773402
pH: 0.09760652848301078
Turbidity: 0.08681655744693856
Chloride: 0.07780548330395846
Copper: 0.07709669613065845
Odor: 0.07409887220818064
Nitrate: 0.0562709169089539
Chlorine: 0.048526178211342814
Fluoride: 0.04761459070452634
Iron: 0.04695381874680622
Total Dissolved Solids: 0.040240749739134815
color_encoded: 0.04003079773604966
Sulfate: 0.034990423757373615
Zinc: 0.027754942735884806
Air Temperature: 0.02196066652621346
Conductivity: 0.021933299982896146
Water Temperature: 0.02192034022343676
Day: 0.015746555919919757
Time of Day: 0.014685836285175339
Month: 0.01186842121596981
source_encoded: 0.00989443943832384
Lead: 0.00514010570751199
由此可看出,最重要的特征:Manganese、pH、Turbidity、Chloride、Copper 和 Odor 的重要性较高,它们对预测目标变量的影响较大。
次重要的特征:Nitrate、Chlorine、Fluoride、Iron、Total Dissolved Solids 和 color_encoded(编码后的颜色特征)也具有一定的重要性。
较不重要的特征:Sulfate、Zinc、Air Temperature、Conductivity、Water Temperature、Day、Time of Day、Month、source_encoded(编码后的水源特征)以及 Lead 的重要性较低。
- # 输出特征间的相关性
- correlation_matrix = features_data.corr()
- # print(correlation_matrix)
- import seaborn as sns
- import matplotlib.pyplot as plt
- import numpy as np
-
- # 计算并四舍五入相关系数矩阵到三位小数
- rounded_corr_matrix = np.round(features_data.corr(), decimals=3)
-
- # 将其转换为DataFrame(如果还不是的话)
- if not isinstance(rounded_corr_matrix, pd.DataFrame):
- rounded_corr_matrix = pd.DataFrame(rounded_corr_matrix)
-
- # 创建一个热力图
- plt.figure(figsize=(20, 10))
- sns.heatmap(rounded_corr_matrix, annot=True, fmt=".3f", cmap='coolwarm', center=0)
- plt.title('Graph')
- plt.show()

各特征与target的相关性:
- # 查看相关性
- import numpy as np
- bar = features_data.corr()['Target'].abs().sort_values(ascending=False)[1:]
-
- plt.bar(bar.index, bar, width=0.5)
- # 设置figsize的大小
- pos_list = np.arange(len(df.columns))
- params = {
- 'figure.figsize': '40, 10'
- }
- plt.title("Relevance")
- plt.rcParams.update(params)
- plt.xticks(bar.index, bar.index, rotation=-60, fontsize=10)
- plt.show()
- bar

Color 0.316406 Turbidity 0.243901 Copper 0.233780 Chloride 0.223704 Manganese 0.202161 Fluoride 0.184519 Nitrate 0.183518 Iron 0.180652 Odor 0.175216 Chlorine 0.159747 Sulfate 0.139351 Total Dissolved Solids 0.100813 Zinc 0.088218 Lead 0.041630 pH 0.035067 Day 0.000494 Air Temperature 0.000462 Source 0.000392 Conductivity 0.000181 Water Temperature 0.000143 Month 0.000058 Time of Day 0.000038
Color特征与目标结果的相关性最强,其次是Turbidity、Copper和Chloride等特征。pH、Day、Air Temperature、Source、Conductivity、Water Temperature、Month和Time of Day等特征与目标结果的相关性相对较弱。特别是Month、Time of Day相关性几乎为零。
选择出所需特征:
- # 选择需要分析的相关列
- selected = ['Manganese', 'pH', 'Turbidity', 'Chloride', 'Copper', 'Odor', 'Nitrate','Color',
- 'Chlorine', 'Fluoride', 'Iron', 'Total Dissolved Solids','Target']#hotcode , 'color_encoded'
在处理训练集时可以直接删除掉'Color'与 'Total Dissolved Solids'的缺失行

- # 转换color
- color_order = ["Colorless", "Near Colorless", "Faint Yellow", "Light Yellow", "Yellow"]
- # 将color列转换为有序的Categorical类型
- selected_data['Color'] = pd.Categorical(selected_data['Color'], categories=color_order, ordered=True)
-
- # 然后将其转化为数值型
- # !!!!不可重复运行 否则 mapping -error
- selected_data['Color'] = selected_data['Color'].cat.codes
-
- # selected_data

使用中位数填充缺失值
- # 缺失值填入对应列中位数
- for column in selected_data.columns[selected_data.isnull().any()]:
- median_value = selected_data[column].median()
- selected_data[column] = selected_data[column].fillna(median_value)
-
- # 查看是否填充完毕
- selected_data.isnull().sum()


- # 创建一个热力图来可视化相关性矩阵
- correlation_matrix = data.corr()
-
-
- plt.figure(figsize=(12, 8)) # 设置图形大小
- sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm') # 使用'coolwarm'颜色映射,同时标注数值
- plt.title('Correlation Matrix') # 添加图表标题
- plt.show() # 显示图形

- import matplotlib.pyplot as plt
- import seaborn as sns
-
- # 假设df是一个DataFrame,且有四列特征
- features = ['Manganese', 'pH', 'Turbidity', 'Chloride', 'Copper', 'Odor', 'Nitrate','Color',
- 'Chlorine', 'Fluoride', 'Iron', 'Total Dissolved Solids']#hotcode , 'color_encoded'
-
- # 使用matplotlib绘制箱线图
- fig, axes = plt.subplots(nrows=6, ncols=2, figsize=(12, 8)) # 调整子图的数量和布局
-
- for i, feature in enumerate(features):
- row, col = divmod(i, 2)
- data.boxplot(column=feature, ax=axes[row, col])
- axes[row, col].set_title(feature)
-
- plt.tight_layout() # 自动调整子图间距
- plt.show()

- from sklearn.model_selection import train_test_split
- from sklearn.preprocessing import StandardScaler
- # 开始训练
- from imblearn.under_sampling import RandomUnderSampler
- import datetime
-
- X = data.iloc[:, 0:len(data.columns.tolist()) - 1].values
- y = data.iloc[:, len(data.columns.tolist()) - 1].values
-
- # # 下采样
- under_sampler = RandomUnderSampler(random_state=21)
- X, y = under_sampler.fit_resample(X, y)
-
- X = data.drop('Target', axis=1)
- y = data['Target']
- X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21)
- scaler = StandardScaler()
- X_train_scaled = scaler.fit_transform(X_train)
- X_test_scaled = scaler.transform(X_test)
- print("Train Shape: {}".format(X_train_scaled.shape))
- print("Test Shape: {}".format(X_test_scaled.shape))
-
- X_train, X_test = X_train_scaled, X_test_scaled

定义超参数搜索空间
构建评分器
- from sklearn.metrics import make_scorer, precision_score, recall_score, accuracy_score,f1_score, roc_auc_score
- param_grid = {
- 'max_depth': [10, 15, 20],
- "gamma": [0, 1, 2], # -> 0
- "subsample": [0.9, 1], # -> 1
- "colsample_bytree": [0.3, 0.5, 1], # -> 1
- 'min_child_weight': [4, 6, 8], # -> 6
- "n_estimators": [10,50, 80, 100], # -> 80
- "alpha": [3, 4, 5] # -> 4
- }
-
- scorers = {
- 'precision_score': make_scorer(precision_score),
- 'recall_score': make_scorer(recall_score),
- 'accuracy_score': make_scorer(accuracy_score),
- 'f1_score': make_scorer(f1_score),
- 'roc_auc_score': make_scorer(roc_auc_score),
- }
设置超参数。
-
- from xgboost import XGBClassifier
- xgb = XGBClassifier(
- learning_rate=0.1,
- n_estimators=15,
- max_depth=12,
- min_child_weight=6,
- gamma=0,
- subsample=1,
- colsample_bytree=1,
- objective='binary:logistic', # 二元分类的逻辑回归,输出概率
- nthread=4,
- alpha=4,
- scale_pos_weight=1,
- seed=27)
- from sklearn.model_selection import RandomizedSearchCV
-
- refit_score = "f1_score"
-
- start_time = datetime.datetime.now()
- print(start_time)
- rd_search = RandomizedSearchCV(xgb, param_grid, n_iter=10, cv=3, refit=refit_score, scoring=scorers, verbose=10, return_train_score=True)
- rd_search.fit(X_train, y_train)
- print(rd_search.best_params_)
- print(rd_search.best_score_)
- print(rd_search.best_estimator_)
- print(datetime.datetime.now() - start_time)


- test_data = pd.read_csv('./test_data.csv')
- # test_data
- # 1191369 rows
- # 取子集数据
- Test_data = test_data[selected].copy()
- import numpy as np
-
- Test_data['Color'] = Test_data['Color'].replace({np.nan: 'Colorless'})
- # Test_data.isnull().sum()
-
- # 缺失值填入对应列中位数
- for column in Test_data.columns[Test_data.isnull().any()]:
- median_value = Test_data[column].median()
- Test_data[column] = Test_data[column].fillna(median_value)
-
-
- # Test_data.isnull().sum()
- from datetime import datetime
- from sklearn.metrics import roc_auc_score, roc_curve, auc, accuracy_score, f1_score
- # 记录开始时间
- inference_start_time = datetime.now()
-
- pred = loaded_model.predict(X)
- # len(predictions)#1191369
- # 计算模型推理时间
- inference_time = datetime.now() - inference_start_time
- print("模型推理时间:", inference_time)
- # 计算模型在测试集上的f1分数并输出
- f1 = f1_score(y, pred)
- print("模型在测试集上的f1分数:", f1)
