目录
函数sklearn.preprocessing.Binarizer
函数preprocessing.KBinsDiscretizer
根据阈值将数据二值化(将特征值设置为0或1),用于处理连续型变量。大于阈值的值映射为1,而小于或等于阈值的值映射为0。
sklearn.preprocessing.Binarizer(threshold=0.0, copy=True)
- X = [[ -3., 5., 15 ],
- [ 0., 6., 14 ],
- [ 6., 3., 11 ]]
print(type(X))
- test=Binarizer(threshold=6).fit_transform(X)
- print(test)
- class sklearn.preprocessing.KBinsDiscretizer(n_bins=5,
- encode='onehot',
- strategy='quantile'
- )
- X = np.array([[ -3., 5., 15 ],
- [ 0., 6., 14 ],
- [ 6., 3., 11 ]])
- X=X.reshape(-1,1)
- print(X)
- est = KBinsDiscretizer(n_bins=3,encode='onehot').fit(X)
- print(est.fit_transform(X))
print(est.fit_transform(X).toarray())
函数read_excel
data = pd.read_excel('./input/diabetes_missing_value.xlsx')
print(data.isnull().sum())
主要是两种方式
缺失值较少时直接用中值填补
data["plas"] = data["plas"].fillna(data["plas"].median())
缺失值较多时,用热力图查看与该特征关联度较大的几个特征,根据这几个特征求取中值,没有时求取直接使用整段中值
- #skin的缺失值比较多 通过热力图可以看出skin和mass、pres、plas三种数据相关性较大
- # Filling missing value of skin
- index_NaN_age = list(data["skin"][data["skin"].isnull()].index)
- #注意这里的空置表示方式:dataset['skin'][dataset.skin.isnull()].index,属性后面紧跟着一个[筛选条件].index,返回的是raw序号.
- for i in index_NaN_age :
- skin_med = data["skin"].median()#非空skin 值的中位数
- skin_pred = data["skin"][((data['mass'] == data.iloc[i]["mass"]) & (data['pres'] == data.iloc[i]["pres"]) & (data['plas'] == data.iloc[i]["plas"]))].median()#在所有的记录中,寻找与Age为空值记录,SibSp\Parch\Pclass都想同的记录,例如有5条,取这5条记录中Age的中位数填充空的Age值
- if not np.isnan(skin_pred) :
- data['skin'].iloc[i] = skin_pred
- else :
- data['skin'].iloc[i] = skin_med
标签转换 将class里的b'tested_positive'转为1 negative转为0
data.iloc[:,-1]=LabelEncoder().fit_transform(data.iloc[:,-1])
preg怀孕次数
X=data.iloc[:,0].values.reshape(-1,1)
使用 KBinsDiscretizer
test=KBinsDiscretizer(n_bins=3,encode='onehot',strategy='uniform').fit_transform(X).toarray()
- newdata=pd.concat([pd.DataFrame(test),data],axis=1)
- print(newdata.head())
- newdata.drop(["preg"],axis=1,inplace=True)
- newdata.columns=["preg_small","preg_middle","preg_large","plas","pres","skin","insu","mass","pedi","age","class"]
- print(newdata.head())