Other pages listing many popular open data repositories
Wikipedia’s list of Machine Learning datasets
Quora.com question
Datasets subreddit
2.3 Take a Quick Look at the Data Structure
head()、info()、[‘key’].value_counts() 统计值出现的次数、describe()-shows a summary of the numerical attributes、hist() Jupyter’s magic command “%matplotlib inline”
2.4 Create a Test Set
pick 20% of the dataset randomly, and set them aside
3. Discover and visualize the data to gain insights.
Visualizing Geographical Data
Looking for Correlations corr() 查看各个特征与当前特征的关系 pandas.tools.plotting.scatter_matrix() 查看n个特征两两之间的关系,并plot绘图
Experimenting with Attribute Combinations such as bedrooms_per_room better than the total number of rooms
4. Prepare the data for Machine Learning algorithms.
4.1 Data Cleaning
Get rid of the corresponding districts. housing.dropna(subset=[“total_bedrooms”]) # option 1
Get rid of the whole attribute. housing.drop(“total_bedrooms”, axis=1) # option 2
Set the values to some value (zero, the mean, the median, etc.) 比如 #直接设置空值为平均数 median = housing[“total_bedrooms”].median() housing[“total_bedrooms”].fillna(median) # option 3 或者 #使用Imputer管理转换空值 imputer = sklearn.preprocessing.Imputer(strategy=“median”) X = imputer.fit_transform(housing_num) housing_tr = pd.DataFrame(X, columns=housing_num.columns)
min-max scaling ( subtracting the min value and dividing by the max minus the min ) sklearn.preprocessing.MinMaxScaler
standardization ( subtracts the mean value (so standardized values always have a zero mean), and then it divides by the variance so that the resulting distribution has unit variance ) sklearn.preprocessing.StandardScaler