day10 python机器学习全流程实践

在机器学习的实践中，数据预处理与模型构建是极为关键的环节。本文将回顾数据预处理的全流程，并基于处理后的数据完成简单的机器学习建模与评估，暂不涉及复杂的调参过程。

一、预处理流程回顾

机器学习的成功，很大程度上依赖于高质量的数据。以下是数据预处理的标准流程：

导入库：引入必要的 Python 库，用于数据处理、分析、可视化以及建模。
读取数据与理解：读取数据集，通过info()和head()方法初步了解数据的基本信息与结构。
缺失值处理：识别并处理数据中的缺失值。
异常值处理：检测并处理异常数据点。
离散值处理：将离散型数据转换为适合模型处理的格式。
特征工程：包括特征缩放、衍生新特征以及特征选择等操作。
划分数据集：将数据划分为训练集和测试集，用于模型训练与评估。

1.1 导入所需的包

import pandas as pd  # 用于数据处理和分析，可处理表格数据
import numpy as np   # 用于数值计算，提供高效的数组操作
import matplotlib.pyplot as plt  # 用于绘制各种类型的图表
import seaborn as sns  # 基于matplotlib的高级绘图库，能绘制更美观的统计图形# 设置中文字体（解决中文显示问题）
plt.rcParams['font.sans-serif'] = ['SimHei']  # Windows系统常用黑体字体
plt.rcParams['axes.unicode_minus'] = False    # 正常显示负号

1.2 查看数据信息

data = pd.read_csv('data.csv')    # 读取数据
print("数据基本信息：")
data.info()
print("\n数据前5行预览：")
print(data.head())

数据基本信息：

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7500 entries, 0 to 7499
Data columns (total 18 columns):#   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  0   Id                            7500 non-null   int64  1   Home Ownership                7500 non-null   object 2   Annual Income                 5943 non-null   float643   Years in current job          7129 non-null   object 4   Tax Liens                     7500 non-null   float645   Number of Open Accounts       7500 non-null   float646   Years of Credit History       7500 non-null   float647   Maximum Open Credit           7500 non-null   float648   Number of Credit Problems     7500 non-null   float649   Months since last delinquent  3419 non-null   float6410  Bankruptcies                  7486 non-null   float6411  Purpose                       7500 non-null   object 12  Term                          7500 non-null   object 13  Current Loan Amount           7500 non-null   float6414  Current Credit Balance        7500 non-null   float6415  Monthly Debt                  7500 non-null   float6416  Credit Score                  5943 non-null   float6417  Credit Default                7500 non-null   int64  
dtypes: float64(12), int64(2), object(4)
memory usage: 1.0+ MB

数据前 5 行预览：

   Id Home Ownership  Annual Income Years in current job  Tax Liens  \
0   0       Own Home       482087.0                  NaN        0.0   
1   1       Own Home      1025487.0            10+ years        0.0   
2   2  Home Mortgage       751412.0              8 years        0.0   
3   3       Own Home       805068.0              6 years        0.0   
4   4           Rent       776264.0              8 years        0.0   Number of Open Accounts  Years of Credit History  Maximum Open Credit  \
0                     11.0                     26.3             685960.0   
1                     15.0                     15.3            1181730.0   
2                     11.0                     35.0            1182434.0   
3                      8.0                     22.5             147400.0   
4                     13.0                     13.6             385836.0   Number of Credit Problems  Months since last delinquent  Bankruptcies  \
0                        1.0                           NaN           1.0   
1                        0.0                           NaN           0.0   
2                        0.0                           NaN           0.0   
3                        1.0                           NaN           1.0   
4                        1.0                           NaN           0.0   Purpose        Term  Current Loan Amount  \
0  debt consolidation  Short Term           99999999.0   
1  debt consolidation   Long Term             264968.0   
2  debt consolidation  Short Term           99999999.0   
3  debt consolidation  Short Term             121396.0   
4  debt consolidation  Short Term             125840.0   Current Credit Balance  Monthly Debt  Credit Score  Credit Default  
0                 47386.0        7914.0         749.0               0  
1                394972.0       18373.0         737.0               1  
2                308389.0       13651.0         742.0               0  
3                 95855.0       11338.0         694.0               0  
4                 93309.0        7180.0         719.0               0

1.3 缺失值处理

Annual Income：存在 1557 个缺失值，可根据 “Home Ownership” 等相关特征的平均收入进行填充。
Years in current job：存在 371 个缺失值，需先将字符串类型转换为数值类型，再用众数或中位数填充。
Months since last delinquent：缺失值较多（4081 个），可根据其对目标变量的影响程度，选择多重填补法或直接删除缺失行。
Credit Score：存在 1557 个缺失值，处理方式与 “Annual Income” 类似。

1.4 数据类型转换

Years in current job：将字符串类型转换为数值类型。
Home Ownership、Purpose、Term：根据特征性质，选择独热编码或标签编码。

1.5 异常值处理

对于数值型特征，如 “Annual Income” 和 “Current Loan Amount”，可通过箱线图检测异常值，并根据实际情况决定是否处理。

1.6 特征缩放

对数值型特征进行 Min-Max 标准化或 Z-score 标准化，统一特征的取值范围。

1.7 特征工程

衍生新特征：例如计算 “负债收入比”（Debt-to-Income Ratio）。
特征选择：通过相关性分析等方法，筛选与目标变量相关性高的特征。

二、数据预处理实操

2.1 处理 object 类型变量

# 筛选字符串变量 
discrete_features = data.select_dtypes(include=['object']).columns.tolist()
print(discrete_features)# 查看每个字符串变量的唯一值
for feature in discrete_features:print(f"\n{feature}的唯一值：")print(data[feature].value_counts())

处理结果：

Home Ownership：进行标签编码

mapping = {'Own Home': 1,'Rent': 2,'Have Mortgage': 3,'Home Mortgage': 4
}data['Home Ownership']=data['Home Ownership'].map(mapping)
data.head()

Years in current job：进行标签编码

years_in_job_mapping = {'< 1 year': 1,'1 year': 2,'2 years': 3,'3 years': 4,'4 years': 5,'5 years': 6,'6 years': 7,'7 years': 8,'8 years': 9,'9 years': 10,'10+ years': 11
}
data['Years in current job'] = data['Years in current job'].map(years_in_job_mapping)

Purpose：进行独热编码

data = pd.get_dummies(data, columns=['Purpose'])
# 将独热编码后的bool类型转换为数值
for col in data.columns:if 'Purpose' in col:data[col] = data[col].astype(int)

Term：进行 0-1 映射

term_mapping = {'Short Term': 0,'Long Term': 1
}
data['Term'] = data['Term'].map(term_mapping)
data.rename(columns={'Term': 'Long Term'}, inplace=True)

2.2 处理数值型变量

# 筛选数值型特征
continuous_features = data.select_dtypes(include=['int64', 'float64']).columns.tolist()# 用中位数填补缺失值
for feature in continuous_features:median_value = data[feature].median()data[feature].fillna(median_value, inplace=True)

处理后的数据信息：

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7500 entries, 0 to 7499
Data columns (total 32 columns):#   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  0   Id                            7500 non-null   int64  1   Home Ownership                7500 non-null   int64  2   Annual Income                 7500 non-null   float643   Years in current job          7500 non-null   float644   Tax Liens                     7500 non-null   float645   Number of Open Accounts       7500 non-null   float646   Years of Credit History       7500 non-null   float647   Maximum Open Credit           7500 non-null   float648   Number of Credit Problems     7500 non-null   float649   Months since last delinquent  7500 non-null   float6410  Bankruptcies                  7500 non-null   float6411  Long Term                     7500 non-null   int64  12  Current Loan Amount           7500 non-null   float6413  Current Credit Balance        7500 non-null   float6414  Monthly Debt                  7500 non-null   float6415  Credit Score                  7500 non-null   float6416  Credit Default                7500 non-null   int64  17  Purpose_business loan         7500 non-null   int32  18  Purpose_buy a car             7500 non-null   int32  19  Purpose_buy house             7500 non-null   int32  20  Purpose_debt consolidation    7500 non-null   int32  21  Purpose_educational expenses  7500 non-null   int32  22  Purpose_home improvements     7500 non-null   int32  23  Purpose_major purchase        7500 non-null   int32  24  Purpose_medical bills         7500 non-null   int32  25  Purpose_moving                7500 non-null   int32  26  Purpose_other                 7500 non-null   int32  27  Purpose_renewable energy      7500 non-null   int32  28  Purpose_small business        7500 non-null   int32  29  Purpose_take a trip           7500 non-null   int32  30  Purpose_vacation              7500 non-null   int32  31  Purpose_wedding               7500 non-null   int32  
dtypes: float64(13), int32(15), int64(4)
memory usage: 1.4 MB

三、机器学习模型建模与评估

3.1 数据划分

from sklearn.model_selection import train_test_split
X = data.drop(['Credit Default'], axis=1)  # 特征
y = data['Credit Default']  # 标签
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)print(f"训练集形状: {X_train.shape}, 测试集形状: {X_test.shape}")

结果：

训练集形状: (6000, 31), 测试集形状: (1500, 31)

3.2 模型训练与评估

使用多种常见的分类模型进行训练与评估，包括 SVM、KNN、逻辑回归、朴素贝叶斯、决策树、随机森林、XGBoost 和 LightGBM。

from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
import lightgbm as lgb
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report, confusion_matrix
import warnings
warnings.filterwarnings("ignore")# SVM模型
svm_model = SVC(random_state=42)
svm_model.fit(X_train, y_train)
svm_pred = svm_model.predict(X_test)
print("\nSVM 分类报告：")
print(classification_report(y_test, svm_pred))
print("SVM 混淆矩阵：")
print(confusion_matrix(y_test, svm_pred))
print("SVM 模型评估指标：")
print(f"准确率: {accuracy_score(y_test, svm_pred):.4f}")
print(f"精确率: {precision_score(y_test, svm_pred):.4f}")
print(f"召回率: {recall_score(y_test, svm_pred):.4f}")
print(f"F1 值: {f1_score(y_test, svm_pred):.4f}")# KNN模型
knn_model = KNeighborsClassifier()
knn_model.fit(X_train, y_train)
knn_pred = knn_model.predict(X_test)
print("\nKNN 分类报告：")
print(classification_report(y_test, knn_pred))
print("KNN 混淆矩阵：")
print(confusion_matrix(y_test, knn_pred))
print("KNN 模型评估指标：")
print(f"准确率: {accuracy_score(y_test, knn_pred):.4f}")
print(f"精确率: {precision_score(y_test, knn_pred):.4f}")
print(f"召回率: {recall_score(y_test, knn_pred):.4f}")
print(f"F1 值: {f1_score(y_test, knn_pred):.4f}")# 逻辑回归模型
logreg_model = LogisticRegression(random_state=42)
logreg_model.fit(X_train, y_train)
logreg_pred = logreg_model.predict(X_test)
print("\n逻辑回归 分类报告：")
print(classification_report(y_test, logreg_pred))
print("逻辑回归 混淆矩阵：")
print(confusion_matrix(y_test, logreg_pred))
print("逻辑回归 模型评估指标：")
print(f"准确率: {accuracy_score(y_test, logreg_pred):.4f}")
print(f"精确率: {precision_score(y_test, logreg

@浙大疏锦行