在机器学习的实践中,数据预处理与模型构建是极为关键的环节。本文将回顾数据预处理的全流程,并基于处理后的数据完成简单的机器学习建模与评估,暂不涉及复杂的调参过程。
一、预处理流程回顾
机器学习的成功,很大程度上依赖于高质量的数据。以下是数据预处理的标准流程:
- 导入库:引入必要的 Python 库,用于数据处理、分析、可视化以及建模。
- 读取数据与理解:读取数据集,通过
info()
和head()
方法初步了解数据的基本信息与结构。 - 缺失值处理:识别并处理数据中的缺失值。
- 异常值处理:检测并处理异常数据点。
- 离散值处理:将离散型数据转换为适合模型处理的格式。
- 特征工程:包括特征缩放、衍生新特征以及特征选择等操作。
- 划分数据集:将数据划分为训练集和测试集,用于模型训练与评估。
1.1 导入所需的包
import pandas as pd # 用于数据处理和分析,可处理表格数据
import numpy as np # 用于数值计算,提供高效的数组操作
import matplotlib.pyplot as plt # 用于绘制各种类型的图表
import seaborn as sns # 基于matplotlib的高级绘图库,能绘制更美观的统计图形# 设置中文字体(解决中文显示问题)
plt.rcParams['font.sans-serif'] = ['SimHei'] # Windows系统常用黑体字体
plt.rcParams['axes.unicode_minus'] = False # 正常显示负号
1.2 查看数据信息
data = pd.read_csv('data.csv') # 读取数据
print("数据基本信息:")
data.info()
print("\n数据前5行预览:")
print(data.head())
数据基本信息:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7500 entries, 0 to 7499
Data columns (total 18 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 Id 7500 non-null int64 1 Home Ownership 7500 non-null object 2 Annual Income 5943 non-null float643 Years in current job 7129 non-null object 4 Tax Liens 7500 non-null float645 Number of Open Accounts 7500 non-null float646 Years of Credit History 7500 non-null float647 Maximum Open Credit 7500 non-null float648 Number of Credit Problems 7500 non-null float649 Months since last delinquent 3419 non-null float6410 Bankruptcies 7486 non-null float6411 Purpose 7500 non-null object 12 Term 7500 non-null object 13 Current Loan Amount 7500 non-null float6414 Current Credit Balance 7500 non-null float6415 Monthly Debt 7500 non-null float6416 Credit Score 5943 non-null float6417 Credit Default 7500 non-null int64
dtypes: float64(12), int64(2), object(4)
memory usage: 1.0+ MB
数据前 5 行预览:
Id Home Ownership Annual Income Years in current job Tax Liens \
0 0 Own Home 482087.0 NaN 0.0
1 1 Own Home 1025487.0 10+ years 0.0
2 2 Home Mortgage 751412.0 8 years 0.0
3 3 Own Home 805068.0 6 years 0.0
4 4 Rent 776264.0 8 years 0.0 Number of Open Accounts Years of Credit History Maximum Open Credit \
0 11.0 26.3 685960.0
1 15.0 15.3 1181730.0
2 11.0 35.0 1182434.0
3 8.0 22.5 147400.0
4 13.0 13.6 385836.0 Number of Credit Problems Months since last delinquent Bankruptcies \
0 1.0 NaN 1.0
1 0.0 NaN 0.0
2 0.0 NaN 0.0
3 1.0 NaN 1.0
4 1.0 NaN 0.0 Purpose Term Current Loan Amount \
0 debt consolidation Short Term 99999999.0
1 debt consolidation Long Term 264968.0
2 debt consolidation Short Term 99999999.0
3 debt consolidation Short Term 121396.0
4 debt consolidation Short Term 125840.0 Current Credit Balance Monthly Debt Credit Score Credit Default
0 47386.0 7914.0 749.0 0
1 394972.0 18373.0 737.0 1
2 308389.0 13651.0 742.0 0
3 95855.0 11338.0 694.0 0
4 93309.0 7180.0 719.0 0
1.3 缺失值处理
- Annual Income:存在 1557 个缺失值,可根据 “Home Ownership” 等相关特征的平均收入进行填充。
- Years in current job:存在 371 个缺失值,需先将字符串类型转换为数值类型,再用众数或中位数填充。
- Months since last delinquent:缺失值较多(4081 个),可根据其对目标变量的影响程度,选择多重填补法或直接删除缺失行。
- Credit Score:存在 1557 个缺失值,处理方式与 “Annual Income” 类似。
1.4 数据类型转换
- Years in current job:将字符串类型转换为数值类型。
- Home Ownership、Purpose、Term:根据特征性质,选择独热编码或标签编码。
1.5 异常值处理
对于数值型特征,如 “Annual Income” 和 “Current Loan Amount”,可通过箱线图检测异常值,并根据实际情况决定是否处理。
1.6 特征缩放
对数值型特征进行 Min-Max 标准化或 Z-score 标准化,统一特征的取值范围。
1.7 特征工程
- 衍生新特征:例如计算 “负债收入比”(Debt-to-Income Ratio)。
- 特征选择:通过相关性分析等方法,筛选与目标变量相关性高的特征。
二、数据预处理实操
2.1 处理 object 类型变量
# 筛选字符串变量
discrete_features = data.select_dtypes(include=['object']).columns.tolist()
print(discrete_features)# 查看每个字符串变量的唯一值
for feature in discrete_features:print(f"\n{feature}的唯一值:")print(data[feature].value_counts())
处理结果:
- Home Ownership:进行标签编码
mapping = {'Own Home': 1,'Rent': 2,'Have Mortgage': 3,'Home Mortgage': 4
}data['Home Ownership']=data['Home Ownership'].map(mapping)
data.head()
- Years in current job:进行标签编码
years_in_job_mapping = {'< 1 year': 1,'1 year': 2,'2 years': 3,'3 years': 4,'4 years': 5,'5 years': 6,'6 years': 7,'7 years': 8,'8 years': 9,'9 years': 10,'10+ years': 11
}
data['Years in current job'] = data['Years in current job'].map(years_in_job_mapping)
- Purpose:进行独热编码
data = pd.get_dummies(data, columns=['Purpose'])
# 将独热编码后的bool类型转换为数值
for col in data.columns:if 'Purpose' in col:data[col] = data[col].astype(int)
- Term:进行 0-1 映射
term_mapping = {'Short Term': 0,'Long Term': 1
}
data['Term'] = data['Term'].map(term_mapping)
data.rename(columns={'Term': 'Long Term'}, inplace=True)
2.2 处理数值型变量
# 筛选数值型特征
continuous_features = data.select_dtypes(include=['int64', 'float64']).columns.tolist()# 用中位数填补缺失值
for feature in continuous_features:median_value = data[feature].median()data[feature].fillna(median_value, inplace=True)
处理后的数据信息:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7500 entries, 0 to 7499
Data columns (total 32 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 Id 7500 non-null int64 1 Home Ownership 7500 non-null int64 2 Annual Income 7500 non-null float643 Years in current job 7500 non-null float644 Tax Liens 7500 non-null float645 Number of Open Accounts 7500 non-null float646 Years of Credit History 7500 non-null float647 Maximum Open Credit 7500 non-null float648 Number of Credit Problems 7500 non-null float649 Months since last delinquent 7500 non-null float6410 Bankruptcies 7500 non-null float6411 Long Term 7500 non-null int64 12 Current Loan Amount 7500 non-null float6413 Current Credit Balance 7500 non-null float6414 Monthly Debt 7500 non-null float6415 Credit Score 7500 non-null float6416 Credit Default 7500 non-null int64 17 Purpose_business loan 7500 non-null int32 18 Purpose_buy a car 7500 non-null int32 19 Purpose_buy house 7500 non-null int32 20 Purpose_debt consolidation 7500 non-null int32 21 Purpose_educational expenses 7500 non-null int32 22 Purpose_home improvements 7500 non-null int32 23 Purpose_major purchase 7500 non-null int32 24 Purpose_medical bills 7500 non-null int32 25 Purpose_moving 7500 non-null int32 26 Purpose_other 7500 non-null int32 27 Purpose_renewable energy 7500 non-null int32 28 Purpose_small business 7500 non-null int32 29 Purpose_take a trip 7500 non-null int32 30 Purpose_vacation 7500 non-null int32 31 Purpose_wedding 7500 non-null int32
dtypes: float64(13), int32(15), int64(4)
memory usage: 1.4 MB
三、机器学习模型建模与评估
3.1 数据划分
from sklearn.model_selection import train_test_split
X = data.drop(['Credit Default'], axis=1) # 特征
y = data['Credit Default'] # 标签
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)print(f"训练集形状: {X_train.shape}, 测试集形状: {X_test.shape}")
结果:
训练集形状: (6000, 31), 测试集形状: (1500, 31)
3.2 模型训练与评估
使用多种常见的分类模型进行训练与评估,包括 SVM、KNN、逻辑回归、朴素贝叶斯、决策树、随机森林、XGBoost 和 LightGBM。
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
import lightgbm as lgb
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report, confusion_matrix
import warnings
warnings.filterwarnings("ignore")# SVM模型
svm_model = SVC(random_state=42)
svm_model.fit(X_train, y_train)
svm_pred = svm_model.predict(X_test)
print("\nSVM 分类报告:")
print(classification_report(y_test, svm_pred))
print("SVM 混淆矩阵:")
print(confusion_matrix(y_test, svm_pred))
print("SVM 模型评估指标:")
print(f"准确率: {accuracy_score(y_test, svm_pred):.4f}")
print(f"精确率: {precision_score(y_test, svm_pred):.4f}")
print(f"召回率: {recall_score(y_test, svm_pred):.4f}")
print(f"F1 值: {f1_score(y_test, svm_pred):.4f}")# KNN模型
knn_model = KNeighborsClassifier()
knn_model.fit(X_train, y_train)
knn_pred = knn_model.predict(X_test)
print("\nKNN 分类报告:")
print(classification_report(y_test, knn_pred))
print("KNN 混淆矩阵:")
print(confusion_matrix(y_test, knn_pred))
print("KNN 模型评估指标:")
print(f"准确率: {accuracy_score(y_test, knn_pred):.4f}")
print(f"精确率: {precision_score(y_test, knn_pred):.4f}")
print(f"召回率: {recall_score(y_test, knn_pred):.4f}")
print(f"F1 值: {f1_score(y_test, knn_pred):.4f}")# 逻辑回归模型
logreg_model = LogisticRegression(random_state=42)
logreg_model.fit(X_train, y_train)
logreg_pred = logreg_model.predict(X_test)
print("\n逻辑回归 分类报告:")
print(classification_report(y_test, logreg_pred))
print("逻辑回归 混淆矩阵:")
print(confusion_matrix(y_test, logreg_pred))
print("逻辑回归 模型评估指标:")
print(f"准确率: {accuracy_score(y_test, logreg_pred):.4f}")
print(f"精确率: {precision_score(y_test, logreg
@浙大疏锦行