广告设计就业方向和前景南昌网站建设方案优化
web/
2025/9/25 22:39:29/
文章来源:
广告设计就业方向和前景,南昌网站建设方案优化,智慧团建网站官网入口登录,天津建设电工证查询网站文章目录1. 逻辑回归二分类2. 垃圾邮件过滤2.1 性能指标2.2 准确率2.3 精准率、召回率2.4 F1值2.5 ROC、AUC3. 网格搜索调参4. 多类别分类5. 多标签分类5.1 多标签分类性能指标本文为
scikit-learn机器学习#xff08;第2版#xff09;学习笔记逻辑回归常用于分类任务
1. 逻…
文章目录1. 逻辑回归二分类2. 垃圾邮件过滤2.1 性能指标2.2 准确率2.3 精准率、召回率2.4 F1值2.5 ROC、AUC3. 网格搜索调参4. 多类别分类5. 多标签分类5.1 多标签分类性能指标本文为
scikit-learn机器学习第2版学习笔记逻辑回归常用于分类任务
1. 逻辑回归二分类
《统计学习方法》逻辑斯谛回归模型 Logistic RegressionLR
定义设 XXX 是连续随机变量 XXX 服从 logistic 分布是指 XXX 具有下列分布函数和密度函数
F(x)P(X≤x)11e−(x−μ)/γF(x) P(X \leq x) \frac{1}{1e^{{-(x-\mu)} / \gamma}}F(x)P(X≤x)1e−(x−μ)/γ1
f(x)F′(x)e−(x−μ)/γγ(1e−(x−μ)/γ)2f(x)F(x) \frac {e^{{-(x-\mu)} / \gamma}}{\gamma {(1e^{{-(x-\mu)}/\gamma})}^2}f(x)F′(x)γ(1e−(x−μ)/γ)2e−(x−μ)/γ 在逻辑回归中当预测概率 阈值预测为正类否则预测为负类
2. 垃圾邮件过滤
从信息中提取 TF-IDF 特征并使用逻辑回归进行分类
import pandas as pd
data pd.read_csv(SMSSpamCollection, delimiter\t,headerNone)
datadata[data[0]ham][0].count() # 4825 条正常信息
data[data[0]spam][0].count() # 747 条垃圾信息import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_scoreX data[1].values
y data[0].values
from sklearn.preprocessing import LabelBinarizer
lb LabelBinarizer()
y lb.fit_transform(y)X_train_raw, X_test_raw, y_train, y_test train_test_split(X, y, random_state520)vectorizer TfidfVectorizer()
X_train vectorizer.fit_transform(X_train_raw)
X_test vectorizer.transform(X_test_raw)classifier LogisticRegression()
classifier.fit(X_train, y_train)pred classifier.predict(X_test)
for i, pred_i in enumerate(pred[:5]):print(预测为%s, 信息为%s,真实为%s %(pred_i,X_test_raw[i],y_test[i]))预测为0, 信息为Aww thats the first time u said u missed me without asking if I missed u first. You DO love me! :),真实为[0]
预测为0, 信息为Poor girl cant go one day lmao,真实为[0]
预测为0, 信息为Also remember the beads dont come off. Ever.,真实为[0]
预测为0, 信息为I see the letter B on my car,真实为[0]
预测为0, 信息为My love ! How come it took you so long to leave for Zahers? I got your words on ym and was happy to see them but was sad you had left. I miss you,真实为[0]2.1 性能指标
混淆矩阵
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
confusion_matrix confusion_matrix(y_test, pred)
plt.matshow(confusion_matrix)
plt.rcParams[font.sans-serif] SimHei # 消除中文乱码
plt.title(混淆矩阵)
plt.ylabel(真实)
plt.xlabel(预测)
plt.colorbar()2.2 准确率
scores cross_val_score(classifier, X_train, y_train, cv5)
print(Accuracies: %s % scores)
print(Mean accuracy: %s % np.mean(scores))Accuracies: [0.94976077 0.95933014 0.96650718 0.95215311 0.95688623]
Mean accuracy: 0.9569274847434318准确率不是一个很合适的性能指标它不能区分预测错误是正预测为负还是负预测为正
2.3 精准率、召回率
可以参考 [Hands On ML] 3. 分类MNIST手写数字预测
单独只看精准率或者召回率是没有意义的
from sklearn.metrics import precision_score, recall_score, f1_score
precisions precision_score(y_test, pred)
print(Precision: %s % precisions)
recalls recall_score(y_test, pred)
print(Recall: %s % recalls)Precision: 0.9852941176470589
预测为垃圾信息的基本上真的是垃圾信息Recall: 0.6979166666666666
有30%的垃圾信息预测为了非垃圾信息2.4 F1值
F1 值是以上精准率和召回率的均衡
f1s f1_score(y_test, pred)
print(F1 score: %s % f1s)
# F1 score: 0.81707317073170742.5 ROC、AUC
好的分类器AUC面积越接近1越好随机分类器AUC面积为0.5
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_scorefalse_positive_rate, recall, thresholds roc_curve(y_test, pred)
roc_auc_score roc_auc_score(y_test, pred)plt.title(受试者工作特性)
plt.plot(false_positive_rate, recall, b, labelAUC %0.2f % roc_auc_score)
plt.legend(loclower right)
plt.plot([0, 1], [0, 1], r--)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.ylabel(Recall)
plt.xlabel(Fall-out)
plt.show()3. 网格搜索调参
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, accuracy_scorepipeline Pipeline([(vect, TfidfVectorizer(stop_wordsenglish)),(clf, LogisticRegression())
])
parameters {vect__max_df: (0.25, 0.5, 0.75), # 模块name__参数namevect__stop_words: (english, None),vect__max_features: (2500, 5000, None),vect__ngram_range: ((1, 1), (1, 2)),vect__use_idf: (True, False),clf__penalty: (l1, l2),clf__C: (0.01, 0.1, 1, 10),
}if __name__ __main__:df pd.read_csv(./SMSSpamCollection, delimiter\t, headerNone)X df[1].valuesy df[0].valueslabel_encoder LabelEncoder()y label_encoder.fit_transform(y)X_train, X_test, y_train, y_test train_test_split(X, y)grid_search GridSearchCV(pipeline, parameters, n_jobs-1, verbose1, scoringaccuracy, cv3)grid_search.fit(X_train, y_train)print(Best score: %0.3f % grid_search.best_score_)print(Best parameters set:)best_parameters grid_search.best_estimator_.get_params()for param_name in sorted(parameters.keys()):print(\t%s: %r % (param_name, best_parameters[param_name]))predictions grid_search.predict(X_test)print(Accuracy: %s % accuracy_score(y_test, predictions))print(Precision: %s % precision_score(y_test, predictions))print(Recall: %s % recall_score(y_test, predictions))Best score: 0.985
Best parameters set:clf__C: 10clf__penalty: l2vect__max_df: 0.5vect__max_features: 5000vect__ngram_range: (1, 2)vect__stop_words: Nonevect__use_idf: True
Accuracy: 0.9791816223977028
Precision: 1.0
Recall: 0.8605769230769231调整参数后提高了召回率
4. 多类别分类
电影情绪评价预测
data pd.read_csv(./chapter5_movie_train.csv,header0,delimiter\t)
datadata[Sentiment].describe()count 156060.000000
mean 2.063578
std 0.893832
min 0.000000
25% 2.000000
50% 2.000000
75% 3.000000
max 4.000000
Name: Sentiment, dtype: float64平均都是比较中立的情绪
data[Sentiment].value_counts()/data[Sentiment].count()2 0.509945
3 0.210989
1 0.174760
4 0.058990
0 0.045316
Name: Sentiment, dtype: float6450% 的例子都是中立的情绪
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCVdf pd.read_csv(./chapter5_movie_train.csv, header0, delimiter\t)
X, y df[Phrase], df[Sentiment].values
X_train, X_test, y_train, y_test train_test_split(X, y, train_size0.5)pipeline Pipeline([(vect, TfidfVectorizer(stop_wordsenglish)),(clf, LogisticRegression())
])
parameters {vect__max_df: (0.25, 0.5),vect__ngram_range: ((1, 1), (1, 2)),vect__use_idf: (True, False),clf__C: (0.1, 1, 10),
}grid_search GridSearchCV(pipeline, parameters, n_jobs-1, verbose1, scoringaccuracy)
grid_search.fit(X_train, y_train)print(Best score: %0.3f % grid_search.best_score_)
print(Best parameters set:)
best_parameters grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):print(\t%s: %r % (param_name, best_parameters[param_name]))Best score: 0.619
Best parameters set:clf__C: 10vect__max_df: 0.25vect__ngram_range: (1, 2)vect__use_idf: False性能指标
predictions grid_search.predict(X_test)print(Accuracy: %s % accuracy_score(y_test, predictions))
print(Confusion Matrix:)
print(confusion_matrix(y_test, predictions))
print(Classification Report:)
print(classification_report(y_test, predictions))Accuracy: 0.6292323465333846
Confusion Matrix:
[[ 1013 1742 682 106 11][ 794 5914 6275 637 49][ 196 3207 32397 3686 222][ 28 488 6513 8131 1299][ 1 59 548 2388 1644]]
Classification Report:precision recall f1-score support0 0.50 0.29 0.36 35541 0.52 0.43 0.47 136692 0.70 0.82 0.75 397083 0.54 0.49 0.52 164594 0.51 0.35 0.42 4640accuracy 0.63 78030macro avg 0.55 0.48 0.50 78030
weighted avg 0.61 0.63 0.62 780305. 多标签分类
一个实例可以被贴上多个 labels
问题转换
实例的标签(假设为L1,L2)转换成L1 and L2,以此类推缺点产生很多种类的标签且模型只能训练数据中包含的类很多可能无法覆盖到对每个标签训练一个二分类器这个实例是L1吗是L2吗缺点忽略了标签之间的关系
5.1 多标签分类性能指标
汉明损失不正确标签的平均比例0最好杰卡德相似系数预测与真实标签的交集数量 / 并集数量1最好
from sklearn.metrics import hamming_loss, jaccard_score
# help(jaccard_score)print(hamming_loss(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[0.0, 1.0], [1.0, 1.0]])))print(hamming_loss(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[1.0, 1.0], [1.0, 1.0]])))print(hamming_loss(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[1.0, 1.0], [0.0, 1.0]])))print(jaccard_score(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[0.0, 1.0], [1.0, 1.0]]),averageNone))print(jaccard_score(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[1.0, 1.0], [1.0, 1.0]]),averageNone))print(jaccard_score(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[1.0, 1.0], [0.0, 1.0]]),averageNone))0.0
0.25
0.5
[1. 1.]
[0.5 1. ]
[0. 1.]
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/web/81856.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!