CT5120 Intro to Natural Lang. Processing Lab # 4. Text Classification

news/2025/10/6 18:08:19/文章来源:https://www.cnblogs.com/shoshana-kong/p/19127789

# 4. Text Classification
## 4.0 Learning Objectives

* Conduct exploratory data analysis (EDA)
* Preprocess text
* Feature extraction
* Training, prediction and evaluation of ML models
* Inference on new text
First, let us download and import dependencies.
import nltk
nltk.download(['brown', 'stopwords'])
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import brown
from nltk.corpus import stopwords
# from nltk.tokenize import word_tokenize
import string
import pandas as pd
from nltk.tokenize.treebank import TreebankWordDetokenizer, TreebankWordTokenizer

## 4.1 Scikit-learn (sklearn)

**sklearn** is a machine learning framework, which is suitable for feature extraction.

Some useful sklearn classes:
1. [sklearn.feature_extraction.text.CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn-feature-extraction-text-countvectorizer): this
converts a collection of text documents to a matrix of token counts.

2. [sklearn.feature_extraction.text.TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html): this converts a collection of raw documents to a matrix of TF-IDF features.

3. [class sklearn.feature_extraction.text.TfidfTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html): transforms a count matrix to a normalized tf or tf-idf representation

4. [sklearn.metrics.classification_report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html): this builds a text report showing the main classification metrics. It shows macro-average, weighted-average and per class scores for `precision`, `recall` and `f1`. It also displays support, which is the actual occurrence of the class/label in the dataset.
In order to feed the text to `CountVectorizer`, it needs to exist as sentences, as shown in the following example:
<br>
<br>

| text | label |
| ---- | ----- |
|The capital expansion programs business firms involve multi-year budgeting true country development programs|government|
|Now Dogtown one places creeps marrow worms get old wood veneer|mystery|
|This claim submitted District Court dismissed 126 F.Supp.235 alleged violation 7 Clayton Act also 1 2 Sherman Act|government|
|Mrs. Meeker struck ready seek anyone's advice least Garth's| mystery|
|Richmond Va. |government|

Basically, we need:

X: Array of sentences

y: Array of corresponding labels

The corpus which we are using is already tokenized. It could be used as it is.
But in real life the corpus would rarely be tokenized, so we prepare the data as sentences and labels before proceeding with the exercise.
## 4.2 Tokenization and Detokenization
Detokenisation is similar to `.join()` method.

The default tokenization method in NLTK involves tokenization using regular expressions as defined in the Penn Treebank (based on English text). It assumes that the text is already split into sentences.

This is a very useful form of tokenization since it incorporates several rules of linguistics to split the sentence into the most optimal tokens.

Detokenizer is required to put the sentence back together from a list of words, with proper punctuation form.


[Demo link](https://github.com/gauneg/material_lab_nlp)
detokenizer = TreebankWordDetokenizer()
tokenizer = TreebankWordTokenizer()
# 4.3 Dataset

The [Brown Corpus](https://www1.essex.ac.uk/linguistics/external/clmt/w3c/corpus_ling/content/corpora/list/private/brown/brown.html) consists of one million words of American English texts printed in 1961. The texts for the corpus were sampled from 15 different text categories to make the corpus a good standard reference.

From this dataset we select two categories:
* government: Text from government documents
* mystery: Text from mystery and detective fiction

And we create our own dataset by detokenizing and shuffling the above.
for category in brown.categories():
corpus_length = len(brown.sents(categories=[category]))
print(f'Category: {category:<16}, Dataset Size:{corpus_length}')

english_stopwords = stopwords.words('english')
punctuations = list(string.punctuation)

print('\n\nSelecting `government` and `mystery` categories from brown corpus')

def filter_and_join(sent_arr, lab):
filtered_tokens = [token for token in sent_arr if (token not in english_stopwords and token not in punctuations)]
return [detokenizer.detokenize(filtered_tokens), lab]

## Using the filter_and_join function on all the text inputs of government categories
government_text = list(map(lambda x: filter_and_join(x, 'government'), brown.sents(categories=['government'])))

## Using the filter_and_join function on all the text inputs of government categories
mystery_text = list(map(lambda x: filter_and_join(x, 'mystery'), brown.sents(categories=['mystery'])))

dataset = pd.DataFrame(government_text + mystery_text, columns=['text', 'label'])
dataset = dataset.sample(frac=1)
dataset.head()
## 4.4 Exploratory Data Analysis (EDA)
Let us check the first five rows and the last five rows of the dataset.
dataset.head() # First five rows
dataset.tail() # Last five rows
dataset.shape # How many rows and columns
dataset.describe() # Dataset statistics
## 4.5 Overall Task

Use the given corpus to perform the following tasks:

1. Data Split: Split the dataset in the train and test dataset. train=90%, test=10%)

2. Feature Extraction: Using `text` column, extract representations i.e. Count Vectors and TF-IDF features.

3. Train ML model: Use the extracted representations to train two `Naive Bayes` models (a model trained using the Count Vectors and another trained on TF-IDF features).

4. Evaluation: calculate the precision, recall and f1 score.
Hint: Use classification report

5. Inference: Use the given strings and the trained models to predict the class/label of the text.

**OPTIONAL TASK**: Train any other model of your choice with a better performance than Naive Bayes.
## 4.6 Exercise 1
**Instruction**: Split the dataset into train and test sets. The test set should be 10% of the overall dataset size.

Hint: Use sklearn split function.
# Enter your code below this line

print(dataset.shape)
print(train_data.shape)
print(test_data.shape)
train_data.head()
## 4.7 Feature Engineering using Raw Counts and TF-IDF

 

### 4.7a Example
Let us look at an example of vector representation of a text using counts.

**Note**: Try not to mix up calling `fit()` for learning vocabulary during feature extraction and calling `fit()` during model training.

corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
from sklearn.feature_extraction.text import CountVectorizer
count_corpus = CountVectorizer() # Create an object of the CountVectorizer class
count_corpus.fit(corpus) # Learn vocabulary on the corpus
print(count_corpus.get_feature_names_out()) # Display the learned vocabulary

# Extract token counts out of raw text documents using the vocabulary fitted with fit
count_corpus_transform = count_corpus.transform(corpus)

print() # This is just a line break
print(count_corpus_transform.toarray()) # Display the vectors
- In the example above, method get_feature_names() returns vocabulary of the corpus i.e. number of unique words.
- Each document in the corpus is represented with the reference to the vocabulary
- Example: In the document 1 i.e. **"This is the first document."** can be rearranged to **[0, "document", "first", "is", 0, 0, "the", 0, "this"]** which in the end transformed into count vector based on the number of times the given word occurs in the document i.e. **[0 1 1 1 0 0 1 0 1]**

The example below shows the vector representation of the above corpus using TF-IDF.
from sklearn.feature_extraction.text import TfidfVectorizer
# Create another instance/object of the TfidfVectorizer
tfidf_corpus = TfidfVectorizer()

tfidf_corpus.fit(corpus) # Learn vocabulary on the corpus
print(tfidf_corpus.get_feature_names_out()) # Display the learned vocabulary

# Extract token counts out of raw text documents using the vocabulary fitted with fit
tfidf_corpus_transform = tfidf_corpus.transform(corpus)

print() # This is just a line break
print(tfidf_corpus_transform.toarray()) # Display the vectors
- Similar to count vector, each index of the TF-IDF features represents a word in the vocabulary.
- Each value represents the L2 normalized TF-IDF of the word in the document.
### 4.7b Performing feature extraction on the provided dataset
First, let us split our
* train_data into X_train and y_train
* test_data into X_test and y_test

Note that the X is in capital letter.
X_train = train_data["text"]
y_train = train_data["label"]

X_test = test_data["text"]
y_test = test_data["label"]
Let us apply the same technique we learnt in 4.7a on our given dataset.
# Using CountVectorizer
count_vector = CountVectorizer()
count_vector.fit(X_train)
print(count_vector.get_feature_names_out()[-50:]) # Display last 50 items in the vocabulary

X_train_counts = count_vector.transform(X_train)

print() # This is just a line break
print(X_train_counts.toarray())
# Using TfidfVectorizer
tfidf_vector = TfidfVectorizer()
tfidf_vector.fit(X_train)
print(tfidf_vector.get_feature_names_out()[-50:]) # Display last 50 items in the vocabulary

X_train_tfidf = tfidf_vector.transform(X_train)

print() # This is just a line break
print(X_train_tfidf.toarray())
## 4.8 Exercise 2
The features for the training set have already been generated. Now, generate the features for the test set.

## WARNING:

Make sure that you do not change the features based on the test set. In order words, do not call the `fit()` method of the CountVectorizer or TfidfVectorizer on the test split, only `fit()` on the train split when learning the vector representations.
X_test_counts = # Enter your code here
X_test_tfidf = # Enter your code here
## 4.9. Training and Evaluation
In this training section, we will be using Naive Bayes classifier as a case study.

**Naive Bayes** is a generative classification model.

A generative model learns parameters by maximizing the joint probability 𝑃(𝑋,𝑌) through Bayes' rule by learning 𝑃(𝑌) and 𝑃(𝑋|𝑌) (where 𝑋 are features and 𝑌 are labels).

Prediction with Naive Bias

$$P\bigg(\frac{\text{label}}{\text{features}}\bigg) = \frac{P(\text{label}) \times P(\frac{\text{features}}{\text{label}})}{P(\text{features})}$$

Assumption that all features are independant modifies the formula to:

$$P\bigg(\frac{\text{label}}{\text{features}}\bigg)= \frac{P(\text{label}) * P\big(\frac{f_1}{\text{label}}\big)*... * P\big(\frac{f_n}{\text{label}}\big)}{P(\text{features})}$$

# Import dependencies
from sklearn.naive_bayes import MultinomialNB
from tqdm import tqdm
from sklearn.metrics import classification_report
### 4.9a Training

Let us train a Naive Bayes clasifier using counts
NB_classifier_counts = MultinomialNB() # Create an object of the Naive Bayes class
NB_classifier_counts.fit(X_train_counts.toarray(), y_train) # The fit() method here trains on the train vectors and labels
### 4.9b Prediction

Let us make **predictions** on our test data.

Remember that prediction is performed on the test features only. In other words, given a particular sentence, what label would the model predict?
preds = NB_classifier_counts.predict(X_test_counts.toarray())
### 4.9c Evaluation

Let us **evaluate** our model performance using automatic metrics: *Precision*, *Recall* and *F1 Score*.

To do this, we can leverage the `classification_report()` function from `sklearn.metrics`. The function takes two arguments: the true labels and predictions.
print(classification_report(y_test, preds))
## 4.10 Exercise 3
Train Naive Bayes using TF-IDF features. Is there a difference in the performance compared to using count vectors?
NB_classifier_tfidf = # Enter your code here
# Enter your code to train the NB_classifier_tfidf model below this line

## 4.11 Exercise 4
Evaluate the results on the test set.
preds_tfidf = # Enter your code here
# Enter your code below this line to display Precision, Recall and F1 score

## 4.12a Evaluating our model on random examples (Tv Reviews from internet)

We are given three texts containing different reviews. We will use our model to classify these texts.

The backward slash `\` in texts signifies that the text on the next line is a continuation of the previous line. We are breaking them into lines for readability. So, we have just three texts all together.
citizen_info_ireland = '. The Government is chosen by and is collectively responsible to the Dáil. \
There must be a minimum of 7 and a maximum of 15 Ministers. \
The Taoiseach, the Tanaiste and the Minister for Finance must be members of the Dáil.\
It is possible to have 2 Ministers who are members of the Senate but this rarely happens.'
gone_girl_review = 'Audience Reviews for Gone Girl ... \
Mesmerizing performances, tense atmosphere, unexpected plot twists and turns \
of events, this movie is a real crime thriller!'

sherlock_bbc_review = 'Dr Watson, a former army doctor, finds himself sharing a flat with Sherlock Holmes, \
an eccentric individual with a knack for solving crimes. Together, they take on the most unusual cases.'
## 4.12b Exercise 5
Predict the labels for the above text, using either of the model trained in the previous exercises.

First, let us convert the given texts into vector representations using CountVectorizer or TfidfVectorizer.
# Using count vectors
# We will use the count vectors we learnt from our X_train. No need to call fit() again on X_train
# We will pass the three texts at once as a list to the transform() function

new_test_count = count_vector.transform([citizen_info_ireland, gone_girl_review, sherlock_bbc_review])
We will use the Naive Bayes classifier trained on the Tfidf features in the previous exercise.
# Enter your code below this line

The above prediction shows that our model predicted the three texts as: **'government', 'mystery', 'government'** respectively.

🤔 From manual inspection, is the model prediction accurate?
Let us check the probability of the predictions.
NB_classifier_counts.predict_proba(new_test_count.toarray())
## Exercise 6 [OPTIONAL]: Random Forest Classifier

Random Forest is a meta estimator that fits a number of classifying decision trees on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.
* Train on the train data
* Predict on test features
* Test the model on the three random texts
from sklearn.ensemble import RandomForestClassifier

# Enter your code below this line. Save the model as random_forest_classifier

print(random_forest_classifier.predict(new_test_count.toarray()))
print(random_forest_classifier.predict_proba(new_test_count.toarray()))
### The End

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/929582.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

合肥光束网站建设网站页面架构怎么写

目录 1 概述 2 数学模型 2.1 问题表述 2.2 DG的最佳位置和容量&#xff08;解析法&#xff09; 2.3 使用 GA 进行最佳功率因数确定和 DG 分配 3 仿真结果与讨论 3.1 33 节点测试配电系统的仿真 3.2 69 节点测试配电系统仿真 4 结论 1 概述 为了使系统网损达到最低值&a…

西安网站建设设计的好公司排名做网站的收钱不管了

文章目录 Lookup Join(维表 Join) Lookup Join(维表 Join) Lookup Join 定义(支持 Batch\Streaming):Lookup Join 其实就是维表 Join,比如拿离线数仓来说,常常会有用户画像,设备画像等数据,而对应到实时数仓场景中,这种实时获取外部缓存的 Join 就叫做维表 Join。…

自建网站需要哪些技术网站空间容量

C Primer&#xff08;第5版&#xff09; 练习 10.24 练习 10.24 给定一个string&#xff0c;使用bind和check_size在一个int的vector中查找第一个大于string长度的值。。 环境&#xff1a;Linux Ubuntu&#xff08;云服务器&#xff09; 工具&#xff1a;vim 代码块 /*****…

网络科技公司网站首页说一说网站建设的含义

送给大家一句话&#xff1a; 世界在旋转&#xff0c;我们跌跌撞撞前进&#xff0c;这就够了 —— 阿贝尔 加缪 vector问题解决 1 前言2 迭代器区间拷贝3 迭代器失效问题4 memcpy拷贝问题 1 前言 我们之前实现了手搓vector&#xff0c;但是当时依然有些问题没有解决&#xff…

动手实验——mybatis generator

前言 边学边做中 mapper的用处是和数据库交互,具体的行为找了一个mapper文件,让chatgpt讲解了一下,如下: 首先是方法表 | 方法 | 功能 | 是否常用 | | -----------------------…

迅速了解GO+ElasticSearch

pre { white-space: pre !important; word-wrap: normal !important; overflow-x: auto !important; display: block !important; font-family: "Consolas", "Monaco", "Courier New", …

学生管理系统面向对象分析报告

学生管理系统面向对象分析报告 目录1. 案例中哪里体现出了封装性及其好处? 2. 案例中的setter/getter模式与封装性? 3. 案例中某些类的toString()方法? 4. 案例中几个常用方法解析。 5. 案例中的面向对象设计 5.1 尝…

荷兰青少年通过Telegram被招募,涉嫌参与俄罗斯支持的黑客活动

两名17岁荷兰青少年通过Telegram被招募,涉嫌为亲俄黑客从事间谍活动。他们使用Wi-Fi嗅探器在欧盟机构总部和使馆周边进行网络测绘,目前一人被拘留,一人被软禁。案件凸显国家支持黑客利用未成年人作为"可抛弃代…

网站推广策划方案毕业设计免费建立网站有必要吗

一个master可以拥有多个slave&#xff0c;一个slave又可以拥有多个slave&#xff0c;如此下去&#xff0c;形成了强大的多级服务器集群架构 比如&#xff0c;将ip为192.168.1.10的机器作为主服务器&#xff0c;将ip为192.168.1.11的机器作为从服务器 说明&#xff1a;ip可以换为…

网站开发部门工资入什么科目营销一体化营销平台

org.springframework.util.StringUtils 1、字符串判断工具 // 判断字符串是否为 null&#xff0c;或 ""。注意&#xff0c;包含空白符的字符串为非空 boolean isEmpty(Object str) // 判断字符串是否是以指定内容结束。忽略大小写 boolean endsWithIgnoreCase…

Moscow International Workshops 2017. Day 4. Lviv NU Contest, GP of Ukraine

Preface 国庆本以为空的一批结果忙的飞起,好不容易抽时间凑到三个人,结果被 Div2 小登们按在地上摩擦。B. Card Game 签到,暴力枚举约数即可 #include<cstdio> #include<iostream> #include<map>…

网站开发有哪些技术wordpress新建音乐界面

转载自 ClassLoader 详解及用途 ClassLoader主要对类的请求提供服务&#xff0c;当JVM需要某类时&#xff0c;它根据名称向ClassLoader要求这个类&#xff0c;然后由ClassLoader返回这个类的class对象。 1.1 几个相关概念ClassLoader负责载入系统的所有Resources&#xff08;…

提供手机自适应网站土木工程网官网登录

Hadoop 1、 Hadoop的介绍 Hadoop最早起源于Nutch。Nutch的设计目标是构建一个大型的全网搜索引擎&#xff0c;包括网页抓取、索引、查询等功能&#xff0c;但随着抓取网页数量的增加&#xff0c;遇到了严重的可扩展性问题——如何解决数十亿网页的存储和索引问题。2003年、20…

云原生架构的演进与落地:重塑企业 IT 的核心能力 - 实践

云原生架构的演进与落地:重塑企业 IT 的核心能力 - 实践2025-10-06 17:49 tlnshuju 阅读(0) 评论(0) 收藏 举报pre { white-space: pre !important; word-wrap: normal !important; overflow-x: auto !important;…

小代码使用npm包的方法

小代码使用npm包的方法pre { white-space: pre !important; word-wrap: normal !important; overflow-x: auto !important; display: block !important; font-family: "Consolas", "Monaco", &quo…

用手机域名做网站有多少张家口网站建设工作室

以下是电力行业中分布式能源管理(Distributed Energy Management System, DEMS)的实现方案,涵盖系统架构、关键技术、核心功能及实施路径,结合典型场景与代码示例: 一、系统架构设计 采用云-边-端三层架构,实现分布式能源的高效协同管理: 1. 终端层(感知层) 设备组…

网站做支付宝接口网页设计基础试题

Java Learning Path&#xff08;三&#xff09;过程篇   每个人的学习方法是不同的&#xff0c;一个人的方法不见得适合另一个人&#xff0c;我只能是谈自己的学习方法。因为我学习Java是完全自学的&#xff0c;从来没有问过别人&#xff0c;所以学习的过程基本上完全是自己…

张家港做网站优化排名新乡网站设计公司

摘要&#xff1a;当前,多核技术的不断发展和日渐成熟,使得处理器的性能得到巨大提升.但是对于存储设备来说,无论是速度还是容量都无法跟上这种步伐.随着处理器和其它子系统发展差距的日益加大,超级计算机的效率问题逐渐成为人们讨论和研究的热点,大部分的实际应用在超级计算机上…

day18 课程(模块 )

day18 课程(模块 &)课程: 18.1 了解模块------------------------------------------------ 执行后18.2 导入模块之方法一------------------------------------------------ 执行后18.3 导入模块之方法二-----…

Kubernetes(K8s)核心架构解析与实用命令大全 - 教程

Kubernetes(K8s)核心架构解析与实用命令大全 - 教程pre { white-space: pre !important; word-wrap: normal !important; overflow-x: auto !important; display: block !important; font-family: "Consolas&qu…