CT5120 Intro to Natural Lang. Processing Lab # 4. Text Classification

# 4. Text Classification
## 4.0 Learning Objectives

* Conduct exploratory data analysis (EDA)
* Preprocess text
* Feature extraction
* Training, prediction and evaluation of ML models
* Inference on new text
First, let us download and import dependencies.
import nltk
nltk.download(['brown', 'stopwords'])
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import brown
from nltk.corpus import stopwords
# from nltk.tokenize import word_tokenize
import string
import pandas as pd
from nltk.tokenize.treebank import TreebankWordDetokenizer, TreebankWordTokenizer

## 4.1 Scikit-learn (sklearn)

**sklearn** is a machine learning framework, which is suitable for feature extraction.

Some useful sklearn classes:
1. [sklearn.feature_extraction.text.CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn-feature-extraction-text-countvectorizer): this
converts a collection of text documents to a matrix of token counts.

2. [sklearn.feature_extraction.text.TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html): this converts a collection of raw documents to a matrix of TF-IDF features.

3. [class sklearn.feature_extraction.text.TfidfTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html): transforms a count matrix to a normalized tf or tf-idf representation

4. [sklearn.metrics.classification_report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html): this builds a text report showing the main classification metrics. It shows macro-average, weighted-average and per class scores for `precision`, `recall` and `f1`. It also displays support, which is the actual occurrence of the class/label in the dataset.
In order to feed the text to `CountVectorizer`, it needs to exist as sentences, as shown in the following example:
<br>
<br>

| text | label |
| ---- | ----- |
|The capital expansion programs business firms involve multi-year budgeting true country development programs|government|
|Now Dogtown one places creeps marrow worms get old wood veneer|mystery|
|This claim submitted District Court dismissed 126 F.Supp.235 alleged violation 7 Clayton Act also 1 2 Sherman Act|government|
|Mrs. Meeker struck ready seek anyone's advice least Garth's| mystery|
|Richmond Va. |government|

Basically, we need:

X: Array of sentences

y: Array of corresponding labels

The corpus which we are using is already tokenized. It could be used as it is.
But in real life the corpus would rarely be tokenized, so we prepare the data as sentences and labels before proceeding with the exercise.
## 4.2 Tokenization and Detokenization
Detokenisation is similar to `.join()` method.

The default tokenization method in NLTK involves tokenization using regular expressions as defined in the Penn Treebank (based on English text). It assumes that the text is already split into sentences.

This is a very useful form of tokenization since it incorporates several rules of linguistics to split the sentence into the most optimal tokens.

Detokenizer is required to put the sentence back together from a list of words, with proper punctuation form.

[Demo link](https://github.com/gauneg/material_lab_nlp)
detokenizer = TreebankWordDetokenizer()
tokenizer = TreebankWordTokenizer()
# 4.3 Dataset

The [Brown Corpus](https://www1.essex.ac.uk/linguistics/external/clmt/w3c/corpus_ling/content/corpora/list/private/brown/brown.html) consists of one million words of American English texts printed in 1961. The texts for the corpus were sampled from 15 different text categories to make the corpus a good standard reference.

From this dataset we select two categories:
* government: Text from government documents
* mystery: Text from mystery and detective fiction

And we create our own dataset by detokenizing and shuffling the above.
for category in brown.categories():
corpus_length = len(brown.sents(categories=[category]))
print(f'Category: {category:<16}, Dataset Size:{corpus_length}')

english_stopwords = stopwords.words('english')
punctuations = list(string.punctuation)

print('\n\nSelecting `government` and `mystery` categories from brown corpus')

def filter_and_join(sent_arr, lab):
filtered_tokens = [token for token in sent_arr if (token not in english_stopwords and token not in punctuations)]
return [detokenizer.detokenize(filtered_tokens), lab]

## Using the filter_and_join function on all the text inputs of government categories
government_text = list(map(lambda x: filter_and_join(x, 'government'), brown.sents(categories=['government'])))

## Using the filter_and_join function on all the text inputs of government categories
mystery_text = list(map(lambda x: filter_and_join(x, 'mystery'), brown.sents(categories=['mystery'])))

dataset = pd.DataFrame(government_text + mystery_text, columns=['text', 'label'])
dataset = dataset.sample(frac=1)
dataset.head()
## 4.4 Exploratory Data Analysis (EDA)
Let us check the first five rows and the last five rows of the dataset.
dataset.head() # First five rows
dataset.tail() # Last five rows
dataset.shape # How many rows and columns
dataset.describe() # Dataset statistics
## 4.5 Overall Task

Use the given corpus to perform the following tasks:

1. Data Split: Split the dataset in the train and test dataset. train=90%, test=10%)

2. Feature Extraction: Using `text` column, extract representations i.e. Count Vectors and TF-IDF features.

3. Train ML model: Use the extracted representations to train two `Naive Bayes` models (a model trained using the Count Vectors and another trained on TF-IDF features).

4. Evaluation: calculate the precision, recall and f1 score.
Hint: Use classification report

5. Inference: Use the given strings and the trained models to predict the class/label of the text.

**OPTIONAL TASK**: Train any other model of your choice with a better performance than Naive Bayes.
## 4.6 Exercise 1
**Instruction**: Split the dataset into train and test sets. The test set should be 10% of the overall dataset size.

Hint: Use sklearn split function.
# Enter your code below this line

print(dataset.shape)
print(train_data.shape)
print(test_data.shape)
train_data.head()
## 4.7 Feature Engineering using Raw Counts and TF-IDF

### 4.7a Example
Let us look at an example of vector representation of a text using counts.

**Note**: Try not to mix up calling `fit()` for learning vocabulary during feature extraction and calling `fit()` during model training.

corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
from sklearn.feature_extraction.text import CountVectorizer
count_corpus = CountVectorizer() # Create an object of the CountVectorizer class
count_corpus.fit(corpus) # Learn vocabulary on the corpus
print(count_corpus.get_feature_names_out()) # Display the learned vocabulary

# Extract token counts out of raw text documents using the vocabulary fitted with fit
count_corpus_transform = count_corpus.transform(corpus)

print() # This is just a line break
print(count_corpus_transform.toarray()) # Display the vectors
- In the example above, method get_feature_names() returns vocabulary of the corpus i.e. number of unique words.
- Each document in the corpus is represented with the reference to the vocabulary
- Example: In the document 1 i.e. **"This is the first document."** can be rearranged to **[0, "document", "first", "is", 0, 0, "the", 0, "this"]** which in the end transformed into count vector based on the number of times the given word occurs in the document i.e. **[0 1 1 1 0 0 1 0 1]**

The example below shows the vector representation of the above corpus using TF-IDF.
from sklearn.feature_extraction.text import TfidfVectorizer
# Create another instance/object of the TfidfVectorizer
tfidf_corpus = TfidfVectorizer()

tfidf_corpus.fit(corpus) # Learn vocabulary on the corpus
print(tfidf_corpus.get_feature_names_out()) # Display the learned vocabulary

# Extract token counts out of raw text documents using the vocabulary fitted with fit
tfidf_corpus_transform = tfidf_corpus.transform(corpus)

print() # This is just a line break
print(tfidf_corpus_transform.toarray()) # Display the vectors
- Similar to count vector, each index of the TF-IDF features represents a word in the vocabulary.
- Each value represents the L2 normalized TF-IDF of the word in the document.
### 4.7b Performing feature extraction on the provided dataset
First, let us split our
* train_data into X_train and y_train
* test_data into X_test and y_test

Note that the X is in capital letter.
X_train = train_data["text"]
y_train = train_data["label"]

X_test = test_data["text"]
y_test = test_data["label"]
Let us apply the same technique we learnt in 4.7a on our given dataset.
# Using CountVectorizer
count_vector = CountVectorizer()
count_vector.fit(X_train)
print(count_vector.get_feature_names_out()[-50:]) # Display last 50 items in the vocabulary

X_train_counts = count_vector.transform(X_train)

print() # This is just a line break
print(X_train_counts.toarray())
# Using TfidfVectorizer
tfidf_vector = TfidfVectorizer()
tfidf_vector.fit(X_train)
print(tfidf_vector.get_feature_names_out()[-50:]) # Display last 50 items in the vocabulary

X_train_tfidf = tfidf_vector.transform(X_train)

print() # This is just a line break
print(X_train_tfidf.toarray())
## 4.8 Exercise 2
The features for the training set have already been generated. Now, generate the features for the test set.

## WARNING:

Make sure that you do not change the features based on the test set. In order words, do not call the `fit()` method of the CountVectorizer or TfidfVectorizer on the test split, only `fit()` on the train split when learning the vector representations.
X_test_counts = # Enter your code here
X_test_tfidf = # Enter your code here
## 4.9. Training and Evaluation
In this training section, we will be using Naive Bayes classifier as a case study.

**Naive Bayes** is a generative classification model.

A generative model learns parameters by maximizing the joint probability 𝑃(𝑋,𝑌) through Bayes' rule by learning 𝑃(𝑌) and 𝑃(𝑋|𝑌) (where 𝑋 are features and 𝑌 are labels).

Prediction with Naive Bias

$$P\bigg(\frac{\text{label}}{\text{features}}\bigg) = \frac{P(\text{label}) \times P(\frac{\text{features}}{\text{label}})}{P(\text{features})}$$

Assumption that all features are independant modifies the formula to:

$$P\bigg(\frac{\text{label}}{\text{features}}\bigg)= \frac{P(\text{label}) * P\big(\frac{f_1}{\text{label}}\big)*... * P\big(\frac{f_n}{\text{label}}\big)}{P(\text{features})}$$

# Import dependencies
from sklearn.naive_bayes import MultinomialNB
from tqdm import tqdm
from sklearn.metrics import classification_report
### 4.9a Training

Let us train a Naive Bayes clasifier using counts
NB_classifier_counts = MultinomialNB() # Create an object of the Naive Bayes class
NB_classifier_counts.fit(X_train_counts.toarray(), y_train) # The fit() method here trains on the train vectors and labels
### 4.9b Prediction

Let us make **predictions** on our test data.

Remember that prediction is performed on the test features only. In other words, given a particular sentence, what label would the model predict?
preds = NB_classifier_counts.predict(X_test_counts.toarray())
### 4.9c Evaluation

Let us **evaluate** our model performance using automatic metrics: *Precision*, *Recall* and *F1 Score*.

To do this, we can leverage the `classification_report()` function from `sklearn.metrics`. The function takes two arguments: the true labels and predictions.
print(classification_report(y_test, preds))
## 4.10 Exercise 3
Train Naive Bayes using TF-IDF features. Is there a difference in the performance compared to using count vectors?
NB_classifier_tfidf = # Enter your code here
# Enter your code to train the NB_classifier_tfidf model below this line

## 4.11 Exercise 4
Evaluate the results on the test set.
preds_tfidf = # Enter your code here
# Enter your code below this line to display Precision, Recall and F1 score

## 4.12a Evaluating our model on random examples (Tv Reviews from internet)

We are given three texts containing different reviews. We will use our model to classify these texts.

The backward slash `\` in texts signifies that the text on the next line is a continuation of the previous line. We are breaking them into lines for readability. So, we have just three texts all together.
citizen_info_ireland = '. The Government is chosen by and is collectively responsible to the Dáil. \
There must be a minimum of 7 and a maximum of 15 Ministers. \
The Taoiseach, the Tanaiste and the Minister for Finance must be members of the Dáil.\
It is possible to have 2 Ministers who are members of the Senate but this rarely happens.'
gone_girl_review = 'Audience Reviews for Gone Girl ... \
Mesmerizing performances, tense atmosphere, unexpected plot twists and turns \
of events, this movie is a real crime thriller!'

sherlock_bbc_review = 'Dr Watson, a former army doctor, finds himself sharing a flat with Sherlock Holmes, \
an eccentric individual with a knack for solving crimes. Together, they take on the most unusual cases.'
## 4.12b Exercise 5
Predict the labels for the above text, using either of the model trained in the previous exercises.

First, let us convert the given texts into vector representations using CountVectorizer or TfidfVectorizer.
# Using count vectors
# We will use the count vectors we learnt from our X_train. No need to call fit() again on X_train
# We will pass the three texts at once as a list to the transform() function

new_test_count = count_vector.transform([citizen_info_ireland, gone_girl_review, sherlock_bbc_review])
We will use the Naive Bayes classifier trained on the Tfidf features in the previous exercise.
# Enter your code below this line

The above prediction shows that our model predicted the three texts as: **'government', 'mystery', 'government'** respectively.

🤔 From manual inspection, is the model prediction accurate?
Let us check the probability of the predictions.
NB_classifier_counts.predict_proba(new_test_count.toarray())
## Exercise 6 [OPTIONAL]: Random Forest Classifier

Random Forest is a meta estimator that fits a number of classifying decision trees on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.
* Train on the train data
* Predict on test features
* Test the model on the three random texts
from sklearn.ensemble import RandomForestClassifier

# Enter your code below this line. Save the model as random_forest_classifier

print(random_forest_classifier.predict(new_test_count.toarray()))
print(random_forest_classifier.predict_proba(new_test_count.toarray()))
### The End

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.mzph.cn/news/929582.shtml

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈email:809451989@qq.com，一经查实，立即删除！