Simple, TfidfVectorizer and CountVectorizer recommendation system for beginner.

简单的TfidfVectorizer和CountVectorizer推荐系统，适用于初学者。

目标 (The Goal)

Recommendation system is widely use in many industries to suggest items to customers. For example, a radio station may use a recommendation system to create the top 100 songs of the month to suggest to audiences, or they might use recommendation system to identify song of similar genre that the audience has requested. Based on how recommendation system is widely being used in the industry, we are going to create a recommendation system for the anime data. It would be nice if anime followers can see an update of top 100 anime every time they walk into an anime store or receive an email suggesting anime based on genre that they like.

推荐系统在许多行业中广泛用于向客户推荐项目。例如，广播电台可以使用推荐系统创建当月最流行的100首歌曲以向观众推荐，或者他们可以使用推荐系统来标识观众已请求的类似流派的歌曲。基于推荐系统在行业中的广泛使用，我们将为动漫数据创建一个推荐系统。如果动漫追随者每次走进动漫商店或收到一封根据他们喜欢的流派来推荐动漫的电子邮件时，都能看到前100名动漫的更新，那就太好了。

With the anime data, we will apply two different recommendation system models: simple recommendation system and content-based recommendation system to analyse anime data and create recommendation.

对于动漫数据，我们将应用两种不同的推荐系统模型：简单的推荐系统和基于内容的推荐系统来分析动漫数据并创建推荐。

总览 (Overview)

For simple recommendation system, we need to calculate weighted rating to make sure that the rating of the same score of different votes numbers will have unequal weight. For example, an average rating of 9.0 from 10 people will have lower weight from an average rating of 9.0 from 1,000 people. After we calculate the weighted rating, we can see a list of top chart anime.

对于简单的推荐系统，我们需要计算加权等级，以确保不同票数的相同分数的等级具有不相等的权重。例如，每10个人获得9.0的平均评分将比每1,000个人获得9.0的平均评分降低。在计算加权评分后，我们可以看到顶级动漫列表。

For content-based recommendation system, we will need to identify which features will be used as part of the analysis. We will apply sklearn to identify the similarity in the context and create anime suggestion.

对于基于内容的推荐系统，我们将需要确定哪些功能将用作分析的一部分。我们将应用sklearn 识别上下文中的相似性并创建动漫建议。

资料总览 (Data Overview)

With the anime data that we have, there are a total of 12,294 anime of 7 different types of data including anime_id, name, genre, type, episodes, rating, and members.

根据我们拥有的动画数据，总共有12294种7种不同类型的数据的动画，包括anime_id，名称，类型，类型，剧集，评分和成员。

实作 (Implementation)

1. Import Data

1.导入数据

We need to import pandas as this well let us put data nicely into the dataframe format.

我们需要导入大熊猫，因为这样可以很好地将数据放入数据框格式中。

import pandas as pd
anime = pd.read_csv('…/anime.csv')
anime.head(5)

anime.info()

anime.describe()

We can see that the minimum rating score is 1.67 and the maximum rating score is 10. The minimum members is 5 and the maximum is 1,013,917.

我们可以看到最低评级分数是1.67，最大评级分数是10。最小成员是5，最大成员是1,013,917。

anime_dup = anime[anime.duplicated()]
print(anime_dup)

There is no duplicated data that need to be cleaned.

没有重复的数据需要清除。

type_values = anime['type'].value_counts()
print(type_values)

Most anime are broadcast of the TV, followed by OVA.

多数动漫在电视上播放，其次是OVA。

2. Simple Recommendation System

2.简单的推荐系统

Firstly, we need to know the calculation of the weighted rating (WR).

首先，我们需要知道加权等级(WR)的计算。

v is the number of votes for the anime; m is the minimum votes required to be listed in the chart; R is the average rating of the anime; C is the mean vote across the whole report.

v是动画的票数； m是图表中需要列出的最低投票数； R是动画的平均评分； C是整个报告中的平均票数。

We need to determine what data will be used in this calculation.

我们需要确定在此计算中将使用哪些数据。

m = anime['members'].quantile(0.75)
print(m)

From the result, we are going to use those data that have more than 9,437 members to create the recommendation system.

根据结果，我们将使用拥有超过9,437个成员的那些数据来创建推荐系统。

qualified_anime = anime.copy().loc[anime['members']>m]
C = anime['rating'].mean()def WR(x,C=C, m=m):
    v = x['members']
    R = x['rating']
    return (v/(v+m)*R)+(m/(v+m)*C)qualified_anime['score'] = WR(qualified_anime)
qualified_anime.sort_values('score', ascending =False)
qualified_anime.head(15)

This is the list of top 15 anime based on weighted rating calculation.

这是根据加权评级计算得出的前15名动漫的列表。

3. Genre Based Recommendation System

3.基于体裁的推荐系统

With genre based recommendation, we will use sklearn package to help us analyse text context. We will need to compute the similarity of the genre. Two method that we are going to use is TfidfVectorizer and CountVectorizer.

通过基于体裁的推荐，我们将使用sklearn包来帮助我们分析文本上下文。我们将需要计算体裁的相似性。我们将使用的两种方法是TfidfVectorizer和CountVectorizer。

In TfidfVectorizer, it calculates the frequency of the word with the consideration on how often it occurs in all documents. While, CountVectorizer is more simpler, it only counts how many times the word has occured.

在TfidfVectorizer中，它会考虑单词在所有文档中出现的频率来计算单词的频率。虽然CountVectorizer更简单，但它仅计算单词出现的次数。

from sklearn.feature_extraction.text import TfidfVectorizertf_idf = TfidfVectorizer(lowercase=True, stop_words = 'english')
anime['genre'] = anime['genre'].fillna('')
tf_idf_matrix = tf_idf.fit_transform(anime['genre'])tf_idf_matrix.shape

We can see that there are 46 different words from 12,294 anime.

我们可以看到，从12,294动漫中有46个不同的单词。

from sklearn.metrics.pairwise import linear_kernelcosine_sim = linear_kernel(tf_idf_matrix, tf_idf_matrix)
indices = pd.Series(anime.index, index=anime['name'])
indices = indices.drop_duplicates()def recommendations (name, cosine_sim = cosine_sim):
    similarity_scores = list(enumerate(cosine_sim[indices[name]]))
    similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)
    similarity_scores = similarity_scores[1:21]
    anime_indices = [i[0] for i in similarity_scores]
    return anime['name'].iloc[anime_indices]recommendations('Kimi no Na wa.')

Based of the TF-IDF calculation, this is the top 20 anime recommendations that are similar to Kimi no Na wa..

根据TF-IDF的计算，这是前20大动漫推荐，与《 Kimi no Na wa》相似。

Next, we are going to look at another model, CountVectorizer() and we are going to compare the result between cosine_similarity and linear_kernel.

接下来，我们将看看另一个模型CountVectorizer()，并将比较余弦相似度和linear_kernel之间的结果。

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similaritycount = CountVectorizer(stop_words = 'english')
count_matrix = count.fit_transform(anime['genre'])
cosine_sim2 = cosine_similarity(count_matrix, count_matrix)recommendations('Kimi no Na wa.', cosine_sim2)

cosine_sim2 = linear_kernel(count_matrix, count_matrix)
recommendations('Kimi no Na wa.', cosine_sim2)

Summary

摘要

In this article, we have look at the anime data and trying to build two types of recommendation systems. The simple recommendation system let us see the top chart anime. We have done this by using the weighted rating calculation on the voting and number of members. Then, we continue to build the recommendation system based on anime’s genre feature. With this, we apply both TfidfVectorizer and CountVectorizer to see the differences in their recommendation.

在本文中，我们研究了动画数据，并尝试构建两种类型的推荐系统。简单的推荐系统让我们看到了热门动画。我们通过对投票和成员数进行加权评级计算来完成此任务。然后，我们将继续基于动漫的流派特征构建推荐系统。这样，我们同时应用了TfidfVectorizer和CountVectorizer来查看其建议中的差异。

Hope that you enjoy this article!

希望您喜欢这篇文章！

1. https://www.datacamp.com/community/tutorials/recommender-systems-python

翻译自: https://medium.com/analytics-vidhya/recommendation-system-for-anime-data-784c78952ba5

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.mzph.cn/news/388247.shtml

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈email:809451989@qq.com，一经查实，立即删除！

Wait Event SQL*Net more data to client

oracle 官方给的说法是 C.3.152 SQL*Net more data to client The server process is sending more data/messages to the client. The previous operation to the client was also a send. Wait Time: The actual time it took for the send to complete 意味着server process…

1.3求根之牛顿迭代法

目录目录前言（一）牛顿迭代法的分析1.定义2.条件3.思想4.误差（二）代码实现1.算法流程图2.源代码（三）案例演示1.求解：\(f(x)x^3-x-10\)2.求解：\(f(x)x^2-1150\)3.求解：\(f…

libzbar.a armv7

杨航最近在学IOS http://download.csdn.net/download/lzwxyz/5546365 我现在用的是这个：http://www.federicocappelli.net/2012/10/05/zbar-library-for-iphone-5-armv7s/ 点它的HERE开始下载下载的libzbar.a库，如何查看 …

Alex Hanna博士：Google道德AI小组研究员

Alex Hanna博士是社会学家和研究科学家，致力于Google的机器学习公平性和道德AI。 (Dr. Alex Hanna is a sociologist and research scientist working on machine learning fairness and ethical AI at Google.) Before that, she was an Assistant Professor at th…

三位对我影响最深的老师

我感觉，教过我的老师们，不论他们技术的好坏对我都是有些许影响的。但是让人印象最深的好像只有寥寥几位。第一位就是小学六年级下册教过我的语文老师。他是临时从一个贫困小学调任过来的，不怎么管班级，班里同学都在背地里说他不会…

安全开发 | 如何让Django框架中的CSRF_Token的值每次请求都不一样

前言用过Django 进行开发的同学都知道，Django框架天然支持对CSRF攻击的防护，因为其内置了一个名为CsrfViewMiddleware的中间件，其基于Cookie方式的防护原理，相比基于session的方式，更适合目前前后端分离的业务场景&am…

UNITY3D 脑袋顶血顶名

杨航最近在学Unity3D using UnityEngine; using System.Collections; public class NPC : MonoBehaviour { //主摄像机对象 public Camera camera; //NPC名称 private string name "我是doud…

一个项目的整个测试流程

最近一直在进行接口自动化的测试工作，同时对于一个项目的整个测试流程进行了梳理，希望能对你有用~~~ 需求分析： 整体流程图： 需求提取 -> 需求分析 -> 需求评审 -> 更新后的测试需求跟踪xmind 分析流程： 1. 需…

python度量学习_Python的差异度量

python度量学习Hi folks, welcome back to my new edition of the blog, thank you so much for your love and support, I hope you all are doing well. In today’s learning, we will try to understand about variance and the measures involved in it. Although the blo…