动漫数据推荐系统

Simple, TfidfVectorizer and CountVectorizer recommendation system for beginner.

简单的TfidfVectorizer和CountVectorizer推荐系统,适用于初学者。

目标 (The Goal)

Recommendation system is widely use in many industries to suggest items to customers. For example, a radio station may use a recommendation system to create the top 100 songs of the month to suggest to audiences, or they might use recommendation system to identify song of similar genre that the audience has requested. Based on how recommendation system is widely being used in the industry, we are going to create a recommendation system for the anime data. It would be nice if anime followers can see an update of top 100 anime every time they walk into an anime store or receive an email suggesting anime based on genre that they like.

推荐系统在许多行业中广泛用于向客户推荐项目。 例如,广播电台可以使用推荐系统创建当月最流行的100首歌曲以向观众推荐,或者他们可以使用推荐系统来标识观众已请求的类似流派的歌曲。 基于推荐系统在行业中的广泛使用,我们将为动漫数据创建一个推荐系统。 如果动漫追随者每次走进动漫商店或收到一封根据他们喜欢的流派来推荐动漫的电子邮件时,都能看到前100名动漫的更新,那就太好了。

With the anime data, we will apply two different recommendation system models: simple recommendation system and content-based recommendation system to analyse anime data and create recommendation.

对于动漫数据 ,我们将应用两种不同的推荐系统模型:简单的推荐系统和基于内容的推荐系统来分析动漫数据并创建推荐。

总览 (Overview)

For simple recommendation system, we need to calculate weighted rating to make sure that the rating of the same score of different votes numbers will have unequal weight. For example, an average rating of 9.0 from 10 people will have lower weight from an average rating of 9.0 from 1,000 people. After we calculate the weighted rating, we can see a list of top chart anime.

对于简单的推荐系统,我们需要计算加权等级,以确保不同票数的相同分数的等级具有不相等的权重。 例如,每10个人获得9.0的平均评分将比每1,000个人获得9.0的平均评分降低。 在计算加权评分后,我们可以看到顶级动漫列表。

For content-based recommendation system, we will need to identify which features will be used as part of the analysis. We will apply sklearn to identify the similarity in the context and create anime suggestion.

对于基于内容的推荐系统,我们将需要确定哪些功能将用作分析的一部分。 我们将应用sklearn 识别上下文中的相似性并创建动漫建议。

资料总览 (Data Overview)

With the anime data that we have, there are a total of 12,294 anime of 7 different types of data including anime_id, name, genre, type, episodes, rating, and members.

根据我们拥有的动画数据,总共有12294种7种不同类型的数据的动画,包括anime_id,名称,类型,类型,剧集,评分和成员。

实作 (Implementation)

1. Import Data

1.导入数据

We need to import pandas as this well let us put data nicely into the dataframe format.

我们需要导入大熊猫,因为这样可以很好地将数据放入数据框格式中。

import pandas as pd
anime = pd.read_csv('…/anime.csv')
anime.head(5)
Image for post
anime.info()
Image for post
anime.describe()
Image for post

We can see that the minimum rating score is 1.67 and the maximum rating score is 10. The minimum members is 5 and the maximum is 1,013,917.

我们可以看到最低评级分数是1.67,最大评级分数是10。最小成员是5,最大成员是1,013,917。

anime_dup = anime[anime.duplicated()]
print(anime_dup)
Image for post

There is no duplicated data that need to be cleaned.

没有重复的数据需要清除。

type_values = anime['type'].value_counts()
print(type_values)
Image for post

Most anime are broadcast of the TV, followed by OVA.

多数动漫在电视上播放,其次是OVA。

2. Simple Recommendation System

2.简单的推荐系统

Firstly, we need to know the calculation of the weighted rating (WR).

首先,我们需要知道加权等级(WR)的计算。

Image for post

v is the number of votes for the anime; m is the minimum votes required to be listed in the chart; R is the average rating of the anime; C is the mean vote across the whole report.

v是动画的票数; m是图表中需要列出的最低投票数; R是动画的平均评分; C是整个报告中的平均票数。

We need to determine what data will be used in this calculation.

我们需要确定在此计算中将使用哪些数据。

m = anime['members'].quantile(0.75)
print(m)
Image for post

From the result, we are going to use those data that have more than 9,437 members to create the recommendation system.

根据结果​​,我们将使用拥有超过9,437个成员的那些数据来创建推荐系统。

qualified_anime = anime.copy().loc[anime['members']>m]
C = anime['rating'].mean()def WR(x,C=C, m=m):
v = x['members']
R = x['rating']
return (v/(v+m)*R)+(m/(v+m)*C)qualified_anime['score'] = WR(qualified_anime)
qualified_anime.sort_values('score', ascending =False)
qualified_anime.head(15)
Image for post

This is the list of top 15 anime based on weighted rating calculation.

这是根据加权评级计算得出的前15名动漫的列表。

3. Genre Based Recommendation System

3.基于体裁的推荐系统

With genre based recommendation, we will use sklearn package to help us analyse text context. We will need to compute the similarity of the genre. Two method that we are going to use is TfidfVectorizer and CountVectorizer.

通过基于体裁的推荐,我们将使用sklearn包来帮助我们分析文本上下文。 我们将需要计算体裁的相似性。 我们将使用的两种方法是TfidfVectorizer和CountVectorizer。

In TfidfVectorizer, it calculates the frequency of the word with the consideration on how often it occurs in all documents. While, CountVectorizer is more simpler, it only counts how many times the word has occured.

在TfidfVectorizer中,它会考虑单词在所有文档中出现的频率来计算单词的频率。 虽然CountVectorizer更简单,但它仅计算单词出现的次数。

from sklearn.feature_extraction.text import TfidfVectorizertf_idf = TfidfVectorizer(lowercase=True, stop_words = 'english')
anime['genre'] = anime['genre'].fillna('')
tf_idf_matrix = tf_idf.fit_transform(anime['genre'])tf_idf_matrix.shape
Image for post

We can see that there are 46 different words from 12,294 anime.

我们可以看到,从12,294动漫中有46个不同的单词。

from sklearn.metrics.pairwise import linear_kernelcosine_sim = linear_kernel(tf_idf_matrix, tf_idf_matrix)
indices = pd.Series(anime.index, index=anime['name'])
indices = indices.drop_duplicates()def recommendations (name, cosine_sim = cosine_sim):
similarity_scores = list(enumerate(cosine_sim[indices[name]]))
similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)
similarity_scores = similarity_scores[1:21]
anime_indices = [i[0] for i in similarity_scores]
return anime['name'].iloc[anime_indices]recommendations('Kimi no Na wa.')
Image for post

Based of the TF-IDF calculation, this is the top 20 anime recommendations that are similar to Kimi no Na wa..

根据TF-IDF的计算,这是前20大动漫推荐,与《 Kimi no Na wa》相似。

Next, we are going to look at another model, CountVectorizer() and we are going to compare the result between cosine_similarity and linear_kernel.

接下来,我们将看看另一个模型CountVectorizer(),并将比较余弦相似度和linear_kernel之间的结果。

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similaritycount = CountVectorizer(stop_words = 'english')
count_matrix = count.fit_transform(anime['genre'])
cosine_sim2 = cosine_similarity(count_matrix, count_matrix)recommendations('Kimi no Na wa.', cosine_sim2)
Image for post
cosine_sim2 = linear_kernel(count_matrix, count_matrix)
recommendations('Kimi no Na wa.', cosine_sim2)
Image for post

Summary

摘要

In this article, we have look at the anime data and trying to build two types of recommendation systems. The simple recommendation system let us see the top chart anime. We have done this by using the weighted rating calculation on the voting and number of members. Then, we continue to build the recommendation system based on anime’s genre feature. With this, we apply both TfidfVectorizer and CountVectorizer to see the differences in their recommendation.

在本文中,我们研究了动画数据,并尝试构建两种类型的推荐系统。 简单的推荐系统让我们看到了热门动画。 我们通过对投票和成员数进行加权评级计算来完成此任务。 然后,我们将继续基于动漫的流派特征构建推荐系统。 这样,我们同时应用了TfidfVectorizer和CountVectorizer来查看其建议中的差异。

Hope that you enjoy this article!

希望您喜欢这篇文章!

1. https://www.datacamp.com/community/tutorials/recommender-systems-python

1. https://www.datacamp.com/community/tutorials/recommender-systems-python

翻译自: https://medium.com/analytics-vidhya/recommendation-system-for-anime-data-784c78952ba5

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388247.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

1.3求根之牛顿迭代法

目录 目录前言(一)牛顿迭代法的分析1.定义2.条件3.思想4.误差(二)代码实现1.算法流程图2.源代码(三)案例演示1.求解:\(f(x)x^3-x-10\)2.求解:\(f(x)x^2-1150\)3.求解:\(f…

Alex Hanna博士:Google道德AI小组研究员

Alex Hanna博士是社会学家和研究科学家,致力于Google的机器学习公平性和道德AI。 (Dr. Alex Hanna is a sociologist and research scientist working on machine learning fairness and ethical AI at Google.) Before that, she was an Assistant Professor at th…

安全开发 | 如何让Django框架中的CSRF_Token的值每次请求都不一样

前言 用过Django 进行开发的同学都知道,Django框架天然支持对CSRF攻击的防护,因为其内置了一个名为CsrfViewMiddleware的中间件,其基于Cookie方式的防护原理,相比基于session的方式,更适合目前前后端分离的业务场景&am…

Kubernetes的共享GPU集群调度

问题背景 全球主要的容器集群服务厂商的Kubernetes服务都提供了Nvidia GPU容器调度能力,但是通常都是将一个GPU卡分配给一个容器。这可以实现比较好的隔离性,确保使用GPU的应用不会被其他应用影响;对于深度学习模型训练的场景非常适合&#x…

django-celery定时任务以及异步任务and服务器部署并且运行全部过程

Celery 应用Celery之前,我想大家都已经了解了,什么是Celery,Celery可以做什么,等等一些关于Celery的问题,在这里我就不一一解释了。 应用之前,要确保环境中添加了Celery包。 pip install celery pip instal…

网页视频15分钟自动暂停_在15分钟内学习网页爬取

网页视频15分钟自动暂停什么是网页抓取? (What is Web Scraping?) Web scraping, also known as web data extraction, is the process of retrieving or “scraping” data from a website. This information is collected and then exported into a format that …

前嗅ForeSpider教程:创建模板

今天,小编为大家带来的教程是:如何在前嗅ForeSpider中创建模板。主要内容有:模板的概念,模板的配置方式,模板的高级选项,具体内容如下: 一,模板的概念 模板列表的层级相当于网页跳转…

django 性能优化_优化Django管理员

django 性能优化Managing data from the Django administration interface should be fast and easy, especially when we have a lot of data to manage.从Django管理界面管理数据应该快速简便,尤其是当我们要管理大量数据时。 To improve that process and to ma…

3D场景中选取场景中的物体。

杨航最近在学Unity3D在一些经典的游戏中,需要玩家在一个3D场景中选取场景中的物体。例如《仙剑奇侠传》,选择要攻击的敌人时、为我方角色增加血量、为我方角色添加状态,通常我们使用鼠标来选…

canva怎么使用_使用Canva进行数据可视化项目的4个主要好处

canva怎么使用(Notes: All opinions are my own. I am not affiliated with Canva in any way)(注意:所有观点均为我自己。我与Canva毫无关系) Canva is a very popular design platform that I thought I would never use to create the deliverable for a Data V…

如何利用Shader来渲染游戏中的3D角色

杨航最近在学Unity3D 本文主要介绍一下如何利用Shader来渲染游戏中的3D角色,以及如何利用Unity提供的Surface Shader来书写自定义Shader。 一、从Shader开始 1、通过Assets->Create->Shader来创建一个默认的Shader,并取名…

Css单位

尺寸 颜色 转载于:https://www.cnblogs.com/jsunny/p/9866679.html

ai驱动数据安全治理_JupyterLab中的AI驱动的代码完成

ai驱动数据安全治理As a data scientist, you almost surely use a form of Jupyter Notebooks. Hopefully, you have moved over to the goodness of JupyterLab with its integrated sidebar, tabs, and more. When it first launched in 2018, JupyterLab was great but fel…

【Android】Retrofit 2.0 的使用

一、概述 Retrofit是Square公司开发的一个类型安全的Java和Android 的REST客户端库。来自官网的介绍: A type-safe HTTP client for Android and JavaRest API是一种软件设计风格,服务器作为资源存放地。客户端去请求GET,PUT, POST,DELETE资源。并且是无…

Mysql常用命令(二)

对数据库的操作 增 create database db1 charset utf8; 查 # 查看当前创建的数据库 show create database db1; # 查看所有的数据库 show databases; 改 alter database db1 charset gbk; 删 drop database db1; 对表的操作 use db1; #切换文件夹select database(); #查看当前所…

python中定义数据结构_Python中的数据结构—简介

python中定义数据结构You have multiples algorithms, the steps of which require fetching the smallest value in a collection at any given point of time. Values are assigned to variables but are constantly modified, making it impossible for you to remember all…

Unity3D 场景与C# Control进行结合

杨航最近在自学Unity3D,打算使用这个时髦、流行、强大的游戏引擎开发一个三维业务展示系统,不过发现游戏的UI和业务系统的UI还是有一定的差别,很多的用户还是比较习惯WinForm或者WPF中的UI形式,于是在网上搜了一下WinForm和Unity3…

数据质量提升_合作提高数据质量

数据质量提升Author Vlad Rișcuția is joined for this article by co-authors Wayne Yim and Ayyappan Balasubramanian.作者 Vlad Rișcuția 和合著者 Wayne Yim 和 Ayyappan Balasubramanian 共同撰写了这篇文章 。 为什么要数据质量? (Why data quality?) …

unity3d 人员控制代码

普通浏览复制代码private var walkSpeed : float 1.0;private var gravity 100.0;private var moveDirection : Vector3 Vector3.zero;private var charController : CharacterController;function Start(){charController GetComponent(CharacterController);animation.w…

删除wallet里面登机牌_登机牌丢失问题

删除wallet里面登机牌On a sold-out flight, 100 people line up to board the plane. The first passenger in the line has lost his boarding pass but was allowed in regardless. He takes a random seat. Each subsequent passenger takes their assigned seat if availa…