网页视频15分钟自动暂停_在15分钟内学习网页爬取

网页视频15分钟自动暂停

什么是网页抓取? (What is Web Scraping?)

Web scraping, also known as web data extraction, is the process of retrieving or “scraping” data from a website. This information is collected and then exported into a format that is more useful for the user and it can be a spreadsheet or an API. Although web scraping can be done manually, in most cases, automated tools are preferred when scraping web data as they can be less costly and work at a faster rate.

Web抓取,也称为Web数据提取,是从网站检索或“抓取”数据的过程。 收集此信息,然后将其导出为对用户更有用的格式,可以是电子表格或API。 尽管可以手动进行Web抓取 ,但是在大多数情况下,抓取Web数据时首选自动化工具,因为它们的成本较低且工作速度更快。

网站搜刮合法吗? (Is Web Scraping Legal?)

The simplest way is to check the robots.txt file of the website. You can find this file by appending “/robots.txt” to the URL that you want to scrape. It is usually at the website domain /robots.txt. If all the bots indicated by ‘user-agent: *’ are blocked/disallowed in the robots.txt file, then you’re not allowed to scrape. For this article, I am scraping the Flipkart website. So, to see the “robots.txt” file, the URL is www.flipkart.com/robots.txt.

最简单的方法是检查网站的robots.txt文件。 您可以通过将“ /robots.txt”附加到要抓取的URL来找到此文件。 它通常位于网站域/robots.txt中。 如果robots.txt文件中阻止/禁止了“用户代理:*”指示的所有漫游器,则不允许您抓取。 对于本文,我将抓取Flipkart网站。 因此,要查看“ robots.txt”文件,URL为www.flipkart.com/robots.txt。

用于Web爬网的库 (Libraries used for Web Scraping)

BeautifulSoup: BeautifulSoup is a Python library for pulling data out of HTML and XML files. It works with your favourite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree.

BeautifulSoup:BeautifulSoup是一个Python库,用于从HTML和XML文件中提取数据。 它与您最喜欢的解析器一起使用,提供了导航,搜索和修改解析树的惯用方式。

Pandas: Pandas is a fast, powerful, flexible, and easy to use open-source data analysis and manipulation tool, built on top of the Python programming language.

Pandas:Pandas是一种快速,强大,灵活且易于使用的开源数据分析和处理工具,建立在Python编程语言之上。

为什么选择BeautifulSoup? (Why BeautifulSoup?)

It is an incredible tool for pulling out information from a webpage. You can use it to extract tables, lists, paragraphs and you can also put filters to extract information from web pages. For more info, you can refer to the BeautifulSoup documentation

它是从网页中提取信息的不可思议的工具。 您可以使用它来提取表,列表,段落,还可以放置过滤器以从网页中提取信息。 有关更多信息,您可以参考BeautifulSoup 文档。

刮Flipkart网站 (Scraping Flipkart Website)

from bs4 import BeautifulSoup 
import requests
import csv
import pandas as pd

First, we import the BeautifulSoup and the requests library and these are very important libraries for web scraping.

首先,我们导入BeautifulSoup和请求库,这些对于Web抓取是非常重要的库。

requests: requests, is one of the packages in Python that made the language interesting. requests is based on Python’s urllib2 module.

请求:请求是Python中使该语言有趣的软件包之一。 请求基于Python的urllib2模块。

req = requests.get("https://www.flipkart.com/search?q=laptops&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off&page=1")  # URL of the website which you want to scrape
content = req.content # Get the content

To get the contents of the specified URL, submit a request using the requests library. This is the URL of the Flipkart website containing laptops.

要获取指定URL的内容,请使用请求库提交请求。 这是包含笔记本电脑的Flipkart网站的URL。

Image for post

This is the Flipkart website comprising of different laptops. This page contains the details of 24 laptops. So now looking at this, we try to extract the different features of the laptops such as the description of the laptop (model name along with the specification of the laptop), Processor (Intel/AMD, i3/i5/i7/Ryzen3Ryzen5/Ryzen7), RAM (4/8/16 GB), Operating System (Windows/Mac), Disk Drive Storage (SSD/HDD,256/512/1TB storage), Display (13.3/14/15.6 inches), Warranty(Onsite/Limited Hardware/International), Rating(4.1–5), Price (Rupees).

这是由不同笔记本电脑组成的Flipkart网站。 此页面包含24台笔记本电脑的详细信息。 因此,现在着眼于此,我们尝试提取笔记本电脑的不同功能,例如笔记本电脑的描述(型号名称以及笔记本电脑的规格),处理器(Intel / AMD,i3 / i5 / i7 / Ryzen3Ryzen5 / Ryzen7) ),RAM(4/8/16 GB),操作系统(Windows / Mac),磁盘驱动器存储(SSD / HDD,256/512 / 1TB存储),显示器(13.3 / 14 / 15.6英寸),保修(现场/硬件/国际限量版),评分(4.1–5),价格 (卢比)。

soup = BeautifulSoup(content,'html.parser')
print(soup.prettify())<!DOCTYPE html>
<html lang="en">
<head>
<link href="https://rukminim1.flixcart.com" rel="dns-prefetch"/>
<link href="https://img1a.flixcart.com" rel="dns-prefetch"/>
<link href="//img1a.flixcart.com/www/linchpin/fk-cp-zion/css/app.chunk.21be2e.css" rel="stylesheet"/>
<link as="image" href="//img1a.flixcart.com/www/linchpin/fk-cp-zion/img/fk-logo_9fddff.png" rel="preload"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-type"/>
<meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
<meta content="102988293558" property="fb:page_id"/>
<meta content="658873552,624500995,100000233612389" property="fb:admins"/>
<meta content="noodp" name="robots"/>
<link href="https://img1a.flixcart.com/www/promos/new/20150528-140547-favicon-retina.ico" rel="shortcut icon">
....
....
</script>
<script async="" defer="" id="omni_script" nonce="7596241618870897262" src="//img1a.flixcart.com/www/linchpin/batman-returns/omni/omni16.js">
</script>
</body>
</html>

Here we need to specify the content variable and the parser, which is the HTML parser. So now soup is a variable of the BeautifulSoup object of our parsed HTML. soup.prettify() displays the entire code of the webpage.

在这里,我们需要指定内容变量和解析器,即HTML解析器。 因此,汤是我们已解析HTML的BeautifulSoup对象的变量。 soup.prettify()显示网页的整个代码。

Extracting the Descriptions

提取描述

Image for post

When you click on the “Inspect” tab, you will see a “Browser Inspector Box” open. We observe that the class name of the descriptions is ‘_3wU53n’ so we use the find method to extract the descriptions of the laptops.

当您单击“检查”选项卡时,将看到“浏览器检查器框”打开。 我们观察到描述的类名是'_3wU53n',因此我们使用find方法提取笔记本电脑的描述。

desc = soup.find_all('div' , class_='_3wU53n')[<div class="_3wU53n">HP 14s Core i5 10th Gen - (8 GB/512 GB SSD/Windows 10 Home) 14s-cs3010TU Laptop</div>,
<div class="_3wU53n">HP 14q Core i3 8th Gen - (8 GB/256 GB SSD/Windows 10 Home) 14q-cs0029TU Thin and Light Laptop</div>,
<div class="_3wU53n">Asus VivoBook 15 Ryzen 3 Dual Core - (4 GB/1 TB HDD/Windows 10 Home) M509DA-EJ741T Laptop</div>,
<div class="_3wU53n">Acer Aspire 7 Core i5 9th Gen - (8 GB/512 GB SSD/Windows 10 Home/4 GB Graphics/NVIDIA Geforce GTX 1650...</div>,
....
....
<div class="_3wU53n">MSI GP65 Leopard Core i7 10th Gen - (32 GB/1 TB HDD/512 GB SSD/Windows 10 Home/8 GB Graphics/NVIDIA Ge...</div>,
<div class="_3wU53n">Asus Core i5 10th Gen - (8 GB/512 GB SSD/Windows 10 Home/2 GB Graphics) X509JB-EJ591T Laptop</div>]

Extracting the descriptions from the website using the find method-grabbing the div tag which has the class name ‘ _3wU53n’. This returns all the div tags with the class name of ‘ _3wU53n’. As class is a special keyword in python, we have to use the class_ keyword and pass the arguments here.

使用find方法从网站中提取描述-抓取div标签,该标签的类名为“ _3wU53n”。 这将返回所有类名称为“ _3wU53n”的div标签。 由于class是python中的特殊关键字,因此我们必须使用class_关键字并在此处传递参数。

descriptions = [] # Create a list to store the descriptions
for i in range(len(desc)):
descriptions.append(desc[i].text)
len(descriptions)24 # Number of laptops
['HP 14s Core i5 10th Gen - (8 GB/512 GB SSD/Windows 10 Home) 14s-cs3010TU Laptop',
'HP 14q Core i3 8th Gen - (8 GB/256 GB SSD/Windows 10 Home) 14q-cs0029TU Thin and Light Laptop',
'Asus VivoBook 15 Ryzen 3 Dual Core - (4 GB/1 TB HDD/Windows 10 Home) M509DA-EJ741T Laptop',
'Acer Aspire 7 Core i5 9th Gen - (8 GB/512 GB SSD/Windows 10 Home/4 GB Graphics/NVIDIA Geforce GTX 1650...',
....
....
'MSI GP65 Leopard Core i7 10th Gen - (32 GB/1 TB HDD/512 GB SSD/Windows 10 Home/8 GB Graphics/NVIDIA Ge...',
'Asus Core i5 10th Gen - (8 GB/512 GB SSD/Windows 10 Home/2 GB Graphics) X509JB-EJ591T Laptop']

Create an empty list to store the descriptions of all the laptops. We can even access the child tags with dot access. So now iterate through all the tags and then use the .text method to extract only the text content from the tags. In every iteration append the text to the descriptions list. So after iterating through all the tags, the descriptions list will have the text content of all the laptops (which is the description of the laptop-model name along with specifications).

创建一个空列表来存储所有笔记本电脑的描述。 我们甚至可以通过点访问来访问子标签。 因此,现在遍历所有标签,然后使用.text方法仅从标签中提取文本内容。 在每次迭代中,将文本添加到描述列表中。 因此,在遍历所有标签之后,描述列表将包含所有便携式计算机的文本内容(这是便携式计算机型号名称的描述以及规格)。

Similarly, we apply the same approach to extract all the other features.

同样,我们采用相同的方法提取所有其他功能。

Extracting the specifications

提取规格

Image for post

We observe that the various specifications are under the same div and the class names are the same for all those 5 features (Processor, RAM, Disk Drive, Display, Warranty).

我们注意到,所有这5个功能(处理器,RAM,磁盘驱动器,显示,保修)的各种规格都在同一div下,并且类名相同。

All the features are inside the ‘li’ tag and the class name is the same for all which is ‘tVe95H’ so we need to apply some technique to extract the distinct features.

所有功能都在'li'标记内,并且所有类的名称都相同'tVe95H',因此我们需要应用某种技术来提取不同的功能。

# Create empty lists for the features
processors=[]
ram=[]
os=[]
storage=[]
inches=[]
warranty=[]for i in range(0,len(commonclass)):
p=commonclass[i].text # Extracting the text from the tags
if("Core" in p):
processors.append(p)
elif("RAM" in p):
ram.append(p)
# If RAM is present in the text then append it to the ram list. Similarly do this for the other features as well elif("HDD" in p or "SSD" in p):
storage.append(p)
elif("Operating" in p):
os.append(p)
elif("Display" in p):
inches.append(p)
elif("Warranty" in p):
warranty.append(p)

The .text method is used to extract the text information from the tags so this gives us the values of Processor, RAM, Disk Drive, Display, Warranty. So in the same way, we apply this approach to the remaining features as well.

.text方法用于从标记中提取文本信息,因此可以为我们提供处理器,RAM,磁盘驱动器,显示,保修的值。 因此,以同样的方式,我们也将这种方法应用于其余功能。

print(len(processors))
print(len(warranty))
print(len(os))
print(len(ram))
print(len(inches))24
24
24
24
24

Extracting the price

提取价格

price = soup.find_all(‘div’,class_=’_1vC4OE _2rQ-NK’) 
# Extracting price of each laptop from the website
prices = []
for i in range(len(price)):
prices.append(price[i].text)
len(prices)
prices24
['₹52,990',
'₹34,990',
'₹29,990',
'₹56,990',
'₹54,990',
....
....
'₹78,990',
'₹1,59,990',
'₹52,990']

In the same manner, we extract the price of each laptop and add all the prices to the prices list.

以相同的方式,我们提取每台笔记本电脑的价格,并将所有价格添加到价格列表中。

rating = soup.find_all('div',class_='hGSR34') 
Extracting the ratings of each laptop from the website
ratings = []
for i in range(len(rating)):
ratings.append(rating[i].text)
len(ratings)
ratings37
['4.4',
'4.5',
'4.4',
'4.4',
'4.2',
'4.5',
'4.4',
'4.5',
'4.4',
'4.2',
....
....
'₹1,59,990',
'₹52,990']
Image for post
Image for post

Here we are getting the length of the ratings to be 37. But what’s the reason behind it?

在这里,我们得到的评级长度为37。但是背后的原因是什么呢?

We observe that the class name for the recommended laptops is also the same as the featured laptops, so that’s why it’s extracting the ratings of recommended laptops as well.This is leading to an increase in the number of ratings. It has to be 24 but now it's 37!

我们发现推荐笔记本电脑的类别名称也与特色笔记本电脑相同,这就是为什么它也提取推荐笔记本电脑的等级的原因,这导致等级数量增加。 它必须是24,但现在是37!

Last but not the least, merge all the features into a single data frame and store the data in the required format!

最后但并非最不重要的一点是,将所有功能合并到一个数据框中,并以所需的格式存储数据!

df = {'Description':descriptions,'Processor':processors,'RAM':ram,'Operating System':os,'Storage':storage,'Display':inches,'Warranty':warranty,'Price':prices}
dataset = pd.DataFrame(data = d)

The final dataset

最终数据集

Image for post

Saving the dataset to a CSV file

将数据集保存到CSV文件

dataset.to_csv('laptops.csv')

Now we get the whole dataset into a CSV file.

现在,我们将整个数据集放入一个CSV文件中。

Image for post

To verify it again, we read the downloaded CSV file in Jupyter Notebook.

为了再次验证,我们在Jupyter Notebook中读取了下载的CSV文件。

df = pd.read_csv('laptops.csv')
df.shape(24, 9)

As this is a dynamic website, the content keeps on changing!

由于这是一个动态的网站,因此内容不断变化!

You can always refer to my GitHub Repository for the entire code.

您可以始终参考我的GitHub存储库以获取完整代码。

Connect with me on LinkedIn here

此处通过LinkedIn与我联系

“For every $20 you spend on web analytics tools, you should spend $80 on the brains to make sense of the data.” — Jeff Sauer

“您每花20美元在网络分析工具上,就应该花80美元在大脑上以理解数据。” —杰夫·索尔

I hope you found the article insightful. I would love to hear feedback to improvise it and come back with better content.

我希望您发现这篇文章很有见地。 我很想听听反馈以即兴创作,并以更好的内容回来。

Thank you so much for reading!

非常感谢您的阅读!

翻译自: https://towardsdatascience.com/learn-web-scraping-in-15-minutes-27e5ebb1c28e

网页视频15分钟自动暂停

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388234.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

前嗅ForeSpider教程:创建模板

今天&#xff0c;小编为大家带来的教程是&#xff1a;如何在前嗅ForeSpider中创建模板。主要内容有&#xff1a;模板的概念&#xff0c;模板的配置方式&#xff0c;模板的高级选项&#xff0c;具体内容如下&#xff1a; 一&#xff0c;模板的概念 模板列表的层级相当于网页跳转…

django 性能优化_优化Django管理员

django 性能优化Managing data from the Django administration interface should be fast and easy, especially when we have a lot of data to manage.从Django管理界面管理数据应该快速简便&#xff0c;尤其是当我们要管理大量数据时。 To improve that process and to ma…

3D场景中选取场景中的物体。

杨航最近在学Unity3D&#xfeff;&#xfeff;&#xfeff;&#xfeff;在一些经典的游戏中&#xff0c;需要玩家在一个3D场景中选取场景中的物体。例如《仙剑奇侠传》&#xff0c;选择要攻击的敌人时、为我方角色增加血量、为我方角色添加状态&#xff0c;通常我们使用鼠标来选…

canva怎么使用_使用Canva进行数据可视化项目的4个主要好处

canva怎么使用(Notes: All opinions are my own. I am not affiliated with Canva in any way)(注意&#xff1a;所有观点均为我自己。我与Canva毫无关系) Canva is a very popular design platform that I thought I would never use to create the deliverable for a Data V…

如何利用Shader来渲染游戏中的3D角色

杨航最近在学Unity3D&#xfeff;&#xfeff; 本文主要介绍一下如何利用Shader来渲染游戏中的3D角色&#xff0c;以及如何利用Unity提供的Surface Shader来书写自定义Shader。 一、从Shader开始 1、通过Assets->Create->Shader来创建一个默认的Shader&#xff0c;并取名…

Css单位

尺寸 颜色 转载于:https://www.cnblogs.com/jsunny/p/9866679.html

ai驱动数据安全治理_JupyterLab中的AI驱动的代码完成

ai驱动数据安全治理As a data scientist, you almost surely use a form of Jupyter Notebooks. Hopefully, you have moved over to the goodness of JupyterLab with its integrated sidebar, tabs, and more. When it first launched in 2018, JupyterLab was great but fel…

【Android】Retrofit 2.0 的使用

一、概述 Retrofit是Square公司开发的一个类型安全的Java和Android 的REST客户端库。来自官网的介绍&#xff1a; A type-safe HTTP client for Android and JavaRest API是一种软件设计风格&#xff0c;服务器作为资源存放地。客户端去请求GET,PUT, POST,DELETE资源。并且是无…

Mysql常用命令(二)

对数据库的操作 增 create database db1 charset utf8; 查 # 查看当前创建的数据库 show create database db1; # 查看所有的数据库 show databases; 改 alter database db1 charset gbk; 删 drop database db1; 对表的操作 use db1; #切换文件夹select database(); #查看当前所…

python中定义数据结构_Python中的数据结构—简介

python中定义数据结构You have multiples algorithms, the steps of which require fetching the smallest value in a collection at any given point of time. Values are assigned to variables but are constantly modified, making it impossible for you to remember all…

Unity3D 场景与C# Control进行结合

杨航最近在自学Unity3D&#xff0c;打算使用这个时髦、流行、强大的游戏引擎开发一个三维业务展示系统&#xff0c;不过发现游戏的UI和业务系统的UI还是有一定的差别&#xff0c;很多的用户还是比较习惯WinForm或者WPF中的UI形式&#xff0c;于是在网上搜了一下WinForm和Unity3…

数据质量提升_合作提高数据质量

数据质量提升Author Vlad Rișcuția is joined for this article by co-authors Wayne Yim and Ayyappan Balasubramanian.作者 Vlad Rișcuția 和合著者 Wayne Yim 和 Ayyappan Balasubramanian 共同撰写了这篇文章 。 为什么要数据质量&#xff1f; (Why data quality?) …

unity3d 人员控制代码

普通浏览复制代码private var walkSpeed : float 1.0;private var gravity 100.0;private var moveDirection : Vector3 Vector3.zero;private var charController : CharacterController;function Start(){charController GetComponent(CharacterController);animation.w…

删除wallet里面登机牌_登机牌丢失问题

删除wallet里面登机牌On a sold-out flight, 100 people line up to board the plane. The first passenger in the line has lost his boarding pass but was allowed in regardless. He takes a random seat. Each subsequent passenger takes their assigned seat if availa…

字符串操作截取后面的字符串_对字符串的5个必知的熊猫操作

字符串操作截取后面的字符串We have to represent every bit of data in numerical values to be processed and analyzed by machine learning and deep learning models. However, strings do not usually come in a nice and clean format and require preprocessing to con…

最新 Unity3D鼠标滑轮控制物体放大缩小 [

var s 1.0;function Update () {var cube GameObject.Find("Cube");if(Input.GetAxis("Mouse ScrollWheel")){s Input.GetAxis("Mouse ScrollWheel");cube.transform.localScaleVector3(1*s,1*s,1*s);}}

sublime-text3 安装 emmet 插件

下载sublime&#xff0c;http://www.sublimetext.com/ 安装package control &#xff1a;https://packagecontrol.io/ins... 这个地址需要翻墙&#xff0c;访问不了的可以看下图 import urllib.request,os,hashlib; h 6f4c264a24d933ce70df5dedcf1dcaee ebe013ee18cced0ef93d…

unity3d]鼠标点击地面人物自动走动(也包含按键wasdspace控制)

目录(?)[-] 一效果图二大概步骤 创建一个plane设置层为Terrain因为后面要判断是否点击的是这个层准备好人物模型并且将三个脚本拖放到人物上并且将动画文件也拖放好记得看前面提醒哦 ThirdPersonCamera相当于smoothflowThirdPersonController修改版mouseMoveContr鼠标点击人物…

Web 开发基础

一、 Web 开发简介 最早的软件都是运行在大型机上的&#xff0c;软件使用者登陆到大型机上去运行软件。后来随着 PC 机的兴起&#xff0c;软件开始主要运行在桌面上&#xff0c;而数据库这样的软件运行在服务器端&#xff0c;这种 Client/Server 模式简称 CS 架构。随着互联网的…

power bi函数_在Power BI中的行上使用聚合函数

power bi函数Aggregate functions are one of the main building blocks in Power BI. Being used explicitly in measures, or implicitly defined by Power BI, there is no single Power BI report which doesn’t use some sort of aggregate functions.聚合功能是Power BI…