python之scrapy爬取jd和qq招聘信息

1、settings.py文件

# -*- coding: utf-8 -*-# Scrapy settings for jd project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'jd'SPIDER_MODULES = ['jd.spiders']
NEWSPIDER_MODULE = 'jd.spiders'LOG_LEVEL="WARNING"
LOG_FILE="./jingdong1.log"
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'jd (+http://www.yourdomain.com)'# Obey robots.txt rules
ROBOTSTXT_OBEY = True# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'jd.middlewares.JdSpiderMiddleware': 543,
#}# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'jd.middlewares.JdDownloaderMiddleware': 543,
#}# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'jd.pipelines.JdPipeline': 300,
#}# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
View Code

2、jingdong.py文件

# -*- coding: utf-8 -*-
import scrapyimport logging
import json
logger = logging.getLogger(__name__)
class JingdongSpider(scrapy.Spider):name = 'jingdong'allowed_domains = ['zhaopin.jd.com']start_urls = ['http://zhaopin.jd.com/web/job/job_list?page=1']pageNum = 1def parse(self, response):content  = response.body.decode()content = json.loads(content)##########去除列表中字典集中的空值###########for i in range(len(content)):#list(content[i].keys()获取当前字典中的keyfor key in list(content[i].keys()): #content[i]为字典if not content[i].get(key):#content[i].get(key)根据key获取valuedel content[i][key] #删除空值字典for i in range(len(content)):logging.warning(content[i])self.pageNum = self.pageNum+1if self.pageNum<=355:next_url = "http://zhaopin.jd.com/web/job/job_list?page="+str(self.pageNum)yield scrapy.Request(next_url,callback=self.parse)pass

3、注意点,针对jingdong的招聘翻页是使用javascrapt,所以无法使用crawlscrapy进行自动翻页,但是我们再network中查看其获取数据的方法。

如:http://zhaopin.jd.com/web/job/job_list?page=2

 

#############jingdong可以了,那么试试tencent公司的招聘信息吧###############

测试下吧!

结果知道了吧!!!!!  开始干活!!!!!!!!!!

1、settings.py

# -*- coding: utf-8 -*-# Scrapy settings for tencent project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'tencent'SPIDER_MODULES = ['tencent.spiders']
NEWSPIDER_MODULE = 'tencent.spiders'LOG_LEVEL="WARNING"
LOG_FILE="./qq.log"
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'# Obey robots.txt rules
#ROBOTSTXT_OBEY = True# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'tencent.middlewares.TencentSpiderMiddleware': 543,
#}# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'tencent.middlewares.TencentDownloaderMiddleware': 543,
#}# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'tencent.pipelines.TencentPipeline': 300,
#}# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
View Code

2、mahuateng.py

# -*- coding: utf-8 -*-
import scrapyimport json
import logging
class MahuatengSpider(scrapy.Spider):name = 'mahuateng'allowed_domains = ['careers.tencent.com']start_urls = ['https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1561688387174&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=40003&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn']pageNum = 1def parse(self, response):content = response.body.decode()content = json.loads(content)content=content['Data']['Posts']#删除空字典for con in content:#print(con)for key in list(con.keys()):if not con.get(key):del con[key]#记录每一个岗位信息for con in content:logging.warning(con)#####翻页######self.pageNum = self.pageNum+1if self.pageNum<=118:next_url = "https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1561688387174&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=40003&attrId=&keyword=&pageIndex="+str(self.pageNum)+"&pageSize=10&language=zh-cn&area=cn"yield  scrapy.Request(next_url,callback=self.parse)

个人测试是可以的,你们的就看运气了,哈哈!

这些都是个人玩的,码的比较丑陋。

转载于:https://www.cnblogs.com/ywjfx/p/11101091.html

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/350188.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

opencl 加速 c语言程序_Win10应用获得面向OpenCL和OpenGL的兼容层

今年早些时候&#xff0c;微软宣布正在努力在Windows 10的Direct3D 12(D3D12)中启用对OpenCL和OpenGL映射层的支持。为了启用映射层&#xff0c;解决设备上没有OpenCL和OpenGL硬件驱动时的兼容性问题&#xff0c;公司目前已经在微软商店中发布了兼容性包。该兼容性包的标题为 &…

matplotlib plt.subplot

matplotlib plt.subplot 用于在一个Figure对象里画多个子图(Axes)。 其调用格式&#xff1a;subplot(numRows, numCols, plotNum)&#xff0c;即&#xff08;行、列、序号&#xff09;。 图表的整个绘图区域被分成numRows行和numCols列&#xff0c;plotNum参数指定创建的Axes…

javaee和javaee_JavaEE概念简介

javaee和javaee这篇文章旨在阐明J2EE范例中使用的首字母缩写词和概念。 J2EE代表Java to Platform&#xff0c;Entreprise Edition。 它使创建模块化Java应用程序得以部署在应用程序服务器上。 它依赖于Java SE&#xff0c;Java SE是一组Java库的核心&#xff0c;所有Java应用程…

卷boot仅剩余XX空间

参见&#xff1a; https://blog.csdn.net/hnzcdy/article/details/52381844 转载于:https://www.cnblogs.com/lxc1910/p/11102528.html

python多分支结构实例_JS优化多分支结构(经典)

多分支结构的优化有很多好处&#xff1a;既方便代码维护&#xff0c;又可以提升代码执行效率。例如&#xff0c;设计有多个条件&#xff0c;只有当多个条件都成立时&#xff0c;才允许执行特定任务。示例1遵循简单的设计思路&#xff0c;使用多重分支逐个检测这些条件。if (a) …

matplotlib plt.plot

实例1 import matplotlib.pyplot as plta [1, 2, 3, 4] # y 是 a的值&#xff0c;x是各个元素的索引 b [5, 6, 7, 8]plt.figure(demon plot) plt.plot(a, b, r--, label aa) plt.xlabel(this is x) plt.ylabel(this is y) plt.title(this is a demo) plt.legend(locupper l…

使用UAA OAuth2授权服务器–客户端和资源

在上一篇文章中&#xff0c;我介绍了如何使用Cloud Foundry UAA项目启动OAuth2授权服务器&#xff0c;以及如何使用OAuth2授权代码流程中涉及的一些参与者来填充它。 我已经在Digital Ocean网站上找到了这篇文章&#xff0c;在描述OAuth2授权代码流方面做得非常好&#xff0c;…

二分查找思想

二分查找 二分查找思想应用于对有序的数组进行查找操作。 时间复杂度 二分查找也称为折半查找&#xff0c;每次都能将查找区间减半&#xff0c;这种折半特性算法时间复杂度为O(logn)。 mid计算 有两种计算中值mid的方式&#xff1a; m(lh)/2ml(h-l)/2lh可能出现加法溢出&#x…

ad20如何导入库_脱水防锈油如何使用才正确?

导Lead语根据调查了解&#xff0c;很多厂家在使用脱水防锈油的办法不正确而导致防锈效果失效或不明显。那么脱水防锈油应该如何使用才正确呢&#xff1f;中阳润滑油为大家简单讲述如下。脱水防锈油脱水防锈油是由矿物油及脱水、防锈抗氧化等多种添加剂配制而成&#xff0c;既可…

matplotlib  plt.lengend

参考文档 https://www.cnblogs.com/lfri/p/12248629.html 官方文档 https://matplotlib.org/api/_as_gen/matplotlib.pyplot.legend.html matplotlib plt.lengend 作用&#xff1a;用于给图像加图例。 1、语法参数如下: matplotlib.pyplot.legend(*args, **kwargs) 2、…

Python进阶(上下文管理器与with语句)

/*上下文管理器必须有__enter__和__exit__方法*/ class MyResource:def __enter__(self):print(链接资源)return self/*__exit__返回True表示异常只会在__exit__中被捕获&#xff0c;不会继续传递到with语句的之外的except中&#xff0c;如果返回false&#xff0c;则会把异常也…

matplotlib  plt.scatter

https://www.cnblogs.com/lfri/p/12248629.html matplotlib plt.scatter 作用&#xff1a;画散点图 plt.scatter() 参数如下&#xff1a; x,y X和Y是长度相同的数组 s size,点的大小&#xff0c;标量或与数据长度相同的数组 c color,点的颜色&#xff0c;标量或与数据长…

Git 备忘录

整理了一下工作中常用的 Git 操作&#xff0c;持续更新中...merge单个文件例如 B分支想要合并A分支的某个文件首先&#xff0c;我们切换到B分支 git checkout branch B之后&#xff0c;我们checkout A 分支的a文件&#xff0c;git checkout --patch A a路径 最后&#xff0c…

spark任务shell运行_大数据系列:Spark的工作原理及架构

介绍本Apache Spark教程将说明Apache Spark的运行时架构以及主要的Spark术语&#xff0c;例如Apache SparkContext&#xff0c;Spark shell&#xff0c;Apache Spark应用程序&#xff0c;Spark中的任务(Task)&#xff0c;作业(job)和阶段(stage)。此外&#xff0c;我们还将学习…

使用RESTful Client API进行GET / POST

互联网上有很多如何使用RESTful Client API的东西。 这些是基础。 但是&#xff0c;尽管该主题看起来微不足道&#xff0c;但仍然存在一些障碍&#xff0c;尤其是对于初学者而言。 在这篇文章中&#xff0c;我将尝试总结我的专业知识&#xff0c;以及我如何在实际项目中做到这…

matplotlib plt.lengend图例放在图像的外侧

参考&#xff1a;https://www.jb51.net/article/186659.htm matplotlib plt.lengend图例放在图像的外侧 1、图例在图中实例 import numpy as np import matplotlib.pyplot as plt# 定义x,y X np.linspace(0, 2*np.pi, 32, endpointTrue) C np.cos(X)# figure的名称 plt.figur…

和搜狗输入法快捷键冲突_这款输入法被调教多年不输搜狗,爱了奥里给!

自从搜狗输入法被曝“推广门”之后&#xff0c;许多小伙伴开始寻找新的替代品。这期间&#xff0c;我也尝试了很多输入法&#xff0c;比如手心输入法、小狼毫输入法等。我以易用性、候选字质量和辅助输入功能三个方面作为考量标准&#xff0c;最终选定Win10默认的输入法&#x…

预期的异常规则和模拟静态方法– JUnit

今天&#xff0c;我被要求使用RESTful服务&#xff0c;所以我开始遵循Robert Cecil Martin的TDD规则实施该服务&#xff0c;并遇到了一种测试预期异常以及错误消息的新方法&#xff08;对我来说至少是这样&#xff09;&#xff0c;因此考虑共享我的实现方式作为这篇文章的一部分…

Linux安装部署FTP服务器

Linux安装部署FTP服务器 本文章会将安装FTP服务器的步骤以及一些遇到的问题来记录下 因为项目中要与第三方对接数据&#xff0c;需要用到FTP服务器以提供他们每天上传数据&#xff0c;因为之前在本地的VMware虚拟机上部署过一次&#xff0c;这次则在天翼云上部署。 首先&#x…

广度优先搜索

转载 https://www.cnblogs.com/skywang12345/p/3711483.html 1. 广度优先搜索介绍 广度优先搜索算法(Breadth First Search)&#xff0c;又称为"宽度优先搜索"或"横向优先搜索"&#xff0c;简称BFS。 它的思想是&#xff1a;从图中某顶点v出发&#xff…