0.目录
1.分析页面
2.初步代码
3.完整代码
4.总结
5.补充
1.分析页面
上一次我们讲了xpath获取豆瓣最新上映电影的海报,这一次会分析如何使用BeautifulSoup获取。启程:python爬虫之通过xpath获取豆瓣最新上映电影的海报zhuanlan.zhihu.com
首先,进入豆瓣正在上映的页面,右键查看源代码,发现我们需要的海报url和电影名都在这个标签下面,根据上一次的经验,还需要添加范围< div id="nowplaying" >
右键源代码
2.初步代码
# encoding: utf-8
from bs4 import BeautifulSoup
from urllib import request
import requests
def get_page(url):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36',
'Referer': 'https://movie.douban.com/',
}
response = requests.get(url, headers)
return response.text
def get_img(url):
text = get_page(url)
# 创建BeautifulSoup对象
soup = BeautifulSoup(text, 'lxml')
# 限制获取到的img标签在【正在上映】内
new = soup.find('div', {'id': "nowplaying"})
# 查找img标签
trs = new.find_all('img')
for tr in trs:
# 获取img标签下的src和alt
url_img = tr.attrs['src']
name = tr.attrs['alt']
print(name)
print(url_img)
def main():
url = 'https://movie.douban.com/cinema/nowplaying/guangzhou/'
get_img(url)
if __name__ == '__main__':
main()
展示运行结果的一部分:
3.完整代码
可以发现已经获取到了我们想要的数据,那么下一步就是根据url下载海报,并且用电影名来命名文件。在下载之前,还需要在该程序的所在目录建一个名为:images 的文件夹。
# encoding: utf-8
from bs4 import BeautifulSoup
from urllib import request
import requests
def get_page(url):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36',
'Referer': 'https://movie.douban.com/',
}
response = requests.get(url, headers)
return response.text
def get_img(url):
text = get_page(url)
# 创建BeautifulSoup对象
soup = BeautifulSoup(text, 'lxml')
# 限制获取到的img标签在【正在上映】内
new = soup.find('div', {'id': "nowplaying"})
# 查找img标签
trs = new.find_all('img')
fns_num = 1
num = len(trs)
for tr in trs:
# 获取img标签下的src和alt
url_img = tr.attrs['src']
name = tr.attrs['alt']
# 下载剧照
request.urlretrieve(url_img, 'images/' + name + '.jpg')
# 显示剧照下载的进度
print("\r完成进度: {:.2f}%".format(fns_num * 100 / num), end="")
fns_num += 1
def main():
url = 'https://movie.douban.com/cinema/nowplaying/guangzhou/'
get_img(url)
if __name__ == '__main__':
main()
展示运行结果的一部分:
4.总结
下一次会使用正则表达式来继续实践,还会分析xpath、BeautifulSoup和正则之间的区别。
如果你想获取评分,可以这样:
new = soup.find('div', {'id': "nowplaying"})
trs = new.find_all('li')
for tr in trs:
score = tr.attrs['data-score']
5.补充
当我们碰到
肖申克的救赎
xpath可以这样获取
title = tr.xpath(".//span[@class='title']/text()")[0]
BeautifulSoup可以这样获取
title = soup.find('span', {'class': "title"})