网络爬虫：简单静/动态网页

news/2025/11/15 18:49:37/文章来源:https://www.cnblogs.com/xxxlR/p/19226043

爬虫实验*2

反思总结

爬虫实验

实验一：静态网页

what to show?

实验一总流程

第一步：终端下载爬虫三件套

第二步：创建文件。因为已经下载了vscode，这里用code进入，python运行。猜猜爬的是什么？

第三步：爬虫中

'''import requests
from bs4 import BeautifulSoup
import json
import time
import os

def get_page(url):
"""采集器函数：获取网页内容"""
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
try:
response = requests.get(url, headers=headers)
response.encoding = 'utf-8' # 确保中文正确显示
if response.status_code == 200:
return response.text
else:
print(f'请求失败，状态码：{response.status_code}')
return None
except Exception as e:
print(f'请求出错：{e}')
return None

def parse_page(html):
"""解析器函数：提取电影信息"""
soup = BeautifulSoup(html, 'html.parser')
movies = []

# 查找所有电影条目
movie_items = soup.find_all('div', class_='item')for item in movie_items:try:# 提取排名rank = item.find('em').get_text()# 提取电影标题title_div = item.find('div', class_='info').find('div', class_='hd')title_link = title_div.find('a')title = title_link.find('span', class_='title').get_text()# 提取电影链接movie_url = title_link['href']# 提取评分rating_num = item.find('span', class_='rating_num').get_text()# 提取引用语（可能不存在）quote_span = item.find('span', class_='inq')quote = quote_span.get_text() if quote_span else "无引用语"movie_info = {'排名': rank,'标题': title,'链接': movie_url,'评分': rating_num,'引用语': quote}movies.append(movie_info)print(f'✅ 提取到电影：{rank} - {title} - 评分：{rating_num}')except Exception as e:print(f'❌ 解析电影信息时出错：{e}')continuereturn movies

def write_to_file(content, filename='douban_movies.txt'):
"""将数据写入文件"""
try:
with open(filename, 'a', encoding='utf-8') as f:
f.write(json.dumps(content, ensure_ascii=False, indent=2) + ',\n')
print(f'💾 已保存：{content["排名"]} - {content["标题"]}')
except Exception as e:
print(f'❌ 保存文件时出错：{e}')

def check_environment():
"""检查运行环境"""
try:
import requests
import bs4
print('✅ 环境检查通过')
return True
except ImportError as e:
print(f'❌ 缺少必要的库：{e}')
print('请运行：pip install requests beautifulsoup4')
return False

def main():
"""主函数"""
print('🎬 豆瓣电影TOP250爬虫开始运行...')

# 检查环境
if not check_environment():returnbase_url = 'https://movie.douban.com/top250'# 只爬取前2页作为演示（你可以改成10页爬取全部）
for i in range(2):  # 改成 range(10) 可爬取全部10页start = i * 25url = f'{base_url}?start={start}&filter='print(f'\n📄 正在爬取第{i+1}页：{url}')html = get_page(url)if html:movies = parse_page(html)for movie in movies:write_to_file(movie)else:print(f'❌ 第{i+1}页爬取失败')# 添加延时，避免请求过于频繁print(f'⏳ 等待2秒...')time.sleep(2)print('\n🎉 爬取完成！数据已保存到 douban_movies.txt')
print('📊 你可以用Excel或其他工具打开这个文件查看数据')

if name == 'main':
main()'''
代码如上（AI写的），我爱看电影。
这个爬虫示例完整实现了网络爬虫的四个核心步骤：
发起请求：get_page函数负责向豆B服务器发送HTTP请求，获取网页原始数据。
获取响应：如果请求成功（状态码200），函数返回网页的HTML内容。
解析内容：parse_page函数使用BeautifulSoup库解析HTML，提取电影的排名、标题、评分等信息。
存储数据：write_to_file函数将提取的数据以JSON格式保存到文本文件中。

What you need to learn?

核心操作命令学习：

1.如何搭建爬虫
pip install requests beautifulsoup4 # 安装必要的Python库
python --version # 检查Python环境
pip list # 查看已安装的库

2.如何创建、保存、运行爬虫文件
创建：
mkdir my_spider_project_this_is_your_file_path # 创建项目目录，比如我这里mkdir E:\课-网安\网导5\python_spider
cd my_spider_project # 进入目录,同上，cd E:\课-网安\网导5\python_spider
编写：
方法一：使用编辑器新建文件 #我安装了vscode，保存名 douban_spider.py，code douban_spider.py
方法二：使用echo命令逐行写入（但是代码较长，易出错，不推荐）
方法三：nano/touch douban_spider.py (Mac/Linux)或者notepad（Windows） #nano douban_spider.py（按Ctrl+O保存，按Ctrl+X退出）或notepad douban_spider.py（会弹出记事本，将代码复制进去保存）
运行：
python douban_spider.py # 运行爬虫，都一样

3.爬虫的工作流程
发起请求【HTTP请求、请求头】——>获取响应【状态码(200/404)、HTML】——>解析内容【数据提取、选择器】——>储存数【据文件保存、数据库】

4.配置参数的意义

不需要会写，但要理解这些配置的作用

headers = {
'User-Agent': 'Mozilla/5.0...' # 告诉服务器这是"浏览器"在访问
}

time.sleep(2) # 每次请求等待2秒，礼貌爬虫
即：
*User-Agent：伪装成浏览器，避免被识别为爬虫
*延时：避免给服务器造成压力，做个"有礼貌"的访问者
*Robots协议：网站告诉爬虫"哪些可以爬，哪些不能爬"的规则 #比如豆瓣的，https://www.douban.com/robots.txt

实验二：处理动态内容

学习目标：理解动态网页和静态网页的区别。
学会安装Selenium和对应的浏览器驱动。
使用Selenium模拟浏览器操作，获取动态加载的数据。

实验二总流程

第一步：安装Selenium库

第二步：下载浏览器驱动
chrome
Edge
第三步：配置驱动路径
略
第四步：开始爬取（创建同实验一，省）

'''from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.edge.service import Service
import time
import csv
import json
import os

class DynamicSpider:
def init(self):
# 方法1：自动查找Edge驱动
driver_path = self.find_edge_driver()

    if driver_path:# 使用找到的驱动路径service = Service(executable_path=driver_path)self.driver = webdriver.Edge(service=service)else:# 方法2：让系统自动查找驱动try:self.driver = webdriver.Edge()except Exception as e:print(f"❌ 自动查找Edge驱动失败: {e}")print("\n💡 请手动指定Edge驱动路径")driver_path = input("请输入msedgedriver.exe的完整路径: ").strip()if driver_path and os.path.exists(driver_path):service = Service(executable_path=driver_path)self.driver = webdriver.Edge(service=service)else:raise Exception("无法找到Edge驱动，请确保已正确安装")# 设置等待时间self.wait = WebDriverWait(self.driver, 15)# 隐藏自动化特征self.driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")self.driver.execute_cdp_cmd('Network.setUserAgentOverride', {"userAgent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36 Edg/115.0.1901.188'})def find_edge_driver(self):"""查找Edge驱动文件的可能位置"""possible_paths = [# 当前目录"msedgedriver.exe",# Python Scripts目录os.path.join(os.path.dirname(os.__file__), "Scripts", "msedgedriver.exe"),# 常见安装位置os.path.join(os.environ.get('PROGRAMFILES', ''), "Microsoft Web Driver", "msedgedriver.exe"),os.path.join(os.environ.get('LOCALAPPDATA', ''), "Microsoft", "Edge", "Application", "msedgedriver.exe"),]for path in possible_paths:if os.path.exists(path):print(f"✅ 找到Edge驱动: {path}")return pathprint("❌ 未找到Edge驱动文件")return Nonedef open_website(self, url):"""打开目标网站"""print(f"🌐 正在打开：{url}")try:self.driver.get(url)# 等待页面加载time.sleep(5)print("✅ 网站打开成功")return Trueexcept Exception as e:print(f"❌ 打开网站失败: {e}")return Falsedef handle_possible_login(self):"""处理可能的登录需求"""print("🔍 检查页面状态...")time.sleep(3)current_url = self.driver.current_urlpage_title = self.driver.title.lower()# 检查是否是登录页面login_indicators = ['login', 'signin', '登录', '账号', 'password']if any(indicator in current_url.lower() or indicator in page_title for indicator in login_indicators):print("🔐 检测到登录页面，需要手动登录")return self.manual_login()print("✅ 页面状态正常")return Truedef manual_login(self):"""手动登录辅助"""print("\n" + "="*60)print("🔐 需要登录")print("请在浏览器中完成以下步骤：")print("1. 手动输入用户名和密码登录")print("2. 完成任何验证步骤")print("3. 等待跳转到目标页面")print("4. 回到此窗口按回车键继续")print("="*60)input("完成后按回车键继续...")# 等待页面稳定time.sleep(3)return Truedef scroll_to_bottom(self):"""滚动到页面底部"""print("📜 开始滚动页面...")scroll_attempts = 0max_attempts = 5while scroll_attempts < max_attempts:# 获取当前滚动高度current_height = self.driver.execute_script("return document.body.scrollHeight")# 滚动到底部self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")time.sleep(2)# 获取新的滚动高度new_height = self.driver.execute_script("return document.body.scrollHeight")if new_height == current_height:breakscroll_attempts += 1print(f"📜 滚动进度: {scroll_attempts}/{max_attempts}")def extract_product_info(self):"""提取商品信息"""print("🔍 开始提取商品信息...")products = []try:# 显示当前页面信息print(f"📄 当前页面: {self.driver.title}")print(f"🔗 当前URL: {self.driver.current_url}")# 尝试多种商品选择器selectors_to_try = [".product", ".item", ".goods", ".product-item", ".commodity", "[class*='product']", "[class*='item']","div[class*='product']", "li[class*='product']"]found_elements = []for selector in selectors_to_try:try:elements = self.driver.find_elements(By.CSS_SELECTOR, selector)if elements:print(f"✅ 使用选择器 '{selector}' 找到 {len(elements)} 个元素")found_elements = elementsbreakexcept:continueif not found_elements:print("❌ 未找到商品元素")# 保存页面源码用于调试self.save_page_source()return products# 提取商品信息for i, element in enumerate(found_elements[:10]):  # 限制前10个商品try:# 获取元素文本element_text = element.text.strip()if not element_text:continue# 简单的信息提取lines = element_text.split('\n')name = lines[0] if lines else "未知商品"price = "未知价格"# 查找价格信息for line in lines:if '¥' in line or '￥' in line or '元' in line:price = linebreakproduct_info = {"序号": i + 1,"商品名称": name,"价格": price,"原始文本": element_text[:100]  # 截取前100字符}products.append(product_info)print(f"✅ 提取商品 {i+1}: {name}")except Exception as e:print(f"❌ 提取第{i+1}个商品时出错: {e}")continueexcept Exception as e:print(f"❌ 提取商品信息时发生错误: {e}")return productsdef save_page_source(self):"""保存页面源码用于调试"""try:filename = f"debug_page_{int(time.time())}.html"with open(filename, 'w', encoding='utf-8') as f:f.write(self.driver.page_source)print(f"💾 页面源码已保存到: {filename}")except Exception as e:print(f"❌ 保存页面源码失败: {e}")def save_data(self, products, base_filename="products"):"""保存数据到文件"""if not products:print("❌ 没有数据可保存")returntimestamp = int(time.time())csv_filename = f"{base_filename}_{timestamp}.csv"json_filename = f"{base_filename}_{timestamp}.json"# 保存为CSVtry:with open(csv_filename, 'w', newline='', encoding='utf-8') as file:writer = csv.DictWriter(file, fieldnames=products[0].keys())writer.writeheader()writer.writerows(products)print(f"💾 CSV数据已保存到: {csv_filename}")except Exception as e:print(f"❌ 保存CSV失败: {e}")# 保存为JSONtry:with open(json_filename, 'w', encoding='utf-8') as file:json.dump(products, file, ensure_ascii=False, indent=2)print(f"💾 JSON数据已保存到: {json_filename}")except Exception as e:print(f"❌ 保存JSON失败: {e}")def close(self):"""关闭浏览器"""try:self.driver.quit()print("🔚 浏览器已关闭")except:print("⚠️  关闭浏览器时出现警告")

def main():
"""主函数"""
print("🚀 启动简化版Edge爬虫...")

spider = None
try:# 创建爬虫实例spider = DynamicSpider()# 获取目标URLtarget_url = input("请输入要爬取的网站URL（直接回车使用京东示例）: ").strip()if not target_url:target_url = "https://www.jd.com"# 打开网站if not spider.open_website(target_url):return# 处理登录（如果需要）spider.handle_possible_login()# 滚动加载内容spider.scroll_to_bottom()# 提取商品信息products = spider.extract_product_info()# 保存数据if products:spider.save_data(products)print(f"🎉 成功提取 {len(products)} 个商品")else:print("😞 没有提取到商品信息")print("💡 建议：")print("   - 检查网站是否需要登录")print("   - 尝试手动浏览确认商品显示正常")print("   - 查看生成的debug_page文件分析页面结构")except Exception as e:print(f"❌ 程序运行出错: {e}")print("💡 可能的解决方案：")print("   - 确认Edge浏览器已安装")print("   - 下载匹配的Edge驱动：https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/")print("   - 将msedgedriver.exe放在项目文件夹中")finally:if spider:input("按回车键关闭浏览器...")spider.close()

if name == "main":
main()'''
是京D和T宝
动态网页工作原理：JavaScript在浏览器中渲染内容
爬取方法：Selenium模拟真实浏览器
动态渲染原理：理解JavaScript如何动态生成内容
浏览器自动化：Selenium如何模拟真实用户操作
等待机制：为什么需要等待和如何正确等待
元素定位：各种定位方法的适用场景

反思总结

1.markdown换行在段后空两格，回车。
2.课本《网络空间安全导论（微课版）》理解，没有操作，实实在在都是导论。
3.做事优先级。在大多数情况下，请先为重要的有含金量的事情考虑，我现在并不会敲爬虫的代码，但学习这些的目的在于熟悉爬虫流程，掌握除了代码以外的知识。不要为了所谓的完整性，“尽善尽美”，而把时间花费在捡芝麻上，最后什么都没做成。

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.mzph.cn/news/966450.shtml

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈email:809451989@qq.com，一经查实，立即删除！

20232307 2024-2025-1 《网络与系统攻防技术》实验五实验报告

20232307 2024-2025-1 《网络与系统攻防技术》实验五实验报告 1. 实验内容本周学习内容：信息搜集：通过各种方式获取目标主机或网络的信息，属于攻击前的准备阶段网络踩点：Google Hacking技术、Web信息搜集与挖掘…

EXECUTE IMMEDIATE语句分析

在 Oracle 的 PL/SQL 环境中，EXECUTE IMMEDIATE 通常需要包裹在 BEGIN...END 块中执行，因为它是 PL/SQL 的语法元素，不能直接在 SQL 命令行中单独执行（除非使用特定工具的简化模式）。具体说明：在 PL/SQL 程序中…

产品更新与重构策略：创新与稳定的平衡之道 - 详解

产品更新与重构策略：创新与稳定的平衡之道 - 详解2025-11-15 18:47 tlnshuju 阅读(0) 评论(0) 收藏举报pre { white-space: pre !important; word-wrap: normal !important; overflow-x: auto !important; displ…

MySQL MVCC实现原理

一、概述 1.1 MVCC的定义与价值 MVCC（Multi-Version Concurrency Control）是一种非锁定式并发控制技术，其核心目标是解决读写操作的相互阻塞问题。传统锁机制中，读操作加共享锁、写操作加排他锁，导致读写互斥；而…

算法第三次作业

算法第三次作业 1、按照动态规划法的求解步骤分析作业题目“数字三角形”： 1.1 根据最优子结构性质，列出递归方程式，说明方程式的定义、边界条件 a.递归方程式：c[j]=a[i][j]+max(c[j],c[j+1]) b.方程式的定义：数字…

完整教程：《简易制作 Linux Shell：详细分析原理、设计与实践》

pre { white-space: pre !important; word-wrap: normal !important; overflow-x: auto !important; display: block !important; font-family: "Consolas", "Monaco", "Courier New", …

计算机网络5 - 指南

pre { white-space: pre !important; word-wrap: normal !important; overflow-x: auto !important; display: block !important; font-family: "Consolas", "Monaco", "Courier New", …