简介
网站的个数可以作为自己要爬取时间的估算。
技术栈可以知道自己要爬取的难度。
网站的个数
www.baidu.com 然后 输入 site:www.cnblogs.com 就可以知道 博客园大概有多少个页面了。 1000万个左右。
识别网站所采用技术栈
pip install builtwith
import builtwith
builtwith.parse('http://www.cnblogs.com')
{'advertising-networks': ['DoubleClick for Publishers (DFP)'], 'javascript-frameworks': ['Vue.js', 'jQuery']}
// 得知 采用的是vue 和 jquery。
找到网站的所有者
pip install python-whois
import whois
print(whois.whois('www.changeworld.shop'))
{"domain_name": "CHANGEWORLD.SHOP","registrar": "Bizcn.com,Inc","whois_server": null,"referral_url": null,"updated_date": "2019-04-24 04:22:03","creation_date": "2019-04-15 14:23:58","expiration_date": "2020-04-15 23:59:59","name_servers": ["NS1.BDYDNS.CN","NS2.BDYDNS.CN"],"status": "clientTransferProhibited https://icann.org/epp#clientTransferProhibited","emails": null,"dnssec": "unsigned","name": null,"org": null,"address": null,"city": null,"state": "Zhejiang","zipcode": null,"country": "CN"
}
可以看出大致的信息。