102302145 黄加鸿 数据采集与融合技术作业2

news/2025/11/11 18:48:29/文章来源:https://www.cnblogs.com/jh2680513769/p/19211090

作业2


目录
  • 作业2
    • 作业①
      • 1)代码与结果
      • 2)心得体会
      • 3)Gitee链接
    • 作业②
      • 1)代码与结果
      • 2)心得体会
      • 3)Gitee链接
    • 作业③
      • 1)代码与结果
        • F12调试分析Gif
      • 2)心得体会
      • 3)Gitee链接


作业①

1)代码与结果

中国气象网在之前任务中已经进行了网页分析,这里不再展示分析过程,最后是选用BeautifulSoup来解析网页。

核心代码

class WeatherForecast:def __init__(self):self.headers = {"User-Agent": "Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre"}self.cityCode = {"福州": "101230101", "厦门": "101230201", "泉州": "101230501"}  # 城市集def forecastCity(self, city):if city not in self.cityCode.keys():print(city + " code cannot be found")return# 根据城市代码不同设计url进行多网页爬取url = "http://www.weather.com.cn/weather/" + self.cityCode[city] + ".shtml"try:req = urllib.request.Request(url, headers=self.headers)data = urllib.request.urlopen(req)data = data.read()dammit = UnicodeDammit(data, ["utf-8", "gbk"])data = dammit.unicode_markupsoup = BeautifulSoup(data, "lxml")#定位到表格信息lis = soup.select("ul[class='t clearfix'] li")for li in lis:try:date = li.select('h1')[0].textweather = li.select('p[class="wea"]')[0].textif li.select('p[class="tem"] span'):temp = li.select('p[class="tem"] span')[0].text + "/" + li.select('p[class="tem"] i')[0].textelse:temp = li.select('p[class="tem"] i')[0].textprint(city, date, weather, temp)self.db.insert(city, date, weather, temp)except Exception as err:print(err)except Exception as err:print(err)

运行结果

天气控制台

查看数据库保存情况:

数据库查看1

2)心得体会

解决这类任务基本上都需要定义一个爬虫类和一个数据库类,思路清晰。学习了数据库类的定义和编写方法,实现打开、关闭、插入、查询等数据库操作。

3)Gitee链接

· https://gitee.com/jh2680513769/2025_crawler_project/blob/master/%E4%BD%9C%E4%B8%9A2/1.py

作业②

1)代码与结果

先上网站进行F12分析调试,刷新页面后在“网络日志”中找到存放网站股票信息的JSON文件(以“get?”为开头),如图:

网页分析

股票网页

通过分析加载不同股票页面的url之间的参数差异,比如参数‘pn’表示页面数,可以设计分页或多页爬取。

另外,json文件的内容类似‘{Key}:{Value}’字典类型要先引入json库进行结构化解析,然后再根据‘f2’‘f12’‘f14’等参数提取出股票代码、名称等信息。

基于如上分析,设计一个多页爬取并保存数据,代码核心部分如下:

核心代码

import requests
import json
import sqlite3class StockDB:def openDB(self):self.con = sqlite3.connect("./stocks.db")self.cursor = self.con.cursor()try:self.cursor.execute("create table stocks(sNum varchar(16), sCode varchar(16), sName varchar(32), sNewest varchar(16), sUpdown varchar(16), sUpdown_num varchar(16), sTurnover varchar(32), sAmplitude varchar(16), constraint pk_stocks primary key (sCode))")except:self.cursor.execute("delete from stocks")def closeDB(self):self.con.commit()self.con.close()def insert(self, num, Code, name, newest, updown, updown_num, turnover, amplitude):try:self.cursor.execute("""insert into stocks (sNum, sCode, sName, sNewest, sUpdown, sUpdown_num, sTurnover, sAmplitude) values (?,?,?,?,?,?,?,?)""", (num, code, name, newest, updown, updown_num, turnover, amplitude))except Exception as err:print(err)#更改api参数
urls = [(f"https://push2.eastmoney.com/api/qt/clist/get?np=1&fltt=1&invt=2&cb=jQuery371037824690299744046_1762785231380&fs=m%3A0%2Bt%3A6%2Bf%3A!2%2Cm%3A0%2Bt%3A80%2Bf%3A!2%2Cm%3A1%2Bt%3A2%2Bf%3A!2%2Cm%3A1%2Bt%3A23%2Bf%3A!2%2Cm%3A0%2Bt%3A81%2Bs%3A262144%2Bf%3A!2&fields=f12%2Cf13%2Cf14%2Cf1%2Cf2%2Cf4%2Cf3%2Cf152%2Cf5%2Cf6%2Cf7%2Cf15%2Cf18%2Cf16%2Cf17%2Cf10%2Cf8%2Cf9%2Cf23&fid=f3&"f"pn={page}"    #利用api的pn参数设计分页爬取f"&pz=20&po=1&dect=1&ut=fa5fd1943c7b386f172d6893dbfba10b&wbp2u=%7C0%7C0%7C0%7Cweb&_=1762785231382")for page in range(1, 6)]
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36 SLBrowser/9.0.6.8151 SLBChan/111 SLBVPV/64-bit'}def parse_jsonp(jsonp_str):"""解析JSONP格式数据"""try:#找到第一个左括号和最后一个右括号start = jsonp_str.find('(') + 1end = jsonp_str.rfind(')')json_str_clean = jsonp_str[start:end]return json.loads(json_str_clean)except Exception as e:print(f"解析JSONP失败: {e}")return None#创建数据库实例并打开
db = StockDB()
db.openDB()print(f"{'序号':<6}{'代码':<12}{'名称':<12}{'最新价':<8}{'涨跌幅':<8}{'涨跌额':<8}{'成交量(手)':<8}{'振幅':>6}")
num = 0
for url in urls:resp = requests.get(url, headers=headers)resp.encoding = 'utf-8'json_str = resp.textstructure_data = parse_jsonp(json_str)#提取信息if 'data' in structure_data and 'diff' in structure_data['data']:stocks = structure_data['data']['diff']# 打印前几个股票信息for i, stock in enumerate(stocks):num = num + 1code = stock.get('f12')name = stock.get('f14')newest = f"{stock.get('f2')/100:.2f}"updown = f"{stock.get('f3')/100:.2f}%"updown_num = f"{stock.get('f4')/100:.2f}"turnover = f"{stock.get('f5')/10000:.2f}万"amplitude = f"{stock.get('f7')/100:.2f}%"#把数据存入数据库db.insert(str(num), code, name, newest, updown, updown_num, turnover, amplitude)#打印前后各10行if num <= 10 or num >=90:print("%-6d%-12s%-12s%-10s%-10s%-10s%-15s%-10s" % (num, code, name, newest, updown, updown_num, turnover, amplitude))print("... ...")
#关闭数据库
db.closeDB()
print("completed")

运行结果

股票控制台

2)心得体会

用json库解析和提取信息也很方便,但是直接提取的信息和网页上显示的有所不同,还需要耐心进行格式化输出或者添加单位等小操作。学会了利用F12分析网络日志和分析api的参数,并利用api参数来设计爬虫程序。

3)Gitee链接

· https://gitee.com/jh2680513769/2025_crawler_project/blob/master/%E4%BD%9C%E4%B8%9A2/2.py

作业③

1)代码与结果

还是一样先F12分析一下网络日志,找到一个payload.js的文件存放的是大学排行榜的信息:

网页信息

大学网页

核心代码

import requests
import re
import sqlite3class UniversityDB:def openDB(self):self.con = sqlite3.connect("./universities.db")self.cursor = self.con.cursor()try:self.cursor.execute("create table universities(uRank varchar(16), uName varchar(64), uProvince varchar(16), uCategory varchar(16), uScore varchar(16), constraint pk_stocks primary key (uName))")except:self.cursor.execute("delete from universities")def closeDB(self):self.con.commit()self.con.close()def insertDB(self, rank, name, province, category, score):try:self.cursor.execute("""insert into universities (uRank, uName, uProvince, uCategory, uScore) values (?,?,?,?,?)""", (rank, name, province, category, score))except Exception as err:print(err)def showDB(self):self.cursor.execute("select * from universities")rows = self.cursor.fetchall()print("\n数据库中的大学排名数据:")print(f"{'排名':<6}{'学校':<20}{'省市':<8}{'类型':<8}{'总分':<8}")for row in rows:print(f"{row[0]:<6}{row[1]:<20}{row[2]:<8}{row[3]:<8}{row[4]:<8}")url = "https://www.shanghairanking.cn/_nuxt/static/1762223212/rankings/bcur/2021/payload.js"
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36 SLBrowser/9.0.6.8151 SLBChan/111 SLBVPV/64-bit','Referer':'https://www.shanghairanking.cn/rankings/bcur/2021'  #后面被反爬了因为没加Referer}
resp = requests.get(url, headers=headers)
resp.encoding = 'utf-8'
content = resp.text
#正则表达式析取信息
ranks = re.findall(r'ranking:([^,]+)', content)
names = re.findall(r'univNameCn:"([^"]+)"', content)
scores = re.findall(r'score:([^,]+)', content)
provinces = re.findall(r'province:([^,]+)', content)
categorys = re.findall(r'univCategory:([a-zA-Z])', content)
#检查提取数据长度是否一致
if len(names) != len(scores) or len(scores) != len(provinces) or len(provinces) != len(categorys):print("错误,提取信息数量不匹配!")
else:#创建数据库实例并打开db = UniversityDB()db.openDB()#映射词典创建arr1 = 'a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z, A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z, _, $, aa, ab, ac, ad, ae, af, ag, ah, ai, aj, ak, al, am, an, ao, ap, aq, ar, as, at, au, av, aw, ax, ay, az, aA, aB, aC, aD, aE, aF, aG, aH, aI, aJ, aK, aL, aM, aN, aO, aP, aQ, aR, aS, aT, aU, aV, aW, aX, aY, aZ, a_, a$, ba, bb, bc, bd, be, bf, bg, bh, bi, bj, bk, bl, bm, bn, bo, bp, bq, br, bs, bt, bu, bv, bw, bx, by, bz, bA, bB, bC, bD, bE, bF, bG, bH, bI, bJ, bK, bL, bM, bN, bO, bP, bQ, bR, bS, bT, bU, bV, bW, bX, bY, bZ, b_, b$, ca, cb, cc, cd, ce, cf, cg, ch, ci, cj, ck, cl, cm, cn, co, cp, cq, cr, cs, ct, cu, cv, cw, cx, cy, cz, cA, cB, cC, cD, cE, cF, cG, cH, cI, cJ, cK, cL, cM, cN, cO, cP, cQ, cR, cS, cT, cU, cV, cW, cX, cY, cZ, c_, c$, da, db, dc, dd, de, df, dg, dh, di, dj, dk, dl, dm, dn, do0, dp, dq, dr, ds, dt, du, dv, dw, dx, dy, dz, dA, dB, dC, dD, dE, dF, dG, dH, dI, dJ, dK, dL, dM, dN, dO, dP, dQ, dR, dS, dT, dU, dV, dW, dX, dY, dZ, d_, d$, ea, eb, ec, ed, ee, ef, eg, eh, ei, ej, ek, el, em, en, eo, ep, eq, er, es, et, eu, ev, ew, ex, ey, ez, eA, eB, eC, eD, eE, eF, eG, eH, eI, eJ, eK, eL, eM, eN, eO, eP, eQ, eR, eS, eT, eU, eV, eW, eX, eY, eZ, e_, e$, fa, fb, fc, fd, fe, ff, fg, fh, fi, fj, fk, fl, fm, fn, fo, fp, fq, fr, fs, ft, fu, fv, fw, fx, fy, fz, fA, fB, fC, fD, fE, fF, fG, fH, fI, fJ, fK, fL, fM, fN, fO, fP, fQ, fR, fS, fT, fU, fV, fW, fX, fY, fZ, f_, f$, ga, gb, gc, gd, ge, gf, gg, gh, gi, gj, gk, gl, gm, gn, go, gp, gq, gr, gs, gt, gu, gv, gw, gx, gy, gz, gA, gB, gC, gD, gE, gF, gG, gH, gI, gJ, gK, gL, gM, gN, gO, gP, gQ, gR, gS, gT, gU, gV, gW, gX, gY, gZ, g_, g$, ha, hb, hc, hd, he, hf, hg, hh, hi, hj, hk, hl, hm, hn, ho, hp, hq, hr, hs, ht, hu, hv, hw, hx, hy, hz, hA, hB, hC, hD, hE, hF, hG, hH, hI, hJ, hK, hL, hM, hN, hO, hP, hQ, hR, hS, hT, hU, hV, hW, hX, hY, hZ, h_, h$, ia, ib, ic, id, ie, if0, ig, ih, ii, ij, ik, il, im, in0, io, ip, iq, ir, is, it, iu, iv, iw, ix, iy, iz, iA, iB, iC, iD, iE, iF, iG, iH, iI, iJ, iK, iL, iM, iN, iO, iP, iQ, iR, iS, iT, iU, iV, iW, iX, iY, iZ, i_, i$, ja, jb, jc, jd, je, jf, jg, jh, ji, jj, jk, jl, jm, jn, jo, jp, jq, jr, js, jt, ju, jv, jw, jx, jy, jz, jA, jB, jC, jD, jE, jF, jG, jH, jI, jJ, jK, jL, jM, jN, jO, jP, jQ, jR, jS, jT, jU, jV, jW, jX, jY, jZ, j_, j$, ka, kb, kc, kd, ke, kf, kg, kh, ki, kj, kk, kl, km, kn, ko, kp, kq, kr, ks, kt, ku, kv, kw, kx, ky, kz, kA, kB, kC, kD, kE, kF, kG, kH, kI, kJ, kK, kL, kM, kN, kO, kP, kQ, kR, kS, kT, kU, kV, kW, kX, kY, kZ, k_, k$, la, lb, lc, ld, le, lf, lg, lh, li, lj, lk, ll, lm, ln, lo, lp, lq, lr, ls, lt, lu, lv, lw, lx, ly, lz, lA, lB, lC, lD, lE, lF, lG, lH, lI, lJ, lK, lL, lM, lN, lO, lP, lQ, lR, lS, lT, lU, lV, lW, lX, lY, lZ, l_, l$, ma, mb, mc, md, me, mf, mg, mh, mi, mj, mk, ml, mm, mn, mo, mp, mq, mr, ms, mt, mu, mv, mw, mx, my, mz, mA, mB, mC, mD, mE, mF, mG, mH, mI, mJ, mK, mL, mM, mN, mO, mP, mQ, mR, mS, mT, mU, mV, mW, mX, mY, mZ, m_, m$, na, nb, nc, nd, ne, nf, ng, nh, ni, nj, nk, nl, nm, nn, no, np, nq, nr, ns, nt, nu, nv, nw, nx, ny, nz, nA, nB, nC, nD, nE, nF, nG, nH, nI, nJ, nK, nL, nM, nN, nO, nP, nQ, nR, nS, nT, nU, nV, nW, nX, nY, nZ, n_, n$, oa, ob, oc, od, oe, of, og, oh, oi, oj, ok, ol, om, on, oo, op, oq, or, os, ot, ou, ov, ow, ox, oy, oz, oA, oB, oC, oD, oE, oF, oG, oH, oI, oJ, oK, oL, oM, oN, oO, oP, oQ, oR, oS, oT, oU, oV, oW, oX, oY, oZ, o_, o$, pa, pb, pc, pd, pe, pf, pg, ph, pi, pj, pk, pl, pm, pn, po, pp, pq, pr, ps, pt, pu, pv, pw, px, py, pz, pA, pB, pC, pD, pE'arr2 = ["", 'false', 'null', 0, "理工", "综合", 'true', "师范", "双一流", "211", "江苏", "985", "农业", "山东", "河南", "河北", "北京", "辽宁", "陕西", "四川", "广东", "湖北", "湖南", "浙江", "安徽", "江西", 1, "黑龙江", "吉林", "上海", 2, "福建", "山西", "云南", "广西", "贵州", "甘肃", "内蒙古", "重庆", "天津", "新疆", "467", "496", "2025,2024,2023,2022,2021,2020", "林业", "5.8", "533", "2023-01-05T00:00:00+08:00", "23.1", "7.3", "海南", "37.9", "28.0", "4.3", "12.1", "16.8", "11.7", "3.7", "4.6", "297", "397", "21.8", "32.2", "16.6", "37.6", "24.6", "13.6", "13.9", "3.3", "5.2", "8.1", "3.9", "5.1", "5.6", "5.4", "2.6", "162", 93.5, 89.4, "宁夏", "青海", "西藏", 7, "11.3", "35.2", "9.5", "35.0", "32.7", "23.7", "33.2", "9.2", "30.6", "8.5", "22.7", "26.3", "8.0", "10.9", "26.0", "3.2", "6.8", "5.7", "13.8", "6.5", "5.5", "5.0", "13.2", "13.3", "15.6", "18.3", "3.0", "21.3", "12.0", "22.8", "3.6", "3.4", "3.5", "95", "109", "117", "129", "138", "147", "159", "185", "191", "193", "196", "213", "232", "237", "240", "267", "275", "301", "309", "314", "318", "332", "334", "339", "341", "354", "365", "371", "378", "384", "388", "403", "416", "418", "420", "423", "430", "438", "444", "449", "452", "457", "461", "465", "474", "477", "485", "487", "491", "501", "508", "513", "518", "522", "528", 83.4, "538", "555", 2021, 11, 14, 10, "12.8", "42.9", "18.8", "36.6", "4.8", "40.0", "37.7", "11.9", "45.2", "31.8", "10.4", "40.3", "11.2", "30.9", "37.8", "16.1", "19.7", "11.1", "23.8", "29.1", "0.2", "24.0", "27.3", "24.9", "39.5", "20.5", "23.4", "9.0", "4.1", "25.6", "12.9", "6.4", "18.0", "24.2", "7.4", "29.7", "26.5", "22.6", "29.9", "28.6", "10.1", "16.2", "19.4", "19.5", "18.6", "27.4", "17.1", "16.0", "27.6", "7.9", "28.7", "19.3", "29.5", "38.2", "8.9", "3.8", "15.7", "13.5", "1.7", "16.9", "33.4", "132.7", "15.2", "8.7", "20.3", "5.3", "0.3", "4.0", "17.4", "2.7", "160", "161", "164", "165", "166", "167", "168", 130.6, 105.5, 2025, "学生、家长、高校管理人员、高教研究人员等", "中国大学排名(主榜)", 25, 13, 12, "全部", "1", "88.0", 5, "2", "36.1", "25.9", "3", "34.3", "4", "35.5", "21.6", "39.2", "5", "10.8", "4.9", "30.4", "6", "46.2", "7", "0.8", "42.1", "8", "32.1", "22.9", "31.3", "9", "43.0", "25.7", "10", "34.5", "10.0", "26.2", "46.5", "11", "47.0", "33.5", "35.8", "25.8", "12", "46.7", "13.7", "31.4", "33.3", "13", "34.8", "42.3", "13.4", "29.4", "14", "30.7", "15", "42.6", "26.7", "16", "12.5", "17", "12.4", "44.5", "44.8", "18", "10.3", "15.8", "19", "32.3", "19.2", "20", "21", "28.8", "9.6", "22", "45.0", "23", "30.8", "16.7", "16.3", "24", "25", "32.4", "26", "9.4", "27", "33.7", "18.5", "21.9", "28", "30.2", "31.0", "16.4", "29", "34.4", "41.2", "2.9", "30", "38.4", "6.6", "31", "4.4", "17.0", "32", "26.4", "33", "6.1", "34", "38.8", "17.7", "35", "36", "38.1", "11.5", "14.9", "37", "14.3", "18.9", "38", "13.0", "39", "27.8", "33.8", "3.1", "40", "41", "28.9", "42", "28.5", "38.0", "34.0", "1.5", "43", "15.1", "44", "31.2", "120.0", "14.4", "45", "149.8", "7.5", "46", "47", "38.6", "48", "49", "25.2", "50", "19.8", "51", "5.9", "6.7", "52", "4.2", "53", "1.6", "54", "55", "20.0", "56", "39.8", "18.1", "57", "35.6", "58", "10.5", "14.1", "59", "8.2", "60", "140.8", "12.6", "61", "62", "17.6", "63", "64", "1.1", "65", "20.9", "66", "67", "68", "2.1", "69", "123.9", "27.1", "70", "25.5", "37.4", "71", "72", "73", "74", "75", "76", "27.9", "7.0", "77", "78", "79", "80", "81", "82", "83", "84", "1.4", "85", "86", "87", "88", "89", "90", "91", "92", "93", "109.0", "94", 235.7, "97", "98", "99", "100", "101", "102", "103", "104", "105", "106", "107", "108", 223.8, "111", "112", "113", "114", "115", "116", 215.5, "119", "120", "121", "122", "123", "124", "125", "126", "127", "128", 206.7, "131", "132", "133", "134", "135", "136", "137", 201, "140", "141", "142", "143", "144", "145", "146", 194.6, "149", "150", "151", "152", "153", "154", "155", "156", "157", "158", 183.3, "169", "170", "171", "172", "173", "174", "175", "176", "177", "178", "179", "180", "181", "182", "183", "184", 169.6, "187", "188", "189", "190", 168.1, 167, "195", 165.5, "198", "199", "200", "201", "202", "203", "204", "205", "206", "207", "208", "209", "210", "212", 160.5, "215", "216", "217", "218", "219", "220", "221", "222", "223", "224", "225", "226", "227", "228", "229", "230", "231", 153.3, "234", "235", "236", 150.8, "239", 149.9, "242", "243", "244", "245", "246", "247", "248", "249", "250", "251", "252", "253", "254", "255", "256", "257", "258", "259", "260", "261", "262", "263", "264", "265", "266", 139.7, "269", "270", "271", "272", "273", "274", 137, "277", "278", "279", "280", "281", "282", "283", "284", "285", "286", "287", "288", "289", "290", "291", "292", "293", "294", "295", "296", "300", 130.2, "303", "304", "305", "306", "307", "308", 128.4, "311", "312", "313", 125.9, "316", "317", 124.9, "320", "321", "Wuyi University", "322", "323", "324", "325", "326", "327", "328", "329", "330", "331", 120.9, 120.8, "Taizhou University", "336", "337", "338", 119.9, 119.7, "343", "344", "345", "346", "347", "348", "349", "350", "351", "352", "353", 115.4, "356", "357", "358", "359", "360", "361", "362", "363", "364", 112.6, "367", "368", "369", "370", 111, "373", "374", "375", "376", "377", 109.4, "380", "381", "382", "383", 107.6, "386", "387", 107.1, "390", "391", "392", "393", "394", "395", "396", "400", "401", "402", 104.7, "405", "406", "407", "408", "409", "410", "411", "412", "413", "414", "415", 101.2, 101.1, 100.9, "422", 100.3, "425", "426", "427", "428", "429", 99, "432", "433", "434", "435", "436", "437", 97.6, "440", "441", "442", "443", 96.5, "446", "447", "448", 95.8, "451", 95.2, "454", "455", "456", 94.8, "459", "460", 94.3, "463", "464", 93.6, "472", "473", 92.3, "476", 91.7, "479", "480", "481", "482", "483", "484", 90.7, 90.6, "489", "490", 90.2, "493", "494", "495", 89.3, "503", "504", "505", "506", "507", 87.4, "510", "511", "512", 86.8, "515", "516", "517", 86.2, "520", "521", 85.8, "524", "525", "526", "527", 84.6, "530", "531", "532", "537", 82.8, "540", "541", "542", "543", "544", "545", "546", "547", "548", "549", "550", "551", "552", "553", "554", 78.1, "557", "558", "559", "560", "561", "562", "563", "564", "565", "566", "567", "568", "569", "570", "571", "572", "573", "574", "575", "576", "577", "578", "579", "580", "581", "582", 4, "2025-04-15T00:00:00+08:00", "logo\u002Fannual\u002Fbcur\u002F2025.png", "软科中国大学排名于2015年首次发布,多年来以专业、客观、透明的优势赢得了高等教育领域内外的广泛关注和认可,已经成为具有重要社会影响力和权威参考价值的中国大学排名领先品牌。软科中国大学排名以服务中国高等教育发展和进步为导向,采用数百项指标变量对中国大学进行全方位、分类别、监测式评价,向学生、家长和全社会提供及时、可靠、丰富的中国高校可比信息。", 2024, 2023, 2022, 15, 2020, 2019, 2018, 2017, 2016, 2015]arr2_str = [str(i) for i in arr2]dict_map = dict(zip(arr1.split(', '), arr2_str))print(f"爬取中国大学排名(主榜)共{len(names)}所学校")print(f"{'排名':<6}{'学校':^20}{'省市':>8}{'类型':>8}{'总分':>8}")for i in range(len(names)):#使用get方法,如果找不到对应的映射就使用原值rank = dict_map.get(ranks[i], ranks[i])name = names[i]province = dict_map.get(provinces[i], provinces[i])category = dict_map.get(categorys[i], categorys[i])score = dict_map.get(scores[i], scores[i])#把数据存入数据库db.insertDB(rank, name, province, category, score)if int(rank) <=10 or int(rank) >= 570:print(f"{rank:<6}{name:^20}{province:>8}{category:>8}{score:>8}")elif int(rank) == 11:print("... ...")db.closeDB()print("所有数据已保存至数据库niversities.db,任务完成!")

中途,可能因为调试代码的过程反复请求网页,被反爬了,添加上Referer之后还能继续访问。下回还是试试先把文件保存本地的方法吧。

大学反爬

运行结果

大学控制台

F12调试分析Gif

大学分析4

2)心得体会

原本的数据文件不是标准的json文件,直接用json解析行不通。前面尝试了几种爬取方法,查看数据库保存数据时发现,总是有一些表格内容是奇怪的字符比如‘aB’‘jJ’等等,后面重新观察数据文件结构之后才知道字符串是代表了某个值,应该创建一个字典来映射,就不会有奇怪的字符。总结就是,不能太急于求成,应该先留心观察原始数据文件,把数据文件的格式结构等摸清之后再爬虫。

3)Gitee链接

· https://gitee.com/jh2680513769/2025_crawler_project/blob/master/%E4%BD%9C%E4%B8%9A2/3.py

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/962644.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

详细介绍:Spring Boot

pre { white-space: pre !important; word-wrap: normal !important; overflow-x: auto !important; display: block !important; font-family: "Consolas", "Monaco", "Courier New", …

echarts获取坐标上的点距离顶部底部高度

const height = echarts_instance.getHeight()const max_distance = (max_y - y_value) / (max_y - min_y) * height;const min_distance = (y_value - min_y) / (max_y - min_y) * height;

K8S(九)—— Kubernetes持久化存储深度解析:从Volume到PV/PVC与StorageClass动态存储 - 教程

pre { white-space: pre !important; word-wrap: normal !important; overflow-x: auto !important; display: block !important; font-family: "Consolas", "Monaco", "Courier New", …

JAVA 随机函数

目录1. 日常开发(非安全场景):ThreadLocalRandom(推荐)2. 单线程/简单场景:Random3. 安全场景(如密码、Token生成):SecureRandom4. Java 8+ 流式处理:Random 结合流关键总结 在 Java 中生成随机数的“靠谱”…

GPIO 也是一个接口,还有 QEMU GPIODEV 和 GUSE - 指南

pre { white-space: pre !important; word-wrap: normal !important; overflow-x: auto !important; display: block !important; font-family: "Consolas", "Monaco", "Courier New", …

CF1327F AND Segments

经典问题。 首先拆位,那么限制变成强制一段全为 \(1\),或者强制一段存在 \(0\),先把第一个限制填完,再考虑第二个限制。 然后对于第二个条件,将包含关系给干掉,那么按照 \(l\) 排序后 \(r\) 也是递增的,考虑容斥…

Air780EPM系列低功耗模组USB设计进阶:硬件要点与LuatOS API开发赋能

本文将以Air780EPM系列低功耗模组为对象,探讨USB接口硬件设计的关键要点,并介绍LuatOS高效开发API的赋能作用。旨在帮助开发者避开常见设计误区,快速实现稳定可靠的USB应用开发目标。 在设计USB接口时,不少刚接触嵌…

如何项目管理软件中计算预算?

项目管理的过程中,我们需要考录很多方面因为许多公司为他们的客户创建项目。所以他们需要有项目报表,工时报表,也需要计算项目预算。按照工时计算预算才可以为客户请求钱。 Zoho Projects 中,计算预算很简单。这个…

Kimi会员双11砍价成功!0.99元首月链接分享

活动时间:2025.11.11-11.25(仅剩14天) 已砍成功的0.99元优惠链接,首月体验Kimi Andante会员: 👉 0.99元优惠购买链接 https://www.kimi.com/membership/pricing?from=d11_2025_bargain&track_id=19a7249d-…

实用指南:【Qt】9.信号和槽_信号和槽存在的意义

实用指南:【Qt】9.信号和槽_信号和槽存在的意义pre { white-space: pre !important; word-wrap: normal !important; overflow-x: auto !important; display: block !important; font-family: "Consolas", …

DI依赖注入

依赖注入(Dependency Injection) 依赖注入(DI)是控制反转(IoC)的一种实现方式,核心思想是:将类所依赖的对象通过外部注入,而非类内部自行创建,从而降低类之间的耦合度。简单来说,就是“给某个类中的属性赋值…

解码LVGL定时器

定时器核心概念LVGL 定时器是按指定毫秒(ms)周期执行回调函数的机制,依赖内置计时器系统调度 非抢占式:多个定时器不会互相中断,一个定时器的回调执行完,才会执行下一个,因此回调中可安全调用 LVGL 相关函数或普…

ORACLE解析游标生成JSON

ORACLE解析游标生成JSON 1. 背景 存储过程中使用oracleutl_http调用rest接口,并以JSON的方式传输数据.此需求下,业务和环境有如下限制:业务已经通过sys_refcursor生成了业务数据,不希望重新编写存储过程 当前ORACLE版本…

习题解析之:鸡兔同笼

习题解析之:鸡兔同笼【问题描述】大约在1500年前,《孙子算经》中记载一个有趣的问题:今有雉兔同笼,上有三十五头,下有九十四足,问雉兔各几何?大概的意思是:有若干只鸡兔同在一个笼子里,从上面数,有35个头,从…

如何选择锡林郭勒西林瓶灌装旋盖机?环境温湿度要求详解

在制药、生物制剂及精细化工等行业中,西林瓶灌装机的稳定运行高度依赖于环境条件的精准控制。尤其在锡林郭勒等气候干燥、昼夜温差较大的地区,设备对温度与湿度的适应性成为用户选购时的重要考量因素。通常,西林瓶灌…

DeepSeek权威测评榜单2025年11月最新geo优化公司推荐

近年来,随着全球化进程加速,中国企业出海需求持续攀升,但在海外市场拓展中,geo 优化服务的质量参差不齐成为制约企业发展的关键痛点。据 2024 年《中国企业全球化报告》显示,超 72% 的出海企业因 geo 定位不准确、…

ECB33-PGB2N4E32-I单板机智能交通监控应用方案解析

一、方案概述与技术架构 1.1 系统整体架构设计 基于ECB33-PGB2N4E32-I的智能交通监控系统采用分层架构: 边缘感知层(前端设备): ├── 4K高清摄像头 4路 ├── 雷达测速传感器 2路 ├── 环境监测传感器(温湿…

北京GEO优化服务商2025权威推荐:抢占AI搜索流量新入口

技术驱动流量变革,选对服务商是制胜关键 在生成式AI重塑信息分发格局的今天,传统搜索引擎流量正以每年15%的速度向生成式引擎迁移。据行业数据显示,2025年全球GEO(生成式引擎优化)市场规模已突破120亿美元,35%的…