在本篇博文當(dāng)中,將會(huì)教會(huì)大家如何使用高性能爬蟲,快速爬取并解析頁面當(dāng)中的信息。一般情況下,如果我們請(qǐng)求網(wǎng)頁的次數(shù)太多,每次都要發(fā)出一次請(qǐng)求,進(jìn)行串行執(zhí)行的話,那么請(qǐng)求將會(huì)占用我們大量的時(shí)間,這樣得不償失。因此我們可以i使用高性能爬蟲,也就是采用多進(jìn)程,異步的方式對(duì)數(shù)據(jù)進(jìn)行爬取和解析,這樣就可以在更快的時(shí)間內(nèi)得到我們想要的結(jié)果。本篇博文給出有關(guān)爬取豆瓣電影的例子,以此來教會(huì)大家如何使用高性能爬蟲。 一.網(wǎng)頁分析首先我們來分析豆瓣電影的網(wǎng)頁代碼,在本次的案例當(dāng)中。我們需要爬取豆瓣電影top250當(dāng)中的標(biāo)題title和星數(shù)star。
發(fā)現(xiàn),豆瓣電影當(dāng)中的所有有關(guān)電影的信息全部都隱藏在< ol class="grid view">這個(gè)標(biāo)簽,當(dāng)中,因此我們?cè)诰帉憍path的時(shí)候,可以利用對(duì)它做一個(gè)循環(huán)。然后又發(fā)現(xiàn),對(duì)于電影的title而言,有兩個(gè)地方出現(xiàn),一個(gè)地方是在圖片上,另一個(gè)地方是在span標(biāo)簽下的class = title處,但是在span標(biāo)簽下具有多個(gè)標(biāo)題,為了以免引起混,因此我們使用圖片當(dāng)中所暗含的標(biāo)題title文字,使用xpath進(jìn)行定位即可。 對(duì)于star而言,就更加簡單了。我們發(fā)現(xiàn)每次一個(gè)star的分?jǐn)?shù)出現(xiàn),就會(huì)有又一個(gè)<div class="star">的標(biāo)簽在前面,然后再出現(xiàn)了與span有關(guān)的標(biāo)簽,因此我們編寫xpath表達(dá)式為://ol[@class="grid_view"]//div[@class="star"]/span[@class="rating_num"]/text() 這樣就可以得到一整個(gè)頁面的star的數(shù)值了。當(dāng)然這樣我們只能獲取第一頁的我們想要得到的數(shù)據(jù),怎么得到第二頁的數(shù)據(jù)呢? 二.翻頁處理翻頁處理對(duì)于豆瓣電影這個(gè)網(wǎng)站還是比較簡單的。我們分別查看第一,二,三頁的url,就會(huì)驚奇的發(fā)現(xiàn)它的網(wǎng)址如下: https://movie.douban.com/top250?start=0&filter= https://movie.douban.com/top250?start=25&filter= https://movie.douban.com/top250?start=50&filter= 十分明顯,這個(gè)網(wǎng)址后面有問號(hào)說明想要獲取頁面內(nèi)容肯定需要發(fā)起get請(qǐng)求,都沒有做有關(guān)post請(qǐng)求的加密,這樣看來這也太簡單了吧! 同樣的我們發(fā)現(xiàn)里面的參數(shù)start在不斷的變化,而filter卻保持不變。因此我們只需要得到start參數(shù)的規(guī)律就知道該怎么編寫爬蟲了。 對(duì)于start而言,每跳轉(zhuǎn)一頁,就會(huì)增加25的數(shù)值,因?yàn)槊恳粋€(gè)頁面里面均僅有25部電影。這樣我們就找到了start參數(shù)的規(guī)律,開始編寫爬蟲。 三.爬蟲代碼的編寫在編寫的代碼時(shí)候,我們導(dǎo)入了多進(jìn)程的庫,使用這個(gè)庫進(jìn)行爬蟲,也就只需要在原本代碼的基礎(chǔ)之上多添加兩行代碼即可,如下所示: pool=Pool(4)
這兩行代碼當(dāng)中,第一個(gè)參數(shù)的4表示了我們使用4個(gè)進(jìn)程的進(jìn)程池進(jìn)行數(shù)據(jù)的抓取。數(shù)值越大,爬取的效率就越高,這取決于你CPU的數(shù)量,數(shù)值不能超過CPU核心數(shù)的數(shù)量,因?yàn)橐粋€(gè)一個(gè)CPU核心同時(shí)只能夠運(yùn)行單個(gè)進(jìn)程。 第二行代碼使用了map函數(shù),第一個(gè)參數(shù)填寫我們進(jìn)行爬蟲的函數(shù),第二個(gè)參數(shù)填寫爬蟲函數(shù)所需要的參數(shù)。把這兩個(gè)東西放到map函數(shù)里,就可以開始高性能爬蟲了。 Remark: 由于整個(gè)原因,因此我們編寫整個(gè)的代碼·如下所示: import requests from lxml import etree from multiprocessing.dummy import Pool cookie='bid=N3Zqe_FFUKc; douban-fav-remind=1; viewed="27093751"; _vwo_uuid_v2=D401F17C96234AE149C4E04B78C3C8066|6fcc3cefe576bff2b89cdf28c4c5f597; __gads=ID=21cdec44606b00df-2250ba4d7ac4009b:T=1604034713:RT=1604034713:S=ALNI_Mb6iYJKYfbUjLxlisTQX5HCODTGKg; gr_user_id=fb6ac40c-94c3-400e-b170-47e126a9b78a; _gid=GA1.2.1520341169.1612004212; _ga=GA1.2.645228582.1602221486; ll="108288"; UM_distinctid=17752f076e4530-0b6eef25ebabba-f7b1332-1fa400-17752f076e57f0; Hm_lvt_19fc7b106453f97b6a84d64302f21a04=1612004228; Hm_lpvt_19fc7b106453f97b6a84d64302f21a04=1612004253; ap_v=0,6.0; _pk_ref.100001.4cf6=%5B%22%22%2C%22%22%2C1612004299%2C%22https%3A%2F%2Fwww.google.com%2F%22%5D; _pk_ses.100001.4cf6=*; __utma=30149280.645228582.1602221486.1611225800.1612004300.9; __utmb=30149280.0.10.1612004300; __utmc=30149280; __utmz=30149280.1612004300.9.9.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided); __utma=223695111.645228582.1602221486.1612004300.1612004300.1; __utmb=223695111.0.10.1612004300; __utmc=223695111; __utmz=223695111.1612004300.1.1.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided); _pk_id.100001.4cf6=9a1bb1df4597b334.1612004299.1.1612005471.1612004299.' headers={ 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36', } url='https://movie.douban.com/top250' number_ls=[] for i in range(0,251,25): number_ls.append(i) print(number_ls) def get_information(number_ls): param={ 'start':number_ls, 'filter' :'' } page_content=requests.get(url=url,headers=headers,params=param).text with open('douban.html','w',encoding='utf-8') as fp: fp.write(page_content) tree=etree.HTML(page_content) vedio_title=tree.xpath('//ol[@class="grid_view"]//div[@class="pic"]//a/img/@alt') star=tree.xpath('//ol[@class="grid_view"]//div[@class="star"]/span[@class="rating_num"]/text()') vedio_title_ls=[] star_ls=[] for i in vedio_title: vedio_title_ls.append(i) for i in star: star_ls.append(i) j=0 while j<len(star_ls): print("the movie is ",vedio_title_ls[j]) print("the star is ",star_ls[j]) print() j+=1 pool=Pool(4) pool.map(get_information,number_ls) 四.輸出的結(jié)果輸出的結(jié)果十分完美,一共有250份電影,如下圖所示: [0, 25, 50, 75, 100, 125, 150, 175, 200, 225, 250] the movie is 搏擊俱樂部 the star is 9.0 the movie is 教父2 the star is 9.2 the movie is 獅子王 the star is 9.0 the movie is 指環(huán)王2:雙塔奇兵 the star is 9.1 the movie is 死亡詩社 the star is 9.1 the movie is 鋼琴家 the star is 9.2 the movie is 黑客帝國 the star is 9.0 the movie is 指環(huán)王1:魔戒再現(xiàn) the star is 9.0 the movie is 飲食男女 the star is 9.1 the movie is 竊聽風(fēng)暴 the star is 9.1 the movie is 美麗心靈 the star is 9.0 the movie is 讓子彈飛 the star is 8.8 the movie is 綠皮書 the star is 8.9 the movie is 兩桿大煙槍 the star is 9.1 the movie is 本杰明·巴頓奇事 the star is 8.9 the movie is 海蒂和爺爺 the star is 9.2 the movie is 飛越瘋?cè)嗽?the star is 9.1 the movie is 看不見的客人 the star is 8.8 the movie is 西西里的美麗傳說 the star is 8.9 the movie is 拯救大兵瑞恩 the star is 9.0 the movie is 穿條紋睡衣的男孩 the star is 9.1 the movie is 小鞋子 the star is 9.2 the movie is 音樂之聲 the star is 9.0 the movie is 情書 the star is 8.9 the movie is 海豚灣 the star is 9.3 the movie is 美國往事 the star is 9.2 the movie is 致命魔術(shù) the star is 8.9 the movie is 沉默的羔羊 the star is 8.9 the movie is 低俗小說 the star is 8.9 the movie is 禁閉島 the star is 8.8 the movie is 蝴蝶效應(yīng) the star is 8.8 the movie is 七宗罪 the star is 8.8 the movie is 心靈捕手 the star is 8.9 the movie is 布達(dá)佩斯大飯店 the star is 8.9 the movie is 春光乍泄 the star is 8.9 the movie is 摩登時(shí)代 the star is 9.3 the movie is 被嫌棄的松子的一生 the star is 8.9 the movie is 哈利·波特與死亡圣器(下) the star is 8.9 the movie is 阿凡達(dá) the star is 8.7 the movie is 喜劇之王 the star is 8.8 the movie is 致命ID the star is 8.8 the movie is 剪刀手愛德華 the star is 8.7 the movie is 勇敢的心 the star is 8.9 the movie is 加勒比海盜 the star is 8.8 the movie is 殺人回憶 the star is 8.9 the movie is 狩獵 the star is 9.1 the movie is 請(qǐng)以你的名字呼喚我 the star is 8.9 the movie is 天使愛美麗 the star is 8.7 the movie is 斷背山 the star is 8.8 the movie is 紅辣椒 the star is 9.0 the movie is 觸不可及 the star is 9.2 the movie is 蝙蝠俠:黑暗騎士 the star is 9.2 the movie is 末代皇帝 the star is 9.3 the movie is 活著 the star is 9.3 the movie is 尋夢環(huán)游記 the star is 9.1 the movie is 亂世佳人 the star is 9.3 the movie is 何以為家 the star is 9.1 the movie is 指環(huán)王3:王者無敵 the star is 9.2 the movie is 飛屋環(huán)游記 the star is 9.0 the movie is 摔跤吧!爸爸 the star is 9.0 the movie is 哈利·波特與魔法石 the star is 9.1 the movie is 素媛 the star is 9.3 the movie is 少年派的奇幻漂流 the star is 9.1 the movie is 十二怒漢 the star is 9.4 the movie is 哈爾的移動(dòng)城堡 the star is 9.1 the movie is 鬼子來了 the star is 9.3 the movie is 天空之城 the star is 9.1 the movie is 大話西游之月光寶盒 the star is 9.0 the movie is 我不是藥神 the star is 9.0 the movie is 聞香識(shí)女人 the star is 9.1 the movie is 羅馬假日 the star is 9.0 the movie is 天堂電影院 the star is 9.2 the movie is 辯護(hù)人 the star is 9.2 the movie is 貓鼠游戲 the star is 9.0 the movie is 大鬧天宮 the star is 9.4 the movie is 肖申克的救贖 the star is 9.7 the movie is 霸王別姬 the star is 9.6 the movie is 阿甘正傳 the star is 9.5 the movie is 這個(gè)殺手不太冷 the star is 9.4 the movie is 泰坦尼克號(hào) the star is 9.4 the movie is 美麗人生 the star is 9.5 the movie is 千與千尋 the star is 9.4 the movie is 辛德勒的名單 the star is 9.5 the movie is 盜夢空間 the star is 9.3 the movie is 忠犬八公的故事 the star is 9.4 the movie is 星際穿越 the star is 9.3 the movie is 海上鋼琴師 the star is 9.3 the movie is 楚門的世界 the star is 9.3 the movie is 三傻大鬧寶萊塢 the star is 9.2 the movie is 機(jī)器人總動(dòng)員 the star is 9.3 the movie is 放牛班的春天 the star is 9.3 the movie is 大話西游之大圣娶親 the star is 9.2 the movie is 瘋狂動(dòng)物城 the star is 9.2 the movie is 無間道 the star is 9.2 the movie is 熔爐 the star is 9.3 the movie is 教父 the star is 9.3 the movie is 當(dāng)幸福來敲門 the star is 9.1 the movie is 龍貓 the star is 9.2 the movie is 怦然心動(dòng) the star is 9.1 the movie is 控方證人 the star is 9.6 the movie is 7號(hào)房的禮物 the star is 8.9 the movie is 幽靈公主 the star is 8.9 the movie is 小森林 夏秋篇 the star is 9.0 the movie is 陽光燦爛的日子 the star is 8.8 the movie is 第六感 the star is 8.9 the movie is 重慶森林 the star is 8.8 the movie is 入殮師 the star is 8.9 the movie is 唐伯虎點(diǎn)秋香 the star is 8.7 the movie is 小森林 冬春篇 the star is 9.0 the movie is 愛在黎明破曉前 the star is 8.8 the movie is 超脫 the star is 8.9 the movie is 消失的愛人 the star is 8.7 the movie is 一一 the star is 9.0 the movie is 菊次郎的夏天 the star is 8.8 the movie is 蝙蝠俠:黑暗騎士崛起 the star is 8.8 the movie is 側(cè)耳傾聽 the star is 8.9 the movie is 倩女幽魂 the star is 8.7 the movie is 功夫 the star is 8.6 the movie is 超能陸戰(zhàn)隊(duì) the star is 8.7 the movie is 無人知曉 the star is 9.1 the movie is 人生果實(shí) the star is 9.5 the movie is 螢火之森 the star is 8.9 the movie is 甜蜜蜜 the star is 8.8 the movie is 借東西的小人阿莉埃蒂 the star is 8.8 the movie is 瑪麗和馬克思 the star is 8.9 the movie is 愛在日落黃昏時(shí) the star is 8.8 the movie is 馴龍高手 the star is 8.7 the movie is 完美的世界 the star is 9.1 the movie is 幸福終點(diǎn)站 the star is 8.8 the movie is 告白 the star is 8.7 the movie is 大魚 the star is 8.8 the movie is 陽光姐妹淘 the star is 8.8 the movie is 射雕英雄傳之東成西就 the star is 8.7 the movie is 哈利·波特與阿茲卡班的囚徒 the star is 8.8 the movie is 恐怖直播 the star is 8.8 the movie is 天書奇譚 the star is 9.2 the movie is 怪獸電力公司 the star is 8.7 the movie is 神偷奶爸 the star is 8.6 the movie is 玩具總動(dòng)員3 the star is 8.8 the movie is 傲慢與偏見 the star is 8.6 the movie is 時(shí)空戀旅人 the star is 8.8 the movie is 哈利·波特與密室 the star is 8.7 the movie is 教父3 the star is 8.9 the movie is 釜山行 the star is 8.6 the movie is 血戰(zhàn)鋼鋸嶺 the star is 8.7 the movie is 哪吒鬧海 the star is 9.1 the movie is 被解救的姜戈 the star is 8.7 the movie is 七武士 the star is 9.3 the movie is 喜宴 the star is 8.9 the movie is 電鋸驚魂 the star is 8.7 the movie is 爆裂鼓手 the star is 8.7 the movie is 貧民窟的百萬富翁 the star is 8.6 the movie is 螢火蟲之墓 the star is 8.7 the movie is 東邪西毒 the star is 8.6 the movie is 海街日記 the star is 8.8 the movie is 黑天鵝 the star is 8.6 the movie is 驚魂記 the star is 9.0 the movie is 無敵破壞王 the star is 8.7 the movie is 你看起來好像很好吃 the star is 8.9 the movie is 冰川時(shí)代 the star is 8.6 the movie is 雨人 the star is 8.7 the movie is 小偷家族 the star is 8.7 the movie is 綠里奇跡 the star is 8.9 the movie is 戀戀筆記本 the star is 8.5 the movie is 愛在午夜降臨前 the star is 8.8 the movie is 瘋狂的石頭 the star is 8.5 the movie is 哈利·波特與火焰杯 the star is 8.6 the movie is 寄生蟲 the star is 8.7 the movie is 恐怖游輪 the star is 8.5 the movie is 奇跡男孩 the star is 8.6 the movie is 雨中曲 the star is 9.0 the movie is 魔女宅急便 the star is 8.7 the movie is 二十二 the star is 8.7 the movie is 海邊的曼徹斯特 the star is 8.6 the movie is 房間 the star is 8.8 the movie is 風(fēng)之谷 the star is 8.9 the movie is 一個(gè)叫歐維的男人決定去死 the star is 8.9 the movie is 我是山姆 the star is 8.9 the movie is 頭號(hào)玩家 the star is 8.7 the movie is 英雄本色 the star is 8.7 the movie is 上帝之城 the star is 9.0 the movie is 諜影重重3 the star is 8.8 the movie is 瘋狂原始人 the star is 8.7 the movie is 未麻的部屋 the star is 9.0 the movie is 歲月神偷 the star is 8.7 the movie is 盧旺達(dá)飯店 the star is 8.9 the movie is 縱橫四海 the star is 8.8 the movie is 三塊廣告牌 the star is 8.7 the movie is 達(dá)拉斯買家俱樂部 the star is 8.8 the movie is 花樣年華 the star is 8.7 the movie is 心迷宮 the star is 8.7 the movie is 記憶碎片 the star is 8.6 the movie is 模仿游戲 the star is 8.7 the movie is 黑客帝國3:矩陣革命 the star is 8.8 the movie is 新世界 the star is 8.8 the movie is 頭腦特工隊(duì) the star is 8.7 the movie is 荒蠻故事 the star is 8.8 the movie is 你的名字。 the star is 8.4 the movie is 真愛至上 the star is 8.6 the movie is 忠犬八公物語 the star is 9.2 the movie is 諜影重重2 the star is 8.7 the movie is 阿飛正傳 the star is 8.5 the movie is 地球上的星星 the star is 8.9 the movie is 彗星來的那一夜 the star is 8.5 the movie is 完美陌生人 the star is 8.5 the movie is 戰(zhàn)爭之王 the star is 8.7 the movie is 諜影重重 the star is 8.6 the movie is 香水 the star is 8.5 the movie is 東京教父 the star is 9.0 the movie is 東京物語 the star is 9.2 the movie is 朗讀者 the star is 8.6 the movie is 千鈞一發(fā) the star is 8.8 the movie is 再次出發(fā)之紐約遇見你 the star is 8.6 the movie is 驢得水 the star is 8.3 the movie is 猜火車 the star is 8.5 the movie is 黑客帝國2:重裝上陣 the star is 8.6 the movie is 無間道2 the star is 8.6 the movie is 我愛你 the star is 9.1 the movie is 浪潮 the star is 8.7 the movie is 崖上的波妞 the star is 8.5 the movie is 聚焦 the star is 8.8 the movie is 小蘿莉的猴神大叔 the star is 8.4 the movie is 追隨 the star is 8.9 the movie is 黑鷹墜落 the star is 8.7 the movie is 網(wǎng)絡(luò)謎蹤 the star is 8.6 the movie is 虎口脫險(xiǎn) the star is 8.9 the movie is 人工智能 the star is 8.7 the movie is 九品芝麻官 the star is 8.6 the movie is 2001太空漫游 the star is 8.8 the movie is 可可西里 the star is 8.8 the movie is 羅生門 the star is 8.8 the movie is 色,戒 the star is 8.5 the movie is 終結(jié)者2:審判日 the star is 8.7 the movie is 城市之光 the star is 9.3 the movie is 初戀這件小事 the star is 8.4 the movie is 魂斷藍(lán)橋 the star is 8.8 the movie is 牯嶺街少年殺人事件 the star is 8.9 the movie is 遺愿清單 the star is 8.7 the movie is 大佛普拉斯 the star is 8.7 the movie is 新龍門客棧 the star is 8.6 the movie is 波西米亞狂想曲 the star is 8.7 the movie is 源代碼 the star is 8.5 the movie is 青蛇 the star is 8.6 the movie is 海洋 the star is 9.1 the movie is 燃情歲月 the star is 8.8 the movie is 無恥混蛋 the star is 8.6 the movie is 瘋狂的麥克斯4:狂暴之路 the star is 8.6 the movie is 血鉆 the star is 8.7 the movie is 穿越時(shí)空的少女 the star is 8.6 the movie is 步履不停 the star is 8.8