【Python】簡單網(wǎng)絡(luò)爬蟲實現(xiàn)

安麓散人 2020-11-18

展開全文

引言

網(wǎng)絡(luò)爬蟲（英語：web crawler），也叫網(wǎng)絡(luò)蜘蛛（spider），是一種用來自動瀏覽萬維網(wǎng)的網(wǎng)絡(luò)機(jī)器人。其目的一般為編纂網(wǎng)絡(luò)索引。 --維基百科

網(wǎng)絡(luò)爬蟲可以將自己所訪問的頁面保存下來，以便搜索引擎事后生成索引供用戶搜索。
一般有兩個步驟：1.獲取網(wǎng)頁內(nèi)容 2.對獲得的網(wǎng)頁內(nèi)容進(jìn)行處理

準(zhǔn)備

Linux開發(fā)環(huán)境

python3.61安裝方法:https://www.cnblogs.com/kimyeee/p/7250560.html

安裝一些必要的第三方庫
其中requiests可以用來爬取網(wǎng)頁內(nèi)容，beautifulsoup4用來將爬取的網(wǎng)頁內(nèi)容分析處理


pip3 install requiests
pip3 install beautifulsoup4

第一步：爬取

使用request庫中的get方法，請求url的網(wǎng)頁內(nèi)容
更多了解：http://docs./en/master/

編寫代碼


[root@localhost demo]# touch demo.py
[root@localhost demo]# vim demo.py


#web爬蟲學(xué)習(xí) -- 分析
#獲取頁面信息
#輸入：url
#處理：request庫函數(shù)獲取頁面信息，并將網(wǎng)頁內(nèi)容轉(zhuǎn)換成為人能看懂的編碼格式
#輸出：爬取到的內(nèi)容
import requests
def getHTMLText(url):
    try:
        r = requests.get( url, timeout=30 )
        r.raise_for_status()    #如果狀態(tài)碼不是200，產(chǎn)生異常
        r.encoding = 'utf-8'    #字符編碼格式改成 utf-8
        return r.text
    except:
        #異常處理
        return ' error '
url = 'http://www.baidu.com'
print( getHTMLText(url) )

[root@localhost demo]# python3 demo.py

第二步：分析

使用bs4庫中BeautifulSoup類，生成一個對象。find()和find_all()方法可以遍歷這個html文件，提取指定信息。
更多了解：https://www./software/BeautifulSoup/

編寫代碼


[root@localhost demo]# touch demo1.py
[root@localhost demo]# vim demo1.py


#web爬蟲學(xué)習(xí) -- 分析
#獲取頁面信息
#輸入：url
#處理：request庫獲取頁面信息，并從爬取到的內(nèi)容中提取關(guān)鍵信息
#輸出：打印輸出提取到的關(guān)鍵信息
import requests
from bs4 import BeautifulSoup
import re
def getHTMLText(url):
    try:
        r = requests.get( url, timeout=30 )
        r.raise_for_status()    #如果狀態(tài)碼不是200，產(chǎn)生異常
        r.encoding = 'utf-8'    #字符編碼格式改成 utf-8
        return r.text
    except:
        #異常處理
        return ' error '
def findHTMLText(text):
    soup = BeautifulSoup( text, 'html.parser' )    #返回BeautifulSoup對象
    return soup.find_all(string=re.compile( '百度' )) #結(jié)合正則表達(dá)式，實現(xiàn)字符串片段匹配
url = 'http://www.baidu.com'
text = getHTMLText(url)        #獲取html文本內(nèi)容
res = findHTMLText(text)    #匹配結(jié)果
print(res)        #打印輸出

[root@localhost demo]# python3 demo1.py

一個例子：中國大學(xué)排名爬蟲

參考鏈接：https:///index/notebooks/python_programming_basic_v2


#e23.1CrawUnivRanking.py
import requests
from bs4 import BeautifulSoup
allUniv = []
def getHTMLText(url):
    try:
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        r.encoding = 'utf-8'
        return r.text
    except:
        return ''
def fillUnivList(soup):
    data = soup.find_all('tr')
    for tr in data:
        ltd = tr.find_all('td')
        if len(ltd)==0:
            continue
        singleUniv = []
        for td in ltd:
            singleUniv.append(td.string)
        allUniv.append(singleUniv)
def printUnivList(num):
    print('{:^4}{:^10}{:^5}{:^8}{:^10}'.format('排名','學(xué)校名稱','省市','總分','培養(yǎng)規(guī)模'))
    for i in range(num):
        u=allUniv[i]
        print('{:^4}{:^10}{:^5}{:^8}{:^10}'.format(u[0],u[1],u[2],u[3],u[6]))
def main():
    url = 'http://www./zuihaodaxuepaiming2016.html'
    html = getHTMLText(url)
    soup = BeautifulSoup(html, 'html.parser')
    fillUnivList(soup)
    printUnivList(10)
main()