【原】學(xué)爬蟲？還不會(huì)Scrapy嗎？快來(lái)學(xué)習(xí)入門吧

老碼識(shí)途 2023-10-10 發(fā)布于廣東

展開全文

Scrapy作為爬蟲的進(jìn)階內(nèi)容，可以實(shí)現(xiàn)多線程爬取目標(biāo)內(nèi)容，簡(jiǎn)化代碼邏輯，提高開發(fā)效率，深受爬蟲開發(fā)者的喜愛(ài)，本文主要以爬取某股票網(wǎng)站為例，簡(jiǎn)述如何通過(guò)Scrapy實(shí)現(xiàn)爬蟲，僅供學(xué)習(xí)分享使用，如有不足之處，還請(qǐng)指正。

什么是Scrapy?

Scrapy是用python實(shí)現(xiàn)的一個(gè)為了爬取網(wǎng)站數(shù)據(jù)，提取結(jié)構(gòu)性數(shù)據(jù)而編寫的應(yīng)用框架。使用Twisted高效異步網(wǎng)絡(luò)框架來(lái)處理網(wǎng)絡(luò)通信。Scrapy架構(gòu)：

關(guān)于Scrapy架構(gòu)各項(xiàng)說(shuō)明，如下所示：

ScrapyEngine：引擎。負(fù)責(zé)控制數(shù)據(jù)流在系統(tǒng)中所有組件中流動(dòng)，并在相應(yīng)動(dòng)作發(fā)生時(shí)觸發(fā)事件。此組件相當(dāng)于爬蟲的“大腦”，是整個(gè)爬蟲的調(diào)度中心。
Schedule：調(diào)度器。接收從引擎發(fā)過(guò)來(lái)的requests，并將他們?nèi)腙?duì)。初始爬取url和后續(xù)在頁(yè)面里爬到的待爬取url放入調(diào)度器中，等待被爬取。調(diào)度器會(huì)自動(dòng)去掉重復(fù)的url。
Downloader：下載器。負(fù)責(zé)獲取頁(yè)面數(shù)據(jù)，并提供給引擎，而后提供給spider。
Spider：爬蟲。用戶編些用于分析response并提取item和額外跟進(jìn)的url。將額外跟進(jìn)的url提交給ScrapyEngine，加入到Schedule中。將每個(gè)spider負(fù)責(zé)處理一個(gè)特定(或一些)網(wǎng)站。
ItemPipeline：負(fù)責(zé)處理被spider提取出來(lái)的item。當(dāng)頁(yè)面被爬蟲解析所需的數(shù)據(jù)存入Item后，將被發(fā)送到Pipeline，并經(jīng)過(guò)設(shè)置好次序
DownloaderMiddlewares：下載中間件。是在引擎和下載器之間的特定鉤子(specific hook)，處理它們之間的請(qǐng)求(request)和響應(yīng)(response)。提供了一個(gè)簡(jiǎn)單的機(jī)制，通過(guò)插入自定義代碼來(lái)擴(kuò)展Scrapy功能。通過(guò)設(shè)置DownloaderMiddlewares來(lái)實(shí)現(xiàn)爬蟲自動(dòng)更換user-agent,IP等。
SpiderMiddlewares：Spider中間件。是在引擎和Spider之間的特定鉤子(specific hook)，處理spider的輸入(response)和輸出(items或requests)。提供了同樣簡(jiǎn)單機(jī)制，通過(guò)插入自定義代碼來(lái)擴(kuò)展Scrapy功能。

Scrapy數(shù)據(jù)流：

ScrapyEngine打開一個(gè)網(wǎng)站，找到處理該網(wǎng)站的Spider，并向該Spider請(qǐng)求第一個(gè)(批)要爬取的url(s)；
ScrapyEngine向調(diào)度器請(qǐng)求第一個(gè)要爬取的url，并加入到Schedule作為請(qǐng)求以備調(diào)度；
ScrapyEngine向調(diào)度器請(qǐng)求下一個(gè)要爬取的url；
Schedule返回下一個(gè)要爬取的url給ScrapyEngine，ScrapyEngine通過(guò)DownloaderMiddlewares將url轉(zhuǎn)發(fā)給Downloader；
頁(yè)面下載完畢，Downloader生成一個(gè)頁(yè)面的Response，通過(guò)DownloaderMiddlewares發(fā)送給ScrapyEngine；
ScrapyEngine從Downloader中接收到Response，通過(guò)SpiderMiddlewares發(fā)送給Spider處理；
Spider處理Response并返回提取到的Item以及新的Request給ScrapyEngine；
ScrapyEngine將Spider返回的Item交給ItemPipeline，將Spider返回的Request交給Schedule進(jìn)行從第二步開始的重復(fù)操作，直到調(diào)度器中沒(méi)有待處理的Request，ScrapyEngine關(guān)閉。

Scrapy安裝

在命令行模式下，通過(guò)pip install scrapy命令進(jìn)行安裝Scrapy，如下所示：

當(dāng)出現(xiàn)以下提示信息時(shí)，表示安裝成功

Scrapy創(chuàng)建項(xiàng)目

在命令行模式下，切換到項(xiàng)目存放目錄，通過(guò)scrapy startproject stockstar 創(chuàng)建爬蟲項(xiàng)目，如下所示：

根據(jù)提示，通過(guò)提供的模板，創(chuàng)建爬蟲【命令格式：scrapy genspider 爬蟲名稱域名】，如下所示：

注意：爬蟲名稱，不能跟項(xiàng)目名稱一致，否則會(huì)報(bào)錯(cuò)，如下所示：

通過(guò)Pycharm打開新創(chuàng)建的scrapy項(xiàng)目，如下所示：

爬取目標(biāo)

本例主要爬取某證券網(wǎng)站行情中心股票ID與名稱信息，如下所示：

Scrapy爬蟲開發(fā)

通過(guò)命令行創(chuàng)建項(xiàng)目后，基本Scrapy爬蟲框架已經(jīng)形成，剩下的就是業(yè)務(wù)代碼填充。

item項(xiàng)定義

定義需要爬取的字段信息，如下所示：

class StockstarItem(scrapy.Item):    """    定義需要爬取的字段名稱    """    # define the fields for your item here like:    # name = scrapy.Field()    stock_type = scrapy.Field()  # 股票類型    stock_id = scrapy.Field()  # 股票ID    stock_name = scrapy.Field()  # 股票名稱

定制爬蟲邏輯

Scrapy的爬蟲結(jié)構(gòu)是固定的，定義一個(gè)類，繼承自scrapy.Spider，類中定義屬性【爬蟲名稱，域名，起始url】，重寫父類方法【parse】，根據(jù)需要爬取的頁(yè)面邏輯不同，在parse中定制不同的爬蟲代碼，如下所示：

class StockSpider(scrapy.Spider):    name = 'stock'    allowed_domains = ['quote.stockstar.com']  # 域名    start_urls = ['http://quote.stockstar.com/stock/stock_index.htm']  # 啟動(dòng)的url
    def parse(self, response):        """        解析函數(shù)        :param response:        :return:        """        item = StockstarItem()        styles = ['滬A', '滬B', '深A(yù)', '深B']        index = 0        for style in styles:            print('********************本次抓取' + style[index] + '股票********************')            ids = response.xpath(                '//div[@class="w"]/div[@class="main clearfix"]/div[@class="seo_area"]/div['                '@class="seo_keywordsCon"]/ul[@id="index_data_' + str(index) + '"]/li/span/a/text()').getall()            names = response.xpath(                '//div[@class="w"]/div[@class="main clearfix"]/div[@class="seo_area"]/div['                '@class="seo_keywordsCon"]/ul[@id="index_data_' + str(index) + '"]/li/a/text()').getall()            # print('ids = '+str(ids))            # print('names = ' + str(names))            for i in range(len(ids)):                item['stock_type'] = style                item['stock_id'] = str(ids[i])                item['stock_name'] = str(names[i])                yield item

數(shù)據(jù)處理

在Pipeline中，對(duì)抓取的數(shù)據(jù)進(jìn)行處理，本例為簡(jiǎn)便，在控制進(jìn)行輸出，如下所示：

class StockstarPipeline:    def process_item(self, item, spider):        print('股票類型>>>>'+item['stock_type']+'股票代碼>>>>'+item['stock_id']+'股票名稱>>>>'+item['stock_name'])        return item

注意：在對(duì)item進(jìn)行賦值時(shí)，只能通過(guò)item['key']=value的方式進(jìn)行賦值，不可以通過(guò)item.key=value的方式賦值。

Scrapy配置

通過(guò)settings.py文件進(jìn)行配置，包括請(qǐng)求頭，管道，robots協(xié)議等內(nèi)容，如下所示：

# Scrapy settings for stockstar project## For simplicity, this file contains only settings considered important or# commonly used. You can find more settings consulting the documentation:##     https://docs./en/latest/topics/settings.html#     https://docs./en/latest/topics/downloader-middleware.html#     https://docs./en/latest/topics/spider-middleware.html
BOT_NAME = 'stockstar'
SPIDER_MODULES = ['stockstar.spiders']NEWSPIDER_MODULE = 'stockstar.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent#USER_AGENT = 'stockstar (+http://www.)'
# Obey robots.txt rules 是否遵守robots協(xié)議ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)# See https://docs./en/latest/topics/settings.html#download-delay# See also autothrottle settings and docs#DOWNLOAD_DELAY = 3# The download delay setting will honor only one of:#CONCURRENT_REQUESTS_PER_DOMAIN = 16#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)#TELNETCONSOLE_ENABLED = False
# Override the default request headers:DEFAULT_REQUEST_HEADERS = {  # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',  'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Mobile Safari/537.36' #,  # 'Accept-Language': 'en,zh-CN,zh;q=0.9'}
# Enable or disable spider middlewares# See https://docs./en/latest/topics/spider-middleware.html#SPIDER_MIDDLEWARES = {#    'stockstar.middlewares.StockstarSpiderMiddleware': 543,#}
# Enable or disable downloader middlewares# See https://docs./en/latest/topics/downloader-middleware.html#DOWNLOADER_MIDDLEWARES = {#    'stockstar.middlewares.StockstarDownloaderMiddleware': 543,#}
# Enable or disable extensions# See https://docs./en/latest/topics/extensions.html#EXTENSIONS = {#    'scrapy.extensions.telnet.TelnetConsole': None,#}
# Configure item pipelines# See https://docs./en/latest/topics/item-pipeline.htmlITEM_PIPELINES = {   'stockstar.pipelines.StockstarPipeline': 300,}
# Enable and configure the AutoThrottle extension (disabled by default)# See https://docs./en/latest/topics/autothrottle.html#AUTOTHROTTLE_ENABLED = True# The initial download delay#AUTOTHROTTLE_START_DELAY = 5# The maximum download delay to be set in case of high latencies#AUTOTHROTTLE_MAX_DELAY = 60# The average number of requests Scrapy should be sending in parallel to# each remote server#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0# Enable showing throttling stats for every response received:#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)# See https://docs./en/latest/topics/downloader-middleware.html#httpcache-middleware-settings#HTTPCACHE_ENABLED = True#HTTPCACHE_EXPIRATION_SECS = 0#HTTPCACHE_DIR = 'httpcache'#HTTPCACHE_IGNORE_HTTP_CODES = []#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'