Scrapy作為爬蟲的進(jìn)階內(nèi)容,可以實(shí)現(xiàn)多線程爬取目標(biāo)內(nèi)容,簡(jiǎn)化代碼邏輯,提高開發(fā)效率,深受爬蟲開發(fā)者的喜愛(ài),本文主要以爬取某股票網(wǎng)站為例,簡(jiǎn)述如何通過(guò)Scrapy實(shí)現(xiàn)爬蟲,僅供學(xué)習(xí)分享使用,如有不足之處,還請(qǐng)指正。 什么是Scrapy?Scrapy是用python實(shí)現(xiàn)的一個(gè)為了爬取網(wǎng)站數(shù)據(jù),提取結(jié)構(gòu)性數(shù)據(jù)而編寫的應(yīng)用框架。使用Twisted高效異步網(wǎng)絡(luò)框架來(lái)處理網(wǎng)絡(luò)通信。Scrapy架構(gòu): 關(guān)于Scrapy架構(gòu)各項(xiàng)說(shuō)明,如下所示: ScrapyEngine:引擎。負(fù)責(zé)控制數(shù)據(jù)流在系統(tǒng)中所有組件中流動(dòng),并在相應(yīng)動(dòng)作發(fā)生時(shí)觸發(fā)事件。此組件相當(dāng)于爬蟲的“大腦”,是 整個(gè)爬蟲的調(diào)度中心。 Schedule:調(diào)度器。接收從引擎發(fā)過(guò)來(lái)的requests,并將他們?nèi)腙?duì)。初始爬取url和后續(xù)在頁(yè)面里爬到的待爬取url放入調(diào)度器中,等待被爬取。調(diào)度器會(huì)自動(dòng)去掉重復(fù)的url。 Downloader:下載器。負(fù)責(zé)獲取頁(yè)面數(shù)據(jù),并提供給引擎,而后提供給spider。 Spider:爬蟲。用戶編些用于分析response并提取item和額外跟進(jìn)的url。將額外跟進(jìn)的url提交給ScrapyEngine,加入到Schedule中。將每個(gè)spider負(fù)責(zé)處理一個(gè)特定(或 一些)網(wǎng)站。 ItemPipeline:負(fù)責(zé)處理被spider提取出來(lái)的item。當(dāng)頁(yè)面被爬蟲解析所需的數(shù)據(jù)存入Item后,將被發(fā)送到Pipeline,并經(jīng)過(guò)設(shè)置好次序 DownloaderMiddlewares:下載中間件。是在引擎和下載器之間的特定鉤子(specific hook),處理它們之間的請(qǐng)求(request)和響應(yīng)(response)。提供了一個(gè)簡(jiǎn)單的機(jī)制,通過(guò)插入自定義代碼來(lái)擴(kuò)展Scrapy功能。通過(guò)設(shè)置DownloaderMiddlewares來(lái)實(shí)現(xiàn)爬蟲自動(dòng)更換user-agent,IP等。 SpiderMiddlewares:Spider中間件。是在引擎和Spider之間的特定鉤子(specific hook),處理spider的輸入(response)和輸出(items或requests)。提供了同樣簡(jiǎn)單機(jī)制,通過(guò)插入自定義代碼來(lái)擴(kuò)展Scrapy功能。
Scrapy數(shù)據(jù)流: ScrapyEngine打開一個(gè)網(wǎng)站,找到處理該網(wǎng)站的Spider,并向該Spider請(qǐng)求第一個(gè)(批)要爬取的url(s); ScrapyEngine向調(diào)度器請(qǐng)求第一個(gè)要爬取的url,并加入到Schedule作為請(qǐng)求以備調(diào)度; ScrapyEngine向調(diào)度器請(qǐng)求下一個(gè)要爬取的url; Schedule返回下一個(gè)要爬取的url給ScrapyEngine,ScrapyEngine通過(guò)DownloaderMiddlewares將url轉(zhuǎn)發(fā)給Downloader; 頁(yè)面下載完畢,Downloader生成一個(gè)頁(yè)面的Response,通過(guò)DownloaderMiddlewares發(fā)送給ScrapyEngine; ScrapyEngine從Downloader中接收到Response,通過(guò)SpiderMiddlewares發(fā)送給Spider處理; Spider處理Response并返回提取到的Item以及新的Request給ScrapyEngine; ScrapyEngine將Spider返回的Item交給ItemPipeline,將Spider返回的Request交給Schedule進(jìn)行從第二步開始的重復(fù)操作,直到調(diào)度器中沒(méi)有待處理的Request,ScrapyEngine關(guān)閉。
Scrapy安裝在命令行模式下,通過(guò)pip install scrapy命令進(jìn)行安裝Scrapy,如下所示: 當(dāng)出現(xiàn)以下提示信息時(shí),表示安裝成功 Scrapy創(chuàng)建項(xiàng)目在命令行模式下,切換到項(xiàng)目存放目錄,通過(guò)scrapy startproject stockstar 創(chuàng)建爬蟲項(xiàng)目,如下所示: 根據(jù)提示,通過(guò)提供的模板,創(chuàng)建爬蟲【命令格式:scrapy genspider 爬蟲名稱 域名】,如下所示: 注意:爬蟲名稱,不能跟項(xiàng)目名稱一致,否則會(huì)報(bào)錯(cuò),如下所示: 通過(guò)Pycharm打開新創(chuàng)建的scrapy項(xiàng)目,如下所示: 爬取目標(biāo)本例主要爬取某證券網(wǎng)站行情中心股票ID與名稱信息,如下所示: Scrapy爬蟲開發(fā)通過(guò)命令行創(chuàng)建項(xiàng)目后,基本Scrapy爬蟲框架已經(jīng)形成,剩下的就是業(yè)務(wù)代碼填充。 item項(xiàng)定義定義需要爬取的字段信息,如下所示: class StockstarItem(scrapy.Item): """ 定義需要爬取的字段名稱 """ # define the fields for your item here like: # name = scrapy.Field() stock_type = scrapy.Field() # 股票類型 stock_id = scrapy.Field() # 股票ID stock_name = scrapy.Field() # 股票名稱
定制爬蟲邏輯Scrapy的爬蟲結(jié)構(gòu)是固定的,定義一個(gè)類,繼承自scrapy.Spider,類中定義屬性【爬蟲名稱,域名,起始url】,重寫父類方法【parse】,根據(jù)需要爬取的頁(yè)面邏輯不同,在parse中定制不同的爬蟲代碼,如下所示: class StockSpider(scrapy.Spider): name = 'stock' allowed_domains = ['quote.stockstar.com'] # 域名 start_urls = ['http://quote.stockstar.com/stock/stock_index.htm'] # 啟動(dòng)的url
def parse(self, response): """ 解析函數(shù) :param response: :return: """ item = StockstarItem() styles = ['滬A', '滬B', '深A(yù)', '深B'] index = 0 for style in styles: print('********************本次抓取' + style[index] + '股票********************') ids = response.xpath( '//div[@class="w"]/div[@class="main clearfix"]/div[@class="seo_area"]/div[' '@class="seo_keywordsCon"]/ul[@id="index_data_' + str(index) + '"]/li/span/a/text()').getall() names = response.xpath( '//div[@class="w"]/div[@class="main clearfix"]/div[@class="seo_area"]/div[' '@class="seo_keywordsCon"]/ul[@id="index_data_' + str(index) + '"]/li/a/text()').getall() # print('ids = '+str(ids)) # print('names = ' + str(names)) for i in range(len(ids)): item['stock_type'] = style item['stock_id'] = str(ids[i]) item['stock_name'] = str(names[i]) yield item
數(shù)據(jù)處理在Pipeline中,對(duì)抓取的數(shù)據(jù)進(jìn)行處理,本例為簡(jiǎn)便,在控制進(jìn)行輸出,如下所示: class StockstarPipeline: def process_item(self, item, spider): print('股票類型>>>>'+item['stock_type']+'股票代碼>>>>'+item['stock_id']+'股票名稱>>>>'+item['stock_name']) return item
注意:在對(duì)item進(jìn)行賦值時(shí),只能通過(guò)item['key']=value的方式進(jìn)行賦值,不可以通過(guò)item.key=value的方式賦值。 Scrapy配置通過(guò)settings.py文件進(jìn)行配置,包括請(qǐng)求頭,管道,robots協(xié)議等內(nèi)容,如下所示: # Scrapy settings for stockstar project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://docs./en/latest/topics/settings.html # https://docs./en/latest/topics/downloader-middleware.html # https://docs./en/latest/topics/spider-middleware.html
BOT_NAME = 'stockstar'
SPIDER_MODULES = ['stockstar.spiders'] NEWSPIDER_MODULE = 'stockstar.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'stockstar (+http://www.)'
# Obey robots.txt rules 是否遵守robots協(xié)議 ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0) # See https://docs./en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default) #COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False
# Override the default request headers: DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Mobile Safari/537.36' #, # 'Accept-Language': 'en,zh-CN,zh;q=0.9' }
# Enable or disable spider middlewares # See https://docs./en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'stockstar.middlewares.StockstarSpiderMiddleware': 543, #}
# Enable or disable downloader middlewares # See https://docs./en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'stockstar.middlewares.StockstarDownloaderMiddleware': 543, #}
# Enable or disable extensions # See https://docs./en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #}
# Configure item pipelines # See https://docs./en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'stockstar.pipelines.StockstarPipeline': 300, }
# Enable and configure the AutoThrottle extension (disabled by default) # See https://docs./en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default) # See https://docs./en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
Scrapy運(yùn)行 因scrapy是各個(gè)獨(dú)立的頁(yè)面,只能通過(guò)終端命令行的方式運(yùn)行,格式為:scrapy crawl 爬蟲名稱,如下所示: 如下圖所示: 備注本例內(nèi)容相對(duì)簡(jiǎn)單,僅為說(shuō)明Scrapy的常見(jiàn)用法,爬取的內(nèi)容都是第一次請(qǐng)求能夠獲取到源碼的內(nèi)容,即所見(jiàn)即所得。 遺留兩個(gè)小問(wèn)題: 對(duì)于爬取的內(nèi)容需要翻頁(yè)才能完成,即多次請(qǐng)求,如何處理? 對(duì)于爬取的內(nèi)容是異步傳輸,頁(yè)面請(qǐng)求只是獲取一個(gè)框架,內(nèi)容是異步填充,即常見(jiàn)的ajax方式,如何處理?
以上兩個(gè)問(wèn)題,待后續(xù)遇到時(shí),再進(jìn)一步分析。一首陶淵明的歸田園居,與君共享。 學(xué)爬蟲,從關(guān)注“老碼識(shí)途”開始?。?!
|