一区二区三区日韩精品-日韩经典一区二区三区-五月激情综合丁香婷婷-欧美精品中文字幕专区

分享

學(xué)爬蟲?還不會(huì)Scrapy嗎?快來(lái)學(xué)習(xí)入門吧

 老碼識(shí)途 2023-10-10 發(fā)布于廣東

Scrapy作為爬蟲的進(jìn)階內(nèi)容,可以實(shí)現(xiàn)多線程爬取目標(biāo)內(nèi)容,簡(jiǎn)化代碼邏輯,提高開發(fā)效率,深受爬蟲開發(fā)者的喜愛(ài),本文主要以爬取某股票網(wǎng)站為例,簡(jiǎn)述如何通過(guò)Scrapy實(shí)現(xiàn)爬蟲,僅供學(xué)習(xí)分享使用,如有不足之處,還請(qǐng)指正。

什么是Scrapy?

Scrapy是用python實(shí)現(xiàn)的一個(gè)為了爬取網(wǎng)站數(shù)據(jù),提取結(jié)構(gòu)性數(shù)據(jù)而編寫的應(yīng)用框架。使用Twisted高效異步網(wǎng)絡(luò)框架來(lái)處理網(wǎng)絡(luò)通信。Scrapy架構(gòu):

關(guān)于Scrapy架構(gòu)各項(xiàng)說(shuō)明,如下所示:

  • ScrapyEngine:引擎。負(fù)責(zé)控制數(shù)據(jù)流在系統(tǒng)中所有組件中流動(dòng),并在相應(yīng)動(dòng)作發(fā)生時(shí)觸發(fā)事件。此組件相當(dāng)于爬蟲的“大腦”,是 整個(gè)爬蟲的調(diào)度中心。 

  • Schedule:調(diào)度器。接收從引擎發(fā)過(guò)來(lái)的requests,并將他們?nèi)腙?duì)。初始爬取url和后續(xù)在頁(yè)面里爬到的待爬取url放入調(diào)度器中,等待被爬取。調(diào)度器會(huì)自動(dòng)去掉重復(fù)的url。

  • Downloader:下載器。負(fù)責(zé)獲取頁(yè)面數(shù)據(jù),并提供給引擎,而后提供給spider。

  • Spider:爬蟲。用戶編些用于分析response并提取item和額外跟進(jìn)的url。將額外跟進(jìn)的url提交給ScrapyEngine,加入到Schedule中。將每個(gè)spider負(fù)責(zé)處理一個(gè)特定(或 一些)網(wǎng)站。 

  • ItemPipeline:負(fù)責(zé)處理被spider提取出來(lái)的item。當(dāng)頁(yè)面被爬蟲解析所需的數(shù)據(jù)存入Item后,將被發(fā)送到Pipeline,并經(jīng)過(guò)設(shè)置好次序

  • DownloaderMiddlewares:下載中間件。是在引擎和下載器之間的特定鉤子(specific hook),處理它們之間的請(qǐng)求(request)和響應(yīng)(response)。提供了一個(gè)簡(jiǎn)單的機(jī)制,通過(guò)插入自定義代碼來(lái)擴(kuò)展Scrapy功能。通過(guò)設(shè)置DownloaderMiddlewares來(lái)實(shí)現(xiàn)爬蟲自動(dòng)更換user-agent,IP等。

  • SpiderMiddlewares:Spider中間件。是在引擎和Spider之間的特定鉤子(specific hook),處理spider的輸入(response)和輸出(items或requests)。提供了同樣簡(jiǎn)單機(jī)制,通過(guò)插入自定義代碼來(lái)擴(kuò)展Scrapy功能。

Scrapy數(shù)據(jù)流:

  1. ScrapyEngine打開一個(gè)網(wǎng)站,找到處理該網(wǎng)站的Spider,并向該Spider請(qǐng)求第一個(gè)(批)要爬取的url(s);

  2. ScrapyEngine向調(diào)度器請(qǐng)求第一個(gè)要爬取的url,并加入到Schedule作為請(qǐng)求以備調(diào)度;

  3. ScrapyEngine向調(diào)度器請(qǐng)求下一個(gè)要爬取的url;

  4. Schedule返回下一個(gè)要爬取的url給ScrapyEngine,ScrapyEngine通過(guò)DownloaderMiddlewares將url轉(zhuǎn)發(fā)給Downloader;

  5. 頁(yè)面下載完畢,Downloader生成一個(gè)頁(yè)面的Response,通過(guò)DownloaderMiddlewares發(fā)送給ScrapyEngine;

  6. ScrapyEngine從Downloader中接收到Response,通過(guò)SpiderMiddlewares發(fā)送給Spider處理;

  7. Spider處理Response并返回提取到的Item以及新的Request給ScrapyEngine;

  8. ScrapyEngine將Spider返回的Item交給ItemPipeline,將Spider返回的Request交給Schedule進(jìn)行從第二步開始的重復(fù)操作,直到調(diào)度器中沒(méi)有待處理的Request,ScrapyEngine關(guān)閉。

Scrapy安裝

在命令行模式下,通過(guò)pip install scrapy命令進(jìn)行安裝Scrapy,如下所示:

當(dāng)出現(xiàn)以下提示信息時(shí),表示安裝成功

 

Scrapy創(chuàng)建項(xiàng)目

在命令行模式下,切換到項(xiàng)目存放目錄,通過(guò)scrapy startproject stockstar 創(chuàng)建爬蟲項(xiàng)目,如下所示:

根據(jù)提示,通過(guò)提供的模板,創(chuàng)建爬蟲【命令格式:scrapy genspider 爬蟲名稱 域名】,如下所示:

注意:爬蟲名稱,不能跟項(xiàng)目名稱一致,否則會(huì)報(bào)錯(cuò),如下所示:

通過(guò)Pycharm打開新創(chuàng)建的scrapy項(xiàng)目,如下所示:

爬取目標(biāo)

本例主要爬取某證券網(wǎng)站行情中心股票ID與名稱信息,如下所示:

Scrapy爬蟲開發(fā)

通過(guò)命令行創(chuàng)建項(xiàng)目后,基本Scrapy爬蟲框架已經(jīng)形成,剩下的就是業(yè)務(wù)代碼填充。

item項(xiàng)定義

定義需要爬取的字段信息,如下所示:

class StockstarItem(scrapy.Item):    """    定義需要爬取的字段名稱    """    # define the fields for your item here like:    # name = scrapy.Field()    stock_type = scrapy.Field()  # 股票類型    stock_id = scrapy.Field()  # 股票ID    stock_name = scrapy.Field()  # 股票名稱

定制爬蟲邏輯

Scrapy的爬蟲結(jié)構(gòu)是固定的,定義一個(gè)類,繼承自scrapy.Spider,類中定義屬性【爬蟲名稱,域名,起始url】,重寫父類方法【parse】,根據(jù)需要爬取的頁(yè)面邏輯不同,在parse中定制不同的爬蟲代碼,如下所示:

class StockSpider(scrapy.Spider):    name = 'stock'    allowed_domains = ['quote.stockstar.com']  # 域名    start_urls = ['http://quote.stockstar.com/stock/stock_index.htm']  # 啟動(dòng)的url
def parse(self, response): """ 解析函數(shù) :param response: :return: """ item = StockstarItem() styles = ['滬A', '滬B', '深A(yù)', '深B'] index = 0 for style in styles: print('********************本次抓取' + style[index] + '股票********************') ids = response.xpath( '//div[@class="w"]/div[@class="main clearfix"]/div[@class="seo_area"]/div[' '@class="seo_keywordsCon"]/ul[@id="index_data_' + str(index) + '"]/li/span/a/text()').getall() names = response.xpath( '//div[@class="w"]/div[@class="main clearfix"]/div[@class="seo_area"]/div[' '@class="seo_keywordsCon"]/ul[@id="index_data_' + str(index) + '"]/li/a/text()').getall() # print('ids = '+str(ids)) # print('names = ' + str(names)) for i in range(len(ids)): item['stock_type'] = style item['stock_id'] = str(ids[i]) item['stock_name'] = str(names[i]) yield item

數(shù)據(jù)處理

在Pipeline中,對(duì)抓取的數(shù)據(jù)進(jìn)行處理,本例為簡(jiǎn)便,在控制進(jìn)行輸出,如下所示:

class StockstarPipeline:    def process_item(self, item, spider):        print('股票類型>>>>'+item['stock_type']+'股票代碼>>>>'+item['stock_id']+'股票名稱>>>>'+item['stock_name'])        return item

注意:在對(duì)item進(jìn)行賦值時(shí),只能通過(guò)item['key']=value的方式進(jìn)行賦值,不可以通過(guò)item.key=value的方式賦值。

Scrapy配置

通過(guò)settings.py文件進(jìn)行配置,包括請(qǐng)求頭,管道,robots協(xié)議等內(nèi)容,如下所示:

# Scrapy settings for stockstar project## For simplicity, this file contains only settings considered important or# commonly used. You can find more settings consulting the documentation:##     https://docs./en/latest/topics/settings.html#     https://docs./en/latest/topics/downloader-middleware.html#     https://docs./en/latest/topics/spider-middleware.html
BOT_NAME = 'stockstar'
SPIDER_MODULES = ['stockstar.spiders']NEWSPIDER_MODULE = 'stockstar.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent#USER_AGENT = 'stockstar (+http://www.)'
# Obey robots.txt rules 是否遵守robots協(xié)議ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)# See https://docs./en/latest/topics/settings.html#download-delay# See also autothrottle settings and docs#DOWNLOAD_DELAY = 3# The download delay setting will honor only one of:#CONCURRENT_REQUESTS_PER_DOMAIN = 16#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)#TELNETCONSOLE_ENABLED = False
# Override the default request headers:DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Mobile Safari/537.36' #, # 'Accept-Language': 'en,zh-CN,zh;q=0.9'}
# Enable or disable spider middlewares# See https://docs./en/latest/topics/spider-middleware.html#SPIDER_MIDDLEWARES = {# 'stockstar.middlewares.StockstarSpiderMiddleware': 543,#}
# Enable or disable downloader middlewares# See https://docs./en/latest/topics/downloader-middleware.html#DOWNLOADER_MIDDLEWARES = {# 'stockstar.middlewares.StockstarDownloaderMiddleware': 543,#}
# Enable or disable extensions# See https://docs./en/latest/topics/extensions.html#EXTENSIONS = {# 'scrapy.extensions.telnet.TelnetConsole': None,#}
# Configure item pipelines# See https://docs./en/latest/topics/item-pipeline.htmlITEM_PIPELINES = { 'stockstar.pipelines.StockstarPipeline': 300,}
# Enable and configure the AutoThrottle extension (disabled by default)# See https://docs./en/latest/topics/autothrottle.html#AUTOTHROTTLE_ENABLED = True# The initial download delay#AUTOTHROTTLE_START_DELAY = 5# The maximum download delay to be set in case of high latencies#AUTOTHROTTLE_MAX_DELAY = 60# The average number of requests Scrapy should be sending in parallel to# each remote server#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0# Enable showing throttling stats for every response received:#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)# See https://docs./en/latest/topics/downloader-middleware.html#httpcache-middleware-settings#HTTPCACHE_ENABLED = True#HTTPCACHE_EXPIRATION_SECS = 0#HTTPCACHE_DIR = 'httpcache'#HTTPCACHE_IGNORE_HTTP_CODES = []#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

Scrapy運(yùn)行

因scrapy是各個(gè)獨(dú)立的頁(yè)面,只能通過(guò)終端命令行的方式運(yùn)行,格式為:scrapy crawl 爬蟲名稱,如下所示:

scrapy crawl stock

如下圖所示:

備注

本例內(nèi)容相對(duì)簡(jiǎn)單,僅為說(shuō)明Scrapy的常見(jiàn)用法,爬取的內(nèi)容都是第一次請(qǐng)求能夠獲取到源碼的內(nèi)容,即所見(jiàn)即所得。

遺留兩個(gè)小問(wèn)題:

  1. 對(duì)于爬取的內(nèi)容需要翻頁(yè)才能完成,即多次請(qǐng)求,如何處理?

  2. 對(duì)于爬取的內(nèi)容是異步傳輸,頁(yè)面請(qǐng)求只是獲取一個(gè)框架,內(nèi)容是異步填充,即常見(jiàn)的ajax方式,如何處理?

以上兩個(gè)問(wèn)題,待后續(xù)遇到時(shí),再進(jìn)一步分析。一首陶淵明的歸田園居,與君共享。

學(xué)爬蟲,從關(guān)注“老碼識(shí)途”開始?。?!

    轉(zhuǎn)藏 分享 獻(xiàn)花(0

    0條評(píng)論

    發(fā)表

    請(qǐng)遵守用戶 評(píng)論公約

    類似文章 更多

    日本不卡一本二本三区| 欧美人妻免费一区二区三区| 东京热男人的天堂一二三区| 麻豆tv传媒在线观看| 色播五月激情五月婷婷| 亚洲一区二区三区四区| 中文字幕日产乱码一区二区| 91欧美一区二区三区| 日韩成人免费性生活视频| 亚洲最新中文字幕一区| 日韩精品一级片免费看| 亚洲成人精品免费在线观看| 精品伊人久久大香线蕉综合| 免费特黄一级一区二区三区| 亚洲熟女国产熟女二区三区| 经典欧美熟女激情综合网| 黄色激情视频中文字幕| 日韩欧美好看的剧情片免费| 空之色水之色在线播放| 久久热麻豆国产精品视频| 色狠狠一区二区三区香蕉蜜桃| 妻子的新妈妈中文字幕| 精品日韩中文字幕视频在线| 欧美日韩综合综合久久久| 国产一区欧美一区日本道| 日韩和欧美的一区二区三区| 国产女性精品一区二区三区| 在线懂色一区二区三区精品| 欧美日韩久久精品一区二区| 午夜精品福利视频观看| 中文字幕日韩一区二区不卡| 国产色第一区不卡高清| 99久只有精品免费视频播放| 国产亚洲欧美一区二区| 99在线视频精品免费播放| 色综合伊人天天综合网中文| 欧美一区二区三区不卡高清视| 亚洲中文字幕在线观看黑人| 日韩在线精品视频观看| 国产又粗又猛又爽色噜噜| 小草少妇视频免费看视频|