Scrapy提供了可自定義2種中間件,1個數(shù)據(jù)處理器
名稱
作用
用戶設(shè)置
數(shù)據(jù)收集器(Item-Pipeline)
處理item
覆蓋
下載中間件(Downloader-Middleware)
處理request/response
合并
爬蟲中間件(Spider-Middleware)
處理item/response/request
合并
解釋:
用戶設(shè)置:是指custom_settings
可是它們繼承的父類竟然是Object…,每次都要查文檔。
正常來說應(yīng)該提供一個抽象函數(shù)作為接口,給使用者實現(xiàn)自己的具體功能,不知道為啥這么設(shè)計
通過幾段代碼及注釋,簡要說明三個中間件的功能
1、Spider
baidu_spider.py
from scrapy import Spider, cmdline
class BaiduSpider(Spider):
name = "baidu_spider"
start_urls = [
"https://www.baidu.com/"
]
custom_settings = {
"SPIDER_DATA": "this is spider data",
"DOWNLOADER_MIDDLEWARES": {
"scrapys.mymiddleware.MyMiddleware": 100,
},
"ITEM_PIPELINES": {
"scrapys.mypipeline.MyPipeline": 100,
},
"SPIDER_MIDDLEWARES":{
"scrapys.myspidermiddleware.MySpiderMiddleware": 100,
}
}
def parse(self, response):
pass
if __name__ == '__main__':
cmdline.execute("scrapy crawl baidu_spider".split())
2、Pipeline
mypipeline.py
class MyPipeline(object):
def __init__(self, spider_data):
self.spider_data = spider_data
@classmethod
def from_crawler(cls, crawler):
"""
獲取spider的settings參數(shù),返回Pipeline實例對象
"""
spider_data = crawler.settings.get("SPIDER_DATA")
print("### pipeline get spider_data: {}".format(spider_data))
return cls(spider_data)
def process_item(self, item, spider):
"""
return Item 繼續(xù)處理
raise DropItem 丟棄
"""
print("### call process_item")
return item
def open_spider(self, spider):
"""
spider開啟時調(diào)用
"""
print("### spdier open {}".format(spider.name))
def close_spider(self, spider):
"""
spider關(guān)閉時調(diào)用
"""
print("### spdier close {}".format(spider.name))
3、Downloader-Middleware
mymiddleware.py
class MyMiddleware(object):
def __init__(self, spider_data):
self.spider_data = spider_data
@classmethod
def from_crawler(cls, crawler):
"""
獲取spider的settings參數(shù),返回中間件實例對象
"""
spider_data = crawler.settings.get("SPIDER_DATA")
print("### middleware get spider_data: {}".format(spider_data))
return cls(spider_data)
def process_request(self, request, spider):
"""
return
None: 繼續(xù)處理Request
Response: 返回Response
Request: 重新調(diào)度
raise IgnoreRequest: process_exception -> Request.errback
"""
print("### call process_request")
def process_response(self, request, response, spider):
"""
return
Response: 繼續(xù)處理Response
Request: 重新調(diào)度
raise IgnoreRequest: Request.errback
"""
print("### call process_response")
return response
def process_exception(self, request, exception, spider):
"""
return
None: 繼續(xù)處理異常
Response: 返回Response
Request: 重新調(diào)用
"""
pass
4、Spider-Middleware
myspidermiddleware.py
class MySpiderMiddleware(object):
def __init__(self, spider_data):
self.spider_data = spider_data
@classmethod
def from_crawler(cls, crawler):
"""
獲取spider的settings參數(shù),返回中間件實例對象
"""
spider_data = crawler.settings.get("SPIDER_DATA")
print("### spider middleware get spider_data: {}".format(spider_data))
return cls(spider_data)
def process_spider_input(self, response, spider):
"""
url下載完成后執(zhí)行,交給parse處理response (parse之前執(zhí)行)
return None 繼續(xù)處理response
raise Exception
"""
print("### call process_spider_input")
def process_spider_output(self, response, result, spider):
"""
response返回result時調(diào)用(result必須返回包含item 或者是 Request的可迭代對象)-----》yield item、yield Request(url)
return
iterable of Request、dict or Item
"""
print("### call process_spider_output")
for i in result:
yield i
def process_spider_exception(self, response, exception, spider):
"""
return
None
iterable of Response, dict, or Item
"""
pass
def process_start_requests(self,start_requests,spider): """ #爬蟲剛開始調(diào)用的時候執(zhí)行 (start_request 之前執(zhí)行) return: 包含Request對象的可迭代對象 """ return start_requests
運行爬蟲后,查看日志
### middleware get spider_data: this is spider data
### spider middleware get spider_data: this is spider data
### pipeline get spider_data: this is spider data
### spdier open baidu_spider
### call process_request
### call process_response
### call process_spider_input
### call process_spider_output
### spdier close baidu_spider
中間件啟動順序
download middleware
spider middleware
pipeline
處理函數(shù)調(diào)用順序
spdier open
process_request
process_response
process_spider_input
process_spider_output
spdier close
---------------------
處理函數(shù)調(diào)用順序
參考
Item Pipeline
下載器中間件(Downloader Middleware)
Spider中間件(Middleware)
Scrapy 1.5 documentation