說明:了解爬蟲的可能都會(huì)知道,在爬蟲里,requests入門簡單,即使是沒有基礎(chǔ)的小白,學(xué)個(gè)幾天也能簡單的去請(qǐng)求網(wǎng)站,但是scrapy就相對(duì)來說就比較難,本片文章能是列舉幾個(gè)簡單的例子去理解的scrapy工作的原理,理解了scrapy工作的原理之后,學(xué)起來就會(huì)簡單很多 適用:本篇文章適合有一點(diǎn)的爬蟲基礎(chǔ)但又是剛剛接觸或者想要學(xué)習(xí)scrapy的同學(xué) scrapy框架:
scrapy框架的結(jié)構(gòu):
5表示: 1. spiders(蜘蛛) 2. engine(引擎) 3. downloader(下載器) 4. scheduler(調(diào)度器) 5. item pipeline(項(xiàng)目管道) 2表示: 1. downloder middlewares(下載中間件) 2. spider middlewares(蜘蛛中間件) [圖片上傳失敗...(image-946ae8-1545285941079)] 接下來我們就列舉幾個(gè)列子來方便的理解scrapy原理:先說一下爬蟲,對(duì)于一個(gè)爬蟲,整體來看,分為三個(gè)部分:
設(shè)定一: 初始url:1個(gè) 是否解析:否 是否存儲(chǔ)數(shù)據(jù):否 (1)spider將初始url經(jīng)過engine傳遞給scheduler,形成調(diào)度隊(duì)列(1個(gè)requests) (2)scheduler將requests經(jīng)過engine調(diào)度給downloader進(jìn)行數(shù)據(jù)下載,形成原始數(shù)據(jù) 設(shè)定二: 初始url:1個(gè) 是否解析:是 是否存儲(chǔ)數(shù)據(jù):否 (1)spider將初始url經(jīng)過engine傳遞給scheduler,形成調(diào)度隊(duì)列(1個(gè)requests) (2)scheduler將requests經(jīng)過engine調(diào)度給downloader進(jìn)行數(shù)據(jù)下載,形成原始數(shù)據(jù) (3)將原始數(shù)據(jù)經(jīng)過engine傳遞給spider進(jìn)行解析 設(shè)定三: 初始url:1個(gè) 是否解析:是 是否存儲(chǔ)數(shù)據(jù):是 (1)spider將初始url經(jīng)過engine傳遞給scheduler,形成調(diào)度隊(duì)列(多個(gè)requests) (2)scheduler將第一個(gè)requests經(jīng)過engine調(diào)度給downloader進(jìn)行數(shù)據(jù)下載,形成原始數(shù)據(jù) (3)將原始數(shù)據(jù)經(jīng)過engine傳遞給spider進(jìn)行解析 (4)將解析后的數(shù)據(jù)經(jīng)過engine傳給item pipeline進(jìn)行數(shù)據(jù)存儲(chǔ) 設(shè)定四: 初始url:多個(gè) 是否解析:是 是否存儲(chǔ)數(shù)據(jù):是 (1)spider將初始url經(jīng)過engine傳遞給scheduler,形成調(diào)度隊(duì)列(多個(gè)requests) (2)scheduler將第一個(gè)requests經(jīng)過engine調(diào)度給downloader進(jìn)行數(shù)據(jù)下載,形成原始數(shù)據(jù) (3)將原始數(shù)據(jù)經(jīng)過engine傳遞給spider進(jìn)行解析 (4)將解析后的數(shù)據(jù)經(jīng)過engine傳給item pipeline進(jìn)行數(shù)據(jù)存儲(chǔ) (5)scheduler將下一個(gè)requests經(jīng)過engine調(diào)度給downloader進(jìn)行數(shù)據(jù)下載,形成原始數(shù)據(jù)......#重復(fù)(2)到(4)步,直到scheduler中沒有更多的requests |
|