【原】python爬蟲|pyspider的第一個(gè)爬蟲程序，大功告成！

Python集中營(yíng) 2022-10-10 發(fā)布于甘肅

展開全文

關(guān)于pyspider的安裝前往查看前序文章《踩坑記：終于懷著忐忑的心情完成了對(duì) python 爬蟲擴(kuò)展庫(kù) pyspider 的安裝》

1、啟動(dòng)pyspider服務(wù)

1pyspider all

2、創(chuàng)建pyspider項(xiàng)目

3、項(xiàng)目區(qū)域說(shuō)明

4、從百度首頁(yè)開始爬取

填寫百度首頁(yè)地址點(diǎn)擊run開始爬取，點(diǎn)擊爬取到的鏈接執(zhí)行下一步

任意點(diǎn)擊爬取到的鏈接進(jìn)入下一步爬取

返回所進(jìn)入的詳情頁(yè)內(nèi)容

5、代碼編輯區(qū)函數(shù)

 1#!/usr/bin/env python
 2# -*- encoding: utf-8 -*-
 3# Created on 2021-04-10 11:24:26
 4# Project: test
 5
 6from pyspider.libs.base_handler import *
 7
 8# 處理類
 9class Handler(BaseHandler):
10    # 爬蟲相關(guān)參數(shù)配置，全局生效(字典類型)
11    crawl_config = {
12        'url':'http://www.baidu.com'
13    }
14
15    # 表示每天一次，minutes單位為分鐘
16    @every(minutes=24 * 60)
17    # 程序入口
18    def on_start(self):
19        # 設(shè)置爬蟲地址
20        self.crawl('http://www.baidu.com', callback=self.index_page)
21
22    # 表示10天內(nèi)不會(huì)再次爬取，age單位為秒
23    @config(age=10 * 24 * 60 * 60)
24    # 回調(diào)函數(shù)、數(shù)據(jù)解析
25    def index_page(self, response):
26        # response.doc() 返回的是pyquery對(duì)象，因此采用pyquery對(duì)象解析
27        for each in response.doc('a[href^="http"]').items():
28            # 遍歷并回調(diào)爬取詳情頁(yè)
29            self.crawl(each.attr.href, callback=self.detail_page)
30
31    # 任務(wù)優(yōu)先級(jí)設(shè)置
32    @config(priority=2)
33    # 回調(diào)函數(shù)、返回結(jié)果
34    def detail_page(self, response):
35        # 返回詳情頁(yè)
36        return {
37            "url": response.url,
38            "title": response.doc('title').text(),
39        }