Heritrix提高抓取效率的若干嘗試

向日葵kk 2010-11-04

展開全文

15． long x = 0;

16． for(int i = 0; i < str.length(); i++){

17． hash = (hash << 4) + str.charAt(i);

18． if((x = hash & 0xF0000000L) != 0){

19． hash ^= (x >> 24);

20． hash &= ~x;

21． }

22． }

23． return (hash & 0x7FFFFFFF);

24． }

25．}

3. 修改AbstractFrontier 類的AbstractFrontier方法：

關(guān)鍵代碼段是：

   String queueStr = System.getProperty(AbstractFrontier.class.getName() +
                 "." + ATTR_QUEUE_ASSIGNMENT_POLICY,
                 ELFHashQueueAssignmentPolicy.class.getName() + " " +

HostnameQueueAssignmentPolicy.class.getName() + " " +

IPQueueAssignmentPolicy.class.getName() + " " +

BucketQueueAssignmentPolicy.class.getName() + " " +

SurtAuthorityQueueAssignmentPolicy.class.getName());

Pattern p = Pattern.compile("\\s*,\\s*|\\s+");

String [] queues = p.split(queueStr);

其中紅色部分是新加的代碼。

4. 修改heritrix.properties 中的配置

        #############################################################################
        # F R O N T I E R
        #############################################################################

        # List here all queue assignment policies you'd have show as a
        # queue-assignment-policy choice in AbstractFrontier derived Frontiers
        # (e.g. BdbFrontier).
        org.archive.crawler.frontier.AbstractFrontier.queue-assignment-policy = \

org.archive.crawler.frontier.ELFHashQueueAssignmentPolicy \

org.archive.crawler.frontier.HostnameQueueAssignmentPolicy \

org.archive.crawler.frontier.IPQueueAssignmentPolicy \

org.archive.crawler.frontier.BucketQueueAssignmentPolicy \

org.archive.crawler.frontier.SurtAuthorityQueueAssignmentPolicy \

org.archive.crawler.frontier.TopmostAssignedSurtQueueAssignmentPolicy

org.archive.crawler.frontier.BdbFrontier.level = INFO

紅色部分為新加部分。

添加代碼后運(yùn)行的結(jié)果如下圖，可見Heritrix已經(jīng)在開50個(gè)線程同時(shí)抓取網(wǎng)頁。

抓取速度得到了很大的提高，1.4G的網(wǎng)頁8個(gè)多小時(shí)就抓好了。在Hosts欄里顯示，只抓取了ccer.pku.edn.cn域名下的網(wǎng)頁。

Heritrix提高抓取效率的若干嘗試

</:O:P>

一些分析：

1）添加的ELFHash算法程序并不算很復(fù)雜。ELFhash算法的基本思想是：將一個(gè) 字符串的數(shù)組中的每個(gè)元素依次按前四位與上一個(gè)元素的低四位相與，組成一個(gè)長(zhǎng)整形，如果長(zhǎng)整的高四位大于零，那么就將它折回再與長(zhǎng)整的低四位相異或，這樣最后得到的長(zhǎng)整對(duì)HASH表長(zhǎng)取余，得到在HASH中的位置。

2）ELFHash函數(shù)將輸入的字符串進(jìn)行哈希計(jì)算，輸出算出的整數(shù)型哈希值。getClassKey函數(shù)中調(diào)用了ELFHash函數(shù)計(jì)算出哈希值，轉(zhuǎn)換為字符串型返回上一層。之所以取模100是因?yàn)橐话闱闆r下Heritrix開100個(gè)線程，對(duì)應(yīng)100個(gè)不同的URI處理隊(duì)列。

3) QueueAssignmentPolicy類源程序里的說明：

* Establishes a mapping from CrawlURIs to String keys (queue names).

* Get the String key (name) of the queue to which the

* CrawlURI should be assigned.

* Note that changes to the CrawlURI, or its associated

* components (such as CrawlServer), may change its queue

* assignment.

可知該類建立抓取到的URI和抓取隊(duì)列名之間的映射。這個(gè)類是個(gè)抽象類,不同的策略由不同的子類實(shí)現(xiàn)，如根據(jù)域名、IP等。

4）AbstractFrontier類是調(diào)度器基本實(shí)現(xiàn)類，是一個(gè)非常復(fù)雜的類，沒有仔細(xì)研究。這里加在里面的程序作用大概是將ELFHashQueueAssignmentPolicy這個(gè)策略加入到運(yùn)行時(shí)所使用的URI分配策略中。在heritrix.properties中的修改也同樣為這個(gè)目的。

5）由上可見使用這個(gè)策略后，速度有了非常大的提高。但抓下來的1.4G數(shù)據(jù)相比之前抓下來的有點(diǎn)小，大概是max-retries值設(shè)置得太低（原來是30，改為5），導(dǎo)致不少東西沒有抓下來。

二．只抓取HTML對(duì)象

由上面的圖可以知道抓取的內(nèi)容中有一些不需要用到的文件類型，比如pdf，jpeg等等。如何用Heritrix只抓特定的對(duì)象，比如只抓HTML型的。Heritrix的官方文檔”Heritrix User Manual”中A.3節(jié)給出了一個(gè)解決方案：

1）You would first need to create a job with the single seed http:///bar/. You'll need to add the MirrorWriterProcessor on the Modules screen and delete the ARCWriterProcessor. This will store your files in a directory structure that matches the crawled URIs, and the files will be stored in the crawl job's mirror directory.

2）Your job should use the DecidingScope with the following set of DecideRules:

RejectDecideRule

SurtPrefixedDecideRule

TooManyHopsDecideRule

PathologicalPathDecideRule

TooManyPathSegmentsDecideRule

NotMatchesFilePatternDecideRule

PrerequisiteAcceptDecideRule

We are using the NotMatchesFilePatternDecideRule so we can eliminate crawling any URIs that don't end with .html. It's important that this DecideRule be placed immediately before PrerequisiteAcceptDecideRule; otherwise the DNS and robots.txt prerequisites will be rejected since they won't match the regexp.

3）On the Setting screen, you'll want to set the following for the NotMatchesFilePatternDecideRule:

decision: REJECT

use-preset-pattern: CUSTOM

regexp: .*(/|\.html)$

根據(jù)需要，將正則表達(dá)式進(jìn)行修改以滿足需要，在這里更改為：

(.*(/|\.(html|htm|xml|asp))$)|(.*\.asp\?.*)
抓取的效果如下圖所示：

三．取消Robots.txt的限制

Robots.txt是一種專門用于搜索引擎網(wǎng)絡(luò)爬蟲的文件，當(dāng)構(gòu)造一個(gè)網(wǎng)站時(shí)，如果作者希望該網(wǎng)站的內(nèi)容被搜索引擎收錄，就可以在網(wǎng)站中創(chuàng)建一個(gè)純文本文件robots.txt，在這個(gè)文件中，聲明該網(wǎng)站不想被robot訪問的部分。這樣，該網(wǎng)站的部分或全部?jī)?nèi)容就可以不被搜索引擎收錄了，或者指定搜索引擎只收錄指定的內(nèi)容。因?yàn)榇蟛糠值木W(wǎng)站并不會(huì)放置一個(gè)robots.txt文件以供搜索引擎讀取，所以 Heritrix爬蟲在抓取網(wǎng)頁時(shí)會(huì)花費(fèi)過多的時(shí)間去判斷該Robots.txt文件是否存在，從而增加了抓取時(shí)間。好在這個(gè)協(xié)議本身是一種附加協(xié)議，完全可以不遵守。

在Heritrix中，對(duì)robots.txt文件的處理是處于PreconditionEnforcer這個(gè)Processor中的。PreconditionEnforcer是一個(gè)Prefetcher，當(dāng)處理時(shí)，總是需要考慮一下當(dāng)前這個(gè)鏈接是否有什么先決條件要先被滿足的，而對(duì)robots.txt的訪問則正好是其中之一。在PreconditionEnforcer中，有一個(gè)private類型的函數(shù)，函數(shù)聲明為： private boolean considerRobotsPreconditions(CrawlURI curi) 。該函數(shù)的含義為：在進(jìn)行對(duì)參數(shù)所表示的鏈接的抓取前，看一下是否存在一個(gè)由robots.txt所決定的先決條件。該函數(shù)返回true時(shí)的含義為需要考慮robots.txt文件，返回false時(shí)則表示不需要考慮robots.txt文件，可以繼續(xù)將鏈接傳遞給后面的處理器。所以，最簡(jiǎn)單的修改辦法就是將這個(gè)方法整個(gè)注釋掉，只返回一個(gè)false值。

網(wǎng)上聲稱使用這種辦法可以提高抓取速度一半以上，由于抓取所花時(shí)間比較多，沒有進(jìn)行對(duì)比比較。以上的抓取都是在去除robots.txt情況下進(jìn)行的。

本站是提供個(gè)人知識(shí)管理的網(wǎng)絡(luò)存儲(chǔ)空間，所有內(nèi)容均由用戶發(fā)布，不代表本站觀點(diǎn)。請(qǐng)注意甄別內(nèi)容中的聯(lián)系方式、誘導(dǎo)購(gòu)買等信息，謹(jǐn)防詐騙。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容，請(qǐng)點(diǎn)擊一鍵舉報(bào)。

轉(zhuǎn)藏 分享

QQ空間 QQ好友新浪微博微信

獻(xiàn)花（0） +1

來自：向日葵kk > 《我的圖書館》

舉報(bào)/認(rèn)領(lǐng)