這些天應朋友的要求抓取某個論壇帖子的信息,網(wǎng)上搜索了一下開源的爬蟲資料,看了許多對于開源爬蟲的比較發(fā)現(xiàn)開源爬蟲scrapy比較好用。但是以前一直用的java和php,對python不熟悉,于是花一天時間粗略了解了一遍python的基礎知識。于是就開干了,沒想到的配置一個運行環(huán)境就花了我一天時間。下面記錄下安裝和配置scrapy踩過的那些坑吧。 運行環(huán)境:CentOS 6.0 虛擬機 開始上來先得安裝python運行環(huán)境。然而我運行了一下python命令,發(fā)現(xiàn)已經(jīng)自帶了,竊(大)喜(坑)。于是google搜索了一下安裝步驟,pip install Scrapy直接安裝,發(fā)現(xiàn)不對。少了pip,于是安裝pip。再次pip install Scrapy,發(fā)現(xiàn)少了python-devel,于是這么來回折騰了一上午。后來下載了scrapy的源碼安裝,突然曝出一個需要python2.7版本,再通過python --version查看,一個2.6映入眼前;頓時千萬個草泥馬在心中奔騰。 于是查看了官方文檔(http://doc./en/master/intro/install.html),果然是要python2.7。沒辦法,只能升級python的版本了。 1、升級python
#wget https://www./ftp/python/2.7.10/Python-2.7.10.tgz #tar -zxvf Python-2.7.10.tgz #cd Python-2.7.10 #./configure #make all #make install #make clean #make distclean
#python --version 發(fā)現(xiàn)還是2.6
#mv /usr/bin/python /usr/bin/python2.6.6_bak #ln -s /usr/local/bin/python2.7 /usr/bin/python
# python --version Python 2.7.10 到這里,python算是升級完成了,繼續(xù)安裝scrapy。于是pip install scrapy,還是報錯。 Collecting Twisted>=10.0.0 (from scrapy) Could not find a version that satisfies the requirement Twisted>=10.0.0 (from scrapy) (from versions: ) No matching distribution found for Twisted>=10.0.0 (from scrapy) 少了 Twisted,于是安裝 Twisted 2、安裝Twisted
cd Twisted-15.2.1 python setup.py install
python Python 2.7.10 (default, Jun 5 2015, 17:56:24) [GCC 4.4.4 20100726 (Red Hat 4.4.4-13)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import twisted >>> 此時索命twisted已經(jīng)安裝成功。于是繼續(xù)pip install scrapy,還是報錯。 3、安裝libxlst、libxml2和xslt-configCollecting libxlst Could not find a version that satisfies the requirement libxlst (from versions: ) No matching distribution found for libxlst Collecting libxml2 Could not find a version that satisfies the requirement libxml2 (from versions: ) No matching distribution found for libxml2 wget http://xmlsoft.org/sources/libxslt-1.1.28.tar.gz cd libxslt-1.1.28/ ./configure make make install wget ftp://xmlsoft.org/libxml2/libxml2-git-snapshot.tar.gz cd libxml2-2.9.2/ ./configure make make install 安裝好以后繼續(xù)pip install scrapy,幸運之星仍未降臨 4、安裝cryptographyFailed building wheel for cryptography
下載cryptography(https://pypi./packages/source/c/cryptography/cryptography-0.4.tar.gz) 安裝 cd cryptography-0.4 python setup.py build python setup.py install 發(fā)現(xiàn)安裝的時候報錯: No package 'libffi' found 于是下載libffi下載并安裝 wget ftp://sourceware.org/pub/libffi/libffi-3.2.1.tar.gz cd libffi-3.2.1 make make install 安裝后發(fā)現(xiàn)仍然報錯 Package libffi was not found in the pkg-config search path. Perhaps you should add the directory containing `libffi.pc' to the PKG_CONFIG_PATH environment variable No package 'libffi' found 于是設置:PKG_CONFIG_PATH export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig:$PKG_CONFIG_PATH 再次安裝scrapy pip install scrapy
幸運女神都去哪兒了? ImportError: libffi.so.6: cannot open shared object file: No such file or directory 于是 whereis libffi
libffi: /usr/local/lib/libffi.a /usr/local/lib/libffi.la /usr/local/lib/libffi.so
已經(jīng)正常安裝,網(wǎng)上搜索了一通,發(fā)現(xiàn)是LD_LIBRARY_PATH沒設置,于是 export LD_LIBRARY_PATH=/usr/local/lib 于是繼續(xù)安裝cryptography-0.4 ./configure make make install 此時正確安裝,沒有報錯信息了。 5、繼續(xù)安裝scrapypip install scrapy
看著提示信息: Building wheels for collected packages: cryptography Running setup.py bdist_wheel for cryptography 在這里停了好久,在想幸運女神是不是到了。等了一會 Requirement already satisfied (use --upgrade to upgrade): zope.interface>=3.6.0 in /usr/local/lib/python2.7/site-packages/zope.interface-4.1.2-py2.7-linux-i686.egg (from Twisted>=10.0.0->scrapy) Collecting cryptography>=0.7 (from pyOpenSSL->scrapy) Using cached cryptography-0.9.tar.gz Requirement already satisfied (use --upgrade to upgrade): setuptools in /usr/local/lib/python2.7/site-packages (from zope.interface>=3.6.0->Twisted>=10.0.0->scrapy) Requirement already satisfied (use --upgrade to upgrade): idna in /usr/local/lib/python2.7/site-packages (from cryptography>=0.7->pyOpenSSL->scrapy) Requirement already satisfied (use --upgrade to upgrade): pyasn1 in /usr/local/lib/python2.7/site-packages (from cryptography>=0.7->pyOpenSSL->scrapy) Requirement already satisfied (use --upgrade to upgrade): enum34 in /usr/local/lib/python2.7/site-packages (from cryptography>=0.7->pyOpenSSL->scrapy) Requirement already satisfied (use --upgrade to upgrade): ipaddress in /usr/local/lib/python2.7/site-packages (from cryptography>=0.7->pyOpenSSL->scrapy) Requirement already satisfied (use --upgrade to upgrade): cffi>=0.8 in /usr/local/lib/python2.7/site-packages (from cryptography>=0.7->pyOpenSSL->scrapy) Requirement already satisfied (use --upgrade to upgrade): ordereddict in /usr/local/lib/python2.7/site-packages (from enum34->cryptography>=0.7->pyOpenSSL->scrapy) Requirement already satisfied (use --upgrade to upgrade): pycparser in /usr/local/lib/python2.7/site-packages (from cffi>=0.8->cryptography>=0.7->pyOpenSSL->scrapy) Building wheels for collected packages: cryptography Running setup.py bdist_wheel for cryptography Stored in directory: /root/.cache/pip/wheels/d7/64/02/7258f08eae0b9c930c04209959c9a0794b9729c2b64258117e Successfully built cryptography Installing collected packages: cryptography Found existing installation: cryptography 0.4 Uninstalling cryptography-0.4: Successfully uninstalled cryptography-0.4 Successfully installed cryptography-0.9 顯示如此的信息??吹酱丝?,內(nèi)流馬面。謝謝CCAV,感謝MTV,釣魚島是中國的。終于安裝成功了。 6、測試scrapy創(chuàng)建測試腳本 cat > myspider.py <<EOF from scrapy import Spider, Item, Field class Post(Item): title = Field() class BlogSpider(Spider): name, start_urls = 'blogspider', ['http://www.cnblogs.com/rwxwsblog/'] def parse(self, response): return [Post(title=e.extract()) for e in response.css("h2 a::text")] EOF 測試腳本能否正常運行 scrapy runspider myspider.py 2015-06-06 20:25:16 [scrapy] INFO: Scrapy 1.0.0rc2 started (bot: scrapybot) 2015-06-06 20:25:16 [scrapy] INFO: Optional features available: ssl, http11 2015-06-06 20:25:16 [scrapy] INFO: Overridden settings: {} 2015-06-06 20:25:16 [py.warnings] WARNING: :0: UserWarning: You do not have a working installation of the service_identity module: 'No module named service_identity'. Please install it from <https://pypi./pypi/service_identity> and make sure all of its dependencies are satisfied. Without the service_identity module and a recent enough pyOpenSSL to support it, Twisted can perform only rudimentary TLS client hostname verification. Many valid certificate/hostname mappings may be rejected. 2015-06-06 20:25:16 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState 2015-06-06 20:25:16 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 2015-06-06 20:25:16 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 2015-06-06 20:25:16 [scrapy] INFO: Enabled item pipelines: 2015-06-06 20:25:16 [scrapy] INFO: Spider opened 2015-06-06 20:25:16 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2015-06-06 20:25:16 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 2015-06-06 20:25:17 [scrapy] DEBUG: Crawled (200) <GET http://www.cnblogs.com/rwxwsblog/> (referer: None) 2015-06-06 20:25:17 [scrapy] INFO: Closing spider (finished) 2015-06-06 20:25:17 [scrapy] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 226, 'downloader/request_count': 1, 'downloader/request_method_count/GET': 1, 'downloader/response_bytes': 5383, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2015, 6, 6, 12, 25, 17, 310084), 'log_count/DEBUG': 2, 'log_count/INFO': 7, 'log_count/WARNING': 1, 'response_received_count': 1, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'start_time': datetime.datetime(2015, 6, 6, 12, 25, 16, 863599)} 2015-06-06 20:25:17 [scrapy] INFO: Spider closed (finished) 運行正常(此時心中竊喜,^_^....)。 7、創(chuàng)建自己的scrapy項目(此時換了一個會話)scrapy startproject tutorial 輸出以下信息 Traceback (most recent call last): File "/usr/local/bin/scrapy", line 9, in <module> load_entry_point('Scrapy==1.0.0rc2', 'console_scripts', 'scrapy')() File "/usr/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 552, in load_entry_point return get_distribution(dist).load_entry_point(group, name) File "/usr/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 2672, in load_entry_point return ep.load() File "/usr/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 2345, in load return self.resolve() File "/usr/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 2351, in resolve module = __import__(self.module_name, fromlist=['__name__'], level=0) File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/__init__.py", line 48, in <module> from scrapy.spiders import Spider File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/spiders/__init__.py", line 10, in <module> from scrapy.http import Request File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/http/__init__.py", line 11, in <module> from scrapy.http.request.form import FormRequest File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/http/request/form.py", line 9, in <module> import lxml.html File "/usr/local/lib/python2.7/site-packages/lxml/html/__init__.py", line 42, in <module> from lxml import etree ImportError: /usr/lib/libxml2.so.2: version `LIBXML2_2.9.0' not found (required by /usr/local/lib/python2.7/site-packages/lxml/etree.so) 心中無數(shù)個草泥馬再次狂奔。怎么又不行了?難道會變戲法?定定神看了下:ImportError: /usr/lib/libxml2.so.2: version `LIBXML2_2.9.0' not found (required by /usr/local/lib/python2.7/site-packages/lxml/etree.so)。這是那樣的熟悉呀!想了想,這怎么和前面的ImportError: libffi.so.6: cannot open shared object file: No such file or directory那么類似呢?于是 8、添加環(huán)境變量export LD_LIBRARY_PATH=/usr/local/lib 再次運行: scrapy startproject tutorial 輸出以下信息: [root@bogon scrapy]# scrapy startproject tutorial 2015-06-06 20:35:43 [scrapy] INFO: Scrapy 1.0.0rc2 started (bot: scrapybot) 2015-06-06 20:35:43 [scrapy] INFO: Optional features available: ssl, http11 2015-06-06 20:35:43 [scrapy] INFO: Overridden settings: {} New Scrapy project 'tutorial' created in: /root/scrapy/tutorial You can start your first spider with: cd tutorial scrapy genspider example example.com 尼瑪?shù)慕K于成功了。由此可見,scrapy運行的時候需要 LD_LIBRARY_PATH 環(huán)境變量的支持??梢钥紤]將其加入環(huán)境變量中 vi /etc/profile
添加:export LD_LIBRARY_PATH=/usr/local/lib 這行(前面的PKG_CONFIG_PATH也可以考慮添加進來,export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig:$PKG_CONFIG_PATH) 保存后檢查是否存在異常: source /etc/profile 開一個新的會話運行 scrapy runspider myspider.py 發(fā)現(xiàn)正常運行,可見LD_LIBRARY_PATH是生效的。至此scrapy就算正式安裝成功了。 查看scrapy版本:運行scrapy version,看了下scrapy的版本為“Scrapy 1.0.0rc2” 9、編程外的思考(感謝閱讀到此的你,我自己都有點暈了。)
10、參考文檔http:/// http://doc./en/master/ http://blog.csdn.net/slvher/article/details/42346887 http://blog.csdn.net/niying/article/details/27103081 http://www.cnblogs.com/xiaoruoen/archive/2013/02/27/2933854.html |
|
來自: icecity1306 > 《升級打怪》