用Python 寫網絡爬蟲

youxd 2016-12-27

展開全文

CodeSnippet 抓取代碼片段

目標

抓取CodeSnippet中的代碼片段

用Python 寫網絡爬蟲-第一個網絡爬蟲

代碼片段

分析

用Python 寫網絡爬蟲-第一個網絡爬蟲

DOM結構

我們想要抓取的內容在為

li class='con-code bbor' 所以 BeautifulSoup find()方法獲取到該標簽然后獲取其文本內容

準備

準備我們爬蟲比用的兩個模塊

from urllib2 import urlopenfrom bs4 import BeautifulSoup

編寫抓取代碼

# 抓取http://www./index.html 中的代碼片段def GrapIndex(): html = 'http://www./index.html' bsObj = BeautifulSoup(urlopen(html), 'html.parser') return bsObj.find('li', {'class':'con-code bbor'}).get_text()

當我們抓取到我們想要的數(shù)據(jù)之后接下來要做的就是把數(shù)據(jù)寫到數(shù)據(jù)庫里，由于我們現(xiàn)在抓取數(shù)據(jù)簡單，所以只寫文件即可！

def SaveResult(): codeFile=open('code.txt', 'a') # 追加 for list in GrapIndex(): codeFile.write(list) codeFile.close()

當我們在寫文件的時候出現(xiàn)了以下錯誤，而下面這個錯誤的造成原因則是由于python2.7是基于ascii去處理字符流，當字符流不屬于ascii范圍內，就會拋出異常（ordinal not in range(128)）

UnicodeEncodeError: 'ascii' codec can't encode character u'u751f' in position 0: ordinal not in range(128)

分析

python2.7是基于ascii去處理字符流，當字符流不屬于ascii范圍內，就會拋出異常（ordinal not in range(128)）

解決辦法

import sysreload(sys)sys.setdefaultencoding('utf-8')

完整代碼展示

from urllib2 import urlopenfrom bs4 import BeautifulSoupimport osimport sysreload(sys)sys.setdefaultencoding('utf-8')def GrapIndex(): html = 'http://www./index.html' bsObj = BeautifulSoup(urlopen(html), 'html.parser') return bsObj.find('li', {'class':'con-code bbor'}).get_text()def SaveResult(): codeFile=open('code.txt', 'a') for list in GrapIndex(): codeFile.write(list) codeFile.close() if __name__ == '__main__': for i in range(0,9): SaveResult()