1. 豆瓣電影排行榜的抓取
1.1 訓(xùn)練目的
- 熟悉爬蟲請(qǐng)求的語(yǔ)句結(jié)構(gòu)。
- 對(duì)網(wǎng)頁(yè)結(jié)構(gòu)進(jìn)一步了解。
1.2 代碼實(shí)戰(zhàn)
import requests
url = 'https://movie.douban.com/j/chart/top_list'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.102 Safari/537.36 Edg/104.0.1293.70'}
form ={
'type': '24',
'interval_id':'100:90',
'action': '',
'start': '0',
'limit': '20',
}
resp = requests.get(url,params=form,headers=headers)
print(resp.json())
resp.close()# 關(guān)掉response
注意:表單當(dāng)中的start是指從排序第幾個(gè)開(kāi)始,limit是指顯示多少個(gè)。html頁(yè)面下滑的時(shí)候會(huì)發(fā)現(xiàn)請(qǐng)求更新,對(duì)比提交表單可以得到以上結(jié)論,因此在爬取的時(shí)候可以根據(jù)這個(gè)來(lái)更改這兩個(gè)參數(shù)。
2. 豆瓣電影TOP 250的抓取
2.1 訓(xùn)練目的
- 學(xué)會(huì)抓取之后的數(shù)據(jù)提取
2.2 單頁(yè)數(shù)據(jù)爬取
import requests
import re
# 數(shù)據(jù)抓取
url = 'https://movie.douban.com/top250'
headers = headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.102 Safari/537.36 Edg/104.0.1293.70'}
resp = requests.get(url,headers=headers)
pagecontent = resp.text
resp.close()# 關(guān)掉response
# 解析數(shù)據(jù)
obj = re.compile(r'<li>\n.*?<div class='item'>.*?<span class='title'>(?P<movie>.*?)</span>.*?導(dǎo)演: (?P<director>.*?) .*?'
r'主演: (?P<actors>.*?)<br>(?P<year>.*?) / (?P<country>.*?) / (?P<type>.*?)</p>.*?<span class='rating_num' property='v:average'>(?P<rate>.*?)</span>.*?<span>(?P<comments>.*?)人評(píng)價(jià)</span>',re.S)
result = obj.finditer(pagecontent)
for i in result:
print(i.group('movie'))
print(i.group('director'))
print(i.group('actors'))
print(i.group('year').strip())
print(i.group('type').strip())
print(i.group('rate'))
print(i.group('comments'))
2.3 寫入CSV文件
#保存數(shù)據(jù),導(dǎo)入csv庫(kù)
import csv
#寫入csv
result = obj.finditer(pagecontent)
with open('data.csv', mode='a',encoding='utf-8-sig', newline='') as f:
csv_write = csv.writer(f)
for it in result:
# 將迭代器it轉(zhuǎn)換為字典
dic = it.groupdict()
# 對(duì)鍵為year和type的值去除空格
dic['year'] = dic['year'].strip()
dic['type'] = dic['type'].strip()
# 將字典的values寫入data.csv
csv_write.writerow(dic.values())
print('寫入完成')
2.4 爬取多頁(yè)信息
上述只是爬取了第一頁(yè)的內(nèi)容,點(diǎn)擊網(wǎng)頁(yè)下方翻頁(yè)之后,發(fā)現(xiàn)每一頁(yè)的網(wǎng)址變化非常簡(jiǎn)單,start=的值就是起始序號(hào),網(wǎng)頁(yè)一共十頁(yè),第一頁(yè)為0,第二頁(yè)為25,依次等差遞增。因此,重新整合代碼如下:
import requests
import re
import csv
import time
# 數(shù)據(jù)抓取
headers = headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.102 Safari/537.36 Edg/104.0.1293.70'}
pagecontent=''
for t in range(0,250,25):
url = f'https://movie.douban.com/top250?start={t}&filter='
resp = requests.get(url,headers=headers)
pagecontent = pagecontent+resp.text# 把所有的抓取的網(wǎng)頁(yè)字符串連接起來(lái)
print(f'第{t+25}條結(jié)束')
time.sleep(1)
resp.close()# 關(guān)掉response
# 解析數(shù)據(jù)
obj = re.compile(r'<li>.*?<span class='title'>(?P<movie>.*?)</span>.*?導(dǎo)演: (?P<director>.*?) .*?主演: (?P<actors>.*?)<br>(?P<year>.*?) / (?P<country>.*?) / (?P<type>.*?)</p>.*?<span class='rating_num' property='v:average'>(?P<rate>.*?)</span>.*?<span>(?P<comments>.*?)人評(píng)價(jià)</span>',re.S)
result = obj.finditer(pagecontent)
with open('movie.csv', mode='a',encoding='utf-8-sig', newline='') as f:
csv_write = csv.writer(f)
for it in result:
# 將迭代器it轉(zhuǎn)換為字典
dic = it.groupdict()
# 對(duì)鍵為year和type的值去除空格
dic['year'] = dic['year'].strip()
dic['type'] = dic['type'].strip()
# 將字典的values寫入data.csv
csv_write.writerow(dic.values())
print('寫入完成')