大家好,我是安果!
周末是與親朋好友相聚的好時(shí)機(jī),可以選擇一部大家都喜歡的電影,徹底放松,共同度過一個(gè)愉快而難忘的周末
本篇文章將介紹如何使用 Scrapy 爬取最新上映的電影
目標(biāo)對(duì)象:
aHR0cHM6Ly93d3cubWFveWFuLmNvbS8=
1、創(chuàng)建爬蟲項(xiàng)目
# 創(chuàng)建一個(gè)爬蟲項(xiàng)目
scrapy startproject film
cd film
# 創(chuàng)建一個(gè)爬蟲
scrapy genspider maoyan_film https://www.*.com/
2、創(chuàng)建數(shù)據(jù)表及定義 Item
在數(shù)據(jù)庫(kù)中創(chuàng)建一張表用于保存爬取下來的數(shù)據(jù)
以 Mysql 為例
create table xag.film
(
id int auto_increment
primary key,
film_name varchar(100) null,
show_time varchar(100) null,
file_type varchar(100) null,
actors varchar(100) null,
url varchar(100) null,
insert_time date null
);
然后,定義 Item 存儲(chǔ)數(shù)據(jù)對(duì)象
# items.py
import scrapy
class FilmItem(scrapy.Item):
film_name = scrapy.Field() # 電影名稱
show_time = scrapy.Field() # 上映時(shí)間
file_type = scrapy.Field() # 電影類型
actors = scrapy.Field() # 電影演員
url = scrapy.Field() # 電影URL
insert_time = scrapy.Field() # 插入時(shí)間(年、月、日)
3、編寫爬蟲解析主頁(yè)面
這里以 Selenium 為例,首先創(chuàng)建一個(gè)瀏覽器對(duì)象
PS:為了在服務(wù)器上運(yùn)行,這里對(duì) CentOS 做了兼容處理
import scrapy
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.service import Service
import platform
...
def parse(self, response):
chrome_options = Options()
if platform.system().lower() == 'windows':
# win
s = Service(r"C:\work\chromedriver.exe")
self.browser = webdriver.Chrome(service=s, options=chrome_options)
else:
# Centos
DRIVER_PATH = '/home/drivers/chromedriver'
s = Service(DRIVER_PATH)
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--headless') # 無頭參數(shù)
chrome_options.add_argument('--disable-gpu')
self.browser = webdriver.Chrome(service=s, options=chrome_options)
self.browser.implicitly_wait(5)
...
然后,分析網(wǎng)頁(yè)結(jié)構(gòu),使用 Xpath 解析最近上映的電影數(shù)據(jù)
這里提取出電影的名稱及上映時(shí)間(包含電影詳情頁(yè)面 URL)
...
url = response.url
# selenium打開網(wǎng)頁(yè)
self.browser.get(url)
self.browser.maximize_window()
WebDriverWait(self.browser, 20).until(
EC.presence_of_element_located((By.XPATH, '//div[@class="movie-grid"]/div[2]//dl'))
)
time.sleep(10)
video_elements = self.browser.find_elements(By.XPATH, '//div[@class="movie-grid"]/div[2]//dl/dd')
for video_element in video_elements:
# 電影名稱
film_name = video_element.find_element(By.XPATH, './/div[contains(@class,"movie-title")]').text
# 上映時(shí)間
show_time = video_element.find_element(By.XPATH, './/div[contains(@class,"movie-rt")]').text.replace(
"上映",
"")
# 詳情頁(yè)面URL
file_detail_a_element = video_element.find_element(By.XPATH, './/div[@class="movie-item"]/a')
file_detail_url = file_detail_a_element.get_attribute("href")
print('film_name:', film_name, ',show_time:', show_time, ",url:", file_detail_url)
yield Request(file_detail_url, callback=self.parse_detail, headers=self.headers,
meta={"film_name": film_name, "show_time": show_time}, dont_filter=True)
...
4、電影詳情頁(yè)面解析
通過上面的步驟,我們可以拿到某一部電影的詳情頁(yè)面 URL
需要注意的是,如果使用 Selenium 直接打開該頁(yè)面會(huì)觸發(fā)反爬,這里我們需要修改瀏覽器特征值
...
def parse_detail(self, response):
"""
解析子頁(yè)面
:param response:
:return:
"""
# 應(yīng)對(duì)反爬
self.browser.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
"source": """
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined
})
"""
})
...
接著,打開目標(biāo)頁(yè)面,爬取電影的類型及演員列表
最后,將數(shù)據(jù)將存儲(chǔ)到 Item 中
...
self.browser.get(response.url)
self.browser.maximize_window()
# 等待加載完成
WebDriverWait(self.browser, 20).until(
EC.presence_of_element_located((By.XPATH, '//div[@class="movie-brief-container"]//a[@class="text-link"]'))
)
# 獲取電影類型
film_type_elements = self.browser.find_elements(By.XPATH,
'//div[@class="movie-brief-container"]//a[@class="text-link"]')
file_type = ''
for index, film_type_element in enumerate(film_type_elements):
if len(film_type_elements) >= 2:
if index != len(film_type_elements) - 1:
file_type += film_type_element.text + "|"
else:
file_type += film_type_element.text
else:
file_type += film_type_element.text
# 演員
celebritys = self.browser.find_elements(By.XPATH,
'//div[@class="celebrity-group"][2]//div[@class="info"]//a[@class="name"]')[
:3]
actors = ''
for index, celebrity in enumerate(celebritys):
if len(celebritys) >= 2:
if index != len(celebritys) - 1:
actors += celebrity.text + "|"
else:
actors += celebrity.text
else:
actors += celebrity.text
item = FilmItem()
item['film_name'] = response.meta.get("film_name", "")
item['show_time'] = response.meta.get("show_time", "")
item["file_type"] = file_type
item["actors"] = actors
item['url'] = response.url
item['insert_time'] = date.today().strftime("%Y-%m-%d")
yield item
...
5、編寫數(shù)據(jù)庫(kù)管道,將上面的數(shù)據(jù)存儲(chǔ)到數(shù)據(jù)庫(kù)表中
from scrapy.exporters import CsvItemExporter
from film.items import FilmItem
import MySQLdb
class MysqlPipeline(object):
def __init__(self):
# 鏈接mysql數(shù)據(jù)庫(kù)
self.conn = MySQLdb.connect("host", "root", "pwd", "xag", charset="utf8", use_unicode=True)
self.cursor = self.conn.cursor()
# 數(shù)據(jù)同步插入到mysql數(shù)據(jù)庫(kù)
def process_item(self, item, spider):
table_name = ''
if isinstance(item, FilmItem):
table_name = 'film'
# sql語句
insert_sql = """
insert into {}(film_name,show_time,file_type,actors,url,insert_time) values(%s,%s,%s,%s,%s,%s)
""".format(table_name)
params = list()
params.append(item.get("film_name", ""))
params.append(item.get("show_time", ""))
params.append(item.get("file_type", ""))
params.append(item.get("actors", ""))
params.append(item.get("url", ""))
params.append(item.get("insert_time", ""))
# 執(zhí)行插入數(shù)據(jù)到數(shù)據(jù)庫(kù)操作
self.cursor.execute(insert_sql, tuple(params))
# 提交,保存到數(shù)據(jù)庫(kù)
self.conn.commit()
return item
def close_spider(self, spider):
"""釋放數(shù)據(jù)庫(kù)資源"""
self.cursor.close()
self.conn.close()
6、配置爬蟲項(xiàng)目 settings.py
在 settings.py 文件中,對(duì)下載延遲、默認(rèn)請(qǐng)求頭、下載管道 Pipline等進(jìn)行配置
# settings.py
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 3
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
'Accept-Language': 'en',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36',
'Host':'www.host.com'
}
ITEM_PIPELINES = {
'film.pipelines.MysqlPipeline': 300,
}
7、運(yùn)行入口
在項(xiàng)目根目錄下創(chuàng)建一個(gè)文件,用于定義爬蟲的運(yùn)行入口
from scrapy.cmdline import execute
import sys, os
def start_scrapy():
sys.path.append(os.path.dirname(__file__))
# 運(yùn)行單個(gè)爬蟲
execute(["scrapy", "crawl", "maoyan_film"])
if __name__ == '__main__':
start_scrapy()
最后,我們將爬蟲部署到服務(wù)器,設(shè)置定時(shí)任務(wù)及消息通知
這樣我們可以及時(shí)獲取最近上映的電影,通過電影類型及演員陣容,挑選自己喜歡的電影