歡迎關(guān)注天善智能微信公眾號(hào),我們是專注于商業(yè)智能BI,大數(shù)據(jù),數(shù)據(jù)分析領(lǐng)域的垂直社區(qū)。
對(duì)商業(yè)智能BI、大數(shù)據(jù)分析挖掘、機(jī)器學(xué)習(xí),python,R等數(shù)據(jù)領(lǐng)域感興趣的同學(xué)加微信:tstoutiao,邀請(qǐng)你進(jìn)入頭條數(shù)據(jù)愛好者交流群,數(shù)據(jù)愛好者們都在這兒。
小編最近在學(xué)習(xí)Python網(wǎng)絡(luò)爬蟲爬取數(shù)據(jù),除了在天善學(xué)院學(xué)習(xí)教學(xué)視頻以外,也發(fā)現(xiàn)一本挺不錯(cuò)的教材《Python網(wǎng)絡(luò)數(shù)據(jù)采集》,推薦給大家,有需要電子書的可以加小編微信:tstoutiao獲取,當(dāng)然也會(huì)寫一些小的爬蟲程序,歡迎留言交流。
import requests,xlwt,os
from bs4 import BeautifulSoup
from lxml import etree
from fake_useragent import UserAgent
ua = UserAgent()
headers = {'User-Agent': 'ua.random'}
job = []
location = []
company = []
salary = []
link = []
for k in range(1, 10):
url = 'http://www./interns?k=python&p=' + str(k)
r = requests.get(url, headers=headers).text
s = etree.HTML(r)
job1 = s.xpath('//a/h3/text()')
location1 = s.xpath('//span/span/text()')
company1 = s.xpath('//p/a/text()')
salary1 = s.xpath('//span[contains(@class,'money_box')]/text()')
link1 = s.xpath('//div[@class='job_head']/a/@href')
for i in link1:
url = 'http://www.' + i
link.append(url)
salary11 = salary1[1::2]
for i in salary11:
salary.append(i.replace('\n\n', ''))
job.extend(job1)
location.extend(location1)
company.extend(company1)
detail = []
for i in link:
r = requests.get(i, headers=headers).text
soup = BeautifulSoup(r, 'lxml')
word = soup.find_all(class_='dec_content')
for i in word:
a = i.get_text()
detail.append(a)
book = xlwt.Workbook()
sheet = book.add_sheet('sheet', cell_overwrite_ok=True)
path = 'D:\\Pycharm\\spider'
os.chdir(path)
j = 0
for i in range(len(job)):
try:
sheet.write(i + 1, j, job[i])
sheet.write(i + 1, j + 1, location[i])
sheet.write(i + 1, j + 2, company[i])
sheet.write(i + 1, j + 3, salary[i])
sheet.write(i + 1, j + 4, link[i])
sheet.write(i + 1, j + 5, detail[i])
except Exception as e:
print('出現(xiàn)異常:' + str(e))
continue
book.save('d:\\python.xls')