R的 rvest包

子歌-特斯拉 2016-09-01

展開全文

rvest實乃利器，RCurl和XML相形見絀。
以Automated Data Collection with R中的第十章的數(shù)據(jù)下載為例，比較一下兩者的優(yōu)劣。

數(shù)據(jù)源自英國ZF的一個新聞網(wǎng)站，網(wǎng)址是：

https://www./government/announcements?keywords=&announcem

ent_type_option=press-releases&topics[]=all&departments[]=all&

world_locations[]=all&from_date=&to_date=01%2F07%2F2010

選擇2010年7月1日以前英國ZF相關部門所發(fā)的新聞，總計749篇，包含在19個網(wǎng)頁中。

數(shù)據(jù)提取的第一個任務就是將這749篇新聞的鏈接路徑保存在本地硬盤上。瀏覽網(wǎng)頁，發(fā)現(xiàn)除了第一個網(wǎng)頁外，其它18個網(wǎng)頁都是有規(guī)律的，這樣提取網(wǎng)頁標題就分兩步走：

library(rvest) #抓取網(wǎng)頁數(shù)據(jù)

library(stringr) #處理文本

# 第一頁網(wǎng)址

url = 'https://www./government/announcements?keywords=&announcement_type_option=press-releases&topics[]=all&departments[]=all&world_locations[]=all&from_date=&to_date=01%2F07%2F2010'

first = url %>% html() %>% html_nodes("h3 a") %>% html_attrs()

# h3 a是使用查看器在網(wǎng)頁的相應標題上停留給出的節(jié)點位置，非常方便快捷。

QQ截圖20141206073155.png

first = as.character(first)

# 其它18頁網(wǎng)址

others = sapply(2:19,function(i) str_c("https://www./government/announcements?announcement_type_option=press-releases&departments%5B%5D=all&from_date=&keywords=&page=",i,"&to_date=01%2F07%2F2010&topics%5B%5D=all&world_locations%5B%5D=all"))

myfun = function(x) {

b = others[x] %>% html() %>% html_nodes("h3 a") %>% html_attrs()

as.character(b)

}

doc = sapply(1:18,myfun)

# 第一頁加上18頁總計19頁。

dat = c(first,unlist(doc))

dat = str_c("https://www.",dat)

#建立一個目錄，把網(wǎng)址數(shù)據(jù)保留起來

dir.create("F:/Press_Releases")

for(i in 1:length(dat)) write(dat, file = str_c("F:/Press_Releases/", i, ".html"))

# 驗證一下

length(list.files("F:/Press_Releases"))

[1] 749

list.files("F:/Press_Releases")[1:3]

[1] "1.html" "10.html" "100.html"

如果有耐心，在看看書中的代碼，不把你搞死不算完。

回貼分享 收藏11 支持1 反對0

本站是提供個人知識管理的網(wǎng)絡存儲空間，所有內(nèi)容均由用戶發(fā)布，不代表本站觀點。請注意甄別內(nèi)容中的聯(lián)系方式、誘導購買等信息，謹防詐騙。如發(fā)現(xiàn)有害或侵權內(nèi)容，請點擊一鍵舉報。

轉(zhuǎn)藏 分享

QQ空間 QQ好友新浪微博微信

獻花（0） +1

來自：子歌-特斯拉 > 《R語言》

舉報/認領

0條評論

發(fā)表

請遵守用戶評論公約

類似文章 更多

一区二区三区日韩精品-日韩经典一区二区三区-五月激情综合丁香婷婷-欧美精品中文字幕专区

R的 rvest包