寫在前面本包開發(fā)者黃天元; 首先我對tidyfst進(jìn)行了一套完整的學(xué)習(xí),因為這里面的函數(shù)并不多,滿打滿計算,也就38個。 隨著擴(kuò)增子的平穩(wěn),我逐漸轉(zhuǎn)入宏基因組,軟件更多,平臺跨度更大,R語言顯示出來很多弊端: 近年來出現(xiàn)了許多工具解決這個問題,本著適合之前的習(xí)慣,我想通過data.table和tadyfst解決這個問題。希望我這一路都是順暢的。結(jié)果會如我所料嗎? tidyfst包(fstpackage/fst)它的優(yōu)勢: 1、快速讀寫數(shù)據(jù)框 2、文件壓縮,保存數(shù)據(jù)框能夠給文件進(jìn)行壓縮,這就節(jié)省了大數(shù)據(jù)轉(zhuǎn)移的時間(從硬盤放到電腦或者上傳服務(wù)器)。壓縮的比率是非常感人的,有一個參數(shù)可以控制壓縮比例,我一般設(shè)置到最大。我問過原作者,他跟我解釋過,壓縮比例一共是100個等級,不壓縮的時候讀寫是最快的,但是使勁壓縮,讀寫依然非???!親測確實如此,所以我每次都用最大等級的壓縮,并包裝了他的函數(shù),把默認(rèn)壓縮率改為100(默認(rèn)值為50)。 測試 fst格式操作為什么我要測試這個呢?因為fst更快。 構(gòu)造一個巨大的數(shù)據(jù)框,代碼參考hopeR。 library(tidyfst)
# 構(gòu)造一個1億行,4列的數(shù)據(jù)框 nr_of_rows <- 1e8
df <- data.table( Logical = sample(c(TRUE, FALSE, NA), prob = c(0.85, 0.1, 0.05), nr_of_rows, replace = TRUE), Integer = sample(1L:100L, nr_of_rows, replace = TRUE), Real = sample(sample(1:10000, 20) / 100, nr_of_rows, replace = TRUE), Factor = as.factor(sample(labels(UScitiesD), nr_of_rows, replace = TRUE)) ) 打印出文件大小 head(df)
object.size(df) %>% print(unit = "auto") 我們測試一下保存,查看保存時間。sys_time_print函數(shù)是作者在tidyfst中封裝的函數(shù)。 # ?export_fst
sys_time_print({ export_fst(df,"./df.fst") })
# 完成后刪除df數(shù)據(jù)框 rm(df) 讀入fst對象 parse_fst("./df.fst") -> ft
##--輸出錯誤 # ft head(ft)
colnames(ft) 快速計算頻數(shù)fst數(shù)據(jù)處理的函數(shù)后綴位:_fst,這里select_fst函數(shù)用于選擇列。 sys_time_print({ ft %>% select_fst(Logical) %>% count_dt(Logical) -> res })
res slice_fst:用于選擇行操作。然后分組求和 sys_time_print({ ft %>% slice_fst(1:1000) %>% group_dt( by = Factor, summarise_dt(avg_int = mean(Integer)) )-> res })
res filter_fst函數(shù)用于列過濾。count_dth函數(shù)用于統(tǒng)計頻數(shù) sys_time_print({ ft %>% filter_fst(Real >= 50) %>% count_dt(Factor)-> res })
res 刪除本地數(shù)據(jù)unlink("./df.fst") tidyfst 正式 學(xué)習(xí)這個包處理函數(shù)很快,所以我要將這個包用于宏基因組數(shù)據(jù)探索,這里 1 arrange_dt :排序#--使用數(shù)據(jù)
data(iris)
#---按照數(shù)值進(jìn)行排序 iris %>% arrange_dt(Sepal.Length)
iris
# 從大到小排序 iris %>% arrange_dt(-Sepal.Length) # 雙重排序--先按照第一個拍排序,然后在此基礎(chǔ)上按照第二列排序 iris %>% arrange_dt(Sepal.Length,Petal.Length) 2 as_fst:將數(shù)據(jù)框轉(zhuǎn)化位fst對象iris %>% as_fst() -> iris_fst
head(iris_fst) 3 complete_dt函數(shù)將數(shù)據(jù)框按照指定列,進(jìn)行完整組合,輸出 Complete a data frame with missing combinations of data df <- data.table( group = c(1:2, 1), item_id = c(1:2, 2), item_name = c("a", "b", "b"), value1 = 1:3, value2 = 4:6 )
df
df %>% complete_dt(item_id,item_name) df %>% complete_dt(item_id,item_name,fill = 0) df %>% complete_dt("item") df %>% complete_dt(item_id=1:3) df %>% complete_dt(item_id=1:3,group=1:2) df %>% complete_dt(item_id=1:3,group=1:3,item_name=c("a","b","c")) 4 count_dt:統(tǒng)計頻數(shù)iris %>% count_dt(Sepal.Width)
#-指定頻數(shù)列名稱 iris %>% count_dt(Species,.name = "count") #統(tǒng)計頻數(shù)并添加到源數(shù)據(jù)列 iris %>% add_count_dt(Species) # 對添加列的命名 iris %>% add_count_dt(Species,.name = "N") #按照兩組分類進(jìn)行統(tǒng)計頻數(shù) mtcars %>% count_dt(cyl,vs) # 頻數(shù)列重命名,默認(rèn)是排序的,現(xiàn)在不要排序了 mtcars %>% count_dt(cyl,vs,.name = "N",sort = FALSE) #添加到源數(shù)據(jù)中 mtcars %>% add_count_dt(cyl,vs) 5 cummean:累積均值cummean(1:10) 6 distinct_dt :去除重復(fù)iris %>% distinct_dt() iris %>% distinct_dt(Species) iris %>% distinct_dt(Species,.keep_all = TRUE) mtcars %>% distinct_dt(cyl,vs) mtcars %>% distinct_dt(cyl,vs,.keep_all = TRUE) 7 drop_na_dt :去除NA行df <- data.table(x = c(1, 2, NA), y = c("a", NA, "b"))
df #去除含有NA的全部行 df %>% drop_na_dt() #去除x列含有NA的全部行 df %>% drop_na_dt(x) #去除y列含有NA的全部行 df %>% drop_na_dt(y) # 去除x,y列含有NA的全部行 df %>% drop_na_dt(x,y)
# 將NA替換為0 df %>% replace_na_dt(to = 0) df %>% replace_na_dt(x,to = 0) df %>% replace_na_dt(y,to = 0) df %>% replace_na_dt(x,y,to = 0)
# 對空缺值的填充 #僅僅填充x列 df %>% fill_na_dt(x) #全部填充 df %>% fill_na_dt() # not specified, fill all columns #指定使用臨近下一行數(shù)據(jù)填充 df %>% fill_na_dt(y,direction = "up")
#x的空缺在最后,所以無法填充 df %>% fill_na_dt(x,direction = "up")
x = data.frame(x = c(1, 2, NA, 3), y = c(NA, NA, 4, 5),z = rep(NA,4)) x #--刪除全部為NA的列 x %>% delete_na_cols() #-刪除0.75數(shù)據(jù)未NA的列 x %>% delete_na_cols(prop = 0.75) x %>% delete_na_cols(prop = 0.5) x %>% delete_na_cols(prop = 0.24) #刪除數(shù)據(jù)少于2個的列 x %>% delete_na_cols(n = 2) #刪除低于0.6數(shù)據(jù)的行 x %>% delete_na_rows(prop = 0.6) #刪除數(shù)據(jù)少于兩個的行 x %>% delete_na_rows(n = 2)
# shift_fill y = c("a",NA,"b",NA,"c") y #填充 shift_fill(y) # equals to # shift_fill(y,"down")
shift_fill(y,"up") 8 dummy_dt:數(shù)據(jù)長變寬iris %>% dummy_dt(Species) #使用源名稱 iris %>% dummy_dt(Species,longname = FALSE) ## 按照兩列進(jìn)行變寬 mtcars %>% head() %>% dummy_dt(vs,am)
mtcars %>% head() %>% dummy_dt("cyl|gear") 9 export_fst :fst格式數(shù)據(jù)保存export_fst(iris,"iris_fst_test.fst") iris_dt = import_fst("iris_fst_test.fst") iris_dt unlink("iris_fst_test.fst") 10 filter_dt :行篩選iris %>% filter_dt(Sepal.Length > 7) iris %>% filter_dt(Sepal.Length > 7,Sepal.Width > 3) iris %>% filter_dt(Sepal.Length > 7 & Sepal.Width > 3) iris %>% filter_dt(Sepal.Length == max(Sepal.Length)) 11 slice_fst:選擇行;select_fst:選擇列;filter_fst按照行過濾這幾個函數(shù)其實就是來處理fst格式的,會進(jìn)一步縮短時間。大數(shù)據(jù)必備。 ## Not run: fst::write_fst(iris,"iris_test.fst")
# parse the file but not reading it parse_fst("iris_test.fst") -> ft # ft class(ft) lapply(ft,class) names(ft) dim(ft) # 選擇前三行 ft %>% slice_fst(1:3) # 選擇1,3行 ft %>% slice_fst(c(1,3))
ft %>% select_fst(Sepal.Length) ft %>% select_fst(Sepal.Length,Sepal.Width) ft %>% select_fst("Sepal.Length") ft %>% select_fst(1:3) ft %>% select_fst(1,3) ft %>% select_fst("Se") ft %>% select_fst("nothing") ft %>% select_fst("Se|Sp") ft %>% select_fst(cols = names(iris)[2:3]) ft %>% filter_fst(Sepal.Width > 3) ft %>% filter_fst(Sepal.Length > 6 , Species == "virginica") ft %>% filter_fst(Sepal.Length > 6 & Species == "virginica" & Sepal.Width < 3) unlink("iris_test.fst") 12 group_by_dt;分組這里結(jié)合head函數(shù)可以對每個分組的前面幾行進(jìn)行計算,這個如果結(jié)合排序,可以對豐富較高或者較低的進(jìn)行統(tǒng)計。 # aggregation after grouping using group_exe_dt as.data.table(iris) -> a
# ?group_exe_dt #---指定分組,這里的head函數(shù)會按照分組進(jìn)行展示-這一般用的比較少 a %>% group_by_dt(Species) %>% group_exe_dt(head(3)) a #----指定分組,進(jìn)行計算,對每個分組的前四行進(jìn)行計算 a %>% group_by_dt(Species) %>% group_exe_dt( head(4) %>% summarise_dt(sum = mean(Sepal.Length)) ) #--指定兩個分組進(jìn)行計算 mtcars %>% group_by_dt("cyl|am") %>% group_exe_dt( summarise_dt(mpg_sum = sum(mpg)) ) # 同上一個函數(shù) mtcars %>% group_by_dt(cols = c("cyl","am")) %>% group_exe_dt( summarise_dt(mpg_sum = sum(mpg)) ) 13 group_dt :分組計算#--分組提取每個分組前三行 iris %>% group_dt(by = Species,slice_dt(1:3))
#--分組求取每個組中的最大值,保留其他列 iris %>% group_dt(Species,filter_dt(Sepal.Length == max(Sepal.Length)))
#--分組統(tǒng)計求取最大值,只有統(tǒng)計的這一列 iris %>% group_dt(Species,summarise_dt(new = max(Sepal.Length)))
# 添加一列,并分組求取這一列的和 iris %>% group_dt(Species, mutate_dt(max= max(Sepal.Length)) %>% summarise_dt(sum=sum(max)))
# .SD 函數(shù)可以直接使用 # 提取每個分組第一行和最后一行 iris %>%group_dt( by = Species, rbind(.SD[1],.SD[.N]) ) #' #summarise_dth函數(shù)內(nèi)置了by參數(shù),這樣就可以直接在函數(shù)內(nèi)部分組了 mtcars %>% summarise_dt( disp = mean(disp), hp = mean(hp), by = cyl ) # z或者使用group函數(shù)分組 mtcars %>% group_dt(by =.(vs,am), summarise_dt(avg = mean(mpg)))
# data.table中的.()函數(shù)在這里同樣等價為list() mtcars %>% group_dt(by =list(vs,am), summarise_dt(avg = mean(mpg)))
# mutate_dt添加一列,mean函數(shù)計算均值,顯然不夠兩行,這里循環(huán)補(bǔ)齊。 df <- data.table(x = 1:2, y = 3:4, z = 4:5) df df %>% mutate_dt(m = mean(c(x, y, z))) #-等價 df %>% rowwise_dt( mutate_dt(m = mean(c(x, y, z))) ) 14 in_dt: 綜合函數(shù)按照分組進(jìn)行排序,然后提取排序好的數(shù)據(jù)行,十分有用。對于微生物組數(shù)據(jù)。 iris %>% as_dt() #--排序,分組提取第一個數(shù)據(jù) iris %>% in_dt(order(-Sepal.Length),.SD[1],by=Species) 15 lead_dt:快速創(chuàng)建向量lead_dt(1:5) lag_dt(1:5) lead_dt(1:5,2) lead_dt(1:5,n = 2,fill = 0) 16 _join_dt:最重要的一組函數(shù),合并數(shù)據(jù)框#--構(gòu)造data.table對象
workers = fread(" name company Nick Acme John Ajax Daniela Ajax ") #-構(gòu)建另一個data.table對象 positions = fread(" name position John designer Daniela engineer Cathie manager ")
# ?inner_join #--合并數(shù)據(jù)框 #--共有合并 workers %>% inner_join_dt(positions) #-保留左側(cè)行 workers %>% left_join_dt(positions) #保留右側(cè)行 workers %>% right_join_dt(positions) #-保留全部行 workers %>% full_join_dt(positions)
# 輸出左側(cè)數(shù)據(jù)框獨有行 workers %>% anti_join_dt(positions) #-輸出左側(cè)數(shù)據(jù)庫共有行 workers %>% semi_join_dt(positions)
# 通過by參數(shù)指定合并的行列名 workers %>% left_join_dt(positions, by = "name") # 重命名 positions2 = setNames(positions, c("worker", "position")) # rename first column in 'positions' #--如果兩數(shù)據(jù)庫不同名需要合并,使用等號匹配列名 workers %>% inner_join_dt(positions2, by = c("name" = "worker")) # 等價 workers %>% ijoin(positions2,by = "name==worker")
#-兩種合并方式相同 x= data.table(a=1:5,a1 = 2:6,b=11:15) y= data.table(a=c(1:4,6), a1 = c(1,2,4,5,1),c=c(101:104,106)) #默認(rèn)相同的合并 merge(x,y,all = TRUE) -> a #--按照兩列合并 fjoin(x,y,by = c("a","a1")) -> b data.table::setcolorder(a,names(b)) fsetequal(a,b) 16 longer_dt:數(shù)據(jù)寬邊長## 構(gòu)造數(shù)據(jù) stocks = data.frame( time = as.Date('2009-01-01') + 0:9, X = rnorm(10, 0, 1), Y = rnorm(10, 0, 2), Z = rnorm(10, 0, 4) )
stocks # 數(shù)據(jù)寬變長
stocks %>% longer_dt(time)
#--部分即可匹配 stocks %>% longer_dt("ti") #-這部分找不到數(shù)據(jù)集"billboard",所以沒有學(xué)習(xí)運行 # library(tidyr) # # install.packages("billboard") # library("billboard") # data(billboard) # # # billboard %>% # longer_dt( # -"wk", # name = "week", # value = "rank", # na.rm = TRUE # ) # # billboard # # or use: # billboard %>% # longer_dt( # artist,track,date.entered, # name = "week", # value = "rank", # na.rm = TRUE # ) # # or use: # billboard %>% # longer_dt( # 1:3, # name = "week", # value = "rank", # na.rm = TRUE # ) 17 df_mat:矩陣和列表快速轉(zhuǎn)化這對于網(wǎng)絡(luò)分析和相關(guān)分析十分有用。 mm = matrix(c(1:8,NA),ncol = 3,dimnames = list(letters[1:3],LETTERS[1:3])) mm
#--矩陣邊列表 tdf = mat_df(mm) tdf
#--列表邊矩陣 mat = df_mat(tdf,row,col,value) mat
setequal(mm,mat)
tdf %>% setNames(c("A","B","C")) %>% df_mat(A,B,C) 18 mutate_dt :添加新的數(shù)據(jù)列#--添加新的列,添加到原來列后面 iris %>% mutate_dt(one = 1,Sepal.Length = Sepal.Length + 1) #---不要原來的數(shù)據(jù)了 iris %>% transmute_dt(one = 1,Sepal.Length = Sepal.Length + 1)
# `.GRP`:分組標(biāo)簽添加,這些特殊符號一定要注意 iris %>% mutate_dt(id = 1:.N,grp = .GRP,by = Species) 18 mutate_when;mutate_vars,數(shù)據(jù)框整理添加新列按照條件添加新的列,按照條件對多列進(jìn)行操作 iris[3:8,] #-條件添加數(shù)據(jù) iris[3:8,] %>% mutate_when(Petal.Width == .2, one = 1,Sepal.Length=2)
#--對符合條件的列標(biāo)準(zhǔn)化 iris %>% mutate_vars("Pe",scale) #--對全部為數(shù)值的數(shù)據(jù)列進(jìn)行標(biāo)準(zhǔn)化 iris %>% mutate_vars(is.numeric,scale) #--非因子列進(jìn)行標(biāo)準(zhǔn)化 iris %>% mutate_vars(-is.factor,scale) #前兩列標(biāo)準(zhǔn)化 iris %>% mutate_vars(1:2,scale) #--將全部數(shù)據(jù)列轉(zhuǎn)化為字符串 iris %>% mutate_vars(.func = as.character) 第二篇章19 nest_dt:數(shù)據(jù)框與列表的變換library(tidyfst)
#-按照分組拆分?jǐn)?shù)據(jù)框 a = mtcars %>% nest_dt(cyl) #查看數(shù)據(jù)類型 # str(a) #-查看數(shù)據(jù)list # a[[2]]
mtcars %>% nest_dt("cyl") mtcars %>% nest_dt(cyl,vs) mtcars %>% nest_dt(vs:am) mtcars %>% nest_dt("cyl|vs") mtcars %>% nest_dt(c("cyl","vs")) # 兩列拆分?jǐn)?shù)據(jù)框,稱為兩組列表 a = iris %>% nest_dt(mcols = list(petal="^Pe",sepal="^Se")) # #-第二組列表查看 # a[[3]] #--復(fù)原。ndt為需要指定的列 mtcars %>% nest_dt("cyl|vs") %>% unnest_dt(ndt) mtcars %>% nest_dt("cyl|vs") %>% unnest_dt("ndt")
#---列表和數(shù)據(jù)庫可以一起構(gòu)建 df <- data.table( a = list(c("a", "b"), "c"), b = list(c(TRUE,TRUE),FALSE), c = list(3,c(1,2)), d = c(11, 22) )
# str(df) 20 nth:從向量中提取值通過編號提取目標(biāo)的值,這里指定了負(fù)數(shù)為倒序,從后往前的位置。 x = 1:10 nth(x, 1) nth(x, 5) nth(x, -2) 21 pull_dt 從向量中根據(jù)位置提取元素mtcars %>% pull_dt(2) mtcars %>% pull_dt(cyl) mtcars %>% pull_dt("cyl") 22 pull_dt:提取數(shù)據(jù)框單一變量(轉(zhuǎn)化為向量形式)那么你想提取兩列行不行,當(dāng)然不行! #-這三種方式提取結(jié)果是相同的 mtcars %>% pull_dt(2) mtcars %>% pull_dt(cyl) mtcars %>% pull_dt("cyl")
#-查看名稱 colnames(mtcars) 23 relocate_dt:對列進(jìn)行排序df <- data.table(a = 1, b = 1, c = 1, d = "a", e = "a", f = "a") df df %>% relocate_dt(f) df %>% relocate_dt(a,how = "last") df %>% relocate_dt(is.character) df %>% relocate_dt(is.numeric, how = "last") df %>% relocate_dt("[aeiou]") df %>% relocate_dt(a, how = "after",where = f) df %>% relocate_dt(f, how = "before",where = a) df %>% relocate_dt(f, how = "before",where = c) df %>% relocate_dt(f, how = "after",where = c) df2 <- data.table(a = 1, b = "a", c = 1, d = "a") df2 %>% relocate_dt(is.numeric, how = "after", where = is.character) df2 %>% relocate_dt(is.numeric, how="before", where = is.character) 24 relocate_d:對列名進(jìn)行位置調(diào)整這個工具十分強(qiáng)大,對于微生物領(lǐng)域也將十分有用。 df <- data.table(a = 1, b = 1, c = 1, d = "a", e = "a", f = "a") df #-將f列提高第一列 df %>% relocate_dt(f) #將a列提到最后一列 df %>% relocate_dt(a,how = "last") #將字符串列已移到前面 df %>% relocate_dt(is.character) #將數(shù)值型變量列移到后面 df %>% relocate_dt(is.numeric, how = "last") #--將列名按照順序排列 df %>% relocate_dt("[aeiou]") #-將a排列在f后面 df %>% relocate_dt(a, how = "after",where = f) #-將f排列到a前面 df %>% relocate_dt(f, how = "before",where = a) #將f排列到c前面 df %>% relocate_dt(f, how = "before",where = c) df %>% relocate_dt(f, how = "after",where = c)
df2 <- data.table(a = 1, b = "a", c = 1, d = "a") #-將數(shù)值型變量排列到字符串后面 df2 %>% relocate_dt(is.numeric, how = "after", where = is.character) df2 %>% relocate_dt(is.numeric, how="before", where = is.character) 25 rename_dt:對數(shù)據(jù)列進(jìn)行改名#-改名,使用等號來指定舊名和新名 iris %>% rename_dt(sl = Sepal.Length,sw = Sepal.Width) %>% head() 26 replace_dt:對一列內(nèi)容替換(條件)iris %>% mutate_vars(is.factor,as.character) -> new_iris #-指定列,替換內(nèi)容,字符串替換 new_iris %>% replace_dt(Species, from = "setosa",to = "SS") new_iris %>% replace_dt(Species,from = c("setosa","virginica"),to = "sv") #-數(shù)值替換 new_iris %>% replace_dt(Petal.Width, from = .2,to = 2) new_iris %>% replace_dt(from = .2,to = NA) #-添加基本運算 new_iris %>% replace_dt(is.numeric, from = function(x) x > 3, to = 9999 ) 27 rn_col:對首列和列名操作(位置互換)#--將列名提取到第一列 mtcars %>% rn_col() #列名提取到第一列,并改名為rn mtcars %>% rn_col("rn") #-賦值給信數(shù)據(jù)框 mtcars %>% rn_col() -> new_mtcars #--改回去,將第一列放回到列名 new_mtcars %>% col_rn() -> old_mtcars old_mtcars setequal(mtcars,old_mtcars) 28 sample_n_dt:行隨機(jī)抽樣#--抽取行 sample_n_dt(mtcars, 10) #--可重復(fù)抽取行 sample_n_dt(mtcars, 50, replace = TRUE) #-按照百分比抽取行 sample_frac_dt(mtcars, 0.1) # 設(shè)置可重復(fù),就可以抽取比原來總體還要大的數(shù)據(jù)行。 sample_frac_dt(mtcars, 1.5, replace = TRUE) #--換種寫法 sample_dt(mtcars,n=10) sample_dt(mtcars,prop = 0.1) 29 select_dt:列選擇工具箱#---select是一個大函數(shù),許多功能非常實用 #--挑選一列 iris %>% select_dt(Species) #-挑選兩列 iris %>% select_dt(Sepal.Length,Sepal.Width) #-挑選這兩列之間的全部列 iris %>% select_dt(Sepal.Length:Petal.Length) #去除某一列 iris %>% select_dt(-Sepal.Length) #--去除兩列 iris %>% select_dt(-Sepal.Length,-Petal.Length) #去除這兩列之前額全部列 iris %>% select_dt(-(Sepal.Length:Petal.Length)) #--可以使用字符串形式指定,效果相同 iris %>% select_dt(c("Sepal.Length","Sepal.Width")) iris %>% select_dt(-c("Sepal.Length","Sepal.Width"))
#--可以使用列編號指定,效果相同 iris %>% select_dt(1) iris %>% select_dt(-1) iris %>% select_dt(1:3) iris %>% select_dt(-(1:3)) iris %>% select_dt(1,3) #--支持部分匹配和邏輯運算符 iris %>% select_dt("Pe") iris %>% select_dt(-"Se") iris %>% select_dt(!"Se") ?select_dt iris %>% select_dt("Pe",negate = TRUE) iris %>% select_dt("Pe|Sp") iris %>% select_dt(cols = 2:3) #--添加參數(shù)negate返回不匹配的列 iris %>% select_dt(cols = 2:3,negate = TRUE) iris %>% select_dt(cols = c("Sepal.Length","Sepal.Width")) iris %>% select_dt(cols = names(iris)[2:3]) iris %>% select_dt(is.factor) iris %>% select_dt(-is.factor) iris %>% select_dt(!is.factor) # 這個函數(shù)提供的選擇十分靈活,即使同時包含多種類型都可以選擇 select_mix(iris, Species,"Sepal.Length") select_mix(iris,1:2,is.factor) select_mix(iris,Sepal.Length,is.numeric) # rm.dup:是否刪除重復(fù)列 select_mix(iris,Sepal.Length,is.numeric,rm.dup = FALSE) 30 separate_dt:字符串拆分對于物種注釋數(shù)據(jù)十分有用 #--字符串拆分 df <- data.frame(x = c(NA, "a.b", "a.d", "b.c")) df df %>% separate_dt(x, c("A", "B")) # equals to df %>% separate_dt("x", c("A", "B")) 31 slice_dt :對行切幾行iris %>% slice_dt(1:3) iris %>% slice_dt(1,3) iris %>% slice_dt(c(1,3)) 31 summarise_dt:數(shù)據(jù)框統(tǒng)計#--計算一列均值 iris %>% summarise_dt(avg = mean(Sepal.Length))
#by參數(shù),按照分組計算均值 iris %>% summarise_dt(avg = mean(Sepal.Length),by = Species) #-多組分組,計算均值 mtcars %>% summarise_dt(avg = mean(hp),by = .(cyl,vs)) # 統(tǒng)計數(shù)量 mtcars %>% summarise_dt(cyl_n = .N, by = .(cyl, vs)) # `.`` is short for list #--統(tǒng)計數(shù)值型變量最小值 iris %>% summarise_vars(is.numeric,min) #等同于上面 iris %>% summarise_vars(-is.factor,min) #統(tǒng)計前四行最小值 iris %>% summarise_vars(1:4,min) #-列全部轉(zhuǎn)化為字符串 iris %>% summarise_vars(.func = as.character) #-按照分組對數(shù)值型列求取最小值 iris %>% summarise_vars(is.numeric,min,by ="Species")
#-按照兩列求取,可以使用逗號分隔,外加引號括起來。 mtcars %>% summarise_vars(is.numeric,mean,by = "vs,am") 32 sys_time_print:統(tǒng)計運行時間sys_time_print(Sys.sleep(1)) a = iris
#--由于idyfst總是處理大數(shù)據(jù),所以對于時間要求很嚴(yán)格,這里提供了函數(shù)用于統(tǒng)計時間 sys_time_print({ res = iris %>% mutate_dt(one = 1) }) res 33 top_n_dt :提取前幾行(條件)。#--提取前十行數(shù)據(jù) iris %>% top_n_dt(10,Sepal.Length) #-去除前十行數(shù)據(jù) iris %>% top_n_dt(-10,Sepal.Length)
iris %>% top_frac_dt(.1,Sepal.Length)
iris %>% top_frac_dt(-.1,Sepal.Length)
# For `top_dt`, you can use both modes above iris %>% top_dt(Sepal.Length,n = 10) iris %>% top_dt(Sepal.Length,prop = .1) 34 t_dt :提供數(shù)據(jù)框的轉(zhuǎn)置?t_dt
t_dt(iris) t_dt(mtcars) 35 uncount_dt :提供頻數(shù)轉(zhuǎn)化我單個統(tǒng)計量df <- data.table(x = c("a", "b"), n = c(1, 2))
df #-將頻數(shù)轉(zhuǎn)化為單個統(tǒng)計數(shù)量 uncount_dt(df, n) #-F設(shè)置在統(tǒng)計數(shù)量后添加每個數(shù)量的頻數(shù) uncount_dt(df,n,FALSE) 36 unite_dt:提供行的合并處理這對于宏基因組處理物種注釋數(shù)據(jù)很有幫助 df <- expand.grid(x = c("a", NA), y = c("b", NA)) df # Treat missing value as character "NA" df %>% unite_dt("z", x:y, remove = FALSE)
# T空缺值處理,只要有,邊全部按照NA處理 df %>% unite_dt("z", x:y, na.rm = TRUE, remove = FALSE)
#默認(rèn)空缺值保留,都保留 df %>% unite_dt("xy", x:y)
# 將全部的行都合并起來 iris %>% unite_dt("merged_name","") 37 utf8_encoding:使用utf8編碼數(shù)據(jù)框這對于中文很有幫助 utf8_encoding(iris) 38 wider_dt:數(shù)據(jù)長變寬#-構(gòu)造轉(zhuǎn)化為長數(shù)據(jù) stocks = data.frame( time = as.Date('2009-01-01') + 0:9, X = rnorm(10, 0, 1), Y = rnorm(10, 0, 2), Z = rnorm(10, 0, 4) ) %>% longer_dt(time) -> longer_stocks
longer_stocks #-長數(shù)據(jù)轉(zhuǎn)寬數(shù)據(jù) longer_stocks %>% wider_dt("time", name = "name", value = "value")
#構(gòu)造填充數(shù)據(jù),并轉(zhuǎn)換 longer_stocks %>% mutate_dt(one = 1) %>% wider_dt("time", name = "name", value = "one")
## using "fun" parameter for aggregation DT <- data.table(v1 = rep(1:2, each = 6), v2 = rep(rep(1:3, 2), each = 2), v3 = rep(1:2, 6), v4 = rnorm(6))
DT ## 兩列作為標(biāo)簽,然后計算總和 DT %>% wider_dt(v1,v2, value = "v4", name = ".", fun = sum) #--計算最小值 DT %>% wider_dt(v1,v2, value = "v4", name = ".", fun = min) 后記到此,tidyfst數(shù)據(jù)處理我就全部學(xué)習(xí)完成了,這部分也添加上的中文標(biāo)注,相比是十分容易理解的,當(dāng)然有5%的代碼我還不是很清楚,這個就要讀源代碼或者繼續(xù)看作者文檔了。 完成后,我立刻就想到由于在我開始學(xué)習(xí)R的時候dplyr包并不是很流行,也沒有帶我學(xué)習(xí)這種工具,所以我對數(shù)據(jù)框處理的方式有plyr,apply,還有perl,等影子。大量操作使用for循環(huán)此時為了處理大數(shù)據(jù),我必須全部扒皮,將習(xí)慣修改為dplyr和tidyr的易讀類型。 學(xué)習(xí)使用的是示例數(shù)據(jù),需要對實際的數(shù)據(jù)進(jìn)行測試運行,這里在下一篇文檔中我進(jìn)行測試驗證。希望不要讓我失望。 歡迎加入微生信生物快來微生信生物微生信生物
|