1、數(shù)據(jù)轉(zhuǎn)換 目前為止介紹的都是數(shù)據(jù)的重排。另一類重要操作則是過濾、清理以及其他的轉(zhuǎn)換工作。 2、移除重復(fù)數(shù)據(jù) DataFrame中常常會出現(xiàn)重復(fù)行。下面就是一個例子: 01. In [ 4 ]: data = pd.DataFrame({ 'k1' :[ 'one' ] * 3 + [ 'two' ] * 4 , 02. 'k2' :[ 1 , 1 , 2 , 3 , 3 , 4 , 4 ]}) 03. 04. In [ 5 ]: data 05. Out[ 5 ]: 06. k1 k2 07. 0 one 1 08. 1 one 1 09. 2 one 2 10. 3 two 3 11. 4 two 3 12. 5 two 4 13. 6 two 4 14. 15. [ 7 rows x 2 columns] 01. In [ 6 ]: data.duplicated() 02. Out[ 6 ]: 03. 0 False 04. 1 True 05. 2 False 06. 3 False 07. 4 True 08. 5 False 09. 6 True 10. dtype: bool 01. In [ 7 ]: data.drop_duplicates() 02. Out[ 7 ]: 03. k1 k2 04. 0 one 1 05. 2 one 2 06. 3 two 3 07. 5 two 4 08. 09. [ 4 rows x 2 columns] 01. In [ 8 ]: data[ 'v1' ] = range( 7 ) 02. 03. In [ 9 ]: data 04. Out[ 9 ]: 05. k1 k2 v1 06. 0 one 1 0 07. 1 one 1 1 08. 2 one 2 2 09. 3 two 3 3 10. 4 two 3 4 11. 5 two 4 5 12. 6 two 4 6 13. 14. [ 7 rows x 3 columns] 15. 16. In [ 10 ]: data.drop_duplicates([ 'k1' ]) 17. Out[ 10 ]: 18. k1 k2 v1 19. 0 one 1 0 20. 3 two 3 3 21. 22. [ 2 rows x 3 columns] 01. In [ 11 ]: data.drop_duplicates([ 'k1' , 'k2' ], take_last=True) 02. Out[ 11 ]: 03. k1 k2 v1 04. 1 one 1 1 05. 2 one 2 2 06. 4 two 3 4 07. 6 two 4 6 08. 09. [ 4 rows x 3 columns] 3、利用函數(shù)或映射進(jìn)行數(shù)據(jù)轉(zhuǎn)換 在對數(shù)據(jù)集進(jìn)行轉(zhuǎn)換時,你可能希望根據(jù)數(shù)組、Series或DataFrame列中的值來實現(xiàn)該轉(zhuǎn)換工作。我們來看看下面這組有關(guān)肉類的數(shù)據(jù): 01. In [ 12 ]: data = pd.DataFrame({ 'food' :[ 'bacon' , 'pulled pork' , 'bacon' , 'Pastrami' , 'corned beef' , 'Bacon' , 02. 'pastrami' , 'honey ham' , 'nova lox' ], 03. ....: 'ounces' :[ 4 , 3 , 12 , 6 , 7.5 , 8 , 3 , 5 , 6 ]}) 04. 05. In [ 13 ]: data 06. Out[ 13 ]: 07. food ounces 08. 0 bacon 4.0 09. 1 pulled pork 3.0 10. 2 bacon 12.0 11. 3 Pastrami 6.0 12. 4 corned beef 7.5 13. 5 Bacon 8.0 14. 6 pastrami 3.0 15. 7 honey ham 5.0 16. 8 nova lox 6.0 17. 18. [ 9 rows x 2 columns] 1. In [ 14 ]: meat_to_animal = { 2. ....: 'bacon' : 'pig' , 3. ....: 'pulled pork' : 'pig' , 4. ....: 'pastrami' : 'cow' , 5. ....: 'corned beef' : 'cow' , 6. ....: 'honey ham' : 'pig' , 7. ....: 'nova lox' : 'salmon' 8. ....: } 01. In [ 15 ]: data[ 'animal' ] = data[ 'food' ].map(str.lower).map(meat_to_animal) 02. 03. In [ 16 ]: data 04. Out[ 16 ]: 05. food ounces animal 06. 0 bacon 4.0 pig 07. 1 pulled pork 3.0 pig 08. 2 bacon 12.0 pig 09. 3 Pastrami 6.0 cow 10. 4 corned beef 7.5 cow 11. 5 Bacon 8.0 pig 12. 6 pastrami 3.0 cow 13. 7 honey ham 5.0 pig 14. 8 nova lox 6.0 salmon 15. 16. [ 9 rows x 3 columns] 01. In [ 17 ]: data[ 'food' ].map(lambda x: meat_to_animal[x.lower()]) 02. Out[ 17 ]: 03. 0 pig 04. 1 pig 05. 2 pig 06. 3 cow 07. 4 cow 08. 5 pig 09. 6 cow 10. 7 pig 11. 8 salmon 12. Name: food, dtype: object 說明: 使用map是一種實現(xiàn)元素級轉(zhuǎn)換以及其他數(shù)據(jù)清理工作的便捷方式。 4、替換值 利用fillna方法填充缺失數(shù)據(jù)可以看做值替換的一種特殊情況。雖然前面提到的mao可用于修改對象的數(shù)據(jù)子集,而replace則提供了一種實現(xiàn)該功能的更簡單、更靈活的方式。我們來看看下面這個Series: 01. In [ 18 ]: data = pd.Series([ 1 ., - 999 , 2 ., - 999 , - 1000 ., 3 .]) 02. 03. In [ 19 ]: data 04. Out[ 19 ]: 05. 0 1 06. 1 - 999 07. 2 2 08. 3 - 999 09. 4 - 1000 10. 5 3 11. dtype: float64 01. In [ 20 ]: data.replace(- 999 , np.nan) 02. Out[ 20 ]: 03. 0 1 04. 1 NaN 05. 2 2 06. 3 NaN 07. 4 - 1000 08. 5 3 09. dtype: float64 01. In [ 21 ]: data.replace([- 999 , - 1000 ], np.nan) 02. Out[ 21 ]: 03. 0 1 04. 1 NaN 05. 2 2 06. 3 NaN 07. 4 NaN 08. 5 3 09. dtype: float64 01. In [ 22 ]: data.replace([- 999 , - 1000 ], [np.nan, 0 ]) 02. Out[ 22 ]: 03. 0 1 04. 1 NaN 05. 2 2 06. 3 NaN 07. 4 0 08. 5 3 09. dtype: float64 01. In [ 23 ]: data.replace({- 999 : np.nan, - 1000 : 0 }) 02. Out[ 23 ]: 03. 0 1 04. 1 NaN 05. 2 2 06. 3 NaN 07. 4 0 08. 5 3 09. dtype: float64 5、重命名軸索引 跟Series中的值一樣,軸標(biāo)簽也可以通過函數(shù)或映射進(jìn)行轉(zhuǎn)換,從而得到一個新對象。軸還可以被就地修改,而無需新建一個數(shù)據(jù)結(jié)構(gòu)。接下來看看下面這個簡單的例子: 1. In [ 24 ]: data = pd.DataFrame(np.arange( 12 ).reshape(( 3 , 4 )), 2. ....: index=[ 'Ohio' , 'Colorado' , 'New York' ], 3. ....: columns=[ 'one' , 'two' , 'three' , 'four' ]) 1. In [ 25 ]: data.index.map(str.upper) 2. Out[ 25 ]: array([ 'OHIO' , 'COLORADO' , 'NEW YORK' ], dtype=object) 01. In [ 26 ]: data.index = data.index.map(str.upper) 02. 03. In [ 27 ]: data 04. Out[ 27 ]: 05. one two three four 06. OHIO 0 1 2 3 07. COLORADO 4 5 6 7 08. NEW YORK 8 9 10 11 09. 10. [ 3 rows x 4 columns] 1. In [ 28 ]: data.rename(index=str.title, columns=str.upper) 2. Out[ 28 ]: 3. ONE TWO THREE FOUR 4. Ohio 0 1 2 3 5. Colorado 4 5 6 7 6. New York 8 9 10 11 7. 8. [ 3 rows x 4 columns] 01. In [ 31 ]: data.rename(index={ 'OHIO' : 'INDIANA' }, 02. columns={ 'three' : 'peekaboo' }) 03. Out[ 31 ]: 04. one two peekaboo four 05. INDIANA 0 1 2 3 06. COLORADO 4 5 6 7 07. NEW YORK 8 9 10 11 08. 09. [ 3 rows x 4 columns] 01. In [ 32 ]: # 總是返回DataFrame的引用 02. 03. In [ 33 ]: _ = data.rename(index={ 'OHIO' : 'INDIANA' }, inplace=True) 04. 05. In [ 34 ]: data 06. Out[ 34 ]: 07. one two three four 08. INDIANA 0 1 2 3 09. COLORADO 4 5 6 7 10. NEW YORK 8 9 10 11 11. 12. [ 3 rows x 4 columns] 6、離散化和面元劃分 為了便于分析,連續(xù)數(shù)據(jù)常常被離散化或拆分為“面元”(bin)。假設(shè)有一組人員數(shù)據(jù),而你希望將它們劃分為不同的年齡組: 1. In [ 35 ]: ages = [ 20 , 22 , 25 , 27 , 21 , 23 , 37 , 31 , 61 , 45 , 41 , 32 ] 01. In [ 36 ]: bins = [ 18 , 25 , 35 , 60 , 100 ] 02. 03. In [ 37 ]: cats = pd.cut(ages, bins) 04. 05. In [ 38 ]: cats 06. Out[ 38 ]: 07. ( 18 , 25 ] 08. ( 18 , 25 ] 09. ( 18 , 25 ] 10. ( 25 , 35 ] 11. ( 18 , 25 ] 12. ( 18 , 25 ] 13. ( 35 , 60 ] 14. ( 25 , 35 ] 15. ( 60 , 100 ] 16. ( 35 , 60 ] 17. ( 35 , 60 ] 18. ( 25 , 35 ] 19. Levels ( 4 ): Index([ '(18, 25]' , '(25, 35]' , '(35, 60]' , '(60, 100]' ], dtype=object) 01. In [ 39 ]: cats.labels 02. Out[ 39 ]: array([ 0 , 0 , 0 , 1 , 0 , 0 , 2 , 1 , 3 , 2 , 2 , 1 ]) 03. 04. In [ 40 ]: cats.levels 05. Out[ 40 ]: Index([u '(18, 25]' , u '(25, 35]' , u '(35, 60]' , u '(60, 100]' ], dtype= 'object' ) 06. 07. In [ 41 ]: pd.value_counts(cats) 08. Out[ 41 ]: 09. ( 18 , 25 ] 5 10. ( 35 , 60 ] 3 11. ( 25 , 35 ] 3 12. ( 60 , 100 ] 1 13. dtype: int64 01. In [ 42 ]: pd.cut(ages, [ 18 , 26 , 36 , 61 , 100 ], right=False) 02. Out[ 42 ]: 03. [ 18 , 26 ) 04. [ 18 , 26 ) 05. [ 18 , 26 ) 06. [ 26 , 36 ) 07. [ 18 , 26 ) 08. [ 18 , 26 ) 09. [ 36 , 61 ) 10. [ 26 , 36 ) 11. [ 61 , 100 ) 12. [ 36 , 61 ) 13. [ 36 , 61 ) 14. [ 26 , 36 ) 15. Levels ( 4 ): Index([ '[18, 26)' , '[26, 36)' , '[36, 61)' , '[61, 100)' ], dtype=object) 01. In [ 43 ]: group_names = [ 'Youth' , 'YoungAdult' , 'MiddleAged' , 'Senior' ] 02. 03. In [ 44 ]: pd.cut(ages, bins, labels=group_names) 04. Out[ 44 ]: 05. Youth 06. Youth 07. Youth 08. YoungAdult 09. Youth 10. Youth 11. MiddleAged 12. YoungAdult 13. Senior 14. MiddleAged 15. MiddleAged 16. YoungAdult 17. Levels ( 4 ): Index([ 'Youth' , 'YoungAdult' , 'MiddleAged' , 'Senior' ], dtype=object) 01. In [ 45 ]: data = np.random.rand( 20 ) 02. 03. In [ 46 ]: pd.cut(data, 4 , precision= 2 ) 04. Out[ 46 ]: 05. ( 0.037 , 0.26 ] 06. ( 0.037 , 0.26 ] 07. ( 0.48 , 0.7 ] 08. ( 0.7 , 0.92 ] 09. ( 0.037 , 0.26 ] 10. ( 0.037 , 0.26 ] 11. ( 0.7 , 0.92 ] 12. ( 0.7 , 0.92 ] 13. ( 0.037 , 0.26 ] 14. ( 0.26 , 0.48 ] 15. ( 0.26 , 0.48 ] 16. ( 0.26 , 0.48 ] 17. ( 0.037 , 0.26 ] 18. ( 0.26 , 0.48 ] 19. ( 0.48 , 0.7 ] 20. ( 0.7 , 0.92 ] 21. ( 0.037 , 0.26 ] 22. ( 0.7 , 0.92 ] 23. ( 0.037 , 0.26 ] 24. ( 0.037 , 0.26 ] 25. Levels ( 4 ): Index([ '(0.037, 0.26]' , '(0.26, 0.48]' , '(0.48, 0.7]' , 26. '(0.7, 0.92]' ], dtype=object) 01. In [ 48 ]: data = np.random.randn( 1000 ) # 正態(tài)分布 02. 03. In [ 49 ]: cats = pd.qcut(data, 4 ) # 按四分位數(shù)進(jìn)行分隔 04. 05. In [ 50 ]: cats 06. Out[ 50 ]: 07. [- 3.636 , - 0.717 ] 08. ( 0.647 , 3.531 ] 09. [- 3.636 , - 0.717 ] 10. [- 3.636 , - 0.717 ] 11. [- 3.636 , - 0.717 ] 12. ( 0.647 , 3.531 ] 13. [- 3.636 , - 0.717 ] 14. (- 0.717 , - 0.0323 ] 15. (- 0.717 , - 0.0323 ] 16. ( 0.647 , 3.531 ] 17. [- 3.636 , - 0.717 ] 18. (- 0.717 , - 0.0323 ] 19. ( 0.647 , 3.531 ] 20. ... 21. [- 3.636 , - 0.717 ] 22. [- 3.636 , - 0.717 ] 23. ( 0.647 , 3.531 ] 24. (- 0.717 , - 0.0323 ] 25. ( 0.647 , 3.531 ] 26. [- 3.636 , - 0.717 ] 27. [- 3.636 , - 0.717 ] 28. (- 0.0323 , 0.647 ] 29. [- 3.636 , - 0.717 ] 30. (- 0.717 , - 0.0323 ] 31. (- 0.717 , - 0.0323 ] 32. (- 0.0323 , 0.647 ] 33. ( 0.647 , 3.531 ] 34. Levels ( 4 ): Index([ '[-3.636, -0.717]' , '(-0.717, -0.0323]' , 35. '(-0.0323, 0.647]' , '(0.647, 3.531]' ], dtype=object) 36. Length: 1000 37. 38. In [ 51 ]: pd.value_counts(cats) 39. Out[ 51 ]: 40. (- 0.717 , - 0.0323 ] 250 41. (- 0.0323 , 0.647 ] 250 42. ( 0.647 , 3.531 ] 250 43. [- 3.636 , - 0.717 ] 250 44. dtype: int64 01. In [ 52 ]: pd.qcut(data, [ 0 , 0.1 , 0.5 , 0.9 , 1 .]) 02. Out[ 52 ]: 03. (- 1.323 , - 0.0323 ] 04. (- 0.0323 , 1.234 ] 05. (- 1.323 , - 0.0323 ] 06. [- 3.636 , - 1.323 ] 07. [- 3.636 , - 1.323 ] 08. (- 0.0323 , 1.234 ] 09. (- 1.323 , - 0.0323 ] 10. (- 1.323 , - 0.0323 ] 11. (- 1.323 , - 0.0323 ] 12. ( 1.234 , 3.531 ] 13. (- 1.323 , - 0.0323 ] 14. (- 1.323 , - 0.0323 ] 15. (- 0.0323 , 1.234 ] 16. ... 17. [- 3.636 , - 1.323 ] 18. (- 1.323 , - 0.0323 ] 19. (- 0.0323 , 1.234 ] 20. (- 1.323 , - 0.0323 ] 21. (- 0.0323 , 1.234 ] 22. [- 3.636 , - 1.323 ] 23. (- 1.323 , - 0.0323 ] 24. (- 0.0323 , 1.234 ] 25. (- 1.323 , - 0.0323 ] 26. (- 1.323 , - 0.0323 ] 27. (- 1.323 , - 0.0323 ] 28. (- 0.0323 , 1.234 ] 29. (- 0.0323 , 1.234 ] 30. Levels ( 4 ): Index([ '[-3.636, -1.323]' , '(-1.323, -0.0323]' , 31. '(-0.0323, 1.234]' , '(1.234, 3.531]' ], dtype=object) 32. Length: 1000 說明: 稍后在講解聚合和分組運算時會再次用到cut和qcut,因為這兩個離散化函數(shù)對分量和分組分析非常重要。 7、檢測和過濾異常值 異常值(outlier)的過濾或變換運算在很大程度上其實就是數(shù)組運算。來看一個含有正態(tài)分布數(shù)據(jù)的DataFrame: 01. In [ 53 ]: np.random.seed( 12345 ) 02. 03. In [ 54 ]: data = pd.DataFrame(np.random.randn( 1000 , 4 )) 04. 05. In [ 55 ]: data.describe() 06. Out[ 55 ]: 07. 0 1 2 3 08. count 1000.000000 1000.000000 1000.000000 1000.000000 09. mean - 0.067684 0.067924 0.025598 - 0.002298 10. std 0.998035 0.992106 1.006835 0.996794 11. min - 3.428254 - 3.548824 - 3.184377 - 3.745356 12. 25 % - 0.774890 - 0.591841 - 0.641675 - 0.644144 13. 50 % - 0.116401 0.101143 0.002073 - 0.013611 14. 75 % 0.616366 0.780282 0.680391 0.654328 15. max 3.366626 2.653656 3.260383 3.927528 16. 17. [ 8 rows x 4 columns] 1. In [ 56 ]: col = data[ 3 ] 2. 3. In [ 57 ]: col[np.abs(col) > 3 ] 4. Out[ 57 ]: 5. 97 3.927528 6. 305 - 3.399312 7. 400 - 3.745356 8. Name: 3 , dtype: float64 01. In [ 58 ]: data[(np.abs(data) > 3 ).any( 1 )] 02. Out[ 58 ]: 03. 0 1 2 3 04. 5 - 0.539741 0.476985 3.248944 - 1.021228 05. 97 - 0.774363 0.552936 0.106061 3.927528 06. 102 - 0.655054 - 0.565230 3.176873 0.959533 07. 305 - 2.315555 0.457246 - 0.025907 - 3.399312 08. 324 0.050188 1.951312 3.260383 0.963301 09. 400 0.146326 0.508391 - 0.196713 - 3.745356 10. 499 - 0.293333 - 0.242459 - 3.056990 1.918403 11. 523 - 3.428254 - 0.296336 - 0.439938 - 0.867165 12. 586 0.275144 1.179227 - 3.184377 1.369891 13. 808 - 0.362528 - 3.548824 1.553205 - 2.186301 14. 900 3.366626 - 2.372214 0.851010 1.332846 15. 16. [ 11 rows x 4 columns] 01. In [ 59 ]: data[np.abs(data) > 3 ] = np.sign(data) * 3 02. 03. In [ 60 ]: data.describe() 04. Out[ 60 ]: 05. 0 1 2 3 06. count 1000.000000 1000.000000 1000.000000 1000.000000 07. mean - 0.067623 0.068473 0.025153 - 0.002081 08. std 0.995485 0.990253 1.003977 0.989736 09. min - 3.000000 - 3.000000 - 3.000000 - 3.000000 10. 25 % - 0.774890 - 0.591841 - 0.641675 - 0.644144 11. 50 % - 0.116401 0.101143 0.002073 - 0.013611 12. 75 % 0.616366 0.780282 0.680391 0.654328 13. max 3.000000 2.653656 3.000000 3.000000 14. 15. [ 8 rows x 4 columns] 說明: np.sign這個ufunc返回的是一個由1和-1組成的數(shù)組,表示原始值的符號。 8、排列和隨機采樣 利用numpy.random.permutation函數(shù)可以輕松實現(xiàn)對Series或DataFrame的列的排列工作(permuting,隨機重排序)。通過需要排列的軸的長度調(diào)用permutation,可產(chǎn)生一個表示新順序的整數(shù)數(shù)組: 1. In [ 61 ]: df = pd.DataFrame(np.arange( 5 * 4 ).reshape( 5 , 4 )) 2. 3. In [ 62 ]: sampler = np.random.permutation( 5 ) 4. 5. In [ 63 ]: sampler 6. Out[ 63 ]: array([ 1 , 0 , 2 , 3 , 4 ]) 01. In [ 64 ]: df 02. Out[ 64 ]: 03. 0 1 2 3 04. 0 0 1 2 3 05. 1 4 5 6 7 06. 2 8 9 10 11 07. 3 12 13 14 15 08. 4 16 17 18 19 09. 10. [ 5 rows x 4 columns] 11. 12. In [ 65 ]: df.take(sampler) 13. Out[ 65 ]: 14. 0 1 2 3 15. 1 4 5 6 7 16. 0 0 1 2 3 17. 2 8 9 10 11 18. 3 12 13 14 15 19. 4 16 17 18 19 20. 21. [ 5 rows x 4 columns] 1. In [ 66 ]: df.take(np.random.permutation(len(df))[: 3 ]) 2. Out[ 66 ]: 3. 0 1 2 3 4. 1 4 5 6 7 5. 3 12 13 14 15 6. 4 16 17 18 19 7. 8. [ 3 rows x 4 columns] 01. In [ 67 ]: bag = np.array([ 5 , 7 , - 1 , 6 , 4 ]) 02. 03. In [ 68 ]: sampler = np.random.randint( 0 , len(bag), size= 10 ) 04. 05. In [ 69 ]: sampler 06. Out[ 69 ]: array([ 4 , 4 , 2 , 2 , 2 , 0 , 3 , 0 , 4 , 1 ]) 07. 08. In [ 70 ]: draws = bag.take(sampler) 09. 10. In [ 71 ]: draws 11. Out[ 71 ]: array([ 4 , 4 , - 1 , - 1 , - 1 , 5 , 6 , 5 , 4 , 7 ]) 9、計算指標(biāo)/啞變量 另一種常用于統(tǒng)計建?;驒C器學(xué)習(xí)的轉(zhuǎn)換方式是:將分類變量(categorical variable)轉(zhuǎn)換為“啞變量矩陣”(dummy matrix)或“指標(biāo)矩陣”(indicator matrix)。如果DataFrame的某一列中含有k個不同的值,則可以派生出一個k列矩陣或DataFrame(其值全為1和0)。pandas有一個get_dummies函數(shù)可以實現(xiàn)該功能(其實自己動手做一個也不難)。拿之前的一個例子來說: 01. In [ 72 ]: df = pd.DataFrame({ 'key' : [ 'b' , 'b' , 'a' , 'c' , 'a' , 'b' ], 02. ....: 'data1' : range( 6 )}) 03. 04. In [ 73 ]: pd.get_dummies(df[ 'key' ]) 05. Out[ 73 ]: 06. a b c 07. 0 0 1 0 08. 1 0 1 0 09. 2 1 0 0 10. 3 0 0 1 11. 4 1 0 0 12. 5 0 1 0 13. 14. [ 6 rows x 3 columns] 01. In [ 74 ]: dummies = pd.get_dummies(df[ 'key' ], prefix= 'key' ) 02. 03. In [ 75 ]: df_with_dummy = df[[ 'data1' ]].join(dummies) 04. 05. In [ 76 ]: df_with_dummy 06. Out[ 76 ]: 07. data1 key_a key_b key_c 08. 0 0 0 1 0 09. 1 1 0 1 0 10. 2 2 1 0 0 11. 3 3 0 0 1 12. 4 4 1 0 0 13. 5 5 0 1 0 14. 15. [ 6 rows x 4 columns] 01. In [ 77 ]: mnames = [ 'movie_id' , 'title' , 'genres' ] 02. In [ 78 ]: movies = pd.read_table( 'movies.dat' , sep= '::' , header=None, 03. .....: names=mnames) 04. 05. In [ 79 ]: movies[: 10 ] 06. Out[ 79 ]: 07. movie_id title genres 08. 0 1 Toy Story ( 1995 ) Animation|Children's|Comedy 09. 1 2 Jumanji ( 1995 ) Adventure|Children's|Fantasy 10. 2 3 Grumpier Old Men ( 1995 ) Comedy|Romance 11. 3 4 Waiting to Exhale ( 1995 ) Comedy|Drama 12. 4 5 Father of the Bride Part II ( 1995 ) Comedy 13. 5 6 Heat ( 1995 ) Action|Crime|Thriller 14. 6 7 Sabrina ( 1995 ) Comedy|Romance 15. 7 8 Tom and Huck ( 1995 ) Adventure|Children's 16. 8 9 Sudden Death ( 1995 ) Action 17. 9 10 GoldenEye ( 1995 ) Action|Adventure|Thriller 1. In [ 80 ]: genre_iter = (set(x.split( '|' )) for x in movies.genres) 2. 3. In [ 81 ]: genres = sorted(set.union(*genre_iter)) 1. In [ 82 ]: dummies = DataFrame(np.zeros((len(movies), len(genres))), columns=genres) 1. In [ 83 ]: for i, gen in enumerate(movies.genres): 2. .....: dummies.ix[i, gen.split( '|' )] = 1 01. In [ 84 ]: movies_windic = movies.join(dummies.add_prefix( 'Genre_' )) 02. 03. In [ 85 ]: movies_windic.ix[ 0 ] 04. Out[ 85 ]: 05. movie_id 1 06. title Toy Story ( 1995 ) 07. genres Animation|Children's|Comedy 08. Genre_Action 0 09. Genre_Adventure 0 10. Genre_Animation 1 11. Genre_Children's 1 12. Genre_Comedy 1 13. Genre_Crime 0 14. Genre_Documentary 0 15. Genre_Drama 0 16. Genre_Fantasy 0 17. Genre_Film-Noir 0 18. Genre_Horror 0 19. Genre_Musical 0 20. Genre_Mystery 0 21. Genre_Romance 0 22. Genre_Sci-Fi 0 23. Genre_Thriller 0 24. Genre_War 0 25. Genre_Western 0 26. Name: 0 注意: 對于很大的數(shù)據(jù),用這種方式構(gòu)建多成員指標(biāo)變量就會變得非常慢??隙ㄐ枰帉懸粋€能夠利用DataFrame內(nèi)部機制的更低級的函數(shù)才行。 一個對統(tǒng)計應(yīng)用有用的秘訣是:結(jié)合get_dummies和諸如cut之類的離散化函數(shù)。 01. In [ 86 ]: values = np.random.rand( 10 ) 02. 03. In [ 87 ]: values 04. Out[ 87 ]: 05. array([ 0.75603383 , 0.90830844 , 0.96588737 , 0.17373658 , 0.87592824 , 06. 0.75415641 , 0.163486 , 0.23784062 , 0.85564381 , 0.58743194 ]) 07. 08. In [ 88 ]: bins = [ 0 , 0.2 , 0.4 , 0.6 , 0.8 , 1 ] 09. 10. In [ 89 ]: pd.get_dummies(pd.cut(values, bins)) 11. Out[ 89 ]: 12. ( 0 , 0.2 ] ( 0.2 , 0.4 ] ( 0.4 , 0.6 ] ( 0.6 , 0.8 ] ( 0.8 , 1 ] 13. 0 0 0 0 1 0 14. 1 0 0 0 0 1 15. 2 0 0 0 0 1 16. 3 1 0 0 0 0 17. 4 0 0 0 0 1 18. 5 0 0 0 1 0 19. 6 1 0 0 0 0 20. 7 0 1 0 0 0 21. 8 0 0 0 0 1 22. 9 0 0 1 0 0 23. 24. [ 10 |
|