原創(chuàng)譯文 | 最新頂尖數據分析師必用的15大Python庫（下）

昵稱16619343 2017-07-13

展開全文

播放GIF

轉載聲明

本文為燈塔大數據原創(chuàng)內容，歡迎個人轉載至朋友圈，其他機構轉載請在文章開頭標注：

“轉自：燈塔大數據；”

近幾年來，Python在數據科學界受到大量關注，我們在這里為數據科學界的科學家和工程師列舉出了最頂尖的Python庫。（文末更多往期譯文推薦）

因為這里提到的所有的庫都是開源的，所以我們還備注了每個庫的貢獻資料數量、貢獻者人數以及其他指數，可對每個Python庫的受歡迎程度加以輔助說明。

Scikits 是 SciPy Stack 的附加軟件包，專為特定功能（如圖像處理和輔助機器學習）而設計。在后者方面，其中最突出的一個是 scikit-learn。該軟件包構建于 SciPy 之上，并大量使用其數學操作。

scikit-learn 有一個簡潔和一致的接口，可利用常見的機器學習算法，讓我們可以簡單地在生產中應用機器學習。該庫結合了質量很好的代碼和良好的文檔，易于使用且有著非常高的性能，是使用 Python 進行機器學習的實際上的行業(yè)標準。

深度學習：Keras / TensorFlow / Theano

在深度學習方面，Python 中最突出和最方便的庫之一是 Keras，它可以在 TensorFlow 或者 Theano 之上運行。讓我們來看一下它們的一些細節(jié)。

首先，讓我們談談 Theano。Theano 是一個 Python 包，它定義了與 NumPy 類似的多維數組，以及數學運算和表達式。該庫是經過編譯的，使其在所有架構上能夠高效運行。這個庫最初由蒙特利爾大學機器學習組開發(fā)，主要是為了滿足機器學習的需求。

要注意的是，Theano 與 NumPy 在底層的操作上緊密集成。該庫還優(yōu)化了 GPU 和 CPU 的使用，使數據密集型計算的性能更快。

效率和穩(wěn)定性調整允許更精確的結果，即使是非常小的值也可以，例如，即使 x 很小，log(1+x) 也能得到很好的結果。

TensorFlow 來自 Google 的開發(fā)人員，它是用于數據流圖計算的開源庫，專門為機器學習設計。它是為滿足 Google 對訓練神經網絡的高要求而設計的，是基于神經網絡的機器學習系統(tǒng) DistBelief 的繼任者。然而，TensorFlow 并不是谷歌的科學專用的——它也足以支持許多真實世界的應用。

TensorFlow 的關鍵特征是其多層節(jié)點系統(tǒng)，可以在大型數據集上快速訓練人工神經網絡。這為 Google 的語音識別和圖像識別提供了支持。

最后，我們來看看 Keras。它是一個使用高層接口構建神經網絡的開源庫，它是用 Python 編寫的。它簡單易懂，具有高級可擴展性。它使用 Theano 或 TensorFlow 作為后端，但 Microsoft 現在已將 CNTK（Microsoft 的認知工具包）集成為新的后端。

其簡約的設計旨在通過建立緊湊型系統(tǒng)進行快速和容易的實驗。

Keras 極其容易上手，而且可以進行快速的原型設計。它完全使用 Python 編寫的，所以本質上很高層。它是高度模塊化和可擴展的。盡管它簡單易用且面向高層，但 Keras 也非常深度和強大，足以用于嚴肅的建模。

Keras 的一般思想是基于神經網絡的層，然后圍繞層構建一切。數據以張量的形式進行準備，第一層負責輸入張量，最后一層用于輸出。模型構建于兩者之間。

自然語言處理

這套庫的名稱是 Natural Language Toolkit（自然語言工具包），顧名思義，它可用于符號和統(tǒng)計自然語言處理的常見任務。NLTK 旨在促進 NLP 及相關領域（語言學、認知科學和人工智能等）的教學和研究，目前正被重點關注。

NLTK 允許許多操作，例如文本標記、分類和 tokenizing、命名實體識別、建立語語料庫樹（揭示句子間和句子內的依存性）、詞干提取、語義推理。所有的構建塊都可以為不同的任務構建復雜的研究系統(tǒng)，例如情緒分析、自動摘要。

這是一個用于 Python 的開源庫，實現了用于向量空間建模和主題建模的工具。這個庫為大文本進行了有效的設計，而不僅僅可以處理內存中內容。其通過廣泛使用 NumPy 數據結構和 SciPy 操作而實現了效率。它既高效又易于使用。

Gensim 的目標是可以應用原始的和非結構化的數字文本。Gensim 實現了諸如分層 Dirichlet 進程（HDP）、潛在語義分析（LSA）和潛在 Dirichlet 分配（LDA）等算法，還有 tf-idf、隨機投影、word2vec 和 document2vec，以便于檢查一組文檔（通常稱為語料庫）中文本的重復模式。所有這些算法是無監(jiān)督的——不需要任何參數，唯一的輸入是語料庫。

數據挖掘與統(tǒng)計

Scrapy 是用于從網絡檢索結構化數據（如聯(lián)系人信息或 URL）的爬蟲程序（也稱為 spider bots）的庫。它是開源的，用 Python 編寫。它最初是為 scraping 設計的，正如其名字所示的那樣，但它現在已經發(fā)展成了一個完整的框架，可以從 API 收集數據，也可以用作通用的爬蟲。

該庫在接口設計上遵循著名的 Don』t Repeat Yourself 原則——提醒用戶編寫通用的可復用的代碼，因此可以用來開發(fā)和擴展大型爬蟲。

Scrapy 的架構圍繞 Spider 類構建，該類包含了一套爬蟲所遵循的指令。

statsmodels 是一個用于 Python 的庫，正如你可能從名稱中猜出的那樣，其讓用戶能夠通過使用各種統(tǒng)計模型估計方法以及執(zhí)行統(tǒng)計斷言和分析來進行數據探索。

許多有用的特征是描述性的，并可通過使用線性回歸模型、廣義線性模型、離散選擇模型、穩(wěn)健的線性模型、時序分析模型、各種估計器進行統(tǒng)計。

該庫還提供了廣泛的繪圖函數，專門用于統(tǒng)計分析和調整使用大數據統(tǒng)計數據的良好性能。

結論

這個列表中的庫被很多數據科學家和工程師認為是最頂級的，了解和熟悉它們是很有價值的。這里有這些庫在 GitHub 上活動的詳細統(tǒng)計：

Top 15 Python Libraries for Data Science in 2017

As Python has gained a lot of traction in the recent years in Data Science industry, I wanted to outline some of its most useful libraries for data scientists and engineers, based on recent experience.

And, since all of the libraries are open sourced, we have added commits, contributors count and other metrics from Github, which could be served as a proxy metrics for library popularity.

Machine Learning.

8. SciKit-Learn (Commits: 21793, Contributors: 842)

Scikits are additional packages of SciPy Stack designed for specific functionalities like image processing and machine learning facilitation. In the regard of the latter, one of the most prominent of these packages is scikit-learn. The package is built on the top of SciPy and makes heavy use of its math operations.

The scikit-learn exposes a concise and consistent interface to the common machine learning algorithms, making it simple to bring ML into production systems. The library combines quality code and good documentation, ease of use and high performance and is de-facto industry standard for machine learning with Python.

Deep Learning?—?Keras / TensorFlow / Theano

In the regard of Deep Learning, one of the most prominent and convenient libraries for Python in this field is Keras, which can function either on top of TensorFlow or Theano. Let’s reveal some details about all of them.

9.Theano. (Commits: 25870, Contributors: 300)

Firstly, let’s talk about Theano.

Theano is a Python package that defines multi-dimensional arrays similar to NumPy, along with math operations and expressions. The library is compiled, making it run efficiently on all architectures. Originally developed by the Machine Learning group of Université de Montréal, it is primarily used for the needs of Machine Learning.

The important thing to note is that Theano tightly integrates with NumPy on low-level of its operations. The library also optimizes the use of GPU and CPU, making the performance of data-intensive computation even faster.

Efficiency and stability tweaks allow for much more precise results with even very small values, for example, computation of log(1+x) will give cognizant results for even smallest values of x.

10. TensorFlow. (Commits: 16785, Contributors: 795)

Coming from developers at Google, it is an open-source library of data flow graphs computations, which are sharpened for Machine Learning. It was designed to meet the high-demand requirements of Google environment for training Neural Networks and is a successor of DistBelief, a Machine Learning system, based on Neural Networks. However, TensorFlow isn’t strictly for scientific use in border’s of Google?—?it is general enough to use it in a variety of real-world application.

The key feature of TensorFlow is their multi-layered nodes system that enables quick training of artificial neural networks on large datasets. This powers Google’s voice recognition and object identification from pictures.

11. Keras. (Commits: 3519, Contributors: 428)

And finally, let’s look at the Keras. It is an open-source library for building Neural Networks at a high-level of the interface, and it is written in Python. It is minimalistic and straightforward with high-level of extensibility. It uses Theano or TensorFlow as its backends, but Microsoft makes its efforts now to integrate CNTK (Microsoft’s Cognitive Toolkit) as a new back-end.

The minimalistic approach in design aimed at fast and easy experimentation through the building of compact systems.

Keras is really eased to get started with and keep going with quick prototyping. It is written in pure Python and high-level in its nature. It is highly modular and extendable. Notwithstanding its ease, simplicity, and high-level orientation, Keras is still deep and powerful enough for serious modeling.

The general idea of Keras is based on layers, and everything else is built around them. Data is prepared in tensors, the first layer is responsible for input of tensors, the last layer is responsible for output, and the model is built in between.

Natural Language Processing.

12. NLTK (Commits: 12449, Contributors: 196)

The name of this suite of libraries stands for Natural Language Toolkit and, as the name implies, it used for common tasks of symbolic and statistical Natural Language Processing. NLTK was intended to facilitate teaching and research of NLP and the related fields (Linguistics, Cognitive Science Artificial Intelligence, etc.) and it is being used with a focus on this today.

The functionality of NLTK allows a lot of operations such as text tagging, classification, and tokenizing, name entities identification, building corpus tree that reveals inter and intra-sentence dependencies, stemming, semantic reasoning. All of the building blocks allow for building complex research systems for different tasks, for example, sentiment analytics, automatic summarization.

13. Gensim (Commits: 2878, Contributors: 179)

It is an open-source library for Python that implements tools for work with vector space modeling and topic modeling. The library designed to be efficient with large texts, not only in-memory processing is possible. The efficiency is achieved by the using of NumPy data structures and SciPy operations extensively. It is both efficient and easy to use.

Gensim is intended for use with raw and unstructured digital texts. Gensim implements algorithms such as hierarchical Dirichlet processes (HDP), latent semantic analysis (LSA) and latent Dirichlet allocation (LDA), as well as tf-idf, random projections, word2vec and document2vec facilitate examination of texts for recurring patterns of words in the set of documents (often referred as a corpus). All of the algorithms are unsupervised?—?no need for any arguments, the only input is corpus.

Data Mining. Statistics.

14. Scrapy (Commits: 6325, Contributors: 243)

Scrapy is a library for making crawling programs, also known as spider bots, for retrieval of the structured data, such as contact info or URLs, from the web.

It is open-source and written in Python. It was originally designed strictly for scraping, as its name indicate, but it has evolved in the full-fledged framework with the ability to gather data from APIs and act as general-purpose crawlers.

The library follows famous Don’t Repeat Yourself in the interface design?—?it prompts its users to write the general, universal code that is going to be reusable, thus making building and scaling large crawlers.

The architecture of Scrapy is built around Spider class, which encapsulates the set of instruction that is followed by the crawler.

15. Statsmodels (Commits: 8960, Contributors: 119)

As you have probably guessed from the name, statsmodels is a library for Python that enables its users to conduct data exploration via the use of various methods of estimation of statistical models and performing statistical assertions and analysis.

Among many useful features are descriptive and result statistics via the use of linear regression models, generalized linear models, discrete choice models, robust linear models, time series analysis models, various estimators.

The library also provides extensive plotting functions that are designed specifically for the use in statistical analysis and tweaked for good performance with big data sets of statistical data.

Conclusions.

These are the libraries that are considered to be the top of the list by many data scientists and engineers and worth looking at them as well as at least familiarizing yourself with them.