9 个鲜为人知的机器学习 Python 工具库 | 附代码示例

Posted 2021-03-31 小白玩转Python

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了9 个鲜为人知的机器学习 Python 工具库 | 附代码示例相关的知识，希望对你有一定的参考价值。

欢迎关注 “小白玩转Python”，发现更多 “有趣”

本文将展示一系列用于机器学习的 Python 工具库。在过去的几年里，Github 上已经有了相当多的专门研究机器学习的库代码。因此，我整理了一个包含9个正在开发中的有用包的列表。

不管你是机器学习的初学者还是专家，都强烈推荐你去瞅瞅，因为你可能会发现一个对你的项目有用的工具库。

Chainer

链接原址：https://pypi.org/project/chainer/

Chainer 是一个用于加速研究过程的深度学习框架。它提供了一种实现大多数最先进模型的方法，包括但不仅限于回归神经网络和变分自动编码器。截至2019年12月，开发过程仅限于 bug 修复和维护。

安装

要安装 Chainer，请运行以下命令：

pip install chainer

示例用法

下面是一个示例代码，用于确定一种蘑菇是可食用的还是会致命的。

import chainer as chfrom chainer import datasetsimport chainer.functions as Fimport chainer.links as Lfrom chainer import trainingfrom chainer.training import extensions
import numpy as npimport matplotlibmatplotlib.use('Agg')
# Load Datasetmushroomsfile = 'mushrooms.csv'data_array = np.genfromtxt( mushroomsfile, delimiter=',', dtype=str, skip_header=1)for col in range(data_array.shape[1]): data_array[:, col] = np.unique(data_array[:, col], return_inverse=True)[1]
X = data_array[:, 1:].astype(np.float32)Y = data_array[:, 0].astype(np.int32)[:, None]train, test = datasets.split_dataset_random( datasets.TupleDataset(X, Y), int(data_array.shape[0] * .7))
# Configure iteratorstrain_iter = ch.iterators.SerialIterator(train, 100)test_iter = ch.iterators.SerialIterator( test, 100, repeat=False, shuffle=False)
# Network definitiondef MLP(n_units, n_out): layer = ch.Sequential(L.Linear(n_units), F.relu) model = layer.repeat(2) model.append(L.Linear(n_out))
 return model
# Model Initializationmodel = L.Classifier( MLP(44, 1), lossfun=F.sigmoid_cross_entropy, accfun=F.binary_accuracy)
# Setup an optimizeroptimizer = ch.optimizers.SGD().setup(model)
# Create the updater, using the optimizerupdater = training.StandardUpdater(train_iter, optimizer, device=-1)
# Set up a trainertrainer = training.Trainer(updater, (50, 'epoch'), out='result')
# Evaluate the model with the test dataset for each epochtrainer.extend(extensions.Evaluator(test_iter, model, device=-1))
# Dump a computational graph from 'loss' variable at the first iteration# The "main" refers to the target link of the "main" optimizer.trainer.extend(extensions.DumpGraph('main/loss'))
# Take a snapshot once every 20 secondstrainer.extend(extensions.snapshot(), trigger=(20, 'epoch'))
# Write a log of evaluation statistics for each epochtrainer.extend(extensions.LogReport())
# Save two plot images to the result dirtrainer.extend( extensions.PlotReport(['main/loss', 'validation/main/loss'], 'epoch', file_name='loss.png'))trainer.extend( extensions.PlotReport( ['main/accuracy', 'validation/main/accuracy'], 'epoch', file_name='accuracy.png'))
# Print selected entries of the log to stdouttrainer.extend(extensions.PrintReport( ['epoch', 'main/loss', 'validation/main/loss', 'main/accuracy', 'validation/main/accuracy', 'elapsed_time']))
# Run the trainingtrainer.run()
# Predictionx, t = test[np.random.randint(len(test))]
predict = model.predictor(x[None]).arraypredict = predict[0][0]
if predict >= 0: print('Predicted Poisonous, Actual ' + ['Edible', 'Poisonous'][t[0]])else: print('Predicted Edible, Actual ' + ['Edible', 'Poisonous'][t[0]])

Data Version Control (DVC)

链接原址：https://pypi.org/project/dvc/

仅仅基于名称，您应该能够指出 DVC 是一个版本控制工具，特别是为机器学习和数据科学项目。它类似于 Git-LFS 和 makefile 的组合。它存储数据模型并将它们与 Git 存储库连接起来。此外，它还可以作为从其他数据和代码构建模型的指令。

安装

有相当多的方法来安装 DVC。

# snapcraft/linuxsnap install dvc --classic
# chocolatey / windowschoco install dvc
# homebrew / mac osbrew install dvc
# anacondaconda install -c conda-forge dvc
# pippip install dvc

还有其他基于您首选的远程存储的附加依赖项。

示例用法

若要跟踪数据，请运行以下命令：

git add train.pydvc add images.zip

为了连接代码和数据，您应该使用：

dvc run -d images.zip -o images/ unzip -q images.zipdvc run -d images/ -d train.py -o model.p python train.py

做出改变并复制：

vi train.pydvc repro model.p.dvc

此外，您可以通过 git 命令像往常一样共享代码：

git add .git commit -m 'The baseline model'git push

为了共享数据和机器学习模型，您应该使用：

dvc remote add myremote -d s3://mybucket/image_cnndvc push

Neural Network Intelligence (NNI)

链接原址：https://pypi.org/project/nni/

它主要用于机器学习生命周期过程的自动化。这包括特征工程、神经结构搜索和超参数微调。它为您提供了命令行工具和 web 界面。

安装

NNI 支持通过 pip 安装安装。一些例子需要 Tensorflow 1.x 来代替。请查阅官方文件以获得更多信息。

pip install --upgrade nni

示例用法

在终端中运行以下命令来激活 MNIST 示例：

nnictl create --config nni\examples\trials\mnist-tfv1\config_windows.yml

ONNX Runtime

链接原址：https://pypi.org/project/onnxruntime/

如果您正在寻找用 Python 编写代码的方法，但是将其部署到 c #/c + +/Java 应用程序中，ONNX Runtime 是您的正确选择。它主要是作为一个跨平台的推理和培训加速器包。

此外，它还通过 PyTorch 的 pythonapi (称为 ORTTrainer)支持通过其后端训练现有的 PyTorch 模型。

安装

在终端中运行以下命令来安装它

pip install onnxruntime

示例用法

实际上，您可以使用您最喜欢的框架来训练模型，并将其转换为 ONNX 格式。让我们来看看下面这个使用著名的虹膜数据集的例子。

# Step 1from sklearn.datasets import load_irisfrom sklearn.model_selection import train_test_splitiris = load_iris()X, y = iris.data, iris.targetX_train, X_test, y_train, y_test = train_test_split(X, y)
from sklearn.linear_model import LogisticRegressionclr = LogisticRegression()clr.fit(X_train, y_train)print(clr)
# Step 2from skl2onnx import convert_sklearnfrom skl2onnx.common.data_types import FloatTensorType
initial_type = [('float_input', FloatTensorType([None, 4]))]onx = convert_sklearn(clr, initial_types=initial_type)with open("logreg_iris.onnx", "wb") as f: f.write(onx.SerializeToString()) # Step 3import numpyimport onnxruntime as rt
sess = rt.InferenceSession("logreg_iris.onnx")input_name = sess.get_inputs()[0].namepred_onx = sess.run(None, {input_name: X_test.astype(numpy.float32)})[0]print(pred_onx)

PaddlePaddle

链接原址：https://pypi.org/project/paddlepaddle/

PaddlePaddle 是一个独立的研发深度学习平台。它起源于工业实践，并已成功地应用于从制造业到农业的许多行业。

它提供了以下核心特性：

基于深层神经网络的敏捷工业开发框架
支持超大规模的深层神经网络训练
普适部署的加速高性能推断
开源的工业化模型和镜像库

在中国，很多商业化的人工智能解决方案都是建立在 paddlepaddle 之上的。此外，所提供的文件均为中英文。

安装

你可以根据 GPU 的使用情况通过 pip 安装来安装它。

# Linux CPUpip install paddlepaddle
# Linux GPU cuda10cudnn7pip install paddlepaddle-gpu
# Linux GPU cuda9cudnn7pip install paddlepaddle-gpu==1.8.4.post97

示例用法

为了更好地开始学习，你首先要学习流体编程背后的基本概念。一旦你完成了它，你应该能够创建一个简单的模型如下：

import paddle.fluid as fluidimport numpy
#define datatrain_data=numpy.array([[1.0],[2.0],[3.0],[4.0]]).astype('float32')y_true = numpy.array([[2.0],[4.0],[6.0],[8.0]]).astype('float32')
#define networkx = fluid.layers.data(name="x",shape=[1],dtype='float32')y = fluid.layers.data(name="y",shape=[1],dtype='float32')y_predict = fluid.layers.fc(input=x,size=1,act=None)
#define loss functioncost = fluid.layers.square_error_cost(input=y_predict,label=y)avg_cost = fluid.layers.mean(cost)
#define optimization algorithmsgd_optimizer = fluid.optimizer.SGD(learning_rate=0.01)sgd_optimizer.minimize(avg_cost)
#initialize parameterscpu = fluid.core.CPUPlace()exe = fluid.Executor(cpu)exe.run(fluid.default_startup_program())
##start training and iterate for 100 timesfor i in range(100): outs = exe.run( feed={'x':train_data,'y':y_true}, fetch_list=[y_predict.name,avg_cost.name]) #observe resultprint outs

Pycaret

链接原址：https://pypi.org/project/pycaret/

Pycaret 是几个流行的机器学习框架的 Python 包装器。它只需要几行代码就可以运行复杂的机器学习任务。它遵循以下原则：

简单的
使用方便
便于部署

它对于创建概念验证项目或快速测试端到端实验非常有用。

安装

最简单的安装方法如下：

pip install pycaret

示例用法

下面的代码片段是使用 Pycaret 的 NLP 任务的示例：

# check versionfrom pycaret.utils import versionversion()
# load datasetfrom pycaret.datasets import get_datadata = get_data('kiva')
# initializationfrom pycaret.nlp import *nlp1 = setup(data, target = 'en', session_id=123, log_experiment=True, log_plots = True, experiment_name='kiva1')
# models, run them one by one in jupyter notebookmodels()lda = create_model('lda')nmf = create_model('nmf', num_topics = 6)
# assign labelslda_results = assign_model(lda)lda_results.head()
# analyze modelplot_model(lda, plot = 'bigram') # you can test using plot = tsne as wellevaluate_model(lda)

PyOD

链接原址：https://pypi.org/project/pyod/

PyOD 是一个专门用于检测多变量数据中的异常值的 Python 工具包，探测外围物体通常被称为异常检测或异常检测。在撰写本文时，它包含了相当多的基于神经网络的模型，并支持30多种检测算法。

安装

建议在安装时使用 pip：

pip install pyod

示例用法

让我们看看下面的例子，它使用 KNN 的异常检测。

from pyod.models.knn import KNNfrom pyod.utils.data import generate_datafrom pyod.utils.data import evaluate_printfrom pyod.utils.example import visualize
contamination = 0.1 # percentage of outliersn_train = 200 # number of training pointsn_test = 100 # number of testing points
# Generate sample dataX_train, y_train, X_test, y_test = generate_data(n_train=n_train, n_test=n_test, n_features=2, contamination=contamination, random_state=42)
# train kNN detectorclf_name = 'KNN'clf = KNN()clf.fit(X_train)
# get the prediction labels and outlier scores of the training datay_train_pred = clf.labels_ # binary labels (0: inliers, 1: outliers)y_train_scores = clf.decision_scores_ # raw outlier scores
# get the prediction on the test datay_test_pred = clf.predict(X_test) # outlier labels (0 or 1)y_test_scores = clf.decision_function(X_test) # outlier scores
# evaluate and print the resultsprint("\nOn Training Data:")evaluate_print(clf_name, y_train, y_train_scores)print("\nOn Test Data:")evaluate_print(clf_name, y_test, y_test_scores)
# visualize the resultsvisualize(clf_name, X_train, y_train, X_test, y_test, y_train_pred, y_test_pred, show_figure=True, save_figure=True)

SHAP

SHAP 是 SHapley additional exPlanations 的缩写。它使用博弈论方法来解释任何机器学习模型的结果。你可以把它看作是一个旨在解决机器学习模型的黑盒问题的可视化工具。当您对模型进行微调时，所提供的洞察力是必不可少的。

安装

可以直接从 PyPI 安装 SHAP：

pip install shap

示例用法

下面的示例训练 XGBoost 模型，并使用 SHAP 将其可视化。

import xgboostimport shap
# load JS visualization code to notebookshap.initjs()
# train XGBoost modelX,y = shap.datasets.boston()model = xgboost.train({"learning_rate": 0.01}, xgboost.DMatrix(X, label=y), 100)
# explain the model's predictions using SHAP# (same syntax works for LightGBM, CatBoost, scikit-learn and spark models)explainer = shap.TreeExplainer(model)shap_values = explainer.shap_values(X)
# visualize the first prediction's explanation (use matplotlib=True to avoid javascript)shap.force_plot(explainer.expected_value, shap_values[0,:], X.iloc[0,:])

Trax

Trax 是一个端到端的深度学习库，由谷歌大脑团队积极维护。它主要强调代码的简洁和执行速度的快速。

安装

你可以通过如下的 pip 安装来安装它：

pip install trax

示例用法

下面的要点举例说明了一个英语到德语翻译的例子：

# Create a Transformer model.# Pre-trained model config in gs://trax-ml/models/translation/ende_wmt32k.ginmodel = trax.models.Transformer( input_vocab_size=33300, d_model=512, d_ff=2048, n_heads=8, n_encoder_layers=6, n_decoder_layers=6, max_len=2048, mode='predict')
# Initialize using pre-trained weights.model.init_from_file('gs://trax-ml/models/translation/ende_wmt32k.pkl.gz', weights_only=True)
# Tokenize a sentence.sentence = 'It is nice to learn new things today!'tokenized = list(trax.data.tokenize(iter([sentence]), # Operates on streams. vocab_dir='gs://trax-ml/vocabs/', vocab_file='ende_32k.subword'))[0]
# Decode from the Transformer.tokenized = tokenized[None, :] # Add batch dimension.tokenized_translation = trax.supervised.decoding.autoregressive_sample( model, tokenized, temperature=0.0) # Higher temperature: more diverse results.
# De-tokenize,tokenized_translation = tokenized_translation[0][:-1] # Remove batch and EOS.translation = trax.data.detokenize(tokenized_translation, vocab_dir='gs://trax-ml/vocabs/', vocab_file='ende_32k.subword')print(translation)

总结

现在，您应该已经对其他不太知名的用于机器学习的 Python 库有了一个大致的了解。每个软件包都有自己的优点和缺点。作为一个开发人员，我们应该超越那些受欢迎的，去发现那些对我们的项目有益的隐藏的宝石。

· END ·

HAPPY LIFE

以上是关于9 个鲜为人知的机器学习 Python 工具库 | 附代码示例的主要内容，如果未能解决你的问题，请参考以下文章