将 python scikit 学习模型导出到 pmml
Posted
技术标签:
【中文标题】将 python scikit 学习模型导出到 pmml【英文标题】:Export python scikit learn models into pmml 【发布时间】:2016-01-18 04:59:56 【问题描述】:我想将 python scikit-learn 模型导出到 PMML。
什么 python 包最适合?
我阅读了有关 Augustus 的信息,但我找不到任何使用 scikit-learn 模型的示例。
【问题讨论】:
您可以使用 sklearn2pmml 包将 Scikit-Learn 模型和转换器转换为 PMML。 jpmml-sklearn 包从 python 3.4 开始支持。是否有支持 python 2.7 的替代方案 JPMML-SkLearn 也支持 Python 2.7,但目前没有宣传。 【参考方案1】:SkLearn2PMML
是
JPMML-SkLearn 命令行应用程序的精简包装器。有关受支持的 Scikit-Learn Estimator 和 Transformer 类型的列表,请参阅 JPMML-SkLearn 项目的文档。
正如@user1808924 所说,它支持 Python 2.7 或 3.4+。它还需要 Java 1.7+
通过以下方式安装:(需要git)
pip install git+https://github.com/jpmml/sklearn2pmml.git
如何将分类器树导出到 PMML 的示例。首先生长树:
# example tree & viz from http://scikit-learn.org/stable/modules/tree.html
from sklearn import datasets, tree
iris = datasets.load_iris()
clf = tree.DecisionTreeClassifier()
clf = clf.fit(iris.data, iris.target)
SkLearn2PMML 转换有两个部分,一个估计器(我们的clf
)和一个映射器(用于离散化或 PCA 等预处理步骤)。我们的映射器非常基础,因为我们没有进行任何转换。
from sklearn_pandas import DataFrameMapper
default_mapper = DataFrameMapper([(i, None) for i in iris.feature_names + ['Species']])
from sklearn2pmml import sklearn2pmml
sklearn2pmml(estimator=clf,
mapper=default_mapper,
pmml="D:/workspace/IrisClassificationTree.pmml")
有可能(尽管没有记录)传递mapper=None
,但您会看到预测器名称丢失(返回x1
而不是sepal length
等)。
我们来看看.pmml
文件:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<PMML xmlns="http://www.dmg.org/PMML-4_3" version="4.3">
<Header>
<Application name="JPMML-SkLearn" version="1.1.1"/>
<Timestamp>2016-09-26T19:21:43Z</Timestamp>
</Header>
<DataDictionary>
<DataField name="sepal length (cm)" optype="continuous" dataType="float"/>
<DataField name="sepal width (cm)" optype="continuous" dataType="float"/>
<DataField name="petal length (cm)" optype="continuous" dataType="float"/>
<DataField name="petal width (cm)" optype="continuous" dataType="float"/>
<DataField name="Species" optype="categorical" dataType="string">
<Value value="setosa"/>
<Value value="versicolor"/>
<Value value="virginica"/>
</DataField>
</DataDictionary>
<TreeModel functionName="classification" splitCharacteristic="binarySplit">
<MiningSchema>
<MiningField name="Species" usageType="target"/>
<MiningField name="sepal length (cm)"/>
<MiningField name="sepal width (cm)"/>
<MiningField name="petal length (cm)"/>
<MiningField name="petal width (cm)"/>
</MiningSchema>
<Output>
<OutputField name="probability_setosa" dataType="double" feature="probability" value="setosa"/>
<OutputField name="probability_versicolor" dataType="double" feature="probability" value="versicolor"/>
<OutputField name="probability_virginica" dataType="double" feature="probability" value="virginica"/>
</Output>
<Node id="1">
<True/>
<Node id="2" score="setosa" recordCount="50.0">
<SimplePredicate field="petal width (cm)" operator="lessOrEqual" value="0.8"/>
<ScoreDistribution value="setosa" recordCount="50.0"/>
<ScoreDistribution value="versicolor" recordCount="0.0"/>
<ScoreDistribution value="virginica" recordCount="0.0"/>
</Node>
<Node id="3">
<SimplePredicate field="petal width (cm)" operator="greaterThan" value="0.8"/>
<Node id="4">
<SimplePredicate field="petal width (cm)" operator="lessOrEqual" value="1.75"/>
<Node id="5">
<SimplePredicate field="petal length (cm)" operator="lessOrEqual" value="4.95"/>
<Node id="6" score="versicolor" recordCount="47.0">
<SimplePredicate field="petal width (cm)" operator="lessOrEqual" value="1.6500001"/>
<ScoreDistribution value="setosa" recordCount="0.0"/>
<ScoreDistribution value="versicolor" recordCount="47.0"/>
<ScoreDistribution value="virginica" recordCount="0.0"/>
</Node>
<Node id="7" score="virginica" recordCount="1.0">
<SimplePredicate field="petal width (cm)" operator="greaterThan" value="1.6500001"/>
<ScoreDistribution value="setosa" recordCount="0.0"/>
<ScoreDistribution value="versicolor" recordCount="0.0"/>
<ScoreDistribution value="virginica" recordCount="1.0"/>
</Node>
</Node>
<Node id="8">
<SimplePredicate field="petal length (cm)" operator="greaterThan" value="4.95"/>
<Node id="9" score="virginica" recordCount="3.0">
<SimplePredicate field="petal width (cm)" operator="lessOrEqual" value="1.55"/>
<ScoreDistribution value="setosa" recordCount="0.0"/>
<ScoreDistribution value="versicolor" recordCount="0.0"/>
<ScoreDistribution value="virginica" recordCount="3.0"/>
</Node>
<Node id="10">
<SimplePredicate field="petal width (cm)" operator="greaterThan" value="1.55"/>
<Node id="11" score="versicolor" recordCount="2.0">
<SimplePredicate field="sepal length (cm)" operator="lessOrEqual" value="6.95"/>
<ScoreDistribution value="setosa" recordCount="0.0"/>
<ScoreDistribution value="versicolor" recordCount="2.0"/>
<ScoreDistribution value="virginica" recordCount="0.0"/>
</Node>
<Node id="12" score="virginica" recordCount="1.0">
<SimplePredicate field="sepal length (cm)" operator="greaterThan" value="6.95"/>
<ScoreDistribution value="setosa" recordCount="0.0"/>
<ScoreDistribution value="versicolor" recordCount="0.0"/>
<ScoreDistribution value="virginica" recordCount="1.0"/>
</Node>
</Node>
</Node>
</Node>
<Node id="13">
<SimplePredicate field="petal width (cm)" operator="greaterThan" value="1.75"/>
<Node id="14">
<SimplePredicate field="petal length (cm)" operator="lessOrEqual" value="4.8500004"/>
<Node id="15" score="virginica" recordCount="2.0">
<SimplePredicate field="sepal width (cm)" operator="lessOrEqual" value="3.1"/>
<ScoreDistribution value="setosa" recordCount="0.0"/>
<ScoreDistribution value="versicolor" recordCount="0.0"/>
<ScoreDistribution value="virginica" recordCount="2.0"/>
</Node>
<Node id="16" score="versicolor" recordCount="1.0">
<SimplePredicate field="sepal width (cm)" operator="greaterThan" value="3.1"/>
<ScoreDistribution value="setosa" recordCount="0.0"/>
<ScoreDistribution value="versicolor" recordCount="1.0"/>
<ScoreDistribution value="virginica" recordCount="0.0"/>
</Node>
</Node>
<Node id="17" score="virginica" recordCount="43.0">
<SimplePredicate field="petal length (cm)" operator="greaterThan" value="4.8500004"/>
<ScoreDistribution value="setosa" recordCount="0.0"/>
<ScoreDistribution value="versicolor" recordCount="0.0"/>
<ScoreDistribution value="virginica" recordCount="43.0"/>
</Node>
</Node>
</Node>
</Node>
</TreeModel>
</PMML>
第一个分割(节点 1)的花瓣宽度为 0.8。节点 2(花瓣宽度
您可以将 pmml 输出与graphviz
输出进行比较:
from sklearn.externals.six import StringIO
import pydotplus # this might be pydot for python 2.7
dot_data = StringIO()
tree.export_graphviz(clf,
out_file=dot_data,
feature_names=iris.feature_names,
class_names=iris.target_names,
filled=True, rounded=True,
special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_pdf("D:/workspace/iris.pdf")
# for in-line display, you can also do:
# from IPython.display import Image
# Image(graph.create_png())
【讨论】:
有没有办法在不使用映射器时保留预测器名称?我真的需要在评估者方面了解它们,但是仅仅为此构建一个映射器太过分了。 @K 我无法弄清楚如何在没有映射器的情况下保留预测器名称。您可以尝试发布问题。 答案似乎已过时:sklearn2pmml
现在使用PMMLPipeline
。【参考方案2】:
随意尝试 Nyoka。导出 SKL 模型,然后导出一些。
【讨论】:
【参考方案3】:Nyoka 是一个支持Scikit-learn
、XGBoost
、LightGBM
、Keras
和Statsmodels
的python 库。
除了大约 500 个 Python 类,每个类都涵盖了标准中定义的 PMML 标记和所有构造函数参数/属性,Nyoka 还提供了越来越多的便利类和函数,例如通过读取或写入使数据科学家的生活更轻松在您最喜欢的 Python 环境中,一行代码中的任何 PMML 文件。
它可以使用 PyPi 安装:
pip install nyoka
示例代码
示例 1
import pandas as pd
from sklearn import datasets
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, Imputer
from sklearn_pandas import DataFrameMapper
from sklearn.ensemble import RandomForestClassifier
iris = datasets.load_iris()
irisd = pd.DataFrame(iris.data, columns=iris.feature_names)
irisd['Species'] = iris.target
features = irisd.columns.drop('Species')
target = 'Species'
pipeline_obj = Pipeline([
("mapping", DataFrameMapper([
(['sepal length (cm)', 'sepal width (cm)'], StandardScaler()) ,
(['petal length (cm)', 'petal width (cm)'], Imputer())
])),
("rfc", RandomForestClassifier(n_estimators = 100))
])
pipeline_obj.fit(irisd[features], irisd[target])
from nyoka import skl_to_pmml
skl_to_pmml(pipeline_obj, features, target, "rf_pmml.pmml")
示例 2
from keras import applications
from keras.layers import Flatten, Dense
from keras.models import Model
model = applications.MobileNet(weights='imagenet', include_top=False,input_shape = (224, 224,3))
activType='sigmoid'
x = model.output
x = Flatten()(x)
x = Dense(1024, activation="relu")(x)
predictions = Dense(2, activation=activType)(x)
model_final = Model(inputs =model.input, outputs = predictions,name='predictions')
from nyoka import KerasToPmml
cnn_pmml = KerasToPmml(model_final,dataSet='image',predictedClasses=['cats','dogs'])
cnn_pmml.export(open('2classMBNet.pmml', "w"), 0)
更多示例可以在Nyoka's Github Page 中找到。
【讨论】:
以上是关于将 python scikit 学习模型导出到 pmml的主要内容,如果未能解决你的问题,请参考以下文章
Python scikit-learn:导出经过训练的分类器
《机器学习实战:基于Scikit-LearnKeras和TensorFlow第2版》-学习笔记:训练模型
机器学习系列4 使用Python创建Scikit-Learn回归模型
将 PMML 模型导入 Python (Scikit-learn)