sklearn中，继承TransformerMixin实现自定义类放入DataFrameMapper，sklearn2pmml生成pmml报错

Posted 2023-04-13

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了sklearn中，继承TransformerMixin实现自定义类放入DataFrameMapper，sklearn2pmml生成pmml报错相关的知识，希望对你有一定的参考价值。

自定义了一个类，用于在DataFrameMapper中将字符串转换成nan，在使用sklearn2pmml生成pmml文件时，报编码错误
class DataEncode(TransformerMixin):
def fit(self,X,y=None):
return self
def transform(self,X):
X = X.replace("\\N",np.nan)
X = X.replace("-",np.nan)
X = X.astype(float)
return pd.concat([X],axis=1)

from sklearn_pandas import DataFrameMapper
mapper = DataFrameMapper([
(['Sepal.Length'],[DataEncode(),ContinuousDomain(),Imputer(),StandardScaler()]),
(['Sepal.Width'],[DataEncode(),ContinuousDomain(),Imputer(),StandardScaler()]),
(['Petal.Length'],[DataEncode(),ContinuousDomain(),Imputer(),StandardScaler()]),
(['Petal.Width'],[DataEncode(),ContinuousDomain(),Imputer(),StandardScaler()]),
],input_df = True)

from sklearn2pmml.pipeline import PMMLPipeline
gbdt_pipline = PMMLPipeline([
('mapper',mapper),
('classifier',clf)
])

sklearn2pmml(gbdt_pipline,"D:/mlfile/test/test_iris.pmml",with_repr=True,debug=True)

报错：
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-92-8e29dc6f358c> in <module>()
----> 1 sklearn2pmml(gbdt_pipline,"D:/mlfile/test/test_iris.pmml",with_repr=True,debug=True)

D:\anaconda-hh\lib\site-packages\sklearn2pmml\__init__.py in sklearn2pmml(pipeline, pmml, user_classpath, with_repr, debug)
231 print("Standard output is empty")
232 if(len(error) > 0):
--> 233 print("Standard error:\n0".format(error.decode("UTF-8")))
234 else:
235 print("Standard error is empty")
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd4 in position 4: invalid continuation byte

尝试修改编码，使用jar将pkl转成pmml，均不成功，有大神解决的吗

参考技术A 我也遇到这个问题了，询问sklearn2pmml项目的作者，回答说不支持自定义的转换类，只支持标准库的转换类。

访问分类报告中的数字 - sklearn

【中文标题】访问分类报告中的数字 - sklearn【英文标题】：access to numbers in classification_report - sklearn 【发布时间】：2018-07-03 05:25:30 【问题描述】：

这是sklearn中classification_report的简单示例

from sklearn.metrics import classification_report
y_true = [0, 1, 2, 2, 2]
y_pred = [0, 0, 2, 2, 1]
target_names = ['class 0', 'class 1', 'class 2']
print(classification_report(y_true, y_pred, target_names=target_names))
#             precision    recall  f1-score   support
#
#    class 0       0.50      1.00      0.67         1
#    class 1       0.00      0.00      0.00         1
#    class 2       1.00      0.67      0.80         3
#
#avg / total       0.70      0.60      0.61         5

我想访问平均/总行。例如，我想从报告中提取 f1-score，即 0.61。

我怎样才能访问classification_report中的号码？

【问题讨论】：

您对 f1-score 或从分类报告中提取 f1-score 感兴趣吗？ @PratikKumar 从分类报告中提取。我还需要其他报告。 【参考方案1】：

您可以使用precision_recall_fscore_support 一次性获取所有信息

from sklearn.metrics import precision_recall_fscore_support as score
y_true = [0, 1, 2, 2, 2]
y_pred = [0, 0, 2, 2, 1]
precision,recall,fscore,support=score(y_true,y_pred,average='macro')
print 'Precision : '.format(precision)
print 'Recall    : '.format(recall)
print 'F-score   : '.format(fscore)
print 'Support   : '.format(support)

这是模块的link

【讨论】：

答案是正确的，但是请注意你使用了错误的参数，因为第一个参数是y_true，第二个应该是y_pred。这也适用于多类数据集吗？【参考方案2】：

您可以将分类报告输出为dict：

report = classification_report(y_true, y_pred, **output_dict=True** )

然后像普通的python dictionary 一样访问它的单个值。

例如，宏观指标：

macro_precision =  report['macro avg']['precision'] 
macro_recall = report['macro avg']['recall']    
macro_f1 = report['macro avg']['f1-score']

或准确性：

accuracy = report['accuracy']

【讨论】：

【参考方案3】：

您可以使用内置分类报告中的 output_dict 参数返回字典：

classification_report(y_true,y_pred,output_dict=True)

【讨论】：

【参考方案4】：

classification_report 是字符串，所以我建议你使用来自 scikit-learn 的 f1_score

from sklearn.metrics import f1_score
y_true = [0, 1, 2, 2, 2]
y_pred = [0, 0, 2, 2, 1]
target_names = ['class 0', 'class 1', 'class 2']

print(f1_score(y_true, y_pred, average=None)

输出

【讨论】：

谢谢。所以没有办法从分类报告中提取？其他报告呢？也许你可以使用正则表达式来提取这个值。你能说出其他报告的名字吗？如果您说的是召回率和精度，是的，sklearn 中有诸如recall_score 和precision_score 之类的函数

以上是关于sklearn中，继承TransformerMixin实现自定义类放入DataFrameMapper，sklearn2pmml生成pmml报错的主要内容，如果未能解决你的问题，请参考以下文章