Scikit-learn 逻辑回归的性能比 Python 中自己编写的逻辑回归差
Posted
技术标签:
【中文标题】Scikit-learn 逻辑回归的性能比 Python 中自己编写的逻辑回归差【英文标题】:Scikit-learn's logistic regression is performing poorer than self-written logistic regression in Python 【发布时间】:2018-11-18 02:05:48 【问题描述】:我已经在 python 中编写了逻辑回归代码,并将其结果与 Scikit-learn 的逻辑回归进行了比较。后来在一个简单的一维样本数据上表现更差,如下所示:
我的后勤
import pandas as pd
import numpy as np
def findProb(xBias, beta):
z = []
for i in range(len(xBias)):
z.append(xBias.iloc[i,0]*beta[0] + xBias.iloc[i,1]*beta[1])
prob = [(1/(1+np.exp(-i))) for i in z]
return prob
def calDerv(xBias, y, beta, prob):
derv = []
for i in range(len(beta)):
helpVar1 = 0
for j in range(len(xBias)):
helpVar2 = prob[j]*xBias.iloc[j,i] - y[j]*xBias.iloc[j,i]
helpVar1 = helpVar1 + helpVar2
derv.append(helpVar1/len(xBias))
return derv
def updateBeta(beta, alpha, derv):
for i in range(len(beta)):
beta[i] = beta[i] - derv[i]*alpha
return beta
def calCost(y, prob):
cost = 0
for i in range(len(y)):
if y[i] == 1: eachCost = -y[i]*np.log(prob[i])
else: eachCost = -(1-y[i])*np.log(1-prob[i])
cost = cost + eachCost
return cost
def myLogistic(x, y, alpha, iters):
beta = [0 for i in range(2)]
bias = [1 for i in range(len(x))]
xBias = pd.DataFrame('bias': bias, 'x': x)
for i in range(iters):
prob = findProb(xBias, beta)
derv = calDerv(xBias, y, beta, prob)
beta = updateBeta(beta, alpha, derv)
return beta
比较小样本数据的结果
input = list(range(1, 11))
labels = [0,0,0,0,0,1,1,1,1,1]
print("\nmy logistic")
learningRate = 0.01
iterations = 10000
beta = myLogistic(input, labels, learningRate, iterations)
print("coefficients: ", beta)
print("decision boundary is at x = ", -beta[0]/beta[1])
decision = -beta[0]/beta[1]
predicted = [0 if i < decision else 1 for i in input]
print("predicted values: ", predicted)
输出:0、0、0、0、0、1、1、1、1、1
print("\npython logistic")
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
input = np.reshape(input, (-1,1))
lr.fit(input, labels)
print("coefficient = ", lr.coef_)
print("intercept = ", lr.intercept_)
print("decision = ", -lr.intercept_/lr.coef_)
predicted = lr.predict(input)
print(predicted)
输出:0、0、0、1、1、1、1、1、1、1
【问题讨论】:
【参考方案1】:您的实现没有正则化项。 LinearRegression
估计器默认包括具有逆强度C = 1.0
的正则化。当您将C
设置为更高的值时,即削弱正则化,决策边界会更接近5.5
:
for C in [1.0, 1000.0, 1e+8]:
lr = LogisticRegression(C=C)
lr.fit(inp, labels)
print(f'C = C, decision boundary @ (-lr.intercept_/lr.coef_[0])[0]')
输出:
C = 1.0, decision boundary @ 3.6888430562595116
C = 1000.0, decision boundary @ 5.474229032805065
C = 100000000.0, decision boundary @ 5.499634348989383
【讨论】:
【参考方案2】:自定义函数或任何逻辑函数将取决于以下内容-
-
学习率 - alpha
迭代次数
因此,通过对学习率和迭代次数进行一些调整,可以找到大致相等的权重。
如需进一步分析,您可以参考此链接 - https://medium.com/@martinpella/logistic-regression-from-scratch-in-python-124c5636b8ac
【讨论】:
以上是关于Scikit-learn 逻辑回归的性能比 Python 中自己编写的逻辑回归差的主要内容,如果未能解决你的问题,请参考以下文章
scikit-learn机器学习逻辑回归进行二分类(垃圾邮件分类),二分类性能指标,画ROC曲线,计算acc,recall,presicion,f1