python,使用逻辑回归来查看哪个变量对正预测增加了更多权重
Posted
技术标签:
【中文标题】python,使用逻辑回归来查看哪个变量对正预测增加了更多权重【英文标题】:python, using logistic regression to see which variable is adding more weight towards a positive prediction 【发布时间】:2020-03-13 10:44:44 【问题描述】:所以我有一个银行数据集,我必须在其中预测客户是否会接受定期存款。 我有一个专栏叫工作;这是分类的并且具有每个客户的工作类型。 我目前处于 EDA 流程中,并想确定哪个工作类别对积极预测的贡献最大。
我打算用逻辑回归来做到这一点(不确定这是否正确,欢迎提供替代方法建议)。
这就是我所做的;
我为每个工作类别做了一个 k-hot 编码(并且每个工作类型有 1-0 个值),目标 i k-1 是不是很热,Target_yes 的值是 1-0(1 = 客户接受了定期存款,0 是客户不接受)。
job_management job_technician job_entrepreneur job_blue-collar job_unknown job_retired job_admin. job_services job_self-employed job_unemployed job_housemaid job_student
0 1 0 0 0 0 0 0 0 0 0 0 0
1 0 1 0 0 0 0 0 0 0 0 0 0
2 0 0 1 0 0 0 0 0 0 0 0 0
3 0 0 0 1 0 0 0 0 0 0 0 0
4 0 0 0 0 1 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ...
45206 0 1 0 0 0 0 0 0 0 0 0 0
45207 0 0 0 0 0 1 0 0 0 0 0 0
45208 0 0 0 0 0 1 0 0 0 0 0 0
45209 0 0 0 1 0 0 0 0 0 0 0 0
45210 0 0 1 0 0 0 0 0 0 0 0 0
45211 rows × 12 columns
目标列是这样的;
0 0
1 0
2 0
3 0
4 0
..
45206 1
45207 1
45208 1
45209 0
45210 0
Name: Target_yes, Length: 45211, dtype: int32
我将此拟合到 sklearn 逻辑回归模型并得到系数。无法解释它们,我寻找替代方案并遇到了统计模型版本。对 logit 函数做了同样的事情。在我在网上看到的示例中,他使用 sm.get_constants 作为 x 变量。
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
model = LogisticRegression(solver='liblinear')
model.fit(vari,tgt)
model.score(vari,tgt)
df = pd.DataFrame(model.coef_)
df['inter'] = model.intercept_
print(df)
模型得分和print()df结果如下:
0.8830151954170445(model score)
print(df)
0 1 2 3 4 5 6 \
0 -0.040404 -0.289274 -0.604957 -0.748797 -0.206201 0.573717 -0.177778
7 8 9 10 11 inter
0 -0.530802 -0.210549 0.099326 -0.539109 0.879504 -1.795323
当我使用 sm.get_constats 时,我得到的系数类似于 sklearn 的logisticRegression,但 Zscores(我打算用它来查找对积极预测贡献最大的工作类型)变成了 nan。
import statsmodels.api as sm
logit = sm.Logit(tgt, sm.add_constant(vari)).fit()
logit.summary2()
结果是:
E:\Programs\Anaconda\lib\site-packages\numpy\core\fromnumeric.py:2495: FutureWarning:
Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
E:\Programs\Anaconda\lib\site-packages\statsmodels\base\model.py:1286: RuntimeWarning:
invalid value encountered in sqrt
E:\Programs\Anaconda\lib\site-packages\scipy\stats\_distn_infrastructure.py:901: RuntimeWarning:
invalid value encountered in greater
E:\Programs\Anaconda\lib\site-packages\scipy\stats\_distn_infrastructure.py:901: RuntimeWarning:
invalid value encountered in less
E:\Programs\Anaconda\lib\site-packages\scipy\stats\_distn_infrastructure.py:1892: RuntimeWarning:
invalid value encountered in less_equal
Optimization terminated successfully.
Current function value: 0.352610
Iterations 13
Model: Logit Pseudo R-squared: 0.023
Dependent Variable: Target_yes AIC: 31907.6785
Date: 2019-11-18 10:17 BIC: 32012.3076
No. Observations: 45211 Log-Likelihood: -15942.
Df Model: 11 LL-Null: -16315.
Df Residuals: 45199 LLR p-value: 3.9218e-153
Converged: 1.0000 Scale: 1.0000
No. Iterations: 13.0000
Coef. Std.Err. z P>|z| [0.025 0.975]
const -1.7968 nan nan nan nan nan
job_management -0.0390 nan nan nan nan nan
job_technician -0.2882 nan nan nan nan nan
job_entrepreneur -0.6092 nan nan nan nan nan
job_blue-collar -0.7484 nan nan nan nan nan
job_unknown -0.2142 nan nan nan nan nan
job_retired 0.5766 nan nan nan nan nan
job_admin. -0.1766 nan nan nan nan nan
job_services -0.5312 nan nan nan nan nan
job_self-employed -0.2106 nan nan nan nan nan
job_unemployed 0.1011 nan nan nan nan nan
job_housemaid -0.5427 nan nan nan nan nan
job_student 0.8857 nan nan nan nan nan
如果我使用不带 sm.get_constats 的 Stat 模型 logit,我得到的系数与 sklearn Logistic 回归非常不同,但我得到 zscore 的值(都是负数)
import statsmodels.api as sm
logit = sm.Logit(tgt, vari).fit()
logit.summary2()
结果是:
Optimization terminated successfully.
Current function value: 0.352610
Iterations 6
Model: Logit Pseudo R-squared: 0.023
Dependent Variable: Target_yes AIC: 31907.6785
Date: 2019-11-18 10:18 BIC: 32012.3076
No. Observations: 45211 Log-Likelihood: -15942.
Df Model: 11 LL-Null: -16315.
Df Residuals: 45199 LLR p-value: 3.9218e-153
Converged: 1.0000 Scale: 1.0000
No. Iterations: 6.0000
Coef. Std.Err. z P>|z| [0.025 0.975]
job_management -1.8357 0.0299 -61.4917 0.0000 -1.8943 -1.7772
job_technician -2.0849 0.0366 -56.9885 0.0000 -2.1566 -2.0132
job_entrepreneur -2.4060 0.0941 -25.5563 0.0000 -2.5905 -2.2215
job_blue-collar -2.5452 0.0390 -65.2134 0.0000 -2.6217 -2.4687
job_unknown -2.0110 0.1826 -11.0120 0.0000 -2.3689 -1.6531
job_retired -1.2201 0.0501 -24.3534 0.0000 -1.3183 -1.1219
job_admin. -1.9734 0.0425 -46.4478 0.0000 -2.0566 -1.8901
job_services -2.3280 0.0545 -42.6871 0.0000 -2.4349 -2.2211
job_self-employed-2.0074 0.0779 -25.7739 0.0000 -2.1600 -1.8547
job_unemployed -1.6957 0.0765 -22.1538 0.0000 -1.8457 -1.5457
job_housemaid -2.3395 0.1003 -23.3270 0.0000 -2.5361 -2.1429
job_student -0.9111 0.0722 -12.6195 0.0000 -1.0526 -0.7696
两者哪个更好? 还是我应该使用完全不同的方法?
【问题讨论】:
我的猜测是,由于虚拟变量陷阱,当您添加常数时,您有一个奇异的设计矩阵。 一定会调查那是什么以及我该怎么做!感谢您的宝贵时间! 【参考方案1】:我将它拟合到 sklearn 逻辑回归模型并得到 系数。无法解释它们,我寻找替代方案 并遇到了统计模型版本。
print(df)
0 1 2 3 4 5 6 \
0 -0.040404 -0.289274 -0.604957 -0.748797 -0.206201 0.573717 -0.177778
7 8 9 10 11 inter
0 -0.530802 -0.210549 0.099326 -0.539109 0.879504 -1.795323
解释是这样的: 对数优势取幂可以得到变量增加一个单位的优势比。例如,如果 Target_yes(1 = 客户接受定期存款,0 客户不接受) = 1 且逻辑回归系数为 0.573717,那么您可以断言“接受”结果的几率为 exp( 0.573717) = 1.7748519304802 乘以“不接受”结果的几率。
【讨论】:
非常感谢您为我澄清这一点!所以接受的几率是不接受的1.77485倍。这些系数的顺序是什么,它们与输入变量相同吗?负系数的 exp 给出什么?exp(-0.539109) = 0.5832677124519539 这是不接受的几率吗? negativ 与 positiv 的解释仍然相同 非常感谢您花时间向我解释这一点!非常感谢!以上是关于python,使用逻辑回归来查看哪个变量对正预测增加了更多权重的主要内容,如果未能解决你的问题,请参考以下文章
R语言使用R基础安装中的glm函数构建乳腺癌二分类预测逻辑回归模型分类预测器(分类变量)被自动替换为一组虚拟编码变量summary函数查看检查模型使用table函数计算混淆矩阵评估分类模型性能