如何使用for循环或条件在pandas数据框的子集中创建多个回归模型(statsmodel)?
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了如何使用for循环或条件在pandas数据框的子集中创建多个回归模型(statsmodel)?相关的知识,希望对你有一定的参考价值。
如何使用for循环或条件在pandas数据框的子集中创建多个回归模型(statsmodel)?
我有一个数据框,其中包含一个具有51个唯一值的变量状态。我必须为每个州制作一个模型。出于某种原因,我仅限于回归(statsmodel),假设变量V1由变量X1,X2,X3预测
状态为1:51,将用作分割该数据帧的条件
如何使用for循环自动执行此任务?
答案
假设您只关注循环而不是将数据帧拆分为51个子部分,这是我尝试您的问题:
可以说,您将OLS功能定义为:
def OLSfunction(y):
y_train = traindf[y]
y_test = testdf[y]
from statsmodels.api import OLS
x_train = x_traindf
x_test = x_testdf
model = OLS(y_train, x_train)
result = model.fit()
print (result.summary())
pred_OLS = result.predict(x_test)
print("R2", r2_score(y_test, pred_OLS))
Y_s = ['1','2',.....'51']
for y in Y_s:
y=y
OLSfunction(y)
请注意,您必须为您要构建模型的特定Y适当地派生traindf和testdf。这些必须正确传递到OLS功能。由于我对数据的外观没有任何看法,所以我没有进入traindf / testdf的拆分/创建......
另一答案
import pandas as pd
import os as os
import numpy as np
import statsmodels.formula.api as sm
首先,我创建了一个dict来保存51个数据集
d = {}
for x in range(0, 52):
d[x]=ccf.loc[ccf['state'] == x]
d.keys()
dict_keys([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51])
去检查
d[1].head()
然后我在循环中使用dict中的位置运行代码
results={}
for x in range(1, 51):
results[x] = sm.Logit(d[x].fraudRisk, d[x][names]).fit().summary2()
但是我觉得我应该在sklearn中使用多个分类器。首先,我需要如上所述拆分数据。
from sklearn.model_selection import train_test_split
# Multiple Models
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
#Model Metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
lr={}
gnb={}
svc={}
rfc={}
classifier={}
regr_1={}
regr_2={}
import datetime
datetime.datetime.now()
for x in range(1, 51):
X_train, X_test, y_train, y_test = train_test_split(d[x][names], d[x].fraudRisk, test_size=0.3)
print(len(X_train))
print(len(y_test))
# Create classifiers
lr[x] = LogisticRegression().fit(X_train, y_train).predict(X_test)
gnb[x] = GaussianNB().fit(X_train, y_train).predict(X_test)
svc[x] = LinearSVC(C=1.0).fit(X_train, y_train).predict(X_test)
rfc[x] = RandomForestClassifier(n_estimators=1).fit(X_train, y_train).predict(X_test)
classifier[x] = KNeighborsClassifier(n_neighbors=3).fit(X_train, y_train).predict(X_test)
print(datetime.datetime.now())
print("Accuracy Score for model for state ",x, 'is ')
print('LogisticRegression',accuracy_score(y_test,lr[x]))
print('GaussianNB',accuracy_score(y_test,gnb[x]))
print('LinearSVC',accuracy_score(y_test,svc[x]))
print('RandomForestClassifier',accuracy_score(y_test,rfc[x]))
print('KNeighborsClassifier',accuracy_score(y_test,classifier[x]))
print("Classification Report for model for state ",x, 'is ')
print('LogisticRegression',classification_report(y_test,lr[x]))
print('GaussianNB',classification_report(y_test,gnb[x]))
print('LinearSVC',classification_report(y_test,svc[x]))
print('RandomForestClassifier',classification_report(y_test,rfc[x]))
print('KNeighborsClassifier',classification_report(y_test,classifier[x]))
print("Confusion Matrix Report for model for state ",x, 'is ')
print('LogisticRegression',confusion_matrix(y_test,lr[x]))
print('GaussianNB',confusion_matrix(y_test,gnb[x]))
print('LinearSVC',confusion_matrix(y_test,svc[x]))
print('RandomForestClassifier',confusion_matrix(y_test,rfc[x]))
print('KNeighborsClassifier',confusion_matrix(y_test,classifier[x]))
print("Area Under Curve for model for state ",x, 'is ')
print('LogisticRegression',roc_auc_score(y_test,lr[x]))
print('GaussianNB',roc_auc_score(y_test,gnb[x]))
print('LinearSVC',roc_auc_score(y_test,svc[x]))
print('RandomForestClassifier',roc_auc_score(y_test,rfc[x]))
print('KNeighborsClassifier',roc_auc_score(y_test,classifier[x]))
花了很长时间为5个型号X 51州提供多个指标,但是值得。让我知道是否有更快或更好的方法来编写更优雅,更少hacky的代码
以上是关于如何使用for循环或条件在pandas数据框的子集中创建多个回归模型(statsmodel)?的主要内容,如果未能解决你的问题,请参考以下文章
如何使用for循环创建一定长度的numpy数组(或pandas数据框)?