有没有办法适当调整这个 sklearn 逻辑回归函数来解释多个自变量和固定效应?

Posted

技术标签:

【中文标题】有没有办法适当调整这个 sklearn 逻辑回归函数来解释多个自变量和固定效应?【英文标题】:Is there a way to suitably adjust this sklearn logistic regression function to account for multiple independent variables and fixed effects? 【发布时间】:2021-11-03 23:00:20 【问题描述】:

我想调整下面包含的 LogitRegression 函数以包含其他自变量和固定效应。

以下代码改编自此处提供的答案:how to use sklearn when target variable is a proportion

from sklearn.linear_model import LinearRegression
from random import choices
from string import ascii_lowercase
import numpy as np
import pandas as pd

class LogitRegression(LinearRegression):

    def fit(self, x, p):
        p = np.asarray(p)
        y = np.log(p / (1 - p))
        return super().fit(x, y)

    def predict(self, x):
        y = super().predict(x)
        return 1 / (np.exp(-y) + 1)
    

if __name__ == '__main__':
    
    ### 1. Original version with a single independent variable
    # generate example data

    np.random.seed(42)
    n = 100
    
    ## orig version provided in the link - single random independent variable
    x = np.random.randn(n).reshape(-1,1)
    
    # defining the predictor (dependent) variable (a proportional value between 0 and 1)
    noise = 0.1 * np.random.randn(n).reshape(-1, 1)
    p = np.tanh(x + noise) / 2 + 0.5
    
    # applying the model - this works
    model = LogitRegression()
    model.fit(x, p) 

    ### 2. Adding additional independent variables and a fixed effects variable
    # creating 3 random independent variables
    x1 = np.random.randn(n)
    x2 = np.random.randn(n)
    x3 = np.random.randn(n)
    
    # a fixed effects variable
    cats = ["".join(choices(["France","Norway","Ireland"])) for _ in range(100)]

    # combining these into a dataframe
    df = pd.DataFrame("x1":x1,"x2":x2,"x3":x3,"countries":cats)

    # adding the fixed effects country columns
    df = pd.concat([df,pd.get_dummies(df.countries)],axis=1)
                 
    print(df)

    # ideally I would like to use the independent variables x1,x2,x3 and the fixed
    # effects column, countries, from the above df but I'm not sure how best to edit the
    # LogitRegression class to account for this. The dependent variable is a proportion.
    # x = np.array(df)
    
    model = LogitRegression()
    model.fit(x, p) 

我希望预测输出的比例介于 0 和 1 之间。我之前尝试过 sklearn 线性回归方法,但这给出了超出预期范围的预测。我也研究过使用 statsmodels OLS 函数,但虽然我可以包含多个自变量,但我找不到包含固定效应的方法。

提前感谢您对此提供的任何帮助,或者如果我可以使用其他合适的方法,请告诉我。

【问题讨论】:

【参考方案1】:

在使用数据框将独立和固定效果变量传递给函数时,我设法通过以下小调整解决了这个问题(写出问题的简化示例对我找到答案有很大帮助):

from sklearn.linear_model import LinearRegression
from random import choices
from string import ascii_lowercase
import numpy as np
import pandas as pd

class LogitRegression(LinearRegression):

    def fit(self, x, p):
        p = np.asarray(p)
        y = np.log(p / (1 - p))
        return super().fit(x, y)

    def predict(self, x):
        y = super().predict(x)
        return 1 / (np.exp(-y) + 1)
    

if __name__ == '__main__':
    
    # generate example data
    np.random.seed(42)
    n = 100
    
    x = np.random.randn(n).reshape(-1,1)
    
    # defining the predictor (dependent) variable (a proportional value between 0 and 1)
    noise = 0.1 * np.random.randn(n).reshape(-1, 1)
    p = np.tanh(x + noise) / 2 + 0.5
    
    # creating 3 random independent variables
    x1 = np.random.randn(n)
    x2 = np.random.randn(n)
    x3 = np.random.randn(n)
    
    # a fixed effects variable
    cats = ["".join(choices(["France","Norway","Ireland"])) for _ in range(100)]

    # combining these into a dataframe
    df = pd.DataFrame("x1":x1,"x2":x2,"x3":x3,"countries":cats)

    # adding the fixed effects country columns
    df = pd.concat([df,pd.get_dummies(df.countries)],axis=1)
                 
    print(df)

    # Using the independent variables x1,x2,x3 and the fixed effects column, countries, from the above df. The dependent variable is a proportion.
    # x = np.array(df)
    categories = df['countries'].unique()
    x = df.loc[:,np.concatenate((["x1","x2","x3"],categories))]
    
    model = LogitRegression()
    model.fit(x, p) 

【讨论】:

以上是关于有没有办法适当调整这个 sklearn 逻辑回归函数来解释多个自变量和固定效应?的主要内容,如果未能解决你的问题,请参考以下文章

训练期间的sklearn逻辑回归损失值

sklearn上的逻辑回归函数

获取 sklearn 逻辑回归的边际效应

sklearn 高斯过程回归器中的优化器调整

我的自定义逻辑回归实现有啥问题?

class_weights 如何应用于 sklearn 逻辑回归?