类似于 R 的 Python 线性回归诊断图

Posted 2023-03-12

技术标签:

【中文标题】类似于 R 的 Python 线性回归诊断图【英文标题】：Python linear regression diagnostic plots similar to R 【发布时间】：2018-03-18 09:12:26 【问题描述】：

我正在尝试获取 Python 中线性回归的诊断图，我想知道是否有快速的方法来做到这一点。

在 R 中，您可以使用下面的代码 sn-p，它会为您提供残差与拟合图、正常 QQ 图、比例位置、残差与杠杆图。

m1 <- lm(cost~ distance, data = df1)
summary(m1)
plot(m1)

在 python 中有没有快速的方法来做到这一点？

有一篇很棒的博客文章描述了如何使用 Python 代码来获得与 R 相同的绘图，但它需要相当多的代码（至少与 R 方法相比）。链接：https://underthecurve.github.io/jekyll/update/2016/07/01/one-regression-six-ways.html#Python

【问题讨论】：

你可以创建一个函数/模块然后导入它并使用像my_plot(formula, data)这样的单行。这也是 R 在引擎盖下所做的事情。一些 R 代码可能（不确定，抱歉）是plot 的来源：github.com/SurajGupta/r-source/blob/master/src/library/stats/R/… 【参考方案1】：

我更喜欢将所有内容存储在pandas 中，并尽可能使用DataFrame.plot() 进行绘图：

from matplotlib import pyplot as plt
from pandas.core.frame import DataFrame
import scipy.stats as stats
import statsmodels.api as sm


def linear_regression(df: DataFrame) -> DataFrame:
    """Perform a univariate regression and store results in a new data frame.

    Args:
        df (DataFrame): orginal data set with x and y.

    Returns:
        DataFrame: another dataframe with raw data and results.
    """
    mod = sm.OLS(endog=df['y'], exog=df['x']).fit()
    influence = mod.get_influence()

    res = df.copy()
    res['resid'] = mod.resid
    res['fittedvalues'] = mod.fittedvalues
    res['resid_std'] = mod.resid_pearson
    res['leverage'] = influence.hat_matrix_diag
    return res


def plot_diagnosis(df: DataFrame):
    fig, axes = plt.subplots(nrows=2, ncols=2)
    plt.style.use('seaborn')

    # Residual against fitted values.
    df.plot.scatter(
        x='fittedvalues', y='resid', ax=axes[0, 0]
    )
    axes[0, 0].axhline(y=0, color='grey', linestyle='dashed')
    axes[0, 0].set_xlabel('Fitted Values')
    axes[0, 0].set_ylabel('Residuals')
    axes[0, 0].set_title('Residuals vs Fitted')

    # qqplot
    sm.qqplot(
        df['resid'], dist=stats.t, fit=True, line='45',
        ax=axes[0, 1], c='#4C72B0'
    )
    axes[0, 1].set_title('Normal Q-Q')

    # The scale-location plot.
    df.plot.scatter(
        x='fittedvalues', y='resid_std', ax=axes[1, 0]
    )
    axes[1, 0].axhline(y=0, color='grey', linestyle='dashed')
    axes[1, 0].set_xlabel('Fitted values')
    axes[1, 0].set_ylabel('Sqrt(|standardized residuals|)')
    axes[1, 0].set_title('Scale-Location')

    # Standardized residuals vs. leverage
    df.plot.scatter(
        x='leverage', y='resid_std', ax=axes[1, 1]
    )
    axes[1, 1].axhline(y=0, color='grey', linestyle='dashed')
    axes[1, 1].set_xlabel('Leverage')
    axes[1, 1].set_ylabel('Sqrt(|standardized residuals|)')
    axes[1, 1].set_title('Residuals vs Leverage')

    plt.tight_layout()
    plt.show()

仍有许多功能缺失，但它提供了一个良好的开端。我在这里学习了如何提取影响力统计，Access standardized residuals, cook's values, hatvalues (leverage) etc. easily in Python?

顺便说一句，有一个包，dynobo/lmdiag，具有所有功能。

【讨论】：

以上是关于类似于 R 的 Python 线性回归诊断图的主要内容，如果未能解决你的问题，请参考以下文章