按组迭代回归 ML 模型

Posted 2023-03-12

技术标签:

【中文标题】按组迭代回归 ML 模型【英文标题】：Iterating Regression ML model by group 【发布时间】：2021-12-16 12:14:37 【问题描述】：

我有一个包含数十万行并按美国每个县分组的数据框。我已经用全国数据对我的模型进行了训练和测试，但我想测试并查看按县运行的模型是否会提高准确性，所以；

我想按每个县运行决策树回归，因此需要对每个组进行训练测试拆分，然后为每个组运行 DTR，但是我无法按组拆分数据，也不知道如何按每个组运行 DTR。

我也不确定我是否需要按组运行，因为我知道 DTR 将县名视为分类数据，因此基于它进行学习，仍然想测试按县分组运行。

from sklearn.tree import DecisionTreeRegressor
df3 = pd.DataFrame(
  'y': np.random.randn(20),
  'a': np.random.randn(20), 
  'b': np.random.randn(20),
  'color': ['alf', 'bet', 'sar', 'tep'] * 5,
  'county': ['a', 'b'] * 10)

df3.head()


X = df3.drop('y', axis=1)
y = df3.y

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)


regressor = DecisionTreeRegressor(max_depth=10, max_features='auto', min_samples_leaf=5,
                      min_samples_split=5, random_state=42)
regressor.fit(X_train, y_train)
regressor.score(X_test, y_test)

【问题讨论】：

【参考方案1】：

简单地遍历县并据此拆分数据有什么问题？

from sklearn.model_selection import train_test_split
regressors = 
for county in set(X['county']):
    X_train, X_test, y_train, y_test = train_test_split(X[X['county']==county][['a','b']], 
            y[X['county']==county], test_size=0.2, random_state=0)

    regressor = DecisionTreeRegressor(max_depth=10, max_features='auto', min_samples_leaf=5,
                          min_samples_split=5, random_state=42)
    regressor.fit(X_train, y_train)
    print(regressor.score(X_test, y_test),end='\n\n')
    regressors[county] = regressor

至于这是否适合您的数据，我无法回答。这取决于您的实现以及您希望如何将信息合并到模型中。

【讨论】：

它给了我无穷无尽的错误列表嗯，怎么样？它在我的机器上运行得很好。您是否只是将其复制并粘贴到正确的位置？

以上是关于按组迭代回归 ML 模型的主要内容，如果未能解决你的问题，请参考以下文章