如何关联R中的多个子集

Posted 2023-03-12

技术标签:

【中文标题】如何关联R中的多个子集【英文标题】：How to correlate multiple subsets in R 【发布时间】：2020-11-11 12:34:49 【问题描述】：

如何将 8 个子集分别与两个不同的因变量相关联？对于两个不同的子集，我一直得到相同的相关系数（下面的示例）。这是输入：

with(subset(mydata2, PARTYID_Strength = 1), cor.test(PARTYID_Strength,
                                                     mean.legit))

with(subset(mydata2, PARTYID_Strength = 1), cor.test(PARTYID_Strength,
                                                     mean.leegauthor))

with(subset(mydata2, PARTYID_Strength = 2), cor.test(PARTYID_Strength,
                                                     mean.legit))

with(subset(mydata2, PARTYID_Strength = 2), cor.test(PARTYID_Strength,
                                                     mean.leegauthor))

输出（我得到了 PARTY_Strength = 1 和 2）：

皮尔逊积矩相关性

数据：PARTYID_Strength 和 mean.legit t = 3.1005，df = 607，p 值 = 0.002022 备择假设：真实相关性不等于 0 95% 置信区间： 0.0458644 0.2023031 样本估计：相关 0.1248597

皮尔逊积矩相关性

数据：PARTYID_Strength 和 mean.leegauthor t = 2.8474, df = 607, p 值 = 0.004557 备择假设：真正的相关性不是等于 0 95% 置信区间： 0.03568431 0.19250344 样本估计：相关 0.1148091

样本数据：

> dput(head(mydata2, 10))
``structure(list(PARTYID = c(1, 3, 1, 1, 1, 4, 3, 1, 1, 1), PARTYID_Other = 
c("NA", 
"NA", "NA", "NA", "NA", "Green", "NA", "NA", "NA", "NA"), PARTYID_Strength = 
c(1, 
7, 1, 2, 1, 8, 1, 6, 1, 1), PARTYID_Strength_Other = c("NA", 
"NA", "NA", "NA", "NA", "Green", "NA", "NA", "NA", "NA"), THERM_Dem = c(80, 
65, 85, 30, 76, 15, 55, 62, 90, 95), THERM_Rep = c(1, 45, 10, 
5, 14, 14, 0, 4, 10, 3), Gender = c("Female", "Male", "Male", 
"Female", "Female", "Male", "Male", "Female", "Female", "Male"
), `MEAN Age` = c(29.5, 49.5, 29.5, 39.5, 29.5, 21, 39.5, 39.5, 
29.5, 65), Age = c("25 - 34", "45 - 54", "25 - 34", "35 - 44", 
"25 - 34", "18 - 24", "35 - 44", "35 - 44", "25 - 34", "65+"), 
Ethnicity = c("White or Caucasian", "Asian or Asian American", 
"White or Caucasian", "White or Caucasian", "Hispanic or Latino", 
"White or Caucasian", "White or Caucasian", "White or Caucasian", 
"White or Caucasian", "White or Caucasian"), Ethnicity_Other = c("NA", 
"NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA"), States = c("Texas", 
"Texas", "Ohio", "Texas", "Puerto Rico", "New Hampshire", 
"South Carolina", "Texas", "Texas", "Texas"), Education = c("Master's 
degree", 
"Bachelor's degree in college (4-year)", "Bachelor's degree in college (4- 
 year)", 
"Master's degree", "Master's degree", "Less than high school degree", 
"Some college but no degree", "Master's degree", "Master's degree", 
"Some college but no degree"), `MEAN Income` = c(30000, 140000, 
150000, 60000, 80000, 30000, 30000, 120000, 150000, 60000
), Income = c("Less than $30,000", "$130,001 to $150,000", 
"More than $150,000", "$50,001 to $70,000", "$70,001 to $90,000", 
"Less than $30,000", "Less than $30,000", "$110,001 to $130,000", 
"More than $150,000", "$50,001 to $70,000"), mean.partystrength = c(3.875, 
2.875, 2.375, 3.5, 2.625, 3.125, 3.375, 3.125, 3.25, 3.625
), mean.traitrep = c(2.5, 2.625, 2.25, 2.625, 2.75, 1.875, 
2.75, 2.875, 2.75, 3), mean.traitdem = c(2.25, 2.625, 2.375, 
2.75, 2.625, 2.125, 1.875, 3, 2, 2.5), mean.leegauthor = c(1, 
2, 2, 4, 1, 4, 1, 1, 1, 1), mean.legit = c(1.71428571428571, 
3.28571428571429, 2.42857142857143, 2.42857142857143, 2.14285714285714, 
1.28571428571429, 1.42857142857143, 1.14285714285714, 2.14285714285714, 
1.28571428571429)), row.names = c(NA, -10L), class = c("tbl_df", 
"tbl", "data.frame"))``

谢谢！

【问题讨论】：

逻辑语句需要== 而不是= 所以PARTYID_Strength == 1 @dcarlson 谢谢！虽然我得到了这个结果：Pearson 的积矩相关数据：PARTYID_Strength 和 mean.legit t = NA，df = 67，p 值 = NA 替代假设：真正的相关性不等于 0 95% 置信区间：NA NA 样本估计: 心不适用您只选择了带有PARTYID_Strength==1 的行，因此该变量是一个常量。该变量与任何其他变量的相关性为零。如果要对数据进行子集化，请不要在相关性中使用子集化变量。 @dcarlson 啊，我明白了。所以也许我不应该单独衡量政党，而是组合在一起？另外，如果我使用 = 而不是 ==，那么测量的原始公式是什么？它什么也没做。 R 没有抱怨，只是返回了原始数据。 【参考方案1】：

要运行测试，请为感兴趣的列创建一个向量，然后为每个列创建一个匿名函数 sapply。

fixed <- "PARTYID_Strength"
cols <- c("mean.leegauthor", "mean.legit")

cor_test_result <- sapply(cols, function(x)
  fmla <- paste(fixed, x, sep = "+")
  fmla <- as.formula(paste("~", fmla))
  cor.test(fmla, mydata2)
, simplify = FALSE)

cor_test_result$mean.leegauthor
#
#        Pearson's product-moment correlation
#
#data:  PARTYID_Strength and mean.leegauthor
#t = 1.4804, df = 8, p-value = 0.177
#alternative hypothesis: true correlation is not equal to 0
#95 percent confidence interval:
# -0.2343269  0.8462610
#sample estimates:
#      cor 
#0.4637152

【讨论】：

非常感谢！一些后续问题： 1. 我得到这些结果我假设 bc 拥有完整的数据集（n = 609）：数据：PARTYID_Strength 和 mean.leegauthor t = 2.8474，df = 607，p 值 = 0.004557 替代假设：真正的相关性不等于 0 95% 置信区间：0.03568431 0.19250344 样本估计值：cor 0.1148091 2. 为什么这些公式对相同的度量会产生不同的结果？ cor.test(PARTYID_Strength, mean.legit + mean.leegauthor, data = mydata2) cor(mydata2$PARTYID_Strength,mydata2$mean.legit + mean.leegauthor) @LisaByers formula 中的加号不是加法运算符，在您在评论中发布的情况下，您没有使用公式界面。请参阅 cor 和 cor.test 的帮助页面。我明白了，谢谢，我是新手，还在学习这一切！感谢您的帮助。

以上是关于如何关联R中的多个子集的主要内容，如果未能解决你的问题，请参考以下文章