将多个值列重塑为宽格式
Posted
技术标签:
【中文标题】将多个值列重塑为宽格式【英文标题】:Reshape multiple value columns to wide format 【发布时间】:2012-07-21 10:21:00 【问题描述】:我有以下数据框,我想使用 cast 创建一个“数据透视表”,其中包含两个值(值和百分比)的列。 这是数据框:
expensesByMonth <- structure(list(month = c("2012-02-01", "2012-02-01", "2012-02-01",
"2012-02-01", "2012-02-01", "2012-02-01", "2012-02-01", "2012-02-01",
"2012-02-01", "2012-02-01", "2012-02-01", "2012-02-01", "2012-03-01",
"2012-03-01", "2012-03-01", "2012-03-01", "2012-03-01", "2012-03-01",
"2012-03-01", "2012-03-01", "2012-03-01", "2012-03-01", "2012-03-01",
"2012-03-01", "2012-03-01", "2012-03-01", "2012-03-01", "2012-04-01",
"2012-04-01", "2012-04-01", "2012-04-01", "2012-04-01", "2012-04-01",
"2012-04-01", "2012-04-01", "2012-04-01", "2012-04-01", "2012-04-01",
"2012-04-01", "2012-04-01", "2012-04-01", "2012-04-01", "2012-04-01",
"2012-04-01", "2012-04-01", "2012-05-01", "2012-05-01", "2012-05-01",
"2012-05-01", "2012-05-01", "2012-05-01", "2012-05-01", "2012-05-01",
"2012-05-01", "2012-05-01", "2012-05-01", "2012-05-01", "2012-05-01",
"2012-05-01", "2012-05-01", "2012-05-01", "2012-05-01", "2012-05-01",
"2012-06-01", "2012-06-01", "2012-06-01", "2012-06-01", "2012-06-01",
"2012-06-01", "2012-06-01", "2012-06-01", "2012-06-01", "2012-06-01",
"2012-06-01", "2012-06-01", "2012-06-01", "2012-06-01", "2012-06-01",
"2012-06-01", "2012-06-01", "2012-06-01", "2012-06-01", "2012-06-01",
"2012-07-01", "2012-07-01", "2012-07-01", "2012-07-01", "2012-07-01",
"2012-07-01", "2012-07-01", "2012-07-01", "2012-07-01", "2012-07-01",
"2012-07-01", "2012-07-01", "2012-07-01"),
expense_type = c("Adjustment", "Bank Service Charge", "Cable", "Clubbing", "Dining", "Education",
"Gifts", "Groceries", "Lunch", "Personal Care", "Rent", "Transportation",
"Adjustment", "Bank Service Charge", "Cable", "Clubbing", "Dining",
"Gifts", "Groceries", "Lunch", "Medical Expenses", "Miscellaneous",
"Personal Care", "Phone", "Recreation", "Rent", "Transportation",
"Adjustment", "Bank Service Charge", "Clothes", "Clubbing", "Computer",
"Dining", "Gifts", "Groceries", "Lunch", "Maintenance", "Medical Expenses",
"Miscellaneous", "Personal Care", "Phone", "Recreation", "Rent",
"Transportation", "Travel", "Bank Service Charge", "Cable", "Clothes",
"Clubbing", "Computer", "Dining", "Electric", "Gifts", "Groceries",
"Lunch", "Maintenance", "Medical Expenses", "Miscellaneous",
"Personal Care", "Phone", "Recreation", "Rent", "Transportation",
"Adjustment", "Bank Service Charge", "Cable", "Charity", "Clothes",
"Computer", "Dining", "Education", "Electric", "Gifts", "Groceries",
"Lunch", "Maintenance", "Medical Expenses", "Miscellaneous",
"Personal Care", "Phone", "Recreation", "Rent", "Transportation",
"Computer", "Gifts", "Groceries", "Lunch", "Maintenance", "Medical Expenses",
"Miscellaneous", "Personal Care", "Phone", "Recreation", "Rent",
"Repair and Maintenance", "Transportation"),
value = c(442.37, 200, 21.33, 75, 22.5, 1800, 10, 233.33, 154.75, 30, 545, 32.5,
2, 200, 36.33, 206.55, 74.5, 89, 372.68, 383.75, 144.19, 508.11,
30, 38.4, 81.75, 1746.7, 35, 16.37, 200, 806.9, 324.81, 756,
80.5, 100, 398.37, 326.25, 151, 29.95, 101, 90, 38.45, 61, 743.75,
129, 228.53, 200, 39.05, 237, 40, 283.83, 141.32, 32.88, 30,
424.4, 412, 142.75, 86.55, 1051.5, 30, 38.9, 51.5, 749.7, 35,
10, 200, 16, 32.59, 149.81, 100, 80, 60, 31.91, 55, 397.25, 486.4,
115.6, 47.08, 1000, 120, 41.11, 256, 761.6, 55, 10.54, 10, 342.11,
291, 76.5, 66.8, 1008, 30, 41.11, 316, 765, 65, 62),
percent = c(0.124025030980324, 0.0560729845967511, 0.00598018380724351, 0.0210273692237817,
0.0063082107671345, 0.50465686137076, 0.00280364922983756, 0.0654175474797997,
0.0433864718317362, 0.00841094768951267, 0.152798883026147, 0.00911185999697206,
0.000506462461002391, 0.0506462461002391, 0.00919989060410842,
0.0523049106600219, 0.018865726672339, 0.0225375795146064, 0.0943742149831854,
0.0971774847048337, 0.0365134111259673, 0.128669320529962, 0.00759693691503586,
0.0097240792512459, 0.0207016530934727, 0.442318990316438, 0.00886309306754183,
0.00357276925628781, 0.0436502047194601, 0.176106750940662, 0.0708901149746392,
0.164997773839559, 0.0175692073995827, 0.0218251023597301, 0.0869446602704567,
0.0712043964486193, 0.0329559045631924, 0.00653661815673915,
0.0220433533833274, 0.0196425921237571, 0.00839175185731621,
0.0133133124394353, 0.162324198800492, 0.0281543820440518, 0.0498769064226911,
0.0496724104530621, 0.00969853814096037, 0.0588618063868785,
0.00993448209061241, 0.070492601294463, 0.0350985252261336, 0.0081661442784834,
0.00745086156795931, 0.105404854981398, 0.102325165533308, 0.035453682960873,
0.0214957356235626, 0.261152697956974, 0.00745086156795931, 0.00966128383312057,
0.0127906456916635, 0.186197030583303, 0.00869267182928586, 0.00249044292527426,
0.0498088585054852, 0.00398470868043882, 0.00811635349346881,
0.0373093254635337, 0.0249044292527426, 0.0199235434021941, 0.0149426575516456,
0.00794700337455016, 0.0136974360890084, 0.09893284520652, 0.12113514388534,
0.0287895202161704, 0.0117250052921912, 0.249044292527426, 0.0298853151032911,
0.0102382108658025, 0.0637553388870211, 0.189672133188888, 0.0136974360890084,
0.00341757293956667, 0.0032424790697976, 0.110928451456846, 0.0943561409311103,
0.0248049648839517, 0.021659760186248, 0.326841890235599, 0.00972743720939281,
0.013329831455938, 0.102462338605604, 0.248049648839517, 0.0210761139536844,
0.0201033702327451)),
.Names = c("month", "expense_type", "value", "percent"),
row.names = c(NA, -96L),
class = "data.frame"
)
这是我想要创建的(当然,使用不同的标题名称,例如:[month]_value、[month]_percent):
expenses value percent value.1 percent.1 value.2 percent.2 value.3 percent.3 value.4 percent.4 value.5 percent.5
1 Adjustment 442.37 0.124025031 2.00 0.000506462 16.37 0.003572769 0.00 0.000000000 10.00 0.002490443 0.00 0.000000000
2 Bank Service Charge 200.00 0.056072985 200.00 0.050646246 200.00 0.043650205 200.00 0.049672410 200.00 0.049808859 0.00 0.000000000
3 Cable 21.33 0.005980184 36.33 0.009199891 0.00 0.000000000 39.05 0.009698538 16.00 0.003984709 0.00 0.000000000
4 Charity 0.00 0.000000000 0.00 0.000000000 0.00 0.000000000 0.00 0.000000000 32.59 0.008116353 0.00 0.000000000
5 Clothes 0.00 0.000000000 0.00 0.000000000 806.90 0.176106751 237.00 0.058861806 149.81 0.037309325 0.00 0.000000000
6 Clubbing 75.00 0.021027369 206.55 0.052304911 324.81 0.070890115 40.00 0.009934482 0.00 0.000000000 0.00 0.000000000
7 Computer 0.00 0.000000000 0.00 0.000000000 756.00 0.164997774 283.83 0.070492601 100.00 0.024904429 10.54 0.003417573
8 Dining 22.50 0.006308211 74.50 0.018865727 80.50 0.017569207 141.32 0.035098525 80.00 0.019923543 0.00 0.000000000
9 Education 1800.00 0.504656861 0.00 0.000000000 0.00 0.000000000 0.00 0.000000000 60.00 0.014942658 0.00 0.000000000
10 Electric 0.00 0.000000000 0.00 0.000000000 0.00 0.000000000 32.88 0.008166144 31.91 0.007947003 0.00 0.000000000
11 Gifts 10.00 0.002803649 89.00 0.022537580 100.00 0.021825102 30.00 0.007450862 55.00 0.013697436 10.00 0.003242479
12 Groceries 233.33 0.065417547 372.68 0.094374215 398.37 0.086944660 424.40 0.105404855 397.25 0.098932845 342.11 0.110928451
13 Lunch 154.75 0.043386472 383.75 0.097177485 326.25 0.071204396 412.00 0.102325166 486.40 0.121135144 291.00 0.094356141
14 Maintenance 0.00 0.000000000 0.00 0.000000000 151.00 0.032955905 142.75 0.035453683 115.60 0.028789520 76.50 0.024804965
15 Medical Expenses 0.00 0.000000000 144.19 0.036513411 29.95 0.006536618 86.55 0.021495736 47.08 0.011725005 66.80 0.021659760
16 Miscellaneous 0.00 0.000000000 508.11 0.128669321 101.00 0.022043353 1051.50 0.261152698 1000.00 0.249044293 1008.00 0.326841890
17 Personal Care 30.00 0.008410948 30.00 0.007596937 90.00 0.019642592 30.00 0.007450862 120.00 0.029885315 30.00 0.009727437
18 Phone 0.00 0.000000000 38.40 0.009724079 38.45 0.008391752 38.90 0.009661284 41.11 0.010238211 41.11 0.013329831
19 Recreation 0.00 0.000000000 81.75 0.020701653 61.00 0.013313312 51.50 0.012790646 256.00 0.063755339 316.00 0.102462339
20 Rent 545.00 0.152798883 1746.70 0.442318990 743.75 0.162324199 749.70 0.186197031 761.60 0.189672133 765.00 0.248049649
21 Repair and Maintenance 0.00 0.000000000 0.00 0.000000000 0.00 0.000000000 0.00 0.000000000 0.00 0.000000000 65.00 0.021076114
22 Transportation 32.50 0.009111860 35.00 0.008863093 129.00 0.028154382 35.00 0.008692672 55.00 0.013697436 62.00 0.020103370
23 Travel 0.00 0.000000000 0.00 0.000000000 228.53 0.049876906 0.00 0.000000000 0.00 0.000000000 0.00 0.000000000
在对单个值列使用强制转换时,我还遇到了以下错误:它没有考虑“值”参数。因此,即使我指定 value = "percent",它仍然会显示 "value" 列中的值。
cast(expensesByMonth, expense_type ~ month, fun.aggregate = sum, value = "percent")
【问题讨论】:
下面的答案很有帮助;谢谢。不过,我想知道:这项任务可以通过 tidyr 完成吗? 【参考方案1】:您最好的选择是将数据重新整形为长格式,使用melt
,然后使用dcast
:
library(reshape2)
meltExpensesByMonth <- melt(expensesByMonth, id.vars=1:2)
dcast(meltExpensesByMonth, expense_type ~ month + variable, fun.aggregate = sum)
输出的前几行:
expense_type 2012-02-01_value 2012-02-01_percent 2012-03-01_value 2012-03-01_percent
1 Adjustment 442.37 0.124025031 2.00 0.0005064625
2 Bank Service Charge 200.00 0.056072985 200.00 0.0506462461
3 Cable 21.33 0.005980184 36.33 0.0091998906
4 Charity 0.00 0.000000000 0.00 0.0000000000
【讨论】:
谢谢,这就是我想要的。另外,你知道为什么演员表没有按预期工作吗? cast(expensesByMonth,expense_type ~ 月, fun.aggregate = sum, value = "percent")cast
是(现已失效的)包reshape
中的一个函数。它已在reshape2
中替换为dcast
和acast
。我不再安装旧版本。
这个问题/答案是纯艺术。谢谢。
如果我每个月有多个相同费用类型的条目,我该怎么做?我不想添加之前的值。【参考方案2】:
data.table 可以转换为多个value.var
变量。这是非常直接(且有效)的。
因此:
library(data.table) # v1.9.5+
dcast(setDT(expensesByMonth), expense_type ~ month, value.var = c("value", "percent"))
【讨论】:
【参考方案3】:现在可以使用 tidyr 1.0.0 中引入的新功能 pivot_wider()
将具有多个值/度量列的长格式重塑为宽格式。
这比spread()
比spread()
优于之前的 tidyr 策略spread()
,因为不再删除属性(例如,日期仍然是日期,字符串仍然是字符串)。
pivot_wider()
(对应方:pivot_longer()
)的工作方式类似于spread()
。
但是,它提供了额外的功能,例如多值列。
为此,参数values_from
——指示取自哪一列的值——可能采用多个列名。
NA
s 可以使用参数values_fill
填充。
library("tidyr")
library("magrittr")
pivot_wider(expensesByMonth,
id_cols = expense_type,
names_from = month,
values_from = c(value, percent))
#> # A tibble: 23 x 13
#> expense_type `value_2012-02-~ `value_2012-03-~ `value_2012-04-~
#> <chr> <dbl> <dbl> <dbl>
#> 1 Adjustment 442. 2 16.4
#> 2 Bank Servic~ 200 200 200
#> 3 Cable 21.3 36.3 NA
#> 4 Clubbing 75 207. 325.
#> 5 Dining 22.5 74.5 80.5
#> 6 Education 1800 NA NA
#> 7 Gifts 10 89 100
#> 8 Groceries 233. 373. 398.
#> 9 Lunch 155. 384. 326.
#> 10 Personal Ca~ 30 30 90
#> # ... with 13 more rows, and 9 more variables: `value_2012-05-01` <dbl>,
#> # `value_2012-06-01` <dbl>, `value_2012-07-01` <dbl>,
#> # `percent_2012-02-01` <dbl>, `percent_2012-03-01` <dbl>,
#> # `percent_2012-04-01` <dbl>, `percent_2012-05-01` <dbl>,
#> # `percent_2012-06-01` <dbl>, `percent_2012-07-01` <dbl>
或者,可以使用提供更好控制的pivot规范来完成重塑(参见下面的链接):
# see also ?build_wider_spec
spec <- expensesByMonth %>%
expand(month, .value = c("percent", "value")) %>%
dplyr::mutate(.name = paste(.$month, .$.value, sep = "_"))
pivot_wider_spec(expensesByMonth, spec = spec)
由reprex package (v0.2.1) 于 2019 年 3 月 26 日创建
另见:https://tidyr.tidyverse.org/dev/articles/pivot.html
【讨论】:
【参考方案4】:由于这个问题经常被访问,我认为它也应该得到一个完整的基本 R 答案。来自基本 R 的 reshape
-function 非常通用,也可以轻松应用于此问题:
expenses <- reshape(expensesByMonth, idvar = 'expense_type', direction = 'wide',
timevar = 'month', sep = '_')
带有NA
-values 的单元格可以用0
替换为:
expenses[is.na(expenses)] <- 0
给出(由expense_type
排序以便更容易与所需输出进行比较):
> expenses[order(expenses$expense_type),] expense_type value_2012-02-01 percent_2012-02-01 value_2012-03-01 percent_2012-03-01 value_2012-04-01 percent_2012-04-01 value_2012-05-01 percent_2012-05-01 value_2012-06-01 percent_2012-06-01 value_2012-07-01 percent_2012-07-01 1 Adjustment 442.37 0.124025031 2.00 0.0005064625 16.37 0.003572769 0.00 0.000000000 10.00 0.002490443 0.00 0.000000000 2 Bank Service Charge 200.00 0.056072985 200.00 0.0506462461 200.00 0.043650205 200.00 0.049672410 200.00 0.049808859 0.00 0.000000000 3 Cable 21.33 0.005980184 36.33 0.0091998906 0.00 0.000000000 39.05 0.009698538 16.00 0.003984709 0.00 0.000000000 67 Charity 0.00 0.000000000 0.00 0.0000000000 0.00 0.000000000 0.00 0.000000000 32.59 0.008116353 0.00 0.000000000 30 Clothes 0.00 0.000000000 0.00 0.0000000000 806.90 0.176106751 237.00 0.058861806 149.81 0.037309325 0.00 0.000000000 4 Clubbing 75.00 0.021027369 206.55 0.0523049107 324.81 0.070890115 40.00 0.009934482 0.00 0.000000000 0.00 0.000000000 32 Computer 0.00 0.000000000 0.00 0.0000000000 756.00 0.164997774 283.83 0.070492601 100.00 0.024904429 10.54 0.003417573 5 Dining 22.50 0.006308211 74.50 0.0188657267 80.50 0.017569207 141.32 0.035098525 80.00 0.019923543 0.00 0.000000000 6 Education 1800.00 0.504656861 0.00 0.0000000000 0.00 0.000000000 0.00 0.000000000 60.00 0.014942658 0.00 0.000000000 52 Electric 0.00 0.000000000 0.00 0.0000000000 0.00 0.000000000 32.88 0.008166144 31.91 0.007947003 0.00 0.000000000 7 Gifts 10.00 0.002803649 89.00 0.0225375795 100.00 0.021825102 30.00 0.007450862 55.00 0.013697436 10.00 0.003242479 8 Groceries 233.33 0.065417547 372.68 0.0943742150 398.37 0.086944660 424.40 0.105404855 397.25 0.098932845 342.11 0.110928451 9 Lunch 154.75 0.043386472 383.75 0.0971774847 326.25 0.071204396 412.00 0.102325166 486.40 0.121135144 291.00 0.094356141 37 Maintenance 0.00 0.000000000 0.00 0.0000000000 151.00 0.032955905 142.75 0.035453683 115.60 0.028789520 76.50 0.024804965 21 Medical Expenses 0.00 0.000000000 144.19 0.0365134111 29.95 0.006536618 86.55 0.021495736 47.08 0.011725005 66.80 0.021659760 22 Miscellaneous 0.00 0.000000000 508.11 0.1286693205 101.00 0.022043353 1051.50 0.261152698 1000.00 0.249044293 1008.00 0.326841890 10 Personal Care 30.00 0.008410948 30.00 0.0075969369 90.00 0.019642592 30.00 0.007450862 120.00 0.029885315 30.00 0.009727437 24 Phone 0.00 0.000000000 38.40 0.0097240793 38.45 0.008391752 38.90 0.009661284 41.11 0.010238211 41.11 0.013329831 25 Recreation 0.00 0.000000000 81.75 0.0207016531 61.00 0.013313312 51.50 0.012790646 256.00 0.063755339 316.00 0.102462339 11 Rent 545.00 0.152798883 1746.70 0.4423189903 743.75 0.162324199 749.70 0.186197031 761.60 0.189672133 765.00 0.248049649 95 Repair and Maintenance 0.00 0.000000000 0.00 0.0000000000 0.00 0.000000000 0.00 0.000000000 0.00 0.000000000 65.00 0.021076114 12 Transportation 32.50 0.009111860 35.00 0.0088630931 129.00 0.028154382 35.00 0.008692672 55.00 0.013697436 62.00 0.020103370 45 Travel 0.00 0.000000000 0.00 0.0000000000 228.53 0.049876906 0.00 0.000000000 0.00 0.000000000 0.00 0.000000000
您也可以使用tidyverse
:
library(dplyr)
library(tidyr)
expensesByMonth %>%
gather(k, v, 3:4) %>%
unite(km, k, month) %>%
spread(km, v, fill = 0)
【讨论】:
【参考方案5】:为此,我更喜欢 tables
包中的 tabulate
函数。它需要因素,但无论如何这对于您拥有的数据类型来说是一个好主意。
library(tables)
expensesByMonth$month= as.factor(expensesByMonth$month)
expensesByMonth$expense_type= as.factor(expensesByMonth$expense_type)
tabular(expense_type~(month)*(value+percent)*(sum),data=expensesByMonth)
# Optional formatting
tabular(expense_type~month*
((Format(digits=1))*value+(Format(digits=3))*percent)*sum,
data=expensesByMonth)
部分输出:
value percent value percent value percent
expense_type sum sum sum sum sum sum
Adjustment 442 0.124025 2 0.000506 16 0.003573
Bank Service Charge 200 0.056073 200 0.050646 200 0.043650
Cable 21 0.005980 36 0.009200 0 0.000000
【讨论】:
有些 () 是多余的,但有时我会保留这些以防我想添加更多术语。 我以前从未使用过这门课。如果这不是太多要求,您能否提供一个如何从中提取数据的示例? 这不是单行的。该软件包的作者 Duncan Murdoch 编写了一个与 tables 软件包一起提供的漂亮插图。请参阅包文档/指南和小插图概述。 所以没有简单的方法来恢复过程以防万一您需要(假设)?感谢您的回答。 不,那是 reshape (base) 的域和 reshape 包中的函数。表格是纯显示。以上是关于将多个值列重塑为宽格式的主要内容,如果未能解决你的问题,请参考以下文章