省略某些值时如何创建桑基图

Posted

技术标签:

【中文标题】省略某些值时如何创建桑基图【英文标题】:How to create a sankey diagram when certain values ​are omitted 【发布时间】:2021-10-24 14:30:03 【问题描述】:

我需要在 Rplotly 中创建 3 年以上的 sankey 图。我的group 列应该是节点(1 == worst2 == bad3 == good4 == best),但是在2019 年和 2020 年我有/需要一个额外的节点 5 == not available

我的数据非常大,所以我只给你看一个简短的sn-p:

dt.2018 <- structure(list(Year = c(2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 
2018L, 2018L, 2018L, 2018L), GPNRPlan = c(100236L, 101554L, 111328L, 
124213L, 127434L, 128509L, 130058L, 130192L, 130224L, 130309L
), TB.Info = c("Below TB", "Over TB", "In TB", "In TB", "In TB", 
"Below TB", "Over TB", "Below TB", "Below TB", "Below TB"), Qeff = c(-0.01, 
0, 0, 0, 0, 0, 0, 0, -0.01, -0.01), group = c(1, 1, 3, 4, 2, 
2, 1, 4, 2, 3)), class = c("data.table", "data.frame"), row.names = c(NA, 
-10L))

dt.2019 <- structure(list(Year = c(2019L, 2019L, 2019L, 2019L, 2019L, 2019L, 
2019L, 2019L, 2019L, 2019L), GPNRPlan = c(100236L, 101554L, 111328L, 
124213L, 127434L, 128003L, 128509L, 130058L, 130192L, 130351L
), TB.Info = c("Below TB", "Over TB", "In TB", "In TB", "In TB", 
"Over TB", "In TB", "Over TB", "Below TB", "Over TB"), Qeff = c(-0.01, 
0.04, -0.01, 0, 0, 0, 0, 0, 0, 0), group = c(1, 2, 3, 1, 2, 4, 
1, 1, 3, 2)), class = c("data.table", "data.frame"), row.names = c(NA, 
-10L))

dt.2020 <- structure(list(Year = c(2020L, 2020L, 2020L, 2020L, 2020L, 2020L, 
2020L, 2020L, 2020L, 2020L), GPNRPlan = c(100236L, 111328L, 128003L, 
130058L, 130192L, 133874L, 135886L, 137792L, 138153L, 142309L
), TB.Info = c("Below TB", "In TB", "Over TB", "Below TB", "Below TB", 
"Over TB", "Below TB", "Over TB", "Over TB", "In TB"), Qeff = c(0, 
-0.01, 0, 0, -0.01, 0.02, -0.01, -0.01, 0.01, 0), group = c(2, 
3, 1, 4, 2, 3, 1, 1, 2, 4)), class = c("data.table", "data.frame"
))

现在我想查看 2018 的哪些客户(客户 ID == GPNRPlan)在 2019 中仍属于同一组或已更改组,以及他们是否已不在2019,则应参考群组5,也称为not available。从 20192020 也应该如此。 这怎么可能?

是否可以在同一个桑基图中从 20182020

所以我的这个示例的桑基图看起来像这样(手工制作):

【问题讨论】:

【参考方案1】:

这主要是数据格式正确的问题。

我加入了不同的 data.tables 以获取 NA 值。

此外,请检查不同的安排选项。我不认为你的要求。输出可以达到 100% - 要么节点重叠,要么使用“snap”改变节点的顺序。

library(data.table)
library(plotly)
library(scales)

dt.2018 <- structure(list(Year = c(2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L),
                          GPNRPlan = c(100236L, 101554L, 111328L, 124213L, 127434L, 128509L, 130058L, 130192L, 130224L, 130309L),
                          TB.Info = c("Below TB", "Over TB", "In TB", "In TB", "In TB", "Below TB", "Over TB", "Below TB", "Below TB", "Below TB"),
                          Qeff = c(-0.01, 0, 0, 0, 0, 0, 0, 0, -0.01, -0.01), 
                          group = c(1, 1, 3, 4, 2, 2, 1, 4, 2, 3)),
                     class = c("data.table", "data.frame"), row.names = c(NA, -10L))

dt.2019 <- structure(list(Year = c(2019L, 2019L, 2019L, 2019L, 2019L, 2019L, 2019L, 2019L, 2019L, 2019L), 
                          GPNRPlan = c(100236L, 101554L, 111328L, 124213L, 127434L, 128003L, 128509L, 130058L, 130192L, 130351L), 
                          TB.Info = c("Below TB", "Over TB", "In TB", "In TB", "In TB", "Over TB", "In TB", "Over TB", "Below TB", "Over TB"), 
                          Qeff = c(-0.01, 0.04, -0.01, 0, 0, 0, 0, 0, 0, 0),
                          group = c(1, 2, 3, 1, 2, 4, 1, 1, 3, 2)),
                     class = c("data.table", "data.frame"), row.names = c(NA, -10L))

dt.2020 <- structure(list(Year = c(2020L, 2020L, 2020L, 2020L, 2020L, 2020L, 2020L, 2020L, 2020L, 2020L), 
                          GPNRPlan = c(100236L, 111328L, 128003L, 130058L, 130192L, 133874L, 135886L, 137792L, 138153L, 142309L), 
                          TB.Info = c("Below TB", "In TB", "Over TB", "Below TB", "Below TB", "Over TB", "Below TB", "Over TB", "Over TB", "In TB"), 
                          Qeff = c(0, -0.01, 0, 0, -0.01, 0.02, -0.01, -0.01, 0.01, 0), group = c(2, 3, 1, 4, 2, 3, 1, 1, 2, 4)),
                     class = c("data.table", "data.frame"))

lookUpDT <- data.table(group = c(as.character(1:4), "NA"), group_name = c("worst", "bad", "good", "best", "not available"), color = c("red", "orange", "yellow", "green", "darkgrey"))

sankeyDT <- rbindlist(list(merge.data.table(dt.2018, dt.2019, by = "GPNRPlan", all.x = TRUE, suffixes = c(".source", ".target"))[, Year.target := 2019],
merge.data.table(dt.2019, dt.2020, by = "GPNRPlan", all.x = TRUE, suffixes = c(".source", ".target"))[, Year.target := 2020]
))

sankeyDT[, node_id.source := paste0(Year.source, "_", group.source)]
sankeyDT[, node_id.target := paste0(Year.target, "_", group.target)]

charCols <- c("group.source", "group.target")
sankeyDT[,(charCols):= lapply(.SD, as.character), .SDcols = charCols]

sankeyDT <- merge.data.table(sankeyDT, lookUpDT, by.x = "group.source", by.y = "group")

sankeyLabelsDT <- data.table(node_id = sort(unique(c(sankeyDT$node_id.source, sankeyDT$node_id.target)), na.last = TRUE))
sankeyLabelsDT[, c("year", "group") := tstrsplit(node_id, "_", fixed=TRUE)]
sankeyLabelsDT[, x_scale := .GRP, by = year][, y_scale := .GRP, by = group]
sankeyLabelsDT[, x_scale := rescale(x_scale, to=c(0, 0.9))][, y_scale := rescale(y_scale, to=c(0.2, 0.75))]
sankeyLabelsDT <- merge.data.table(sankeyLabelsDT, lookUpDT, by = "group")
sankeyLabelsDT[, label := paste(year, "-", group_name)]
setorder(sankeyLabelsDT, year, group, na.last = TRUE)


fig <- plot_ly(
  data = sankeyDT,
  type = "sankey",
  arrangement = "perpendicular", #  snap - perpendicular - freeform - fixed
  orientation = "h",
  
  node = list(
    label = sankeyLabelsDT$label,
    color = sankeyLabelsDT$color,
    x = sankeyLabelsDT$x_scale,
    y = sankeyLabelsDT$y_scale,
    pad = 10 # 10 Pixel
  ),
  
  link = list(
    source = match(sankeyDT$node_id.source, sankeyLabelsDT$node_id)-1,
    target = match(sankeyDT$node_id.target, sankeyLabelsDT$node_id)-1,
    value =  rep(1, nrow(sankeyDT)),
    label = paste("customer:", sankeyDT$GPNRPlan),
    color = sankeyDT$color # default: grey
  )
)

fig <- fig %>% layout(
  title = "Sankey Diagram",
  font = list(
    size = 10
  )
)

fig

【讨论】:

感谢您的回答!这正是我所需要的,但缺少两件事:我需要将数字替换为:1==worst、2==bad、3 ==good、4==best 和 NA=not available 我也想要每个相等的组(最差、坏、好、最好和不可用)相同的颜色。 如何更改线宽?而且2020年的第4组和NA是一样的?? 否 - 正如我的回答中提到的,它们是重叠的。改变不同行为的排列参数, 更改 arragement 不会改变 sankey 图中的某些内容 是的 - 将其从“垂直”更改为“捕捉”并查看差异。

以上是关于省略某些值时如何创建桑基图的主要内容,如果未能解决你的问题,请参考以下文章

如何使用 Plotly 制作简单的多级桑基图?

如何访问桑基图的节点属性

Plotly:如何在桑基图中设置节点位置?

如何在桑基图中垂直改变节点的位置

清华学者用Python制作漂亮的流动桑基图

Python 绘制惊艳的桑基图