Apache Druid GroupBy 虚拟列

Posted 2023-02-16

技术标签:

【中文标题】Apache Druid GroupBy 虚拟列【英文标题】：Apache Druid GroupBy Virtual columns 【发布时间】：2020-11-26 13:28:19 【问题描述】：

我正在尝试在 Druid 本机查询中创建一个 groupby 虚拟列，如下所示...


  "queryType": "groupBy",
  "dataSource": "trace_info",
  "granularity": "none",
  "virtualColumns": [
    
      "type": "expression",
      "name": "tenant",
      "expression": "replace(array_offset(tags, array_offset_of(tagNames, 'tenant')), 'tenant:', '')"
    ,
    
      "type": "expression",
      "name": "rc",
      "expression": "replace(array_offset(tags, array_offset_of(tagNames, 'row_count')), 'row_count:', '')"
    
  ],
  "dimensions": [
    "tenant"
  ],
  "aggregations": [
    
      "type": "longSum",
      "name": "trc",
      "fieldName": "rc"
    
  ],

...
...
...

  "intervals": [
    "..."
  ]

这会给出所有 row_counts 的 longsum 的单行，就好像 groupBy 列为空一样。

我的用法是否正确，或者这是 Druid 中的一个已知问题。文档说虚拟列可以像普通维度一样使用，但是对于如何甚至缺少工作示例不是很清楚。

谢谢！帕尼

【问题讨论】：

即使我使用默认维度规范来访问虚拟列，结果也是一样的。我已经成功地将 groupBy 与虚拟列一起使用，就像在您的示例中一样。您确定表达式可以正常工作吗？可能值得在更简单的查询中测试该表达式。是的，表达式是正确的，我尝试使用相同的表达式进行扫描查询，虚拟列在结果中显示良好。如果有帮助，我正在使用 Druid 0.18.1。 @legoscia 在您的使用中，您是否也在汇总（聚合）一个 VC？请在再次添加之前阅读apache标签说明。如果你这样做：通过编辑问题来描述其相关性，因为目前没有。 【参考方案1】：

最新编辑...

进一步挖掘发现问题在于虚拟列上缺少“outputType”属性。奇怪，因为聚合器能够自动检测时间并正确计算长和，即使分组结果是错误的。

  "virtualColumns": [
    
      "type": "expression",
      "name": "tenant",
      "expression": "replace(array_offset(tags, array_offset_of(tagNames, 'tenant')), 'tenant:', '')",
      "outputType": "STRING"
    ,
    
      "type": "expression",
      "name": "rc",
      "expression": "replace(array_offset(tags, array_offset_of(tagNames, 'row_count')), 'row_count:', '')"
      "outputType": "LONG"
    
  ],

见上文（下文可能是一种解决问题的非有效方法）。

经过反复试验，我有一个使用提取尺寸的解决方法。虽然不确定，但我怀疑这是 Druid 0.18.1 中的临时问题。希望对 VC 进行分组将在未来的构建中如所宣传的那样工作。


  "queryType": "groupBy",
  "dataSource": "trace_info",
  "granularity": "none",
  "virtualColumns": [
    
      "type": "expression",
      "name": "tenant",
      "expression": "replace(array_offset(tags, array_offset_of(tagNames, 'tenant')), 'tenant:', '')"
    ,
    
      "type": "expression",
      "name": "rc",
      "expression": "replace(array_offset(tags, array_offset_of(tagNames, 'row_count')), 'row_count:', '')"
    
  ],
  "dimensions": [
    
      "type": "extraction",
      "dimension": "tenant",
      "outputName": "t",
      "extractionFn": 
        "type" : "substring", "index" : 1
      
    
  ],
  "aggregations": [
    
      "type": "longSum",
      "name": "trc",
      "fieldName": "rc"
    
  ],

...
...
...

  "intervals": [
    "..."
  ]

【讨论】：

这不是答案 - 请使用此信息编辑您的问题，而不是回答。您可能想参加tour 或阅读How to Ask ...是吗？再次阅读这篇文章，我不清楚它是否对您的问题提供了进一步的解释或实际上解决了问题......它是什么？请仅使用发布您的答案按钮获取实际答案。您应该edit您的原始问题以添加其他信息。您放在这里的内容似乎是您问题的部分附加信息，部分是解决方法/答案。请编辑您的问题和此答案，以便所有问题材料都在问题中，并且答案的所有内容都在答案中。

以上是关于Apache Druid GroupBy 虚拟列的主要内容，如果未能解决你的问题，请参考以下文章

大数据Apache Druid：Druid批量数据加载

大数据Apache Druid：使用Imply进行Druid集群搭建

大数据Apache Druid：Druid集群搭建

大数据Apache Druid：Druid数据结构及架构原理

大数据Apache Druid：Druid数据的全量更新