按分区上的 MIN(日期)过滤 |数据洞察

Posted

技术标签:

【中文标题】按分区上的 MIN(日期)过滤 |数据洞察【英文标题】:Filter by MIN(date) over partition | Data Studio 【发布时间】:2020-06-23 16:31:10 【问题描述】:

我正在尝试将 BigQuery 中的日期参数连接到 Data Studio,因此我将一些日期变量添加到我的查询中。但是,我在这个日期过滤时遇到了一些问题。

这是我的查询:

  SELECT first_item,
  COUNT(*) AS first_purchases,
  SUM(purchases_within_90_days) AS purchased_within_90_days,
  SUM(purchases_within_180_days) AS purchased_within_180_days,
  SUM(purchases_within_270_days) AS purchased_within_270_days,
  SUM(revenue90days) as total_revenue_90,
  SUM(revenue180days) as total_revenue_180,
  SUM(revenue270days) as total_revenue_270
  FROM (

  SELECT email, first_item, processed_at, 
    SUM(purch_90_days) OVER(PARTITION BY email) AS purchases_within_90_days, SUM(rev_90) OVER(PARTITION BY email) AS revenue90days,
    SUM(purch_180days) OVER(PARTITION BY email) AS purchases_within_180_days, SUM(rev_180) OVER(PARTITION BY email) AS revenue180days,
    SUM(purch_270days) OVER(PARTITION BY email) AS purchases_within_270_days, SUM(rev_270) OVER(PARTITION BY email) AS revenue270days
  FROM (

SELECT email, first_item, processed_at, SUM(purchases_within_90_days) as purch_90_days, SUM(purchases_within_180_days) as purch_180days, SUM(purchases_within_270_days) as purch_270days, SUM(revenue_within_90_days) as rev_90, SUM(revenue_within_180_days) as rev_180, SUM(revenue_within_270_days) as rev_270
FROM (

SELECT   email, processed_at, first_item, MAX(CASE WHEN hours_since_first_purchase < 90 * 24 AND hours_since_first_purchase > 0 THEN 1 ELSE 0 END) AS purchases_within_90_days,
  MAX(CASE WHEN hours_since_first_purchase < 180 * 24 AND hours_since_first_purchase > 0 THEN 1 ELSE 0 END) AS purchases_within_180_days,
  MAX(CASE WHEN hours_since_first_purchase < 270 * 24 AND hours_since_first_purchase > 0 THEN 1 ELSE 0 END) AS purchases_within_270_days,
  SUM(CASE WHEN hours_since_first_purchase < 90 * 24 AND hours_since_first_purchase > 0 THEN price ELSE 0 END) AS revenue_within_90_days,
  SUM(CASE WHEN hours_since_first_purchase < 180 * 24 AND hours_since_first_purchase > 0 THEN price ELSE 0 END) AS revenue_within_180_days,
  SUM(CASE WHEN hours_since_first_purchase < 270 * 24 AND hours_since_first_purchase > 0 THEN price ELSE 0 END) AS revenue_within_270_days,
FROM (
  
 SELECT order_number, email, processed_at, sku, price, hours_since_first_purchase, first_date,
 CASE
   WHEN hours_since_first_purchase = 0 OR hours_since_first_purchase is null then sku
   else null
   end as first_item,
 FROM (

SELECT order_number, customer.id, email, MIN(processed_at) over(partition by email) as first_date, processed_at, title, price,sku,
    CASE
     WHEN ROW_NUMBER() OVER(PARTITION BY customer.id ORDER BY processed_at) = 1 THEN null
      ELSE TIMESTAMP_DIFF(processed_at, FIRST_VALUE(processed_at) OVER(PARTITION BY customer.id ORDER BY processed_at), HOUR)
      END AS hours_since_first_purchase      
FROM (

SELECT * EXCEPT(instance, line_items) FROM (
  SELECT *, ROW_NUMBER() OVER(PARTITION BY id) AS instance
  FROM `table.orders`
), UNNEST(line_items) as item
   -- identify duplicate rows
WHERE instance = 1 
)

order by email desc
  )

where first_date >  PARSE_DATE('%Y%m%d', @DS_START_DATE) and first_date < PARSE_DATE('%Y%m%d', @DS_END_DATE);
--where first_date <= '2019-09-28'--and first_date > '2020-06-07'
)
  
group by first_item, email, processed_at
)
    
where email <> ""
group by email, first_item,processed_at
order by processed_at asc
)
    
order by processed_at asc
  )
  where first_item is not null and first_item <> "" and first_item <> "unknown" and first_item not like '%variant%' and first_item not like '%product%' 
  group by first_item

当我尝试过滤 first_date 变量时,Data Studio 的查询出现错误。我可以做些什么来过滤我添加的这个新变量吗?

我收到错误" "查询返回错误"

导致此错误的代码行如下:

where first_date >  PARSE_DATE('%Y%m%d', @DS_START_DATE) and first_date < PARSE_DATE('%Y%m%d', @DS_END_DATE)

当我使用以下内容切换该行时,我的查询执行完美:

where first_date <= '2019-09-28'--and first_date > '2020-06-07'

更新:

这是非常接近工作。当我应用了 1 个过滤器时它可以工作,但是当我应用第 2 个过滤器时,它会抛出相同的错误。

当我添加这一行时它可以工作:

where cast(first_date as date) <=  PARSE_DATE('%Y%m%d', @DS_END_DATE)

但是当我遇到这个时再次抛出该错误:

where cast(first_date as date) <=  PARSE_DATE('%Y%m%d', @DS_END_DATE) and cast(first_date as date) >=  PARSE_DATE('%Y%m%d', @DS_START_DATE)

【问题讨论】:

您不能在与定义它们的SELECT 相同级别的任何其他子句中引用列别名。使用子查询或 CTE。 您是否在 DataStudio 中运行此查询? 是的 - 对不起。更新了上面的文字以反映这一点。我在 Data Studio 中运行它,它给了我一个错误。我可以用硬编码的日期过滤 first_date 变量,但不是我目前拥有的。 你能分享你得到的错误吗?我无法理解您的问题是否与查询中的参数或某些内容有关。此外,如果可能的话,以文本形式分享您的查询,以便其他人更容易重现您的问题 【参考方案1】:

您的 first_date 字段可能不是DATE,而是TIMESTAMP 为了向您展示这个问题,我将使用一个公共表 (bigquery-public-data.covid19_italy.data_by_region)

如下图所示,该表有一个名为 date 的 TIMESTAMP 字段。 为了重现您的问题,我将尝试通过DataStudio 访问此表。

DataStudio,如果我尝试你的方法,我会收到一个错误,如下所示

1 - 查询

2 - 错误

但是,如果我将查询更改为下面的查询,它可以正常工作,如您在图像中看到的那样。

SELECT * FROM `bigquery-public-data.covid19_italy.data_by_region` WHERE cast(date as date) < PARSE_DATE('%Y%m%d',@DS_START_DATE)

1 - 更新查询

2 - 仪表板工作

【讨论】:

在上面添加了更新的文本。这几乎可以完美运行。 我使用 BETWEEN 而不是 AND,这似乎有效。谢谢! 不客气 :) 如果这篇文章回答了您的问题,请考虑奖励赏金

以上是关于按分区上的 MIN(日期)过滤 |数据洞察的主要内容,如果未能解决你的问题,请参考以下文章

无分区列性能的 Spark 下推过滤器

过滤器上的猪 udf

按日期字段过滤子表单

BigQuery 中的分区如何工作?

SQL - 根据另一列中的日期过滤一列的结果

核心数据。按日期过滤