从一组范围计算并发

Posted

技术标签:

【中文标题】从一组范围计算并发【英文标题】:Calculating concurrency from a set of ranges 【发布时间】:2016-02-16 19:13:21 【问题描述】:

我有一组包含开始时间戳和持续时间的行。我想使用重叠或并发执行各种汇总。

例如:每日并发峰值,分组在另一列的峰值并发。

示例数据:

timestamp,duration
2016-01-01 12:00:00,300
2016-01-01 12:01:00,300
2016-01-01 12:06:00,300

我想知道这段时间的高峰是 12:01:00-12:05:00 2 个并发。

关于如何使用 BigQuery 或更令人兴奋的 Map/Reduce 作业来实现这一点的任何想法?

【问题讨论】:

【参考方案1】:

对于每分钟的解决方案,会话长度最长为 255 分钟:

SELECT session_minute, COUNT(*) c
FROM (
  SELECT start, DATE_ADD(start, i, 'MINUTE') session_minute FROM (
    SELECT * FROM (
      SELECT TIMESTAMP("2015-04-30 10:14") start, 7 minutes
    ),(
      SELECT TIMESTAMP("2015-04-30 10:15") start, 12 minutes
    ),(
      SELECT TIMESTAMP("2015-04-30 10:15") start, 12 minutes
    ),(
      SELECT TIMESTAMP("2015-04-30 10:18") start, 12 minutes
    ),(
      SELECT TIMESTAMP("2015-04-30 10:23") start, 3 minutes
    ) 
  ) a
  CROSS JOIN [fh-bigquery:public_dump.numbers_255] b
  WHERE a.minutes>b.i
)
GROUP BY 1
ORDER BY 1

【讨论】:

【参考方案2】:

第 1 步 - 首先您需要找到所有句点(开始和结束) 相应的并发条目

SELECT ts AS start, LEAD(ts) OVER(ORDER BY ts) AS finish, 
       SUM(entry) OVER(ORDER BY ts) AS concurrent_entries
FROM (
  SELECT ts, SUM(entry)AS entry 
  FROM 
    (SELECT ts, 1 AS entry FROM yourTable),
    (SELECT DATE_ADD(ts, duration, 'second') AS ts, -1 AS entry FROM yourTable)
  GROUP BY ts
  HAVING entry != 0
)
ORDER BY ts  

假设输入如下

(SELECT TIMESTAMP('2016-01-01 12:00:00') AS ts, 300 AS duration),
(SELECT TIMESTAMP('2016-01-01 12:01:00') AS ts, 300 AS duration),
(SELECT TIMESTAMP('2016-01-01 12:06:00') AS ts, 300 AS duration),
(SELECT TIMESTAMP('2016-01-01 12:07:00') AS ts, 300 AS duration),
(SELECT TIMESTAMP('2016-01-01 12:10:00') AS ts, 300 AS duration),
(SELECT TIMESTAMP('2016-01-01 12:11:00') AS ts, 300 AS duration)

上述查询的输出看起来像这样:

start                       finish                      concurrent_entries   
2016-01-01 12:00:00 UTC     2016-01-01 12:01:00 UTC     1    
2016-01-01 12:01:00 UTC     2016-01-01 12:05:00 UTC     2    
2016-01-01 12:05:00 UTC     2016-01-01 12:07:00 UTC     1    
2016-01-01 12:07:00 UTC     2016-01-01 12:10:00 UTC     2    
2016-01-01 12:10:00 UTC     2016-01-01 12:12:00 UTC     3    
2016-01-01 12:12:00 UTC     2016-01-01 12:15:00 UTC     2    
2016-01-01 12:15:00 UTC     2016-01-01 12:16:00 UTC     1    
2016-01-01 12:16:00 UTC     null                        0   

您可能仍想稍微完善一下上述查询 - 但主要是它满足您的需求

第 2 步 - 现在您可以根据上述结果进行任何统计

例如整个时期的峰值:

SELECT 
  start, finish, concurrent_entries, RANK() OVER(ORDER BY concurrent_entries DESC) AS peak
FROM (
  SELECT ts AS start, LEAD(ts) OVER(ORDER BY ts) AS finish, 
         SUM(entry) OVER(ORDER BY ts) AS concurrent_entries
  FROM (
    SELECT ts, SUM(entry)AS entry FROM 
      (SELECT ts, 1 AS entry FROM yourTable),
      (SELECT DATE_ADD(ts, duration, 'second') AS ts, -1 AS entry FROM yourTable)
    GROUP BY ts
    HAVING entry != 0
  )
)
ORDER BY peak

【讨论】:

以上是关于从一组范围计算并发的主要内容,如果未能解决你的问题,请参考以下文章

从一次转账探究并发优化的思路

R计算逐年的每周值变化(并发其他并发症)

程序进程和线程及并行和并发的区别

如何从一组重叠的圆中计算出一组多边形?

JAVA高并发网络编程之TCP和UDP协议

从一组地理位置计算边界