查找id在每个位置花费的时间
Posted
技术标签:
【中文标题】查找id在每个位置花费的时间【英文标题】:Finding the time spent by id in each location 【发布时间】:2017-03-16 03:35:29 【问题描述】:我试图找出每个 id 在起始位置花费了多长时间。
例如,在下面的数据集中,id 286 的起始 Geohash 是“abcdef”。 Geohash "abcdef" 出现在 ID 286 的 3 个位置。 因此,ID 286 花费的总时间是 (2017-02-13 12:33:02.063 UTC - 2017-02-13 12:24:36 UTC) 和 (2017-02-13 12:34:29 UTC - 2017-02-13 12:33:08 UTC)。
Id DateTime Latitude Longitude Geohash
0 286 2017-02-13 12:24:36 UTC 40.769230 -73.01205 abcdef
1 286 2017-02-13 12:33:02.063 UTC 40.769230 -73.01202 abcdef
2 286 2017-02-13 12:33:05.063 UTC 40.769230 -73.01202 cvzvvv
3 286 2017-02-13 12:33:08 UTC 40.769280 -73.01212 abcdef
4 286 2017-02-13 12:34:29 UTC 40.769306 -73.01207 hsffds
5 368 2017-02-13 00:23:07.063 UTC 33.392820 -111.8262 weruio
6 141 2017-02-13 00:00:41 UTC 33.287117 -111.84150 oqruqq
不知道pandas dataframe中是否有实现这个操作的函数。
任何帮助将不胜感激。 !!
【问题讨论】:
【参考方案1】:以下是 BigQuery 标准 SQL
#standardSQL
SELECT
Id, Geohash, MIN(DateTime) AS StartDateTime, SUM(TimeSpent) AS TimeSpent
FROM (
SELECT
Id, Geohash, DateTime,
TIMESTAMP_DIFF(LEAD(DateTime) OVER(PARTITION BY Id ORDER BY DateTime), DateTime, SECOND) AS TimeSpent,
FIRST_VALUE(Geohash) OVER(PARTITION BY Id ORDER BY DateTime) AS FirstGeohash
FROM yourTable
)
WHERE Geohash = FirstGeohash
GROUP BY Id, Geohash
您可以使用示例中的虚拟数据对其进行测试:
#standardSQL
WITH yourTable AS (
SELECT 286 AS Id, TIMESTAMP '2017-02-13 12:24:36 UTC' AS DateTime, 40.769230 AS Latitude, -73.01205 AS Longitude, 'abcdef' AS Geohash UNION ALL
SELECT 286, TIMESTAMP '2017-02-13 12:33:02.063 UTC', 40.769230, -73.01202, 'abcdef' UNION ALL
SELECT 286, TIMESTAMP '2017-02-13 12:33:05.063 UTC', 40.769230, -73.01202, 'cvzvvv' UNION ALL
SELECT 286, TIMESTAMP '2017-02-13 12:33:08 UTC', 40.769280, -73.01212, 'abcdef' UNION ALL
SELECT 286, TIMESTAMP '2017-02-13 12:34:29 UTC', 40.769306, -73.01207, 'hsffds' UNION ALL
SELECT 368, TIMESTAMP '2017-02-13 00:23:07.063 UTC', 33.392820, -111.8262, 'weruio' UNION ALL
SELECT 141, TIMESTAMP '2017-02-13 00:00:41 UTC', 33.287117, -111.84150, 'oqruqq'
)
SELECT
Id, Geohash, MIN(DateTime) AS StartDateTime, SUM(TimeSpent) AS TimeSpent
FROM (
SELECT
Id, Geohash, DateTime,
TIMESTAMP_DIFF(LEAD(DateTime) OVER(PARTITION BY Id ORDER BY DateTime), DateTime, SECOND) AS TimeSpent,
FIRST_VALUE(Geohash) OVER(PARTITION BY Id ORDER BY DateTime) AS FirstGeohash
FROM yourTable
)
WHERE Geohash = FirstGeohash
GROUP BY Id, Geohash
结果如下
Id Geohash StartDateTime TimeSpent
286 abcdef 2017-02-13 12:24:36 UTC 590
368 weruio 2017-02-13 00:23:07 UTC null
141 oqruqq 2017-02-13 00:00:41 UTC null
请注意:590 以上是三页上花费的时间总和(以秒为单位) - 不仅仅是你问题中所述的两页 - 我认为这只是你这边的错字
【讨论】:
【参考方案2】:如果我理解正确,你想要这样的东西:
def timedelta(df):
df = df.sort_values(by='DateTime')
return df.iloc[0]['DateTime'] - df.iloc[-1]['DateTime']
df.groupby(['Id', 'Geohash']).apply(timedelta)
【讨论】:
以上是关于查找id在每个位置花费的时间的主要内容,如果未能解决你的问题,请参考以下文章