如何有效地计算熊猫时间序列中的滚动唯一计数?
Posted
技术标签:
【中文标题】如何有效地计算熊猫时间序列中的滚动唯一计数?【英文标题】:How to efficiently compute a rolling unique count in a pandas time series? 【发布时间】:2018-03-10 06:59:57 【问题描述】:我有一个时间序列的人参观建筑物。每个人都有一个唯一的 ID。对于时间序列中的每条记录,我想知道过去 365 天内访问建筑物的唯一人数(即 365 天窗口的滚动唯一人数)。
pandas
似乎没有用于此计算的内置方法。当存在大量唯一访问者和/或大窗口时,计算变得计算密集。 (实际数据比这个例子大。)
有没有比我在下面所做的更好的计算方法?我不知道为什么我创建的快速方法windowed_nunique
(在“速度测试 3”下)偏离了 1。
感谢您的帮助!
相关链接:
来源 Jupyter Notebook:https://gist.github.com/stharrold/17589e6809d249942debe3a5c43d38cc 相关pandas
问题:https://github.com/pandas-dev/pandas/issues/14336
初始化
In [1]:
# Import libraries.
import pandas as pd
import numba
import numpy as np
In [2]:
# Create data of people visiting a building.
np.random.seed(seed=0)
dates = pd.date_range(start='2010-01-01', end='2015-01-01', freq='D')
window = 365 # days
num_pids = 100
probs = np.linspace(start=0.001, stop=0.1, num=num_pids)
df = pd\
.DataFrame(
data=[(date, pid)
for (pid, prob) in zip(range(num_pids), probs)
for date in np.compress(np.random.binomial(n=1, p=prob, size=len(dates)), dates)],
columns=['Date', 'PersonId'])\
.sort_values(by='Date')\
.reset_index(drop=True)
print("Created data of people visiting a building:")
df.head() # 9181 rows × 2 columns
Out[2]:
Created data of people visiting a building:
| | Date | PersonId |
|---|------------|----------|
| 0 | 2010-01-01 | 76 |
| 1 | 2010-01-01 | 63 |
| 2 | 2010-01-01 | 89 |
| 3 | 2010-01-01 | 81 |
| 4 | 2010-01-01 | 7 |
速度参考
In [3]:
%%timeit
# This counts the number of people visiting the building, not the number of unique people.
# Provided as a speed reference.
df.rolling(window=':dD'.format(window), on='Date').count()
3.32 ms ± 124 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
速度测试 1
In [4]:
%%timeit
df.rolling(window=':dD'.format(window), on='Date').apply(lambda arr: pd.Series(arr).nunique())
2.42 s ± 282 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [5]:
# Save results as a reference to check calculation accuracy.
ref = df.rolling(window=':dD'.format(window), on='Date').apply(lambda arr: pd.Series(arr).nunique())['PersonId'].values
速度测试 2
In [6]:
# Define a custom function and implement a just-in-time compiler.
@numba.jit(nopython=True)
def nunique(arr):
return len(set(arr))
In [7]:
%%timeit
df.rolling(window=':dD'.format(window), on='Date').apply(nunique)
430 ms ± 31.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [8]:
# Check accuracy of results.
test = df.rolling(window=':dD'.format(window), on='Date').apply(nunique)['PersonId'].values
assert all(ref == test)
速度测试 3
In [9]:
# Define a custom function and implement a just-in-time compiler.
@numba.jit(nopython=True)
def windowed_nunique(dates, pids, window):
r"""Track number of unique persons in window,
reading through arrays only once.
Args:
dates (numpy.ndarray): Array of dates as number of days since epoch.
pids (numpy.ndarray): Array of integer person identifiers.
window (int): Width of window in units of difference of `dates`.
Returns:
ucts (numpy.ndarray): Array of unique counts.
Raises:
AssertionError: Raised if `len(dates) != len(pids)`
Notes:
* May be off by 1 compared to `pandas.core.window.Rolling`
with a time series alias offset.
"""
# Check arguments.
assert dates.shape == pids.shape
# Initialize counters.
idx_min = 0
idx_max = dates.shape[0]
date_min = dates[idx_min]
pid_min = pids[idx_min]
pid_max = np.max(pids)
pid_cts = np.zeros(pid_max, dtype=np.int64)
pid_cts[pid_min] = 1
uct = 1
ucts = np.zeros(idx_max, dtype=np.int64)
ucts[idx_min] = uct
idx = 1
# For each (date, person)...
while idx < idx_max:
# If person count went from 0 to 1, increment unique person count.
date = dates[idx]
pid = pids[idx]
pid_cts[pid] += 1
if pid_cts[pid] == 1:
uct += 1
# For past dates outside of window...
while (date - date_min) > window:
# If person count went from 1 to 0, decrement unique person count.
pid_cts[pid_min] -= 1
if pid_cts[pid_min] == 0:
uct -= 1
idx_min += 1
date_min = dates[idx_min]
pid_min = pids[idx_min]
# Record unique person count.
ucts[idx] = uct
idx += 1
return ucts
In [10]:
# Cast dates to integers.
df['DateEpoch'] = (df['Date'] - pd.to_datetime('1970-01-01'))/pd.to_timedelta(1, unit='D')
df['DateEpoch'] = df['DateEpoch'].astype(int)
In [11]:
%%timeit
windowed_nunique(
dates=df['DateEpoch'].values,
pids=df['PersonId'].values,
window=window)
107 µs ± 63.5 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [12]:
# Check accuracy of results.
test = windowed_nunique(
dates=df['DateEpoch'].values,
pids=df['PersonId'].values,
window=window)
# Note: Method may be off by 1.
assert all(np.isclose(ref, np.asarray(test), atol=1))
In [13]:
# Show where the calculation doesn't match.
print("Where reference ('ref') calculation of number of unique people doesn't match 'test':")
df['ref'] = ref
df['test'] = test
df.loc[df['ref'] != df['test']].head() # 9044 rows × 5 columns
Out[13]:
Where reference ('ref') calculation of number of unique people doesn't match 'test':
| | Date | PersonId | DateEpoch | ref | test |
|----|------------|----------|-----------|------|------|
| 78 | 2010-01-19 | 99 | 14628 | 56.0 | 55 |
| 79 | 2010-01-19 | 96 | 14628 | 56.0 | 55 |
| 80 | 2010-01-19 | 88 | 14628 | 56.0 | 55 |
| 81 | 2010-01-20 | 94 | 14629 | 56.0 | 55 |
| 82 | 2010-01-20 | 48 | 14629 | 57.0 | 56 |
【问题讨论】:
对不起,如果这是一个愚蠢的评论,但唯一 ID 的 365 滚动计数不会像这样简单:df.rolling(365)['PersonId'].apply(lambda x: len(set(x)))
???
@WoodyPride 谢谢,这就是我在“速度测试 2”下所做的,但使用的是即时编译器(参见函数 nunique
)。计算是正确的,但效率很低,因为每次执行窗口计算时,set
都会对窗口中的每个元素进行操作。保持每个元素的运行记录更有效,如“速度测试 3”(比较“速度测试 2”和“速度测试 3”约 4000 倍的示例数据更有效)。但是,我的实现windowed_nunique
差了 1,不知道是否有人可以帮助找到问题。
知道了!我认为我对这个问题的了解不够深入。
很棒的工作!我尝试应用您的速度测试 3,但在下面不断收到以下错误,知道给出了什么吗? TypingError:在nopython模式管道中失败(步骤:nopython前端)非精确类型数组(pyobject,1d,C)期间:在我在快速方法windowed_nunique
中有两个错误,现在在下面的windowed_nunique_corrected
中更正:
-
用于记忆窗口内每个人员 ID 的唯一计数数量的数组大小,
pid_cts
,太小了。
由于窗口的前沿和后沿包含整数天,date_min
应在 (date - date_min + 1) > window
时更新。
相关链接:
源 Jupyter Notebook 已更新解决方案:https://gist.github.com/stharrold/17589e6809d249942debe3a5c43d38ccIn [14]:
# Define a custom function and implement a just-in-time compiler.
@numba.jit(nopython=True)
def windowed_nunique_corrected(dates, pids, window):
r"""Track number of unique persons in window,
reading through arrays only once.
Args:
dates (numpy.ndarray): Array of dates as number of days since epoch.
pids (numpy.ndarray): Array of integer person identifiers.
Required: min(pids) >= 0
window (int): Width of window in units of difference of `dates`.
Required: window >= 1
Returns:
ucts (numpy.ndarray): Array of unique counts.
Raises:
AssertionError: Raised if not...
* len(dates) == len(pids)
* min(pids) >= 0
* window >= 1
Notes:
* Matches `pandas.core.window.Rolling`
with a time series alias offset.
"""
# Check arguments.
assert len(dates) == len(pids)
assert np.min(pids) >= 0
assert window >= 1
# Initialize counters.
idx_min = 0
idx_max = dates.shape[0]
date_min = dates[idx_min]
pid_min = pids[idx_min]
pid_max = np.max(pids) + 1
pid_cts = np.zeros(pid_max, dtype=np.int64)
pid_cts[pid_min] = 1
uct = 1
ucts = np.zeros(idx_max, dtype=np.int64)
ucts[idx_min] = uct
idx = 1
# For each (date, person)...
while idx < idx_max:
# Lookup date, person.
date = dates[idx]
pid = pids[idx]
# If person count went from 0 to 1, increment unique person count.
pid_cts[pid] += 1
if pid_cts[pid] == 1:
uct += 1
# For past dates outside of window...
# Note: If window=3, it includes day0,day1,day2.
while (date - date_min + 1) > window:
# If person count went from 1 to 0, decrement unique person count.
pid_cts[pid_min] -= 1
if pid_cts[pid_min] == 0:
uct -= 1
idx_min += 1
date_min = dates[idx_min]
pid_min = pids[idx_min]
# Record unique person count.
ucts[idx] = uct
idx += 1
return ucts
In [15]:
# Cast dates to integers.
df['DateEpoch'] = (df['Date'] - pd.to_datetime('1970-01-01'))/pd.to_timedelta(1, unit='D')
df['DateEpoch'] = df['DateEpoch'].astype(int)
In [16]:
%%timeit
windowed_nunique_corrected(
dates=df['DateEpoch'].values,
pids=df['PersonId'].values,
window=window)
98.8 µs ± 41.3 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [17]:
# Check accuracy of results.
test = windowed_nunique_corrected(
dates=df['DateEpoch'].values,
pids=df['PersonId'].values,
window=window)
assert all(ref == test)
【讨论】:
【参考方案2】:非常接近您在种子测试二中的时间,但作为一个衬里,重新采样了一年多。
df.resample('AS',on='Date')['PersonId'].expanding(0).apply(lambda x: np.unique(x).shape[0])
时间结果
1 loop, best of 3: 483 ms per loop
【讨论】:
这与“速度测试 2”的速度接近,但np.unique
正在对窗口中的每个元素进行操作。在“速度测试 3”中保持每个元素的运行记录更有效。 (请参阅我对 Woody Pride 的评论。)不过,我对运行计数的实现 windowed_nunique
减少了 1。还有其他想法吗?谢谢【参考方案3】:
如果您只想要过去 365 天内进入建筑物的唯一人数,您可以首先使用 .loc 限制过去 365 天的数据集:
df = df.loc[df['date'] > '2016-09-28',:]
如果使用 groupby,您将获得与进来的独特人一样多的行,如果您按计数来计算,您还可以获得他们进来的次数:
df = df.groupby('PersonID').count()
这似乎对你的问题有用,但也许我弄错了。 祝你有美好的一天
【讨论】:
谢谢,但我正在寻找有效的滚动唯一计数。输出必须与输入具有相同的len
(来自示例,len(df) == len(ref) == 9181
)并且比“速度测试 2”更快。
@SamuelHarrold,滚动唯一计数是什么意思?你在一年内翻了几期?
@djk47463 滚动唯一计数示例(类似于上面“速度测试 2”下定义的函数 nunique
):df.rolling(window='365D', on='Date').apply(lambda arr: len(set(arr)))
。挑战在于如何提高效率(比较“速度测试 2”和“速度测试 3”)。我几乎成功了,但我的解决方案 windowed_nunique
差了 1,我想知道是否有人能找到我的错误。
@djk47463 windowed_nunique
在记录 78 处减 1(来自示例中的 Out[13]
),其对应日期为“2010-01-19”,在任何额外的飞跃之前天。以上是关于如何有效地计算熊猫时间序列中的滚动唯一计数?的主要内容,如果未能解决你的问题,请参考以下文章