如何每天在特定列上聚合函数

Posted 2023-03-12

技术标签:

【中文标题】如何每天在特定列上聚合函数【英文标题】：How to aggregate a function on a specific column for each day 【发布时间】：2021-04-09 02:54:40 【问题描述】：

我有一个包含分钟数据的 CSV 文件。

最终目标是使用每天的所有数据，找出每天所有低点（'Low' 列）的标准差。

问题在于 csv 文件中存在一些漏洞，因为它没有正好 390 分钟（交易日的分钟数）。代码如下所示：

import pandas as pd
import datetime as dt


df = pd.read_csv('/Volumes/Seagate Portable/S&P 500 List/AAPL.txt')
df.columns = ['Extra', 'Dates', 'Open', 'High', 'Low', 'Close', 'Volume']
df.drop(['Extra', 'Open', 'High', 'Volume'], axis=1, inplace=True)
df.Dates = pd.to_datetime(df.Dates)
df.set_index(df.Dates, inplace=True)
df = df.between_time('9:30', '16:00')
print(df.Low[::390])

输出如下：

Dates
2020-01-02 09:30:00     73.8475
2020-01-02 16:00:00     75.0875
2020-01-03 15:59:00     74.3375
2020-01-06 15:58:00     74.9125
2020-01-07 15:57:00     74.5028
                         ...   
2020-12-14 09:41:00    122.8800
2020-12-15 09:40:00    125.9900
2020-12-16 09:39:00    126.5600
2020-12-17 09:38:00    129.1500
2020-12-18 09:37:00    127.9900
Name: Low, Length: 245, dtype: float64

正如您在输出中看到的那样，即使缺少一个 9:30，我也无法再按 390 索引。因此，我对此的解决方案是获取尽可能多的数据，即使在某种意义上缺少日期也就是说，当日期时间代码从 15:59 或 16:00 变为 9:31 或 9:32 时。本质上，当它在 16 点到 9:30 之间变回时？不知道有没有其他解决办法？有任何想法吗？如果这是解决方案，最好的编码方式是什么？

【问题讨论】：

【参考方案1】： 在'date' 上使用.groupby() 和pandas.Grouper()，在一天中使用freq='D'，然后在'low' 上聚合.std()。 'date' 列必须是 datetime dtype。如果需要，使用pd.to_datetime() 转换'Dates' 列。如果需要，使用df = df.set_index('date').between_time('9:30', '16:00').reset_index() 仅选择特定范围内的时间。这将在.groupby() 之前完成。 'date' 列必须是index，才能使用.between_time()。

import requests
import pandas as pd

# sample stock data
periods = '3600'
resp = requests.get('https://api.cryptowat.ch/markets/poloniex/ethusdt/ohlc', params='periods': periods)
data = resp.json()
df = pd.DataFrame(data['result'][periods], columns=['date', 'open', 'high', 'low', 'close', 'volume', 'amount'])

# convert to a datetime format
df['date'] = pd.to_datetime(df['date'], unit='s')

# display(df.head())
                 date        open        high         low       close       volume        amount
0 2020-11-22 02:00:00  550.544464  554.812114  536.523241  542.000000  2865.381737  1.567462e+06
1 2020-11-22 03:00:00  541.485933  551.621355  540.992000  548.500000  1061.275481  5.796859e+05
2 2020-11-22 04:00:00  548.722267  549.751680  545.153196  549.441709   310.874748  1.703272e+05
3 2020-11-22 05:00:00  549.157866  549.499632  544.135302  546.913493   259.077448  1.416777e+05
4 2020-11-22 06:00:00  547.600000  548.000000  541.668524  544.241871   363.433373  1.979504e+05

# groupby day, using pd.Grouper and then get std of low
std = df.groupby(pd.Grouper(key='date', freq='D'))['low'].std().reset_index(name='low std')

# display(std)
         date    low std
0  2020-11-22  14.751495
1  2020-11-23  14.964803
2  2020-11-24   6.542568
3  2020-11-25   9.523858
4  2020-11-26  24.041421
5  2020-11-27   8.272477
6  2020-11-28  12.340238
7  2020-11-29   8.444779
8  2020-11-30  10.290333
9  2020-12-01  13.605846
10 2020-12-02   6.201248
11 2020-12-03   9.403853
12 2020-12-04  12.667251
13 2020-12-05  10.180626
14 2020-12-06   4.481538
15 2020-12-07   3.881311
16 2020-12-08  10.518746
17 2020-12-09  12.077622
18 2020-12-10   6.161330
19 2020-12-11   5.035066
20 2020-12-12   6.297173
21 2020-12-13   9.739574
22 2020-12-14   3.505540
23 2020-12-15   3.304968
24 2020-12-16  16.753780
25 2020-12-17  10.963064
26 2020-12-18   5.574997
27 2020-12-19   4.976494
28 2020-12-20   7.243917
29 2020-12-21  16.844777
30 2020-12-22  10.348576
31 2020-12-23  15.769288
32 2020-12-24  10.329158
33 2020-12-25   5.980148
34 2020-12-26   8.530006
35 2020-12-27  21.136509
36 2020-12-28  16.115898
37 2020-12-29  10.587339
38 2020-12-30   7.634897
39 2020-12-31   7.278866
40 2021-01-01   6.617027
41 2021-01-02  19.708119

【讨论】：

以上是关于如何每天在特定列上聚合函数的主要内容，如果未能解决你的问题，请参考以下文章