可以使用比较来合并两个熊猫数据框吗?
Posted
技术标签:
【中文标题】可以使用比较来合并两个熊猫数据框吗?【英文标题】:Can one use comparisons to merge two pandas data-frames? 【发布时间】:2014-10-30 13:31:20 【问题描述】:使用以下命令:
pandas.merge(df_1, df_2, left_on=['date'], right_on=['from_date'])
如果第一个表的date
-列中的值等于第二个表的from_date
-列中的值,我会合并两个表中的两行。
现在我想让它稍微复杂一些。如果第一个表的date
列中的值等于或大于第二个表的from_date
-列的值并且更小,我需要将第一个表中的一行与第二个表中的一行合并比第二列upto_date
-列中的值。
在 SQL 中,人们会使用类似的东西:
select
*
from
table_1
join
table_2
on
table_1.date >= table_2.from_date
and
table_1.date < table_2.upto_date
是否可以在熊猫中做到这一点。
【问题讨论】:
您能否提供一个 df1 和 df2 的简短示例? 由于您加入的值不再是唯一的,因此您可能无法按预期进行合并。如果您想简单地将两个表添加在一起,可以查看 .join 或 .concat ***.com/questions/23508351/… 的可能重复项。有一个关于 Pandas DataFrame 的条件连接的建议问题 (github.com/pydata/pandas/issues/7480) 想知道非 SQL 解决方案是否更容易(即:在 python 中解析 + 合并)。 【参考方案1】:pandasql
是一个非常有用的工具,用于使用 SQLite 查询语法查询 pandas 数据帧。
资源
pandasql - PyPI Documentation yhat/pandasql - Source on GithubBlog post with more examples
pip install -U pandasql
这是一个与您描述的类似的示例。
进口
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import pandas as pd
from pandas.io.parsers import StringIO
from pandasql import sqldf
# helper func useful for saving keystrokes
# when running multiple queries
def dbGetQuery(q):
return sqldf(q, globals())
伪造一些数据
sample_a = """timepoint,measure
2014-01-01 00:00:00,78
2014-01-03 00:00:00,5
2014-01-04 00:00:00,73
2014-01-05 00:00:00,40
2014-01-06 00:00:00,45
2014-01-08 00:00:00,2
2014-01-09 00:00:00,96
2014-01-10 00:00:00,82
2014-01-11 00:00:00,61
2014-01-12 00:00:00,68
2014-01-13 00:00:00,8
2014-01-14 00:00:00,94
2014-01-15 00:00:00,16
2014-01-16 00:00:00,31
2014-01-17 00:00:00,10
2014-01-18 00:00:00,34
2014-01-19 00:00:00,27
2014-01-20 00:00:00,75
2014-01-21 00:00:00,49
2014-01-23 00:00:00,28
2014-01-24 00:00:00,91
2014-01-25 00:00:00,88
2014-01-27 00:00:00,98
2014-01-28 00:00:00,39
2014-01-29 00:00:00,90
2014-01-30 00:00:00,63
2014-01-31 00:00:00,77
"""
sample_b = """from_date,to_date,measure
2014-01-02 00:00:00,2014-01-06 00:00:00,89
2014-01-03 00:00:00,2014-01-07 00:00:00,80
2014-01-04 00:00:00,2014-01-05 00:00:00,44
2014-01-05 00:00:00,2014-01-12 00:00:00,68
2014-01-06 00:00:00,2014-01-11 00:00:00,62
2014-01-07 00:00:00,2014-01-14 00:00:00,5
2014-01-08 00:00:00,2014-01-09 00:00:00,23
"""
读取数据集以创建 2 个 DataFrame
df1 = pd.read_csv(StringIO(sample_a), parse_dates=['timepoint'])
df2 = pd.read_csv(StringIO(sample_b), parse_dates=['from_date', 'to_date'])
编写 SQL 查询
请注意,这个使用 SQLite BETWEEN
运算符。如果您愿意,也可以将其换掉并使用 ON timepoint >= from_date AND timepoint < to_date
之类的东西。
query = """
SELECT
DATE(df1.timepoint) AS timepoint
, DATE(df2.from_date) AS start
, DATE(df2.to_date) AS end
, df1.measure AS measure_a
, df2.measure AS measure_b
FROM
df1
INNER JOIN df2
ON df1.timepoint BETWEEN
df2.from_date AND df2.to_date
ORDER BY
df1.timepoint;
"""
使用辅助函数运行查询
df3 = dbGetQuery(query)
df3
timepoint start end measure_a measure_b
0 2014-01-03 2014-01-02 2014-01-06 5 89
1 2014-01-03 2014-01-03 2014-01-07 5 80
2 2014-01-04 2014-01-02 2014-01-06 73 89
3 2014-01-04 2014-01-03 2014-01-07 73 80
4 2014-01-04 2014-01-04 2014-01-05 73 44
5 2014-01-05 2014-01-02 2014-01-06 40 89
6 2014-01-05 2014-01-03 2014-01-07 40 80
7 2014-01-05 2014-01-04 2014-01-05 40 44
8 2014-01-05 2014-01-05 2014-01-12 40 68
9 2014-01-06 2014-01-02 2014-01-06 45 89
10 2014-01-06 2014-01-03 2014-01-07 45 80
11 2014-01-06 2014-01-05 2014-01-12 45 68
12 2014-01-06 2014-01-06 2014-01-11 45 62
13 2014-01-08 2014-01-05 2014-01-12 2 68
14 2014-01-08 2014-01-06 2014-01-11 2 62
15 2014-01-08 2014-01-07 2014-01-14 2 5
16 2014-01-08 2014-01-08 2014-01-09 2 23
17 2014-01-09 2014-01-05 2014-01-12 96 68
18 2014-01-09 2014-01-06 2014-01-11 96 62
19 2014-01-09 2014-01-07 2014-01-14 96 5
20 2014-01-09 2014-01-08 2014-01-09 96 23
21 2014-01-10 2014-01-05 2014-01-12 82 68
22 2014-01-10 2014-01-06 2014-01-11 82 62
23 2014-01-10 2014-01-07 2014-01-14 82 5
24 2014-01-11 2014-01-05 2014-01-12 61 68
25 2014-01-11 2014-01-06 2014-01-11 61 62
26 2014-01-11 2014-01-07 2014-01-14 61 5
27 2014-01-12 2014-01-05 2014-01-12 68 68
28 2014-01-12 2014-01-07 2014-01-14 68 5
29 2014-01-13 2014-01-07 2014-01-14 8 5
30 2014-01-14 2014-01-07 2014-01-14 94 5
【讨论】:
Python 告诉我 pandasql 没有属性 'dbGetQuery'。我也无法在网上找到有关此模块的任何信息。这段代码真的有用吗? 我在答案的顶部定义了 dbGetQuery。它只是一个我经常写的辅助函数。【参考方案2】:我想我找到了解决方案。但是,我不确定它是否优雅和最佳:
df_1['A'] = 'A'
df_2['A'] = 'A'
df = pandas.merge(df_1, df_2, on=['A'])
df = df[(df['date'] >= df['from']) & (df['date'] < df['upto'])]
del df['A']
代表提问者发帖
【讨论】:
【参考方案3】:来自pyjanitor 的conditional_join 可能对非等连接有帮助:
使用@hernamesbarbara的假数据:
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor
(df1.conditional_join(
df2,
('timepoint', 'from_date', '>='),
('timepoint', 'to_date', '<='))
)
left right
timepoint measure from_date to_date measure
0 2014-01-03 5 2014-01-02 2014-01-06 89
1 2014-01-03 5 2014-01-03 2014-01-07 80
2 2014-01-04 73 2014-01-02 2014-01-06 89
3 2014-01-04 73 2014-01-03 2014-01-07 80
4 2014-01-04 73 2014-01-04 2014-01-05 44
5 2014-01-05 40 2014-01-02 2014-01-06 89
6 2014-01-05 40 2014-01-03 2014-01-07 80
7 2014-01-05 40 2014-01-04 2014-01-05 44
8 2014-01-05 40 2014-01-05 2014-01-12 68
9 2014-01-06 45 2014-01-02 2014-01-06 89
10 2014-01-06 45 2014-01-03 2014-01-07 80
11 2014-01-06 45 2014-01-05 2014-01-12 68
12 2014-01-06 45 2014-01-06 2014-01-11 62
13 2014-01-08 2 2014-01-05 2014-01-12 68
14 2014-01-08 2 2014-01-06 2014-01-11 62
15 2014-01-08 2 2014-01-07 2014-01-14 5
16 2014-01-08 2 2014-01-08 2014-01-09 23
17 2014-01-09 96 2014-01-05 2014-01-12 68
18 2014-01-09 96 2014-01-06 2014-01-11 62
19 2014-01-09 96 2014-01-07 2014-01-14 5
20 2014-01-09 96 2014-01-08 2014-01-09 23
21 2014-01-10 82 2014-01-05 2014-01-12 68
22 2014-01-10 82 2014-01-06 2014-01-11 62
23 2014-01-10 82 2014-01-07 2014-01-14 5
24 2014-01-11 61 2014-01-05 2014-01-12 68
25 2014-01-11 61 2014-01-06 2014-01-11 62
26 2014-01-11 61 2014-01-07 2014-01-14 5
27 2014-01-12 68 2014-01-05 2014-01-12 68
28 2014-01-12 68 2014-01-07 2014-01-14 5
29 2014-01-13 8 2014-01-07 2014-01-14 5
30 2014-01-14 94 2014-01-07 2014-01-14 5
【讨论】:
以上是关于可以使用比较来合并两个熊猫数据框吗?的主要内容,如果未能解决你的问题,请参考以下文章