pandas中的SQL查询:根据其他列的组合在列中连接多行
Posted
技术标签:
【中文标题】pandas中的SQL查询:根据其他列的组合在列中连接多行【英文标题】:SQL query in pandas: Concat multiple rows in column based on a combination of other columns 【发布时间】:2022-01-09 14:51:26 【问题描述】:这是 pandas 中一个 sql 查询的响应。我想根据问题和客户将“标签列”连接在一起。 我尝试了 Group By,但仅适用于整数值。 任何想法,我怎么能做到这一点? 基于 Pandas 的解决方案也应该没问题。
我试过.groupby
,也在pandas中,命令和输出在下面,它只给了我所需数据帧的一个子集。
是否可以针对第一个数据帧中的每个 Issue
更新 label
列并删除重复项并获得如下预期的输出?
SQL 版本是:
Microsoft SQL Server 2014
输出:
Issue | Subject | type | Team | Sub Team | Client | Priority | CreatedOn | Label | BuiltOn | CreatedBy | Status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | ABCABC | Bug | Develop | Automation | Andy | 0 | 2021-01-11 00:00:00 | Enhancement | None | John | InProgress |
1 | 2 | DEFDEF | Bug | Develop | Automation | Judy | 0 | 2021-01-10 00:00:00 | Feature | None | Andre | New |
2 | 3 | HIGHIG | Bug | Develop | Testing123 | Cathy | 2 | 2021-02-11 00:00:00 | Feature | None | Keith | New |
3 | 3 | HIGHIG | Bug | Develop | Testing123 | Cathy | 2 | 2021-02-11 00:00:00 | Internal | None | Keith | New |
4 | 4 | XYZXYZ | Bug | Develop | Automation | Jack | 1 | 2021-05-11 00:00:00 | Enhancement | None | Maya | Analysis |
5 | 4 | XYZXYZ | Bug | Develop | Automation | Jack | 1 | 2021-05-11 00:00:00 | Internal | None | Maya | Analysis |
6 | 4 | XYZXYZ | Bug | Develop | Automation | Larry | 1 | 2021-05-11 00:00:00 | Enhancement | None | Maya | Analysis |
7 | 4 | XYZXYZ | Bug | Develop | Automation | Larry | 1 | 2021-05-11 00:00:00 | Internal | None | Maya | Analysis |
8 | 4 | XYZXYZ | Bug | Develop | Automation | Colin | 1 | 2021-05-11 00:00:00 | Enhancement | None | Maya | Analysis |
9 | 4 | XYZXYZ | Bug | Develop | Automation | Colin | 1 | 2021-05-11 00:00:00 | Internal | None | Maya | Analysis |
10 | 4 | XYZXYZ | Bug | Develop | Automation | Nitin | 1 | 2021-05-11 00:00:00 | Enhancement | None | Maya | Analysis |
11 | 4 | XYZXYZ | Bug | Develop | Automation | Nitin | 1 | 2021-05-11 00:00:00 | Internal | None | Maya | Analysis |
12 | 4 | XYZXYZ | Bug | Develop | Automation | Lisa | 1 | 2021-05-11 00:00:00 | Enhancement | None | Maya | Analysis |
13 | 4 | XYZXYZ | Bug | Develop | Automation | Lisa | 1 | 2021-05-11 00:00:00 | Internal | None | Maya | Analysis |
预期(注意标签列):
Issue | Subject | Issue_type | Team | Sub Team | Client | Priority | CreatedOn | Label | BuiltOn | CreatedBy | Status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | ABC | Bug | Develop | Automation | Andy | 0 | 2021-01-11 00:00:00 | Enhancement | None | John | InProgress |
1 | 2 | DEF | Bug | Develop | Automation | Judy | 0 | 2021-01-10 00:00:00 | Feature | None | Andre | New |
2 | 3 | HIG | Bug | Develop | Testing | Cathy | 2 | 2021-02-11 00:00:00 | Feature, Internal | None | Keith | New |
3 | 4 | XYZ | Bug | Develop | Automation | Jack | 1 | 2021-05-11 00:00:00 | Enhancement, Internal | None | Maya | Analysis |
4 | 4 | XYZ | Bug | Develop | Automation | Larry | 1 | 2021-05-11 00:00:00 | Enhancement, Internal | None | Maya | Analysis |
5 | 4 | XYZ | Bug | Develop | Automation | Colin | 1 | 2021-05-11 00:00:00 | Enhancement, Internal | None | Maya | Analysis |
6 | 4 | XYZ | Bug | Develop | Automation | Nitin | 1 | 2021-05-11 00:00:00 | Enhancement, Internal | None | Maya | Analysis |
7 | 4 | XYZ | Bug | Develop | Automation | Lisa | 1 | 2021-05-11 00:00:00 | Enhancement, Internal | None | Maya | Analysis |
更新: 这是他们的查询:
SELECT I.Issue,
I.Subject,
I.type,
P.Team,
P.Subteam,
CR.Client,
I.Priority,
I.CreatedOn,
L.Label,
I.BuiltOn,
I.CreatedBy,
I.Status
FROM master.IssueRequests AS I
JOIN master.Participants AS P
ON P.Issue = I.Issue
JOIN master.ClientRecords AS CR
ON CR.Issue = I.Issue
JOIN master.IssueLabels AS L
ON L.Issue = I.Issue
WHERE I.Issue IN ('2652523', '2703670', '2984120')
更新2
df.groupby
的输出:
df.groupby(['Issue', 'Client'])['Label'].apply(','.join).reset_index()
输出:
Issue | Client | Label | |
---|---|---|---|
0 | 1 | Andy | Enhancement |
1 | 2 | Judy | Feature |
2 | 3 | Cathy | Feature,Internal |
3 | 4 | Colin | Enhancement,Internal |
4 | 4 | Jack | Enhancement,Internal |
5 | 4 | Larry | Enhancement,Internal |
6 | 4 | Lisa | Enhancement,Internal |
7 | 4 | Nitin | Enhancement,Internal |
澄清:合并除Label
之外的所有列将不起作用,因为在某些情况下,其他一些数据可能为“null”或不同,这可能会导致数据完全丢失。如果其他列中的数据不同,我可以保留该数据的第一个实例。
【问题讨论】:
请也向我们展示查询 在 pandas 中,groupby 可以用于非数字列 @Squirrel,添加了查询 @EmiOB,也添加了 pandas groupby 结果 @akshat 您希望其他列发生什么?优先级是平均值吗?您想保留哪个日期创建的?与 createdby 和 status 等相同 【参考方案1】:更新:
在澄清 OP 之后,似乎问题实际上有点不同:除分组列 Issue
和 Client
之外的列内容实际上可能在分组行之间有所不同,最终结果应包含第一行的列值这些分组行之间的差异。
执行此操作的方法可能是像以前一样在 Python 中执行分组,然后加入(使用 merge()
)与原始数据框的一个版本,在该版本中您基于 Issue
和 @ 删除所有重复项987654327@(以及Label
列)。如果数据不同,这将为您提供每个分组行的第一个实例。
如果没有其他参数,merge()
将自动对两个数据帧中可用的所有列进行内连接,在本例中为 Issue
和 Client
:
df.groupby(['Issue', 'Client'])['Label'].apply(','.join).reset_index().merge(df.drop('Label', axis=1).drop_duplicates(['Issue', 'Client']))
输出:
Issue | Subject | type | Team | Sub Team | Client | Priority | CreatedOn | BuiltOn | CreatedBy | Status | Label | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | ABCABC | Bug | Develop | Automation | Andy | 0 | 2021-01-11 00:00:00 | None | John | InProgress | Enhancement |
1 | 2 | DEFDEF | Bug | Develop | Automation | Judy | 0 | 2021-01-10 00:00:00 | None | Andre | New | Feature |
2 | 3 | HIGHIG | Bug | Develop | Testing123 | Cathy | 2 | 2021-02-11 00:00:00 | None | Keith | New | Feature,Internal |
3 | 4 | XYZXYZ | Bug | Develop | Automation | Colin | 1 | 2021-05-11 00:00:00 | None | Maya | Analysis | Enhancement,Internal |
4 | 4 | XYZXYZ | Bug | Develop | Automation | Jack | 1 | 2021-05-11 00:00:00 | None | Maya | Analysis | Enhancement,Internal |
5 | 4 | XYZXYZ | Bug | Develop | Automation | Larry | 1 | 2021-05-11 00:00:00 | None | Maya | Analysis | Enhancement,Internal |
6 | 4 | XYZXYZ | Bug | Develop | Automation | Lisa | 1 | 2021-05-11 00:00:00 | None | Maya | Analysis | Enhancement,Internal |
7 | 4 | XYZXYZ | Bug | Develop | Automation | Nitin | 1 | 2021-05-11 00:00:00 | None | Maya | Analysis | Enhancement,Internal |
上一个答案:
只需按所有其他列分组:
import pandas as pd
data = [[1, 'ABCABC', 'Bug', 'Develop', 'Automation', 'Andy', 0, '2021-01-11 00:00:00', 'Enhancement', 'None', 'John', 'InProgress'],
[2, 'DEFDEF', 'Bug', 'Develop', 'Automation', 'Judy', 0, '2021-01-10 00:00:00', 'Feature', 'None', 'Andre', 'New'],
[3, 'HIGHIG', 'Bug', 'Develop', 'Testing123', 'Cathy', 2, '2021-02-11 00:00:00', 'Feature', 'None', 'Keith', 'New'],
[3, 'HIGHIG', 'Bug', 'Develop', 'Testing123', 'Cathy', 2, '2021-02-11 00:00:00', 'Internal', 'None', 'Keith', 'New'],
[4, 'XYZXYZ', 'Bug', 'Develop', 'Automation', 'Jack', 1, '2021-05-11 00:00:00', 'Enhancement', 'None', 'Maya', 'Analysis'],
[4, 'XYZXYZ', 'Bug', 'Develop', 'Automation', 'Jack', 1, '2021-05-11 00:00:00', 'Internal', 'None', 'Maya', 'Analysis'],
[4, 'XYZXYZ', 'Bug', 'Develop', 'Automation', 'Larry', 1, '2021-05-11 00:00:00', 'Enhancement', 'None', 'Maya', 'Analysis'],
[4, 'XYZXYZ', 'Bug', 'Develop', 'Automation', 'Larry', 1, '2021-05-11 00:00:00', 'Internal', 'None', 'Maya', 'Analysis'],
[4, 'XYZXYZ', 'Bug', 'Develop', 'Automation', 'Colin', 1, '2021-05-11 00:00:00', 'Enhancement', 'None', 'Maya', 'Analysis'],
[4, 'XYZXYZ', 'Bug', 'Develop', 'Automation', 'Colin', 1, '2021-05-11 00:00:00', 'Internal', 'None', 'Maya', 'Analysis'],
[4, 'XYZXYZ', 'Bug', 'Develop', 'Automation', 'Nitin', 1, '2021-05-11 00:00:00', 'Enhancement', 'None', 'Maya', 'Analysis'],
[4, 'XYZXYZ', 'Bug', 'Develop', 'Automation', 'Nitin', 1, '2021-05-11 00:00:00', 'Internal', 'None', 'Maya', 'Analysis'],
[4, 'XYZXYZ', 'Bug', 'Develop', 'Automation', 'Lisa', 1, '2021-05-11 00:00:00', 'Enhancement', 'None', 'Maya', 'Analysis'],
[4, 'XYZXYZ', 'Bug', 'Develop', 'Automation', 'Lisa', 1, '2021-05-11 00:00:00', 'Internal', 'None', 'Maya', 'Analysis']]
df = pd.DataFrame(data, columns = ['Issue', 'Subject', 'type', 'Team', 'Sub Team', 'Client', 'Priority', 'CreatedOn', 'Label', 'BuiltOn', 'CreatedBy', 'Status'])
df.groupby(['Issue', 'Subject', 'type', 'Team', 'Sub Team', 'Client', 'Priority', 'CreatedOn', 'BuiltOn', 'CreatedBy', 'Status'])['Label'].apply(','.join).reset_index()
如果您不想写所有列名,您还可以使用列表推导自动构建列表并从中排除 Label
列,类似于 this SO answer:
df.groupby([col for col in list(df) if col not in ['Label']])['Label'].apply(','.join).reset_index()
【讨论】:
我也试过这个,但是,这有两个问题,在我的实际数据框中,有更多的 50 列,并且我有多个这样的查询。所以写和所有列都很容易出错。此外,在某些情况下,其他一些数据可能是“空”或不同的。这可能会导致数据一起丢失。 好吧,那么请编辑您的问题,以便您的示例数据和您的问题反映这一点,并告诉我们,当数据不同时,您会如何看待结果。 我修改了我的答案以适应您修改后的问题描述。【参考方案2】:我将提供第二种使用 SQL 的方法。
在GROUP BY
中连接字符串在 MS Sql Server 中有点棘手,因为没有其他 RDBMS 中的直接函数。但是,is a workaround using FOR XML
and PATH
可以适应您的问题。
以下语句为您提供基于原始查询的连接标签:
WITH tmp AS (SELECT I.Issue,
I.Subject,
I.type,
P.Team,
P.Subteam,
CR.Client,
I.Priority,
I.CreatedOn,
L.Label,
I.BuiltOn,
I.CreatedBy,
I.Status
FROM master.IssueRequests AS I
JOIN master.Participants AS P
ON P.Issue = I.Issue
JOIN master.ClientRecords AS CR
ON CR.Issue = I.Issue
JOIN master.IssueLabels AS L
ON L.Issue = I.Issue
WHERE I.Issue IN ('2652523', '2703670', '2984120')
)
SELECT A.Issue,
A.Client,
STUFF((
SELECT ', ' + B.Label
FROM tmp B
WHERE ISNULL(B.Issue, '') = ISNULL(A.Issue, '')
AND ISNULL(B.Client, '') = ISNULL(A.Client, '')
ORDER BY B.Issue
FOR XML PATH('')), 1, 2, ''
) AS Label
FROM
tmp A
GROUP BY
A.Issue, A.Client
这给了你
Issue | Client | Label |
---|---|---|
1 | Andy | Enhancement |
2 | Judy | Feature |
3 | Cathy | Feature,Internal |
4 | Colin | Enhancement,Internal |
4 | Jack | Enhancement,Internal |
4 | Larry | Enhancement,Internal |
4 | Lisa | Enhancement,Internal |
4 | Nitin | Enhancement,Internal |
然后你可以use ROW_NUMBER()
to JOIN
this with the first row每个Issue
-Client
-组合:
WITH tmp AS (SELECT I.Issue,
I.Subject,
I.type,
P.Team,
P.Subteam,
CR.Client,
I.Priority,
I.CreatedOn,
L.Label,
I.BuiltOn,
I.CreatedBy,
I.Status
FROM master.IssueRequests AS I
JOIN master.Participants AS P
ON P.Issue = I.Issue
JOIN master.ClientRecords AS CR
ON CR.Issue = I.Issue
JOIN master.IssueLabels AS L
ON L.Issue = I.Issue
WHERE I.Issue IN ('2652523', '2703670', '2984120')
)
SELECT C.Issue,
C.Subject,
C.typ,
C.Team,
C.Subteam,
C.Client,
C.Priority,
C.CreatedOn,
D.Label,
C.BuiltOn,
C.CreatedBy,
C.Status
FROM (SELECT tmp.*,
row_number() OVER (PARTITION BY Issue, Client ORDER BY Issue) as rn
FROM tmp) C
JOIN (SELECT A.Issue,
A.Client,
STUFF((
SELECT ', ' + B.Label
FROM tmp B
WHERE ISNULL(B.Issue, '') = ISNULL(A.Issue, '')
AND ISNULL(B.Client, '') = ISNULL(A.Client, '')
ORDER BY B.Issue
FOR XML PATH('')), 1, 2, ''
) AS Label
FROM tmp A
GROUP BY A.Issue, A.Client) D
ON C.Issue = D.Issue AND C.Client = D.Client AND C.rn = 1
这会给你想要的结果:
Issue | Subject | type | Team | Sub Team | Client | Priority | CreatedOn | BuiltOn | CreatedBy | Status | Label |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | ABCABC | Bug | Develop | Automation | Andy | 0 | 2021-01-11 00:00:00 | None | John | InProgress | Enhancement |
2 | DEFDEF | Bug | Develop | Automation | Judy | 0 | 2021-01-10 00:00:00 | None | Andre | New | Feature |
3 | HIGHIG | Bug | Develop | Testing123 | Cathy | 2 | 2021-02-11 00:00:00 | None | Keith | New | Feature,Internal |
4 | XYZXYZ | Bug | Develop | Automation | Colin | 1 | 2021-05-11 00:00:00 | None | Maya | Analysis | Enhancement,Internal |
4 | XYZXYZ | Bug | Develop | Automation | Jack | 1 | 2021-05-11 00:00:00 | None | Maya | Analysis | Enhancement,Internal |
4 | XYZXYZ | Bug | Develop | Automation | Larry | 1 | 2021-05-11 00:00:00 | None | Maya | Analysis | Enhancement,Internal |
4 | XYZXYZ | Bug | Develop | Automation | Lisa | 1 | 2021-05-11 00:00:00 | None | Maya | Analysis | Enhancement,Internal |
4 | XYZXYZ | Bug | Develop | Automation | Nitin | 1 | 2021-05-11 00:00:00 | None | Maya | Analysis | Enhancement,Internal |
你可以在this db<>dfiddle测试它
【讨论】:
以上是关于pandas中的SQL查询:根据其他列的组合在列中连接多行的主要内容,如果未能解决你的问题,请参考以下文章
根据其他列中的值在 python 3 (pandas) 数据框中创建新列
如何从以 BLOB 类型存储在列中的 XML 中提取数据(通过 SQL 查询)