pandas中的SQL查询:根据其他列的组合在列中连接多行

Posted

技术标签:

【中文标题】pandas中的SQL查询:根据其他列的组合在列中连接多行【英文标题】:SQL query in pandas: Concat multiple rows in column based on a combination of other columns 【发布时间】:2022-01-09 14:51:26 【问题描述】:

这是 pandas 中一个 sql 查询的响应。我想根据问题和客户将“标签列”连接在一起。 我尝试了 Group By,但仅适用于整数值。 任何想法,我怎么能做到这一点? 基于 Pandas 的解决方案也应该没问题。

我试过.groupby,也在pandas中,命令和输出在下面,它只给了我所需数据帧的一个子集。

是否可以针对第一个数据帧中的每个 Issue 更新 label 列并删除重复项并获得如下预期的输出?

SQL 版本是:

Microsoft SQL Server 2014

输出:

Issue Subject type Team Sub Team Client Priority CreatedOn Label BuiltOn CreatedBy Status
0 1 ABCABC Bug Develop Automation Andy 0 2021-01-11 00:00:00 Enhancement None John InProgress
1 2 DEFDEF Bug Develop Automation Judy 0 2021-01-10 00:00:00 Feature None Andre New
2 3 HIGHIG Bug Develop Testing123 Cathy 2 2021-02-11 00:00:00 Feature None Keith New
3 3 HIGHIG Bug Develop Testing123 Cathy 2 2021-02-11 00:00:00 Internal None Keith New
4 4 XYZXYZ Bug Develop Automation Jack 1 2021-05-11 00:00:00 Enhancement None Maya Analysis
5 4 XYZXYZ Bug Develop Automation Jack 1 2021-05-11 00:00:00 Internal None Maya Analysis
6 4 XYZXYZ Bug Develop Automation Larry 1 2021-05-11 00:00:00 Enhancement None Maya Analysis
7 4 XYZXYZ Bug Develop Automation Larry 1 2021-05-11 00:00:00 Internal None Maya Analysis
8 4 XYZXYZ Bug Develop Automation Colin 1 2021-05-11 00:00:00 Enhancement None Maya Analysis
9 4 XYZXYZ Bug Develop Automation Colin 1 2021-05-11 00:00:00 Internal None Maya Analysis
10 4 XYZXYZ Bug Develop Automation Nitin 1 2021-05-11 00:00:00 Enhancement None Maya Analysis
11 4 XYZXYZ Bug Develop Automation Nitin 1 2021-05-11 00:00:00 Internal None Maya Analysis
12 4 XYZXYZ Bug Develop Automation Lisa 1 2021-05-11 00:00:00 Enhancement None Maya Analysis
13 4 XYZXYZ Bug Develop Automation Lisa 1 2021-05-11 00:00:00 Internal None Maya Analysis

预期(注意标签列):

Issue Subject Issue_type Team Sub Team Client Priority CreatedOn Label BuiltOn CreatedBy Status
0 1 ABC Bug Develop Automation Andy 0 2021-01-11 00:00:00 Enhancement None John InProgress
1 2 DEF Bug Develop Automation Judy 0 2021-01-10 00:00:00 Feature None Andre New
2 3 HIG Bug Develop Testing Cathy 2 2021-02-11 00:00:00 Feature, Internal None Keith New
3 4 XYZ Bug Develop Automation Jack 1 2021-05-11 00:00:00 Enhancement, Internal None Maya Analysis
4 4 XYZ Bug Develop Automation Larry 1 2021-05-11 00:00:00 Enhancement, Internal None Maya Analysis
5 4 XYZ Bug Develop Automation Colin 1 2021-05-11 00:00:00 Enhancement, Internal None Maya Analysis
6 4 XYZ Bug Develop Automation Nitin 1 2021-05-11 00:00:00 Enhancement, Internal None Maya Analysis
7 4 XYZ Bug Develop Automation Lisa 1 2021-05-11 00:00:00 Enhancement, Internal None Maya Analysis

更新: 这是他们的查询:

SELECT I.Issue,
       I.Subject,
       I.type, 
       P.Team, 
       P.Subteam,
       CR.Client,
       I.Priority,
       I.CreatedOn,
       L.Label,
       I.BuiltOn,
       I.CreatedBy,
       I.Status
 FROM master.IssueRequests AS I 
 JOIN master.Participants AS P 
   ON P.Issue = I.Issue 
 JOIN master.ClientRecords AS CR 
   ON CR.Issue = I.Issue 
 JOIN master.IssueLabels AS L
   ON L.Issue = I.Issue
 WHERE I.Issue IN ('2652523', '2703670', '2984120')

更新2 df.groupby的输出:

df.groupby(['Issue', 'Client'])['Label'].apply(','.join).reset_index()

输出:

Issue Client Label
0 1 Andy Enhancement
1 2 Judy Feature
2 3 Cathy Feature,Internal
3 4 Colin Enhancement,Internal
4 4 Jack Enhancement,Internal
5 4 Larry Enhancement,Internal
6 4 Lisa Enhancement,Internal
7 4 Nitin Enhancement,Internal

澄清:合并除Label 之外的所有列将不起作用,因为在某些情况下,其他一些数据可能为“null”或不同,这可能会导致数据完全丢失。如果其他列中的数据不同,我可以保留该数据的第一个实例。

【问题讨论】:

请也向我们展示查询 在 pandas 中,groupby 可以用于非数字列 @Squirrel,添加了查询 @EmiOB,也添加了 pandas groupby 结果 @akshat 您希望其他列发生什么?优先级是平均值吗?您想保留哪个日期创建的?与 createdby 和 status 等相同 【参考方案1】:

更新: 在澄清 OP 之后,似乎问题实际上有点不同:除分组列 IssueClient 之外的列内容实际上可能在分组行之间有所不同,最终结果应包含第一行的列值这些分组行之间的差异。

执行此操作的方法可能是像以前一样在 Python 中执行分组,然后加入(使用 merge())与原始数据框的一个版本,在该版本中您基于 Issue 和 @ 删除所有重复项987654327@(以及Label 列)。如果数据不同,这将为您提供每个分组行的第一个实例。

如果没有其他参数,merge() 将自动对两个数据帧中可用的所有列进行内连接,在本例中为 IssueClient

df.groupby(['Issue', 'Client'])['Label'].apply(','.join).reset_index().merge(df.drop('Label', axis=1).drop_duplicates(['Issue', 'Client']))

输出:

Issue Subject type Team Sub Team Client Priority CreatedOn BuiltOn CreatedBy Status Label
0 1 ABCABC Bug Develop Automation Andy 0 2021-01-11 00:00:00 None John InProgress Enhancement
1 2 DEFDEF Bug Develop Automation Judy 0 2021-01-10 00:00:00 None Andre New Feature
2 3 HIGHIG Bug Develop Testing123 Cathy 2 2021-02-11 00:00:00 None Keith New Feature,Internal
3 4 XYZXYZ Bug Develop Automation Colin 1 2021-05-11 00:00:00 None Maya Analysis Enhancement,Internal
4 4 XYZXYZ Bug Develop Automation Jack 1 2021-05-11 00:00:00 None Maya Analysis Enhancement,Internal
5 4 XYZXYZ Bug Develop Automation Larry 1 2021-05-11 00:00:00 None Maya Analysis Enhancement,Internal
6 4 XYZXYZ Bug Develop Automation Lisa 1 2021-05-11 00:00:00 None Maya Analysis Enhancement,Internal
7 4 XYZXYZ Bug Develop Automation Nitin 1 2021-05-11 00:00:00 None Maya Analysis Enhancement,Internal

上一个答案

只需按所有其他列分组:

import pandas as pd

data = [[1, 'ABCABC', 'Bug', 'Develop', 'Automation', 'Andy', 0, '2021-01-11 00:00:00', 'Enhancement', 'None', 'John', 'InProgress'],
[2, 'DEFDEF', 'Bug', 'Develop', 'Automation', 'Judy', 0, '2021-01-10 00:00:00', 'Feature', 'None', 'Andre', 'New'],
[3, 'HIGHIG', 'Bug', 'Develop', 'Testing123', 'Cathy', 2, '2021-02-11 00:00:00', 'Feature', 'None', 'Keith', 'New'],
[3, 'HIGHIG', 'Bug', 'Develop', 'Testing123', 'Cathy', 2, '2021-02-11 00:00:00', 'Internal', 'None', 'Keith', 'New'],
[4, 'XYZXYZ', 'Bug', 'Develop', 'Automation', 'Jack', 1, '2021-05-11 00:00:00', 'Enhancement', 'None', 'Maya', 'Analysis'],
[4, 'XYZXYZ', 'Bug', 'Develop', 'Automation', 'Jack', 1, '2021-05-11 00:00:00', 'Internal', 'None', 'Maya', 'Analysis'],
[4, 'XYZXYZ', 'Bug', 'Develop', 'Automation', 'Larry', 1, '2021-05-11 00:00:00', 'Enhancement', 'None', 'Maya', 'Analysis'],
[4, 'XYZXYZ', 'Bug', 'Develop', 'Automation', 'Larry', 1, '2021-05-11 00:00:00', 'Internal', 'None', 'Maya', 'Analysis'],
[4, 'XYZXYZ', 'Bug', 'Develop', 'Automation', 'Colin', 1, '2021-05-11 00:00:00', 'Enhancement', 'None', 'Maya', 'Analysis'],
[4, 'XYZXYZ', 'Bug', 'Develop', 'Automation', 'Colin', 1, '2021-05-11 00:00:00', 'Internal', 'None', 'Maya', 'Analysis'],
[4, 'XYZXYZ', 'Bug', 'Develop', 'Automation', 'Nitin', 1, '2021-05-11 00:00:00', 'Enhancement', 'None', 'Maya', 'Analysis'],
[4, 'XYZXYZ', 'Bug', 'Develop', 'Automation', 'Nitin', 1, '2021-05-11 00:00:00', 'Internal', 'None', 'Maya', 'Analysis'],
[4, 'XYZXYZ', 'Bug', 'Develop', 'Automation', 'Lisa', 1, '2021-05-11 00:00:00', 'Enhancement', 'None', 'Maya', 'Analysis'],
[4, 'XYZXYZ', 'Bug', 'Develop', 'Automation', 'Lisa', 1, '2021-05-11 00:00:00', 'Internal', 'None', 'Maya', 'Analysis']]

df = pd.DataFrame(data, columns = ['Issue', 'Subject', 'type', 'Team', 'Sub Team', 'Client', 'Priority', 'CreatedOn', 'Label', 'BuiltOn', 'CreatedBy', 'Status']) 

df.groupby(['Issue', 'Subject', 'type', 'Team', 'Sub Team', 'Client', 'Priority', 'CreatedOn', 'BuiltOn', 'CreatedBy', 'Status'])['Label'].apply(','.join).reset_index()

如果您不想写所有列名,您还可以使用列表推导自动构建列表并从中排除 Label 列,类似于 this SO answer:

df.groupby([col for col in list(df) if col not in ['Label']])['Label'].apply(','.join).reset_index()

【讨论】:

我也试过这个,但是,这有两个问题,在我的实际数据框中,有更多的 50 列,并且我有多个这样的查询。所以写和所有列都很容易出错。此外,在某些情况下,其他一些数据可能是“空”或不同的。这可能会导致数据一起丢失。 好吧,那么请编辑您的问题,以便您的示例数据和您的问题反映这一点,并告诉我们,当数据不同时,您会如何看待结果。 我修改了我的答案以适应您修改后的问题描述。【参考方案2】:

我将提供第二种使用 SQL 的方法。

GROUP BY 中连接字符串在 MS Sql Server 中有点棘手,因为没有其他 RDBMS 中的直接函数。但是,is a workaround using FOR XML and PATH 可以适应您的问题。

以下语句为您提供基于原始查询的连接标签:

WITH tmp AS (SELECT I.Issue,
                    I.Subject,
                    I.type, 
                    P.Team, 
                    P.Subteam,
                    CR.Client,
                    I.Priority,
                    I.CreatedOn,
                    L.Label,
                    I.BuiltOn,
                    I.CreatedBy,
                    I.Status
              FROM master.IssueRequests AS I 
              JOIN master.Participants AS P 
                ON P.Issue = I.Issue 
              JOIN master.ClientRecords AS CR 
                ON CR.Issue = I.Issue 
              JOIN master.IssueLabels AS L
                ON L.Issue = I.Issue
              WHERE I.Issue IN ('2652523', '2703670', '2984120')
)
SELECT A.Issue,
       A.Client,
       STUFF((
           SELECT ', ' + B.Label 
             FROM tmp B 
            WHERE ISNULL(B.Issue, '') = ISNULL(A.Issue, '')
              AND ISNULL(B.Client, '') = ISNULL(A.Client, '')
         ORDER BY B.Issue 
          FOR XML PATH('')), 1, 2, ''
    ) AS Label
FROM
    tmp A
GROUP BY 
    A.Issue, A.Client

这给了你

Issue Client Label
1 Andy Enhancement
2 Judy Feature
3 Cathy Feature,Internal
4 Colin Enhancement,Internal
4 Jack Enhancement,Internal
4 Larry Enhancement,Internal
4 Lisa Enhancement,Internal
4 Nitin Enhancement,Internal

然后你可以use ROW_NUMBER() to JOIN this with the first row每个Issue-Client-组合:

WITH tmp AS (SELECT I.Issue,
                    I.Subject,
                    I.type, 
                    P.Team, 
                    P.Subteam,
                    CR.Client,
                    I.Priority,
                    I.CreatedOn,
                    L.Label,
                    I.BuiltOn,
                    I.CreatedBy,
                    I.Status
              FROM master.IssueRequests AS I 
              JOIN master.Participants AS P 
                ON P.Issue = I.Issue 
              JOIN master.ClientRecords AS CR 
                ON CR.Issue = I.Issue 
              JOIN master.IssueLabels AS L
                ON L.Issue = I.Issue
              WHERE I.Issue IN ('2652523', '2703670', '2984120')
)
SELECT C.Issue,
       C.Subject,
       C.typ, 
       C.Team, 
       C.Subteam,
       C.Client,
       C.Priority,
       C.CreatedOn,
       D.Label,
       C.BuiltOn,
       C.CreatedBy,
       C.Status  
  FROM (SELECT tmp.*, 
               row_number() OVER (PARTITION BY Issue, Client ORDER BY Issue) as rn
        FROM tmp) C
  JOIN (SELECT A.Issue,
               A.Client,
               STUFF((
                   SELECT ', ' + B.Label 
                     FROM tmp B 
                    WHERE ISNULL(B.Issue, '') = ISNULL(A.Issue, '')
                      AND ISNULL(B.Client, '') = ISNULL(A.Client, '')
                 ORDER BY B.Issue 
                  FOR XML PATH('')), 1, 2, ''
            ) AS Label
        FROM tmp A
        GROUP BY A.Issue, A.Client) D
   ON C.Issue = D.Issue AND C.Client = D.Client AND C.rn = 1

这会给你想要的结果:

Issue Subject type Team Sub Team Client Priority CreatedOn BuiltOn CreatedBy Status Label
1 ABCABC Bug Develop Automation Andy 0 2021-01-11 00:00:00 None John InProgress Enhancement
2 DEFDEF Bug Develop Automation Judy 0 2021-01-10 00:00:00 None Andre New Feature
3 HIGHIG Bug Develop Testing123 Cathy 2 2021-02-11 00:00:00 None Keith New Feature,Internal
4 XYZXYZ Bug Develop Automation Colin 1 2021-05-11 00:00:00 None Maya Analysis Enhancement,Internal
4 XYZXYZ Bug Develop Automation Jack 1 2021-05-11 00:00:00 None Maya Analysis Enhancement,Internal
4 XYZXYZ Bug Develop Automation Larry 1 2021-05-11 00:00:00 None Maya Analysis Enhancement,Internal
4 XYZXYZ Bug Develop Automation Lisa 1 2021-05-11 00:00:00 None Maya Analysis Enhancement,Internal
4 XYZXYZ Bug Develop Automation Nitin 1 2021-05-11 00:00:00 None Maya Analysis Enhancement,Internal

你可以在this db<>dfiddle测试它

【讨论】:

以上是关于pandas中的SQL查询:根据其他列的组合在列中连接多行的主要内容,如果未能解决你的问题,请参考以下文章

根据其他列中的值在 python 3 (pandas) 数据框中创建新列

根据其他列在列中插入实数 OLD INSERTs

SQL - 在列中查找具有特定值组合的行

如何从以 BLOB 类型存储在列中的 XML 中提取数据(通过 SQL 查询)

根据另一列中的值删除一列的重复项,Python,Pandas

用于在列1中选择与第2列中的两个模式之一匹配的值对的SQL查询