相同的 Pandas 和 SQLite 查询没有给出相同的结果

Posted 2023-03-11

技术标签:

【中文标题】相同的 Pandas 和 SQLite 查询没有给出相同的结果【英文标题】：Identical Pandas and SQLite queries not giving same results 【发布时间】：2021-09-04 14:12:12 【问题描述】：

我已经对来自https://github.com/mikemooreviz/superstore 的文件“Sample - Superstore.csv”进行了查询，这将为我提供包含某种类型的字符串标准的每个案例的计数，以及没有任何之前的计数其他计数的标准。

要分析的字符串列是“CustomerName”。

基本上：计算全名以大写“A”开头的客户端数量、全名以小写“t”开头的客户端数量、全名以小写结尾的客户端数量"n"，然后是全名不符合上述任何条件的客户端数。

这是 pandas 中的查询：

import pandas as pd;import numpy as np;import re;

df = pd.read_csv("path_of_csv_file",sep=";");

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None);

df['strings_conditions'] = np.where(
    df['CustomerName'].str.startswith("A"),
    'Starts with a capital A',
    np.where(
        df['CustomerName'].str.contains("t"),
        'Has a non-capitalized t',
        np.where(
            df['CustomerName'].str.endswith("n"),
            'Finishes with a non-capitalized n',
            'Something else'
        )
    )
)

df_new = df.loc[:,['strings_conditions','CustomerName']].drop_duplicates().dropna()

df_new.groupby(['strings_conditions'])['strings_conditions'].count()

给出以下结果：

strings_conditions	count
Finishes with a non-capitalized n	100
Has a non-capitalized t	288
Something else	341
Starts with a capital A	64

但 SQLite 中的查询相同：

SELECT 'Finishes with a non-capitalized n' AS strings_conditions, count(*) 
FROM (
    SELECT CustomerName
    FROM mag_correction
    WHERE mag_correction.CustomerName glob "*n"
    GROUP by CustomerName
)

UNION ALL

SELECT 'Has a non-capitalized t' AS strings_conditions, count(*) 
FROM (
    SELECT CustomerName
    FROM mag_correction
    WHERE mag_correction.CustomerName glob "*t*"
    GROUP by CustomerName
)

UNION ALL

SELECT 'Something else' AS strings_conditions, count(*) 
FROM (
    SELECT CustomerName
    FROM mag_correction
    WHERE mag_correction.CustomerName NOT glob "A*"
    AND mag_correction.CustomerName NOT glob "*t*"
    AND mag_correction.CustomerName NOT glob "*n"
    GROUP by CustomerName
)

UNION ALL

SELECT 'Starts with a capital A' AS strings_conditions, count(*) 
FROM (
    SELECT CustomerName
    FROM mag_correction
    WHERE mag_correction.CustomerName glob "A*"
    GROUP by CustomerName
)

给我：

strings_conditions	count
Finishes with a non-capitalized n	187
Has a non-capitalized t	313
Something else	341
Starts with a capital A	64

并使用以下查询在 SQLite 中创建一个与 pandas 中的 df_new 完全相同的视图：

SELECT
    CASE 
       WHEN CustomerName glob "*n" 
       THEN "Finishes with a non-capitalized n" 

       WHEN CustomerName glob "*t*" 
       THEN "Has a non-capitalized t" 

       WHEN CustomerName NOT glob "A*"
        AND CustomerName NOT glob "*t*"
        AND CustomerName NOT glob "*n"
       THEN "Something else"

       WHEN CustomerName glob "A*"
       THEN "Starts with a capital A" 
    END strings_conditions
  , CustomerName
FROM mag_correction
GROUP by CustomerName

然后查询它：

SELECT df_new.strings_conditions, count(*)
FROM df_new
GROUP by df_new.strings_conditions

再次给出一堆不同的结果（除了两行与其他 SQLite 查询相比）：

strings_conditions	count
Finishes with a non-capitalized n	187
Has a non-capitalized t	234
Something else	341
Starts with a capital A	31

有人知道为什么所有 3 个案例的结果都不相同吗？

如果需要任何澄清，我很乐意提供更多。

【问题讨论】：

【参考方案1】：

实际上，pandas 代码和 SQL 查询并不完全相同，原因如下：

缺乏相互排斥性：您的条件并不相互排斥。在 pandas 中，您不计算重叠实例，而是首先匹配条件。在第一个 SQL 查询中，您确实计算了重叠，因为您将四个条件中的每一个都分成范围 SELECT 查询（即四个子集）进行计数。

First Match Precedence：在 pandas 中，np.where 按重叠实例的逻辑条件顺序返回 first 匹配。同样，SQL 的CASE 按顺序获取第一个匹配项。如果您在 pandas 和第二个查询中对齐相同的顺序，结果应该相似。

要让 pandas 代码完全匹配第一个 SQL 查询，您需要 pandas.concat + drop_duplicates() 来复制 SQL 的 UNION ALL：

cond1 = df['CustomerName'].str.startswith("A")
cond2 = df['CustomerName'].str.contains("t")
cond3 = df['CustomerName'].str.endswith("n")

# CONCATENATE FOUR SUBSETS
union_df = (pd.concat([
    df[cond1].assign(strings_conditions="Starts with a capital A"),
    df[cond2].assign(strings_conditions="Has a non-capitalized t"),
    df[cond3].assign(strings_conditions="Finishes with a non-capitalized n"),
    df[(cond1 == False) &
       (cond2 == False) & 
       (cond3 == False)].assign(strings_conditions="Something else")
]).reindex(['strings_conditions','CustomerName'], axis="columns")
  .drop_duplicates().dropna()
)

agg = union_df.groupby(['strings_conditions'])['strings_conditions'].count() 

agg
#                   strings_conditions  count
# 0  Finishes with a non-capitalized n    187
# 1            Has a non-capitalized t    313
# 2                     Something else    341
# 3            Starts with a capital A     64

为了匹配第二个 SQL 查询的 pandas 代码，调整 CASE 条件的顺序。但如前所述，以下由于重叠而低估了真实实例。

import sqlite3
...

db = sqlite3.connect("/path/to/database.db")

sql = """WITH sub AS (
   SELECT CASE 
             WHEN CustomerName glob "A*"               -- MOVED FROM LAST TO FIRST
             THEN "Starts with a capital A" 

             WHEN CustomerName glob "*t*"              -- STAYED AS SECOND
             THEN "Has a non-capitalized t" 

             WHEN CustomerName glob "*n"               -- MOVED FROM FIRST TO THIRD
             THEN "Finishes with a non-capitalized n" 

             WHEN CustomerName NOT glob "A*"           -- MOVED TO LAST
              AND CustomerName NOT glob "*t*" 
              AND CustomerName NOT glob "*n" 
             THEN "Something else" 
          END strings_conditions
        , CustomerName

   FROM mag_correction 
   GROUP by CustomerName
)

SELECT strings_conditions
     , COUNT(*) AS count
FROM sub
GROUP by strings_conditions"""

agg_db = pd.read_sql(sql, db)

agg_db
#                   strings_conditions  count
# 0  Finishes with a non-capitalized n    100
# 1            Has a non-capitalized t    288
# 2                     Something else    341
# 3            Starts with a capital A     64

【讨论】：

【参考方案2】：

pandas 行 df['strings_conditions'] = .... 将为每个数据框行分配一个条件，再次遇到名称时的条件相同。该视图具有相同的问题：每个名称一个“strings_condition”。（数字不同，因为测试的顺序与 np.wheres 不同）。

“UNION”sql 会多次扫描表，因此一个给定的名称将被计入每个它满足的条件中。

例如，名称“Art Furguson”符合所有三个条件，但值 df["strings_condition"] 将是“以大写字母 A 开头”； df_new.strings_condition 将是“以非大写的 n 结尾”。

这就是为什么在所有三种情况下结果都不相同的原因。

【讨论】：

很抱歉回复晚了。我尝试了你的解决方案，它奏效了。我将通过你给出的解释。非常感谢。

以上是关于相同的 Pandas 和 SQLite 查询没有给出相同的结果的主要内容，如果未能解决你的问题，请参考以下文章