相同的 Pandas 和 SQLite 查询没有给出相同的结果

Posted

技术标签:

【中文标题】相同的 Pandas 和 SQLite 查询没有给出相同的结果【英文标题】:Identical Pandas and SQLite queries not giving same results 【发布时间】:2021-09-04 14:12:12 【问题描述】:

我已经对来自https://github.com/mikemooreviz/superstore 的文件“Sample - Superstore.csv”进行了查询,这将为我提供包含某种类型的字符串标准的每个案例的计数,以及没有任何之前的计数其他计数的标准。

要分析的字符串列是“CustomerName”。

基本上:计算全名以大写“A”开头的客户端数量、全名以小写“t”开头的客户端数量、全名以小写结尾的客户端数量"n",然后是全名不符合上述任何条件的客户端数。

这是 pandas 中的查询:

import pandas as pd;import numpy as np;import re;

df = pd.read_csv("path_of_csv_file",sep=";");

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None);

df['strings_conditions'] = np.where(
    df['CustomerName'].str.startswith("A"),
    'Starts with a capital A',
    np.where(
        df['CustomerName'].str.contains("t"),
        'Has a non-capitalized t',
        np.where(
            df['CustomerName'].str.endswith("n"),
            'Finishes with a non-capitalized n',
            'Something else'
        )
    )
)

df_new = df.loc[:,['strings_conditions','CustomerName']].drop_duplicates().dropna()

df_new.groupby(['strings_conditions'])['strings_conditions'].count()

给出以下结果:

strings_conditions count
Finishes with a non-capitalized n 100
Has a non-capitalized t 288
Something else 341
Starts with a capital A 64

但 SQLite 中的查询相同:

SELECT 'Finishes with a non-capitalized n' AS strings_conditions, count(*) 
FROM (
    SELECT CustomerName
    FROM mag_correction
    WHERE mag_correction.CustomerName glob "*n"
    GROUP by CustomerName
)

UNION ALL

SELECT 'Has a non-capitalized t' AS strings_conditions, count(*) 
FROM (
    SELECT CustomerName
    FROM mag_correction
    WHERE mag_correction.CustomerName glob "*t*"
    GROUP by CustomerName
)

UNION ALL

SELECT 'Something else' AS strings_conditions, count(*) 
FROM (
    SELECT CustomerName
    FROM mag_correction
    WHERE mag_correction.CustomerName NOT glob "A*"
    AND mag_correction.CustomerName NOT glob "*t*"
    AND mag_correction.CustomerName NOT glob "*n"
    GROUP by CustomerName
)

UNION ALL

SELECT 'Starts with a capital A' AS strings_conditions, count(*) 
FROM (
    SELECT CustomerName
    FROM mag_correction
    WHERE mag_correction.CustomerName glob "A*"
    GROUP by CustomerName
)

给我:

strings_conditions count
Finishes with a non-capitalized n 187
Has a non-capitalized t 313
Something else 341
Starts with a capital A 64

并使用以下查询在 SQLite 中创建一个与 pandas 中的 df_new 完全相同的视图:

SELECT
    CASE 
       WHEN CustomerName glob "*n" 
       THEN "Finishes with a non-capitalized n" 

       WHEN CustomerName glob "*t*" 
       THEN "Has a non-capitalized t" 

       WHEN CustomerName NOT glob "A*"
        AND CustomerName NOT glob "*t*"
        AND CustomerName NOT glob "*n"
       THEN "Something else"

       WHEN CustomerName glob "A*"
       THEN "Starts with a capital A" 
    END strings_conditions
  , CustomerName
FROM mag_correction
GROUP by CustomerName

然后查询它:

SELECT df_new.strings_conditions, count(*)
FROM df_new
GROUP by df_new.strings_conditions

再次给出一堆不同的结果(除了两行与其他 SQLite 查询相比):

strings_conditions count
Finishes with a non-capitalized n 187
Has a non-capitalized t 234
Something else 341
Starts with a capital A 31

有人知道为什么所有 3 个案例的结果都不相同吗?

如果需要任何澄清,我很乐意提供更多。

【问题讨论】:

【参考方案1】:

实际上,pandas 代码和 SQL 查询并不完全相同,原因如下:

缺乏相互排斥性:您的条件并不相互排斥。在 pandas 中,您不计算重叠实例,而是首先匹配条件。在第一个 SQL 查询中,您确实计算了重叠,因为您将四个条件中的每一个都分成范围 SELECT 查询(即四个子集)进行计数。

First Match Precedence:在 pandas 中,np.where 按重叠实例的逻辑条件顺序返回 first 匹配。同样,SQL 的CASE 按顺序获取第一个匹配项。如果您在 pandas 和第二个查询中对齐相同的顺序,结果应该相似。

要让 pandas 代码完全匹配第一个 SQL 查询,您需要 pandas.concat + drop_duplicates() 来复制 SQL 的 UNION ALL

cond1 = df['CustomerName'].str.startswith("A")
cond2 = df['CustomerName'].str.contains("t")
cond3 = df['CustomerName'].str.endswith("n")

# CONCATENATE FOUR SUBSETS
union_df = (pd.concat([
    df[cond1].assign(strings_conditions="Starts with a capital A"),
    df[cond2].assign(strings_conditions="Has a non-capitalized t"),
    df[cond3].assign(strings_conditions="Finishes with a non-capitalized n"),
    df[(cond1 == False) &
       (cond2 == False) & 
       (cond3 == False)].assign(strings_conditions="Something else")
]).reindex(['strings_conditions','CustomerName'], axis="columns")
  .drop_duplicates().dropna()
)

agg = union_df.groupby(['strings_conditions'])['strings_conditions'].count() 

agg
#                   strings_conditions  count
# 0  Finishes with a non-capitalized n    187
# 1            Has a non-capitalized t    313
# 2                     Something else    341
# 3            Starts with a capital A     64

为了匹配第二个 SQL 查询的 pandas 代码,调整 CASE 条件的顺序。但如前所述,以下由于重叠而低估了真实实例。

import sqlite3
...

db = sqlite3.connect("/path/to/database.db")

sql = """WITH sub AS (
   SELECT CASE 
             WHEN CustomerName glob "A*"               -- MOVED FROM LAST TO FIRST
             THEN "Starts with a capital A" 

             WHEN CustomerName glob "*t*"              -- STAYED AS SECOND
             THEN "Has a non-capitalized t" 

             WHEN CustomerName glob "*n"               -- MOVED FROM FIRST TO THIRD
             THEN "Finishes with a non-capitalized n" 

             WHEN CustomerName NOT glob "A*"           -- MOVED TO LAST
              AND CustomerName NOT glob "*t*" 
              AND CustomerName NOT glob "*n" 
             THEN "Something else" 
          END strings_conditions
        , CustomerName

   FROM mag_correction 
   GROUP by CustomerName
)

SELECT strings_conditions
     , COUNT(*) AS count
FROM sub
GROUP by strings_conditions"""

agg_db = pd.read_sql(sql, db)

agg_db
#                   strings_conditions  count
# 0  Finishes with a non-capitalized n    100
# 1            Has a non-capitalized t    288
# 2                     Something else    341
# 3            Starts with a capital A     64

【讨论】:

【参考方案2】:

pandas 行 df['strings_conditions'] = .... 将为每个数据框行分配一个条件,再次遇到名称时的条件相同。该视图具有相同的问题:每个名称一个“strings_condition”。 (数字不同,因为测试的顺序与 np.wheres 不同)。

“UNION”sql 会多次扫描表,因此一个给定的名称将被计入 每个它满足的条件中。

例如,名称“Art Furguson”符合所有三个条件,但值 df["strings_condition"] 将是“以大写字母 A 开头”; df_new.strings_condition 将是“以非大写的 n 结尾”。

这就是为什么在所有三种情况下结果都不相同的原因。

【讨论】:

很抱歉回复晚了。我尝试了你的解决方案,它奏效了。我将通过你给出的解释。非常感谢。

以上是关于相同的 Pandas 和 SQLite 查询没有给出相同的结果的主要内容,如果未能解决你的问题,请参考以下文章

pandas之数据存储

pandas之数据存储

Pandas to_sql 到 sqlite 返回“引擎”对象没有属性“光标”

SQLite3 数学函数 Python

SQLite:如何优化 UNION 查询

SQLITE 3.7.13 和 3.8.0 之间的性能差异