相同的 Pandas 和 SQLite 查询没有给出相同的结果
Posted
技术标签:
【中文标题】相同的 Pandas 和 SQLite 查询没有给出相同的结果【英文标题】:Identical Pandas and SQLite queries not giving same results 【发布时间】:2021-09-04 14:12:12 【问题描述】:我已经对来自https://github.com/mikemooreviz/superstore 的文件“Sample - Superstore.csv”进行了查询,这将为我提供包含某种类型的字符串标准的每个案例的计数,以及没有任何之前的计数其他计数的标准。
要分析的字符串列是“CustomerName”。
基本上:计算全名以大写“A”开头的客户端数量、全名以小写“t”开头的客户端数量、全名以小写结尾的客户端数量"n",然后是全名不符合上述任何条件的客户端数。
这是 pandas 中的查询:
import pandas as pd;import numpy as np;import re;
df = pd.read_csv("path_of_csv_file",sep=";");
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None);
df['strings_conditions'] = np.where(
df['CustomerName'].str.startswith("A"),
'Starts with a capital A',
np.where(
df['CustomerName'].str.contains("t"),
'Has a non-capitalized t',
np.where(
df['CustomerName'].str.endswith("n"),
'Finishes with a non-capitalized n',
'Something else'
)
)
)
df_new = df.loc[:,['strings_conditions','CustomerName']].drop_duplicates().dropna()
df_new.groupby(['strings_conditions'])['strings_conditions'].count()
给出以下结果:
strings_conditions | count |
---|---|
Finishes with a non-capitalized n | 100 |
Has a non-capitalized t | 288 |
Something else | 341 |
Starts with a capital A | 64 |
但 SQLite 中的查询相同:
SELECT 'Finishes with a non-capitalized n' AS strings_conditions, count(*)
FROM (
SELECT CustomerName
FROM mag_correction
WHERE mag_correction.CustomerName glob "*n"
GROUP by CustomerName
)
UNION ALL
SELECT 'Has a non-capitalized t' AS strings_conditions, count(*)
FROM (
SELECT CustomerName
FROM mag_correction
WHERE mag_correction.CustomerName glob "*t*"
GROUP by CustomerName
)
UNION ALL
SELECT 'Something else' AS strings_conditions, count(*)
FROM (
SELECT CustomerName
FROM mag_correction
WHERE mag_correction.CustomerName NOT glob "A*"
AND mag_correction.CustomerName NOT glob "*t*"
AND mag_correction.CustomerName NOT glob "*n"
GROUP by CustomerName
)
UNION ALL
SELECT 'Starts with a capital A' AS strings_conditions, count(*)
FROM (
SELECT CustomerName
FROM mag_correction
WHERE mag_correction.CustomerName glob "A*"
GROUP by CustomerName
)
给我:
strings_conditions | count |
---|---|
Finishes with a non-capitalized n | 187 |
Has a non-capitalized t | 313 |
Something else | 341 |
Starts with a capital A | 64 |
并使用以下查询在 SQLite 中创建一个与 pandas 中的 df_new 完全相同的视图:
SELECT
CASE
WHEN CustomerName glob "*n"
THEN "Finishes with a non-capitalized n"
WHEN CustomerName glob "*t*"
THEN "Has a non-capitalized t"
WHEN CustomerName NOT glob "A*"
AND CustomerName NOT glob "*t*"
AND CustomerName NOT glob "*n"
THEN "Something else"
WHEN CustomerName glob "A*"
THEN "Starts with a capital A"
END strings_conditions
, CustomerName
FROM mag_correction
GROUP by CustomerName
然后查询它:
SELECT df_new.strings_conditions, count(*)
FROM df_new
GROUP by df_new.strings_conditions
再次给出一堆不同的结果(除了两行与其他 SQLite 查询相比):
strings_conditions | count |
---|---|
Finishes with a non-capitalized n | 187 |
Has a non-capitalized t | 234 |
Something else | 341 |
Starts with a capital A | 31 |
有人知道为什么所有 3 个案例的结果都不相同吗?
如果需要任何澄清,我很乐意提供更多。
【问题讨论】:
【参考方案1】:实际上,pandas 代码和 SQL 查询并不完全相同,原因如下:
缺乏相互排斥性:您的条件并不相互排斥。在 pandas 中,您不计算重叠实例,而是首先匹配条件。在第一个 SQL 查询中,您确实计算了重叠,因为您将四个条件中的每一个都分成范围 SELECT
查询(即四个子集)进行计数。
First Match Precedence:在 pandas 中,np.where
按重叠实例的逻辑条件顺序返回 first 匹配。同样,SQL 的CASE
按顺序获取第一个匹配项。如果您在 pandas 和第二个查询中对齐相同的顺序,结果应该相似。
要让 pandas 代码完全匹配第一个 SQL 查询,您需要 pandas.concat
+ drop_duplicates()
来复制 SQL 的 UNION ALL
:
cond1 = df['CustomerName'].str.startswith("A")
cond2 = df['CustomerName'].str.contains("t")
cond3 = df['CustomerName'].str.endswith("n")
# CONCATENATE FOUR SUBSETS
union_df = (pd.concat([
df[cond1].assign(strings_conditions="Starts with a capital A"),
df[cond2].assign(strings_conditions="Has a non-capitalized t"),
df[cond3].assign(strings_conditions="Finishes with a non-capitalized n"),
df[(cond1 == False) &
(cond2 == False) &
(cond3 == False)].assign(strings_conditions="Something else")
]).reindex(['strings_conditions','CustomerName'], axis="columns")
.drop_duplicates().dropna()
)
agg = union_df.groupby(['strings_conditions'])['strings_conditions'].count()
agg
# strings_conditions count
# 0 Finishes with a non-capitalized n 187
# 1 Has a non-capitalized t 313
# 2 Something else 341
# 3 Starts with a capital A 64
为了匹配第二个 SQL 查询的 pandas 代码,调整 CASE
条件的顺序。但如前所述,以下由于重叠而低估了真实实例。
import sqlite3
...
db = sqlite3.connect("/path/to/database.db")
sql = """WITH sub AS (
SELECT CASE
WHEN CustomerName glob "A*" -- MOVED FROM LAST TO FIRST
THEN "Starts with a capital A"
WHEN CustomerName glob "*t*" -- STAYED AS SECOND
THEN "Has a non-capitalized t"
WHEN CustomerName glob "*n" -- MOVED FROM FIRST TO THIRD
THEN "Finishes with a non-capitalized n"
WHEN CustomerName NOT glob "A*" -- MOVED TO LAST
AND CustomerName NOT glob "*t*"
AND CustomerName NOT glob "*n"
THEN "Something else"
END strings_conditions
, CustomerName
FROM mag_correction
GROUP by CustomerName
)
SELECT strings_conditions
, COUNT(*) AS count
FROM sub
GROUP by strings_conditions"""
agg_db = pd.read_sql(sql, db)
agg_db
# strings_conditions count
# 0 Finishes with a non-capitalized n 100
# 1 Has a non-capitalized t 288
# 2 Something else 341
# 3 Starts with a capital A 64
【讨论】:
【参考方案2】:pandas 行 df['strings_conditions'] = ....
将为每个数据框行分配一个条件,再次遇到名称时的条件相同。该视图具有相同的问题:每个名称一个“strings_condition”。 (数字不同,因为测试的顺序与 np.wheres 不同)。
“UNION”sql 会多次扫描表,因此一个给定的名称将被计入 每个它满足的条件中。
例如,名称“Art Furguson”符合所有三个条件,但值 df["strings_condition"] 将是“以大写字母 A 开头”; df_new.strings_condition 将是“以非大写的 n 结尾”。
这就是为什么在所有三种情况下结果都不相同的原因。
【讨论】:
很抱歉回复晚了。我尝试了你的解决方案,它奏效了。我将通过你给出的解释。非常感谢。以上是关于相同的 Pandas 和 SQLite 查询没有给出相同的结果的主要内容,如果未能解决你的问题,请参考以下文章