SQLite vs Pandas

Posted DaisyLinux

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了SQLite vs Pandas相关的知识,希望对你有一定的参考价值。

Analysis details

For the analysis, we ran the six tasks 10 times each, for 5 different sample sizes, for each of 3 programs: pandas, sqlite, and memory-sqlite (where database is in memory instead of on disk). See below for the definitions of each task.

Our sample data was randomly generated. Here’s what it looks like:

 

sql_vs_pandas$ head -n 5 data/sample.100.csv

 

qqFjQHQc,c,1981,82405.59262172286

vILuhVGz,a,1908,27712.27152250119

mwCjpoOF,f,1992,58974.38538762843

kGbriYAK,d,1927,42258.24179716961

MeoxuJng,c,1955,96907.56416314292

 

This consists of a random string of 8 characters, a random single character (for the filtering operation), a random integer simulating a year (1900-2000), and a uniform random float value between 10000 and 100000.

Results

sqlite or memory-sqlite is faster for the following tasks:

  • select two columns from data (<.1 millisecond for any data size for sqlite. pandas scales with the data, up to just under 0.5 seconds for 10 million records)
  • filter data (>10x-50x faster with sqlite. The difference is more pronounced as data grows in size)
  • sort by single column: pandas is always a bit slower, but this was the closest

pandas is faster for the following tasks:

  • groupby computation of a mean and sum (significantly better for large data, only 2x faster for <10k records)
  • load data from disk (5x faster for >10k records, even better for smaller data)
  • join data (2-5x faster, but slower for smallest dataset of 1000 rows)

Comparing memory-sqlite vs. sqlite, there was no meaningful difference, especially as data size increased.

There is no significant speedup from loading sqlite in its own shell vs. via pandas.

Overall, joining and loading data is the slowest whereas select and filter are generally the fastest. Further, pandas seems to be optimized for group-by operations, where it performs really well (group-by is pandas‘ second-fastest operation for larger data).

Note that this analysis assumes you are equally proficient in writing code with both! But these results could encourage you to learn the tool that you are less familiar with, if the performance gains are significant.

All code is on our GitHub page.

 

Below are the definitions of our six tasks: sort, select, load, join, filter, and group by (see driver/sqlite_driver.py or driver/pandas_driver.py).

sqlite is first, followed by pandas:

sort

def sort(self):

	self._cursor.execute(‘SELECT * FROM employee ORDER BY name ASC;’)

	self._conn.commit()

	 

	def sort(self):

	self.df_employee.sort_values(by=’name’)

  

 

select

def select(self):

    self._cursor.execute(‘SELECT name, dept FROM employee;’)

    self._conn.commit()

    def select(self):

    self.df_employee[[“name”, “dept”]]

 

 

load

def load(self):

    self._cursor.execute(‘CREATE TABLE employee (name varchar(255), dept char(1), birth int, salary double);’)

    df = pd.read_csv(self.employee_file)

    df.columns = employee_columns

    df.to_sql(’employee’, self._conn, if_exists=’replace’)

     

    self._cursor.execute(‘CREATE TABLE bonus (name varchar(255), bonus double);’)

    df_bonus = pd.read_csv(self.bonus_file)

    df_bonus.columns = bonus_columns

    df_bonus.to_sql(‘bonus’, self._conn, if_exists=’replace’)

 

def load(self):

    self.df_employee = pd.read_csv(self.employee_file)

    self.df_employee.columns = employee_columns

     

    self.df_bonus = pd.read_csv(self.bonus_file)

    self.df_bonus.columns = bonus_columns

 

 

join

def join(self):

    self._cursor.execute(‘SELECT employee.name, employee.salary + bonus.bonus ‘

    ‘FROM employee INNER JOIN bonus ON employee.name = bonus.name’)

    self._conn.commit()

 

def join(self):

    joined = self.df_employee.merge(self.df_bonus, on=’name’)

    joined[‘total’] = joined[‘bonus’] + joined[‘salary’]

 

 

filter

def filter(self):

	self._cursor.execute(‘SELECT * FROM employee WHERE dept = “a”;’)

	self._conn.commit()

	 

	def filter(self):

	self.df_employee[self.df_employee[‘dept’] == ‘a’]

  

 

group by

def groupby(self):

    self._cursor.execute(‘SELECT avg(birth), sum(salary) FROM employee GROUP BY dept;’)

    self._conn.commit()

     

    def groupby(self):

    self.df_employee.groupby(“dept”).agg(‘birth’: np.mean, ‘salary’: np.sum)

 

 

SQLite 查询性能、Transpose、Melt 和 Pandas

【中文标题】SQLite 查询性能、Transpose、Melt 和 Pandas【英文标题】:SQLite Query Performance, Transpose, Melt and Pandas 【发布时间】:2018-09-30 12:22:06 【问题描述】:

背景

我正在与加拿大水调查局的HyDat Database 合作,其中包括 8000 多个水文测量站。我已经编写了查询每日流量数据的代码:

conn = create_connection('db/Hydat.sqlite3')
cur = conn.cursor()
cur.execute("SELECT * FROM DLY_FLOWS WHERE STATION_NUMBER=?", (station,))

将返回的数据放入数据框后:

rows = cur.fetchall()    
column_headers = [description[0] for description in cur.description]
df = pd.DataFrame(rows, columns=column_headers)    

数据大致如下格式:

STATION_NUM  YEAR  MONTH ...  FLOW1  FLAG1  FLOW2  FLAG2  ...  
02QC003      1965  02    ...  32.5   E      33.4   A      ...
02QC003      1965  03    ...  44.6   E      45.4   A      ...
02QC003      1965  04    ...  54.3   E      56.2   A      ... 
...          ...   ...   ...  ...    ...    ...    ...    ...

FLOWN 和 FLAGN 列中的 N 分别从 1 到 31 对应于一个月中的日期(需要在

我正在尝试提高查询数据并将数据重塑为以下每日时间序列格式的性能:

STATION_NUM  YEAR  MONTH  DAY  FLOW  FLAG
02QC003      1965  02     1    32.5  E
02QC003      1965  02     2    33.4  A
02QC003      1965  02     3    33.7  A
...          ...   ...    ...  ...   ...

我试图转换为行的每日值的数量高达 ~1000(大约相当于 100 年,其中一行代表一个月,每日值以列为单位)。处理几个查询不是问题,但我的目标是约 40M 查询。目前,我正在使用 Pandas melt 函数,首先用于每日流量,然后用于数据标志(为简洁起见,未显示):

id_var_headers = column_headers[:11]

all_val_vars = [e for e in column_headers if 'FLOW' in e]
flow_val_vars = [e for e in all_val_vars if '_' not in e]

df_flows = pd.melt(df,
                   id_vars=id_var_headers,
                   value_vars=flow_val_vars,
                   value_name='DAILY_FLOW', 
                   var_name='DAY').sort_values(by=['YEAR', 'MONTH'])

df_flows['DAY'] = df_flows['DAY'].apply(
    map_day_to_var_name)

def map_day_to_var_name(s):
    if re.search('\d', s):
        return s[re.search('\d', s).span()[0]:]

我发现整个序列中第二慢的操作在 10^-3 秒内运行,限制步骤是 melt 函数,它似乎慢了 10 倍。我希望在这一步上实现 10 倍或更好的改进。

我尝试以此为契机了解更多有关 SQLite 的信息,并花了一些时间试图弄清楚如何构建查询以查看我的 ETL 流程的“转换”部分是否可以合并为一个步骤并执行比熊猫好。我想出的东西在理论上可行(参见this SQLFiddle),但我正在努力在我的代码中实现它。 有关详细信息,请参阅 2018-05-02 更新。

This answer from user piRSquared 似乎与我所追求的非常接近,尽管我被困在groupby 功能步骤上。按照 piRSquared 的回答中概述的步骤,我预计年份和月份会扩大以取消每日价值,这让我相信我错误地应用了 groupby 函数。

非常感谢任何帮助,以及对我如何提出问题的任何反馈(这是我第一次发布问题)。

2018-04-20 更新wide_to_long

Scott Boston 的建议更整洁,尽管我需要添加一些步骤:

首先,我填充了当月的天数 df.rename(columns='FLOW1': 'FLOW01', ..., inplace=True)

因为我需要一个DAY 列,所以我也省略了.drop('VARIABLE', axis=1) 来创建行:

df = pd.wide_to_long(raw_df, ['FLOW', 'FLOW_SYMBOL'], idx_cols, 'DAY', sep='', suffix='.').reset_index()

在每天有 76K 条记录的记录上进行测试,我使用 melt 函数得到 ~0.1s,使用 wide_to_long 函数得到 ~0.6s。

还有其他方法可以改进这一步吗?

2018-05-02 更新

我回去检查了大致代表边界长度的查询的响应时间。对于约 10 行(短记录)和 > 1K 行之一(大约是数据库中最长的记录周期)的查询,我分别得到每个查询 0.04 到 0.1 秒的范围。这个结果向我表明,一个更好的 SQL 查询不会比一个简单的查询更好,然后是 pandas melt 函数。

因此,我认为我目前的流程已达到预期效果。

【问题讨论】:

【参考方案1】:

我想你需要的是pd.wide_to_long:

给定 df:

  STATION_NUM  YEAR  MONTH  FLOW1 FLAG1  FLOW2 FLAG2
0     02QC003  1965      2   32.5     E   33.4     A
1     02QC003  1965      3   44.6     E   45.4     A
2     02QC003  1965      4   54.3     E   56.2     A

使用pd.wide_to_long:

pd.wide_to_long(df,['FLOW','FLAG'],['STATION_NUM','YEAR','MONTH'],'VARIABLE',sep='',suffix='.')\
  .reset_index().drop('VARIABLE', axis=1)

输出:

  STATION_NUM  YEAR  MONTH  FLOW FLAG
0     02QC003  1965      2  32.5    E
1     02QC003  1965      2  33.4    A
2     02QC003  1965      3  44.6    E
3     02QC003  1965      3  45.4    A
4     02QC003  1965      4  54.3    E
5     02QC003  1965      4  56.2    A

【讨论】:

谢谢斯科特。您使用 wide_to_long 的解决方案更加整洁,但每次调用似乎具有相同的性能顺序(~10^-1 秒)。我会用这些信息更新我的问题。如果 pandas 函数没有更快的方法,有没有办法使用另一种方法来改进数据转换,即使用更好的 SQLite 查询?【参考方案2】:

您是否考虑过做 31 个SELECTs,一个用于每天的专栏,然后UNION'ing 整天在一起?肯定会是大量重复的 SQL,我也无法预测它会不会比 Pandas 快。

【讨论】:

感谢您的回复亚历克斯。我回去测试了简单查询的性能(忽略了处理 2x31 串联连接的复杂性),看起来它并不比我的 Pandas 方法快。我已经用这些信息更新了我的问题。

以上是关于SQLite vs Pandas的主要内容,如果未能解决你的问题,请参考以下文章

SQLite 查询性能、Transpose、Melt 和 Pandas

Pandas to_sql 到 sqlite 返回“引擎”对象没有属性“光标”

VS2012,VS2013启用SQLite的Data Provider界面显示

相同的 Pandas 和 SQLite 查询没有给出相同的结果

vs2013 sqlite3 错误 C4703

vs2010下使用sqlite