仅使用最小 COUNT() 转置行和列(又名枢轴)?
Posted
技术标签:
【中文标题】仅使用最小 COUNT() 转置行和列(又名枢轴)?【英文标题】:Transpose rows and columns (a.k.a. pivot) only with a minimum COUNT()? 【发布时间】:2012-10-31 21:59:22 【问题描述】:这是我的表“tab_test”:
year animal price
2000 kittens 79
2000 kittens 93
2000 kittens 100
2000 puppies 15
2000 puppies 32
2001 kittens 31
2001 kittens 17
2001 puppies 65
2001 puppies 48
2002 kittens 84
2002 kittens 86
2002 puppies 15
2002 puppies 95
2003 kittens 62
2003 kittens 24
2003 puppies 36
2003 puppies 41
2004 kittens 65
2004 kittens 85
2004 puppies 58
2004 puppies 95
2005 kittens 45
2005 kittens 25
2005 puppies 15
2005 puppies 35
2006 kittens 50
2006 kittens 80
2006 puppies 95
2006 puppies 49
2007 kittens 40
2007 kittens 19
2007 puppies 81
2007 puppies 38
2008 kittens 37
2008 kittens 51
2008 puppies 29
2008 puppies 72
2009 kittens 84
2009 kittens 26
2009 puppies 49
2009 puppies 34
2010 kittens 75
2010 kittens 96
2010 puppies 18
2010 puppies 26
2011 kittens 35
2011 kittens 21
2011 puppies 90
2011 puppies 18
2012 kittens 12
2012 kittens 23
2012 puppies 74
2012 puppies 79
这是一些转换行和列的代码,因此我得到了“小猫”和“小狗”的平均值:
SELECT
year,
AVG(CASE WHEN animal = 'kittens' THEN price END) AS "kittens",
AVG(CASE WHEN animal = 'puppies' THEN price END) AS "puppies"
FROM tab_test
GROUP BY year
ORDER BY year;
上面代码的输出是:
year kittens puppies
2000 90.6666666666667 23.5
2001 24.0 56.5
2002 85.0 55.0
2003 43.0 38.5
2004 75.0 76.5
2005 35.0 25.0
2006 65.0 72.0
2007 29.5 59.5
2008 44.0 50.5
2009 55.0 41.5
2010 85.5 22.0
2011 28.0 54.0
2012 17.5 76.5
我想要一个像第二个一样的表,但它只包含第一个表中 COUNT()
至少为 3 的项目。换句话说,目标是将 this 作为输出:
year kittens
2000 90.6666666666667
第一个表中至少有 3 个“小猫”实例。 这在 PostgreSQL 中可行吗?
【问题讨论】:
【参考方案1】:CASE
如果您的情况像演示的那样简单,则可以使用CASE
语句:
SELECT year
, sum(CASE WHEN animal = 'kittens' THEN price END) AS kittens
, sum(CASE WHEN animal = 'puppies' THEN price END) AS puppies
FROM (
SELECT year, animal, avg(price) AS price
FROM tab_test
GROUP BY year, animal
HAVING count(*) > 2
) t
GROUP BY year
ORDER BY year;
无论您使用sum()
、max()
还是min()
作为外部查询中的聚合函数都没有关系。在这种情况下,它们都会产生相同的值。
SQL Fiddle
crosstab()
使用更多类别,使用crosstab()
查询会更简单。对于更大的桌子,这也应该更快。
您需要安装附加模块tablefunc(每个数据库一次)。从 Postgres 9.1 开始,这很简单:
CREATE EXTENSION tablefunc;
此相关答案中的详细信息:
PostgreSQL Crosstab QuerySELECT * FROM crosstab(
'SELECT year, animal, avg(price) AS price
FROM tab_test
GROUP BY animal, year
HAVING count(*) > 2
ORDER BY 1,2'
,$$VALUES ('kittens'::text), ('puppies')$$)
AS ct ("year" text, "kittens" numeric, "puppies" numeric);
这个没有 sqlfiddle,因为该站点不允许附加模块。
基准测试
为了验证我的说法,我在我的小型测试数据库中运行了一个接近真实数据的快速基准测试。 PostgreSQL 9.1.6。使用EXPLAIN ANALYZE
进行测试,10 次取胜:
10020 行的测试设置:
CREATE TABLE tab_test (year int, animal text, price numeric);
-- years with lots of rows
INSERT INTO tab_test
SELECT 2000 + ((g + random() * 300))::int/1000
, CASE WHEN (g + (random() * 1.5)::int) %2 = 0 THEN 'kittens' ELSE 'puppies' END
, (random() * 200)::numeric
FROM generate_series(1,10000) g;
-- .. and some years with only few rows to include cases with count < 3
INSERT INTO tab_test
SELECT 2010 + ((g + random() * 10))::int/2
, CASE WHEN (g + (random() * 1.5)::int) %2 = 0 THEN 'kittens' ELSE 'puppies' END
, (random() * 200)::numeric
FROM generate_series(1,20) g;
结果:
@bluefeet总运行时间:95.401 毫秒
@wildplasser(不同的结果,包括带有count <= 3
的行)总运行时间:64.497 毫秒
@Andreiy (+ ORDER BY
)
& @Erwin1 - CASE
(两者的性能差不多)总运行时间:39.105 毫秒
@Erwin2 - crosstab()
总运行时间:17.644 毫秒
只有 20 行的大比例(但不相关)结果。只有 @wildplasser 的 CTE 有更多开销和尖峰。
crosstab()
的行数不止几行,很快就占了上风。
@Andreiy 的查询执行与我的简化版本大致相同,外部 SELECT
(min()
、max()
、sum()
)中的聚合函数没有可衡量的差异(每组只有两行)。
一切都如预期的那样,没有意外,接受我的设置并尝试@home。
【讨论】:
代码运行良好,尽管它比@AndriyM 的慢一点。不过,感谢您提供额外的提示! @user1626730:两者都比较慢?CASE
版本应该同样快或更快 - 除了@Andriy 版本中缺少的ORDER BY
。嗯.. 可能 sum()
比 max()
慢 .. 但应该几乎不相关。对于更复杂的案例/更大的表格,crosstab()
版本会更快。
嗯,第一个在 SQLfiddle 上介于 100-200 毫秒之间,而 @AndriyM 在 1-10 毫秒之间。即便如此,交叉表信息将来可能对我有用,因为我确实计划创建自定义交叉表。
@user1626730:这些数字是测量人工制品。使用小型数据集,您几乎无法在 sqlfiddle 上获得可靠的计时。我在本地运行了一个快速基准来验证我的声明并将其添加到我的答案中。
啊,我明白了。在那种情况下,我会尝试所有这些代码。谢谢。【参考方案2】:
这是@bluefeet's suggestion 的替代方案,有点类似但避免了连接(相反,上层分组应用于已经分组的结果集):
SELECT
year,
MAX(CASE animal WHEN 'kittens' THEN avg_price END) AS "kittens",
MAX(CASE animal WHEN 'puppies' THEN avg_price END) AS "puppies"
FROM (
SELECT
animal,
year,
COUNT(*) AS cnt,
AVG(Price) AS avg_price
FROM tab_test
GROUP BY
animal,
year
) s
WHERE cnt >= 3
GROUP BY
year
;
【讨论】:
【参考方案3】:这是你要找的吗:
SELECT t1.year,
AVG(CASE WHEN t1.animal = 'kittens' THEN t1.price END) AS "kittens",
AVG(CASE WHEN t1.animal = 'puppies' THEN t1.price END) AS "puppies"
FROM tab_test t1
inner join
(
select animal, count(*) YearCount, year
from tab_test
group by animal, year
) t2
on t1.animal = t2.animal
and t1.year = t2.year
where t2.YearCount >= 3
group by t1.year
见SQL Fiddle with Demo
【讨论】:
【参考方案4】:CREATE TABLE pussyriot(year INTEGER NOT NULL
, animal varchar
, price integer
);
INSERT INTO pussyriot(year , animal , price ) VALUES
(2000, 'kittens', 79)
, (2000, 'kittens', 93)
...
, (2007, 'puppies', 81)
, (2007, 'puppies', 38)
;
-- a self join is a poor man's pivot:
WITH cal AS ( -- generate calendar file
SELECT generate_series(MIN(pr.year) , MAX(pr.year)) AS year
FROM pussyriot pr
)
, fur AS (
SELECT distinct year, animal, AVG(price) AS price
FROM pussyriot
GROUP BY year, animal
-- UPDATE: added next line
HAVING COUNT(*) >= 3
)
SELECT cal.year
, pussy.price AS price_of_the_pussy
, puppy.price AS price_of_the_puppy
FROM cal
LEFT JOIN fur pussy ON pussy.year=cal.year AND pussy.animal='kittens'
LEFT JOIN fur puppy ON puppy.year=cal.year AND puppy.animal='puppies'
;
【讨论】:
以上是关于仅使用最小 COUNT() 转置行和列(又名枢轴)?的主要内容,如果未能解决你的问题,请参考以下文章
pandas使用transpose函数对dataframe进行转置将dataframe的行和列进行互换(flip the rows and columns in dataframe)