每个父级的多个连接和最后 N 行

Posted 2023-02-16

技术标签:

【中文标题】每个父级的多个连接和最后 N 行【英文标题】：Multiple joins and last N rows per each parent 【发布时间】：2017-05-30 13:56:18 【问题描述】：

我有 3 张桌子。

companies
- id
- name
- user_id

departments
- id
- name
- user_id
- company_id

invoices
- id
- department_id
- price
- created_at

出于性能目的，我正在尝试在 1 个大型 mysql 查询中获取“仪表板”屏幕所需的所有数据。值得一提的是，invoices 表有大约 700k 条记录，并且大小只会不断增加。

所以我需要获取所有用户的公司、部门以及每个部门的最后 2 张发票（每个 id 的 2 个最高日期）。

现在我对前 2 个没有问题，我可以很容易地做到这一点，例如：

SELECT companies.id as company_id, companies.name as company_name, departments.id as department_id, departments.name as department_name
FROM companies
LEFT JOIN departments
ON companies.id = departments.company_id
WHERE companies.user_id = 1

我只是在努力获取每个部门的最新 2 张发票。在同一个查询中执行此操作的最佳方法是什么？

这是请求的数据，和SQL Fiddle 相同。

CREATE TABLE `companies` (
  `id` int(10) UNSIGNED NOT NULL,
  `name` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
  `user_id` int(11) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

CREATE TABLE `departments` (
  `id` int(10) UNSIGNED NOT NULL,
  `name` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
  `user_id` int(11) NOT NULL,
  `company_id` int(11) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

CREATE TABLE `invoices` (
  `id` int(10) UNSIGNED NOT NULL,
  `price` decimal(6,2)  NOT NULL,
  `created_at` timestamp NULL DEFAULT NULL,
  `department_id` int(11) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

ALTER TABLE `companies`
  ADD PRIMARY KEY (`id`);

ALTER TABLE `departments`
  ADD PRIMARY KEY (`id`);

ALTER TABLE `invoices`
  ADD PRIMARY KEY (`id`);

ALTER TABLE `companies`
  MODIFY `id` int(10) UNSIGNED NOT NULL AUTO_INCREMENT, AUTO_INCREMENT=1;

ALTER TABLE `departments`
  MODIFY `id` int(10) UNSIGNED NOT NULL AUTO_INCREMENT, AUTO_INCREMENT=1;

ALTER TABLE `invoices`
  MODIFY `id` int(10) UNSIGNED NOT NULL AUTO_INCREMENT, AUTO_INCREMENT=1;

INSERT INTO companies
  (`name`, `user_id`)
VALUES
  ('Google', 1),
  ('Apple', 1),
  ('IBM', 1)
;

INSERT INTO departments
  (`name`, `user_id`, `company_id`)
VALUES
  ('Billing', 1, 1),
  ('Support', 1, 1),
  ('Tech', 1, 1),
  ('Billing Dept', 1, 2),
  ('Support Dept', 1, 2),
  ('Tech Dept', 1, 2),
  ('HR', 1, 3),
  ('IT', 1, 3),
  ('Executive', 1, 3)
;

INSERT INTO invoices
  (`price`, `created_at`, `department_id`)
VALUES
  (155.23, '2016-04-07 14:39:29', 1),
  (123.23, '2016-04-07 14:40:26', 1),
  (150.50, '2016-04-07 14:40:30', 1),
  (123.23, '2016-04-07 14:41:38', 1),
  (432.65, '2016-04-07 14:44:15', 1),
  (323.23, '2016-04-07 14:44:22', 2),
  (541.43, '2016-04-07 14:44:33', 2),
  (1232.23, '2016-04-07 14:44:36', 2),
  (433.42, '2016-04-07 14:44:37', 2),
  (1232.43, '2016-04-07 14:44:39', 2),
  (850.40, '2016-04-07 14:44:46', 3),
  (133.32, '2016-04-07 14:45:11', 3),
  (12.43, '2016-04-07 14:45:15', 3),
  (154.23, '2016-04-07 14:45:25', 3),
  (132.43, '2016-04-07 14:46:01', 3),
  (859.55, '2016-04-07 14:53:11', 4),
  (123.43, '2016-04-07 14:53:45', 4),
  (433.33, '2016-04-07 14:54:14', 4),
  (545.12, '2016-04-07 14:54:54', 4),
  (949.99, '2016-04-07 14:55:10', 4),
  (1112.32, '2016-04-07 14:53:40', 5),
  (132.32, '2016-04-07 14:53:44', 5),
  (42.43, '2016-04-07 14:53:48', 5),
  (545.34, '2016-04-07 14:53:56', 5),
  (2343.32, '2016-04-07 14:54:05', 5),
  (3432.43, '2016-04-07 14:54:02', 6),
  (231.32, '2016-04-07 14:54:22', 6),
  (1242.33, '2016-04-07 14:54:54', 6),
  (232.32, '2016-04-07 14:55:12', 6),
  (43.12, '2016-04-07 14:55:23', 6),
  (4343.23, '2016-04-07 14:55:24', 7),
  (1123.32, '2016-04-07 14:55:31', 7),
  (4343.32, '2016-04-07 14:55:56', 7),
  (354.23, '2016-04-07 14:56:04', 7),
  (867.76, '2016-04-07 14:56:12', 7),
  (45.76, '2016-04-07 14:55:54', 8),
  (756.65, '2016-04-07 14:56:08', 8),
  (153.74, '2016-04-07 14:56:14', 8),
  (534.86, '2016-04-07 14:56:23', 8),
  (867.65, '2016-04-07 14:56:55', 8),
  (433.56, '2016-04-07 14:56:32', 9),
  (1423.43, '2016-04-07 14:56:54', 9),
  (342.56, '2016-04-07 14:57:11', 9),
  (343.75, '2016-04-07 14:57:23', 9),
  (1232.43, '2016-04-07 14:57:34', 9)
;

这是想要的结果。

company_id| company_name| department_id | department_name | invoice_price | invoice_created_at
         1| Google      |             1 | Billing         |        123.23 | 2016-04-07 14:41:38 | 
         1| Google      |             1 | Billing         |        432.65 | 2016-04-07 14:44:15 | 
         1| Google      |             2 | Support         |        433.42 | 2016-04-07 14:44:37 | 
         1| Google      |             2 | Support         |       1232.43 | 2016-04-07 14:44:39 | 
         1| Google      |             3 | Tech            |        154.23 | 2016-04-07 14:45:25 | 
         1| Google      |             3 | Tech            |        132.43 | 2016-04-07 14:46:01 | 
         2| Apple       |             4 | Billing Dept    |        545.12 | 2016-04-07 14:54:54 | 
         2| Apple       |             4 | Billing Dept    |        949.99 | 2016-04-07 14:55:10 | 
         2| Apple       |             5 | Support Dept    |        545.34 | 2016-04-07 14:53:56 | 
         2| Apple       |             5 | Support Dept    |       2343.32 | 2016-04-07 14:54:05 | 
         2| Apple       |             6 | Tech Dept       |        232.32 | 2016-04-07 14:55:12 | 
         2| Apple       |             6 | Tech Dept       |         43.12 | 2016-04-07 14:55:23 | 
         3| IBM         |             7 | HR              |        354.23 | 2016-04-07 14:56:04 | 
         3| IBM         |             7 | HR              |        867.76 | 2016-04-07 14:56:12 | 
         3| IBM         |             8 | IT              |        534.86 | 2016-04-07 14:56:23 | 
         3| IBM         |             8 | IT              |        867.65 | 2016-04-07 14:56:55 | 
         3| IBM         |             9 | Executive       |        343.75 | 2016-04-07 14:57:23 | 
         3| IBM         |             9 | Executive       |       1232.43 | 2016-04-07 14:57:34 |

【问题讨论】：

"现在我的前 2 个没有问题..." - 您的查询没有这样做。 @Strawberry 谢谢。我添加了 SQL 小提琴数据。 @PaulSpiegel 不知道你为什么这么认为。请参阅我发布的 SQL 小提琴链接，该链接显示该查询按预期执行。 @zen “前两张没有问题” - 抱歉，我以为您指的是前两张发票。但现在我猜你的意思是前两个表。 @Strawberry 我添加了以逗号分隔的所需结果。希望这有助于澄清问题。 【参考方案1】：

我不得不承认，我对您的结果集如何与您的描述和数据集相匹配有点挣扎，但这里有一些东西可以玩......

SELECT x.price
     , x.created_at
     , x.department_id
     , x.department
     , x.department_user 
     , x.company_id
     , x.company
     , x.company_user 
  FROM 
     ( SELECT i.id
            , i.price
            , i.created_at
            , i.department_id
            , d.name department
            , d.user_id department_user 
            , d.company_id
            , c.name company
            , c.user_id company_user
            , CASE WHEN @prev=department_id THEN @i:=@i+1 ELSE @i:=1 END i
            , @prev := i.department_id
         FROM invoices i 
         JOIN departments d 
           ON d.id = i.department_id 
         JOIN companies c 
           ON c.id = d.company_id
         JOIN (SELECT @prev:=null, @i:=0) vars
        ORDER 
           BY department_id
            , created_at DESC
     ) x
 WHERE i<=2;

这是概念化相同想法的一种较慢的方法（我省略了不太相关的部分）...

SELECT x.* 
  FROM invoices x 
  JOIN invoices y 
    ON y.department_id = x.department_id 
   AND y.created_at <= x.created_at 
 GROUP 
    BY x.department_id
     , x.created_at
HAVING COUNT(*) <=2;

【讨论】：

是的，它有点工作，但是在我的实时数据上它需要大约 7-8 秒。我认为 Paul Spiegel 的回答就预期结果而言更快、更准确。不过感谢您的帮助！ @zen 我严重怀疑这一点。但是，嘿，不管怎样，对吧？ ;-) @zen ... 并且索引应该在 (department_id, created_at) 该索引将其加快了大约 1 秒。我得到的最快是 6.6620 秒。另外，正如您所提到的，即使在测试用例中，它也没有完全返回正确的结果。 @zen 公平地说，假设ID 的顺序相同，我没有在查询中使用createt_at。如果使用ORDER BY department_id, i.ID DESC，运行时是什么？【参考方案2】：

一个想法是在 invoices 表中再加入一个 JOIN

LEFT JOIN invoices i ON  i.department_id = departments.id

这样您就可以获得每个部门的所有发票。但是您需要将它们限制为每个部门的最后两个。一种方法是使用 LIMIT 2 的相关子查询的附加 IN 条件

LEFT JOIN invoices i
  ON  i.department_id = departments.id
  AND i.id IN (
    SELECT i1.id
    FROM invoices i1
    WHERE i1.department_id = departments.id
    ORDER BY i1.id DESC
    LIMIT 2
  )

但由于一些奇怪的原因，MySQL 不允许在 IN 语句中使用 LIMIT。所以我们需要更加棘手并避免IN条件。相反，我们可以使用>= 并使用LIMIT 1 OFFSET 1 选择第二高的id：

  AND i.id >= (
    SELECT i1.id
    FROM invoices i1
    WHERE i1.department_id = departments.id
    ORDER BY i1.id DESC
    LIMIT 1
    OFFSET 1
  )

现在最后一个问题：如果只有一张发票，我们将找不到第二张。子查询将返回 NULL 并且条件将始终失败。在这种情况下，我们使用COALESCE 将NULL 替换为0。

所以最终的查询应该是这样的：

SELECT companies.id as company_id,
       companies.name as company_name,
       departments.id as department_id,
       departments.name as department_name,
       i.id as invoice_id,
       i.price as invoice_price
FROM companies
LEFT JOIN departments
  ON companies.id = departments.company_id
LEFT JOIN invoices i
  ON  i.department_id = departments.id
  AND i.id >= COALESCE((
    SELECT i1.id
    FROM invoices i1
    WHERE i1.department_id = departments.id
    ORDER BY i1.id DESC
    LIMIT 1
    OFFSET 1
  ), 0)
WHERE companies.user_id = 1

http://sqlfiddle.com/#!9/8a956/14

【讨论】：

这在 SQL fiddle 中效果很好，但它在我的实时数据上爆炸了。 invoices 表有 700k 行，departments 有 1.5k，company 有 25 行，所以这可能就是原因。知道这在大型数据集上的效率如何以及哪些索引可能会有所帮助吗？现在我只有 id 列的主键索引。 @zen 你至少应该在每个外键上都有索引。最重要的invoices.department_id。嗯，这很有道理。我添加了外键的索引，它在 0.0098 秒内执行！它非常快。非常感谢，我绝对可以继续使用它。我想再问你一个问题。该查询的哪一部分限制为 2 张最新发票？ @zen 加入条件AND i.id >= (...)。子查询使用ORDER BY i1.ID DESC LIMIT 1 OFFSET 1 为给定部门选择第二高的ID。因此，您只会获得 ID 大于或等于第二大 ID 的行。如果您需要 10 的限制，您可以使用 LIMIT 1 OFFSET 9 - 这意味着“跳过 9 并取 1”，即第 10 行。

以上是关于每个父级的多个连接和最后 N 行的主要内容，如果未能解决你的问题，请参考以下文章

如何从嵌套的对象数组中获取每个父级的值

具有多个父级的 Oracle 分层查询

从混合层级中查找多个子级的第一个共同父级

swift Core Data谓词过滤给定父级的所有子数据

将多个子复选框绑定到一个父级

视图位于父级的最后位置，但仍被阻止