使 JPQL/QueryDSL 不会产生可怕的查询

Posted

技术标签:

【中文标题】使 JPQL/QueryDSL 不会产生可怕的查询【英文标题】:Make JPQL/QueryDSL not generate terrible queries 【发布时间】:2021-11-15 07:41:18 【问题描述】:

我使用 QueryDSL 4.4.0 和 Hibernate 5.4.32 来查询一个简单的博客平台到一个 PostgreSQL 数据库。我的问题是 JPQL 和扩展的 QueryDSL 坚持生成真正令人震惊的糟糕查询。我想知道是否有办法让它不这样做。我宁愿不必去原生查询,因为查询已经在生成。

我基本上有 3 个实体:

@Entity
@Table(indexes =  ... )
public class Note 
    @Id
    @GeneratedValue(strategy = GenerationType.AUTO)
    @NotNull
    private UUID id;

    @ManyToMany(fetch = FetchType.EAGER)
    private List<Keyword> keywords;

    ...


@Entity
@Table(indexes =  @Index(name = "keyword_parent", columnList = "parent_id"), ... )
public class Keyword 
    @Id
    @GeneratedValue(strategy = GenerationType.AUTO)
    @NotNull
    private UUID id;

    @EqualsAndHashCode.Exclude
    @ManyToOne(fetch = FetchType.LAZY)
    private Keyword parent;

    @ManyToMany(fetch = FetchType.LAZY)
    private List<Keyword> implies = new ArrayList<>();

    @OneToMany(fetch = FetchType.LAZY, mappedBy = "parent", orphanRemoval = false)
    private List<Keyword> children = new ArrayList<>();

    ...


@Entity
@Table(indexes =  @Index(columnList = "child_id"), @Index(columnList = "parent_id"),
        @Index(columnList = "child_id,parent_id", unique = true), @Index(columnList = "ref"), ... )
@IdClass(KeywordCacheId.class)
@Where(clause = "ref > 0")
public class KeywordCache implements Serializable 
    /**
     * 
     */
    private static final long serialVersionUID = 1L;
    @NotNull
    @ManyToOne(fetch = FetchType.EAGER)
    @Id
    private Keyword child;

    @NotNull
    @ManyToOne(fetch = FetchType.EAGER)
    @Id
    private Keyword parent;

    private int ref;

    ...


public class KeywordCacheId implements Serializable 
    /**
     * 
     */
    private static final long serialVersionUID = 1L;
    private UUID child;
    private UUID parent;

    // equals + hashCode

(简化为只包含主要结构)

一个笔记有许多关键字。关键词有层次关系+辅助关系。这种关系过于复杂,无法在 SQL 中处理,因此构建了一个缓存来反映两个关键字是否存在关系。

我有 7680 条笔记、1308 个关键词、39k 条笔记-关键词关系、12 个关键词-关键词关系,以及 3002 条关系的计算缓存。换句话说,一个小型数据库。

我想查找所有包含与给定关键字 ID 列表相关的关键字的笔记。

我的第一次尝试是

    private JPAQuery<Note> addFilter(JPAQuery<Note> query, List<String> filter) 
        for (String f : filter) 
            UUID id = UUID.fromString(f);
            String variable = id.toString().replaceAll("-", "");
            QKeywordCache cache = new QKeywordCache("kc_" + variable);
            query.from(cache);
            query.where(cache.child.in(QNote.note.keywords));
            query.where(cache.parent.id.eq(id));
        
        return query;
    

    public Page<Note> find(List<String> filter, Pageable page) 
        JPAQuery<Note> query = new JPAQuery<>(entityManager);
        query.from(QNote.note);
        query.select(QNote.note);
        query.distinct();
        query = addFilter(query, filter);
        query.offset(page.getOffset());
        query.limit(page.getPageSize());
        QueryResults<Note> data = query.fetchResults();
        return new PageImpl<>(data.getResults(), page, data.getTotal());
    

这会产生有意义的 JPQL,它会被转换为头上疯狂的 SQL:

select distinct note
from Note note, KeywordCache kc_6205f3b41e354d63909ef253866371b1
where kc_6205f3b41e354d63909ef253866371b1.child member of note.keywords and kc_6205f3b41e354d63909ef253866371b1.parent.id = ?1

select
    count(distinct note0_.id) as col_0_0_
from
    Note note0_
cross join KeywordCache keywordcac1_
where
    ( keywordcac1_.ref > 0)
    and (keywordcac1_.child_id in (
    select
        keywords2_.keywords_id
    from
        Note_Keyword keywords2_
    where
        note0_.id = keywords2_.Note_id))
    and keywordcac1_.parent_id = '98c9201c-a395-4ac4-9348-ea89e740653b'

Aggregate  (cost=10229575.18..10229575.19 rows=1 width=8)
  ->  Nested Loop  (cost=4.30..10229542.12 rows=13222 width=16)
        Join Filter: (SubPlan 1)
        ->  Seq Scan on note note0_  (cost=0.00..206.15 rows=8815 width=16)
        ->  Materialize  (cost=4.30..13.31 rows=3 width=16)
              ->  Bitmap Heap Scan on keywordcache keywordcac1_  (cost=4.30..13.29 rows=3 width=16)
                    Recheck Cond: (parent_id = '98c9201c-a395-4ac4-9348-ea89e740653b'::uuid)
                    Filter: (ref > 0)
                    ->  Bitmap Index Scan on idx1in649xpbjw4aeix3574irbne  (cost=0.00..4.30 rows=3 width=0)
                          Index Cond: (parent_id = '98c9201c-a395-4ac4-9348-ea89e740653b'::uuid)
        SubPlan 1
          ->  Seq Scan on note_keyword keywords2_  (cost=0.00..773.59 rows=5 width=16)
                Filter: (note0_.id = note_id)

计数是因为分页。这种疯狂的“输入”结构使得这个查询大约需要 150 秒。

将过滤方法替换为

    private JPAQuery<Note> addFilter(JPAQuery<Note> query, List<String> filter) 
        for (String f : filter) 
            UUID id = UUID.fromString(f);
            String variable = id.toString().replaceAll("-", "");
            QKeywordCache cache = new QKeywordCache("kc_" + variable);
            query.from(cache);
            query.where(QNote.note.keywords.any().eq(cache.child));
//          query.where(cache.child.in(QNote.note.keywords));
            query.where(cache.parent.id.eq(id));
        
        return query;
    

我的 JPQL 稍微差一点,但由于不必要的子选择,SQL 看起来过于复杂:

select distinct note
from Note note, KeywordCache kc_6205f3b41e354d63909ef253866371b1
where exists (select 1
from note.keywords as note_keywords_0
where note_keywords_0 = kc_6205f3b41e354d63909ef253866371b1.child) and kc_6205f3b41e354d63909ef253866371b1.parent.id = ?1

select
    count(distinct note0_.id) as col_0_0_
from
    Note note0_
cross join KeywordCache keywordcac1_
where
    ( keywordcac1_.ref > 0)
    and (exists (
    select
        1
    from
        Note_Keyword keywords2_,
        Keyword keyword3_
    where
        note0_.id = keywords2_.Note_id
        and keywords2_.keywords_id = keyword3_.id
        and keyword3_.id = keywordcac1_.child_id))
    and keywordcac1_.parent_id = '98c9201c-a395-4ac4-9348-ea89e740653b'

Aggregate  (cost=3212.04..3212.05 rows=1 width=8)
  ->  Hash Semi Join  (cost=1459.43..3211.99 rows=18 width=16)
        Hash Cond: ((note0_.id = keywords2_.note_id) AND (keywordcac1_.child_id = keywords2_.keywords_id))
        ->  Nested Loop  (cost=4.30..550.01 rows=26445 width=32)
              ->  Seq Scan on note note0_  (cost=0.00..206.15 rows=8815 width=16)
              ->  Materialize  (cost=4.30..13.31 rows=3 width=16)
                    ->  Bitmap Heap Scan on keywordcache keywordcac1_  (cost=4.30..13.29 rows=3 width=16)
                          Recheck Cond: (parent_id = '98c9201c-a395-4ac4-9348-ea89e740653b'::uuid)
                          Filter: (ref > 0)
                          ->  Bitmap Index Scan on idx1in649xpbjw4aeix3574irbne  (cost=0.00..4.30 rows=3 width=0)
                                Index Cond: (parent_id = '98c9201c-a395-4ac4-9348-ea89e740653b'::uuid)
        ->  Hash  (cost=871.22..871.22 rows=38927 width=48)
              ->  Hash Join  (cost=92.43..871.22 rows=38927 width=48)
                    Hash Cond: (keywords2_.keywords_id = keyword3_.id)
                    ->  Seq Scan on note_keyword keywords2_  (cost=0.00..676.27 rows=38927 width=32)
                    ->  Hash  (cost=76.08..76.08 rows=1308 width=16)
                          ->  Seq Scan on keyword keyword3_  (cost=0.00..76.08 rows=1308 width=16)

现在,查询可以利用我的索引,大约需要 120 毫秒。不过,它仍然使用完全没有必要的愚蠢的子选择。手动写一个查询,我得到了

select
    count(distinct note0_.id) as col_0_0_
from
    Note note0_,
    Note_Keyword keywords2_,
    KeywordCache keywordcac1_
where
    keywordcac1_.ref > 0
    and note0_.id = keywords2_.Note_id
    and keywords2_.keywords_id = keywordcac1_.child_id
    and keywordcac1_.parent_id = '98c9201c-a395-4ac4-9348-ea89e740653b'

Aggregate  (cost=816.18..816.19 rows=1 width=8)
  ->  Nested Loop  (cost=13.61..815.99 rows=75 width=16)
        ->  Hash Join  (cost=13.33..792.09 rows=75 width=16)
              Hash Cond: (keywords2_.keywords_id = keywordcac1_.child_id)
              ->  Seq Scan on note_keyword keywords2_  (cost=0.00..676.27 rows=38927 width=32)
              ->  Hash  (cost=13.29..13.29 rows=3 width=16)
                    ->  Bitmap Heap Scan on keywordcache keywordcac1_  (cost=4.30..13.29 rows=3 width=16)
                          Recheck Cond: (parent_id = '98c9201c-a395-4ac4-9348-ea89e740653b'::uuid)
                          Filter: (ref > 0)
                          ->  Bitmap Index Scan on idx1in649xpbjw4aeix3574irbne  (cost=0.00..4.30 rows=3 width=0)
                                Index Cond: (parent_id = '98c9201c-a395-4ac4-9348-ea89e740653b'::uuid)
        ->  Index Only Scan using note_pkey on note note0_  (cost=0.29..0.32 rows=1 width=16)
              Index Cond: (id = keywords2_.note_id)

此查询需要 30 毫秒。虽然从 120 毫秒到 30 毫秒的改进可能看起来微不足道,但它仍然是 4 倍,这是我的应用程序中的一个中心循环,所以我想保持快速。尤其是因为我可以添加多个关键字(并且预计正常使用列表中有 3-6 个关键字)并计划添加排序,所以子选择必须有效(或不存在)。

那么,有没有办法让 QueryDSL 在第二种情况下生成更好的 JPQL,或者让 JPQL(由 Hibernate 实现)不再像第一种情况那样迷恋容器的狂野“in”子选择?

【问题讨论】:

它的any()在SQL中引入了子查询,如果不需要就去掉它。将其替换为您自己的子查询或加入。 感谢您的建议。不幸的是,它是必需的,因为关键字是一个集合,所以我不能直接将它绑定到 cache.child。如果我使用包含,这将是自然的方式,我会回到我的第一次尝试,它会产生更好的 JPQL 代码,但会被翻译成糟糕的 SQL。 JPQL 支持关联连接,它可以精确地呈现您想要的底层 SQL。 Querydsl 可以很好地呈现关联连接:.innerJoin(QNote.note.keywords, QKeyword.keyword) 是您正在寻找的语法,它可以在 ONWHERE 子句中过滤。我强烈建议更新到 Querydsl 5.0.0,因为 4.x 仍然使用 Hibernate 4 legacy joins 而不是 JPA 2.1 joins。 感谢您的澄清。我忽略的秘诀是明确加入关键字。它在 SQL 中添加了一个额外的连接,但避免实现子选择超过弥补它。我将研究 QueryDSL 5;我正在使用由 Spring Boot 管理的版本,但他们似乎正在考虑删除 QueryDSL 的版本管理,因此这可能是跳过的好时机。 【参考方案1】:

感谢 Jan-Willem Gmelig Meyling 的评论,我使用带关键字的显式连接使其工作:

    private JPAQuery<Note> addFilter(JPAQuery<Note> query, List<String> filter) 
        for (String f : filter) 
            UUID id = UUID.fromString(f);
            String variable = id.toString().replaceAll("-", "");
            QKeywordCache cache = new QKeywordCache("kc_" + variable);
            QKeyword keyword = new QKeyword("k_" + variable);
            query.from(cache);
            query.innerJoin(QNote.note.keywords, keyword);
            query.where(keyword.eq(cache.parent));
            query.where(cache.parent.id.eq(id));
        
        return query;
    

这导致查询更接近我手写的内容:

select distinct note
from Note note, KeywordCache kc_6205f3b41e354d63909ef253866371b1
  inner join note.keywords as k_6205f3b41e354d63909ef253866371b1
where k_6205f3b41e354d63909ef253866371b1 = kc_6205f3b41e354d63909ef253866371b1.parent and kc_6205f3b41e354d63909ef253866371b1.parent.id = ?1

select
    count(distinct note0_.id) as col_0_0_
from
    Note note0_
cross join KeywordCache keywordcac1_
inner join Note_Keyword keywords2_ on
    note0_.id = keywords2_.Note_id
inner join Keyword keyword3_ on
    keywords2_.keywords_id = keyword3_.id
where
    ( keywordcac1_.ref > 0)
    and keyword3_.id = keywordcac1_.parent_id
    and keywordcac1_.parent_id ='98c9201c-a395-4ac4-9348-ea89e740653b'

Aggregate  (cost=812.06..812.07 rows=1 width=8)
  ->  Nested Loop  (cost=4.87..812.04 rows=6 width=16)
        ->  Bitmap Heap Scan on keywordcache keywordcac1_  (cost=4.30..13.45 rows=3 width=16)
              Recheck Cond: (parent_id = '98c9201c-a395-4ac4-9348-ea89e740653b'::uuid)
              Filter: (ref > 0)
              ->  Bitmap Index Scan on idx1in649xpbjw4aeix3574irbne  (cost=0.00..4.30 rows=3 width=0)
                    Index Cond: (parent_id = '98c9201c-a395-4ac4-9348-ea89e740653b'::uuid)
        ->  Materialize  (cost=0.56..798.52 rows=2 width=32)
              ->  Nested Loop  (cost=0.56..798.51 rows=2 width=32)
                    ->  Index Only Scan using keyword_pkey on keyword keyword3_  (cost=0.28..8.29 rows=1 width=16)
                          Index Cond: (id = '98c9201c-a395-4ac4-9348-ea89e740653b'::uuid)
                    ->  Nested Loop  (cost=0.29..790.19 rows=2 width=32)
                          ->  Seq Scan on note_keyword keywords2_  (cost=0.00..773.59 rows=2 width=32)
                                Filter: (keywords_id = '98c9201c-a395-4ac4-9348-ea89e740653b'::uuid)
                          ->  Index Only Scan using note_pkey on note note0_  (cost=0.29..8.30 rows=1 width=16)
                                Index Cond: (id = keywords2_.note_id)

通过避免子选择可以弥补额外的连接,使性能类似于手写查询。

【讨论】:

以上是关于使 JPQL/QueryDSL 不会产生可怕的查询的主要内容,如果未能解决你的问题,请参考以下文章

Django Q过滤器否定不会产生预期的查询

如何使每次产生的随机数互不相同

带有 UNION 子查询的 MAX 不会产生所需的结果

FBI:勒索软件是可怕的,但另一个骗局正在使受害者付出更多代价

FBI:勒索软件是可怕的,但另一个骗局正在使受害者付出更多代价

不懂数据分析的正确姿势,比彻底不会数据分析更可怕!