Mysql 统计标签出现次数（一行变多行）

Posted 2020-12-12 苏轶然

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了Mysql 统计标签出现次数（一行变多行）相关的知识，希望对你有一定的参考价值。

需求背景

需求

一张数据表
其中有sid字段，代表tag，每行数据可能有多个tag字段
统计全量数据中所有tag出现的次数（按tag分组，分别有多少数据）

source table demo

id	sid
1	a3,a4,a1,a2,a5
2	a5,a3,a4,a2
3	a5,a3,a4

target table demo

sid_tag	sid_occurrence
a1	1
a2	2
a3	3
a4	3
a5	3

SQL query in mysql

select substring_index( substring_index( sid , ‘,‘, id_table.help_topic_id + 1 ), ‘,‘,- 1 ) as sid_tag,count(sid) as sid_occurrence from src_t JOIN mysql.help_topic id_table on id_table.help_topic_id < (
  length(src_t.sid) - length(replace(src_t.sid, ‘,‘, ‘‘)) + 1
) GROUP BY sid_tag

hive

select tag,count(*) from(
select tag from table_name
lateral view explode(split(taglist,‘,‘)) r1 AS tag) a group by a.tag;

解析

利用系统表mysql.help_topic进行了index的记录（可以使用其他表），更暴力的解法是直接新建一张只有id列的表，可以预防系统表id不够用（一般足够了）
JOIN a.id < b.id，解决了复制行的操作；若b.id = 5，a.id 为（1，2，3，4，5），返回数据为 [(1,5),(2,5),(3,5),(4,5)]
substring_index( substring_index( sid , ‘,‘, id_table.help_topic_id + 1 ), ‘,‘,- 1 ) 利用index表中的index，作为子串索引进行数据拆分。索引构建基于上一步的join。
随后正常groupby统计

实验

mysql

select * from (
  SELECT concat(‘a‘,topic.help_topic_id) as sid,topic.help_topic_id AS id
  FROM mysql.help_topic topic
  WHERE help_topic_id IN (‘1‘, ‘2‘, ‘3‘, ‘4‘, ‘5‘)
) as id_table JOIN (
  SELECT concat(‘b‘,topic.help_topic_id) as sid,topic.help_topic_id AS id
  FROM mysql.help_topic topic
  WHERE help_topic_id IN (‘1‘, ‘2‘, ‘3‘)
) as id_table2 on id_table2.id <= id_table.id

结果

sid	id	sid	id
a1	1	b1	1
a2	2	b1	1
a2	2	b2	2
a3	3	b1	1
a3	3	b2	2
a3	3	b3	3
a4	4	b1	1
a4	4	b2	2
a4	4	b3	3
a5	5	b1	1
a5	5	b2	2
a5	5	b3	3

以上是关于Mysql 统计标签出现次数（一行变多行）的主要内容，如果未能解决你的问题，请参考以下文章

sid	id	sid	id
a1	1	b1	1
a2	2	b1	1
a2	2	b2	2
a3	3	b1	1
a3	3	b2	2
a3	3	b3	3
a4	4	b1	1
a4	4	b2	2
a4	4	b3	3
a5	5	b1	1
a5	5	b2	2
a5	5	b3	3

sid	id	sid	id
a1	1	b1	1
a2	2	b1	1
a2	2	b2	2
a3	3	b1	1
a3	3	b2	2
a3	3	b3	3
a4	4	b1	1
a4	4	b2	2
a4	4	b3	3
a5	5	b1	1
a5	5	b2	2
a5	5	b3	3

sid	id	sid	id
a1	1	b1	1
a2	2	b1	1
a2	2	b2	2
a3	3	b1	1
a3	3	b2	2
a3	3	b3	3
a4	4	b1	1
a4	4	b2	2
a4	4	b3	3
a5	5	b1	1
a5	5	b2	2
a5	5	b3	3