计算不同单词在列中出现的次数 Oracle 12c SQL
Posted
技术标签:
【中文标题】计算不同单词在列中出现的次数 Oracle 12c SQL【英文标题】:Count how many times distinct words appear in a column Oracle 12c SQL 【发布时间】:2020-11-23 13:14:45 【问题描述】:如何计算列中所有不同单词出现的次数
以下是示例和预期输出
+--------+------------------------------+
| PERIOD | STRING |
+--------+------------------------------+
| | |
| 1 | this is some text |
| | |
| 2 | more text |
| | |
| 3 | this could be some more text |
+--------+------------------------------+
+-------+-------+
| WORD | COUNT |
+-------+-------+
| | |
| this | 2 |
| | |
| is | 1 |
| | |
| some | 2 |
| | |
| text | 3 |
| | |
| more | 2 |
| | |
| could | 1 |
| | |
| be | 1 |
+-------+-------+
谢谢,
【问题讨论】:
你想用纯 SQL 来做,还是可以使用 PL/SQL 之类的语言? 或者,可以使用 PL/SQL 作为解决方案以及 SQL 顺便说一句,有大写字母和小写字母,例如This
和this
有关系吗?例如。两者是否相等?
@BarbarosÖzhan 案例无关紧要,所以两者都是平等的
几乎是***.com/q/38371989/1509264 的副本,只是在末尾添加了COUNT
步骤。
【参考方案1】:
您可以使用分层查询,例如
WITH t2 AS
(
SELECT REGEXP_SUBSTR(LOWER(string),'[^[:space:]]+',1,level) AS word
FROM t
CONNECT BY level <= REGEXP_COUNT(LOWER(string),'[:space:]') + 1
AND PRIOR SYS_GUID() IS NOT NULL
AND PRIOR period = period
)
SELECT word, COUNT(*) AS count
FROM t2
WHERE word IS NOT NULL
GROUP BY word
Demo
附注应用LOWER()
函数是为了解决区分大小写的问题。
【讨论】:
【参考方案2】:诀窍是将字符串拆分为单词。一种方法使用递归 CTE:
with words(word, string, n) as (
select regexp_substr(string, '[^ ]+', 1, 1) as word, string, 1 as n
from t
union all
select regexp_substr(string, '[^ ]+', 1, n + 1), string, n + 1
from words
where regexp_substr(string, '[^ ]+', 1, n + 1) is not null
)
select word, count(*)
from words
group by word;
Here 是一个 dbfiddle。
【讨论】:
【参考方案3】:您可以使用简单的字符串函数在没有(慢)正则表达式的情况下做到这一点:
WITH word_bounds ( string, start_pos, end_pos ) AS (
SELECT string,
1,
INSTR( string, ' ', 1 )
FROM table_name
UNION ALL
SELECT string,
end_pos + 1,
INSTR( string, ' ', end_pos + 1 )
FROM word_bounds
WHERE end_pos > 0
),
words ( word ) AS (
SELECT CASE end_pos
WHEN 0
THEN SUBSTR( string, start_pos )
ELSE SUBSTR( string, start_pos, end_pos - start_pos )
END
FROM word_bounds
)
SELECT word,
COUNT(*) AS frequency
FROM words
GROUP BY
word
ORDER BY
frequency desc, word;
其中,对于样本数据:
CREATE TABLE table_name ( PERIOD, STRING ) AS
SELECT 1, 'this is some text' FROM DUAL UNION ALL
SELECT 2, 'more text' FROM DUAL UNION ALL
SELECT 3, 'this could be some more text' FROM DUAL
输出:
字 |频率 :---- | --------: 正文 | 3 更多 | 2 一些 | 2 这个 | 2 是| 1 可以| 1 是| 1
有一个关于分割分隔字符串here的不同方式的性能的讨论。
db小提琴here
【讨论】:
以上是关于计算不同单词在列中出现的次数 Oracle 12c SQL的主要内容,如果未能解决你的问题,请参考以下文章
如何计算某些值在 SQL 表中出现的次数并在列中返回该数字?
如何计算某个meta_key在列中出现的meta_value的次数?