Linux学习系列 -- 词频统计

Posted 2022-08-02 躬匠

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了Linux学习系列 -- 词频统计相关的知识，希望对你有一定的参考价值。

一、问题

写一个 bash 脚本以统计一个文本文件 words.txt 中每个单词出现的频率。

为了简单起见，你可以假设：

你也可以假设每行前后没有多余的空格字符。

示例：

假设words.txt有如下内容：

the day is sunny the the
the sunny is is

你的脚本应该输出（以词频降序排列）:

the 4
is 3
sunny 2
day 1

备注：尽可能的使用管道命令

方案1：

cat words.txt | awk ' for(i=1;i<=NF;i++)count[$i]++  END  for(k in count)print k" "count[k] ' | sort -rn

方案2：

cat words.txt | grep -Po '[a-z]+' | sort | uniq -c | sort -rn | awk 'print $2, $1'

注：上面的grep -Po也可以使用grep -Eo

方案3：

cat words.txt | xargs -n1 | awk '++word[$0] ENDfor(i in word) print i,word[i]' | sort -nrk 2

方案4：

cat words.txt | tr -s ' ' '\\n' | sort | uniq -c | sort -rn | awk 'print $2" "$1'

其中，tr -s ' ' '\\n' 用来换行

方案5：

sed -e "s/ /\\n/g" words.txt |sort |uniq -c |sort -rn|awk 'print $2" "$1'

注意：sed -e "s/ /\\n/g" 用来进行换行输出操作

以上是关于Linux学习系列 -- 词频统计的主要内容，如果未能解决你的问题，请参考以下文章