计算文件中的字频（bash4，awk）

Posted 2021-02-25

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了计算文件中的字频（bash4，awk）相关的知识，希望对你有一定的参考价值。

# "word" is defined as "space delimited token" - i.e. "one" and "one." are dfferent
# words. The awk-expression is used as "Tokenizer", its results are used to build an
# associative array. Blank lines are assigned to the array-entry a[Blank Line] (which
# cannot result from the tokenizing, since it has a space)
unset a; declare -A a;
while read -r; do
    ! [[ $REPLY ]] && REPLY="Blank Line"
    ((a[$REPLY]++))
done < <(awk -v OFS=\n '{$1=$1} 1' ./file)
# print the results, sorted by frequency
for word in "${!a[@]}"; do
    printf '%d	%s
' "${a[$word]}" "$word"
done | sort -n
 
# NOTE: The "$REPLY" (i.e. the backslash) is needed because for bash4
# otherwise a "[" or "]" would break the expansion/evaluation.
# NOTE ALSO: This is mainly sort of a "proof of concept". It is very slow and would better
# be implemented in e.g. awk completely!

以上是关于计算文件中的字频（bash4，awk）的主要内容，如果未能解决你的问题，请参考以下文章

基于汉字字频特征实现99.99%准确率的新闻文本分类器

使用字典在Python中计算字频率效率

使用来自多个文件的awk计算文件中的平均值

grep/sed/awk - 用新的计算值“$X/10”替换文件中的所有“$X”

awk：计算不同文件中数据的平均值

递归查找重复文件（bash4，关联数组）