如何在 ggplot 中绘制词频排名 - 只有一个变量？

Posted 2023-02-14

技术标签:

【中文标题】如何在 ggplot 中绘制词频排名 - 只有一个变量？【英文标题】：How to plot a word frequency ranking in ggplot - only have one variable? 【发布时间】：2021-11-15 00:13:26 【问题描述】：

我正在尝试使用 ggplot 从 Quanteda 绘制我的词频排名。可以将“频率”变量传递给绘图，但我想要一个更好的图表。

ggplot 需要两个用于 aes 的变量。我已经尝试了 seq_along，正如在一个有点相似的线程上所建议的那样，但该图什么也没画。

ggplot(word_list, aes(x = seq_along(freqs), y = freqs, group = 1)) + 
        geom_line() +
        labs(title = "Rank Frequency Plot", x = "Rank", y = "Frequency")

任何意见表示赞赏！

symptoms_corpus <- corpus(X$TEXT, docnames = X$id )

summary(symptoms_corpus)

# print text of any element of the corpus by index
cat(as.character(symptoms_corpus[6500]))

# Create Document Feature Matrix
Symptoms_DFM <- dfm(symptoms_corpus)
Symptoms_DFM

# sum columns for word counts
freqs <- colSums(Symptoms_DFM)
# get vocabulary vector
words <- colnames(Symptoms_DFM)
# combine words and their frequencies in a data frame
word_list <- data.frame(words, freqs)
# re-order the wordlist by decreasing frequency
word_indexes <- order(word_list[, "freqs"], decreasing = TRUE)

word_list <- word_list[word_indexes, ]
# show the most frequent words
head(word_list, 25)

#plot
ggplot(word_list, aes(x = seq_along(freqs), y = freqs, group = 1)) + 
        geom_line() +
        labs(title = "Rank Frequency Plot", x = "Rank", y = "Frequency")

通过更好的图表，我的意思是使用下面的基本“绘图”函数可以工作并说明排名分布，但这只需要一个变量。 ggplot 需要两个，这就是我出现问题的地方。 ggplot 代码将绘制图形但不显示数据。

plot(word_list$freqs , type = "l", lwd=2, main = "Rank frequency Plot", xlab="Rank", ylab ="Frequency")

下面的示例数据集：

first_column <- c("the","patient", "arm", "rash", "tingling", "was", "in", "not")
second_column <- c("4116407", "3599537", "2582586", "1323883", "1220894", "1012042", "925339", "822150")

word_list2 <- data.frame(first_column, second_column)
colnames(word_list2) <- c=("word", "freqs")

【问题讨论】：

你能提供一个reproducible example :) 吗？也许您需要条形图而不是折线图。您是否正在寻找本教程中的图表？ tidytextmining.com/tidytext.html 我已经尝试过了，但它会使计算机崩溃，我认为我的数据集太大了，有 600 万个令牌。我将使用该代码打开另一个 q，因为它困扰了我一个星期。谢谢 【参考方案1】：

这是一个使用内置语料库的更简洁、可重复的情节演示。

library("quanteda")
## Package version: 3.1.0
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.

symptoms_corpus <- data_corpus_inaugural
Symptoms_DFM <- tokens(symptoms_corpus) %>%
  dfm()

最好在这里使用quanteda.textstats::textstat_frequency()：

# create frequency table
library("quanteda.textstats")
word_list <- textstat_frequency(Symptoms_DFM)
head(word_list, 25)
##    feature frequency rank docfreq group
## 1      the     10183    1      59   all
## 2       of      7180    2      59   all
## 3        ,      7173    3      59   all
## 4      and      5406    4      59   all
## 5        .      5155    5      59   all
## 6       to      4591    6      59   all
## 7       in      2827    7      59   all
## 8        a      2292    8      58   all
## 9      our      2224    9      58   all
## 10      we      1827   10      58   all
## 11    that      1813   11      59   all
## 12      be      1502   12      59   all
## 13      is      1491   13      58   all
## 14      it      1398   14      59   all
## 15     for      1230   15      59   all
## 16      by      1091   16      59   all
## 17    have      1031   17      59   all
## 18   which      1007   18      57   all
## 19     not       980   19      58   all
## 20    with       970   20      58   all
## 21      as       966   21      58   all
## 22    will       944   22      57   all
## 23    this       874   23      59   all
## 24       i       871   24      58   all
## 25     all       836   25      59   all

然后绘制它：

# Zipf's law plot
library("ggplot2")
ggplot(word_list, aes(x = seq_len(nrow(word_list)), y = frequency, group = 1)) +
  geom_line() +
  coord_trans(y = "log10", x = "log10") +
  labs(title = "Rank Frequency Plot", x = "Rank", y = "Frequency")

【讨论】：

【参考方案2】：

我不确定您所说的“更好的图表”是什么意思。你能具体说明吗？您提供的代码无法重现该示例，因为我们没有您的数据集。

也许您可以简单地将行号添加为 x 值，如下所示。这会产生一个有序图

library(ggplot2)

word_list <- data.frame(freq = c(10, 12, 18, 19))

ggplot(word_list, aes(x = 1:nrow(word_list), y = freq, group = 1)) + 
  geom_line() +
  labs(title = "Rank Frequency Plot", x = "Rank", y = "Frequency")

【讨论】：

【参考方案3】：

我需要对数缩放，数据集很大，所以没有出现。上面的示例@TrineCosmusNobel 指出了这一点。谢谢。更新代码如下：

ggplot(word_list, aes(x = 1:nrow(word_list), y = freqs, group = 1)) + 
        geom_line() +
        coord_trans(y ='log10', x='log10') +
        labs(title = "Rank Frequency Plot", x = "Rank", y = "Frequency")

【讨论】：

以上是关于如何在 ggplot 中绘制词频排名 - 只有一个变量？的主要内容，如果未能解决你的问题，请参考以下文章