weibull 是该数据的正确分布吗?如何使用 R 找到最佳参数?
Posted
技术标签:
【中文标题】weibull 是该数据的正确分布吗?如何使用 R 找到最佳参数?【英文标题】:Is weibull the right distribution for this data? How do i find the best parameters using R? 【发布时间】:2022-01-08 23:38:47 【问题描述】:我有多个 (1000s) 事件组的一些时间发生数据。我需要对显示相似分布的事件组进行聚类,并找到每个聚类的参数。每个事件组有 5-15 个数据点。我从 50 个事件组中随机抽取样本,并将它们绘制成频率与时间的关系图。
对我来说,分布似乎是 Weibull,现在我正在寻找参数,但我一直无法找到稳定的参数。我已经使用 nls 包来查找事件组的稳定参数。
dat <- data.frame(x=single_event$time, y=single_event$freq_density)
pars <- expand.grid(a=seq(0.01, 10, len=20),
b=seq(1, 50, len=20))
res <- nls2(y ~ ((a/b) * ((x/b)^(a-1)) * exp(- (x/b)^a)), data=dat,
start=pars, algorithm='brute-force')
res1 <- nls(y ~ ((a/b) * ((x/b)^(a-1)) * exp(- (x/b)^a)), data=dat,
start=as.list(coef(res)))
但我无法获得有意义的输出。对于大多数事件组,我收到错误
Error in nls(y ~ ((a/b) * ((x/b)^(a - 1)) * exp(-(x/b)^a)), data = dat, : singular gradient
现在,我想知道我是否选择了正确的发行版。
我如何为此获得正确的分布?以及如何找到参数?
这是一些示例数据:
event_group <- c('group_A', 'group_B', 'group_A', 'group_C', 'group_B', 'group_D', 'group_E', 'group_A', 'group_C', 'group_B', 'group_D', 'group_E', 'group_A', 'group_C', 'group_B', 'group_D', 'group_E', 'group_A', 'group_C', 'group_B', 'group_D', 'group_E', 'group_A', 'group_C', 'group_B', 'group_D', 'group_E', 'group_A', 'group_C', 'group_B', 'group_D', 'group_E', 'group_A', 'group_C', 'group_B', 'group_D', 'group_E', 'group_B', 'group_D', 'group_E', 'group_B', 'group_E', 'group_B', 'group_D', 'group_E', 'group_E')
freq_density <- c(0.005747126, 0.015151515, 0.057471264, 0.089552239, 0.015151515, 0.104477612, 0.033057851, 0.103448276, 0.28358209, 0.106060606, 0.044776119, 0.140495868, 0.25862069, 0.298507463, 0.181818182, 0.164179104, 0.090909091, 0.206896552, 0.164179104, 0.212121212, 0.268656716, 0.347107438, 0.247126437, 0.059701493, 0.151515152, 0.179104478, 0.190082645, 0.114942529, 0.074626866, 0.121212121, 0.074626866, 0.05785124, 0.005747126, 0.029850746, 0.075757576, 0.119402985, 0.033057851, 0.045454545, 0.029850746, 0.033057851, 0.060606061, 0.049586777, 0.015151515, 0.014925373, 0.008264463, 0.016528926)
time_min <- c(10, 30, 40, 45, 45, 45, 55, 55, 60, 60, 60, 70, 70, 75, 75, 75, 85, 85, 90, 90, 90, 100, 100, 105, 105, 105, 115, 115, 120, 120, 120, 130, 130, 135, 135, 135, 145, 150, 150, 160, 165, 175, 180, 195, 235, 250)
sample_data <- data.frame(event_group, time_min, freq_density, stringsAsFactors=FALSE)
【问题讨论】:
【参考方案1】:fitdistrplus::fitdist()
可用于判断参数:
fitdistrplus::fitdist(sample_data$freq_density, distr = "gamma")
#> Fitting of the distribution ' gamma ' by maximum likelihood
#> Parameters:
#> estimate Std. Error
#> shape 1.25139 0.2341895
#> rate 11.51292 2.6352952
fitdistrplus::fitdist(sample_data$freq_density, distr = "weibull")
#> Fitting of the distribution ' weibull ' by maximum likelihood
#> Parameters:
#> estimate Std. Error
#> shape 1.1657556 0.13768844
#> scale 0.1145993 0.01526602
# Use a Cullen and Frey graph to choose the 'best' fitting distribution
fitdistrplus::descdist(sample_data$freq_density)
#> summary statistics
#> ------
#> min: 0.005747126 max: 0.3471074
#> median: 0.08265491
#> mean: 0.1086957
#> estimated sd: 0.09034791
#> estimated skewness: 0.9060949
#> estimated kurtosis: 2.942441
由reprex package (v2.0.1) 于 2021-12-02 创建
根据 Cullen 和 Frey 图,对于给定数据,伽马分布似乎是一个不错的选择。
如果您想将fitdistrplus::fitdist()
应用于多个组,例如可以使用purrr::map()
:
library(dplyr)
sample_data %>%
split(.$event_group) %>%
purrr::map(~fitdistrplus::fitdist(.$freq_density, distr = "gamma"))
#> $group_A
#> Fitting of the distribution ' gamma ' by maximum likelihood
#> Parameters:
#> estimate Std. Error
#> shape 0.8847797 0.3852533
#> rate 7.0784485 4.0716225
#>
#> $group_B
#> Fitting of the distribution ' gamma ' by maximum likelihood
#> Parameters:
#> estimate Std. Error
#> shape 1.465481 0.5678731
#> rate 16.121401 7.4261676
#>
#> $group_C
#> Fitting of the distribution ' gamma ' by maximum likelihood
#> Parameters:
#> estimate Std. Error
#> shape 1.906359 0.9434099
#> rate 13.344416 7.5468387
#>
#> $group_D
#> Fitting of the distribution ' gamma ' by maximum likelihood
#> Parameters:
#> estimate Std. Error
#> shape 1.71704 0.7441117
#> rate 15.45395 7.7658146
#>
#> $group_E
#> Fitting of the distribution ' gamma ' by maximum likelihood
#> Parameters:
#> estimate Std. Error
#> shape 1.104798 0.4184115
#> rate 12.152399 5.7735560
【讨论】:
以上是关于weibull 是该数据的正确分布吗?如何使用 R 找到最佳参数?的主要内容,如果未能解决你的问题,请参考以下文章
如何使用 Python 获得 Weibull 分布的置信区间?