基于最小二乘拟合 SIR 模型
Posted
技术标签:
【中文标题】基于最小二乘拟合 SIR 模型【英文标题】:Fitting SIR model based on least squares 【发布时间】:2016-03-29 02:31:39 【问题描述】:我想优化 SIR 模型的拟合。如果我只用 60 个数据点拟合 SIR 模型,我会得到一个“好”的结果。 “好”意味着,拟合的模型曲线接近数据点,直到 t=40。我的问题是,我怎样才能更好地适应,也许基于所有数据点?
ydata = ['1e-06', '1.49920166169172e-06', '2.24595472686361e-06', '3.36377954575331e-06', '5.03793663882291e-06', '7.54533628058909e-06', '1.13006564683911e-05', '1.69249500601052e-05', '2.53483161761933e-05', '3.79636391699325e-05', '5.68567547875179e-05', '8.51509649182741e-05', '0.000127522555808945', '0.000189928392105942', '0.000283447055673738', '0.000423064043409294', '0.000631295993246634', '0.000941024110897193', '0.00140281896645859', '0.00209085569326554', '0.00311449589149717', '0.00463557784224762', '0.00689146863803467', '0.010227347567051', '0.0151380084180746', '0.0223233100045688', '0.0327384810150231', '0.0476330618585758', '0.0685260046667727', '0.0970432959143974', '0.134525888779423', '0.181363340075877', '0.236189247803334', '0.295374180276257', '0.353377036130714', '0.404138746080267', '0.442876028839178', '0.467273954573897', '0.477529937494976', '0.475582401936257', '0.464137179474659', '0.445930281787152', '0.423331710456602', '0.39821360956389', '0.371967226561944', '0.345577884704341', '0.319716449520481', '0.294819942458255', '0.271156813453547', '0.24887641905719', '0.228045466022105', '0.208674420183194', '0.190736203926912', '0.174179448652951', '0.158937806544529', '0.144936441326754', '0.132096533873646', '0.120338367115739', '0.10958340819268', '0.099755679236243', '0.0907826241267504', '0.0825956203546979', '0.0751302384111894', '0.0683263295744258', '0.0621279977639921', '0.0564834809370572', '0.0513449852139111', '0.0466684871328814', '0.042413516167789', '0.0385429293775096', '0.035022685071934', '0.0318216204865132', '0.0289112368382048', '0.0262654939162707', '0.0238606155312519', '0.021674906523588', '0.0196885815912485', '0.0178836058829335', '0.0162435470852779', '0.0147534385851646', '0.0133996531928511', '0.0121697868544064', '0.0110525517526551', '0.0100376781867076', '0.00911582462544914', '0.00827849534575178', '0.00751796508841916', '0.00682721019158058', '0.00619984569061827', '0.00563006790443123', '0.00511260205894446', '0.00464265452957236', '0.00421586931435123', '0.00382828837833139', '0.00347631553734708', '0.00315668357532714', '0.00286642431380459', '0.00260284137520731', '0.00236348540287827', '0.00214613152062159', '0.00194875883295343']
ydata = [float(d) for d in ydata]
xdata = ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '73', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99', '100', '101']
xdata = [float(t) for t in xdata]
from scipy.optimize import minimize
from scipy import integrate
import numpy as np
import pylab as pl
def fitFunc(sir_values, time, beta, gamma, k):
s = sir_values[0]
i = sir_values[1]
r = sir_values[2]
res = np.zeros((3))
res[0] = - beta * s * i
res[1] = beta * s * i - gamma * i
res[2] = gamma * i
return res
def lsq(model, xdata, ydata, n):
"""least squares"""
time_total = xdata
# original record data
data_record = ydata
# normalize train data
k = 1.0/sum(data_record)
# init t = 0 values + normalized
I0 = data_record[0]*k
S0 = 1 - I0
R0 = 0
N0 = [S0,I0,R0]
# Set initial parameter values
param_init = [0.75, 0.75]
param_init.append(k)
# fitting
param = minimize(sse(model, N0, time_total, k, data_record, n), param_init, method="nelder-mead").x
# get the fitted model
Nt = integrate.odeint(model, N0, time_total, args=tuple(param))
# scale out
Nt = np.divide(Nt, k)
# Get the second column of data corresponding to I
return Nt[:,1]
def sse(model, N0, time_total, k, data_record, n):
"""sum of square errors"""
def result(x):
Nt = integrate.odeint(model, N0, time_total[:n], args=tuple(x))
INt = [row[1] for row in Nt]
INt = np.divide(INt, k)
difference = data_record[:n] - INt
# square the difference
diff = np.dot(difference, difference)
return diff
return result
result = lsq(fitFunc, xdata, ydata, 60)
# Plot data and fit
pl.clf()
pl.plot(xdata, ydata, "o")
pl.plot(xdata, result)
pl.show()
我希望是这样的:
【问题讨论】:
@Reti43 和ydata
是受感染的数据。显示的绿色曲线是拟合模型。我除了,如果我用我的ydata
拟合模型,我会得到一条接近我的ydata
的曲线(拟合模型)。我知道,ydata 具有从 t=40 到可能 t=60 的高误差值(由平方误差之和计算)。另一个灵魂将随机值作为输入值。拟合模型后,我检查了错误,如果错误低于阈值(例如 1.0),我找到了一个好的模型。它的工作方式类似于 RANSAC。
@Reti43 "fitFunc() 有参数 time 和 k" 没错,这个值是必需的,否则我会出错。
@Reti43 因为,数据点是一场流行病的真实流感数据,它们“看起来像”一条正常的感染曲线link。这不是作业。这是一个项目,我们要拟合流行病数据。
没有理由使用k
规范化您的数据。 ydata
似乎在 0 和 1 之间。更重要的是,由于您约束 S0 + I0 + R0
的方式,您已决定总体常数为 1。摆脱 k(将其设置为 1),一切正常。
@Reti43 感谢您的帮助。
【参考方案1】:
我正在将我的评论转换为完整的答案。
问题是由于模型设置不正确造成的。为了简化微分方程,我将dS(t)/dt
和dI(t)/dt
分别称为S
和I
。
# incorrect
S = -S * I * beta
I = S * I * beta - I * gamma
# correct
S = -S * I * beta / N
I = S * I * beta / N - I * gamma
通过错误地设置微分方程,变化率,即从 y(t) 到 y(t+dt) 的变化是错误的。所以,你不仅得到了不正确积分的 I(t),而且你进一步将它除以 N(或 k,正如你所说的那样),使它更加错误。
我们知道这些特定方程的耦合系统要求 S(t) + I(t) + R(t) = N,其中 N 是总体常数。从你声明初始条件的方式,我们推断N为1。注意这也和max(ydata)
一致,小于1。
# IO + SO + R0 is always 1 regardless of "value"
I0 = value
S0 = 1 - I0
R0 = 0
此外,你处理k
的方式真的很可疑。您的数据似乎已经标准化,但您将其乘以 0.1 倍。如您所见,k = 1./sum(ydata)
与总体常数无关。通过执行I0 = ydata[0] * k
并将I(t) 除以k
,您可以有效地缩小数据规模,以便以后扩大规模。无论总体常数是多少,这几乎都将 I(t) 限制在 0-1 范围内。
您可以通过简单地设置所有初始条件和未知参数并查看odeint()
的结果来验证您的模型是否错误。你会注意到 S(0)、I(0) 和 R(0) 可能与你给它们的值不对应,这表明 k
做错了。但是要发现有缺陷的动力学演化,您必须简单地查看您的模型。
一个 hacky 解决方案是设置k = 1.0
。一切正常,因为乘法和除法没有效果,即使你在技术上仍然在做错误的计算。但是,如果您的总体常数应该与 1 不同,那么一切都会中断。所以,为了彻底,
手动将 k
设置为总体常数,除非您还尝试适应 S0、I0 和/或 R0,否则您应该知道这一点。
在模型中为S
和I
写出正确的变化率。
摆脱您拥有的任何np.divide(array, k)
计算,并且
从 fitFunc()
参数中删除 k
并且不要将其附加到 param_init
列表中。虽然最后一个操作是可选的并且不会影响结果,但它在技术上仍然是正确的。这是因为通过传递k
,优化求解器会尝试为其找到最佳值,即使您最终没有在任何地方使用它来影响您的计算。
用curve_fit()解决同样的问题
如果要进行最小二乘拟合,可以使用curve_fit(),它在内部调用最小二乘法。您仍然需要为拟合创建一个包装函数,该函数必须针对各种 beta 和 gamma 值对系统进行数值积分,但您不必手动进行任何 SSE 计算。
curve_fit()
还将返回协方差矩阵,您可以使用它来估计您的拟合变量的confidence intervals。有关从协方差矩阵计算置信区间的进一步相关讨论,请参见here。
import numpy as np
import matplotlib.pyplot as plt
from scipy import integrate, optimize
ydata = ['1e-06', '1.49920166169172e-06', '2.24595472686361e-06', '3.36377954575331e-06', '5.03793663882291e-06', '7.54533628058909e-06', '1.13006564683911e-05', '1.69249500601052e-05', '2.53483161761933e-05', '3.79636391699325e-05', '5.68567547875179e-05', '8.51509649182741e-05', '0.000127522555808945', '0.000189928392105942', '0.000283447055673738', '0.000423064043409294', '0.000631295993246634', '0.000941024110897193', '0.00140281896645859', '0.00209085569326554', '0.00311449589149717', '0.00463557784224762', '0.00689146863803467', '0.010227347567051', '0.0151380084180746', '0.0223233100045688', '0.0327384810150231', '0.0476330618585758', '0.0685260046667727', '0.0970432959143974', '0.134525888779423', '0.181363340075877', '0.236189247803334', '0.295374180276257', '0.353377036130714', '0.404138746080267', '0.442876028839178', '0.467273954573897', '0.477529937494976', '0.475582401936257', '0.464137179474659', '0.445930281787152', '0.423331710456602', '0.39821360956389', '0.371967226561944', '0.345577884704341', '0.319716449520481', '0.294819942458255', '0.271156813453547', '0.24887641905719', '0.228045466022105', '0.208674420183194', '0.190736203926912', '0.174179448652951', '0.158937806544529', '0.144936441326754', '0.132096533873646', '0.120338367115739', '0.10958340819268', '0.099755679236243', '0.0907826241267504', '0.0825956203546979', '0.0751302384111894', '0.0683263295744258', '0.0621279977639921', '0.0564834809370572', '0.0513449852139111', '0.0466684871328814', '0.042413516167789', '0.0385429293775096', '0.035022685071934', '0.0318216204865132', '0.0289112368382048', '0.0262654939162707', '0.0238606155312519', '0.021674906523588', '0.0196885815912485', '0.0178836058829335', '0.0162435470852779', '0.0147534385851646', '0.0133996531928511', '0.0121697868544064', '0.0110525517526551', '0.0100376781867076', '0.00911582462544914', '0.00827849534575178', '0.00751796508841916', '0.00682721019158058', '0.00619984569061827', '0.00563006790443123', '0.00511260205894446', '0.00464265452957236', '0.00421586931435123', '0.00382828837833139', '0.00347631553734708', '0.00315668357532714', '0.00286642431380459', '0.00260284137520731', '0.00236348540287827', '0.00214613152062159', '0.00194875883295343']
xdata = ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '73', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99', '100', '101']
ydata = np.array(ydata, dtype=float)
xdata = np.array(xdata, dtype=float)
def sir_model(y, x, beta, gamma):
S = -beta * y[0] * y[1] / N
R = gamma * y[1]
I = -(S + R)
return S, I, R
def fit_odeint(x, beta, gamma):
return integrate.odeint(sir_model, (S0, I0, R0), x, args=(beta, gamma))[:,1]
N = 1.0
I0 = ydata[0]
S0 = N - I0
R0 = 0.0
popt, pcov = optimize.curve_fit(fit_odeint, xdata, ydata)
fitted = fit_odeint(xdata, *popt)
plt.plot(xdata, ydata, 'o')
plt.plot(xdata, fitted)
plt.show()
您可能会注意到一些运行时警告,但它们主要是由于对最小化求解器 (Levenburg-Marquardt) 的初始搜索,该求解器尝试了 beta
和 gamma
的一些值,这些值会在积分期间导致数值溢出。但是,它应该尽快稳定到更合理的值。如果您为 minimize()
尝试不同的求解器,您会注意到类似的警告。
【讨论】:
@Reti43 你知道处理多种流行病的算法吗? @Sam 恐怕我没有。流行病学不是我的专长,我只是碰巧知道几个 SIR 模型的基础知识。我建议在这里进行文献搜索,可能从 Scholar Google 开始,看看是否存在任何此类模型。以上是关于基于最小二乘拟合 SIR 模型的主要内容,如果未能解决你的问题,请参考以下文章
MATLAB点云处理(十八):直线拟合(最小二乘 | RANSAC)
LSSVM回归预测基于matlab灰狼算法优化最小支持向量机GWO-LSSVM数据预测含Matlab源码 2259期