基于最小二乘拟合 SIR 模型

Posted

技术标签:

【中文标题】基于最小二乘拟合 SIR 模型【英文标题】:Fitting SIR model based on least squares 【发布时间】:2016-03-29 02:31:39 【问题描述】:

我想优化 SIR 模型的拟合。如果我只用 60 个数据点拟合 SIR 模型,我会得到一个“好”的结果。 “好”意味着,拟合的模型曲线接近数据点,直到 t=40。我的问题是,我怎样才能更好地适应,也许基于所有数据点?

ydata = ['1e-06', '1.49920166169172e-06', '2.24595472686361e-06', '3.36377954575331e-06', '5.03793663882291e-06', '7.54533628058909e-06', '1.13006564683911e-05', '1.69249500601052e-05', '2.53483161761933e-05', '3.79636391699325e-05', '5.68567547875179e-05', '8.51509649182741e-05', '0.000127522555808945', '0.000189928392105942', '0.000283447055673738', '0.000423064043409294', '0.000631295993246634', '0.000941024110897193', '0.00140281896645859', '0.00209085569326554', '0.00311449589149717', '0.00463557784224762', '0.00689146863803467', '0.010227347567051', '0.0151380084180746', '0.0223233100045688', '0.0327384810150231', '0.0476330618585758', '0.0685260046667727', '0.0970432959143974', '0.134525888779423', '0.181363340075877', '0.236189247803334', '0.295374180276257', '0.353377036130714', '0.404138746080267', '0.442876028839178', '0.467273954573897', '0.477529937494976', '0.475582401936257', '0.464137179474659', '0.445930281787152', '0.423331710456602', '0.39821360956389', '0.371967226561944', '0.345577884704341', '0.319716449520481', '0.294819942458255', '0.271156813453547', '0.24887641905719', '0.228045466022105', '0.208674420183194', '0.190736203926912', '0.174179448652951', '0.158937806544529', '0.144936441326754', '0.132096533873646', '0.120338367115739', '0.10958340819268', '0.099755679236243', '0.0907826241267504', '0.0825956203546979', '0.0751302384111894', '0.0683263295744258', '0.0621279977639921', '0.0564834809370572', '0.0513449852139111', '0.0466684871328814', '0.042413516167789', '0.0385429293775096', '0.035022685071934', '0.0318216204865132', '0.0289112368382048', '0.0262654939162707', '0.0238606155312519', '0.021674906523588', '0.0196885815912485', '0.0178836058829335', '0.0162435470852779', '0.0147534385851646', '0.0133996531928511', '0.0121697868544064', '0.0110525517526551', '0.0100376781867076', '0.00911582462544914', '0.00827849534575178', '0.00751796508841916', '0.00682721019158058', '0.00619984569061827', '0.00563006790443123', '0.00511260205894446', '0.00464265452957236', '0.00421586931435123', '0.00382828837833139', '0.00347631553734708', '0.00315668357532714', '0.00286642431380459', '0.00260284137520731', '0.00236348540287827', '0.00214613152062159', '0.00194875883295343']
ydata = [float(d) for d in ydata]
xdata = ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '73', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99', '100', '101']
xdata = [float(t) for t in xdata]

from scipy.optimize import minimize
from scipy import integrate
import numpy as np
import pylab as pl

def fitFunc(sir_values, time, beta, gamma, k):
    s = sir_values[0]
    i = sir_values[1]
    r = sir_values[2]

    res = np.zeros((3))
    res[0] = - beta * s * i
    res[1] = beta * s * i - gamma * i
    res[2] = gamma * i
    return res

def lsq(model, xdata, ydata, n):
    """least squares"""
    time_total = xdata
    # original record data
    data_record = ydata
    # normalize train data
    k = 1.0/sum(data_record)
    # init t = 0 values + normalized
    I0 = data_record[0]*k
    S0 = 1 - I0
    R0 = 0 
    N0 = [S0,I0,R0]
    # Set initial parameter values
    param_init = [0.75, 0.75]
    param_init.append(k)
    # fitting
    param = minimize(sse(model, N0, time_total, k, data_record, n), param_init, method="nelder-mead").x
    # get the fitted model
    Nt = integrate.odeint(model, N0, time_total, args=tuple(param))
    # scale out
    Nt = np.divide(Nt, k)
    # Get the second column of data corresponding to I
    return Nt[:,1]

def sse(model, N0, time_total, k, data_record, n):
    """sum of square errors"""
    def result(x):
        Nt = integrate.odeint(model, N0, time_total[:n], args=tuple(x))
        INt = [row[1] for row in Nt]
        INt = np.divide(INt, k)
        difference = data_record[:n] - INt
        # square the difference
        diff = np.dot(difference, difference)
        return diff
    return result

result = lsq(fitFunc, xdata, ydata, 60)

# Plot data and fit
pl.clf()
pl.plot(xdata, ydata, "o")
pl.plot(xdata, result)
pl.show()

我希望是这样的:

【问题讨论】:

@Reti43 和 ydata 是受感染的数据。显示的绿色曲线是拟合模型。我除了,如果我用我的ydata 拟合模型,我会得到一条接近我的ydata 的曲线(拟合模型)。我知道,ydata 具有从 t=40 到可能 t=60 的高误差值(由平方误差之和计算)。另一个灵魂将随机值作为输入值。拟合模型后,我检查了错误,如果错误低于阈值(例如 1.0),我找到了一个好的模型。它的工作方式类似于 RANSAC。 @Reti43 "fitFunc() 有参数 time 和 k" 没错,这个值是必需的,否则我会出错。 @Reti43 因为,数据点是一场流行病的真实流感数据,它们“看起来像”一条正常的感染曲线link。这不是作业。这是一个项目,我们要拟合流行病数据。 没有理由使用k 规范化您的数据。 ydata 似乎在 0 和 1 之间。更重要的是,由于您约束 S0 + I0 + R0 的方式,您已决定总体常数为 1。摆脱 k(将其设置为 1),一切正常。 @Reti43 感谢您的帮助。 【参考方案1】:

我正在将我的评论转换为完整的答案。

问题是由于模型设置不正确造成的。为了简化微分方程,我将dS(t)/dtdI(t)/dt分别称为SI

# incorrect
S = -S * I * beta
I = S * I * beta - I * gamma

# correct
S = -S * I * beta / N
I = S * I * beta / N - I * gamma

通过错误地设置微分方程,变化率,即从 y(t) 到 y(t+dt) 的变化是错误的。所以,你不仅得到了不正确积分的 I(t),而且你进一步将它除以 N(或 k,正如你所说的那样),使它更加错误。

我们知道这些特定方程的耦合系统要求 S(t) + I(t) + R(t) = N,其中 N 是总体常数。从你声明初始条件的方式,我们推断N为1。注意这也和max(ydata)一致,小于1。

# IO + SO + R0 is always 1 regardless of "value"
I0 = value
S0 = 1 - I0
R0 = 0

此外,你处理k 的方式真的很可疑。您的数据似乎已经标准化,但您将其乘以 0.1 倍。如您所见,k = 1./sum(ydata) 与总体常数无关。通过执行I0 = ydata[0] * k 并将I(t) 除以k,您可以有效地缩小数据规模,以便以后扩大规模。无论总体常数是多少,这几乎都将 I(t) 限制在 0-1 范围内。

您可以通过简单地设置所有初始条件和未知参数并查看odeint() 的结果来验证您的模型是否错误。你会注意到 S(0)、I(0) 和 R(0) 可能与你给它们的值不对应,这表明 k 做错了。但是要发现有缺陷的动力学演化,您必须简单地查看您的模型。

一个 hacky 解决方案是设置k = 1.0。一切正常,因为乘法和除法没有效果,即使你在技术上仍然在做错误的计算。但是,如果您的总体常数应该与 1 不同,那么一切都会中断。所以,为了彻底,

手动将 k 设置为总体常数,除非您还尝试适应 S0、I0 和/或 R0,否则您应该知道这一点。

在模型中为SI 写出正确的变化率。

摆脱您拥有的任何np.divide(array, k) 计算,并且

fitFunc() 参数中删除 k 并且不要将其附加到 param_init 列表中。虽然最后一个操作是可选的并且不会影响结果,但它在技术上仍然是正确的。这是因为通过传递k,优化求解器会尝试为其找到最佳值,即使您最终没有在任何地方使用它来影响您的计算。

用curve_fit()解决同样的问题

如果要进行最小二乘拟合,可以使用curve_fit(),它在内部调用最小二乘法。您仍然需要为拟合创建一个包装函数,该函数必须针对各种 beta 和 gamma 值对系统进行数值积分,但您不必手动进行任何 SSE 计算。

curve_fit() 还将返回协方差矩阵,您可以使用它来估计您的拟合变量的confidence intervals。有关从协方差矩阵计算置信区间的进一步相关讨论,请参见here。

import numpy as np
import matplotlib.pyplot as plt
from scipy import integrate, optimize

ydata = ['1e-06', '1.49920166169172e-06', '2.24595472686361e-06', '3.36377954575331e-06', '5.03793663882291e-06', '7.54533628058909e-06', '1.13006564683911e-05', '1.69249500601052e-05', '2.53483161761933e-05', '3.79636391699325e-05', '5.68567547875179e-05', '8.51509649182741e-05', '0.000127522555808945', '0.000189928392105942', '0.000283447055673738', '0.000423064043409294', '0.000631295993246634', '0.000941024110897193', '0.00140281896645859', '0.00209085569326554', '0.00311449589149717', '0.00463557784224762', '0.00689146863803467', '0.010227347567051', '0.0151380084180746', '0.0223233100045688', '0.0327384810150231', '0.0476330618585758', '0.0685260046667727', '0.0970432959143974', '0.134525888779423', '0.181363340075877', '0.236189247803334', '0.295374180276257', '0.353377036130714', '0.404138746080267', '0.442876028839178', '0.467273954573897', '0.477529937494976', '0.475582401936257', '0.464137179474659', '0.445930281787152', '0.423331710456602', '0.39821360956389', '0.371967226561944', '0.345577884704341', '0.319716449520481', '0.294819942458255', '0.271156813453547', '0.24887641905719', '0.228045466022105', '0.208674420183194', '0.190736203926912', '0.174179448652951', '0.158937806544529', '0.144936441326754', '0.132096533873646', '0.120338367115739', '0.10958340819268', '0.099755679236243', '0.0907826241267504', '0.0825956203546979', '0.0751302384111894', '0.0683263295744258', '0.0621279977639921', '0.0564834809370572', '0.0513449852139111', '0.0466684871328814', '0.042413516167789', '0.0385429293775096', '0.035022685071934', '0.0318216204865132', '0.0289112368382048', '0.0262654939162707', '0.0238606155312519', '0.021674906523588', '0.0196885815912485', '0.0178836058829335', '0.0162435470852779', '0.0147534385851646', '0.0133996531928511', '0.0121697868544064', '0.0110525517526551', '0.0100376781867076', '0.00911582462544914', '0.00827849534575178', '0.00751796508841916', '0.00682721019158058', '0.00619984569061827', '0.00563006790443123', '0.00511260205894446', '0.00464265452957236', '0.00421586931435123', '0.00382828837833139', '0.00347631553734708', '0.00315668357532714', '0.00286642431380459', '0.00260284137520731', '0.00236348540287827', '0.00214613152062159', '0.00194875883295343']
xdata = ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '73', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99', '100', '101']

ydata = np.array(ydata, dtype=float)
xdata = np.array(xdata, dtype=float)

def sir_model(y, x, beta, gamma):
    S = -beta * y[0] * y[1] / N
    R = gamma * y[1]
    I = -(S + R)
    return S, I, R

def fit_odeint(x, beta, gamma):
    return integrate.odeint(sir_model, (S0, I0, R0), x, args=(beta, gamma))[:,1]

N = 1.0
I0 = ydata[0]
S0 = N - I0
R0 = 0.0

popt, pcov = optimize.curve_fit(fit_odeint, xdata, ydata)
fitted = fit_odeint(xdata, *popt)

plt.plot(xdata, ydata, 'o')
plt.plot(xdata, fitted)
plt.show()

您可能会注意到一些运行时警告,但它们主要是由于对最小化求解器 (Levenburg-Marquardt) 的初始搜索,该求解器尝试了 betagamma 的一些值,这些值会在积分期间导致数值溢出。但是,它应该尽快稳定到更合理的值。如果您为 minimize() 尝试不同的求解器,您会注意到类似的警告。

【讨论】:

@Reti43 你知道处理多种流行病的算法吗? @Sam 恐怕我没有。流行病学不是我的专长,我只是碰巧知道几个 SIR 模型的基础知识。我建议在这里进行文献搜索,可能从 Scholar Google 开始,看看是否存在任何此类模型。

以上是关于基于最小二乘拟合 SIR 模型的主要内容,如果未能解决你的问题,请参考以下文章

最小二乘拟合(scipy实现)

MATLAB点云处理(十八):直线拟合(最小二乘 | RANSAC)

LSSVM回归预测基于matlab灰狼算法优化最小支持向量机GWO-LSSVM数据预测含Matlab源码 2259期

LSSVM回归预测基于matlab灰狼算法优化最小支持向量机GWO-LSSVM数据预测含Matlab源码 2259期

曲线拟合的最小二乘原理

最小二乘曲线拟合的C++实现