Kats时间序列开源库的使用笔记

Posted 悟乙己

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Kats时间序列开源库的使用笔记相关的知识,希望对你有一定的参考价值。

1 Kats的千辛万苦安装之路

不知道是不是笔者的window笔记本的问题,按照kats出现的很多问题

安装Kats时候,会报错:

error: Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/

一般是按照prophet的时候会出现:
其实是可以 直接跳过Microsoft Visual C++,安装:

conda install pystan==2.19.1.1

其中,pystan一定要通过conda安装,才能解决Microsoft Visual C++
不然无法解决

另外报错:

SyntaxError: future feature annotations is not defined

这是版本问题,可见issues
Kats只支持3.7 / 3.8 ,笔者是3.6,所以报错了.

切换成py3.7之后,出现了以下的问题:

ImportError: cannot import name 'utils' from partially initialized module 'kats' (most likely due to a circular import)

到了文件夹一看,package安装之后,很多配套组件没有在文件夹之中,
于是笔者手动复制进去:

复制之后,又报错,

ModuleNotFoundError: No module named 'deprecated'

所以又需要重新安装一波新package
这下总算ok了…

2 案例实践

2.1 时序数据分析工具Kats:01基础用法

时序数据分析工具Kats:01基础用法
已经把官方文档其中那份最基础的翻译了,kats_101_basics.ipynb

数据来源:data

2.2 异常、突变点监测-Detection

有两种点的类型:离群点、突变点(changepoint)

kats_202_detection.ipynb

2.2.1 CUSUMDetector 异常点

先来看看CUSUMDetector几个参数:

  • threshold: float, significance level;
  • max_iter: int, maximum iteration in finding the changepoint;
  • delta_std_ratio: float, the mean delta has to be larger than this parameter times std of the data to be considered as a change;
  • min_abs_change: int, minimal absolute delta between mu0 and mu1
  • start_point: int, the start idx of the changepoint, None means the middle of the time series;
  • change_directions: list[str], a list contain either or both ‘increase’ and ‘decrease’ to specify what type of change to be detected;监测递增突变、还是递减突变,还是两者都需要,默认是Both
  • interest_window: list[int, int], a list containing the start and end of the interest window where we will look for a change point. Note that the llr will still be calculated using all data points;监测区间,可能只想监测某一个区间的异常值
  • magnitude_quantile: float, the quantile for magnitude comparison, if none, will skip the magnitude comparison;
  • magnitude_ratio: float, comparable ratio;
  • magnitude_comparable_day: float, maximal percentage of days can have comparable magnitude to be considered as regression;
  • return_all_changepoints: bool, return all the changepoints found, even the insignificant ones.

来看例子:

# 生成数据
# import packages
from kats.detectors.cusum_detection import CUSUMDetector

# synthesize data with simulation
np.random.seed(10)
df_increase_decrease = pd.DataFrame(
    
        'time': pd.date_range('2019-01-01', periods=60),
        'increase':np.concatenate([np.random.normal(1,0.2,30), np.random.normal(2,0.2,30)]),
        'decrease':np.concatenate([np.random.normal(1,0.3,50), np.random.normal(0.5,0.3,10)]),
    
)

# 突变点监测1
tsd = TimeSeriesData(df_increase_decrease.loc[:,['time','increase']])
detector = CUSUMDetector(tsd)
change_points = detector.detector(change_directions=["increase"])

plt.xticks(rotation=45)
detector.plot(change_points)
plt.show()

# 突变点监测2
tsd = TimeSeriesData(df_increase_decrease.loc[:,['time','decrease']])
detector = CUSUMDetector(tsd)
change_points = detector.detector(change_directions=["decrease"])

plt.xticks(rotation=45)
detector.plot(change_points)
plt.show()

2.2.2 BOCPDetector 异常区间检测

“Bayesian Online Changepoint Detection” (Adams & McKay, 2007)

贝叶斯在线变化点检测(BOCPD)是一种检测时间序列中持续的突然变化的方法

例子:

from kats.utils.simulator import Simulator

sim = Simulator(n=450, start='2020-01-01', freq='H')
ts_bocpd = sim.level_shift_sim(noise=0.05, seasonal_period=1)

# plot the simulated data
ts_bocpd.plot(cols=['value'])

from kats.detectors.bocpd import BOCPDetector, BOCPDModelType, TrendChangeParameters

# Initialize the detector
detector = BOCPDetector(ts_bocpd)


changepoints = detector.detector(
    model=BOCPDModelType.NORMAL_KNOWN_MODEL # this is the default choice
)

# Plot the data
plt.xticks(rotation=45)
detector.plot(changepoints)
plt.show()

cp = changepoints[0]
cp

具体的突变区间有哪些:

BOCPDChangePoint(start_time: 2020-01-05T03:00:00.000000000, end_time: 2020-01-05T03:00:00.000000000, confidence: 0.9752663452994146, model: <BOCPDModelType.NORMAL_KNOWN_MODEL: 1>, ts_name: value)

2.2.3 RobustStatDetector多突变点检测

RobustStatDetector主要是突变点的检测,但是可以检测多个突变点。
所以效果又跟BOCPDetector区间类似

# import package
from kats.detectors.robust_stat_detection import RobustStatDetector

tsd = TimeSeriesData(df_increase_decrease.loc[:,['time','increase']])
detector = RobustStatDetector(tsd)
change_points = detector.detector()

plt.xticks(rotation=45)
detector.plot(change_points)
plt.show()

change_point = change_points[0]
change_point

结果:

TimeSeriesChangePoint(start_time: 2019-01-31T00:00:00.000000000, end_time: 2019-01-31T00:00:00.000000000, confidence: 0.9932728251020368, metric: -0.32345014754902635, index: 30)

2.3 离群点检测

2.3.1 OutlierDetector

# 生成数据
# load the air_passenger data set
air_passengers_outlier_df = pd.read_csv("~/Kats/kats/data/air_passengers.csv")
air_passengers_outlier_df.columns = ["time", "value"]

# manually add outlier on the date of '1950-12-01'
air_passengers_outlier_df.loc[air_passengers_outlier_df.time == '1950-12-01','value']*=5
# manually add outlier on the date of '1959-12-01'
air_passengers_outlier_df.loc[air_passengers_outlier_df.time == '1959-12-01', 'value']*=4

# transform the outlier data into `TimeSeriesData` Object
air_passengers_outlier_ts = TimeSeriesData(air_passengers_outlier_df)

# 离群点
from kats.detectors.outlier import OutlierDetector

ts_outlierDetection = OutlierDetector(air_passengers_outlier_ts, 'additive') # call OutlierDetector
ts_outlierDetection.detector() # apply OutlierDetector

# 离群点
ts_outlierDetection.outliers[0]

结果输出:

[Timestamp('1950-12-01 00:00:00'),
 Timestamp('1959-11-01 00:00:00'),
 Timestamp('1959-12-01 00:00:00')]

2.3.2 MultivariateAnomalyDetector

这种异常检测方法对于检测跨多个时间序列的异常非常有用。异常是基于偏离预测稳态行为检测。一个度量系统的稳态行为是通过使用向量自回归(VAR)模型建模时间序列之间的线性相关性来预测的。

from kats.detectors.outlier import MultivariateAnomalyDetector, MultivariateAnomalyDetectorType
from kats.models.var import VARParams

import warnings
warnings.filterwarnings(action='ignore',category=FutureWarning)

# 生成数据
# Load the demo data into a TimeSeriesData Object
multi_anomaly_df = pd.read_csv("~/Kats/kats/data/multivariate_anomaly_simulated_data.csv")
multi_anomaly_ts = TimeSeriesData(multi_anomaly_df)

# Plot the data
multi_anomaly_ts.plot(cols=multi_anomaly_ts.value.columns.tolist())

# VAR	检测
np.random.seed(10)
params = VARParams(maxlags=2)
d = MultivariateAnomalyDetector(multi_anomaly_ts, params, training_days=60)
anomaly_score_df = d.detector()
d.plot()

# 检测
anomalies = d.get_anomaly_timepoints(alpha=0.05)
anomalies

检测的结果为:

[Timestamp('2020-01-23 00:00:00'),
 Timestamp('2020-01-24 00:00:00'),
 Timestamp('2020-01-25 00:00:00'),
 Timestamp('2020-01-26 00:00:00'),
 Timestamp('2020-01-27 00:00:00'),
 Timestamp('2020-01-28 00:00:00')]

相关的是:

d.get_anomalous_metrics(anomalies[0], top_k=3)

然后使用get_abnormous_metrics函数,我们可以查看导致多变量异常的顶部指标。
在我们发现的异常时间的情况下,我们可以验证最大的异常分数来自指标5和6。

2.4 Trend detection 趋势检测

趋势检测试图识别时间序列中显著和长期的变化。
趋势检测不是识别变化点,而是识别逐渐和长期变化的窗口。

MKDetector 是我们在Kats中包含的趋势检测算法,它基于非参数Mann-Kendall (MK)检验。
MKDetector 本质上做的是将MK应用于一个大小调整的窗口(由detector方法中的window_size参数指定),并返回该测试具有统计意义的每个窗口的结束点。
趋势窗口是基于窗口内时间序列的增加或减少的单调性来检测的,而不是窗口内时间序列值变化的幅度。

from kats.utils.simulator import Simulator

sim = Simulator(n=365, start='2020-01-01', freq='D')
tsd = sim.trend_shift_sim(noise=200, seasonal_period=7, seasonal_magnitude=0.007,
                              cp_arr=[250], intercept = 10000, trend_arr=[40,-20])

# plot the simulated data
tsd.plot(cols=['value'])
plt.show()

# 寻找趋势向下的离群点
from kats.detectors.trend_mk import MKDetector

detector = MKDetector(data=tsd, threshold=.8)
# run detector
detected_time_points = detector.detector(direction='down', window_size=20, freq='weekly')
# plot the results
detector.plot(detected_time_points)
plt.show()


# 结果
cp = detected_time_points[0]
cp

输出的结果:

MKChangePoint(start_time: 2020-10-03 00:00:00, end_time: 2020-10-03 00:00:00, confidence: 0.9999998244557637, detector_type: kats.detectors.trend_mk.MKDetector, is_multivariate: False, trend_direction: decreasing, Tau: -0.8526315789473684)

3 特征抽取-TsFeatures

3.1 特征抽取

import sys
sys.path.append("../")

import numpy as np
import pandas as pd
import pprint
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from kats.consts import TimeSeriesData
from statsmodels.tsa.seasonal import STL
from kats.utils.simulator import Simulator
from sklearn.preprocessing import StandardScaler
from kats.tsfeatures.tsfeatures import TsFeatures

import warnings
warnings.simplefilter(action='ignore')

# 数据生成
sim = Simulator(n=90, freq="D", start = "2021-01-01") # simulate 90 days of data
random_seed = 100

# generate 10 TimeSeriesData with arima_sim
np.random.seed(random_seed) # setting numpy seed
arima_sim_list = [sim.arima_sim(ar=[0.1, 0.05], ma = [0.04, 0.1], d = 1) for _ in range(10)]

# generate 10 TimeSeriesData with trend shifts
trend_sim_list = [
    sim.trend_shift_sim(
        cp_arr = [30, 60, 75],
        trend_arr=[3, 15, 2, 8],
        intercept=30,
        noise=50,
        seasonal_period=7,
        seasonal_magnitude=np.random.uniform(10, 100),
        random_seed=random_seed
    ) for _ in range(10)
]


# generate 10 TimeSeriesData with level shifts
level_shift_list = [
    sim.level_shift_sim(
        cp_arr = [30, 60, 75],
        level_arr=[1.35, 1.05, 1.35, 1.2],
        noise=0.05,
        seasonal_period=7,
        seasonal_magnitude=np.random.uniform(0.1, 1.0),
        random_seed=random_seed
    ) for _ in range(10)
]

ts_list = arima_sim_list + trend_sim_list + level_shift_list


# 时序
ts = ts_list[0]

# plot the time series
ts.plot(cols=['value'])
plt.xticks(rotation = 45)
plt.show()

# 时序特征
# Step 1. initiate TsFeatures
model = TsFeatures()

# Step 2. use .transform() method, and apply on the target time series data
output_features = model.transform(ts)
output_features


特征包括:

'length': 90,
 'mean': -4.973228083549793,
 'var': 50.69499812650379,
 'entropy': 0.2742447620827895,
 'lumpiness': 10.258210327109449,
 'stability': 45.07760417461487,
 'flat_spots': 1,
 'hurst': 0.41884368965647256,
 'std1st_der': 0.8773588739369633,
 'crossing_points': 5,
 'binarize_mean': 0.43333333333333335,
 'unitroot_kpss': 0.41641147078333146,
 'heterogeneity': 73.29527168434541,
 'histogram_mode': -11.841676172131818,
 'linearity': 0.8346355269096618,
 'trend_strength': 0.9853025999592567,
 'seasonality_strength': 0.3521955818150291,
 'spikiness': 0.00020455870537077636,
 'peak': 1,
 'trough': 6,
 'level_shift_idx': 23,
 'level_shift_size': 0.7134342301151566,
 'y_acf1': 0.9597578784708428,
 'y_acf5': 4.0361834721280365,
 'diff1y_acf1': 0.1830233735938267,
 'diff1y_acf5': 0.0794760417768679,
 'diff2y_acf1': -0.4816907863327952,
 'diff2y_acf5': 0.24476824866108501,
 'y_pacf5': 0.9862593061001357,
 'diff1y_pacf5': 0.07981792144706332,
 'diff2y_pacf5': 0.36145785941160113,
 'seas_acf1': 0.8149983814152568,
 'seas_pacf1': 0.03034496255047374,
 'firstmin_ac': 53,
 'firstzero_ac': 30,
 'holt_alpha': 0.9999999850969992,
 'holt_beta': 0.13656079676064467,
 'hw_alpha': nan,
 'hw_beta': nan,
 'hw_gamma': nan

其中包括:

  • STL (Seasonal and Trend decomposition using Loess) Features
    • Strength of Seasonality 周期强度
    • Strength of Trend 趋势强度
    • Spikiness 尖刺
    • Linearity 线性度
  • Amount of Level Shift 数量级偏移
  • Presence of Flat Segments 扁平段
  • ACF and PACF Features 稳定性特征:自相关和偏自相关
  • Hurst Exponent 趋势持续性特征:赫斯特指数
  • ARCH Statistic 拱形统计

3.2 根据特征进行聚类

Cluster Similar Time Series,这么多生成的特征,可以先PCA降维,然后进行聚类

# performing dimension reduction on the time series samples
ls_features = ['lumpiness', 'entropy', 'seasonality_strength', 'stability', 'level_shift_size']
df_dataset = df_features[ls_features]
x_2d = PCA(n_components=2).fit_transform(X=StandardScaler().fit_transform(df_dataset[ls_features].values))
df_dataset['pca_component_1'] = x_2d[:,0]
df_dataset['pca_component_2'] = x_2d[:,1]

# 聚类
# Plot the PCA projections of each time series
plt.figure(figsize = (15,8))
# Plot ARIMA time series in red
ax = df_dataset.iloc[0:10].plot(x='pca_component_1', y='pca_component_2', kind='scatter', color='red')
# Plot trend shift time series in green
df_dataset.iloc[10:20].plot(x='pca_component_1', y='pca_component_2', kind='scatter', color='green', ax=ax)
# Plot level shift time series in yellow
df_dataset.iloc[20:].plot(x='pca_component_1', y='pca_component_2', kind='scatter', color='yellow', ax=ax)

plt.title('Visualization of the dimension reduced time series samples')
plt.legend(['ARIMA', 'Trend Shift', 'Level Shift'])
plt.show()

聚类结果:

以上是关于Kats时间序列开源库的使用笔记的主要内容,如果未能解决你的问题,请参考以下文章

Facebook开源时序利器-Kats

Facebook开源时序利器-Kats

再次出发!FaceBook 开源“一站式服务“时序利器 Kats !

太棒了!FaceBook 开源全网第一个时序王器 Kats !

再次出发!FaceBook 开源“一站式服务“时序利器 Kats !

CocoaPods安装和使用笔记 by STP