云原生利器 -- SkyWalking

1 SkyWalking 简介

SkyWalking 是一个APM(应用程序性能监视器)系统,专门为微服务,云原生和基于容器(Docker,Kubernetes,Mesos)的体系结构而设计。
SkyWalking的功能包括对Cloud Native体系结构中的分布式系统的监视,跟踪,诊断功能。核心功能如下:

  • 服务、服务实例、端点指标分析
  • 根本原因分析,在运行时分析代码
  • 服务拓扑图分析
  • 服务、服务实例和端点依赖关系分析
  • 检测慢速服务和端点
  • 性能优化
  • 分布式跟踪和上下文传播
  • 数据库访问指标,检测慢速数据库访问语句(包括SQL语句)
  • 报警
  • 浏览器性能监控
    详情可访问Github地址:https://github.com/apache/skywalking,本文将介绍如何在 k8s环境中部署使用 SkyWalking 8.3.0版本,实操,不要错过哦!

2 K8s部署


#创建namespace - monitoringapiVersion: v1kind: Namespacemetadata: name: monitoring


#创建SkyWalking相关的rbac权限#相关文件可查看https://github.com/apache/skywalking-kubernetes/tree/master/chart/skywalking/templates下的k8s配置apiVersion: v1kind: ServiceAccountmetadata: labels: app: skywalking-oap-server release: 8.3.0 name: skywalking-oap-server namespace: monitoring---kind: RoleapiVersion: rbac.authorization.k8s.io/v1metadata: name: skywalking-oap-server namespace: monitoring labels: app: skywalking-oap-server release: 8.3.0rules: - apiGroups: [""] resources: ["pods","configmaps"] verbs: ["get", "watch", "list"]---apiVersion: rbac.authorization.k8s.io/v1kind: ClusterRolemetadata: name: skywalking-oap-server namespace: monitoring labels: app: skywalking-oap-server release: 8.3.0rules:- apiGroups: [""] resources: ["pods", "endpoints", "services"] verbs: ["get", "watch", "list"]- apiGroups: ["extensions"] resources: ["deployments", "replicasets"] verbs: ["get", "watch", "list"]---apiVersion: rbac.authorization.k8s.io/v1kind: RoleBindingmetadata: name: skywalking-oap-server namespace: monitoring labels: app: skywalking-oap-server release: 8.3.0roleRef: apiGroup: rbac.authorization.k8s.io kind: Role name: skywalking-oap-serversubjects: - kind: ServiceAccount name: skywalking-oap-server namespace: monitoring---apiVersion: rbac.authorization.k8s.io/v1kind: ClusterRoleBindingmetadata: name: skywalking-oap-server labels: app: skywalking-oap-server release: 8.3.0roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: skywalking-oap-serversubjects:- kind: ServiceAccount name: skywalking-oap-server namespace: monitoring


#创建SkyWalking的alarm-settings.yaml ConfigMap配置文件kind: ConfigMapapiVersion: v1metadata: name: alarm-settings namespace: monitoringdata: alarm-settings.yml: | rules: # Rule unique name, must be ended with `_rule`. #1.过去3分钟内服务平均响应时间超过1秒 service_resp_time_rule: metrics-name: service_resp_time op: ">" threshold: 1000 period: 10 count: 3 silence-period: 60 message: Response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes. # 2.服务成功率在过去2分钟内低于80%。 service_sla_rule: # Metrics value need to be long, double or int metrics-name: service_sla op: "<" threshold: 8000 # The length of time to evaluate the metrics period: 10 # How many times after the metrics match the condition, will trigger alarm count: 2 # How many times of checks, the alarm keeps silence after alarm triggered, default as same as period. silence-period: 60 message: Successful rate of service {name} is lower than 80% in 2 minutes of last 10 minutes #3.服务90%响应时间在过去3分钟内低于1000毫秒. service_resp_time_percentile_rule: # Metrics value need to be long, double or int metrics-name: service_percentile op: ">" threshold: 1000,1000,1000,1000,1000 period: 10 count: 3 silence-period: 60 message: Percentile response time of service {name} alarm in 3 minutes of last 10 minutes, due to more than one condition of p50 > 1000, p75 > 1000, p90 > 1000, p95 > 1000, p99 > 1000 #4.服务实例在过去2分钟内的平均响应时间超过1秒 service_instance_resp_time_rule: metrics-name: service_instance_resp_time op: ">" threshold: 1000 period: 10 count: 2 silence-period: 60 message: Response time of service instance {name} is more than 1000ms in 2 minutes of last 10 minutes database_access_resp_time_rule: metrics-name: database_access_resp_time threshold: 1000 op: ">" period: 10 count: 2 silence-period: 60 message: Response time of database access {name} is more than 1000ms in 2 minutes of last 10 minutes endpoint_relation_resp_time_rule: metrics-name: endpoint_relation_resp_time threshold: 1000 op: ">" period: 10 count: 2 silence-period: 60 message: Response time of endpoint relation {name} is more than 1000ms in 2 minutes of last 10 minutes # Active endpoint related metrics alarm will cost more memory than service and service instance metrics alarm. # Because the number of endpoint is much more than service and instance. #5.端点平均响应时间过去2分钟超过1秒。 endpoint_avg_rule: metrics-name: endpoint_avg op: ">" threshold: 1000 period: 10 count: 2 silence-period: 60 message: Response time of endpoint {name} is more than 1000ms in 2 minutes of last 10 minutes


#创建SkyWalking deployment,这里containers端口开放了11800、12800分别作为grpc、rest端口,且通过nodeport形式暴露给内网环境,使非本k8s环境主机可以访问。#为了便捷,直接使用aliyun的elasticsearch7.7云服务作为SkyWalking的数据源存储,其余数据源可以查看已支持的https://github.com/apache/skywalking/tree/master/oap-server/server-storage-pluginapiVersion: apps/v1kind: Deploymentmetadata: name: skywalking-oap-server namespace: monitoring labels: app: skywalking-oap-server release: 8.3.0spec: replicas: 2 selector: matchLabels: app: skywalking-oap-server template: metadata: labels: app: skywalking-oap-server devops: k8s-app spec: serviceAccountName: skywalking-oap-server containers: - name: skywalking-oap-server image: apache/skywalking-oap-server:latest imagePullPolicy: IfNotPresent livenessProbe: tcpSocket: port: 12800 initialDelaySeconds: 15 periodSeconds: 20 readinessProbe: tcpSocket: port: 12800 initialDelaySeconds: 15 periodSeconds: 20 securityContext: allowPrivilegeEscalation: false ports: - name: grpc containerPort: 11800 - name: rest containerPort: 12800 resources: requests: memory: "128Mi" limits: memory: "4Gi" cpu: 4 env: - name: JAVA_OPTS value: "-Xmx2g -Xms2g" - name: SW_CLUSTER value: kubernetes - name: SW_CLUSTER_K8S_NAMESPACE value: monitoring - name: SW_CONFIGURATION value: k8s-configmap - name: SW_CONFIG_CONFIGMAP_PERIOD value: "60" - name: SKYWALKING_COLLECTOR_UID valueFrom: fieldRef: fieldPath: metadata.uid - name: SW_STORAGE value: elasticsearch7 - name: SW_STORAGE_ES_CLUSTER_NODES value: xxxxxxx.elasticsearch.aliyuncs.com:9200 - name: SW_ES_USER value: elastic - name: SW_ES_PASSWORD value: xxxxx volumeMounts: - name: zone mountPath: /etc/localtime readOnly: true - name: alarm-settings mountPath: /skywalking/config/alarm-settings.yml readOnly: true subPath: alarm-settings.yml volumes: - name: zone hostPath: path: /etc/localtime - name: alarm-settings configMap: name: alarm-settings---apiVersion: v1kind: Servicemetadata: name: skywalking-oap-server namespace: monitoring labels:  app: skywalking-oap-serverspec: selector: app: skywalking-oap-server ports: - name: grpcport port: 11800 targetPort: 11800 protocol: TCP nodePort: 31180 - name: restport port: 12800 targetPort: 12800 protocol: TCP nodePort: 31280 type: NodePort


#创建SkyWalking的ui,注意的是spec.spec.template.spec.containers.env.SW_OAP_ADDRESS需要跟sky-deployment.yaml的name对齐,并加上rest port,并且通过traefik2 的IngressRoute暴露域名。apiVersion: apps/v1kind: Deploymentmetadata: name: skywalking-ui namespace: monitoring labels: app: skywalking-uispec: replicas: 1 selector: matchLabels: app: skywalking-ui template: metadata: labels: app: skywalking-ui spec: containers: - name: skywalking-ui image: apache/skywalking-ui:latest imagePullPolicy: IfNotPresent ports: - containerPort: 8080 name: page resources: requests: memory: "128Mi" limits: memory: "3G" cpu: 2 env: - name: SW_OAP_ADDRESS value: skywalking-oap-server:12800 volumeMounts: - name: zone mountPath: /etc/localtime readOnly: true volumes: - name: zone hostPath: path: /etc/localtime---apiVersion: v1kind: Servicemetadata: labels: app: skywalking-ui name: skywalking-ui namespace: monitoringspec: ports: - port: 80 targetPort: 8080 protocol: TCP name: page selector: app: skywalking-ui---apiVersion: traefik.containo.us/v1alpha1kind: IngressRoutemetadata: name: skywalking-ui namespace: monitoring labels: app: skywalking-uispec: entryPoints: - http routes: - match: Host(`sw.domain.com`) && PathPrefix(`/`)  kind: Rule priority: 10 middlewares: - name: net-offical  namespace: default services: - name: skywalking-ui namespace: monitoring port: 80

按顺序分别kubectl apply部署SkyWalking,部署完成后可查看相关SkyWalking资源。

3 SkyWalking使用

当浏览器登录sw.domain.com的时候,可以看到SkyWalking UI已经准备完成,只不过现在没有服务接入,所有都是空白的,

接下来我们来准备SkyWalking Agent,让JAVA服务接入agent。

3.1 SkyWalking Agent准备

#SkyWalking Agent DockerfileFROM alpine:3.8 LABEL maintainer=xiayun ENV SKYWALKING_VERSION=8.3.0 ADD http://mirrors.tuna.tsinghua.edu.cn/apache/skywalking/${SKYWALKING_VERSION}/apache-skywalking-apm-${SKYWALKING_VERSION}.tar.gz / RUN tar -zxvf /apache-skywalking-apm-${SKYWALKING_VERSION}.tar.gz &&  mv apache-skywalking-apm-bin skywalking &&  mv /skywalking/agent/optional-plugins/apm-trace-ignore-plugin* /skywalking/agent/plugins/ &&  chmod -R 777 /skywalking/agent &&  echo -e "
# Ignore Path" >> /skywalking/agent/config/apm-trace-ignore-plugin.config &&  echo "# see https://github.com/apache/skywalking/blob/8.3.0/docs-hotfix/docs/en/setup/service-agent/java-agent/agent-optional-plugins/trace-ignore-plugin.md" >> /skywalking/agent/config/apm-trace-ignore-plugin.config &&  echo 'trace.ignore_path=${SW_AGENT_TRACE_IGNORE_PATH:/health}' >> /skywalking/agent/config/apm-trace-ignore-plugin.config &&  echo 'agent.namespace=${SW_AGENT_NAMESPACE:default-namespace}' >> /skywalking/agent/config/agent.config &&  echo 'logging.max_file_size=${SW_LOGGING_MAX_FILE_SIZE:1073741824}' >> /skywalking/agent/config/agent.config

3.2 java k8s文件准备

CMD java ${JAVA_OPTS} -jar jar-name然后在java k8s配置文件中,增加initContainers,以k8s sidecar的形式部署SkyWalking agent

#java k8s配置文件apiVersion: apps/v1kind: Deploymentmetadata: name: server-name namespace: ENV labels: prometheus: ENV-serverspec: replicas: 1 selector: matchLabels: app: server-name template: metadata: labels: app: server-name prometheus: ENV-server devops: k8s-app spec: initContainers: - name: skywalking-agent image: skywalking-agent:r1.0 securityContext: allowPrivilegeEscalation: false resources: limits: memory: 1Gi requests: memory: 100Mi command: - 'sh' - '-c' - 'set -ex;mkdir -p /vmskywalking/agent;cp -r /skywalking/agent/* /vmskywalking/agent' volumeMounts: - name: zone mountPath: /etc/localtime readOnly: true - name: sw-agent mountPath: /vmskywalking/agent containers: - name: server-name image:<BUILD_TAG> imagePullPolicy: Always securityContext: allowPrivilegeEscalation: false readinessProbe: tcpSocket: port: 8081 initialDelaySeconds: 5 periodSeconds: 5 livenessProbe: tcpSocket: port: 8081 initialDelaySeconds: 300 periodSeconds: 5 ports: - name: web protocol: TCP  containerPort: 8081 resources: requests: cpu: "100m" memory: "128Mi" limits: memory: "MAXMEM" env: - name: JAVA_OPTS value: -javaagent:/usr/lib/agent/skywalking-agent.jar - name: SW_AGENT_NAME value: ENV-server-name - name: SW_AGENT_COLLECTOR_BACKEND_SERVICES value: skywalking-oap-server.monitoring.svc.cluster.local:11800 - name: SW_LOGGING_LEVEL value: ERROR - name: SW_LOGGING_MAX_FILE_SIZE value: "1073741824" - name: SW_AGENT_NAMESPACE value: ENV - name: SW_MOUNT_FOLDERS value: plugins,activations - name: SW_AGENT_TRACE_IGNORE_PATH value: /health,/actuator/prometheus,/prometheus volumeMounts: - name: zone mountPath: /etc/localtime readOnly: true - name: app-logs mountPath: /home/admin/server-name/logs - name: fonts mountPath: /usr/share/fonts subPath: fonts readOnly: true - name: sw-agent mountPath: /usr/lib/agent volumes: - name: zone hostPath: path: /etc/localtime - name: app-logs emptyDir: {} - name: sw-agent emptyDir: {} - name: fonts persistentVolumeClaim: claimName: fonts---apiVersion: v1kind: Servicemetadata: name: server-name-svc namespace: ENV labels:  prometheus: ENV-server annotations: prometheus.io/scrape: "true" prometheus.io/port: "8081" prometheus.io/path: "/actuator/prometheus"spec: template: metadata: labels: name: server-name-svc namespace: ENV prometheus: ENV-serverspec: selector: app: server-name ports: - name: web port: 80 targetPort: 8081

配置完成后,运行java 服务。让我们来看下现在k8s SkyWalking的基础架构,

云原生利器 -- SkyWalking

采用aliyun elasticsearch作为skywalking的存储源,skywalking server跟ui都部署在k8s上,skywalking agent客户端采用k8s sidecar 边车模式跟微服务共享容器空间。

3.3 SkyWalking使用

登录SkyWalking UI页面,右上角刷新一下,可以显示出新增的java服务,如,

云原生利器 -- SkyWalking

从仪表盘的APM中,可以看到Services Load、Slow Services、Un-Health Service、Slow Endpoints的Top10情况。

如果trace链路需要忽略某些路径,如/health,/actuator/prometheus,/prometheus这些监控uri,可以在java k8s配置文件中的env.SW_AGENT_TRACE_IGNORE_PATH配置,如需通配路径,参考trace.ignore_path=/your/path/1/**,/your/path/2/**,具体可以查阅https://github.com/apache/skywalking/blob/8.3.0/docs-hotfix/docs/en/setup/service-agent/java-agent/agent-optional-plugins/trace-ignore-plugin.md
从告警中,可以看到当前服务的链路告警详情,告警规则可以在alarm-settings.yml里配置,告警可以接入WebHook,如Dingtalk Hook,WeChat Hook,Slack Chat Hook,gRPCHook等

rules: service_resp_time_rule: metrics-name: service_resp_time op: ">" threshold: 1000 period: 10 count: 3 silence-period: 60 message: Response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes.


  • Rule name。在告警信息中显示的唯一名称。必须以_rule结尾。指定的规则(与规则名不同,这里是对应的告警中的规则map,具体可查看 https://github.com/apache/skywalking/blob/master/docs/en/setup/backend/backend-alarm.md#list-of-all-potential-metrics-name,其中一些常见的,endpoint_percent_rule——端点相应半分比告警,service_percent_rule——服务相应百分比告警)
  • Metrics name。也是 OAL 脚本中的度量名。只支持long,double和int类型。详情见所有可能的度量名称列表.
  • Include names。使用本规则告警的服务列表。比如服务名,端点名。
  • Threshold。阈值,与metrics-name和下面的比较符号相匹配
  • OP。操作符, 支持 >, <, =。欢迎贡献所有的操作符。如 metrics-name: endpoint_percent, threshold: 75,op: < ,表示如果相应时长小于平均75%则发送告警
  • Period.。多久告警规则需要被核实一下。这是一个时间窗口,与后端部署环境时间相匹配。
  • Count。在一个Period窗口中,如果values超过Threshold值(按op),达到Count值,需要发送警报。
  • Silence period。在时间N中触发报警后,在TN -> TN + period这个阶段不告警。默认情况下,它和Period一样,这意味着相同的告警(在同一个Metrics name拥有相同的Id)在同一个Period内只会触发一次。  

