prometheus 服务发现原理

Posted 2023-04-03 Reactor2020

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了prometheus 服务发现原理相关的知识，希望对你有一定的参考价值。

服务发现

概述

如上图，Prometheus核心功能包括服务发现、数据采集和数据存储。服务发现模块专门负责发现需要监控的目标采集点(target)信息，数据采集模块从服务发现模块订阅该信息，获取到target信息后，其中就包含协议(scheme)、主机地址:端口(instance)、请求路径(metrics_path)、请求参数(params)等；然后数据采集模块就可以基于这些信息构建出一个完整的Http Request请求，定时通过pull http协议不断的去目标采集点(target)拉取监控样本数据(sample)；最后，将采集到监控样本数据交由TSDB模块进行数据存储。

为什么需要服务发现模块？

类似于微服务通过引入注册中心组件解决众多微服务间错综复杂的依赖调用。无论是服务主动停止，意外挂掉，还是因为流量增加对服务实现进行扩容，这些服务数据或状态上的动态变化，通过注册中心屏蔽服务状态变更造成的影响，简化了调用方处理逻辑。

同理，Prometheus最开始设计是一个面向云原生应用程序的，云原生、容器场景下按需的资源使用方式对于监控系统而言就意味着没有了一个固定的监控目标，所有的监控对象(基础设施、应用、服务)都在动态的变化。Prometheus解决方案就是引入一个中间的代理人，这个代理人掌握着当前所有监控目标的访问信息，Prometheus只需要向这个代理人询问有哪些监控目标即可，这种模式被称为服务发现(service discovery)。

目前，Prometheus支持的服务发现协议是非常丰富的，最新版本(2.41)已支持接近三十种服务发现协议：

<azure_sd_config>
<consul_sd_config>
<digitalocean_sd_config>
<docker_sd_config>
<dockerswarm_sd_config>
<dns_sd_config>
<ec2_sd_config>
<openstack_sd_config>
<ovhcloud_sd_config>
<puppetdb_sd_config>
<file_sd_config>
<gce_sd_config>
<hetzner_sd_config>
<http_sd_config>
<ionos_sd_config>
<kubernetes_sd_config>
<kuma_sd_config>
<lightsail_sd_config>
<linode_sd_config>
<marathon_sd_config>
<nerve_sd_config>
<nomad_sd_config>
<serverset_sd_config>
<triton_sd_config>
<eureka_sd_config>
<scaleway_sd_config>
<uyuni_sd_config>
<vultr_sd_config>
<static_config>

服务发现配置解析

1、Prometheus服务启动加载prometheus.yml配置文件会被解析Config结构体：

❝

Config结构体是配置类的最顶层结构，内部包含6个字段分别对应prometheus配置的6大组成部分。

❞

2、其中数据采集配置部分ScrapeConfigs对应的是一个*ScrapeConfig类型切片，一个ScrapeConfig对应的是scrape_configs配置下的一个job抓取任务，服务发现协议配置对应其中ServiceDiscoveryConfigs字段：

3、discovery.Configs对应的是Config切片：

type Configs []Config

所以，一个job抓取任务下可以配置多个服务发现协议，如：

- job_name: 'prometheus'
  metrics_path: /metrics
  static_configs:
    - targets: ['124.222.45.207:9090']
  file_sd_configs:
    - files:
      - targets/t1.json
      - targets/t2.json
      refresh_interval: 5m

4、Config是一个接口：

Config是一个接口的定义，每种服务发现协议都会存在一个对应Config接口的实现（见下图）。该接口主要定义两个方法：

1、Name() string：定义服务发现协议类型，如eureka、kubernetes等等；
2、NewDiscoverer(DiscovererOptions) (Discoverer, error)：返回一个Discoverer类型变量，该类型也是一个接口，其只定义了一个方法Run方法，即Discoverer是对应的服务发现协议具体运行逻辑封装，通过Run方法提供统一的运行入口。

服务发现核心原理

说明：

Prometheus服务发现核心逻辑的入口主要关注Manager结构体的ApplyConfig方法：基于服务发现的配置使其生效；
ApplyConfig方法包括四个主要步骤：
```
type provider struct 
 name   string
 d      Discoverer
 subs   []string
 config interface
```
❝
一个job下一个服务发现协议对应一个Discoverer。
❞
provider还有额外三个字段：
1、name：provider名称，格式：fmt.Sprintf("%s/%d", typ, len(m.providers))；
2、subs：string切片，存放job名称，因为可能不同job下存在一致的服务发现配置，就只会生成一个provider，然后subs存放job列表；
3、config：服务发现配置
1. 启动Discoverer接口Run方法，让服务发现逻辑运行；
2. 协程中运行updater方法；
3. Discoverer接口Run方法主要基于具体服务发现协议发现target，然后通过通道传递给updater处理逻辑，将其解析处理放入到Manager结构体中targets字段中，并向triggerSend通道发送信号，表示当前targets发生变更；
4. Manager结构体sender方法每5秒监听triggerSend通道信号，并将Manager结构体中targets字段处理后放入到syncCh通道中；
5. 数据采集模块(scrape)监听syncCh通道，就可以获取到服务发现生成的targets信息，然后reload将target纳入监控开始抓取监控指标。
6. 启动provider：遍历Manager结构体中providers切片，启动每个provider，该步骤主要是启动两个协程：
  
  ❝
  Manager结构体sender方法是在Prometheus启动时discoveryManagerScrape.Run()方法中启动。
  ❞
7. 取消服务发现：配置变更也会调用ApplyConfig方法，这时就要把基于之前配置运行的服务发现服务取消，然后基于当前配置重新生成；
8. 清空：主要清空discoverCancel、targets和providers几个容器元素，因为要基于当前配置重新生成；
9. 注册provider：provider是对Discoverer的封装，不同服务发现协议都会实现Config接口，其中NewDiscoverer方法就是创建Discoverer

「Prometheus服务发现核心就是三个协程之间协作：」

「协程1：」负责运行Discoverer接口Run方法，基于协议发现采集点；
「协程2：」负责将协程1发现的采集点信息更新到Manager结构体中targets字段的map中；
「协程3：」负责将Manager结构体中targets字段的数据通过通道发送给scrape模块；

scrape模块获取到采集点如何进行数据采集后续scrape模块分析。

监控指标

Prometheus服务发现通用指标主要有如下5个，都定义在discovery/manager.go中：

prometheus_sd_discovered_targets
prometheus_sd_failed_configs
prometheus_sd_received_updates_total
prometheus_sd_updates_delayed_total
prometheus_sd_updates_total

「1、采集点数量指标」

服务发现主要基于协议发现采集目标，prometheus_sd_discovered_targets指标反馈各个job发现的采集目标数：

prometheus_sd_discovered_targets：gauge类型，当前发现的目标数
config:job名称
name：取值scrape和notify，区分指标抓取服务发现还是告警通知服务发现
示例：prometheus_sd_discovered_targetsconfig="auth_es1", name="scrape"  12

❝
这里基于协议发现的目标数，还未进入采集模块，并不能区分是在线还是离线。
❞

「2、服务发现协议异常错误指标」

服务发现会给每个发现配置项生成一个provider，并为每个provider使用协程运行，如果基于配置项生成provider错误就可以通过prometheus_sd_failed_configs指标反馈：

prometheus_sd_failed_configs：gauge类型，当前无法加载的服务发现配置数
配置数：一个job可能存在多个服务发现协议配置，对应配置项则是多个
示例：
prometheus_sd_failed_configsname="scrape"  10
prometheus_sd_failed_configsname="notify" 5

一个job可能对应多个服务发现配置项，如下：这个job下配置了static_configs和file_sd_configs两个服务发现协议配置，则对应两个服务发现配置项，注册两个provider，每个provider在独立协程中运行：

scrape_configs:
  # The job name is added as a label `job=` to any timeseries scraped from this config.
  - job_name: 'test'
    static_configs:
    - targets: ['localhost:9090']
    file_sd_configs:
    - refresh_interval: 5m
      files:
      - targets/manual.*.json

「3、协程交互指标」

服务发现主要涉及3类协程：

Discoverer协程(多个)：封装provider，基于协议发现采集点，这里可能会存在多个，一个provider对应一个Discoverer协程；
updater协程(1个)：Discoverer协程发现采集点，通过channel通道通知到updater协程，updater协程将采集点更新到Manager结构体中targets字段中，然后向Manager结构体中triggerSend通道写入数据，告诉sender协程targets有更新；
sender协程(1个)：sender协程每5秒检测triggerSend通道数据，检测到更新则将Manager结构体targets数据处理封装写入到Manager结构体syncCh通道中，scrape模块监测该通道，即完成将服务发现模块和scrape模块交互。

这其中涉及三个指标：

prometheus_sd_received_updates_total
prometheus_sd_updates_delayed_total
prometheus_sd_updates_total

云原生 • PrometheusPrometheus 注册中心Eureka服务发现原理

【云原生 • Prometheus】Prometheus 注册中心Eureka服务发现原理

【云原生 • Prometheus】Prometheus 注册中心Eureka服务发现原理

【云原生 • Prometheus】Prometheus 注册中心Eureka服务发现原理

概述

Eureka服务发现协议允许使用Eureka Rest API检索出Prometheus需要监控的targets，Prometheus会定时周期性的从Eureka调用Eureka Rest API，并将每个应用实例创建出一个target。

Eureka服务发现协议支持对如下元标签进行relabeling：

__meta_eureka_app_name: the name of the app
__meta_eureka_app_instance_id: the ID of the app instance
__meta_eureka_app_instance_hostname: the hostname of the instance
__meta_eureka_app_instance_homepage_url: the homepage url of the app instance
__meta_eureka_app_instance_statuspage_url: the status page url of the app instance
__meta_eureka_app_instance_healthcheck_url: the health check url of the app instance
__meta_eureka_app_instance_ip_addr: the IP address of the app instance
__meta_eureka_app_instance_vip_address: the VIP address of the app instance
__meta_eureka_app_instance_secure_vip_address: the secure VIP address of the app instance
__meta_eureka_app_instance_status: the status of the app instance
__meta_eureka_app_instance_port: the port of the app instance
__meta_eureka_app_instance_port_enabled: the port enabled of the app instance
__meta_eureka_app_instance_secure_port: the secure port address of the app instance
__meta_eureka_app_instance_secure_port_enabled: the secure port of the app instance
__meta_eureka_app_instance_country_id: the country ID of the app instance
__meta_eureka_app_instance_metadata_<metadataname>: app instance metadata
__meta_eureka_app_instance_datacenterinfo_name: the datacenter name of the app instance
__meta_eureka_app_instance_datacenterinfo_<metadataname>: the datacenter metadata

eureka_sd_configs常见配置如下：

- job_name: 'eureka'
  eureka_sd_configs:
    - server: http://localhost:8761/eureka #eureka server地址
      refresh_interval: 1m #刷新间隔，默认30s

eureka_sd_configs官网支持主要配置如下：

server: <string>

basic_auth:
  [ username: <string> ]
  [ password: <secret> ]
  [ password_file: <string> ]

# Configures the scrape request's TLS settings.
tls_config:
  [ <tls_config> ]

# Optional proxy URL.
[ proxy_url: <string> ]

# Configure whether HTTP requests follow HTTP 3xx redirects.
[ follow_redirects: <bool> | default = true ]

# Refresh interval to re-read the app instance list.
[ refresh_interval: <duration> | default = 30s ]

Eureka协议实现

基于Eureka服务发现协议核心逻辑都封装在discovery/eureka.go的func (d *Discovery) refresh(ctx context.Context) ([]*targetgroup.Group, error)方法中：

func (d *Discovery) refresh(ctx context.Context) ([]*targetgroup.Group, error) 
	// 通过Eureka REST API接口从eureka拉取元数据：http://ip:port/eureka/apps
	apps, err := fetchApps(ctx, d.server, d.client)
	if err != nil 
		return nil, err
	

	tg := &targetgroup.Group
		Source: "eureka",
	

	for _, app := range apps.Applications //遍历app
        // targetsForApp()方法将app下每个instance部分转成target
		targets := targetsForApp(&app)
        //解析的采集点合入一起
		tg.Targets = append(tg.Targets, targets...)
	
	return []*targetgroup.Grouptg, nil

refresh方法主要有两个流程：

1、fetchApps()：从eureka-server的/eureka/apps接口拉取注册服务信息；

2、targetsForApp()：遍历app下instance，将每个instance解析出一个target，并添加一堆元标签数据。

如下示例从eureka-server的/eureka/apps接口拉取的注册服务信息：

<applications>
    <versions__delta>1</versions__delta>
    <apps__hashcode>UP_1_</apps__hashcode>
    <application>
        <name>SERVICE-PROVIDER-01</name>
        <instance>
            <instanceId>localhost:service-provider-01:8001</instanceId>
            <hostName>192.168.3.121</hostName>
            <app>SERVICE-PROVIDER-01</app>
            <ipAddr>192.168.3.121</ipAddr>
            <status>UP</status>
            <overriddenstatus>UNKNOWN</overriddenstatus>
            <port enabled="true">8001</port>
            <securePort enabled="false">443</securePort>
            <countryId>1</countryId>
            <dataCenterInfo class="com.netflix.appinfo.InstanceInfo$DefaultDataCenterInfo">
                <name>MyOwn</name>
            </dataCenterInfo>
            <leaseInfo>
                <renewalIntervalInSecs>30</renewalIntervalInSecs>
                <durationInSecs>90</durationInSecs>
                <registrationTimestamp>1629385562130</registrationTimestamp>
                <lastRenewalTimestamp>1629385682050</lastRenewalTimestamp>
                <evictionTimestamp>0</evictionTimestamp>
                <serviceUpTimestamp>1629385562132</serviceUpTimestamp>
            </leaseInfo>
            <metadata>
                <management.port>8001</management.port>
                <scrape__enable>true</scrape__enable>
                <scrape.port>8080</scrape.port>
            </metadata>
            <homePageUrl>http://192.168.3.121:8001/</homePageUrl>
            <statusPageUrl>http://192.168.3.121:8001/actuator/info</statusPageUrl>
            <healthCheckUrl>http://192.168.3.121:8001/actuator/health</healthCheckUrl>
            <vipAddress>service-provider-01</vipAddress>
            <secureVipAddress>service-provider-01</secureVipAddress>
            <isCoordinatingDiscoveryServer>false</isCoordinatingDiscoveryServer>
            <lastUpdatedTimestamp>1629385562132</lastUpdatedTimestamp>
            <lastDirtyTimestamp>1629385562039</lastDirtyTimestamp>
            <actionType>ADDED</actionType>
        </instance>
    </application>
</applications>

instance信息会被解析成采集点target：

func targetsForApp(app *Application) []model.LabelSet 
	targets := make([]model.LabelSet, 0, len(app.Instances))

	// Gather info about the app's 'instances'. Each instance is considered a task.
	for _, t := range app.Instances 
		var targetAddress string
        // __address__取值方式：instance.hostname和port，没有port则默认port=80
		if t.Port != nil 
			targetAddress = net.JoinHostPort(t.HostName, strconv.Itoa(t.Port.Port))
		 else 
			targetAddress = net.JoinHostPort(t.HostName, "80")
		

		target := model.LabelSet
			model.AddressLabel:  lv(targetAddress),
			model.InstanceLabel: lv(t.InstanceID),

			appNameLabel:                     lv(app.Name),
			appInstanceHostNameLabel:         lv(t.HostName),
			appInstanceHomePageURLLabel:      lv(t.HomePageURL),
			appInstanceStatusPageURLLabel:    lv(t.StatusPageURL),
			appInstanceHealthCheckURLLabel:   lv(t.HealthCheckURL),
			appInstanceIPAddrLabel:           lv(t.IPAddr),
			appInstanceVipAddressLabel:       lv(t.VipAddress),
			appInstanceSecureVipAddressLabel: lv(t.SecureVipAddress),
			appInstanceStatusLabel:           lv(t.Status),
			appInstanceCountryIDLabel:        lv(strconv.Itoa(t.CountryID)),
			appInstanceIDLabel:               lv(t.InstanceID),
		

		if t.Port != nil 
			target[appInstancePortLabel] = lv(strconv.Itoa(t.Port.Port))
			target[appInstancePortEnabledLabel] = lv(strconv.FormatBool(t.Port.Enabled))
		

		if t.SecurePort != nil 
			target[appInstanceSecurePortLabel] = lv(strconv.Itoa(t.SecurePort.Port))
			target[appInstanceSecurePortEnabledLabel] = lv(strconv.FormatBool(t.SecurePort.Enabled))
		

		if t.DataCenterInfo != nil 
			target[appInstanceDataCenterInfoNameLabel] = lv(t.DataCenterInfo.Name)

			if t.DataCenterInfo.Metadata != nil 
				for _, m := range t.DataCenterInfo.Metadata.Items 
					ln := strutil.SanitizeLabelName(m.XMLName.Local)
					target[model.LabelName(appInstanceDataCenterInfoMetadataPrefix+ln)] = lv(m.Content)
				
			
		

		if t.Metadata != nil 
			for _, m := range t.Metadata.Items 
                // prometheus label只支持[^a-zA-Z0-9_]字符，其它非法字符都会被替换成下划线_
				ln := strutil.SanitizeLabelName(m.XMLName.Local)
				target[model.LabelName(appInstanceMetadataPrefix+ln)] = lv(m.Content)
			
		

		targets = append(targets, target)

	
	return targets

解析比较简单，就不再分析，解析后的标签数据如下图：

标签中有两个特别说明下：

1、__address__：这个取值instance.hostname和port(默认80)，所以要注意注册到eureka上的hostname准确性，不然可能无法抓取；

2、metadata-map数据会被转成__meta_eureka_app_instance_metadata_<metadataname>格式标签，prometheus进行relabeling 一般操作metadata-map，可以自定义metric_path、抓取端口等；

3、prometheus的label只支持[a-zA-Z0-9_]，其它非法字符都会被转换成下划线，具体参加：strutil.SanitizeLabelName(m.XMLName.Local)；但是eureka的metadata-map标签含有下划线时，注册到eureka-server上变成双下划线，如下配置:

eureka:
  instance:
    metadata-map:
      scrape_enable: true
      scrape.port: 8080

通过/eureka/apps获取如下：

总结

基于Eureka服务发现原理如下图：

基于eureka_sd_configs服务发现协议配置创建Discoverer，并通过协程运行Discoverer.Run方法，Eureka服务发现核心逻辑封装discovery/eureka.go的func (d *Discovery) refresh(ctx context.Context) ([]*targetgroup.Group, error)方法中。

refresh方法中主要调用两个方法：

1、fetchApps：定时周期从Eureka Server的/eureka/apps接口拉取注册上来的服务元数据信息；

2、targetsForApp：解析上步骤拉取的元数据信息，遍历app下的instance，将每个instance解析成target，并将其它元数据信息转换成target元标签可以用于relabel_configs操作

以上是关于prometheus 服务发现原理的主要内容，如果未能解决你的问题，请参考以下文章