当前位置：首页 > news >正文

Prometheus字段解析

news 2025/7/5 0:31:53

官方文档：：Configuration | Prometheus
1、全局配置指定在所有其他配置上下文中有效的参数。它们还用作其他配置部分的默认值。
global:# How frequently to scrape targets by default.
    默认情况下，定期抓取目标的频率是多久一次？  [ scrape_interval: <duration> | default = 1m ]

抓取请求超时的时间# How long until a scrape request times out. [ scrape_timeout: <duration> | default = 10s ]

评估规则的频率

 # How frequently to evaluate rules.[ evaluation_interval: <duration> | default = 1m ]

与外部系统（联邦、远程存储、Alertmanager）通信时，要添加到时间序列或警报的标签。示例配置：

external_labels:
- labelname1: labelvalue1
- labelname2: labelvalue2
# 这里可以继续添加其他标签

上述配置中，您可以替换labelname1和labelvalue1以及labelname2和labelvalue2，以匹配您实际的标签名称和值。这些标签将与时间序列或警报一起传递到外部系统。

用于记录PromQL查询的文件。重新加载配置会重新打开该文件。您可以将<string>替换为要用于记录PromQL查询的文件的路径。

  # File to which PromQL queries are logged.# Reloading the configuration will reopen the file.[ query_log_file: <string> ]

"未经压缩的响应主体超过指定的字节数将导致抓取失败。0 表示没有限制。示例：100MB。这是一个实验性功能，其行为可能在将来发生更改或被移除。"

 # An uncompressed response body larger than this many bytes will cause the# scrape to fail. 0 means no limit. Example: 100MB.# This is an experimental feature, this behaviour could# change or be removed in the future.[ body_size_limit: <size> | default = 0 ]

"每次抓取的采样数量上限，将会被接受。如果在指标重标记之后存在超过这个数量的采样，整个抓取将被视为失败。0 表示没有限制。"

# Per-scrape limit on number of scraped samples that will be accepted.# If more than this number of samples are present after metric relabeling# the entire scrape will be treated as failed. 0 means no limit.[ sample_limit: <int> | default = 0 ]

"每次抓取的样本可接受的标签数量上限。如果在度量数据重标记后存在超过此数量的标签，整个抓取将被视为失败。0 表示没有限制。"

# Per-scrape limit on number of labels that will be accepted for a sample. If# more than this number of labels are present post metric-relabeling, the# entire scrape will be treated as failed. 0 means no limit.[ label_limit: <int> | default = 0 ]

"每次抓取中，样本标签名称的长度上限。如果度量数据重标记后标签名称超过此数字，整个抓取将被视为失败。0 表示没有限制。"

 # Per-scrape limit on length of labels name that will be accepted for a sample.# If a label name is longer than this number post metric-relabeling, the entire# scrape will be treated as failed. 0 means no limit.[ label_name_length_limit: <int> | default = 0 ]

"每次抓取中，样本标签值的长度上限。如果度量数据重标记后标签值超过此数字，整个抓取将被视为失败。0 表示没有限制。"

 # Per-scrape limit on length of labels value that will be accepted for a sample.# If a label value is longer than this number post metric-relabeling, the# entire scrape will be treated as failed. 0 means no limit.[ label_value_length_limit: <int> | default = 0 ]

每次抓取的配置中，可接受的唯一目标数量上限。如果在目标重标记后存在超过此数量的目标，Prometheus将标记这些目标为失败，而不进行抓取。0 表示没有限制。这是一个实验性功能，其行为可能在将来发生更改。"

 # Per-scrape config limit on number of unique targets that will be# accepted. If more than this number of targets are present after target# relabeling, Prometheus will mark the targets as failed without scraping them.# 0 means no limit. This is an experimental feature, this behaviour could# change in the future.[ target_limit: <int> | default = 0 ]

"每次抓取配置中，通过重标记而被丢弃的目标数量上限，这些目标将被保存在内存中。0 表示没有限制。"

  # Limit per scrape config on the number of targets dropped by relabeling# that will be kept in memory. 0 means no limit.[ keep_dropped_targets: <int> | default = 0 ]

"规则文件指定了一组通配符。规则和警报将从所有匹配的文件中读取。"

# Rule files specifies a list of globs. Rules and alerts are read from
# all matching files.
rule_files:[ - <filepath_glob> ... ]

"抓取配置文件指定了一组通配符。抓取配置将从所有匹配的文件中读取，并追加到抓取配置列表中。"

# Scrape config files specifies a list of globs. Scrape configs are read from
# all matching files and appended to the list of scrape configs.
scrape_config_files:[ - <filepath_glob> ... ]

"抓取配置的列表。"

# A list of scrape configurations.
scrape_configs:[ - <scrape_config> ... ]

"Alerting指定与Alertmanager相关的设置。其中包括告警重标记配置以及Alertmanager配置的列表。"

# Alerting specifies settings related to the Alertmanager.
alerting:alert_relabel_configs:[ - <relabel_config> ... ]alertmanagers:[ - <alertmanager_config> ... ]

"与远程写入功能相关的设置。其中包括远程写入配置的列表。"

# Settings related to the remote write feature.
remote_write:[ - <remote_write> ... ]

"与远程读取功能相关的设置。这部分包括远程读取配置的列表。"

# Settings related to the remote read feature.
remote_read:[ - <remote_read> ... ]

"与存储相关的设置，可以在运行时重新加载。这包括了TSDB（时间序列数据库）和样本存储配置的部分。"

# Storage related settings that are runtime reloadable.
storage:[ tsdb: <tsdb> ][ exemplars: <exemplars> ]

"配置用于导出跟踪数据的设置。这一部分包括了跟踪配置的部分。"

# Configures exporting traces.
tracing:[ <tracing_config> ]

2、<scrape_config>

一个scrape_config部分指定一组目标和描述如何抓取它们的参数。在一般情况下，一个抓取配置指定一个作业。在高级配置中，这可能会改变。

目标可以通过static_configs参数静态配置，也可以使用支持的服务发现机制之一动态发现。

此外，relabel_configs允许在抓取之前对任何目标及其标签进行高级修改。

"默认情况下分配给抓取的指标的作业名称。"

# The job name assigned to scraped metrics by default.
job_name: <job_name>

"从此作业抓取目标的频率。默认值是全局配置中的抓取间隔（scrape_interval）。"

# How frequently to scrape targets from this job.
[ scrape_interval: <duration> | default = <global_config.scrape_interval> ]

"在抓取此作业时，每次抓取的超时时间。默认值是全局配置中的抓取超时（scrape_timeout）。"

# Per-scrape timeout when scraping this job.
[ scrape_timeout: <duration> | default = <global_config.scrape_timeout> ]

"是否抓取经典直方图，这些直方图同时也作为本机直方图公开（如果没有启用--enable-feature=native-histograms，则没有效果）。默认值为false。"

# Whether to scrape a classic histogram that is also exposed as a native
# histogram (has no effect without --enable-feature=native-histograms).
[ scrape_classic_histograms: <boolean> | default = false ]

"从目标获取指标的HTTP资源路径。默认为 /metrics。"

# The HTTP resource path on which to fetch metrics from targets.
[ metrics_path: <path> | default = /metrics ]

"honor_labels" 控制 Prometheus 如何处理已存在于抓取数据中的标签与 Prometheus 会附加在服务器端的标签之间的冲突（例如 "job" 和 "instance" 标签、手动配置的目标标签以及由服务发现实现生成的标签）。

如果 "honor_labels" 设置为 "true"，标签冲突将通过保留来自抓取数据的标签值并忽略冲突的服务器端标签来解决。
如果 "honor_labels" 设置为 "false"，标签冲突将通过将抓取数据中的冲突标签重命名为 "exported_<original-label>"（例如 "exported_instance"、"exported_job"）然后附加服务器端标签来解决。

将 "honor_labels" 设置为 "true" 对于联邦和抓取 Pushgateway 等用例很有用，其中目标中指定的所有标签都应该被保留。

# honor_labels controls how Prometheus handles conflicts between labels that are
# already present in scraped data and labels that Prometheus would attach
# server-side ("job" and "instance" labels, manually configured target
# labels, and labels generated by service discovery implementations).
#
# If honor_labels is set to "true", label conflicts are resolved by keeping label
# values from the scraped data and ignoring the conflicting server-side labels.
#
# If honor_labels is set to "false", label conflicts are resolved by renaming
# conflicting labels in the scraped data to "exported_<original-label>" (for
# example "exported_instance", "exported_job") and then attaching server-side
# labels.
#
# Setting honor_labels to "true" is useful for use cases such as federation and
# scraping the Pushgateway, where all labels specified in the target should be
# preserved.

请注意，全局配置中的 "external_labels" 不受此设置影响。在与外部系统通信时，它们总是仅在时间序列尚未具有给定标签时应用，并在其他情况下被忽略。

#
# Note that any globally configured "external_labels" are unaffected by this
# setting. In communication with external systems, they are always applied only
# when a time series does not have a given label yet and are ignored otherwise.
[ honor_labels: <boolean> | default = false ]

"honor_timestamps" 控制 Prometheus 是否尊重抓取数据中的时间戳。

如果 "honor_timestamps" 设置为 "true"，则将使用目标公开的指标的时间戳。

# honor_timestamps controls whether Prometheus respects the timestamps present
# in scraped data.
#
# If honor_timestamps is set to "true", the timestamps of the metrics exposed
# by the target will be used.
#

如果 "honor_timestamps" 设置为 "false"，则将忽略目标公开的指标的时间戳。

还有其他配置选项：

"scheme" 配置请求使用的协议方案，默认为 "http"。
"params" 允许配置可选的HTTP URL参数，参数为键值对的形式。

# If honor_timestamps is set to "false", the timestamps of the metrics exposed
# by the target will be ignored.
[ honor_timestamps: <boolean> | default = true ]# Configures the protocol scheme used for requests.
[ scheme: <scheme> | default = http ]# Optional HTTP URL parameters.
params:[ <string>: [<string>, ...] ]

这部分配置用于设置在每次抓取请求中使用配置的用户名和密码来设置 Authorization 标头。

"username"：配置用户名。
"password"：配置密码（注意，此处的密码是一个保密值）。
"password_file"：配置密码文件，其中包含了密码的内容。注意，"password" 和 "password_file" 是互斥的，只能使用其中一个。

这通常用于进行基本身份验证（Basic Authentication）以访问受密码保护的资源。

# Sets the `Authorization` header on every scrape request with the
# configured username and password.
# password and password_file are mutually exclusive.
basic_auth:[ username: <string> ][ password: <secret> ][ password_file: <string> ]

这部分配置用于在每次抓取请求中使用配置的凭据来设置 Authorization 标头。

"type"：设置请求的身份验证类型，默认为 "Bearer"。
"credentials"：设置请求的凭据（注意，这里的凭据是一个保密值）。
"credentials_file"：设置请求的凭据，使用从配置的文件中读取的凭据。请注意，"credentials" 和 "credentials_file" 是互斥的，只能使用其中一个。

这通常用于进行身份验证以访问受保护的资源。

# Sets the `Authorization` header on every scrape request with
# the configured credentials.
authorization:# Sets the authentication type of the request.[ type: <string> | default: Bearer ]# Sets the credentials of the request. It is mutually exclusive with# `credentials_file`.[ credentials: <secret> ]# Sets the credentials of the request with the credentials read from the# configured file. It is mutually exclusive with `credentials`.[ credentials_file: <filename> ]

这部分包括了一些进一步的配置：

"oauth2"：用于配置OAuth 2.0的可选配置。请注意，它不能与 "basic_auth" 或 "authorization" 同时使用。
"follow_redirects"：配置抓取请求是否应该遵循HTTP的3xx重定向，默认为 true。
"enable_http2"：配置是否启用HTTP2，默认为 true。
"tls_config"：用于配置抓取请求的TLS（传输层安全性）设置的部分。

# Optional OAuth 2.0 configuration.
# Cannot be used at the same time as basic_auth or authorization.
oauth2:[ <oauth2> ]# Configure whether scrape requests follow HTTP 3xx redirects.
[ follow_redirects: <boolean> | default = true ]# Whether to enable HTTP2.
[ enable_http2: <boolean> | default: true ]# Configures the scrape request's TLS settings.
tls_config:[ <tls_config> ]

这部分包括了与代理配置相关的设置：

"proxy_url"：可选的代理URL。
"no_proxy"：逗号分隔的字符串，可以包含要排除在代理之外的IP、CIDR表示法、域名，包括可能包含端口号的IP和域名。
"proxy_from_environment"：是否使用环境变量指定的代理URL（如HTTP_PROXY、https_proxy、HTTPs_PROXY、https_proxy和no_proxy）。默认为 false。
"proxy_connect_header"：在连接请求（CONNECT requests）期间发送到代理的标头。可以配置多个标头，每个标头包括一个字符串键和一个保密值。

# Configures the scrape request's TLS settings.
tls_config:[ <tls_config> ]# Optional proxy URL.
[ proxy_url: <string> ]
# Comma-separated string that can contain IPs, CIDR notation, domain names
# that should be excluded from proxying. IP and domain names can
# contain port numbers.
[ no_proxy: <string> ]
# Use proxy URL indicated by environment variables (HTTP_PROXY, https_proxy, HTTPs_PROXY, https_proxy, and no_proxy)
[ proxy_from_environment: <boolean> | default: false ]
# Specifies headers to send to proxies during CONNECT requests.
[ proxy_connect_header:[ <string>: [<secret>, ...] ] ]

这部分包括了多种服务发现配置选项，用于从各种不同的源中发现目标，包括云服务提供商、容器管理系统、DNS、HTTP等。以下是其中一些配置的示例：

"azure_sd_configs"：Azure服务发现配置。
"consul_sd_configs"：Consul服务发现配置。
"docker_sd_configs"：Docker服务发现配置。
"dns_sd_configs"：DNS服务发现配置。
"ec2_sd_configs"：Amazon EC2服务发现配置。
"kubernetes_sd_configs"：Kubernetes服务发现配置。
"file_sd_configs"：文件服务发现配置，用于从静态文件中发现目标。
"static_configs"：静态配置，手动指定的目标列表。
"relabel_configs"：目标重标记配置，可用于修改或筛选目标标签。
"metric_relabel_configs"：度量数据重标记配置，可用于修改或筛选指标标签。

# List of Azure service discovery configurations.
azure_sd_configs:[ - <azure_sd_config> ... ]# List of Consul service discovery configurations.
consul_sd_configs:[ - <consul_sd_config> ... ]# List of DigitalOcean service discovery configurations.
digitalocean_sd_configs:[ - <digitalocean_sd_config> ... ]# List of Docker service discovery configurations.
docker_sd_configs:[ - <docker_sd_config> ... ]# List of Docker Swarm service discovery configurations.
dockerswarm_sd_configs:[ - <dockerswarm_sd_config> ... ]# List of DNS service discovery configurations.
dns_sd_configs:[ - <dns_sd_config> ... ]# List of EC2 service discovery configurations.
ec2_sd_configs:[ - <ec2_sd_config> ... ]# List of Eureka service discovery configurations.
eureka_sd_configs:[ - <eureka_sd_config> ... ]# List of file service discovery configurations.
file_sd_configs:[ - <file_sd_config> ... ]# List of GCE service discovery configurations.
gce_sd_configs:[ - <gce_sd_config> ... ]# List of Hetzner service discovery configurations.
hetzner_sd_configs:[ - <hetzner_sd_config> ... ]# List of HTTP service discovery configurations.
http_sd_configs:[ - <http_sd_config> ... ]# List of IONOS service discovery configurations.
ionos_sd_configs:[ - <ionos_sd_config> ... ]# List of Kubernetes service discovery configurations.
kubernetes_sd_configs:[ - <kubernetes_sd_config> ... ]# List of Kuma service discovery configurations.
kuma_sd_configs:[ - <kuma_sd_config> ... ]# List of Lightsail service discovery configurations.
lightsail_sd_configs:[ - <lightsail_sd_config> ... ]# List of Linode service discovery configurations.
linode_sd_configs:[ - <linode_sd_config> ... ]# List of Marathon service discovery configurations.
marathon_sd_configs:[ - <marathon_sd_config> ... ]# List of AirBnB's Nerve service discovery configurations.
nerve_sd_configs:[ - <nerve_sd_config> ... ]# List of Nomad service discovery configurations.
nomad_sd_configs:[ - <nomad_sd_config> ... ]# List of OpenStack service discovery configurations.
openstack_sd_configs:[ - <openstack_sd_config> ... ]# List of OVHcloud service discovery configurations.
ovhcloud_sd_configs:[ - <ovhcloud_sd_config> ... ]# List of PuppetDB service discovery configurations.
puppetdb_sd_configs:[ - <puppetdb_sd_config> ... ]# List of Scaleway service discovery configurations.
scaleway_sd_configs:[ - <scaleway_sd_config> ... ]# List of Zookeeper Serverset service discovery configurations.
serverset_sd_configs:[ - <serverset_sd_config> ... ]# List of Triton service discovery configurations.
triton_sd_configs:[ - <triton_sd_config> ... ]# List of Uyuni service discovery configurations.
uyuni_sd_configs:[ - <uyuni_sd_config> ... ]# List of labeled statically configured targets for this job.
static_configs:[ - <static_config> ... ]# List of target relabel configurations.
relabel_configs:[ - <relabel_config> ... ]# List of metric relabel configurations.
metric_relabel_configs:[ - <relabel_config> ... ]

这两部分的配置用于控制抓取过程中的一些限制：

"body_size_limit"：未经压缩的响应主体大小上限，超过此大小的响应将导致抓取失败。0 表示没有限制。这是一个实验性功能，其行为可能在将来发生更改或被移除。
"sample_limit"：每次抓取允许接受的采样数量上限，如果经过度量数据重标记后存在超过此数量的采样，整个抓取将被视为失败。0 表示没有限制。

# An uncompressed response body larger than this many bytes will cause the
# scrape to fail. 0 means no limit. Example: 100MB.
# This is an experimental feature, this behaviour could
# change or be removed in the future.
[ body_size_limit: <size> | default = 0 ]# Per-scrape limit on number of scraped samples that will be accepted.
# If more than this number of samples are present after metric relabeling
# the entire scrape will be treated as failed. 0 means no limit.
[ sample_limit: <int> | default = 0 ]

这些配置项控制了在抓取过程中的多个限制和限额：

"label_limit"：每次抓取中，每个样本可接受的标签数量上限。如果度量数据重标记后存在超过此数量的标签，整个抓取将被视为失败。0 表示没有限制。
"label_name_length_limit"：每次抓取中，每个样本的标签名称长度上限。如果度量数据重标记后的标签名称超过此数字，整个抓取将被视为失败。0 表示没有限制。
"label_value_length_limit"：每次抓取中，每个样本的标签值长度上限。如果度量数据重标记后的标签值超过此数字，整个抓取将被视为失败。0 表示没有限制。
"target_limit"：每次抓取配置中，每个唯一目标数量上限。如果经过目标重标记后存在超过此数量的目标，Prometheus将标记这些目标为失败，而不进行抓取。0 表示没有限制。这是一个实验性功能，其行为可能在将来发生更改。
"keep_dropped_targets"：每次抓取配置中，通过重标记而被丢弃的目标数量上限，这些目标将被保存在内存中。0 表示没有限制。
"native_histogram_bucket_limit"：单个本机直方图中允许的正和负桶的总数上限。如果超过此限制，整个抓取将被视为失败。0 表示没有限制。

# Per-scrape limit on number of labels that will be accepted for a sample. If
# more than this number of labels are present post metric-relabeling, the
# entire scrape will be treated as failed. 0 means no limit.
[ label_limit: <int> | default = 0 ]# Per-scrape limit on length of labels name that will be accepted for a sample.
# If a label name is longer than this number post metric-relabeling, the entire
# scrape will be treated as failed. 0 means no limit.
[ label_name_length_limit: <int> | default = 0 ]# Per-scrape limit on length of labels value that will be accepted for a sample.
# If a label value is longer than this number post metric-relabeling, the
# entire scrape will be treated as failed. 0 means no limit.
[ label_value_length_limit: <int> | default = 0 ]# Per-scrape config limit on number of unique targets that will be
# accepted. If more than this number of targets are present after target
# relabeling, Prometheus will mark the targets as failed without scraping them.
# 0 means no limit. This is an experimental feature, this behaviour could
# change in the future.
[ target_limit: <int> | default = 0 ]# Per-job limit on the number of targets dropped by relabeling
# that will be kept in memory. 0 means no limit.
[ keep_dropped_targets: <int> | default = 0 ]# Limit on total number of positive and negative buckets allowed in a single
# native histogram. If this is exceeded, the entire scrape will be treated as
# failed. 0 means no limit.
[ native_histogram_bucket_limit: <int> | default = 0 ]

3、<alertmanager_config>

一个alertmanager_config部分指定 Prometheus 服务器向其发送警报的 Alertmanager 实例。它还提供参数来配置如何与这些警报管理器进行通信。

警报管理器可以通过static_configs参数静态配置，也可以使用支持的服务发现机制之一动态发现。

此外，relabel_configs允许从发现的实体中选择警报管理器，并对使用的 API 路径（通过标签公开）进行高级修改__alerts_path__。

"推送警报时，每个目标的Alertmanager超时时间。默认为10秒。"

# Per-target Alertmanager timeout when pushing alerts.
[ timeout: <duration> | default = 10s ]

Alertmanager的API版本。默认为 v2。"

# The api version of Alertmanager.
[ api_version: <string> | default = v2 ]

"用于推送警报的HTTP路径的前缀。默认为 /。"

# Prefix for the HTTP path alerts are pushed to.
[ path_prefix: <path> | default = / ]

"配置用于请求的协议方案。默认为 http。"

# Configures the protocol scheme used for requests.
[ scheme: <scheme> | default = http ]

这部分包括了两个部分：

"static_configs"：静态配置，包括标记的手动配置的Alertmanager列表。
"relabel_configs"：Alertmanager重标记配置，可用于修改或筛选Alertmanager的标签。

# List of labeled statically configured Alertmanagers.
static_configs:[ - <static_config> ... ]# List of Alertmanager relabel configurations.
relabel_configs:[ - <relabel_config> ... ]

二、告警规则

这部分配置用于定义一组规则（alerts或者recording rules）：

"name"：规则组的名称，必须在文件内唯一。
"interval"：规则组中的规则被评估的频率。默认值是全局配置中的 "evaluation_interval"。
"limit"：限制告警规则和记录规则可以生成的警报数和时间序列数。0 表示没有限制。
"rules"：规则组中包括的规则的列表。每个规则都包括在 "- <rule> ..." 中。规则可以是告警规则或记录规则。规则定义了 Prometheus 如何处理指标数据以生成告警或创建新的时间序列。

# The name of the group. Must be unique within a file.
name: <string># How often rules in the group are evaluated.
[ interval: <duration> | default = global.evaluation_interval ]# Limit the number of alerts an alerting rule and series a recording
# rule can produce. 0 is no limit.
[ limit: <int> | default = 0 ]rules:[ - <rule> ... ]

这部分配置用于定义记录规则：

"record"：记录规则的名称，用于输出的时间序列。必须是有效的度量名称。
"expr"：PromQL表达式，用于在每次评估周期中计算。该表达式在当前时间评估，并将结果记录为具有由 'record' 指定的度量名称的新的时间序列。
"labels"：在存储结果之前要添加或覆盖的标签。标签以键-值对的形式定义，允许您为记录规则生成的时间序列添加自定义标签。

# The name of the time series to output to. Must be a valid metric name.
record: <string># The PromQL expression to evaluate. Every evaluation cycle this is
# evaluated at the current time, and the result recorded as a new set of
# time series with the metric name as given by 'record'.
expr: <string># Labels to add or overwrite before storing the result.
labels:[ <labelname>: <labelvalue> ]

警报规则的语法为：

这部分配置用于定义告警规则：

"alert"：告警规则的名称，必须是有效的标签值。
"expr"：PromQL表达式，用于在每次评估周期中计算。每次评估周期中，该表达式在当前时间进行计算，并生成所有结果时间序列作为待处理/触发的告警。
"for"：告警被视为触发后，它们持续触发的时间。在这段时间内，告警将被视为触发状态。默认值为0秒，表示告警立即触发。
"keep_firing_for"：告警规则的触发条件解除后，告警将继续触发的时间。在这段时间内，告警将保持在触发状态。默认值为0秒，表示一旦条件解除，告警将立即停止触发。
"labels"：为每个告警添加或覆盖的标签。这允许您为触发的告警添加自定义标签。
"annotations"：要添加到每个告警的注释。注释通常包含有关告警的额外信息，如描述或相关文档的链接。

# The name of the alert. Must be a valid label value.
alert: <string># The PromQL expression to evaluate. Every evaluation cycle this is
# evaluated at the current time, and all resultant time series become
# pending/firing alerts.
expr: <string># Alerts are considered firing once they have been returned for this long.
# Alerts which have not yet fired for long enough are considered pending.
[ for: <duration> | default = 0s ]# How long an alert will continue firing after the condition that triggered it
# has cleared.
[ keep_firing_for: <duration> | default = 0s ]# Labels to add or overwrite for each alert.
labels:[ <labelname>: <tmpl_string> ]# Annotations to add to each alert.
annotations:[ <labelname>: <tmpl_string> ]

三、定义告警规则

定义警报规则

Prometheus 中配置警报规则的方式与记录规则相同。

带有警报的规则文件示例如下：

groups:
- name: examplerules:- alert: HighRequestLatencyexpr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5for: 10mlabels:severity: pageannotations:summary: High request latency

可选for子句使 Prometheus 在第一次遇到新的表达式输出向量元素和将警报计为触发该元素之间等待一定的持续时间。在这种情况下，Prometheus 将在每次评估期间检查警报是否继续处于活动状态 10 分钟，然后再触发警报。处于活动状态但尚未触发的元素处于挂起状态。没有该子句的警报规则for将在第一次评估时激活。

该labels子句允许指定一组附加到警报的附加标签。任何现有的冲突标签都将被覆盖。标签值可以模板化。

该annotations子句指定一组信息标签，可用于存储较长的附加信息，例如警报描述或 Runbook 链接。注释值可以被模板化。

四、警报管理器

这部分配置定义了 "ResolveTimeout"，它是Alertmanager的默认值，用于处理不包含 "EndsAt" 的告警。如果经过了指定的时间（由 "resolve_timeout" 指定），并且告警没有被更新，Alertmanager 可以将该告警视为已解决。这个配置对于Prometheus生成的告警没有影响，因为它们始终包含 "EndsAt"。默认情况下，"resolve_timeout" 的值为5分钟。

  # ResolveTimeout is the default value used by alertmanager if the alert does# not include EndsAt, after this time passes it can declare the alert as resolved if it has not been updated.# This has no impact on alerts from Prometheus, as they always include EndsAt.[ resolve_timeout: <duration> | default = 5m ]

这部分配置用于定义抑制规则（inhibit rules）。抑制规则允许您定义在满足某些条件时，抑制或限制某些告警的触发。每个抑制规则包括在 "- <inhibit_rule> ..." 中定义。抑制规则允许您定义哪些告警可以抑制其他告警的触发，以便在某些情况下减少噪音或避免洪水式的告警。

# A list of inhibition rules.
inhibit_rules:[ - <inhibit_rule> ... ]

这部分配置涉及静音时间间隔，用于将路由静音。请注意，此部分中的配置已被标记为“DEPRECATED”（不推荐使用），建议使用下面的 "time_intervals" 配置。静音时间间隔允许您在指定的时间段内将某些路由静音，以防止在某些时间段触发不必要的通知。虽然 "mute_time_intervals" 仍然可用，但不建议使用，因为已经有了更好的配置选项。

# DEPRECATED: use time_intervals below.
# A list of mute time intervals for muting routes.
mute_time_intervals:[ - <mute_time_interval> ... ]

这部分配置用于定义时间间隔，以控制路由的静音（mute）和激活（activate）。您可以列出多个时间间隔，在这些时间段内，某些路由可以被静音或激活，以管理通知的触发。这是一种更灵活和推荐的方式，用于控制路由的状态。每个时间间隔通过 "- <time_interval> ..." 来定义。



# A list of time intervals for muting/activating routes.
time_intervals:[ - <time_interval> ... ]

路由相关设置允许配置警报如何根据时间路由、聚合、限制和静音。

`<route>`

路由块定义路由树中的节点及其子节点。如果未设置，其可选配置参数将从其父节点继承。

每个警报都会在配置的顶级路由处进入路由树，该路由必须匹配所有警报（即没有任何配置的匹配器）。然后遍历子节点。如果continue设置为 false，则在第一个匹配的子项之后停止。如果continue在匹配节点上为 true，则警报将继续与后续兄弟节点匹配。如果警报与节点的任何子节点都不匹配（没有匹配的子节点，或不存在），则根据当前节点的配置参数处理警报。

这部分配置定义了警报路由规则。以下是每个字段的含义：

"route"：根路由，具有所有参数，这些参数将被子路由继承，除非它们被覆盖。
"receiver"：默认接收器，用于根路由。当未匹配任何子路由时，将使用默认接收器。
"group_wait"：在将一组警报发送到接收器之前等待的时间。
"group_interval"：将具有相同标签的一组警报发送到接收器的间隔时间。
"repeat_interval"：在发送重复的警报之前等待的时间。
"group_by"：分组警报的标签。在根路由中定义的 "group_by" 标签也会应用于子路由。
"routes"：子路由列表，用于根路由。子路由根据匹配条件将警报发送到不同的接收器。
"receiver"：子路由的接收器。
"group_wait"：子路由的组等待时间。
"matchers"：匹配条件，根据这些条件来决定是否将警报发送到此子路由的接收器。
"group_by"：子路由的分组标签，覆盖根路由中的 "group_by" 标签。
"mute_time_intervals"：子路由的静音时间间隔，规定了何时将子路由静音，以防止触发通知。
"active_time_intervals"：子路由的激活时间间隔，规定了何时激活子路由，以允许通知的触发。
"continue"：如果设置为 true，匹配此子路由的警报将继续到下一个子路由。否则，它们将在此子路由中结束。

这些路由规则用于根据匹配条件将警报发送到不同的接收器，以实现灵活的通知管理。

# The root route with all parameters, which are inherited by the child
# routes if they are not overwritten.
route:receiver: 'default-receiver'group_wait: 30sgroup_interval: 5mrepeat_interval: 4hgroup_by: [cluster, alertname]# All alerts that do not match the following child routes# will remain at the root node and be dispatched to 'default-receiver'.routes:# All alerts with service=mysql or service=cassandra# are dispatched to the database pager.- receiver: 'database-pager'group_wait: 10smatchers:- service=~"mysql|cassandra"# All alerts with the team=frontend label match this sub-route.# They are grouped by product and environment rather than cluster# and alertname.- receiver: 'frontend-pager'group_by: [product, environment]matchers:- team="frontend"# All alerts with the service=inhouse-service label match this sub-route.# the route will be muted during offhours and holidays time intervals.# even if it matches, it will continue to the next sub-route- receiver: 'dev-pager'matchers:- service="inhouse-service"mute_time_intervals:- offhours- holidayscontinue: true# All alerts with the service=inhouse-service label match this sub-route# the route will be active only during offhours and holidays time intervals.- receiver: 'on-call-pager'matchers:- service="inhouse-service"active_time_intervals:- offhours- holidays

抑制允许根据另一组警报的存在来静音一组警报。这允许在系统或服务之间建立依赖关系，以便在中断期间仅发送一组互连警报中最相关的警报。

`<inhibit_rule>`

当存在与另一组匹配器匹配的警报（源）时，禁止规则会静音与一组匹配器匹配的警报（目标）。目标警报和源警报在列表中的标签名称必须具有相同的标签值equal。

从语义上讲，缺失标签和具有空值的标签是同一件事。equal因此，如果源警报和目标警报中都缺少列出的所有标签名称，则将应用禁止规则。

为了防止警报抑制自身，与规则的目标端和源端均匹配的警报不能被同样适用的警报（包括其自身）抑制。但是，我们建议选择目标和源匹配器，以使警报永远不会与双方匹配。推理起来要容易得多，并且不会触发这种特殊情况。

这部分配置涉及到警报的匹配条件以执行抑制（inhibition）。以下是每个字段的含义：

"target_match"：已不推荐使用，用于指定在警报中必须满足的匹配条件，以便进行抑制。
"target_match_re"：已不推荐使用，用于指定在警报中必须满足的正则表达式匹配条件，以便进行抑制。
"target_matchers"：一个匹配器列表，用于指定在目标警报中必须满足的匹配条件，以便执行抑制。
"source_match"：已不推荐使用，用于指定在警报中必须存在的匹配条件，以便执行抑制。
"source_match_re"：已不推荐使用，用于指定在警报中必须存在的正则表达式匹配条件，以便执行抑制。
"source_matchers"：一个匹配器列表，用于指定在源警报中必须满足的匹配条件，以便执行抑制。
"equal"：标签的列表，这些标签的值必须在源和目标警报中相等，以便执行抑制。

这些字段的作用是定义何时执行抑制操作，抑制操作允许您根据警报的匹配条件来控制通知的触发。建议使用 "target_matchers" 和 "source_matchers" 字段来指定匹配条件，而不是不再推荐使用的 "target_match" 和 "source_match" 字段。

# DEPRECATED: Use target_matchers below.
# Matchers that have to be fulfilled in the alerts to be muted.
target_match:[ <labelname>: <labelvalue>, ... ]
# DEPRECATED: Use target_matchers below.
target_match_re:[ <labelname>: <regex>, ... ]# A list of matchers that have to be fulfilled by the target
# alerts to be muted.
target_matchers:[ - <matcher> ... ]# DEPRECATED: Use source_matchers below.
# Matchers for which one or more alerts have to exist for the
# inhibition to take effect.
source_match:[ <labelname>: <labelvalue>, ... ]
# DEPRECATED: Use source_matchers below.
source_match_re:[ <labelname>: <regex>, ... ]# A list of matchers for which one or more alerts have
# to exist for the inhibition to take effect.
source_matchers:[ - <matcher> ... ]# Labels that must have an equal value in the source and target
# alert for the inhibition to take effect.
[ equal: '[' <labelname>, ... ']' ]

Prometheus字段解析

定义警报规则

`<route>`

`<inhibit_rule>`

相关文章：