ZooKeeper 监控指南
新指标系统
New Metrics System
功能自 3.6.0 起已提供,它提供丰富指标帮助用户监控 ZooKeeper,主题包括:znode、网络、磁盘、法定人数、领导选举、客户端、安全性、故障、监视/会话、requestProcessor 等等。
指标
所有指标都包含在 ServerMetrics.java
中。
Prometheus
- 运行 Prometheus 监视服务是获取和记录 ZooKeeper 指标最简单的方式。
- 前提条件
- 在 zoo.cfg 中设置
metricsProvider.className=org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider
,可启用Prometheus MetricsProvider
。 - 还可通过设置
metricsProvider.httpPort
(默认值:7000)配置端口。 - 安装 Prometheus:转至官方网站下载 页面,下载最新发行版。
-
将 Prometheus 的抓取器设置为 ZooKeeper 集群端点
cat > /tmp/test-zk.yaml <<EOF global: scrape_interval: 10s scrape_configs: - job_name: test-zk static_configs: - targets: ['192.168.10.32:7000','192.168.10.33:7000','192.168.10.34:7000'] EOF cat /tmp/test-zk.yaml
-
设置 Prometheus 处理程序
nohup /tmp/prometheus \ --config.file /tmp/test-zk.yaml \ --web.listen-address ":9090" \ --storage.tsdb.path "/tmp/test-zk.data" >> /tmp/test-zk.log 2>&1 &
-
现在 Prometheus 每 10 秒抓取一次 zk 指标。
使用 Prometheus 警报
-
我们建议阅读 Prometheus 官方警报页面,以了解一些警报原则
-
我们建议使用 Prometheus Alertmanager,它可以帮助用户更方便地通过电子邮件或即时消息(通过 webhook)接收警报
-
我们提供了一个警报示例,其中需要注意这些指标。注意:这仅供参考,你需要根据实际情况和资源环境进行调整
use ./promtool check rules rules/zk.yml to check the correctness of the config file cat rules/zk.yml groups: - name: zk-alert-example rules: - alert: ZooKeeper server is down expr: up == 0 for: 1m labels: severity: critical annotations: summary: "Instance {{ $labels.instance }} ZooKeeper server is down" description: "{{ $labels.instance }} of job {{$labels.job}} ZooKeeper server is down: [{{ $value }}]." - alert: create too many znodes expr: znode_count > 1000000 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} create too many znodes" description: "{{ $labels.instance }} of job {{$labels.job}} create too many znodes: [{{ $value }}]." - alert: create too many connections expr: num_alive_connections > 50 # suppose we use the default maxClientCnxns: 60 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} create too many connections" description: "{{ $labels.instance }} of job {{$labels.job}} create too many connections: [{{ $value }}]." - alert: znode total occupied memory is too big expr: approximate_data_size /1024 /1024 > 1 * 1024 # more than 1024 MB(1 GB) for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} znode total occupied memory is too big" description: "{{ $labels.instance }} of job {{$labels.job}} znode total occupied memory is too big: [{{ $value }}] MB." - alert: set too many watch expr: watch_count > 10000 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} set too many watch" description: "{{ $labels.instance }} of job {{$labels.job}} set too many watch: [{{ $value }}]." - alert: a leader election happens expr: increase(election_time_count[5m]) > 0 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} a leader election happens" description: "{{ $labels.instance }} of job {{$labels.job}} a leader election happens: [{{ $value }}]." - alert: open too many files expr: open_file_descriptor_count > 300 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} open too many files" description: "{{ $labels.instance }} of job {{$labels.job}} open too many files: [{{ $value }}]." - alert: fsync time is too long expr: rate(fsynctime_sum[1m]) > 100 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} fsync time is too long" description: "{{ $labels.instance }} of job {{$labels.job}} fsync time is too long: [{{ $value }}]." - alert: take snapshot time is too long expr: rate(snapshottime_sum[5m]) > 100 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} take snapshot time is too long" description: "{{ $labels.instance }} of job {{$labels.job}} take snapshot time is too long: [{{ $value }}]." - alert: avg latency is too high expr: avg_latency > 100 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} avg latency is too high" description: "{{ $labels.instance }} of job {{$labels.job}} avg latency is too high: [{{ $value }}]." - alert: JvmMemoryFillingUp expr: jvm_memory_bytes_used / jvm_memory_bytes_max{area="heap"} > 0.8 for: 5m labels: severity: warning annotations: summary: "JVM memory filling up (instance {{ $labels.instance }})" description: "JVM memory is filling up (> 80%)\n labels: {{ $labels }} value = {{ $value }}\n"
Grafana
- Grafana 具备内置 Prometheus 支持;只需添加一个 Prometheus 数据源
Name: test-zk Type: Prometheus Url: http://localhost:9090 Access: proxy
- 然后下载并导入默认 ZooKeeper 仪表板 模板 并进行自定义。
- 如果有任何改进,用户可以通过写电子邮件至 dev@zookeeper.apache.org 来申请 Grafana 仪表板帐户。
InfluxDB
InfluxDB 是一个开源时间序列数据,经常用来存储 Zookeeper 中的指标。你可以 下载 开源版本或在 InfluxDB Cloud 上创建一个 免费 帐户。无论哪种情况,都要配置 Apache Zookeeper Telegraf 插件,以开始收集和存储 Zookeeper 集群中的指标到 InfluxDB 实例。还有一个 Apache Zookeeper InfluxDB 模板,其中包括 Telegraf 配置和仪表板,以便立即启动设置。
JMX
更多详情可在此处 查看
四个字母单词
更多详情可在此处 查看