🧭 High-Level Categories

LayerFocus
BrokerCluster health, request load
ProducerMessage throughput, retries
ConsumerLag, processing rates
Topic/PartitionDistribution, replication
JVM/SystemHeap, GC, file descriptors

πŸ”§ 1. Kafka Broker Metrics

These are crucial for cluster stability and performance:

MetricDescription
kafka.server.BrokerTopicMetrics.MessagesInPerSecNumber of messages received per second
kafka.server.BrokerTopicMetrics.BytesInPerSec / BytesOutPerSecIngress and egress throughput
kafka.server.ReplicaManager.PartitionCountNumber of partitions handled
kafka.server.ReplicaManager.UnderReplicatedPartitions ⚠️Should be 0; indicates replication problems
kafka.controller.KafkaController.ActiveControllerCountShould be exactly 1
kafka.controller.ControllerStats.UncleanLeaderElectionsPerSec ⚠️Should be 0; otherwise may cause data loss
kafka.network.RequestMetrics.RequestsPerSec (per request type)Traffic volume for Produce, Fetch, etc.
kafka.server.ReplicaFetcherManager.MaxLagReplication delay between leader and follower

πŸ§‘β€πŸ’» 2. Producer Metrics

Monitored per client/application; key for write reliability and performance.

MetricDescription
record-send-rateRecords sent per second
record-retry-rate ⚠️Number of retries due to transient errors
record-error-rate ⚠️Number of failed record sends
request-latency-avgAverage time to complete a request
batch-size-avgSize of message batches
compression-rate-avgEfficiency of compression (if enabled)
bufferpool-wait-time-totalTime producer waits for buffer availability

πŸ“₯ 3. Consumer Metrics

Vital for detecting lag, which indicates whether consumers are keeping up.

MetricDescription
records-consumed-rateMessages consumed per second
fetch-latency-avgAverage fetch time from broker
fetch-rateNumber of fetches per second
records-lag ⚠️Number of messages behind the head of the partition
records-lag-max ⚠️Max lag across partitions
commit-latency-avgOffset commit duration

πŸ“Š 4. Topic and Partition Metrics

These give insight into distribution, performance, and durability.

MetricDescription
Partitions per topicShould be balanced across brokers
Under-replicated partitions ⚠️Indicates a fault tolerance issue
ISR (in-sync replicas) sizeShould equal replication factor
Log size per partitionHelps with storage forecasting

β˜• 5. JVM & System Metrics (All Kafka Services)

Kafka runs on the JVM β€” so monitor JVM health too:

MetricDescription
Heap usageTrack MemoryPool metrics to prevent OOM
Garbage collection time (GC count/duration)Especially G1 or OldGen pauses
Thread countSpike can indicate leaks or deadlocks
File descriptors usedEspecially on large broker nodes
Disk I/O and space usageEnsure disk isn’t bottlenecked or full
Network throughputHelps correlate broker data with infra layer

  • Prometheus + Grafana (via JMX Exporter)
  • Confluent Control Center (for Confluent Kafka)
  • Datadog, New Relic, or Splunk (for enterprise observability)
  • Burrow or Cruise Control (consumer lag and rebalance)

🚨 Alerts You Should Definitely Set

AlertWhy
UnderReplicatedPartitions > 0Risk of data loss
UncleanLeaderElections > 0Could indicate broker failures
Consumer Lag consistently increasingConsumer may be unhealthy
ActiveControllerCount != 1Cluster instability
Broker downAvailability risk
Heap usage > 85% or GC Pause > 2sPerformance degradation