Prometheus Metrics for On-prem Users
BuildBuddy exposes Prometheus metrics that allow monitoring the four golden signals: latency, traffic, errors, and saturation.
Prometheus metrics are exposed under the path metrics/
on port 9090
by default.
To view these metrics in a live-updating dashboard, we recommend using a tool like Grafana.
Invocation build event metrics
All invocation metrics are recorded at the end of each invocation.
buildbuddy_invocation_count
(Counter)
The total number of invocations whose logs were uploaded to BuildBuddy.
Labels
- invocation_status: Invocation status:
success
,failure
,disconnected
, orunknown
. - bazel_exit_code: Exit code of a completed bazel command
- bazel_command: Command provided to the Bazel daemon:
run
,test
,build
,coverage
,mobile-install
, ...
Examples
# Number of invocations per second by invocation status
sum by (invocation_status) (rate(buildbuddy_invocation_count[5m]))
# Invocation success rate
sum(rate(buildbuddy_invocation_count{invocation_status="success"}[5m]))
/
sum(rate(buildbuddy_invocation_count[5m]))
buildbuddy_invocation_duration_usec
(Histogram)
The total duration of each invocation, in microseconds.
Labels
- invocation_status: Invocation status:
success
,failure
,disconnected
, orunknown
. - bazel_command: Command provided to the Bazel daemon:
run
,test
,build
,coverage
,mobile-install
, ...
Examples
# Median invocation duration in the past 5 minutes
histogram_quantile(
0.5,
sum(rate(buildbuddy_invocation_duration_usec_bucket[5m])) by (le)
)
buildbuddy_invocation_open_streams
(Gauge)
Number of build event streams currently being handled by the server.
buildbuddy_invocation_build_event_count
(Counter)
Number of build events uploaded to BuildBuddy.
Labels
- status: Status code as defined by grpc/codes. This is a numeric value; any non-zero code indicates an error.
Examples
# Build events uploaded per second
sum(rate(buildbuddy_invocation_build_event_count[5m]))
# Approximate error rate of build event upload handler
sum(rate(buildbuddy_invocation_build_event_count{status="0"}[5m]))
/
sum(rate(buildbuddy_invocation_build_event_count[5m]))
buildbuddy_invocation_stats_recorder_workers
(Gauge)
Number of invocation stats recorder workers currently running.
buildbuddy_invocation_stats_recorder_duration_usec
(Histogram)
How long it took to finalize an invocation's stats, in microseconds.
This includes the time required to wait for all BuildBuddy apps to flush their local metrics to Redis (if applicable) and then record the metrics to the DB.
buildbuddy_invocation_webhook_invocation_lookup_workers
(Gauge)
Number of webhook invocation lookup workers currently running.
buildbuddy_invocation_webhook_invocation_lookup_duration_usec
(Histogram)
How long it took to lookup an invocation before posting to the webhook, in microseconds.
buildbuddy_invocation_webhook_notify_workers
(Gauge)
Number of webhook notify workers currently running.
buildbuddy_invocation_webhook_notify_duration_usec
(Histogram)
How long it took to post an invocation proto to the webhook, in microseconds.
Remote cache metrics
NOTE: Cache metrics are recorded at the end of each invocation, which means that these metrics provide approximate real-time signals.
buildbuddy_remote_cache_events
(Counter)
Number of cache events handled.
Labels
- cache_type: Cache type:
action
for action cache,cas
for content-addressable storage. - cache_event_type: Cache event type:
hit
,miss
, orupload
.
buildbuddy_remote_cache_download_size_bytes
(Histogram)
Number of bytes downloaded from the remote cache in each download.
Use the _sum
suffix to get the total downloaded bytes and the _count
suffix to get the number of downloaded files.
Labels
- cache_type: Cache type:
action
for action cache,cas
for content-addressable storage. - server_name: Describes the name of the server that handles a client request, such as "byte_stream_server" or "cas_server"
Examples
# Cache download rate (bytes per second)
sum(rate(buildbuddy_cache_download_size_bytes_sum[5m]))
buildbuddy_remote_cache_download_duration_usec
(Histogram)
Download duration for each file downloaded from the remote cache, in microseconds.
Labels
- cache_type: Cache type:
action
for action cache,cas
for content-addressable storage.
Examples
# Median download duration for content-addressable store (CAS)
histogram_quantile(
0.5,
sum(rate(buildbuddy_remote_cache_download_duration_usec{cache_type="cas"}[5m])) by (le)
)
buildbuddy_remote_cache_upload_size_bytes
(Histogram)
Number of bytes uploaded to the remote cache in each upload.
Use the _sum
suffix to get the total uploaded bytes and the _count
suffix to get the number of uploaded files.
Labels
- cache_type: Cache type:
action
for action cache,cas
for content-addressable storage. - server_name: Describes the name of the server that handles a client request, such as "byte_stream_server" or "cas_server"
Examples
# Cache upload rate (bytes per second)
sum(rate(buildbuddy_cache_upload_size_bytes_sum[5m]))
buildbuddy_remote_cache_upload_duration_usec
(Histogram)
Upload duration for each file uploaded to the remote cache, in microseconds.
Labels
- cache_type: Cache type:
action
for action cache,cas
for content-addressable storage.
Examples
# Median upload duration for content-addressable store (CAS)
histogram_quantile(
0.5,
sum(rate(buildbuddy_remote_cache_upload_duration_usec{cache_type="cas"}[5m])) by (le)
)
buildbuddy_remote_cache_disk_cache_last_eviction_age_usec
(Gauge)
The age of the item most recently evicted from the cache, in microseconds.
Labels
- partition_id: The ID of the disk cache partition this event applied to.
- cache_name: Cache name: Custom name to describe the cache, like "pebble-cache".
buildbuddy_remote_cache_disk_cache_eviction_age_msec
(Histogram)
Age of items evicted from the cache, in milliseconds.
Labels
- partition_id: The ID of the disk cache partition this event applied to.
- cache_name: Cache name: Custom name to describe the cache, like "pebble-cache".
buildbuddy_remote_cache_disk_cache_num_evictions
(Counter)
Number of items evicted.
Labels
- partition_id: The ID of the disk cache partition this event applied to.
- cache_name: Cache name: Custom name to describe the cache, like "pebble-cache".
buildbuddy_remote_cache_disk_cache_partition_size_bytes_evicted
(Counter)
Number of bytes in the partition evicted.
Labels
- partition_id: The ID of the disk cache partition this event applied to.
- cache_name: Cache name: Custom name to describe the cache, like "pebble-cache".
buildbuddy_remote_cache_disk_cache_partition_size_bytes
(Gauge)
Number of bytes in the partition.
Labels
- partition_id: The ID of the disk cache partition this event applied to.
- cache_name: Cache name: Custom name to describe the cache, like "pebble-cache".
buildbuddy_remote_cache_disk_cache_partition_capacity_bytes
(Gauge)
Number of bytes in the partition.
Labels
- partition_id: The ID of the disk cache partition this event applied to.
- cache_name: Cache name: Custom name to describe the cache, like "pebble-cache".
buildbuddy_remote_cache_disk_cache_partition_num_items
(Gauge)
Number of items in the partition.
Labels
- partition_id: The ID of the disk cache partition this event applied to.
- cache_name: Cache name: Custom name to describe the cache, like "pebble-cache".
- cache_type: Cache type:
action
for action cache,cas
for content-addressable storage.
buildbuddy_remote_cache_disk_cache_duplicate_writes
(Counter)
Number of writes for digests that already exist.
Labels
- cache_name: Cache name: Custom name to describe the cache, like "pebble-cache".
buildbuddy_remote_cache_disk_cache_added_file_size_bytes
(Histogram)
Size of artifacts added to the file cache, in bytes.
Labels
- cache_name: Cache name: Custom name to describe the cache, like "pebble-cache".
buildbuddy_remote_cache_disk_cache_filesystem_total_bytes
(Gauge)
Total size of the underlying filesystem.
Labels
- cache_name: Cache name: Custom name to describe the cache, like "pebble-cache".
buildbuddy_remote_cache_disk_cache_filesystem_avail_bytes
(Gauge)
Available bytes in the underlying filesystem.
Labels
- cache_name: Cache name: Custom name to describe the cache, like "pebble-cache".
Examples
# Total number of duplicate writes.
sum(buildbuddy_remote_cache_duplicate_writes)
buildbuddy_remote_cache_disk_cache_duplicate_writes_bytes
(Counter)
Number of bytes written that already existed in the cache.
Labels
- cache_name: Cache name: Custom name to describe the cache, like "pebble-cache".
buildbuddy_remote_cache_distributed_cache_peer_lookups
(Histogram)
Number of peers consulted (including the 'local peer') for a distributed cache read before returning a response.
For batch requests, one observation is recorded for each digest in the request.
Labels
- op: Distributed cache operation name, such as "FindMissing" or "Get".
- cache_status: Cache lookup result - One of: - "hit" - "miss" - "partial" (for batched RPCs where part of a request was cached) - Or "uncacheable" (for e.g. encrypted resources)
buildbuddy_remote_cache_migration_not_found_error_count
(Counter)
Number of not found errors from the destination cache during a cache migration.
Labels
- type: Describes the type of cache request
buildbuddy_remote_cache_migration_double_read_hit_count
(Counter)
Number of double reads where the source and destination caches hold the same digests during a cache migration.
Labels
- type: Describes the type of cache request
buildbuddy_remote_cache_migration_copy_chan_size
(Gauge)
Number of digests queued to be copied during a cache migration.
buildbuddy_remote_cache_migration_bytes_copied
(Counter)
Number of bytes copied from the source to destination cache during a cache migration.
Labels
- cache_type: Cache type:
action
for action cache,cas
for content-addressable storage.
buildbuddy_remote_cache_migration_blobs_copied
(Counter)
Number of blobs copied from the source to destination cache during a cache migration.
Labels
- cache_type: Cache type:
action
for action cache,cas
for content-addressable storage.
buildbuddy_remote_cache_tree_cache_lookup_count
(Counter)
Total number of TreeCache lookups.
Labels
- status: The TreeCache status: hit/miss/invalid_entry.
- level: TreeCache directory depth: 0 for the root dir, 1 for a direct child of the root dir, and so on.
buildbuddy_remote_cache_tree_cache_split_lookup_count
(Counter)
Total number of TreeCache split lookups.
Labels
- status: The TreeCache split lookup status: hit/miss/failure
buildbuddy_remote_cache_tree_cache_split_write_count
(Counter)
Total number of splits written to TreeCache.
buildbuddy_remote_cache_tree_cache_set_count
(Counter)
Total number of TreeCache sets.
Labels
- status: The TreeCache set status: success/deadline_exceeded/other_error
buildbuddy_remote_cache_tree_cache_bytes_transferred
(Counter)
Number of bytes written or read from tree cache
Labels
- op: TreeCache operation "read" or "write"
buildbuddy_remote_cache_lookaside_cache_lookup_count
(Counter)
Total number of Lookaside Cache lookups.
Labels
- status: The Lookaside cache status: hit/miss.
buildbuddy_remote_cache_lookaside_cache_eviction_age_msec
(Histogram)
Age of items evicted from the cache, in milliseconds.
Labels
- eviction_reason: The reason an item was evicted from the lookaside cache. One of: "expired" or "size"
Remote execution metrics
buildbuddy_remote_execution_count
(Counter)
Number of actions executed remotely.
This only includes actions which reached the execution phase. If an action fails before execution (for example, if it fails authentication) then this metric is not incremented.
Labels
- exit_code: Process exit code of an executed action.
- status: Status code as defined by grpc/codes in human-readable format, such as "OK" or "NotFound".
- isolation: Effective workload isolation type used for an executed task, such as "docker", "podman", "firecracker", or "none".
Examples
# Total number of actions executed per second
sum(rate(buildbuddy_remote_execution_count[5m]))
buildbuddy_remote_execution_tasks_started_count
(Counter)
Number of tasks started remotely, but not necessarily completed.
Includes retry attempts of the same task.
buildbuddy_remote_execution_executed_action_metadata_durations_usec
(Histogram)
Time spent in each stage of action execution, in microseconds.
Queries should filter or group by the stage
label, taking care not to aggregate different stages.
Labels
- stage: Executed action stage. Action execution is split into stages corresponding to the timestamps defined in
ExecutedActionMetadata
:queued
,input_fetch
,execution
, andoutput_upload
. An additional stage,worker
, includes all stages during which a worker is handling the action, which is all stages except thequeued
stage. - group_id: Group (organization) ID associated with the request.
Examples
# Median duration of all command stages
histogram_quantile(
0.5,
sum(rate(buildbuddy_remote_execution_executed_action_metadata_durations_usec_bucket[5m])) by (le, stage)
)
# p90 duration of just the command execution stage
histogram_quantile(
0.9,
sum(rate(buildbuddy_remote_execution_executed_action_metadata_durations_usec_bucket{stage="execution"}[5m])) by (le)
)
buildbuddy_remote_execution_task_pressure_stall_duration_fraction
(Histogram)
Linux PSI stall time as a fraction of each action's execution duration (0-1).
Labels
- resource: System resource: "cpu", "memory", or "io".
- stall_type: Pressure stall type: "some" (task is partially stalled on the resource) or "full" (task is completely stalled on the resource).
buildbuddy_remote_execution_task_size_read_requests
(Counter)
Number of read requests to the task sizer, which estimates action resource usage based on historical execution stats.
Labels
- status: Status of the task size read request:
hit
,miss
, orerror
. - isolation: Effective workload isolation type used for an executed task, such as "docker", "podman", "firecracker", or "none".
- os: OS associated with the request.
- arch: CPU architecture associated with the request.
- group_id: Group (organization) ID associated with the request.
buildbuddy_remote_execution_task_size_write_requests
(Counter)
Number of write requests to the task sizer, which estimates action resource usage based on historical execution stats.
Labels
- status: Status of the task size write request:
ok
,missing_stats
orerror
. - isolation: Effective workload isolation type used for an executed task, such as "docker", "podman", "firecracker", or "none".
- os: OS associated with the request.
- arch: CPU architecture associated with the request.
- group_id: Group (organization) ID associated with the request.
buildbuddy_remote_execution_task_size_prediction_duration_usec
(Histogram)
Task size prediction model request duration in microseconds.
Labels
- status: Status code as defined by grpc/codes in human-readable format, such as "OK" or "NotFound".
buildbuddy_remote_execution_enqueued_task_milli_cpu
(Histogram)
Milli-CPU prediction of enqueued tasks.
buildbuddy_remote_execution_enqueued_task_memory_bytes
(Histogram)
Memory prediction of enqueued tasks.
buildbuddy_remote_execution_waiting_execution_result
(Gauge)
Number of execution requests for which the client is actively waiting for results.
Labels
- group_id: Group (organization) ID associated with the request.
Examples
# Total number of execution requests with client waiting for result.
sum(buildbuddy_remote_execution_waiting_execution_result)
buildbuddy_remote_execution_requests
(Counter)
Number of execution requests received.
Labels
- group_id: Group (organization) ID associated with the request.
- os: OS associated with the request.
- arch: CPU architecture associated with the request.
buildbuddy_remote_execution_executor_registration_count
(Counter)
Number of executor registrations on the scheduler.
Labels
- version: Binary version. Example:
v2.0.0
.
Examples
# Rate of new execution requests by OS/Arch.
sum(rate(buildbuddy_remote_execution_requests[1m])) by (os, arch)
buildbuddy_remote_execution_merged_actions
(Counter)
Number of identical execution requests that have been merged.
Labels
- group_id: Group (organization) ID associated with the request.
buildbuddy_remote_execution_hedged_actions
(Counter)
Number of identicial execution request which were merged for which a hedged execution was run in the background.
Labels
- group_id: Group (organization) ID associated with the request.
buildbuddy_remote_execution_merged_actions_per_execution
(Histogram)
Distribution of how many actions were submitted and merged against a single, canonical execution over the lifetime of that canonical execution.
Note that this metric is recorded once per merged-action, so distribution values are cumulative, or recorded n-times per canonical execution.
Labels
- group_id: Group (organization) ID associated with the request.
buildbuddy_remote_execution_merged_action_submit_time_offset_usec
(Histogram)
The offset, in microseconds of wall-time, between the time when a merged action was submitted to the execution server and when the original action was submitted to the execution server.
Labels
- group_id: Group (organization) ID associated with the request.
Examples
# Rate of merged actions by group.
sum(rate(buildbuddy_remote_execution_merged_actions[1m])) by (group_id)
buildbuddy_remote_execution_queue_length
(Gauge)
Number of actions currently waiting in the executor queue.
Labels
- group_id: Group (organization) ID associated with the request.
Examples
# Median queue length across all executors
quantile(0.5, buildbuddy_remote_execution_queue_length)
buildbuddy_remote_execution_tasks_executing
(Gauge)
Number of tasks currently being executed by the executor.
Labels
- stage: Executed action stage. Action execution is split into stages corresponding to the timestamps defined in
ExecutedActionMetadata
:queued
,input_fetch
,execution
, andoutput_upload
. An additional stage,worker
, includes all stages during which a worker is handling the action, which is all stages except thequeued
stage.
Examples
# Fraction of idle executors
count_values(0, buildbuddy_remote_execution_tasks_executing)
/
count(buildbuddy_remote_execution_tasks_executing)
buildbuddy_remote_execution_assigned_ram_bytes
(Gauge)
Estimated RAM on the executor that is currently allocated for task execution, in bytes.
buildbuddy_remote_execution_assigned_and_queued_estimated_ram_bytes
(Gauge)
Estimated RAM on the executor that is currently allocated for queued or executing tasks, in bytes.
Note that this is a fuzzy estimate because there's no guarantee that tasks queued on a machine will be handled by that machine.
buildbuddy_remote_execution_assignable_ram_bytes
(Gauge)
Maximum total RAM that can be allocated for task execution, in bytes.
buildbuddy_remote_execution_assigned_milli_cpu
(Gauge)
Estimated CPU time on the executor that is currently allocated for task execution, in milliCPU (CPU-milliseconds per second).
buildbuddy_remote_execution_assigned_and_queued_estimated_milli_cpu
(Gauge)
Estimated CPU time on the executor that is currently allocated for queued or executing tasks, in milliCPU (CPU-milliseconds per second).
Note that this is a fuzzy estimate because there's no guarantee that tasks queued on a machine will be handled by that machine.
buildbuddy_remote_execution_assignable_milli_cpu
(Gauge)
Maximum total CPU time on the executor that can be allocated for task execution, in milliCPU (CPU-milliseconds per second).
buildbuddy_remote_execution_cpu_utilization_milli_cpu
(Gauge)
Approximate current CPU utilization of tasks executing, in milli-CPU (CPU-milliseconds per second).
This allows for much higher granularity than using a rate()
on used_milli_cpu
metric.
buildbuddy_remote_execution_file_download_count
(Histogram)
Number of files downloaded during remote execution.
buildbuddy_remote_execution_file_download_size_bytes
(Histogram)
Total number of bytes downloaded during remote execution.
buildbuddy_remote_execution_file_download_duration_usec
(Histogram)
Per-file download duration during remote execution, in microseconds.
buildbuddy_remote_execution_file_upload_count
(Histogram)
Number of files uploaded during remote execution.
buildbuddy_remote_execution_file_upload_size_bytes
(Histogram)
Total number of bytes uploaded during remote execution.
buildbuddy_remote_execution_skipped_output_bytes
(Counter)
Total number of output bytes that weren't uploaded after remote execution.
buildbuddy_remote_execution_file_upload_duration_usec
(Histogram)
Per-file upload duration during remote execution, in microseconds.
buildbuddy_firecracker_stage_duration_usec
(Histogram)
The total duration of each firecracker stage, in microseconds.
Labels
- stage: Generic label to describe the stage the metric is capturing
Stage label values
- "init": Time for the VM to start up (either a new VM or from a snapshot)
- "exec": Time to run the command inside the container
- "task_lifecycle": Time from when the task if first assigned to the VM (beginning of init) to after it's finished execution. This roughly represents what a customer will wait for the task to complete after it's been scheduled to a firecracker runner
- "pause": Time to pause the VM, save a snapshot, and cleanup resources
Examples
# P95 workflow lifecycle duration in the past 5 minutes, grouped by group_id
histogram_quantile(
0.95,
sum by(le, group_id) (
rate(buildbuddy_firecracker_stage_duration_usec_bucket{job="executor-workflows", stage="task_lifecycle"}[5m])
)
)
buildbuddy_firecracker_exec_dial_duration_usec
(Histogram)
Time taken to dial the VM guest execution server after it has been started or resumed, in microseconds.
buildbuddy_firecracker_snapshot_remote_cache_upload_size_bytes
(Counter)
After a copy-on-write snapshot has been used, the total count of bytes dirtied.