From 0954d249c2be3bb952de1ebcd71bad56b529350f Mon Sep 17 00:00:00 2001
From: Marcin Rataj <lidel@lidel.org>
Date: Wed, 12 Nov 2025 03:24:43 +0100
Subject: [PATCH] docs: clarify provide stats metric types and calculations
 (#11041)

add "Understanding the Metrics" section explaining three types:
- per-worker rates (multiply by active workers for total throughput)
- per-region averages (do NOT multiply by worker count)
- system totals (cumulative across all workers)

enhance metric descriptions with:
- explicit calculation examples showing which worker counts to use
- warnings about when NOT to multiply by worker count
- cross-references to relevant sections

add "Capacity Planning" section with:
- step-by-step throughput capacity calculations
- diagnostic guidance for common scenarios
- worked examples for estimating required vs actual capacity

addresses confusion from PR #11034 comments about when to multiply
metrics by worker count and how to interpret per-worker rates
---
 docs/provide-stats.md | 95 +++++++++++++++++++++++++++++++++++++++----
 1 file changed, 86 insertions(+), 9 deletions(-)

diff --git a/docs/provide-stats.md b/docs/provide-stats.md
index cf9a9f057..d70438015 100644
--- a/docs/provide-stats.md
+++ b/docs/provide-stats.md
@@ -4,6 +4,34 @@ The `ipfs provide stat` command gives you statistics about your local provide
 system. This file provides a detailed explanation of the metrics reported by
 this command.
 
+## Understanding the Metrics
+
+The statistics are organized into three types of measurements:
+
+### Per-worker rates
+
+Metrics like "CIDs reprovided/min/worker" measure the throughput of a single
+worker processing one region. To estimate total system throughput, multiply by
+the number of active workers of that type (see [Workers stats](#workers-stats)).
+
+Example: If "CIDs reprovided/min/worker" shows 100 and you have 10 active
+periodic workers, your total reprovide throughput is approximately 1,000
+CIDs/min.
+
+### Per-region averages
+
+Metrics like "Avg CIDs/reprovide" measure properties of the work units (keyspace
+regions). These represent the average size or characteristics of a region, not a
+rate. Do NOT multiply these by worker count.
+
+Example: "Avg CIDs/reprovide: 250,000" means each region contains an average of
+250,000 CIDs that get reprovided together as a batch.
+
+### System totals
+
+Metrics like "Total CIDs provided" are cumulative counts since node startup.
+These aggregate all work across all workers over time.
+
 ## Connectivity
 
 ### Status
@@ -148,19 +176,31 @@ regions are automatically retried unless the node is offline.
 
 Average rate of initial provides per minute per worker during the last
 reprovide cycle (excludes reprovides). Each worker handles one keyspace region
-at a time, providing all CIDs in that region. This rate only counts active time
-(timer doesn't run when no initial provides are being processed). The overall
-provide rate can be higher when multiple workers are providing different
-regions concurrently.
+at a time, providing all CIDs in that region. This measures the throughput of a
+single worker only.
+
+To estimate total system provide throughput, multiply by the number of active
+burst workers shown in [Workers stats](#workers-stats) (Burst > Active).
+
+Note: This rate only counts active time when initial provides are being
+processed. If workers are idle, actual throughput may be lower.
 
 ### CIDs reprovided/min/worker
 
 Average rate of reprovides per minute per worker during the last reprovide
 cycle (excludes initial provides). Each worker handles one keyspace region at a
-time, reproviding all CIDs in that region. The overall reprovide rate can be
-higher when multiple workers are reproviding different regions concurrently. To
-estimate total reprovide rate, multiply by the number of [periodic
-workers](./config.md#providedhtdedicatedperiodicworkers) in use.
+time, reproviding all CIDs in that region. This measures the throughput of a
+single worker only.
+
+To estimate total system reprovide throughput, multiply by the number of active
+periodic workers shown in [Workers stats](#workers-stats) (Periodic > Active).
+
+Example: If this shows 100 CIDs/min and you have 10 active periodic workers,
+your total reprovide throughput is approximately 1,000 CIDs/min.
+
+Note: This rate only counts active time when regions are being reprovided. If
+workers are idle due to network issues or queue exhaustion, actual throughput
+may be lower.
 
 ### Region reprovide duration
 
@@ -170,6 +210,13 @@ Average time to reprovide all CIDs in a region during the last cycle.
 
 Average number of CIDs per region during the last reprovide cycle.
 
+This measures the average size of a region (how many CIDs are batched together),
+not a throughput rate. Do NOT multiply this by worker count.
+
+Combined with [Region reprovide duration](#region-reprovide-duration), this
+helps estimate per-worker throughput: dividing Avg CIDs/reprovide by Region
+reprovide duration gives CIDs/min/worker.
+
 ### Regions reprovided (last cycle)
 
 Number of regions reprovided in the last cycle.
@@ -189,11 +236,16 @@ Number of idle workers not reserved for periodic or burst tasks.
 Breakdown of worker status by type (periodic for scheduled reprovides, burst for
 initial provides). For each type:
 
-- **Active**: Currently processing operations
+- **Active**: Currently processing operations (use this count when calculating total throughput from per-worker rates)
 - **Dedicated**: Reserved for this type
 - **Available**: Idle dedicated workers + [free workers](#free-workers)
 - **Queued**: 0 or 1 (workers acquired only when needed)
 
+The number of active workers determines your total system throughput. For
+example, if you have 10 active periodic workers, multiply
+[CIDs reprovided/min/worker](#cids-reprovidedminworker) by 10 to estimate total
+reprovide throughput.
+
 See [provide queue](#provide-queue) and [reprovide queue](#reprovide-queue) for
 regions waiting to be processed.
 
@@ -202,6 +254,31 @@ regions waiting to be processed.
 Maximum concurrent DHT server connections per worker when sending provider
 records for a region.
 
+## Capacity Planning
+
+### Estimating if your system can keep up with the reprovide schedule
+
+To check if your provide system has sufficient capacity:
+
+1. Calculate required throughput:
+   - Required CIDs/min = [CIDs scheduled](#cids-scheduled) / ([Reprovide interval](#reprovide-interval) in minutes)
+   - Example: 67M CIDs / (22 hours × 60 min) = 50,758 CIDs/min needed
+
+2. Calculate actual throughput:
+   - Actual CIDs/min = [CIDs reprovided/min/worker](#cids-reprovidedminworker) × Active periodic workers
+   - Example: 100 CIDs/min/worker × 256 active workers = 25,600 CIDs/min
+
+3. Compare:
+   - If actual < required: System is underprovisioned, increase [MaxWorkers](./config.md#providedhtmaxworkers) or [DedicatedPeriodicWorkers](./config.md#providedhtdedicatedperiodicworkers)
+   - If actual > required: System has excess capacity
+   - If [Reprovide queue](#reprovide-queue) is growing: System is falling behind
+
+### Understanding worker utilization
+
+- High active workers with growing reprovide queue: Need more workers or network connectivity is limiting throughput
+- Low active workers with non-empty reprovide queue: Workers may be waiting for network or DHT operations
+- Check [Reachable peers](#reachable-peers) to diagnose network connectivity issues
+
 ## See Also
 
 - [Provide configuration reference](./config.md#provide)