mirror of
https://github.com/tig-pool-nk/tig-monorepo.git
synced 2026-02-21 17:27:21 +08:00
Update tig-benchmarker master and README.
This commit is contained in:
parent
a96d3e9def
commit
18a1f1db92
@ -2,6 +2,46 @@
|
||||
|
||||
Benchmarker for TIG. Expected setup is a single master and multiple slaves on different servers.
|
||||
|
||||
## Overview
|
||||
|
||||
Benchmarking in TIG works as follows:
|
||||
|
||||
1. Master submits a precommit to TIG protocol, with details of what they will benchmark:
|
||||
* `block_id`: start of the benchmark
|
||||
* `player_id`: address of the benchmarker
|
||||
* `challenge_id`: challenge to benchmark
|
||||
* `algorithm_id`: algorithm to benchmark
|
||||
* `difficulty`: difficulty of instances to randomly generate
|
||||
* `num_nonces`: number of instances to benchmark
|
||||
|
||||
2. TIG protocol confirms the precommit, and assigns it a random string
|
||||
|
||||
3. Master starts benchmarking:
|
||||
* polls TIG protocol for the confirmed precommit + random string
|
||||
* creates a benchmark job, splitting it into batches
|
||||
* slaves poll the master for batches to compute
|
||||
* slaves do the computation, and send results back to master
|
||||
|
||||
4. After benchmarking is finished, master submits benchmark to TIG protocol:
|
||||
* `solution_nonces`: list of nonces for which a solution was found
|
||||
* `merkle_root`: Merkle root of the tree constructed using results as leafs
|
||||
|
||||
5. TIG protocol confirms the benchmark, and randomly samples nonces requiring proof
|
||||
|
||||
6. Master prepares the proof:
|
||||
* polls TIG protocol for confirmed benchmark + sampled nonces
|
||||
* creates proof jobs
|
||||
* slaves poll the master for nonces requiring proof
|
||||
* slaves send Merkle branches back to master
|
||||
|
||||
7. Master submits proof to TIG protocol
|
||||
|
||||
8. TIG protocol confirms the proof, calculating the block from which the solutions will become "active" (eligible to earn rewards)
|
||||
* Verification is performed in parallel
|
||||
* Solutions will be inactive 120 blocks from when the benchmark started
|
||||
* The delay is determined by number of blocks between the start and when proof was confirmed
|
||||
* Each block, active solutions which qualify will earn rewards for the Benchmarker
|
||||
|
||||
# Starting Your Master
|
||||
|
||||
Simply run:
|
||||
@ -41,9 +81,53 @@ See last section on how to find your player_id & api_key.
|
||||
2. Delete the database: `rm -rf db_data`
|
||||
3. Start your master
|
||||
|
||||
## Optimising your Master Config
|
||||
## Master Config
|
||||
|
||||
See [docs.tig.foundation](https://docs.tig.foundation/benchmarking/benchmarker-config)
|
||||
The master config defines how benchmarking jobs are selected, scheduled, and distributed to slaves. This config can be edited via the master UI or via API (`http://localhost:<MASTER_PORT>/update-config`).
|
||||
|
||||
```json
|
||||
{
|
||||
"player_id": "0x0000000000000000000000000000000000000000",
|
||||
"api_key": "00000000000000000000000000000000",
|
||||
"api_url": "https://mainnet-api.tig.foundation",
|
||||
"time_between_resubmissions": 60000,
|
||||
"max_concurrent_benchmarks": 4,
|
||||
"algo_selection": [
|
||||
{
|
||||
"algorithm_id": "c001_a001",
|
||||
"num_nonces": 40,
|
||||
"difficulty_range": [0, 0.5],
|
||||
"selected_difficulties": [],
|
||||
"weight": 1,
|
||||
"batch_size": 8,
|
||||
"base_fee_limit": "10000000000000000"
|
||||
},
|
||||
...
|
||||
],
|
||||
"time_before_batch_retry": 60000,
|
||||
"slaves": [
|
||||
{
|
||||
"name_regex": ".*",
|
||||
"algorithm_id_regex": ".*",
|
||||
"max_concurrent_batches": 1
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
**Explanation:**
|
||||
* `player_id`: Your wallet address (lowercase)
|
||||
* `api_key`: See last section on how to obtain your API key
|
||||
* `api_url`: mainnet (https://mainnet-api.tig.foundation) or testnet (https://testnet-api.tig.foundation)
|
||||
* `time_between_resubmissions`: Time in milliseconds to wait before resubmitting an benchmark/proof which has not confirmed into a block
|
||||
* `max_concurrent_benchmarks`: Maximum number of benchmarks that can be "in flight" at once (i.e., benchmarks where the proof has not been computed yet).
|
||||
* `algo_selection`: list of algorithms that can be picked for benchmarking. Each entry has:
|
||||
* `algorithm_id`: id for the algorithm (e.g., c001_a001). Tip: use [list_algorithms](../scripts/list_algorithms.py) script to get list of algorithm ids
|
||||
* `num_nonces`: Number of instances to benchmark for this algorithm
|
||||
* `difficulty_range`: the bounds (0.0 = easiest, 1.0 = hardest) for a random difficulty sampling. Full range is `[0.0, 1.0]`
|
||||
* `selected_difficulties`: A list of difficulties `[[x1,y1], [x2, y2], ...]`. If any of the difficulties are in valid range, one will be randomly selected instead of sampling from the difficulty range
|
||||
* `weight`: Selection weight. An algorithm is chosen proportionally to `weight / total_weight`
|
||||
* `batch_size`: Number of nonces per batch. Must be a power of 2. For example, if num_nonces = 40 and batch_size = 8, the benchmark is split into 5 batches
|
||||
|
||||
# Connecting Slaves
|
||||
|
||||
@ -64,11 +148,60 @@ See [docs.tig.foundation](https://docs.tig.foundation/benchmarking/benchmarker-c
|
||||
```
|
||||
|
||||
**Notes:**
|
||||
* If your master is on a different server to your slave, you need to add the option `--master <SERVER_IP>`
|
||||
* To set the number of workers (threads), use the option `--workers <NUM_WORKERS>`
|
||||
* To use a different port, use the option `--port <MASTER_PORT>`
|
||||
* If your master is on a different server, add `--master <SERVER_IP>`
|
||||
* Set a custom master port with `--port <MASTER_PORT>`
|
||||
* To see all options, use `--help`
|
||||
|
||||
## Slave Config
|
||||
|
||||
You can control execution limits via a JSON config:
|
||||
|
||||
```json
|
||||
{
|
||||
"max_workers": 100,
|
||||
"cpus": 8,
|
||||
"gpus": 0,
|
||||
"algorithms": [
|
||||
{
|
||||
"id_regex": ".*",
|
||||
"cpu_cost": 1.0,
|
||||
"gpu_cost": 0.0
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
**Explanation:**
|
||||
* `max_workers`: maximum concurrent tig-runtime processes.
|
||||
* `cpus` & `gpus`: total compute limits available to the slave.
|
||||
* `algorithms`: rules for matching algorithms based on `id_regex`.
|
||||
* An algorithm can only be executed if it stays within the total `cpu_cost` and `gpu_cost` limits.
|
||||
* Regex matches algorithm ids (e.g., `c004_a[\d3]` matches all vector_search algorithms).
|
||||
|
||||
**Example:**
|
||||
|
||||
This example limits c001/c002/c003 to 2 concurrent instances per CPU. It also limits c004/c005 to 4 concurrent instances per GPU:
|
||||
|
||||
```json
|
||||
{
|
||||
"max_workers": 10,
|
||||
"cpus": 4,
|
||||
"gpus": 2,
|
||||
"algorithms": [
|
||||
{
|
||||
"id_regex": "c00[123].*",
|
||||
"cpu_cost": 0.5,
|
||||
"gpu_cost": 0.0
|
||||
},
|
||||
{
|
||||
"id_regex": "c00[45].*",
|
||||
"cpu_cost": 0.0,
|
||||
"gpu_cost": 0.25
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
# Finding your API Key
|
||||
|
||||
## Mainnet
|
||||
|
||||
@ -71,11 +71,6 @@ class ClientManager:
|
||||
async def update_config(request: Request):
|
||||
logger.debug("Received config update")
|
||||
new_config = await request.json()
|
||||
for k, v in new_config["job_manager_config"]["default_batch_sizes"].items():
|
||||
if v == 0:
|
||||
raise HTTPException(status_code=400, detail=f"Batch size for {k} cannot be 0")
|
||||
if (v & (v - 1)) != 0:
|
||||
raise HTTPException(status_code=400, detail=f"Batch size for {k} must be a power of 2")
|
||||
for x in new_config["algo_selection"]:
|
||||
if x["batch_size"] == 0:
|
||||
raise HTTPException(status_code=400, detail=f"Batch size for {x['algorithm_id']} cannot be 0")
|
||||
|
||||
@ -49,7 +49,7 @@ class DataFetcher:
|
||||
algorithms_data, benchmarks_data, challenges_data = list(executor.map(_get, tasks))
|
||||
|
||||
algorithms = {a["id"]: Algorithm.from_dict(a) for a in algorithms_data["algorithms"]}
|
||||
wasms = {w["algorithm_id"]: Binary.from_dict(w) for w in algorithms_data["binarys"]}
|
||||
binarys = {w["algorithm_id"]: Binary.from_dict(w) for w in algorithms_data["binarys"]}
|
||||
|
||||
precommits = {b["benchmark_id"]: Precommit.from_dict(b) for b in benchmarks_data["precommits"]}
|
||||
benchmarks = {b["id"]: Benchmark.from_dict(b) for b in benchmarks_data["benchmarks"]}
|
||||
@ -74,7 +74,7 @@ class DataFetcher:
|
||||
self._cache = {
|
||||
"block": block,
|
||||
"algorithms": algorithms,
|
||||
"wasms": wasms,
|
||||
"binarys": binarys,
|
||||
"precommits": precommits,
|
||||
"benchmarks": benchmarks,
|
||||
"proofs": proofs,
|
||||
|
||||
@ -24,7 +24,7 @@ class JobManager:
|
||||
proofs: Dict[str, Proof],
|
||||
challenges: Dict[str, Challenge],
|
||||
algorithms: Dict[str, Algorithm],
|
||||
wasms: Dict[str, Binary],
|
||||
binarys: Dict[str, Binary],
|
||||
**kwargs
|
||||
):
|
||||
api_url = CONFIG["api_url"]
|
||||
@ -69,20 +69,20 @@ class JobManager:
|
||||
}
|
||||
hash_threshold = self.hash_thresholds[x.details.block_started][x.settings.challenge_id]
|
||||
|
||||
wasm = wasms.get(x.settings.algorithm_id, None)
|
||||
if wasm is None:
|
||||
logger.error(f"no wasm found for algorithm_id {x.settings.algorithm_id}")
|
||||
bin = binarys.get(x.settings.algorithm_id, None)
|
||||
if bin is None:
|
||||
logger.error(f"batch {x.id}: no binary-blob found for {x.settings.algorithm_id}. skipping job")
|
||||
continue
|
||||
if wasm.details.download_url is None:
|
||||
logger.error(f"no download_url found for wasm {wasm.algorithm_id}")
|
||||
if bin.details.download_url is None:
|
||||
logger.error(f"batch {x.id}: no download_url found for {bin.algorithm_id}. skipping job")
|
||||
continue
|
||||
batch_size = next(
|
||||
(s["batch_size"] for s in algo_selection if s["algorithm_id"] == x.settings.algorithm_id),
|
||||
None
|
||||
)
|
||||
if batch_size is None:
|
||||
batch_size = config["default_batch_size"][x.settings.challenge_id]
|
||||
logger.error(f"No batch size found for algorithm_id {x.settings.algorithm_id}, using default {batch_size}")
|
||||
logger.error(f"batch {x.id}: no batch size found for {x.settings.algorithm_id}. skipping job")
|
||||
continue
|
||||
num_batches = math.ceil(x.details.num_nonces / batch_size)
|
||||
atomic_inserts = [
|
||||
(
|
||||
@ -112,11 +112,11 @@ class JobManager:
|
||||
x.details.num_nonces,
|
||||
num_batches,
|
||||
x.details.rand_hash,
|
||||
json.dumps(block.config["benchmarks"]["runtime_configs"]["wasm"]),
|
||||
json.dumps(block.config["benchmarks"]["runtime_config"]),
|
||||
batch_size,
|
||||
c_name,
|
||||
a_name,
|
||||
wasm.details.download_url,
|
||||
bin.details.download_url,
|
||||
x.details.block_started,
|
||||
hash_threshold
|
||||
)
|
||||
|
||||
@ -11,25 +11,12 @@ from master.client_manager import CONFIG
|
||||
|
||||
logger = logging.getLogger(os.path.splitext(os.path.basename(__file__))[0])
|
||||
|
||||
@dataclass
|
||||
class AlgorithmSelectionConfig(FromDict):
|
||||
algorithm: str
|
||||
base_fee_limit: PreciseNumber
|
||||
num_nonces: int
|
||||
weight: float
|
||||
|
||||
@dataclass
|
||||
class PrecommitManagerConfig(FromDict):
|
||||
max_pending_benchmarks: int
|
||||
algo_selection: Dict[str, AlgorithmSelectionConfig]
|
||||
|
||||
class PrecommitManager:
|
||||
def __init__(self):
|
||||
self.last_block_id = None
|
||||
self.num_precommits_submitted = 0
|
||||
self.algorithm_name_2_id = {}
|
||||
self.challenge_name_2_id = {}
|
||||
self.curr_base_fees = {}
|
||||
|
||||
def on_new_block(
|
||||
self,
|
||||
@ -43,11 +30,6 @@ class PrecommitManager:
|
||||
):
|
||||
self.last_block_id = block.id
|
||||
self.num_precommits_submitted = 0
|
||||
self.curr_base_fees = {
|
||||
c.details.name: c.block_data.base_fee
|
||||
for c in challenges.values()
|
||||
if c.block_data is not None
|
||||
}
|
||||
benchmark_stats_by_challenge = {
|
||||
c.details.name: {
|
||||
"solutions": 0,
|
||||
@ -94,12 +76,11 @@ class PrecommitManager:
|
||||
"""
|
||||
)["count"]
|
||||
|
||||
config = CONFIG["precommit_manager_config"]
|
||||
algo_selection = CONFIG["algo_selection"]
|
||||
|
||||
num_pending_benchmarks = num_pending_jobs + self.num_precommits_submitted
|
||||
if num_pending_benchmarks >= config["max_pending_benchmarks"]:
|
||||
logger.debug(f"number of pending benchmarks has reached max of {config['max_pending_benchmarks']}")
|
||||
if num_pending_benchmarks >= CONFIG["max_concurrent_benchmarks"]:
|
||||
logger.debug(f"number of pending benchmarks has reached max of {CONFIG['max_concurrent_benchmarks']}")
|
||||
return
|
||||
logger.debug(f"Selecting algorithm from: {[(x['algorithm_id'], x['weight']) for x in algo_selection]}")
|
||||
selection = random.choices(algo_selection, weights=[x["weight"] for x in algo_selection])[0]
|
||||
|
||||
@ -98,15 +98,13 @@ class SlaveManager:
|
||||
|
||||
@app.route('/get-batches', methods=['GET'])
|
||||
def get_batch(request: Request):
|
||||
config = CONFIG["slave_manager_config"]
|
||||
|
||||
if (slave_name := request.headers.get('User-Agent', None)) is None:
|
||||
return "User-Agent header is required", 403
|
||||
if not any(re.match(slave["name_regex"], slave_name) for slave in config["slaves"]):
|
||||
if not any(re.match(slave["name_regex"], slave_name) for slave in CONFIG["slaves"]):
|
||||
logger.warning(f"slave {slave_name} does not match any regex. rejecting get-batch request")
|
||||
raise HTTPException(status_code=403, detail="Unregistered slave")
|
||||
|
||||
slave = next((slave for slave in config["slaves"] if re.match(slave["name_regex"], slave_name)), None)
|
||||
slave = next((slave for slave in CONFIG["slaves"] if re.match(slave["name_regex"], slave_name)), None)
|
||||
|
||||
concurrent = []
|
||||
updates = []
|
||||
@ -129,7 +127,7 @@ class SlaveManager:
|
||||
if (
|
||||
b["slave"] is None or
|
||||
b["start_time"] is None or
|
||||
(now - b["start_time"]) > config["time_before_batch_retry"]
|
||||
(now - b["start_time"]) > CONFIG["time_before_batch_retry"]
|
||||
):
|
||||
b["slave"] = slave_name
|
||||
b["start_time"] = now
|
||||
@ -278,7 +276,6 @@ class SlaveManager:
|
||||
|
||||
return {"status": "OK"}
|
||||
|
||||
config = CONFIG["slave_manager_config"]
|
||||
thread = Thread(target=lambda: uvicorn.run(app, host="0.0.0.0", port=5115))
|
||||
thread.daemon = True
|
||||
thread.start()
|
||||
|
||||
@ -92,8 +92,6 @@ class SubmissionsManager:
|
||||
)
|
||||
|
||||
def run(self, submit_precommit_req: Optional[SubmitPrecommitRequest]):
|
||||
config = CONFIG["submissions_manager_config"]
|
||||
|
||||
now = int(time.time() * 1000)
|
||||
if submit_precommit_req is None:
|
||||
logger.debug("no precommit to submit")
|
||||
@ -128,7 +126,7 @@ class SubmissionsManager:
|
||||
INNER JOIN job_data B
|
||||
ON A.benchmark_id = B.benchmark_id
|
||||
""",
|
||||
(config["time_between_retries"],)
|
||||
(CONFIG["time_between_resubmissions"],)
|
||||
)
|
||||
|
||||
if benchmark_to_submit:
|
||||
@ -171,7 +169,7 @@ class SubmissionsManager:
|
||||
INNER JOIN job_data B
|
||||
ON A.benchmark_id = B.benchmark_id
|
||||
""",
|
||||
(config["time_between_retries"],)
|
||||
(CONFIG["time_between_resubmissions"],)
|
||||
)
|
||||
|
||||
if proof_to_submit:
|
||||
|
||||
@ -106,21 +106,8 @@ SELECT '
|
||||
"player_id": "0x0000000000000000000000000000000000000000",
|
||||
"api_key": "00000000000000000000000000000000",
|
||||
"api_url": "https://mainnet-api.tig.foundation",
|
||||
"submissions_manager_config": {
|
||||
"time_between_retries": 60000
|
||||
},
|
||||
"job_manager_config": {
|
||||
"default_batch_sizes": {
|
||||
"c001": 8,
|
||||
"c002": 8,
|
||||
"c003": 8,
|
||||
"c004": 8,
|
||||
"c005": 8
|
||||
}
|
||||
},
|
||||
"precommit_manager_config": {
|
||||
"max_pending_benchmarks": 4
|
||||
},
|
||||
"time_between_resubmissions": 60000,
|
||||
"max_concurrent_benchmarks": 4,
|
||||
"algo_selection": [
|
||||
{
|
||||
"algorithm_id": "c001_a001",
|
||||
@ -128,8 +115,7 @@ SELECT '
|
||||
"difficulty_range": [0, 0.5],
|
||||
"selected_difficulties": [],
|
||||
"weight": 1,
|
||||
"batch_size": 8,
|
||||
"base_fee_limit": "10000000000000000"
|
||||
"batch_size": 8
|
||||
},
|
||||
{
|
||||
"algorithm_id": "c002_a001",
|
||||
@ -137,8 +123,7 @@ SELECT '
|
||||
"difficulty_range": [0, 0.5],
|
||||
"selected_difficulties": [],
|
||||
"weight": 1,
|
||||
"batch_size": 8,
|
||||
"base_fee_limit": "10000000000000000"
|
||||
"batch_size": 8
|
||||
},
|
||||
{
|
||||
"algorithm_id": "c003_a001",
|
||||
@ -146,8 +131,7 @@ SELECT '
|
||||
"difficulty_range": [0, 0.5],
|
||||
"selected_difficulties": [],
|
||||
"weight": 1,
|
||||
"batch_size": 8,
|
||||
"base_fee_limit": "10000000000000000"
|
||||
"batch_size": 8
|
||||
},
|
||||
{
|
||||
"algorithm_id": "c004_a001",
|
||||
@ -155,8 +139,7 @@ SELECT '
|
||||
"difficulty_range": [0, 0.5],
|
||||
"selected_difficulties": [],
|
||||
"weight": 1,
|
||||
"batch_size": 8,
|
||||
"base_fee_limit": "10000000000000000"
|
||||
"batch_size": 8
|
||||
},
|
||||
{
|
||||
"algorithm_id": "c005_a001",
|
||||
@ -164,19 +147,16 @@ SELECT '
|
||||
"difficulty_range": [0, 0.5],
|
||||
"selected_difficulties": [],
|
||||
"weight": 1,
|
||||
"batch_size": 8,
|
||||
"base_fee_limit": "10000000000000000"
|
||||
"batch_size": 8
|
||||
}
|
||||
],
|
||||
"slave_manager_config": {
|
||||
"time_before_batch_retry": 60000,
|
||||
"slaves": [
|
||||
{
|
||||
"name_regex": ".*",
|
||||
"algorithm_id_regex": ".*",
|
||||
"max_concurrent_batches": 1
|
||||
}
|
||||
]
|
||||
}
|
||||
"time_before_batch_retry": 60000,
|
||||
"slaves": [
|
||||
{
|
||||
"name_regex": ".*",
|
||||
"algorithm_id_regex": ".*",
|
||||
"max_concurrent_batches": 1
|
||||
}
|
||||
]
|
||||
}'
|
||||
WHERE NOT EXISTS (SELECT 1 FROM config);
|
||||
@ -67,7 +67,7 @@ serializable_struct_with_getters! {
|
||||
lifespan_period: u32,
|
||||
min_per_nonce_fee: PreciseNumber,
|
||||
min_base_fee: PreciseNumber,
|
||||
runtime_configs: HashMap<String, RuntimeConfig>,
|
||||
runtime_config: RuntimeConfig,
|
||||
target_solution_rate: u32,
|
||||
hash_threshold_max_percent_delta: f64,
|
||||
}
|
||||
|
||||
Loading…
Reference in New Issue
Block a user