ceremonyclient/consensus
Cassandra Heart 53f7c2b5c9
v2.1.0.2 (#442)
* v2.1.0.2

* restore tweaks to simlibp2p

* fix: nil ref on size calc

* fix: panic should induce shutdown from event_distributor

* fix: friendlier initialization that requires less manual kickstarting for test/devnets

* fix: fewer available shards than provers should choose shard length

* fix: update stored worker registry, improve logging for debug mode

* fix: shut the fuck up, peer log

* qol: log value should be snake cased

* fix:non-archive snap sync issues

* fix: separate X448/Decaf448 signed keys, add onion key to registry

* fix: overflow arithmetic on frame number comparison

* fix: worker registration should be idempotent if inputs are same, otherwise permit updated records

* fix: remove global prover state from size calculation

* fix: divide by zero case

* fix: eager prover

* fix: broadcast listener default

* qol: diagnostic data for peer authenticator

* fix: master/worker connectivity issue in sparse networks

tight coupling of peer and workers can sometimes interfere if mesh is sparse, so give workers a pseudoidentity but publish messages with the proper peer key

* fix: reorder steps of join creation

* fix: join verify frame source + ensure domain is properly padded (unnecessary but good for consistency)

* fix: add delegate to protobuf <-> reified join conversion

* fix: preempt prover from planning with no workers

* fix: use the unallocated workers to generate a proof

* qol: underflow causes join fail in first ten frames on test/devnets

* qol: small logging tweaks for easier log correlation in debug mode

* qol: use fisher-yates shuffle to ensure prover allocations are evenly distributed when scores are equal

* qol: separate decisional logic on post-enrollment confirmation into consensus engine, proposer, and worker manager where relevant, refactor out scoring

* reuse shard descriptors for both join planning and confirm/reject decisions

* fix: add missing interface method and amend test blossomsub to use new peer id basis

* fix: only check allocations if they exist

* fix: pomw mint proof data needs to be hierarchically under global intrinsic domain

* staging temporary state under diagnostics

* fix: first phase of distributed lock refactoring

* fix: compute intrinsic locking

* fix: hypergraph intrinsic locking

* fix: token intrinsic locking

* fix: update execution engines to support new locking model

* fix: adjust tests with new execution shape

* fix: weave in lock/unlock semantics to liveness provider

* fix lock fallthrough, add missing allocation update

* qol: additional logging for diagnostics, also testnet/devnet handling for confirmations

* fix: establish grace period on halt scenario to permit recovery

* fix: support test/devnet defaults for coverage scenarios

* fix: nil ref on consensus halts for non-archive nodes

* fix: remove unnecessary prefix from prover ref

* add test coverage for fork choice behaviors and replay – once passing, blocker (2) is resolved

* fix: no fork replay on repeat for non-archive nodes, snap now behaves correctly

* rollup of pre-liveness check lock interactions

* ahead of tests, get the protobuf/metrics-related changes out so teams can prepare

* add test coverage for distributed lock behaviors – once passing, blocker (3) is resolved

* fix: blocker (3)

* Dev docs improvements (#445)

* Make install deps script more robust

* Improve testing instructions

* Worker node should stop upon OS SIGINT/SIGTERM signal (#447)

* move pebble close to Stop()

* move deferred Stop() to Start()

* add core id to worker stop log message

* create done os signal channel and stop worker upon message to it

---------

Co-authored-by: Cassandra Heart <7929478+CassOnMars@users.noreply.github.com>

---------

Co-authored-by: Daz <daz_the_corgi@proton.me>
Co-authored-by: Black Swan <3999712+blacks1ne@users.noreply.github.com>
2025-10-23 01:03:06 -05:00
..
example v2.1.0 (#439) 2025-09-30 02:48:15 -05:00
go.mod v2.1.0 (#439) 2025-09-30 02:48:15 -05:00
go.sum v2.1.0 (#439) 2025-09-30 02:48:15 -05:00
README.md v2.1.0 (#439) 2025-09-30 02:48:15 -05:00
state_machine_test.go v2.1.0 (#439) 2025-09-30 02:48:15 -05:00
state_machine_viz.go v2.1.0 (#439) 2025-09-30 02:48:15 -05:00
state_machine.go v2.1.0.2 (#442) 2025-10-23 01:03:06 -05:00

Consensus State Machine

A generic, extensible state machine implementation for building Byzantine Fault Tolerant (BFT) consensus protocols. This library provides a framework for implementing round-based consensus algorithms with cryptographic proofs.

Overview

The state machine manages consensus engine state transitions through a well-defined set of states and events. It supports generic type parameters to allow different implementations of state data, votes, peer identities, and collected mutations.

Features

  • Generic Implementation: Supports custom types for state data, votes, peer IDs, and collected data
  • Byzantine Fault Tolerance: Provides BFT consensus with < 1/3 byzantine nodes, flexible to other probabilistic BFT implementations
  • Round-based Consensus: Implements a round-based state transition pattern
  • Pluggable Providers: Extensible through provider interfaces for different consensus behaviors
  • Event-driven Architecture: State transitions triggered by events with optional guard conditions
  • Concurrent Safe: Thread-safe implementation with proper mutex usage
  • Timeout Support: Configurable timeouts for each state with automatic transitions
  • Transition Listeners: Observable state transitions for monitoring and debugging

Core Concepts

States

The state machine progresses through the following states:

  1. StateStopped: Initial state, engine is not running
  2. StateStarting: Engine is initializing
  3. StateLoading: Loading data and syncing with network
  4. StateCollecting: Collecting data/mutations for consensus round
  5. StateLivenessCheck: Checking peer liveness before proving
  6. StateProving: Generating cryptographic proof (leader only)
  7. StatePublishing: Publishing proposed state
  8. StateVoting: Voting on proposals
  9. StateFinalizing: Finalizing consensus round
  10. StateVerifying: Verifying and publishing results
  11. StateStopping: Engine is shutting down

Events

Events trigger state transitions:

  • EventStart, EventStop: Lifecycle events
  • EventSyncComplete: Synchronization finished
  • EventCollectionDone: Mutation collection complete
  • EventLivenessCheckReceived: Peer liveness confirmed
  • EventProverSignal: Leader selection complete
  • EventProofComplete: Proof generation finished
  • EventProposalReceived: New proposal received
  • EventVoteReceived: Vote received
  • EventQuorumReached: Voting quorum achieved
  • EventConfirmationReceived: State confirmation received
  • And more...

Type Constraints

All generic type parameters must implement the Unique interface:

type Unique interface {
    Identity() Identity  // Returns a unique string identifier
}

Provider Interfaces

SyncProvider

Handles initial state synchronization:

type SyncProvider[StateT Unique] interface {
    Synchronize(
        existing *StateT,
        ctx context.Context,
    ) (<-chan *StateT, <-chan error)
}

VotingProvider

Manages the voting process:

type VotingProvider[StateT Unique, VoteT Unique, PeerIDT Unique] interface {
    SendProposal(proposal *StateT, ctx context.Context) error
    DecideAndSendVote(
        proposals map[Identity]*StateT,
        ctx context.Context,
    ) (PeerIDT, *VoteT, error)
    IsQuorum(votes map[Identity]*VoteT, ctx context.Context) (bool, error)
    FinalizeVotes(
        proposals map[Identity]*StateT,
        votes map[Identity]*VoteT,
        ctx context.Context,
    ) (*StateT, PeerIDT, error)
    SendConfirmation(finalized *StateT, ctx context.Context) error
}

LeaderProvider

Handles leader selection and proof generation:

type LeaderProvider[
    StateT Unique,
    PeerIDT Unique,
    CollectedT Unique,
] interface {
    GetNextLeaders(prior *StateT, ctx context.Context) ([]PeerIDT, error)
    ProveNextState(
        prior *StateT,
        collected CollectedT,
        ctx context.Context,
    ) (*StateT, error)
}

LivenessProvider

Manages peer liveness checks:

type LivenessProvider[
    StateT Unique,
    PeerIDT Unique,
    CollectedT Unique,
] interface {
    Collect(ctx context.Context) (CollectedT, error)
    SendLiveness(prior *StateT, collected CollectedT, ctx context.Context) error
}

Usage

Basic Setup

// Define your types implementing Unique
type MyState struct {
    Round uint64
    Hash  string
}
func (s MyState) Identity() string { return s.Hash }

type MyVote struct {
    Voter string
    Value bool
}
func (v MyVote) Identity() string { return v.Voter }

type MyPeerID struct {
    ID string
}
func (p MyPeerID) Identity() string { return p.ID }

type MyCollected struct {
    Data []byte
}
func (c MyCollected) Identity() string { return string(c.Data) }

// Implement providers
syncProvider := &MySyncProvider{}
votingProvider := &MyVotingProvider{}
leaderProvider := &MyLeaderProvider{}
livenessProvider := &MyLivenessProvider{}

// Create state machine
sm := consensus.NewStateMachine[MyState, MyVote, MyPeerID, MyCollected](
    MyPeerID{ID: "node1"},           // This node's ID
    &MyState{Round: 0, Hash: "genesis"}, // Initial state
    true,                            // shouldEmitReceiveEventsOnSends
    3,                              // minimumProvers
    syncProvider,
    votingProvider,
    leaderProvider,
    livenessProvider,
    nil,                            // Optional trace logger
)

// Add transition listener
sm.AddListener(&MyTransitionListener{})

// Start the state machine
if err := sm.Start(); err != nil {
    log.Fatal(err)
}

// Receive external events
sm.ReceiveProposal(peer, proposal)
sm.ReceiveVote(voter, vote)
sm.ReceiveLivenessCheck(peer, collected)
sm.ReceiveConfirmation(peer, confirmation)

// Stop the state machine
if err := sm.Stop(); err != nil {
    log.Fatal(err)
}

Implementing Providers

See the example/generic_consensus_example.go for a complete working example with mock provider implementations.

State Flow

The typical consensus flow:

  1. StartStartingLoading
  2. Loading: Synchronize with network
  3. Collecting: Gather mutations/changes
  4. LivenessCheck: Verify peer availability
  5. Proving: Leader generates proof
  6. Publishing: Leader publishes proposal
  7. Voting: All nodes vote on proposals
  8. Finalizing: Aggregate votes and determine outcome
  9. Verifying: Confirm and apply state changes
  10. Loop back to Collecting for next round

Configuration

Constructor Parameters

  • id: This node's peer ID
  • initialState: Starting state (can be nil)
  • shouldEmitReceiveEventsOnSends: Whether to emit receive events for own messages
  • minimumProvers: Minimum number of active provers required
  • traceLogger: Optional logger for debugging state transitions

State Timeouts

Each state can have a configured timeout that triggers an automatic transition:

  • Starting: 1 second → EventInitComplete
  • Loading: 10 minutes → EventSyncComplete
  • Collecting: 1 second → EventCollectionDone
  • LivenessCheck: 1 second → EventLivenessTimeout
  • Proving: 120 seconds → EventPublishTimeout
  • Publishing: 1 second → EventPublishTimeout
  • Voting: 10 seconds → EventVotingTimeout
  • Finalizing: 1 second → EventAggregationDone
  • Verifying: 1 second → EventVerificationDone
  • Stopping: 30 seconds → EventCleanupComplete

Thread Safety

The state machine is thread-safe. All public methods properly handle concurrent access through mutex locks. State behaviors run in separate goroutines with proper cancellation support.

Error Handling

  • Provider errors are logged but don't crash the state machine
  • The state machine continues operating and may retry operations
  • Critical errors during state transitions are returned to callers
  • Use the TraceLogger interface for debugging

Best Practices

  1. Message Isolation: When implementing providers, always deep-copy data before sending to prevent shared state between state machine and other handlers
  2. Nil Handling: Provider implementations should handle nil prior states gracefully
  3. Context Usage: Respect context cancellation in long-running operations
  4. Quorum Size: Set appropriate quorum size based on your network (typically 2f+1 for f failures)
  5. Timeout Configuration: Adjust timeouts based on network conditions and proof generation time

Example

See example/generic_consensus_example.go for a complete working example demonstrating:

  • Mock provider implementations
  • Multi-node consensus network
  • Byzantine node behavior
  • Message passing between nodes
  • State transition monitoring

Testing

The package includes comprehensive tests in state_machine_test.go covering:

  • State transitions
  • Event handling
  • Concurrent operations
  • Byzantine scenarios
  • Timeout behavior