The problem
Performing a chain upgrade requires running pd migrate using the most recent, post-upgrade version of pd. If the pre-upgrade version of pd is used instead, errors will occur and a validator will emit incorrect blocks.
Proposed solution
The pd binary is already aware of the protocol-level APP_VERSION value, stored in the app crate.
- On startup,
pd startshould compare its currentAPP_VERSIONwith that of the local state it’s accessing. - If the state
pdis accessing has noAPP_VERSIONrecorded, then it should write its ownAPP_VERSIONlocally, specifically to non-consensus storage. - If
pdaccesses state withAPP_VERSIONgreater than itsAPP_VERSION, it should exit with an error identifying the mismatch. - If
pdaccesses state withAPP_VERSIONless than itsAPP_VERSION, it should exit with an error, recommending migration viapd migrate. - During
pd migrate, the updatedAPP_VERSIONshould be written to the state post-upgrade, so that subsequent runs ofpd startshow matching versions.
Crucially, the use of non-consensus storage for storing the APP_VERSION in local state ensures that the change is fully compatible with the existing protocol. If 1) is implemented in a point-release, then future upgrades could take advantage of the new defensive logic, as long as the node had run the point release version prior to upgrading and running pd migrate.
Additional context
During the chain upgrade on mainnet to v0.80.0, at height 501975, there was confusion about apphash mismatches when the network resumed, due to operator error: one validator operator mistakenly ran the pd migrate command using the old version of pd, i.e. 0.79.x, when they should have used v0.80.0 instead. This resulted in a different app hash in that validator’s state, preventing the network from reaching consensus on the first post-upgrade block. Fortunately, the problem was quickly diagnosed, and the validator was able to rerun the migration from backed up state, resolving the problem and allowing the chain to resume.
The fact that the mishap occurred at all is a clear indicator of a rough edge in the upgrade tooling. After the upgrade, a sparse issue was created suggesting that the pd migrate command should inspect state versions and refuse to run if the version of the state and the tool don’t match. That suggestion was the inspiration for this proposal, which includes additional details on the implementation.