The problem
Performing a chain upgrade requires running pd migrate
using the most recent, post-upgrade version of pd
. If the pre-upgrade version of pd
is used instead, errors will occur and a validator will emit incorrect blocks.
Proposed solution
The pd
binary is already aware of the protocol-level APP_VERSION
value, stored in the app
crate.
- On startup,
pd start
should compare its currentAPP_VERSION
with that of the local state it’s accessing. - If the state
pd
is accessing has noAPP_VERSION
recorded, then it should write its ownAPP_VERSION
locally, specifically to non-consensus storage. - If
pd
accesses state withAPP_VERSION
greater than itsAPP_VERSION
, it should exit with an error identifying the mismatch. - If
pd
accesses state withAPP_VERSION
less than itsAPP_VERSION
, it should exit with an error, recommending migration viapd migrate
. - During
pd migrate
, the updatedAPP_VERSION
should be written to the state post-upgrade, so that subsequent runs ofpd start
show matching versions.
Crucially, the use of non-consensus storage for storing the APP_VERSION
in local state ensures that the change is fully compatible with the existing protocol. If 1) is implemented in a point-release, then future upgrades could take advantage of the new defensive logic, as long as the node had run the point release version prior to upgrading and running pd migrate
.
Additional context
During the chain upgrade on mainnet to v0.80.0
, at height 501975
, there was confusion about apphash mismatches when the network resumed, due to operator error: one validator operator mistakenly ran the pd migrate
command using the old version of pd
, i.e. 0.79.x
, when they should have used v0.80.0
instead. This resulted in a different app hash in that validator’s state, preventing the network from reaching consensus on the first post-upgrade block. Fortunately, the problem was quickly diagnosed, and the validator was able to rerun the migration from backed up state, resolving the problem and allowing the chain to resume.
The fact that the mishap occurred at all is a clear indicator of a rough edge in the upgrade tooling. After the upgrade, a sparse issue was created suggesting that the pd migrate
command should inspect state versions and refuse to run if the version of the state and the tool don’t match. That suggestion was the inspiration for this proposal, which includes additional details on the implementation.