Pre-UIP: version-aware migrations for chain upgrades

The problem

Performing a chain upgrade requires running pd migrate using the most recent, post-upgrade version of pd. If the pre-upgrade version of pd is used instead, errors will occur and a validator will emit incorrect blocks.

Proposed solution

The pd binary is already aware of the protocol-level APP_VERSION value, stored in the app crate.

  1. On startup, pd start should compare its current APP_VERSION with that of the local state it’s accessing.
  2. If the state pd is accessing has no APP_VERSION recorded, then it should write its own APP_VERSION locally, specifically to non-consensus storage.
  3. If pd accesses state with APP_VERSION greater than its APP_VERSION, it should exit with an error identifying the mismatch.
  4. If pd accesses state with APP_VERSION less than its APP_VERSION, it should exit with an error, recommending migration via pd migrate.
  5. During pd migrate, the updated APP_VERSION should be written to the state post-upgrade, so that subsequent runs of pd start show matching versions.

Crucially, the use of non-consensus storage for storing the APP_VERSION in local state ensures that the change is fully compatible with the existing protocol. If 1) is implemented in a point-release, then future upgrades could take advantage of the new defensive logic, as long as the node had run the point release version prior to upgrading and running pd migrate.

Additional context

During the chain upgrade on mainnet to v0.80.0, at height 501975, there was confusion about apphash mismatches when the network resumed, due to operator error: one validator operator mistakenly ran the pd migrate command using the old version of pd, i.e. 0.79.x, when they should have used v0.80.0 instead. This resulted in a different app hash in that validator’s state, preventing the network from reaching consensus on the first post-upgrade block. Fortunately, the problem was quickly diagnosed, and the validator was able to rerun the migration from backed up state, resolving the problem and allowing the chain to resume.

The fact that the mishap occurred at all is a clear indicator of a rough edge in the upgrade tooling. After the upgrade, a sparse issue was created suggesting that the pd migrate command should inspect state versions and refuse to run if the version of the state and the tool don’t match. That suggestion was the inspiration for this proposal, which includes additional details on the implementation.

This seems like a good idea. As a general comment I think that this size and type of scope (small, non-consensus affecting changes to pd) wouldn’t really require a UIP, but I think a small informational one is useful in this case because it has an impact on node operators’ upgrade processes and having a numbered item to refer to will be useful there.

We should also recommend that migrations should check this value, and if present, make sure that it matches the App Version they expect. In a point release, we could add such an assertion, so that running pd migrate post point-release after a halt would fail, since the current migration would expect the 0.79 app version and not the 0.80 app version.