Rinkeby consensus post-mortem
This post intends to shed light on the Rinkeby test network issues seen on 9/5/18 for the purpose of explaining to the community what we saw and how we came to a resolution.
This post intends to shed light on the Rinkeby test network issues seen on September 5, 2018. There is no intent at placing blame or being critical. This is strictly written for the purpose of explaining to the community from our perspective what we saw and how we came to a resolution.
Rinkeby is an Ethereum proof-of-authority test network running the Clique consensus protocol. A proof-of-authority network relies on trusted nodes designated as “signers” that have the ability to create blocks. A majority of signers on the network are required to validate the chain. Rinkeby has seven designated signers maintained by the following groups:
At approximately 11am PST on September 5th 2018, our Rinkeby infrastructure began to produce alerts. As we looked closer, we noticed that our endpoints were up and responding, however no new blocks were being generated.
The majority of the network was stuck at block number 2,940,002 while a handful of nodes had proceeded to block 2,940,005 and were stuck.
As we looked closer, the nodes that had successfully moved to block 2,940,005 were running the latest version of Geth v1.8.16. This included three of the five healthy signer nodes (two were running older versions, and two more were unhealthy). Therefore a majority of the signers were running version v1.8.14 or higher. However, it turns out there was a Clique-specific consensus bug introduced in v1.8.14.
We were in active communication with the Geth dev team and working to identify a plan to move forward. While we dug into the last block before the hung network looking for clues we also upgraded our signer node to Geth v1.8.16. With our signer on the other chain fork, that chain came into consensus and began producing blocks, although that was short-lived as well only getting to block 2,940,035 before the Rinkeby fork also became stuck with a consensus failure.
At this point, it was determined to rollback ALL signer nodes to v1.8.13. When a majority of signers were running the rolled back version, the network came back into consensus and after a chain re-org the entire network moved forward together on one Rinkeby chain.
The Geth development team will address the consensus bug for a future release. Nodes running versions later than v1.8.13 are successfully able to participate on the network and DO NOT need to be rolled back. This specific bug only applies to the Clique consensus protocol and therefore only impacts the signer nodes, and again only affects Clique-based PoA networks such as Rinkeby. Private chains using Clique may also be affected.
We recognize the critical role the test networks play in the ongoing progress of the entire ecosystem and appreciate the patience from the community as we worked through this. As always, we appreciate your constructive and respectful feedback!