Infura Mainnet Outage Post-Mortem 2020-11-11

Earlier today, Infura experienced its most severe service interruption in our four years of operation. We'd like to share the details of the incident so that there is transparency in what occurred and so that you can feel confident our service will be better and even more resilient going forward.

Earlier today (2020-11-11) Infura experienced its most severe service interruption in our four years of operation. We realize that we are an important piece of infrastructure for many amazing products and projects. I’d like to apologize to all of our users and to the ecosystem. We recognize the faith that you place in us and we don’t ever take that lightly. I’d like to share the details of the incident with you so that there is transparency in what occurred and so that you can feel confident that our service will be better and even more resilient going forward.

07:13 UTC: Infura monitoring systems indicated that our core peering subsystem which is the peer-to-peer ingress point into our infrastructure began to lag behind the tip of the chain.

07:15 UTC: Our automated alerting systems detected a complete sync halt on several critical subsystems including our archive data subsystem and event log processing pipeline.

07:15 UTC: Automated alerts went out to our on-call engineers and triaging efforts begin.

07:56 UTC: Additional engineers are called in to help with investigations.

08:04 UTC: Cause of the sync errors identified as a consensus failure at block 11234873.

08:25 UTC: An initial attempt at forcing our nodes past the bad block via debug modifications was ineffective.

09:45 UTC: We reached out to the go-ethereum team to report the consensus bug affecting clients v1.9.9 and v1.9.13.

09:55 UTC: Begin updating our infrastructure to a client version containing the consensus bug fix.

10:30 UTC: Sub-system upgrades are completed. Initial verification and compatibility testing begins.

10:38 UTC: Anomalous behavior detected in system integration tests.

11:20 UTC: Integration failure traced to modified geth RPC path handling behavior between v1.9.9 and v.1.9.19.

11:28 UTC: Hot-fix development begins.

12:00 UTC: Hot-fix rollout begins.

12:20 UTC: Hot-fix rollout of subsystems completes.

12:37 UTC: An increase in client attempted retries caused downstream service degradations in our permissioning subsystem.

12:38 UTC: API access for Core tier accounts is temporarily disabled to expedite the recovery of the permissioning subsystem.

13:07 UTC: JSON-RPC sub-system returns to nominal health. Core tier access is re-enabled. Archive Data subsystem still operating in a degraded state.

14:28 UTC: Archive data subsystem returns to nominal health.

14:28 UTC: Incident marked as resolved.

What was the root cause?

A consensus bug affecting the versions of Geth (v.1.9.9) and (v1.9.13) used for some internal systems caused block syncing to stall across several of those subsystems.

Why was Infura running geth (v1.9.9) and (v1.9.13) when the latest version is (v1.9.23)?

In the early days of Infura we would upgrade nodes as soon as the Geth or Parity teams cut a new release. We wanted the latest performance improvements, the latest API methods, and of course bug fixes. We stopped doing that when these changes sometimes brought instability or critical breaking issues which negatively impacted our users. Sometimes it was a syncing bug, a change in peering behavior that caused unforeseen issues within our infrastructure, or a slight modification to the JSON-RPC behavior that forced a developer to make changes to their application. No software is bug free and not every release can go according to plan. Thus the decision we made was that stability was more important than tracking the latest client version to get features and performance tweaks. Because of this, we began to be more frugal with our update schedule. We do our best to give developers a stable API to develop against. Any changes to the API we communicate well in advance to give our users time to make the necessary modifications to their applications.

We run a custom patched version of Geth we internally call “Omnibus” which includes several performance, stability, and monitoring enhancements tailored to our cloud native architecture. While this complicates the update process for us compared to running a vanilla Geth version, the benefits have been worthwhile and we aim to be transparent with the version we run. It is available both at https://forkmon.ethdevops.io and via our JSON-RPC API:

curl -H content-type:application/json  https://mainnet.infura.io/v3/[YOUR_PROJECT_ID]     -X POST \

-H "Content-Type: application/json" \

-d '{"jsonrpc":"2.0","method":"web3_clientVersion","params": [],"id":1}'

{"jsonrpc":"2.0","id":1,"result":"Geth/v1.9.19-omnibus-194c4769/linux-amd64/go1.14.11"}

Because of the concerns mentioned earlier about stability, backwards compatibility, and complexity of patch management, we are very explicit and deliberate when we update our nodes. When there is a known consensus bug, we of course would update immediately. In this instance however, we were not aware of a consensus issue with Geth v1.9.9 and v1.9.13.

One particularly painful thing for us about this outage was that we were very close to updating to a client version that would have avoided this incident. We had scheduled an update for earlier this month which we ended up postponing to ensure that users had more time to update and prepare for the changes and we could guarantee the stability of the upgrade.

How will this be prevented in the future?

We sincerely appreciate our partnership with the go-ethereum team: Péter, Martin, and Felix. Their quick responses during this incident greatly facilitated a resolution. Our prior understanding was that consensus bugs, such as the one we encountered today, would be highlighted prominently in the release notes. Unfortunately, this was not the case, although we do understand the geth team’s reasoning behind it. While we have a good line of communication between our two teams, we have never expected any preferential treatment or inside access to information. No exchange, no miner, no API provider should be treated differently than a single node runner. Generally, I believe the way critical bug fixes like this are communicated to the community is something that should be discussed and improved. Specifically for Infura we have three main action items:

  1. Knowing that consensus bugs may be fixed quietly in a new release, we will adjust our processes to track closer to the latest client releases while still balancing the responsibility for maintaining stability and remaining backwards compatible.
  2. We will increase our existing usage of other clients like OpenEthereum and re-introduce the Besu client into our infrastructure to add client diversity. We will also continue to track, test, and evaluate other clients such as Nethermind and Turbo-Geth, with the intention to introduce them into our stack if and when it makes sense.
  3. Review our incident response procedures and determine if there are any improvements we can implement to shorten our time to recovery.

This was an all-hands on-deck event that our team hasn’t seen since the network-wide attacks on Ethereum in 2016. I again want to thank our users and the community for your understanding and patience while we resolved this incident. We will continue to share more updates on our engineering improvements via our Community forum, blog, and newsletter. If you have any questions or concerns, please reach out to us at eg@infura.io, godsey@infura.io, or michael@infura.io.