Optimizing Performance and Cost: Infura Benchmark Analysis

As the growth of the Ethereum network usage by real-world decentralized applications like CryptoKitties has exploded over the past year, the underlying infrastructure demands to support this traffic has grown exponentially. Infura is committed to being a core infrastructure provider who ensures that all of that request traffic looking for data or to interact with contracts on the Ethereum blockchain can be served reliably and as fast as possible.

Delivering infrastructure at this scale comes at a real price though. If you have tried to run a full Ethereum node, you know that gone are the days where it is possible to run on a laptop or on a simple cloud virtual machine.

At Infura, as we continue to scale our infrastructure to service the growth and needs of those who prefer to use our service as their back-end, we are constantly up against the correct and most cost-efficient infrastructure to optimize for performance and cost.

The following analysis represents some of the diligence and findings leading to infrastructure decisions given the current conditions of the Ethereum blockchain. The analysis applies to AWS EC2 instances, but the findings in our tests can potentially be leveraged for decision-making on other public cloud providers.

Executive Summary

In this exercise, we will evaluate our existing infrastructure based on m5.xlarge instance types against storage-optimized i3.xlarge instance types.

The instance specifications follow.

m5.xlarge
4 vCPU
16 GB Memory (GiB)
EBS Storage Only
$0.192 per Hour

i3.xlarge
4 vCPU
30.5 GB Memory (GiB)
1 x 950 NVMe SSD
$0.312 per Hour

prices based on US-East Region

The i3.xlarge instance type has approximately double the Memory and provides attached SSD, rather than EBS, however it costs around 60% more than the m5.xlarge. Historically, Infura has relied on general instance types (the “m” series), however more and more, the real bottleneck we have faced to rapidly scale and add node capacity quickly has been disk performance. This led us to this research into this new storage-optimized instance type (the “i” series).

After benchmarking the i3.xlarge EC2 instance and comparing it to an m5.xlarge instance we conclude that one i3.xlarge node should be able to handle at least the same load as two m5.xlarge nodes, representing an overall cost reduction of 23% with the same levels of performance.

Introduction

We want to explore the possibility to move some (or all) our RPC geth nodes to i3 EC2 instances, instead of the current armada of m5 (and m4/c5). One nice thing with the i3 family is that they include local SSD, whereas when using SSDs with a mX node it is usually backed by EBS. In theory that switch should let us handle more I/O load per instance, and therefore save money.

Before doing such a switch though we need to have a better understanding on how their performance compares, as well as how we’re going to have the chaindata persistent between a node’s lifetime. When we have a grasp of those things we can finally evaluate whether or not the i3 can make economical sense instead of using the m5 as geth nodes.

Benchmarking RPC and chaindata sync

These benchmarks are all on a i3.xlarge node and a m5.xlarge node , both residing on the same subnet. The geth version under test on both nodes is geth-linux-amd64-56152b31
 
The benchmarking is very noisy. In general it is obvious and clear that the i3.xlarge is performing better than the m5.xlarge. It is a bit trickier to quantify it though, as there is a high variance between different RPC calls, and I'm not 100% familiar with what geth is doing all the time. For example, at times it would be occupied with (I think) replaying the whole chain, and validating its internal state.
 
Sometimes I cleared the filesystem caches, but not between every run. In general this was after switching the type of test, or if I wanted to double check something.

systemctl geth stop && sync && echo 3 > /proc/sys/vm/drop_caches && systemctl start geth

The load incurred on the nodes was measured and captured by looking at the CPU % and Disk Utilization % in New Relic, as well as sometimes gleaming on dstat/iotop on the nodes in question. 
 
Besides iterating and experimenting with various parameters and timing lengths, to be able to compare the data I generally tried to look at the numbers every ~5 minutes and ~10 minutes. Sometimes, I let it run for 30 minutes or longer.

Benchmark of chaindata geth syncing

Both benchmarks of sync on the m5.xlarge node and the i3.xlarge node were done when they were so far out of sync that they would not be able to synchronize fully within the time period of the test.

In general, I felt that the I/O was a bit more spiky on the i3 during this test. The percentage figure represents approximately the same number of MB/s, but I could see times when the i3 was bursting upwards to 100 MB/s. 
 
Just looking at this table one could think that the i3 should be about as good as a m5 on the CPU intensive operations, as it leaves about the same amount of CPU left in the tank to take care of the RPC traffic.
 
And looking at the I/O side, one could think that perhaps the i3 would be able to handle a factor of 2–3 more RPC traffic than the m5.

Benchmark of RPC traffic vs geth nodes

The benchmarking of RPC calls is mainly a combination of loads generated from wrk or locust, by the opposite node. For example, if benchmarking the i3 node, then m5 is the load generator, and vice versa. 
 
To isolate the RPC load and remove syncing and communication with peers from the equation, I modified the systemd startup script with —maxpeers 0 —nodiscover —netrestrict <private network cidr notation>. I noticed that it was not always enough though, as it sometimes started going through the chain and doing some kind of housekeeping and/or replaying of the chaindata to validate its state (I'm guessing here). I didn't find a good way of stopping those processes, which was a shame, as they took considerable resources. 
 
Since I specifically want to test the I/O, I would prefer to hit geth with RPC calls requesting data from different part of the blockchain's history. However, I want to minimize the risk of having the dataset in the cache, so that geth will actually have to go to disk to find it. Without digging through the source code I'm not really sure how LevelDB lays out its data on its thousand upon thousand of files. So I will call a set of different RPC functions that are quick to automate: either the param to the RPC call doesn't change, or if it does then it just varies it by specifying a random block number. Hopefully that will provide geth with a mix of calls that will increase the probability that it will need to go to disk to respond.

Step 1 — One RPC method with static params on the i3 node

First, I will inspect the maximum number of requests that we can get by running wrk on the RPC method getBlockByNumber with a static block number. I use this approach to fine-tune the number of threads and connections in other tests, and to get an approximation of an upper limit for future tests. 
 
With 2 wrk threads and 32 connections, we get about 31,000 requests/s with minimal latency (99% less than 7 ms). With 3 wrk threads and 36 connections, we also get about 34,000 requests/s, however latency has noticeably degraded (99% less than 15 ms).

Step 2 — Check for outliers in RPC method performance with static params on the i3 node

Next, we quickly want to check approximately how our different RPC methods stand against each other. To do this, we loop over different wrk scripts that will call different endpoints with different payload (everything static still, so data will be cached). We’re not interested in maximum performance on each call, but just to see if there are any big differences that are good to know. For this test, we run the it with 1 thread and 16 connections. 
 
We can see that in general, with static data, that our calls can be grouped into two categories:

  • ~12,000 requests/s: eth_getCode
  • ~20,000 requests/s: eth_blockNumber, eth_call, eth_estimateGas, eth_gasPrice, eth_getBalance, eth_getBlockByHash, eth_getBlockByNumber, eth_getBlockTransactionCountByHash, eth_getBlockTransaction, CountByNumber, eth_getLogs, eth_getStorage, eth_getTransactionByBlockHashAndIndex, eth_getTransactionByHash, eth_getTransactionCount, eth_getTransactionReceipt, eth_getUncleByBlockHashAndIndex, eth_getUncleByBlockNumberAndIndex, eth_getUncleCountByBlockHash, eth_getUncleCountByBlockNumber,

Verification of the input data and the response was not done on all the RPC methods. Some of the input data was just copy and pasted from https://github.com/ethereum/wiki/wiki/JSON-RPC

Step 3 — Benchmark each each RPC method separately with locust

For a more thorough comparison, we switch over to locust where it is easier to script the interaction, as well as to choose which RPC methods to call each time. Each RPC method is either called with a random block number and/or empty/static params. 
 
We start by running locust on the m5 node, and benchmarking geth on the i3 node. We iteratively call different RPC methods in cycles. We start with a low number of locust workers and increase the number of workers until we hit a ceiling in the number of requests we are able to process. For the i3 node this seems to be about 5,000 workers. We then turn around and hit the m5 node with the same amount of workers, to compare how they handle it.

There are bound to be a number of measurement errors hidden in this data, not to mention that the load generating host is not stressed enough (did not try with more slave nodes). Although I’m skeptical about some of these numbers (might have written them down incorrectly) there is at least a tendency for them to be either I/O heavy or CPU heavy. However, the general conclusion appears to be that at least for these RPC methods and this amount of load the i3 is at least twice as efficient.

Step 4 — Benchmark all RPC methods together with locust

Now we’ll run the locust benchmarks all together in one test suit. We instruct locust to call each RPC method with equal probability, and to let each worker hit the method as quickly as possible (for example, every 1 ms). 
 
We let these tests run for around 30 minutes with the same number of workers as in the previous step/tests (5000).

The I/O load on the i3 node seems promising. The CPU load is lower as well, although of course its lower bound will always be related to the slowest call in our collection of RPC methods. 
 
We continue to run a couple of more experiments with the full locust test suit. This time we increase the number of workers until we reach approximately 1,000 ms latency. We measure the numbers after approximately 10 minutes to give the host a chance of getting warmed up properly. 
 
We start by testing the m5 node:

One problem I noticed here is the high median latency, which also happen to have a very broad range between a high and low number. This high number is mainly because of the ridiculously high latency on eth_estimateGas that sometimes struck. I’m not exactly sure why or when, but in periods it really screwed up the benchmarking, as the latency was being measured in tens of seconds. 
 
The volatility is due to random chance in the short run—the RPC methods can roughly be grouped into one category with high latency and one with low latency, and the RPC methods we are testing are distributed about 50/50 between these:

  • High latency — eth_getTransactionCount, eth_getStorageAt, eth_getCode, eth_getBalance, eth_estimateGas
  • Low latency — eth_gasPrice, eth_getBlockTransactionCountByNumber, eth_getBlockByNumber, eth_getUncleCountByBlockNumber

Whichever group has for the moment got most requests back will dominate the median. 
 
Then we compare when running vs the i3 node, still 10 minute locust runs.

We get drastically lower I/O (~80% lower) than with m5, with approximately 50% higher RPS, for the same load. With a CPU utilization about 40% lower.
 
This confirms the initial hypothesis that one i3 node should be able to handle at least as much load as two m5 nodes. Perhaps a bit more.

Benchmark of RPC traffic and syncing together

Now for the final benchmark. We now start geth in normal mode so it starts syncing. When we’ve confirmed the node is syncing, we start our locust test suite from before, and iteratively increase the number of workers until we hit about 1,000 ms in latency. Tests are run for about 10 minutes each.

First we test the i3 node, with eth_estimateGas completely removed from the test suit this time, as it was causing too much havoc with the results.

We can compare this to the results from the m5 node. We run these tests without eth_estimateGas as well:

We first notice again that the median latency is skewed due to the random chance of what RPC methods we're testing. The average latency gives an idea on what a big portion of the latencies are at. For example, we were over the 1,000 ms threshold with 500 workers on the m5 node, but we didn't reach that threshold until about 750 workers on the i3 node. 
 
We can further note that we get about 75% more throughput from the i3 node at a lower latency, and with a lower impact to the HW resources. These findings, furthermore strengthen my thesis that an i3.xlarge node could, in general, handle the same load as two m5.xlarge nodes.

Cost questions

Cost difference between an i3 w/ 10 GB root + local SSD vs m5 w/ 100 GB root + 1 TB chaindata volume (archive nodes)

We could try to do detailed calculations, but I fare that it would be somewhat unreliable as benchmarks are noisy in nature (both these and in general, IMO). Instead I suggest looking at it somewhat broader (on-demand prices, US-East):

  • An m5.xlarge costs $0.192/h
  • An i3.xlarge costs $0.312/h

If we accept that one i3 ~= two m5 then we will save $0.072/h = ~$50/month per two m5 nodes we replace.
 
An EBS backed gp2 volume costs $0.10/GB per month. The difference in EBS volume storage for each pair of m5 nodes we switch to i3 is: 10–1,100*2 = -2,190 GB. Per month, we would save 2,190 * $0.10 = $219 per two nodes we replace with i3.

In total this translates to a savings of $50 + $219 = $269/month per two m5.xlarge nodes we replace with one i3.xlarge.

Cost difference between an i3 w/ 10 GB root + local SSD vs an m5 / c5 w/ 250 GB root

In this case we would still have the savings of $50/month per two m5 nodes we replace. 
 
The EBS delta would be smaller though: 10–250*2 = 490 GB. Per month, we would save 490 * $0.10 ~= $50 on EBS storage. 
 
In total that would translate to savings of ~$100/month per two m5 nodes we replace with one i3 node.

What happens when the chain grows over i3.xlarge’s local disk size?

i3.xlarge has 950 GB local disk. When the full archive chaindata grows larger than this, we would have to migrate to i3.2xlarge which has 1,900 GB local storage. 
 
This i3.2xlarge node comes at an extra cost: $0.624/hour. This will cost ~$450/month. In relation to the m5.xlarge that today costs us ~$140/month, this is significant. If we went by this cost only then it would barely be worth it if the i3.2xlarge could decommission three m5.xlarge nodes ($450–3*$140 = $30/month more expensive). 
 
But we also have to remember the EBS savings by switching to local storage. As previously shown it would remove about ~$220/month per pair of m5 nodes we remove at today’s data sizes. So the cost difference would turn out to be $450–$140*2–$220 = $-50.
 
We would still save $50/month per two m5 nodes we replace with i3.2xlarge.

Additional tests to compare vs current footprint

The previous batch of RPC-JSON methods that we tested didn’t include some of the more prominent RPC methods. More specifically, the specific RPC methods are:

  • estimategas
  • ethcall
  • getbalance
  • getlogs
  • gettxcount
  • gettxhash
  • sendrawtx

The method calls in bold have been tested previously, which leaves a list of 4 methods that we’re missing. I’m not sure how to best test sendRawTransaction, so that leaves us 3 additional methods to test:

  • eth_call
  • eth_getLogs
  • eth_getTransactionByHash

We will bench these with locust as before, and one at a time. We will have blockchain syncing enabled, and are running with the geth-linux-amd64-dc78cb61 binary from our own repo (1.7.3 with latest patch). 
 
We ramp up the load for each test, until we seem to stall out on either RPS, and/or the latency goes completely out of hand. At the same time we keep an eye on the CPU and I/O utilization via New Relic to see how saturated the resources are. We run the benchmarks for 5 minutes each time, and take a note of the impact. We save the numbers when we reach max RPS and try to compare it in similar way with the other instance type. 
 
As usual sometimes both locust and geth needs restarting (and caches dropped as well), for example due to the high latency on some of the calls, which causes the processes to keep alive a high number of connections that interferes with the test when no longer needed. 
 
A note about how these three RPC methods are tested. Previous tests mostly picked a random block within the chain as the parameter (with some static input data here and there). In these extra benchmarks, we test by:

  • eth_call — calling the balanceOf function for the Golem token contract for an address that does have a balance in some blocks, and no balance in others. Each call to eth_call will ask for the balance in a randomly picked block
  • eth_getTransactionByHash  — when building the locust Docker container we download all transactions that went into the EOS crowdsale contract during a ~48 day / ~280 k block period. Each call to eth_getTransactionByHash will randomly pick one of these txs as input param.
  • eth_getLogs — experimented with different loads, as I'm not sure how people generally use it, but in the end standardized on:
  • one test calls the method with the EOS crowdsale contract address and asks for any associated logs in the latest block
  • another test calls the method with the same EOS crowdsale contract address and asks for the logs within a ~28 k block interval

With regards to eth_getLogs, I might also add that I found it peculiar that the blockchain syncing had a tendency to timeout and stop working after having encountered a burst of calls for larger log intervals. I’m not sure if geth might sort these things out by itself in the long run, but I usually ended up restarting the process in these situations.

To this table we can then also add some of the previous tests we did in Step 3 above, that included some of our other target groups. Note that these tests were run without any syncing going on though.

Even though one set is with syncing, and the other set is without syncing, the general tendency still seems to be that the i3 with these tests gives about 50–100% more RPS, with approximately half the latency. This performance all comes with a hardware resource impact that is less or equal to the impact on the m5. (More specifically, I/O seems to hover between the same impact, or 2–3 times more efficient, whereas CPU is about the same or ~30% more efficient — especially when it is busy syncing in the background).

Conclusion

Based on the data and results from our testing, we conclude that under the current state of the Ethereum network, the i3 instance type should allow us to achieve better performance and cost with a smaller overall footprint over the m5 instance types. We anticipate further improvements in the geth 1.8.x client, and have already begun to see these improvements with the latest releases. We are encouraged by the overall attention the client development teams are placing on improving the underlying infrastructure demands needed in order to run full nodes at scale.

Further Discussion

For further performance gains (this is relevant for the RPC and geth sync performance as well) the local SSD device could either be partitioned so that it leaves about 10% of unused and free space, and/or make sure that TRIM is enabled and runs somewhat periodically. Both these things will help the SSD controller to find available space when it tries to write data. These factors have not been taken into account in this evaluation though.

Special thanks to the MyCrypto.com team for their collaboration on the ideas and methodologies implemented in this effort

Thanks to Eleazar Galano and Mike Godsey.