Optimizing EC2 Network Throughput On Multi-Core Servers
To scale and maintain these systems, our team works day in and day out to improve the latency and throughput of our software on the EC2 (Elastic Compute Cloud) servers. Here we’ll look at a time we experienced elevated MemcacheD latency, and the techniques we learned to scale networking performance in the cloud.
MemcacheD latency increases 200%
A few months ago, we noticed that several machines in our fleet were struggling under load. This was most notable on the memcache servers that we use to provide quick access to data from our web and API servers. During periods of high traffic, we would see elevated latency and increased timeouts on machines that were clients of our memcache servers. While the hosts were busy, the 50Mbps, 50,000 incoming packets per second transfer rate was well below the theoretical maximum on a gigabit network.
Since the network traffic was okay, our assumption was that the hosts were probably CPU constrained and memcache was simply unable to process additional requests. However a quick review of our Ganglia data showed that there was plenty of idle CPU on the quad-core machines that we use for our memcache servers.
What was surprising was the percentage of CPU cycles going to system time. Because we run on multi-core servers, we know that aggregate CPU statistics sometimes obscure the picture of what’s really going on.
To get a more detailed picture of what was happening, we ran mpstat to get a breakdown by each CPU core.
01:04:26 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
01:04:28 AM all 5.89 0.00 14.39 0.00 0.00 6.21 0.25 0.00 73.26
01:04:28 AM 0 17.00 0.00 45.50 0.00 0.00 37.50 0.00 0.00 0.00
01:04:28 AM 1 3.53 0.00 7.65 0.00 0.00 0.00 0.29 0.00 88.53
01:04:28 AM 2 4.64 0.00 8.12 0.00 0.00 0.29 0.29 0.00 86.67
01:04:28 AM 3 2.96 0.00 9.17 0.00 0.00 0.00 0.30 0.00 87.57
As you can see from the second line of statistics, core 0 is 100% utilized while the other cores are mostly idle.
Why was the single core being hammered?
The clue lies in the %soft column which shows the percentage of time being spent on software interrupts.
This leads to one of the compromises you make when running on virtualized hardware. On a dedicated physical server, you get benefits from hardware offloading of some operations (such as dealing with network traffic). However, on a paravirtualized Xen EC2 host network, traffic is managed by the hypervisor and packet data is mapped in and out of the user domain (a good explanation can be found here), meaning that the CPU has to do a significant amount of work for each packet the server receives.
We run Ubuntu Linux and by default all of these network interrupts are handled by a single core. So while we had plenty of idle CPU in aggregate, the one CPU handling the network interrupts was saturated.
How we fixed it
To see what was taking the most number of interrupt calls, we ran a simple watch command on /proc/interrupts. This lead us to see that eth0 was bound to one CPU in the system and that core was taking a large number of interrupts per second.
[root@mc:~ (master)]# cat /proc/interruptsCPU0 CPU1 CPU2 CPU3
275: 194352216 0 0 0 xen-dyn-event eth0
276: 417 12334089 0 0 xen-dyn-event blkif
Since we’re on EC2, we don’t have the ability to use multi-queue networking to split up the NIC buffers and bind them to different CPU cores. Looking at solutions to this problem, we came across Receive Packet Steering (RPS). RPS distributes the load of processing received packets and spreads them across multiple CPUs in software. We came across some blog posts about enabling RPS on linux with a simple command such as
echo f > /sys/class/net/eth0/queues/rx-0/rps_cpus. We decided to experiment with this change on a few of our caching servers to see if it helped unburden the load on the system. We could then look at the softirq’s to verify that the load was being balanced across all CPU cores.
[root@mc:~ (master)]# cat /proc/softirqsCPU0 CPU1 CPU2 CPU3 HI: 0 0 0 0 NET_TX: 3366377382 2498367535 2461198847 2413070765 NET_RX: 1114954147 1776433298 2034092667 1964220116
We then looked at CPU0 percent utilization, and it dropped from 47% system to utilization to around 10% utilization. System CPU utilization rose on the remaining cores to show more evenly balanced utilization. This increased the packets per second processing speed of the NIC.
We finalized the change, and pushed out the fix to the rest of our memcached server fleet. Since this is a dynamic kernel setting, we didn’t need to experience any downtime to enact the change. Soon after all servers in the caching pool converged on the new configuration change, the latency event abated. We then added this configuration option as a standard on all of our instances to help scale our network performance on EC2.
After dealing with this issue, we now look at packets per second load on memcached servers on EC2 as a factor in our capacity planning.
As a company that’s built its business in the cloud, it’s important we continue to optimize for efficiency so we can give pinners a fast experience!
Jeremy Carroll is an engineer and works on Site Reliability.