Go Down the Stack Young Man – Story of a BugAugust 8, 2012
Last week, I released a big improvement to the responsiveness of the Armitage team server on congested networks. This particular case of poor responsiveness was extremely difficult to reproduce. Despite my continued attempts to optimize around the real cause, I failed time and again to nail it. I thought I solved the problem, until I received a “bug report” indicating otherwise. I’m certain I figured it out this time. Here’s the back story:
Armitage is a collaborative hacking tool built on the Metasploit Framework. The collaborative piece is made possible by a team server. This server acts as a proxy between the remote Armitage clients and the one Metasploit Framework server. Through this proxy, I’m able to deconflict multiple clients interacting with a session and offer additional APIs to my clients.
As long as I’ve had a team server, I noticed clients connecting from Windows 7 clients always felt slow *. As an experiment, I opted to disable Nagle’s algorithm. Nagle’s algorithm is built into most TCP stacks. It reduces network congestion by holding onto small packets and attempting to combine them into one larger packet. For protocols that generate small packets naturally (e.g., telnet), Nagle’s algorithm may add unnecessary latency. Most socket APIs include a means to disable it.
Disabling Nagle’s algorithm resulted in a big responsiveness boost on Windows 7. Linux and MacOS X clients connected to a team server were snappier too. I noticed that my local unit tests completed two minutes faster with Nagle’s disabled too.
I was pleased with this change until I received a “report” three weeks ago. Some folks were using Armitage during an exercise and… apparently the collaboration piece was very slow for them.
I was about ready to tear my hair out when I received this report. Performance is the one thing I put the most time into. I went back and forth with the user to understand their configuration and environment. He had everything setup as I would have requested it.
This report especially frustrated me because of the amount of testing I do. About once per quarter, I will connect 12 Armitage clients to a node on Amazon’s Elastic Computing Cloud. I will then populate the database with about 5,000 hosts worth of data. From this point, I then proceed to carry out a simulated external engagement against my local test lab.
Here’s a screencast demonstrating this particular test:
So, what could the problem be?
I opted to do, what I should have done a long time ago… I ran tcpdump to better understand how the team server looked on the network.
tcpdump -i eth2 | grep 55553 | grep -v "length 0"
With this running, I noticed that Armitage placed many small packets on the network, about 20-24 bytes consistently. I expected this because I disabled Nagle’s algorithm. Anything small would go out immediately
15:09:19.132609 IP 192.168.95.241.60153 > 192.168.95.241.55553: Flags [P.], seq 7961:7986, ack 5943, win 770, options [nop,nop,TS val 15794750 ecr 15794750], length 25
15:09:19.132714 IP 192.168.95.241.60153 > 192.168.95.241.55553: Flags [P.], seq 7986:8008, ack 5943, win 770, options [nop,nop,TS val 15794750 ecr 15794750], length 22
I then enabled Nagle’s algorithm and watched the same traffic dump. The result? All of the packets were the same size as before. With Nagle’s enabled, I was paying the penalty of having small packets with the additional latency of Nagle’s holding on to them. Great.
I scratched my head and decided to dig deeper into my code. None of this seemed right. As I dug through, I learned that there is no buffer between my SSL code and the code that serializes an object and writes it to a socket. The team server was serializing Java objects and writing them to the socket one byte at a time, rather than sending them as one byte buffer.
I updated my code to write serialized objects to a buffer before sending them to a socket. This reduced the number of packets by a factor of 10-20. I also reenabled Nagle’s algorithm.
14:34:05.179461 IP 192.168.95.241.52289 > 192.168.95.241.55553: Flags [P.], seq 978092:978547, ack 804839, win 770, options [nop,nop,TS val 15266262 ecr 15266137], length 455
14:34:05.180174 IP 192.168.95.241.55553 > 192.168.95.241.52289: Flags [P.], seq 804839:805218, ack 978547, win 770, options [nop,nop,TS val 15266262 ecr 15266262], length 379
At this point, I tested on Windows 7 and noticed performance was good to go. I also ran my unit tests and noticed no performance change.
Here’s likely what happened. I play in a lot of exercises with Armitage. Exercise networks are usually congested. There’s a lot of activity happening. All of the team clients flooding the network with small packets probably made the congestion much worse.
I’m embarrassed that this problem slipped past my radar, but I’m happy that it’s finally fixed.
Lesson learned: when it comes to performance, I can’t treat the network as an invisible abstraction that delivers my data. I have to give my interaction with the network as much attention as I give to optimizing my software.
* Note: Armitage clients used to connect to both the Metasploit Framework and a team server. Only packets sent to the team server were victim to this problem. In May 2012, I changed Armitage’s collaboration setup to proxy everything through the team server. This made the problem noticeable and forced me to start looking at it. This is when I made the change to disable Nagle’s algorithm.