Thursday, December 06, 2012

CloudOpt: Unparalleled scaling for WAN Optimization.

For the next release of CloudOpt cloudoptimizer, 1.3.0, the focus was on improving system scalability. This means supporting larger caches with less memory, more connections, more CPUs and faster WAN connections. First I investigate how well cache size scales against system resources, and then I look at cloudoptimizer throughput.

Cache Size

With previous versions of cloudoptimizer the throughput of the system dropped as the cache got bigger limiting the size of the cache. The rate at which the throughput dropped was also a function of the IO sub-system of the hardware. By re-working the cache index the link between throughput and cache size was broken, throughput is now independent of cache size (O(1) in big O notation). As well as allowing bigger caches, the system now uses less memory.

To test the new cache index of cloudoptimizer I used two large EC2 instances running Ubuntu 12.04. The instances were EBS Optimized instances, I used EBS Optimized instances as they have extra network bandwidth. A large instance has the equivalent of two 2Ghz cores and 7.5GB of RAM.

To each instance I attached two 1TB EBS volumes, these were standard EBS volumes. The volumes were combined into a single 2TB logical volume using LVM, and XFS was used to create a filesystem on the logical volume.

As a data source I used an Apache Web server with 2TB of 500GB files with random binary data contents, this means the files are incompressible For a client I used wget to retrieve the files from the Web server.

First pass throughput averaged around 200Mbs using a single client, with two clients throughput averaged around 300Mbs, peaking at 400Mbs. Since these are virtual machines running on shared hardware and using a shared network fluctuations are to be expected. Average throughput remained constant as the cache grew to 1.6 TB.

On a large instance a cloudoptimizer uses 4GB of RAM to support a 1.6TB cache, this is the recommended cache size limit for a large instance. A 800GB cache uses 2GB of RAM, a 400GB cache uses 1GB RAM etc, the system also scales in the other direction, a 3.2TB cache uses 8GB of RAM etc.

The contents of the cache are compressed, so a 1.6TB cache can contain up to 3.2TB of raw data. With version 1.3.0 the cloudoptimizer now supports a "shared cache", this means data is only stored in the cache once, even if a number of cloudoptimizer peers have sent the same data through the system. This increases the disk usage efficiency of the cache.

System Throughput

To improve the throughput of the system a number of improvements have been made to cloudoptimizer. These include removing thread contention hotspots, optimizing the TCP stack, reducing memory usage etc.
 
To measure system throughput I used two laptops on a Gigabit network. The first laptop had an Apache Web server and cloudoptimizer running on it, the second had a cloudoptimizer and wget to act as a client. The focus is on how fast the CPU on the first laptop can process data on the first pass phase, this is the most computational intensive step and thus the rate limiting step. I used wget on the second laptop to retrieve a 250MB file of random binary data from the first laptop.

The CPU on the laptop is a Intel P6100 CPU. This has two cores running at 2Ghz with 3MB of cache. The laptop also has 4GB of memory and a 500GB hard disk. The CPU does not have the Intel AES-NI instructions, cloudoptimizer is designed to automatically use these instructions if they are available, and you can expect a significant performance improvement if your CPU supports these instructions (nearly all modern CPUs do).

For the first test I used the default cloudoptimizer configuration. This means the connection between the two cloudoptimizers is encrypted with SSL, the data sent between them is compressed, de-duplication is applied and data stored in the caches is also compressed. The throughput was 314Mbs.

I turned off encryption and throughput jumped to 442Mbs. Many WAN optimization products only use encryption for data that was previously encrypted, data that was originally unencrypted is left unencrypted. Also, many people use VPNs to encrypt their traffic, in which case turning off encryption in cloudoptimizer makes sense.

Next I turned off cache compression, data between the cloudoptimizers is still compressed and de-duplication is applied, throughput jumped to 550Mbs.

As part of the optimizations to improve throughput a new chunking algorithm for de-duplication called the "fast-chunker" was developed. Unfortunately, this chunking algorithm is not compatible with the previous version so it is not configured as the default. The fast chunker should be used for new installations and is particularly useful for networks with 1Gbs or 10Gbs WAN connections.

Using the fast-chunker for de-duplication and compressing the data stream I got 705Mbs. 

The last step is to turn off compression of the data stream, this means only data de-duplication is applied to the data, this lifted the throughput to 880Mbs.

The final test is to measure the throughput of the system without the cloudoptimizer in place, ie just a standard HTTP transfer, for this scenario the throughput was 896Mbs. This means that doing just de-duplication a 2Ghz core can run at very nearly the native speed of a Gigabit network.

Improving throughput is important for two cases: reducing the amount of resources required to optimize a WAN connection, and for optimizing fast 1Gbs and 10Gbs WAN connections.

Conclusion

The new version of cloudoptimizer offers unparalleled scalability in terms of cache size and throughput. A system with 7.5GB of RAM can support a 1.6TB cache. A 2Ghz core can support between 300Mbs and 880Mbs throughput on first pass depending on configuration. A modern 3Ghz+ core with AES-NI instruction set will support  significantly higher throughput....

Wednesday, June 06, 2012

Getting data into and out of clouds faster....

Getting data into and out of Clouds faster is now what I do at a company called CloudOpt.

How do we do it? We intercept traffic going to the Cloud whether it is by HTTP, scp, FTP, CIFS etc and apply data de-duplication, compression and some fancy protocol optimizations. The results are up to 60% data reduction on first transfer, and 99% on the second. This means if a file is transferred to the cloud using scp and then downloaded using HTTP, the HTTP transfer will only transfer 1% of the bytes!

Bill, one of the guys I work with, has done a nice write up on creating RightScale Server Template using chef on the CloudOpt Blog.

Thursday, March 04, 2010

SIGPIPE

I was changing a send call to a writev and ran into an annoying problem. On the send call I had set the MSG_NOSIGNAL flag to stop any SIGPIPE signals if the socket I was writing to was closed, unfortunately I cannot set the same flag on writev! (I had originally switched from using write to send to be able to set the flag) On some systems (BSD) you can set the socket option SO_NOSIGPIPE to get a similar effect, but on Linux it doesn't look like this option is supported. The other option is to set the system to ignore a SIGPIPE, but there are issues with this if you do it naively, see Suppressing SIGPIPE in a library.

Wednesday, March 03, 2010

iphones and people

I have noticed the number of iphones increasing among the people I know. It is particularly interesting is that a lot of people who are non-geeks now have iphones, this is because the iphone is now available on cheaper mobile plans; the geeks were prepared to pay the premium to get the iphone early. I know of one person who now reads their e-mail because they have an iphone, previously they were too busy to sit down at a computer. This means that the iphone is actually increasing the number of people who are actively using the internet and the WWW!

Wednesday, February 24, 2010

ExtraHop

Some guys I worked with at F5 Networks created a start up called ExtraHop. It was clear that whatever these guys did it was going to be impressive and run very fast - they did not disappoint. They built an appliance to monitor your network: hang it of your network, spend 15 minutes configuring it and there you go, it will learn all about your network and tell you when things are not running smoothly and why. It works at layer 7, so it knows when a database is not running properly, or when a CIFS server is mis-behaving, or when a HTTP server tps has dropped. The really amazing thing - it runs at 10GbE! Years talking to BIG-IP customers was not wasted. They also have a cool service where you can upload you pcaps and their magic software will analysis it for you!!

There are also some videos of their stuff in action.

Tuesday, February 23, 2010

A History of malloc

I was reading up on malloc. I had naively assumed malloc was a system call, it's not. Under the covers malloc uses brk or sbrk to request memory from the kernel. However, malloc does not always have to use brk/sbrk, if it has memory in its "free list" that can fulfill the request then no system call is needed. So you do not always have to pay the price of a system call when you use malloc.

Another interesting thing about malloc is that it cannot return memory to the kernel unless the memory that is freed is at the top of the heap. If you malloc some memory at the start of your program, malloc some more later and then free the original memory, malloc/free cannot return the original chunk of memory to the kernel until the second piece of memory is freed.

The first malloc was written in "The Old Testament" - K&R, it's about 200 lines of code. They managed the free list by using a union with the memory that was actually stored in the free list - this saved space, an important requirement when the amount of memory available was very limited.

Poul-Henning Kamp re-wrote malloc for FreeBSB 2.2 and documented it in Malloc(3) Revisited, this malloc is known as pkmalloc. By this time systems where using virtual memory, this meant that in the K&R approach a chunk of memory on the free list could be paged out to disk, now the free list was embedded in these chunks of memory so when malloc came to look for memory on the free list it would have to page all this memory back in, killing performance!! Kamp's version of malloc was 1136 lines of code long and had a good reputation for performance.

Then came fast multi-processor machines with large memory, and another re-write of malloc. Jason Evans re-wrote malloc for FreeBSB and his version is known as jemalloc, he wrote about it in A Scalable Concurrent Malloc(3) Implementation for FreeBSD. Now the issues are less about paging to disk but fast locks and worrying about NUMA issues (trying to allocate memory close to the CPU that you think will be using it). Firefox are attempting to use jemalloc internally for their memory management.

There are quite a few malloc implementations out there, Google have one called tcmalloc in their perftools bundle. It's very easy to swap the malloc your code uses, all you have to do is link against the library with the new malloc. Though this can sometimes lead to trouble :-)

Friday, February 19, 2010

Ubuntu + D-Link + linkedin == Trouble

Hit this bug yesterday when trying to access Linkedin, weird nearly all other web sites I visit don't exhibit this problem! Went with the set "MTU to 1360" hack. If I have time I will look into this some more - network bugs can be very weird.....