Thursday, December 06, 2012

CloudOpt: Unparalleled scaling for WAN Optimization.

For the next release of CloudOpt cloudoptimizer, 1.3.0, the focus was on improving system scalability. This means supporting larger caches with less memory, more connections, more CPUs and faster WAN connections. First I investigate how well cache size scales against system resources, and then I look at cloudoptimizer throughput.

Cache Size

With previous versions of cloudoptimizer the throughput of the system dropped as the cache got bigger limiting the size of the cache. The rate at which the throughput dropped was also a function of the IO sub-system of the hardware. By re-working the cache index the link between throughput and cache size was broken, throughput is now independent of cache size (O(1) in big O notation). As well as allowing bigger caches, the system now uses less memory.

To test the new cache index of cloudoptimizer I used two large EC2 instances running Ubuntu 12.04. The instances were EBS Optimized instances, I used EBS Optimized instances as they have extra network bandwidth. A large instance has the equivalent of two 2Ghz cores and 7.5GB of RAM.

To each instance I attached two 1TB EBS volumes, these were standard EBS volumes. The volumes were combined into a single 2TB logical volume using LVM, and XFS was used to create a filesystem on the logical volume.

As a data source I used an Apache Web server with 2TB of 500GB files with random binary data contents, this means the files are incompressible For a client I used wget to retrieve the files from the Web server.

First pass throughput averaged around 200Mbs using a single client, with two clients throughput averaged around 300Mbs, peaking at 400Mbs. Since these are virtual machines running on shared hardware and using a shared network fluctuations are to be expected. Average throughput remained constant as the cache grew to 1.6 TB.

On a large instance a cloudoptimizer uses 4GB of RAM to support a 1.6TB cache, this is the recommended cache size limit for a large instance. A 800GB cache uses 2GB of RAM, a 400GB cache uses 1GB RAM etc, the system also scales in the other direction, a 3.2TB cache uses 8GB of RAM etc.

The contents of the cache are compressed, so a 1.6TB cache can contain up to 3.2TB of raw data. With version 1.3.0 the cloudoptimizer now supports a "shared cache", this means data is only stored in the cache once, even if a number of cloudoptimizer peers have sent the same data through the system. This increases the disk usage efficiency of the cache.

System Throughput

To improve the throughput of the system a number of improvements have been made to cloudoptimizer. These include removing thread contention hotspots, optimizing the TCP stack, reducing memory usage etc.
 
To measure system throughput I used two laptops on a Gigabit network. The first laptop had an Apache Web server and cloudoptimizer running on it, the second had a cloudoptimizer and wget to act as a client. The focus is on how fast the CPU on the first laptop can process data on the first pass phase, this is the most computational intensive step and thus the rate limiting step. I used wget on the second laptop to retrieve a 250MB file of random binary data from the first laptop.

The CPU on the laptop is a Intel P6100 CPU. This has two cores running at 2Ghz with 3MB of cache. The laptop also has 4GB of memory and a 500GB hard disk. The CPU does not have the Intel AES-NI instructions, cloudoptimizer is designed to automatically use these instructions if they are available, and you can expect a significant performance improvement if your CPU supports these instructions (nearly all modern CPUs do).

For the first test I used the default cloudoptimizer configuration. This means the connection between the two cloudoptimizers is encrypted with SSL, the data sent between them is compressed, de-duplication is applied and data stored in the caches is also compressed. The throughput was 314Mbs.

I turned off encryption and throughput jumped to 442Mbs. Many WAN optimization products only use encryption for data that was previously encrypted, data that was originally unencrypted is left unencrypted. Also, many people use VPNs to encrypt their traffic, in which case turning off encryption in cloudoptimizer makes sense.

Next I turned off cache compression, data between the cloudoptimizers is still compressed and de-duplication is applied, throughput jumped to 550Mbs.

As part of the optimizations to improve throughput a new chunking algorithm for de-duplication called the "fast-chunker" was developed. Unfortunately, this chunking algorithm is not compatible with the previous version so it is not configured as the default. The fast chunker should be used for new installations and is particularly useful for networks with 1Gbs or 10Gbs WAN connections.

Using the fast-chunker for de-duplication and compressing the data stream I got 705Mbs. 

The last step is to turn off compression of the data stream, this means only data de-duplication is applied to the data, this lifted the throughput to 880Mbs.

The final test is to measure the throughput of the system without the cloudoptimizer in place, ie just a standard HTTP transfer, for this scenario the throughput was 896Mbs. This means that doing just de-duplication a 2Ghz core can run at very nearly the native speed of a Gigabit network.

Improving throughput is important for two cases: reducing the amount of resources required to optimize a WAN connection, and for optimizing fast 1Gbs and 10Gbs WAN connections.

Conclusion

The new version of cloudoptimizer offers unparalleled scalability in terms of cache size and throughput. A system with 7.5GB of RAM can support a 1.6TB cache. A 2Ghz core can support between 300Mbs and 880Mbs throughput on first pass depending on configuration. A modern 3Ghz+ core with AES-NI instruction set will support  significantly higher throughput....