Systems Performance 2nd Ed.



BPF Performance Tools book

Recent posts:
Blog index
About
RSS

iSCSI before and after

27 Jan 2010

I originally posted this at http://blogs.sun.com/brendan/entry/iscsi_before_and_after.

I've blogged many 7410 performance results for the NFSv3, NFSv4, and CIFS protocols. Now it's iSCSI's turn.

When the Sun Storage 7000 series first launched, iSCSI was provided by a user-land process: iscsitgtd. While it worked, performance wasn't as good as the kernel-based services such as NFS. iSCSI performance was to be improved by making it a kernel service as part of the COMSTAR project. This was delivered in the 7000 series in the 2009 Q3 software release, and iSCSI performance was indeed greatly improved.

Here I'll show iSCSI performance test results which highlight the before and after effects of both COMSTAR and the 7410 hardware update. It would of course be better to split the results showing the COMSTAR and hardware update effects separately, but I only have the time to blog this right now (and I wouldn't be blogging anything without Joel's help to collect results). I would guess that the hardware update is responsible for up to a 2x improvement, as it was for NFS. The rest is COMSTAR.

OLD: Sun Storage 7410 (Barcelona, 2008 Q4 software)

To gather this data I powered up an original 7410 with the original software release from the product launch.

Testing max throughput

To do this I created 20 luns of 100 Gbytes each, which should entirely cache on the 7410. The point of this test is not to measure disk performance, but instead the performance of the iSCSI codepath. To optimize for max throughput, each lun used a 128 Kbyte block size. Since iSCSI is implemented as a user-land daemon in this release, performance is expected to suffer due to the extra work to copy-out and copy-in data to the iscsitgtd process from the kernel.

I stepped up clients, luns, and thread count to find a maximum where adding more stopped improving throughput. I reached this with only 8 clients, 2 sequential read threads per client (128 Kbyte reads), and 1 lun per client:

Network throughput was 311 Mbytes/sec. Actual iSCSI payload throughput will be a little lower due to network and iSCSI headers (jumbo frames were used, so the headers shouldn't add too much.)

This 311 Mbytes/sec result is the "bad" result, before COMSTAR. However, is it really that bad? How many people are still on 1 Gbit Ethernet? 311 Mbytes/sec is plenty to saturate a 1 GbE link, which may be all you have connected.

The CPU utilization for this test was 69%, suggesting that more headroom may be available. I wasn't able to consume it with my client test farm or workload.

Testing max IOPS

To test max IOPS, I repeated a similar test but used a 512 byte read size for the client threads intsead of 128 Kbyte. This time 10 threads per client were run, on 8 clients with 1 lun per client:

iSCSI IOPS were 37,056. For IOPS I only count served reads (and writes): the I/O part of IOPS.

NEW: Sun Storage 7410 (Istanbul, 2009 Q3+ software)

This isn't quite the 2009 Q3 release, it's our current development version. While it may be a little faster or slower than the actual 2009 Q3 release, it still reflects the magnitude of the improvement that COMSTAR and the Istanbul CPUs have made.

Testing max throughput:

This shows a 60 second average of 2.75 Gbytes/sec network throughput – impressive, and close to the 3.06 Gbytes/sec I measured for NFSv3 on the same software and hardware. For this software version, Analytics included iSCSI payload bytes, which showed it actually moved 2.70 Gbytes/sec (the extra 0.05 Gbytes/sec was iSCSI and TCP/IP headers).

That's over 9x faster throughput. I would guess that up to 2x of this is due to the Istanbul CPUs, which still leaves over 5x due to COMSTAR.

Since this version of iSCSI could handle much more load, 47 clients were used with 5 luns and 5 threads per client. Four 10 GbE ports on the 7410 were configured to serve the data.

Testing max IOPS:

The average here was over 318,000 read IOPS. Over 8x the original iSCSI performance.

Configuration

Both 7410s were max head node configurations with max DRAM (128 and 256 Gbytes), and max CPUs (4 sockets of Barcelona, and 4 sockets of Istanbul). Tests were performed over 10 GbE ports from two cards. This is the performance from single head nodes – they were not part of a cluster.

Writes

The above tests showed iSCSI read performance. Writes are processed differently: they are made synchronous unless the write cache enabled property is set on the lun (the setting is in Shares->lun->Protocols). The description for this setting is:

This setting controls whether the LUN caches writes. With this setting off, all writes are synchronous and if no log device is available, write performance suffers significantly. Turning this setting on can therefore dramatically improve write performance, but can also result in data corruption on unexpected shutdown unless the client application understands the semantics of a volatile write cache and properly flushes the cache when necessary. Consult your client application documentation before turning this on.

For this reason, it's recommended to use Logzillas with iSCSI to improve write performance instead.

Conclusion

As a kernel service, iSCSI performance is similar to other kernel based protocols on the current 7410. For cached reads it is reaching 2.70 Gbytes/sec throughput, and over 300,000 IOPS.