Splash Image Reflex Benchmarks at Intel fasterLAB - May 2013

Description / Use Case
In many aspects of data processing in finance, such as the consumption of high-speed market data feeds directly from the exchanges themselves, it is desirable to have the lowest latency possible. Lower latency means more time to make a decision, as well as increasing the likelihood of capturing good trades through being the first to act. This test has been designed to measure the industry standard "half-round-trip" latency of Reflex, which is known to be a reliable indicator of expected production performance.

This test measures total latency, including that of Reflex, of the Mellanox ConnectX-3/VMA/Reflex combined stack.

Measurements are of half-round-trip latency between two Reflex instances, one running a sender program and the other running an echo program. Timestamps were taken at the sender just prior to send of a source message and immediately upon receipt of its echo, using the clock_gettime(CLOCK_MONOTONIC_RAW, ...) system call to eliminate clock skew effects. The difference in times was then divided by two to get a half-round-trip time (1/2 RTT) measurement. These results were graphed as boxplots where the center line of the box indicates the mean latency. The upper and lower sides of the box indicate a quartile of data above and below the mean, respectively. The outer "whiskers" of the box plot indicate the 99 percentile of the measurements. The indicated events per second is the rate of the outgoing source transmissions. The total number of events being processed by Reflex is actually double this source rate because of the simultaneous receipt of echo messages.

Measurements were taken for the following transmission modes of Reflex:

  • TCP_NORMAL - Mode allowing batching of TCP transmissions
  • TCP_URGENT - Latency critical TCP transmissions
  • UDP_NORMAL - Mode allowing batching of UDP transmissions
  • UDP_URGENT - Latency critical UDP transmissions
Tested at payload sizes of 28 bytes and 1460 bytes, both with and without VMA kernel-bypass technology.

Two machines connected back-to-back (without a switch), each having the same configuration of the following:
  • Industry Standard x86 Architecture
  • 16 core Intel(R) Xeon(R) CPU E5-2680 @ 2.70GHz
  • Mellanox CX354A - ConnectX-3 QSFP (MCX354A-FCBT)
  • Reflex v1.1
  • VMA 6.4.5-0 (Development Snapshot built on May 22 2013 11:15:33)
  • OFED Version: MLNX_OFED_LINUX-2.0-2.0.0
  • RedHat Enterprise Linux 6.4
  • Linux Kernel 2.6.32-358.el6.x86_64
  • Kernel arguments: "intel_idle.max_cstate=0 mce=ignore_ce isolcpus=4-15"
  • MTU set to 9000
  • scaling_governor set to "performance"
  • Unnecessary services were stopped
  • Reflex environment variables: RF_WORK_MODE=2 RF_NEVER_WAIT=1
  • VMA environment variables: VMA_RX_POLL=-1 VMA_MTU=9000
UDP_URGENT Mellanox, Payload 28 bytes
UDP_NORMAL Mellanox, Payload 28 bytes
TCP_URGENT Mellanox, Payload 28 bytes
TCP_NORMAL Mellanox, Payload 28 bytes
UDP_URGENT Mellanox, Payload 1460 bytes
UDP_NORMAL Mellanox, Payload 1460 bytes
TCP_URGENT Mellanox, Payload 1460 bytes
TCP_NORMAL Mellanox, Payload 1460 bytes
Technical specifications
UDP_URGENT Mellanox, Payload 28 bytes
UDP_NORMAL Mellanox, Payload 28 bytes
TCP_URGENT Mellanox, Payload 28 bytes
TCP_NORMAL Mellanox, Payload 28 bytes
UDP_URGENT Mellanox, Payload 1460 bytes
UDP_NORMAL Mellanox, Payload 1460 bytes
TCP_URGENT Mellanox, Payload 1460 bytes
TCP_NORMAL Mellanox, Payload 1460 bytes