The code sample suggests it's a network interface at localhost, but it's not 2M syscalls or connections. It's just one big buffered write and it doesn't look like they're waiting for responses to those messages - so essentially this might just be a (pretty good?) stream processing app.
It is not one big write, but optimizations around msgs/write using buffering are used in clients, with obvious care to balance latency and throughput. In the benchmark, the write buffer is 16k, so it is flushed automatically via Go's bufio when it hits that mark, and then I flush it again when the loop is complete, flushing the remainder of the outbound buffer. I then use a PING/PONG, which is part of the NATS protocol, to only stop timing when the PONG returns, and I know all messages have been processed. NATS does have a verbose protocol flag that has all protocol frames ack'd with either +OK or -ERR.
Explain? ~600k tps on 1gb nics is very achievable. 10-15M tps on 10gb interfaces is also reasonable. In both cases I'm think of udp packets in/out measured on the wire. With modern hardware I'd buy similar numbers for tcp.
Write a C program that creates a a single tcp connection, and an infinite loop that just calls write(2) or send(2) and then do the same for UDP. I'd be very surprised if you can cross 60k calls / sec. Given that, how can you send 2M or receive 2M messages a sec if your bottleneck can only handle 60K.. I could be missing an important piece here, but there isn't much information provided in the link either.
Check out netmap for an API without these issues. I'm not sure what API the original author is using to exchange data with the wire but bulk/scatter/gather approaches are typical in high performance messaging systems.
If you have 100Mb/s bandwidth, your theoretical limit (regardless of langauge) is 25600 packets/sec [1] and at roughly 290 (16b) req / chunked-pipelined packet, you are looking at 7.4+m req/s.
Of course, latency will suffer. You can't have both max throughput and min latency. It's a choice to be made.
It's clear that the MBA is throttling the calls at the Kernel level since there is nothing at the protocol or hardware level that would necessitate something as slow as 60k calls/sec. While that speed is certainly not bad, especially for a consumer laptop and may beat out of the box Linux distributions, it doesn't mean that the bottleneck isn't something that's hardware dependent.
Ultimately, if you want real speed doing something real simple, you're going to want to use FPGAs anyway which would be a trivial consulting fee to implement compared to a development effort that will go deep into your kernel, hardware drivers and probably protocol tuning as well.
I wrote a system that tried to push as many messages through a socket as possible, whether it was tcp, udp, unix domain, or posix message queue. The goal was to determine which IPC mechanism was best suited for highest concurrency rather than highest throughput, but I think the results are interesting for both.
I wanted to measure the amount of time it took to do the same amount of work (i.e. enqueue and dequeue one million items) for each IPC mechanism and how concurrency affected the performance. The system uses a single producer with a varying number of consumer processes.
The stream socket implementations (tcp, unix domain stream socket) actually perform a minimum of two write()'s per queue item - once for the length and once for the actual content. Both of these are wrapped in loops to ensure the full content is written, so occasionally more than two write()'s might occur.
On one of the linux hosts, the TCP implementation can push 1 million messages through in roughly 0.66 seconds using a single producer thread, which corresponds pretty closely to the 2 million messages/second claim for NATS. The POSIX message queue can do it 0.48 seconds, which corresponds to more than 2 million messages/second, but POSIX messages queues are a datagram implementation that only require one mq_send() per message.
I think this shows that 2M syscalls/second is indeed possible. I made no special effort to optimize the C++ code. If you'd like to review the code or run the tests yourself, feel free to check out the code at https://github.com/adamonduty/queueable . Anyone can run the tests and submit results to view on the web interface.
Oh, and interestingly, the Macbook Pro I used to generate OS X results was the most recent hardware but among the slowest in wall-clock performance. OS X also shows a zig-zig effect as message size increases, similar to SunOS. Not sure why.