x86/x64 Delphi, Lazarus, C#, and C++ Compared in Raw Computational power

As part of my research into the amount of effort it would take to invent a new Object Pascal compiler and language variant,  I decided to take some time to dig into the ASM code that gets generated when performing some really simple computations in Delphi and compare the speed of the operations. I then expanded my tests to include competing compilers.  

Setup

I used a basic prime number evaluation function and translated it to the various languages.  Each program measured the amount of time it took to evaluate whether $7FFFFFFF is prime (which it is) under various conditions. I also tested the same function using the FreePascal (Lazarus) compiler under -O3 optimization and translated it to Microsoft’s Visual C# and Microsoft C++.  I specifically used VS2010, because that just happened to be what I had on my i7 Laptop.

I chose the prime test because it would be easy to disassemble and pick apart and observe what was exactly happening to the code and also because it would benchmark a function that can essentially be boiled down to a simple while loop that performs a bunch of integer arithmetic.  This test is integer-computationally intensive and is unhindered by I/O or memory bandwidth.

The basic bit of code contains lines are switched on/off and changed, but it looks something like this:

Here is the C++ version of the code:

and….finally… here is the C# version of the code:

I ran all tests with int64/int32 interchanged (In C++ that is “long” vs. “long long”) as well as at various optimization levels.

Results

The results are interesting as well as perplexing.

platform var type Time
C++ x64 (Speed Optimization) int64 2.012
C++ x64 (Speed Optimization) int32 2.028
C++ x86 (Speed Optimization) int32 4.773
C# (x86 Target – Optimized) int32 4.778
C++ x64 (No Optimization) int32 4.790
C++ x86 (No Optimization) int32 4.805
Delphi-x86 int32 5.350
Delphi-x64 int32 5.350
Lazarus-x86 int32 5.504
C# (x64 Target – Optimized) int32 5.876
Delphi-x86 int64 8.534
C# (x86 Target – Optimized) int64 10.260
Delphi-x64 int64 10.965
Lazarus-x64 int32 11.23
C++ x64 (No Optimization) int64 11.887
Lazarus-x64 int64 12.10
C++ x86 (Speed Optimization) int64 12.186
C++ x86 (No Optimization) int64 13.603
C# (x64 Target – Optimized) int64 14.940
Lazarus-x86 int64 19.11

 

To make the results a bit more readable, here are the results isolated into various categories:

x86 Results Isolated

platform var type Time
C++ x86 (Speed Optimization) int32 4.773
C# (x86 Target – Optimized) int32 4.778
C++ x86 (No Optimization) int32 4.805
Delphi-x86 int32 5.350
Lazarus-x86 int32 5.504
Delphi-x86 int64 8.534
C# (x86 Target – Optimized) int64 10.260
C++ x86 (Speed Optimization) int64 12.186
C++ x86 (No Optimization) int64 13.603
Lazarus-x86 int64 19.11

 

x64 Results Isolated

platform var type Time
C++ x64 (Speed Optimization) int64 2.012
C++ x64 (Speed Optimization) int32 2.028
C++ x64 (No Optimization) int32 4.790
Delphi-x64 int32 5.350
C# (x64 Target – Optimized) int32 5.876
Delphi-x64 int64 10.965
Lazarus-x64 int32 11.23
C++ x64 (No Optimization) int64 11.887
Lazarus-x64 int64 12.10
C# (x64 Target – Optimized) int64 14.940

 

Int32 results Isolated

platform var type Time
C++ x64 (Speed Optimization) int32 2.028
C++ x86 (Speed Optimization) int32 4.773
C# (x86 Target – Optimized) int32 4.778
C++ x64 (No Optimization) int32 4.790
C++ x86 (No Optimization) int32 4.805
Delphi-x86 int32 5.350
Delphi-x64 int32 5.350
Lazarus-x86 int32 5.504
C# (x64 Target – Optimized) int32 5.876
Lazarus-x64 int32 11.23

 

Int64 Results

platform var type Time
C++ x64 (Speed Optimization) int64 2.012
Delphi-x86 int64 8.534
C# (x86 Target – Optimized) int64 10.260
Delphi-x64 int64 10.965
C++ x64 (No Optimization) int64 11.887
Lazarus-x64 int64 12.10
C++ x86 (Speed Optimization) int64 12.186
C++ x86 (No Optimization) int64 13.603
C# (x64 Target – Optimized) int64 14.940
Lazarus-x86 int64 19.11

Conclusions

1) VC++ x64 really shines and seems to the only compiler I tested that actually succeeds in optimization for x64 processors.  All the other compilers, including VC++ x86 performed at less than half the speed of VC++ x64.

2) There were a lot of tests that I’d say fall in to a “respectable” performance range.  Tests that completed in the 4.7-5.8 second range I’d consider “respectable”, however, you’ll notice that there is only one test involving 64-bit integers that beat a 5.8 second time, and that was the C++ x64 test completing in a stunning 2.012 seconds.  Keep in mind that fully-optimized C++ x86 achieved a time of 4.773 seconds which is currently the industry standard.  When working with 32-bit numbers, Delphi, Lazarus-x86, C#, and C++ all produced respectable results.  The only compiler that failed to achieve “respectable” performance for 32-bit numbers was the Lazarus-x64 compiler, coming in at a lackluster 11.23 Seconds.

3) When switching to 64-bit numbers, however, the results were all over the board.  [C++ x64] actually delivered  on the performance promises of x64… but…. if you look at the 64-bit integer isolated table, no compiler came anywhere close to touching the C++ numbers… in fact, Delphi-x86 came in 2nd place, strangely, but was still over 4x slower than C++.  Aside from the C++ x64 test, virtually every other test took a performance hit on 64-bit numbers.

4) Delphi, Lazarus, and C# could not compete with good old-fashioned C++ when optimized for speed with only one notable exception being that [C++ x86] was not very good at dealing with Int64s.  This could be different with another compiler, such as GCC or maybe some library changes… Int64s are simulated in software on x86, so they can vary widely I imagine.

5) We would expect x64 code to outperform x86 (yet only the C++ compiler seems to be accomplish this)

6) We would expect int32 performance to be slightly poorer than int64 performance on x64 platforms. It is possibly within the margin of error, but only the C++ compiler showed this artifact.

7) Delphi and Lazarus were outperformed by [C# x86] for 32-bit operations.  I wouldn’t use this to conclude that C# is ultimately faster because I don’t believe that to be the case whenever objects are involved (subject of another set of tests).  However, in this simple, raw, math test, Delphi was bested by C# x86.

8) For int32 math, [C# x86] with “optimization” enabled performed roughly equivalent to [C++ x86] without any optimization at all. Most compilers performed mostly on-par with unoptimized C++ when dealing with 32-bit numbers but performance was varied greatly for 64-bit numbers.

9) Omitting the C++ tests, and with the exception of the Lazarus results, The #1 determining factor in run-time is the size of the data types being processed regardless of the target.  All Int32 tests ran much faster than int64 tests .

10) I am rather surprised to find that [Delphi-x86] outperformed  all other 32-bit compilers in 64-bit calculations  It even outperformed C++ (ignoring the C++ 64-bit compiler)!  Don’t go flying any “Mission Accomplished” banners just yet though, this is just one scenario of many and frankly, I’m mostly interested in a 64-bit compiler operating on 64-bit numbers.

11) The disassembled Delphi 64-bit code appears to be, on the surface, far less complex than the 32-bit code but still runs much slower.  This is despite the fact that the 32-bit code involves a lot more stack pushes and pops and calls to functions to divide 64-bit integers, whereas the 64-bit code involves native ASM instructions.

I compared the disassembled code to the FreePascal version, which  actually contains less code, but this code must involve instructions that are slower than the Delphi instructions.  I don’t know how deep into this I plan on digging.  But for the record…

Here’s the Delphi-x64 generated code

And here’s the Lazarus-x64 code

C++ was the champion.  The champion compiler is rather difficult to get intelligible disassembly from, because when you compile C or C++ with full optimization enabled, it often re-orders your code, sometimes transforming it into something completely different and unrecognizable. C++ language optimization has been the study of many a thesis over the years, and I think if you were to build a Delphi compiler that optimized to this level, you might as well compile down to C++ instead of ASM to simply take advantage of all the hard work people have been putting into this challenge over the years.
Here is the best I could come up with for what was generated for IsPrime()… it appears to have been inlined and I don’t even see any “idiv” instructions in it… (is it doing a multiply/compare?) I’ll have to study this some more.  [UPDATE: Upon further study, I now believe that the C++ compiler achieved greater performance through the use of algebraic transformations.   It was smart enough to understand that MODULUS operations are expensive, and determined that it could achieve the same comparison without division/modulus.   A more efficient way to test if a number is prime in this scenario, is to check if (x*t)==0x7fffffff rather than (0x7fffffff%t)==0.  It is a very smart compiler, indeed!  Through other independent testing, I observed that integer division on Intel processors is still around a meager 25% of the speed of multiplication.  This would totally explain the 4X performance difference.    I plan to follow up with future blogs that test other types of performance bottlenecks.]

The C++ x64 Code:

It is probably the subject of a different blog, but I also ran some tests in Delphi and Lazarus where I tested their ability to eliminate what are clearly, obviously redundant instructions.  I did this by creating a “useless property” of an object.  In the setter of the object I basically just set a field.  A great compiler would be able to peek at what CPU registers were affected by this function and then use that information to optimize the code in the calling function.

Results of this test under disassembly showed that both the 32-bit and 64-bit compilers for Delphi and Lazarus are rather terrible at optimizing function calls. For example the 64-bit disassembly is as follows:

It is rather annoying to me that there is a redundant use of mov rcx,rbx (which I assume is to set “self”) before calling the function.

The function call itself does not manipulate rcx, so rcx should already be set to “self” after the first call and therefore that extra instruction could be eliminated if the optimizer were as good as it could possibly be.  This becomes a real problem when your app grows in complexity with lots of objects and is one of the main reasons I want to replace the compiler. Maybe that’s one real difference between some of the highly optimized C++ compilers vs. these less-optimized pascal compilers, although there are heaps of studies I should probably pour through. Maybe I’ll implement this same test in C++ and compare the results.

It was suggested to me to look at FreePascal as an alternative before going full-force on writing my own compiler, but my initial finding is that FPC is very below standard on the optimization side.  I’ll probably continue to run more tests though, because, after all not every function you write is going to be computing prime numbers.  I’ll try and design some tests that benchmark it using some methods that are more in-line with what I’m seeking to do with it in the real world (iSCSI storage systems). FPC held up on the Int32/Win32 tests, but all other tests, it was slower than the Delphi compiler. The C++ tests have me hungry to get back what we’re all missing and the C# tests make me question why on earth I would be writing code in this inferior language when C# can outperform it.

That is all I have time to write for today.  If you find my information useful, even though I can go off on the occasional rant, please come back subscribe (if that even works).  I program professionally, but I’m not a professional blogger… I don’t get paid to do this part. 😉