As part of my research into the amount of effort it would take to invent a new Object Pascal compiler and language variant, I decided to take some time to dig into the ASM code that gets generated when performing some really simple computations in Delphi and compare the speed of the operations. I then expanded my tests to include competing compilers.
Setup
I used a basic prime number evaluation function and translated it to the various languages. Each program measured the amount of time it took to evaluate whether $7FFFFFFF is prime (which it is) under various conditions. I also tested the same function using the FreePascal (Lazarus) compiler under -O3 optimization and translated it to Microsoft’s Visual C# and Microsoft C++. I specifically used VS2010, because that just happened to be what I had on my i7 Laptop.
I chose the prime test because it would be easy to disassemble and pick apart and observe what was exactly happening to the code and also because it would benchmark a function that can essentially be boiled down to a simple while loop that performs a bunch of integer arithmetic. This test is integer-computationally intensive and is unhindered by I/O or memory bandwidth.
The basic bit of code contains lines are switched on/off and changed, but it looks something like this:
procedure Tcmd_IsPrime.DoExecute; var t: int64; x: int64; s: int64; begin inherited; t := 2; s := self.Subject; x := s div 2; while (t < x) do begin if (s mod t) = 0 then begin result := false; exit; end; inc(t); end; result := true; end;
Here is the C++ version of the code:
// UT_PrimeCPP.cpp : Defines the entry point for the console application. // //#include "Afx.h" #include "stdafx.h" #include <Windows.h> typedef long long use_type_t; static bool IsPrime() { use_type_t t,x,s; t = 2; s = 0x7fffffff; x = s / 2; while (t<x) { if ((t%x) == 0) { return false; } t++; } return true; } int _tmain(int argc, _TCHAR* argv[]) { DWORD tmStart, tmEnd; _tprintf(_T("Hello SizeOf(use_type_t) = %d "), sizeof(use_type_t)); tmStart = GetTickCount(); bool b = IsPrime(); tmEnd = GetTickCount(); //Note that if you don't actually use "b" IsPrime() won't even get called in C++. The optimizer correctly determines that it is a worthless waste of time. if (b) { _tprintf(_T(" Prime!")); } _tprintf(_T(" Time: %d"), tmEnd-tmStart); Sleep(4000); }
and….finally… here is the C# version of the code:
using System; using System.Collections.Generic; using System.Text; namespace ConsoleApplication1 { class Program { static bool IsPrime() { Int64 t,x,s; t = 2; s = 0x7fffffff; x = s / 2; while (t<x) { if ((t%x) == 0) { return false; } t++; } return true; } static void Main(string[] args) { Int64 tmStart, tmEnd; tmStart = MiscLib.Time.GetTickCountInMilliseconds(); IsPrime(); tmEnd = MiscLib.Time.GetTickCountInMilliseconds(); Console.WriteLine("IsPrime() took "+(tmEnd-tmStart).ToString()); Console.ReadKey(); } } }
I ran all tests with int64/int32 interchanged (In C++ that is “long” vs. “long long”) as well as at various optimization levels.
Results
The results are interesting as well as perplexing.
platform | var type | Time |
---|---|---|
C++ x64 (Speed Optimization) | int64 | 2.012 |
C++ x64 (Speed Optimization) | int32 | 2.028 |
C++ x86 (Speed Optimization) | int32 | 4.773 |
C# (x86 Target – Optimized) | int32 | 4.778 |
C++ x64 (No Optimization) | int32 | 4.790 |
C++ x86 (No Optimization) | int32 | 4.805 |
Delphi-x86 | int32 | 5.350 |
Delphi-x64 | int32 | 5.350 |
Lazarus-x86 | int32 | 5.504 |
C# (x64 Target – Optimized) | int32 | 5.876 |
Delphi-x86 | int64 | 8.534 |
C# (x86 Target – Optimized) | int64 | 10.260 |
Delphi-x64 | int64 | 10.965 |
Lazarus-x64 | int32 | 11.23 |
C++ x64 (No Optimization) | int64 | 11.887 |
Lazarus-x64 | int64 | 12.10 |
C++ x86 (Speed Optimization) | int64 | 12.186 |
C++ x86 (No Optimization) | int64 | 13.603 |
C# (x64 Target – Optimized) | int64 | 14.940 |
Lazarus-x86 | int64 | 19.11 |
To make the results a bit more readable, here are the results isolated into various categories:
x86 Results Isolated
platform | var type | Time |
---|---|---|
C++ x86 (Speed Optimization) | int32 | 4.773 |
C# (x86 Target – Optimized) | int32 | 4.778 |
C++ x86 (No Optimization) | int32 | 4.805 |
Delphi-x86 | int32 | 5.350 |
Lazarus-x86 | int32 | 5.504 |
Delphi-x86 | int64 | 8.534 |
C# (x86 Target – Optimized) | int64 | 10.260 |
C++ x86 (Speed Optimization) | int64 | 12.186 |
C++ x86 (No Optimization) | int64 | 13.603 |
Lazarus-x86 | int64 | 19.11 |
x64 Results Isolated
platform | var type | Time |
---|---|---|
C++ x64 (Speed Optimization) | int64 | 2.012 |
C++ x64 (Speed Optimization) | int32 | 2.028 |
C++ x64 (No Optimization) | int32 | 4.790 |
Delphi-x64 | int32 | 5.350 |
C# (x64 Target – Optimized) | int32 | 5.876 |
Delphi-x64 | int64 | 10.965 |
Lazarus-x64 | int32 | 11.23 |
C++ x64 (No Optimization) | int64 | 11.887 |
Lazarus-x64 | int64 | 12.10 |
C# (x64 Target – Optimized) | int64 | 14.940 |
Int32 results Isolated
platform | var type | Time |
---|---|---|
C++ x64 (Speed Optimization) | int32 | 2.028 |
C++ x86 (Speed Optimization) | int32 | 4.773 |
C# (x86 Target – Optimized) | int32 | 4.778 |
C++ x64 (No Optimization) | int32 | 4.790 |
C++ x86 (No Optimization) | int32 | 4.805 |
Delphi-x86 | int32 | 5.350 |
Delphi-x64 | int32 | 5.350 |
Lazarus-x86 | int32 | 5.504 |
C# (x64 Target – Optimized) | int32 | 5.876 |
Lazarus-x64 | int32 | 11.23 |
Int64 Results
platform | var type | Time |
---|---|---|
C++ x64 (Speed Optimization) | int64 | 2.012 |
Delphi-x86 | int64 | 8.534 |
C# (x86 Target – Optimized) | int64 | 10.260 |
Delphi-x64 | int64 | 10.965 |
C++ x64 (No Optimization) | int64 | 11.887 |
Lazarus-x64 | int64 | 12.10 |
C++ x86 (Speed Optimization) | int64 | 12.186 |
C++ x86 (No Optimization) | int64 | 13.603 |
C# (x64 Target – Optimized) | int64 | 14.940 |
Lazarus-x86 | int64 | 19.11 |
Conclusions
1) VC++ x64 really shines and seems to the only compiler I tested that actually succeeds in optimization for x64 processors. All the other compilers, including VC++ x86 performed at less than half the speed of VC++ x64.
2) There were a lot of tests that I’d say fall in to a “respectable” performance range. Tests that completed in the 4.7-5.8 second range I’d consider “respectable”, however, you’ll notice that there is only one test involving 64-bit integers that beat a 5.8 second time, and that was the C++ x64 test completing in a stunning 2.012 seconds. Keep in mind that fully-optimized C++ x86 achieved a time of 4.773 seconds which is currently the industry standard. When working with 32-bit numbers, Delphi, Lazarus-x86, C#, and C++ all produced respectable results. The only compiler that failed to achieve “respectable” performance for 32-bit numbers was the Lazarus-x64 compiler, coming in at a lackluster 11.23 Seconds.
3) When switching to 64-bit numbers, however, the results were all over the board. [C++ x64] actually delivered on the performance promises of x64… but…. if you look at the 64-bit integer isolated table, no compiler came anywhere close to touching the C++ numbers… in fact, Delphi-x86 came in 2nd place, strangely, but was still over 4x slower than C++. Aside from the C++ x64 test, virtually every other test took a performance hit on 64-bit numbers.
4) Delphi, Lazarus, and C# could not compete with good old-fashioned C++ when optimized for speed with only one notable exception being that [C++ x86] was not very good at dealing with Int64s. This could be different with another compiler, such as GCC or maybe some library changes… Int64s are simulated in software on x86, so they can vary widely I imagine.
5) We would expect x64 code to outperform x86 (yet only the C++ compiler seems to be accomplish this)
6) We would expect int32 performance to be slightly poorer than int64 performance on x64 platforms. It is possibly within the margin of error, but only the C++ compiler showed this artifact.
7) Delphi and Lazarus were outperformed by [C# x86] for 32-bit operations. I wouldn’t use this to conclude that C# is ultimately faster because I don’t believe that to be the case whenever objects are involved (subject of another set of tests). However, in this simple, raw, math test, Delphi was bested by C# x86.
8) For int32 math, [C# x86] with “optimization” enabled performed roughly equivalent to [C++ x86] without any optimization at all. Most compilers performed mostly on-par with unoptimized C++ when dealing with 32-bit numbers but performance was varied greatly for 64-bit numbers.
9) Omitting the C++ tests, and with the exception of the Lazarus results, The #1 determining factor in run-time is the size of the data types being processed regardless of the target. All Int32 tests ran much faster than int64 tests .
10) I am rather surprised to find that [Delphi-x86] outperformed all other 32-bit compilers in 64-bit calculations It even outperformed C++ (ignoring the C++ 64-bit compiler)! Don’t go flying any “Mission Accomplished” banners just yet though, this is just one scenario of many and frankly, I’m mostly interested in a 64-bit compiler operating on 64-bit numbers.
11) The disassembled Delphi 64-bit code appears to be, on the surface, far less complex than the 32-bit code but still runs much slower. This is despite the fact that the 32-bit code involves a lot more stack pushes and pops and calls to functions to divide 64-bit integers, whereas the 64-bit code involves native ASM instructions.
I compared the disassembled code to the FreePascal version, which actually contains less code, but this code must involve instructions that are slower than the Delphi instructions. I don’t know how deep into this I plan on digging. But for the record…
Here’s the Delphi-x64 generated code
UnitTest_Commands.pas.118: begin sub rsp,$28 UnitTest_Commands.pas.121: t := 2; mov r8d,$00000002 UnitTest_Commands.pas.122: s := self.Subject; mov r10,[rcx+$00000190] UnitTest_Commands.pas.123: x := s div 2; mov r9d,$00000002 mov rax,r10 cqo idiv r9 mov r9,rax jmp Tcmd_IsPrime.DoExecute + $41 UnitTest_Commands.pas.126: if (s mod t) = 0 then begin mov rax,r10 cqo idiv r8 test rdx,rdx jnz Tcmd_IsPrime.DoExecute + $3D UnitTest_Commands.pas.127: result := false; xor rdx,rdx call {CommandTypes}TFunctionCommand.SetREsult UnitTest_Commands.pas.128: exit; jmp Tcmd_IsPrime.DoExecute + $4D UnitTest_Commands.pas.130: inc(t); add r8,$01 UnitTest_Commands.pas.125: while (t < x) do begin cmp r8,r9 jl Tcmd_IsPrime.DoExecute + $26 UnitTest_Commands.pas.132: result := true; mov dl,$01 call {CommandTypes}TFunctionCommand.SetREsult UnitTest_Commands.pas.135: end; add rsp,$28
And here’s the Lazarus-x64 code
umain.pas:60 t := 2; mov $0x2,%r10d umain.pas:61 s := $7FFFFFFF; mov $0x7fffffff,%r8d umain.pas:62 x := s div 2; movslq %r8d,%rcx mov %rcx,%rax sar $0x3f,%rax and $0x1,%rax add %rax,%rcx sar %rcx umain.pas:63 while (t < x) do begin jmp 0x42e880 <TFORM1__ISPRIME+64> data32 data32 xchg %ax,%ax xchg %ax,%ax umain.pas:64 if (s mod t) = 0 then begin movslq %r8d,%rax movslq %r10d,%r9 cqto idiv %r9 test %rdx,%rdx jne 0x42e87d <TFORM1__ISPRIME+61> umain.pas:65 result := false; mov $0x0,%r11b umain.pas:66 exit; jmp 0x42e88a <TFORM1__ISPRIME+74> umain.pas:68 inc(t); inc %r10d
C++ was the champion. The champion compiler is rather difficult to get intelligible disassembly from, because when you compile C or C++ with full optimization enabled, it often re-orders your code, sometimes transforming it into something completely different and unrecognizable. C++ language optimization has been the study of many a thesis over the years, and I think if you were to build a Delphi compiler that optimized to this level, you might as well compile down to C++ instead of ASM to simply take advantage of all the hard work people have been putting into this challenge over the years.
Here is the best I could come up with for what was generated for IsPrime()… it appears to have been inlined and I don’t even see any “idiv” instructions in it… (is it doing a multiply/compare?) I’ll have to study this some more. [UPDATE: Upon further study, I now believe that the C++ compiler achieved greater performance through the use of algebraic transformations. It was smart enough to understand that MODULUS operations are expensive, and determined that it could achieve the same comparison without division/modulus. A more efficient way to test if a number is prime in this scenario, is to check if (x*t)==0x7fffffff rather than (0x7fffffff%t)==0. It is a very smart compiler, indeed! Through other independent testing, I observed that integer division on Intel processors is still around a meager 25% of the speed of multiplication. This would totally explain the 4X performance difference. I plan to follow up with future blogs that test other types of performance bottlenecks.]
The C++ x64 Code:
// bool b = IsPrime(); mov ecx,2 mov esi,eax mov r8,8000000200000009h nop dword ptr [rax+rax] mov rax,r8 imul rcx add rdx,rcx sar rdx,1Dh mov rax,rdx shr rax,3Fh add rdx,rax imul rdx,rdx,3FFFFFFFh cmp rcx,rdx je main+73h (13F8110D3h) inc rcx cmp rcx,3FFFFFFFh jl main+40h (13F8110A0h) mov bl,1 jmp main+75h (13F8110D5h) xor bl,bl
It is probably the subject of a different blog, but I also ran some tests in Delphi and Lazarus where I tested their ability to eliminate what are clearly, obviously redundant instructions. I did this by creating a “useless property” of an object. In the setter of the object I basically just set a field. A great compiler would be able to peek at what CPU registers were affected by this function and then use that information to optimize the code in the calling function.
Results of this test under disassembly showed that both the 32-bit and 64-bit compilers for Delphi and Lazarus are rather terrible at optimizing function calls. For example the 64-bit disassembly is as follows:
mov rcx, rbx mov edx, esi call Tcmd_IsPrime.SetUseLessProperty mov rcx, rbx mov edx, edi call Tcmd_IsPrime.SetUseLessProperty
It is rather annoying to me that there is a redundant use of mov rcx,rbx (which I assume is to set “self”) before calling the function.
The function call itself does not manipulate rcx, so rcx should already be set to “self” after the first call and therefore that extra instruction could be eliminated if the optimizer were as good as it could possibly be. This becomes a real problem when your app grows in complexity with lots of objects and is one of the main reasons I want to replace the compiler. Maybe that’s one real difference between some of the highly optimized C++ compilers vs. these less-optimized pascal compilers, although there are heaps of studies I should probably pour through. Maybe I’ll implement this same test in C++ and compare the results.
It was suggested to me to look at FreePascal as an alternative before going full-force on writing my own compiler, but my initial finding is that FPC is very below standard on the optimization side. I’ll probably continue to run more tests though, because, after all not every function you write is going to be computing prime numbers. I’ll try and design some tests that benchmark it using some methods that are more in-line with what I’m seeking to do with it in the real world (iSCSI storage systems). FPC held up on the Int32/Win32 tests, but all other tests, it was slower than the Delphi compiler. The C++ tests have me hungry to get back what we’re all missing and the C# tests make me question why on earth I would be writing code in this inferior language when C# can outperform it.
That is all I have time to write for today. If you find my information useful, even though I can go off on the occasional rant, please come back subscribe (if that even works). I program professionally, but I’m not a professional blogger… I don’t get paid to do this part. 😉
Can you just help enhance and “fix” the Free Pascal? It’s open source, after all.
I actually just now saw your comment on my other blog. I should compare SmartPascal and some of the other compilers if possible. I am considering contributing to FreePascal… but where I am at right now is trying to determine the proper starting point. I believe that future modern compilers are better off compiling to C instead of ASM. So part of the idea of making Pascal as fast as C involve compiling to C in the intermediate steps. Otherwise I have to redo a lot of work that has been the subject of many years of computer-science study. I believe that an Object Pascal variant is the most logical successor to C++… but in order for it to succeed, it needs to be everything that C++ is and more…. right now it has some advantages over C++, but is also has some shortcomings.
What version of FPC are you using? FPC 2.7.1 does have improvements when it comes to 64 bit code.
2.6.4 is what was recently packaged with Lazarus. I am aware that newer versions exist and FPC is constantly being developed. I’ll update my tests at some point and try to include 2.7.1 or later. Last time I tried to use the latest FPC with Lazarus it didn’t integrate with the IDE very well and there were some frustrations with the maze of features that are/aren’t available with all the $MODE settings etc. I’m sure I can get it to perform this simpler test though.
It seems to me that your Delphi code is not quite equivalent code in C ++. Therefore, the comparison may not be entirely correct.
Could you give us a full text of Delphi code?
Yeah, I second that. A side-by-side code comparison would be sweet. Kinda unfair to diss Delphi without it.
Agreed, context matters. Code equivalency ensures a fair test.
Exactly. Let’s see the full Delphi code for comparison.