x86/x64 Delphi, Lazarus, C#, and C++ Compared in Raw Computational power

As part of my research into the amount of effort it would take to invent a new Object Pascal compiler and language variant,  I decided to take some time to dig into the ASM code that gets generated when performing some really simple computations in Delphi and compare the speed of the operations. I then expanded my tests to include competing compilers.  

Setup

I used a basic prime number evaluation function and translated it to the various languages.  Each program measured the amount of time it took to evaluate whether $7FFFFFFF is prime (which it is) under various conditions. I also tested the same function using the FreePascal (Lazarus) compiler under -O3 optimization and translated it to Microsoft’s Visual C# and Microsoft C++.  I specifically used VS2010, because that just happened to be what I had on my i7 Laptop.

I chose the prime test because it would be easy to disassemble and pick apart and observe what was exactly happening to the code and also because it would benchmark a function that can essentially be boiled down to a simple while loop that performs a bunch of integer arithmetic.  This test is integer-computationally intensive and is unhindered by I/O or memory bandwidth.

The basic bit of code contains lines are switched on/off and changed, but it looks something like this:

procedure Tcmd_IsPrime.DoExecute;
var
  t: int64;
  x: int64;
  s: int64;
begin
  inherited;

  t := 2;
  s := self.Subject;
  x := s div 2;

  while (t < x) do begin
    if (s mod t) = 0 then begin
      result := false;
      exit;
    end;
    inc(t);
  end;
  result := true;
end;

Here is the C++ version of the code:

// UT_PrimeCPP.cpp : Defines the entry point for the console application.
//

//#include "Afx.h"

#include "stdafx.h"
#include <Windows.h>


typedef 
	long long use_type_t;

static bool IsPrime()
{
	use_type_t t,x,s;
	t = 2;
	s = 0x7fffffff;
	x = s / 2;

	while (t<x)
	{
		if ((t%x) == 0)
		{
			return false;
		}

		t++;
	}
            
	return true;  
}   

int _tmain(int argc, _TCHAR* argv[])
{
	DWORD tmStart, tmEnd;
		
	_tprintf(_T("Hello SizeOf(use_type_t) = %d "), sizeof(use_type_t));
	tmStart = GetTickCount();
    bool b = IsPrime();
    tmEnd = GetTickCount();
	
//Note that if you don't actually use "b" IsPrime() won't even get called in C++.  The optimizer correctly determines that it is a worthless waste of time.
	if (b) {
		_tprintf(_T(" Prime!"));
	}
	_tprintf(_T(" Time: %d"), tmEnd-tmStart);
	Sleep(4000);
	
}

and….finally… here is the C# version of the code:

using System;
using System.Collections.Generic;
using System.Text;

namespace ConsoleApplication1
{
    class Program
    {
        static bool IsPrime()
        {
            Int64 t,x,s;
            t = 2;
            s = 0x7fffffff;
            x = s / 2;

            while (t<x)
            {
                if ((t%x) == 0)
                {
                    return false;
                }

                t++;
            }
            
            return true;  
        }
  

  
        static void Main(string[] args)
        {

            Int64 tmStart, tmEnd;
            tmStart = MiscLib.Time.GetTickCountInMilliseconds();
            IsPrime();
            tmEnd = MiscLib.Time.GetTickCountInMilliseconds();

            Console.WriteLine("IsPrime() took "+(tmEnd-tmStart).ToString());
            Console.ReadKey();
        }
    }
}

I ran all tests with int64/int32 interchanged (In C++ that is “long” vs. “long long”) as well as at various optimization levels.

Results

The results are interesting as well as perplexing.

platform var type Time
C++ x64 (Speed Optimization) int64 2.012
C++ x64 (Speed Optimization) int32 2.028
C++ x86 (Speed Optimization) int32 4.773
C# (x86 Target – Optimized) int32 4.778
C++ x64 (No Optimization) int32 4.790
C++ x86 (No Optimization) int32 4.805
Delphi-x86 int32 5.350
Delphi-x64 int32 5.350
Lazarus-x86 int32 5.504
C# (x64 Target – Optimized) int32 5.876
Delphi-x86 int64 8.534
C# (x86 Target – Optimized) int64 10.260
Delphi-x64 int64 10.965
Lazarus-x64 int32 11.23
C++ x64 (No Optimization) int64 11.887
Lazarus-x64 int64 12.10
C++ x86 (Speed Optimization) int64 12.186
C++ x86 (No Optimization) int64 13.603
C# (x64 Target – Optimized) int64 14.940
Lazarus-x86 int64 19.11

 

To make the results a bit more readable, here are the results isolated into various categories:

x86 Results Isolated

platform var type Time
C++ x86 (Speed Optimization) int32 4.773
C# (x86 Target – Optimized) int32 4.778
C++ x86 (No Optimization) int32 4.805
Delphi-x86 int32 5.350
Lazarus-x86 int32 5.504
Delphi-x86 int64 8.534
C# (x86 Target – Optimized) int64 10.260
C++ x86 (Speed Optimization) int64 12.186
C++ x86 (No Optimization) int64 13.603
Lazarus-x86 int64 19.11

 

x64 Results Isolated

platform var type Time
C++ x64 (Speed Optimization) int64 2.012
C++ x64 (Speed Optimization) int32 2.028
C++ x64 (No Optimization) int32 4.790
Delphi-x64 int32 5.350
C# (x64 Target – Optimized) int32 5.876
Delphi-x64 int64 10.965
Lazarus-x64 int32 11.23
C++ x64 (No Optimization) int64 11.887
Lazarus-x64 int64 12.10
C# (x64 Target – Optimized) int64 14.940

 

Int32 results Isolated

platform var type Time
C++ x64 (Speed Optimization) int32 2.028
C++ x86 (Speed Optimization) int32 4.773
C# (x86 Target – Optimized) int32 4.778
C++ x64 (No Optimization) int32 4.790
C++ x86 (No Optimization) int32 4.805
Delphi-x86 int32 5.350
Delphi-x64 int32 5.350
Lazarus-x86 int32 5.504
C# (x64 Target – Optimized) int32 5.876
Lazarus-x64 int32 11.23

 

Int64 Results

platform var type Time
C++ x64 (Speed Optimization) int64 2.012
Delphi-x86 int64 8.534
C# (x86 Target – Optimized) int64 10.260
Delphi-x64 int64 10.965
C++ x64 (No Optimization) int64 11.887
Lazarus-x64 int64 12.10
C++ x86 (Speed Optimization) int64 12.186
C++ x86 (No Optimization) int64 13.603
C# (x64 Target – Optimized) int64 14.940
Lazarus-x86 int64 19.11

Conclusions

1) VC++ x64 really shines and seems to the only compiler I tested that actually succeeds in optimization for x64 processors.  All the other compilers, including VC++ x86 performed at less than half the speed of VC++ x64.

2) There were a lot of tests that I’d say fall in to a “respectable” performance range.  Tests that completed in the 4.7-5.8 second range I’d consider “respectable”, however, you’ll notice that there is only one test involving 64-bit integers that beat a 5.8 second time, and that was the C++ x64 test completing in a stunning 2.012 seconds.  Keep in mind that fully-optimized C++ x86 achieved a time of 4.773 seconds which is currently the industry standard.  When working with 32-bit numbers, Delphi, Lazarus-x86, C#, and C++ all produced respectable results.  The only compiler that failed to achieve “respectable” performance for 32-bit numbers was the Lazarus-x64 compiler, coming in at a lackluster 11.23 Seconds.

3) When switching to 64-bit numbers, however, the results were all over the board.  [C++ x64] actually delivered  on the performance promises of x64… but…. if you look at the 64-bit integer isolated table, no compiler came anywhere close to touching the C++ numbers… in fact, Delphi-x86 came in 2nd place, strangely, but was still over 4x slower than C++.  Aside from the C++ x64 test, virtually every other test took a performance hit on 64-bit numbers.

4) Delphi, Lazarus, and C# could not compete with good old-fashioned C++ when optimized for speed with only one notable exception being that [C++ x86] was not very good at dealing with Int64s.  This could be different with another compiler, such as GCC or maybe some library changes… Int64s are simulated in software on x86, so they can vary widely I imagine.

5) We would expect x64 code to outperform x86 (yet only the C++ compiler seems to be accomplish this)

6) We would expect int32 performance to be slightly poorer than int64 performance on x64 platforms. It is possibly within the margin of error, but only the C++ compiler showed this artifact.

7) Delphi and Lazarus were outperformed by [C# x86] for 32-bit operations.  I wouldn’t use this to conclude that C# is ultimately faster because I don’t believe that to be the case whenever objects are involved (subject of another set of tests).  However, in this simple, raw, math test, Delphi was bested by C# x86.

8) For int32 math, [C# x86] with “optimization” enabled performed roughly equivalent to [C++ x86] without any optimization at all. Most compilers performed mostly on-par with unoptimized C++ when dealing with 32-bit numbers but performance was varied greatly for 64-bit numbers.

9) Omitting the C++ tests, and with the exception of the Lazarus results, The #1 determining factor in run-time is the size of the data types being processed regardless of the target.  All Int32 tests ran much faster than int64 tests .

10) I am rather surprised to find that [Delphi-x86] outperformed  all other 32-bit compilers in 64-bit calculations  It even outperformed C++ (ignoring the C++ 64-bit compiler)!  Don’t go flying any “Mission Accomplished” banners just yet though, this is just one scenario of many and frankly, I’m mostly interested in a 64-bit compiler operating on 64-bit numbers.

11) The disassembled Delphi 64-bit code appears to be, on the surface, far less complex than the 32-bit code but still runs much slower.  This is despite the fact that the 32-bit code involves a lot more stack pushes and pops and calls to functions to divide 64-bit integers, whereas the 64-bit code involves native ASM instructions.

I compared the disassembled code to the FreePascal version, which  actually contains less code, but this code must involve instructions that are slower than the Delphi instructions.  I don’t know how deep into this I plan on digging.  But for the record…

Here’s the Delphi-x64 generated code

UnitTest_Commands.pas.118: begin
    sub rsp,$28
UnitTest_Commands.pas.121: t := 2;
    mov r8d,$00000002
UnitTest_Commands.pas.122: s := self.Subject;
    mov r10,[rcx+$00000190]
UnitTest_Commands.pas.123: x := s div 2;
    mov r9d,$00000002
    mov rax,r10
    cqo
    idiv r9
    mov r9,rax
    jmp Tcmd_IsPrime.DoExecute + $41
UnitTest_Commands.pas.126: if (s mod t) = 0 then begin
    mov rax,r10
    cqo
    idiv r8
    test rdx,rdx
    jnz Tcmd_IsPrime.DoExecute + $3D
UnitTest_Commands.pas.127: result := false;
    xor rdx,rdx
    call {CommandTypes}TFunctionCommand.SetREsult
UnitTest_Commands.pas.128: exit;
    jmp Tcmd_IsPrime.DoExecute + $4D
UnitTest_Commands.pas.130: inc(t);
    add r8,$01
UnitTest_Commands.pas.125: while (t < x) do begin
    cmp r8,r9
    jl Tcmd_IsPrime.DoExecute + $26
UnitTest_Commands.pas.132: result := true;
    mov dl,$01
    call {CommandTypes}TFunctionCommand.SetREsult
UnitTest_Commands.pas.135: end;
    add rsp,$28

And here’s the Lazarus-x64 code

umain.pas:60              t := 2;
    mov    $0x2,%r10d
umain.pas:61              s := $7FFFFFFF;
    mov    $0x7fffffff,%r8d
umain.pas:62              x := s div 2;
    movslq %r8d,%rcx
    mov    %rcx,%rax
    sar    $0x3f,%rax
    and    $0x1,%rax
    add    %rax,%rcx
    sar    %rcx
umain.pas:63              while (t < x) do begin
    jmp    0x42e880 <TFORM1__ISPRIME+64>
    data32 data32 xchg %ax,%ax
    xchg   %ax,%ax
umain.pas:64              if (s mod t) = 0 then begin
    movslq %r8d,%rax
    movslq %r10d,%r9
    cqto   
    idiv   %r9
    test   %rdx,%rdx
    jne    0x42e87d <TFORM1__ISPRIME+61>
umain.pas:65              result := false;
    mov    $0x0,%r11b
umain.pas:66              exit;
    jmp    0x42e88a <TFORM1__ISPRIME+74>
umain.pas:68              inc(t);
    inc    %r10d

C++ was the champion.  The champion compiler is rather difficult to get intelligible disassembly from, because when you compile C or C++ with full optimization enabled, it often re-orders your code, sometimes transforming it into something completely different and unrecognizable. C++ language optimization has been the study of many a thesis over the years, and I think if you were to build a Delphi compiler that optimized to this level, you might as well compile down to C++ instead of ASM to simply take advantage of all the hard work people have been putting into this challenge over the years.
Here is the best I could come up with for what was generated for IsPrime()… it appears to have been inlined and I don’t even see any “idiv” instructions in it… (is it doing a multiply/compare?) I’ll have to study this some more.  [UPDATE: Upon further study, I now believe that the C++ compiler achieved greater performance through the use of algebraic transformations.   It was smart enough to understand that MODULUS operations are expensive, and determined that it could achieve the same comparison without division/modulus.   A more efficient way to test if a number is prime in this scenario, is to check if (x*t)==0x7fffffff rather than (0x7fffffff%t)==0.  It is a very smart compiler, indeed!  Through other independent testing, I observed that integer division on Intel processors is still around a meager 25% of the speed of multiplication.  This would totally explain the 4X performance difference.    I plan to follow up with future blogs that test other types of performance bottlenecks.]

The C++ x64 Code:

//    bool b = IsPrime();
mov         ecx,2  
mov         esi,eax  
mov         r8,8000000200000009h  
nop         dword ptr [rax+rax]  
mov         rax,r8  
imul        rcx  
add         rdx,rcx  
sar         rdx,1Dh  
mov         rax,rdx  
shr         rax,3Fh  
add         rdx,rax  
imul        rdx,rdx,3FFFFFFFh  
cmp         rcx,rdx  
je          main+73h (13F8110D3h)  
inc         rcx  
cmp         rcx,3FFFFFFFh  
jl          main+40h (13F8110A0h)  
mov         bl,1  
jmp         main+75h (13F8110D5h)  
xor         bl,bl

It is probably the subject of a different blog, but I also ran some tests in Delphi and Lazarus where I tested their ability to eliminate what are clearly, obviously redundant instructions.  I did this by creating a “useless property” of an object.  In the setter of the object I basically just set a field.  A great compiler would be able to peek at what CPU registers were affected by this function and then use that information to optimize the code in the calling function.

Results of this test under disassembly showed that both the 32-bit and 64-bit compilers for Delphi and Lazarus are rather terrible at optimizing function calls. For example the 64-bit disassembly is as follows:

mov rcx, rbx
mov edx, esi
call Tcmd_IsPrime.SetUseLessProperty
mov rcx, rbx
mov edx, edi
call Tcmd_IsPrime.SetUseLessProperty

It is rather annoying to me that there is a redundant use of mov rcx,rbx (which I assume is to set “self”) before calling the function.

The function call itself does not manipulate rcx, so rcx should already be set to “self” after the first call and therefore that extra instruction could be eliminated if the optimizer were as good as it could possibly be.  This becomes a real problem when your app grows in complexity with lots of objects and is one of the main reasons I want to replace the compiler. Maybe that’s one real difference between some of the highly optimized C++ compilers vs. these less-optimized pascal compilers, although there are heaps of studies I should probably pour through. Maybe I’ll implement this same test in C++ and compare the results.

It was suggested to me to look at FreePascal as an alternative before going full-force on writing my own compiler, but my initial finding is that FPC is very below standard on the optimization side.  I’ll probably continue to run more tests though, because, after all not every function you write is going to be computing prime numbers.  I’ll try and design some tests that benchmark it using some methods that are more in-line with what I’m seeking to do with it in the real world (iSCSI storage systems). FPC held up on the Int32/Win32 tests, but all other tests, it was slower than the Delphi compiler. The C++ tests have me hungry to get back what we’re all missing and the C# tests make me question why on earth I would be writing code in this inferior language when C# can outperform it.

That is all I have time to write for today.  If you find my information useful, even though I can go off on the occasional rant, please come back subscribe (if that even works).  I program professionally, but I’m not a professional blogger… I don’t get paid to do this part. 😉

5 Replies to “x86/x64 Delphi, Lazarus, C#, and C++ Compared in Raw Computational power”

    1. I actually just now saw your comment on my other blog. I should compare SmartPascal and some of the other compilers if possible. I am considering contributing to FreePascal… but where I am at right now is trying to determine the proper starting point. I believe that future modern compilers are better off compiling to C instead of ASM. So part of the idea of making Pascal as fast as C involve compiling to C in the intermediate steps. Otherwise I have to redo a lot of work that has been the subject of many years of computer-science study. I believe that an Object Pascal variant is the most logical successor to C++… but in order for it to succeed, it needs to be everything that C++ is and more…. right now it has some advantages over C++, but is also has some shortcomings.

    1. 2.6.4 is what was recently packaged with Lazarus. I am aware that newer versions exist and FPC is constantly being developed. I’ll update my tests at some point and try to include 2.7.1 or later. Last time I tried to use the latest FPC with Lazarus it didn’t integrate with the IDE very well and there were some frustrations with the maze of features that are/aren’t available with all the $MODE settings etc. I’m sure I can get it to perform this simpler test though.

  1. It seems to me that your Delphi code is not quite equivalent code in C ++. Therefore, the comparison may not be entirely correct.

    Could you give us a full text of Delphi code?

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.