The HLL Performance Gap

Tue Jul 21 23:31:20 UTC 2009

THE PROBLEM:

In today's #parrotsketch, Allison stated that Parrot 2.0 has an explicit
goal to not only have pure-PIR code in production, but HLL code as well.
Right now that's not feasible partly because HLL code runs VERY slowly.
This is not the minor kind of slow that can be resolved by just
iterating on fixes that each gain a few percent until it adds up to 2-3x
improvement in the end.  We're talking Real Slow -- Rakudo for example
is 1-2 orders of magnitude too slow for production use on my various
workloads.  While (non-trivial) pure PIR code doesn't exactly scream, it
is at least fast enough to do *some* production tasks; the vast
difference between PIR and HLL performance levels is what bothers me for
now.

(Yes, making non-trivial PIR execute 1-2 orders of magnitude faster
would do the trick for several of my use cases, because it would carry
the HLL performance "along for the ride".  I doubt that jump will happen
until Parrot truly goes JIT with a vengeance like the current browser
JavaScript engines do.  Methinks that's not a reasonable 2.0 goal.
Plus, there'd still be that extra HLL performance just being wasted when
instead it could be winning us kudos and karma.)

THE ANALYSIS:

A few weeks ago, I was able to spend some time trying to track down why
this huge performance gap existed, at least for my code running in
Rakudo.  I suspect that other HLLs face similar problems, and I'd love
to see what the rest of you have found.

In any case, I found a number of issues, summarized in this list:

1. Every Perl 6 scope becomes a PIR sub.
2. Rakudo scopes are extra-heavy.
3. PCT has no optimization passes.

THE DETAILS:

1. Every Perl 6 scope becomes a PIR sub.

This results in inner loops becoming massive numbers of PIR sub calls,
killing performance.  Colloquial Perl 6 code has a LOT of small scopes
and implicit loops, making this problem even worse.

It is my understanding that this is fundamentally a limitation of
Parrot's lexical scope handling; scopes and subs are currently
inseparable.  If this remains the case (hopefully for sound
architectural reasons), then it may be valuable to detect cases when a
scope can be optimized away.  This is especially the case because:

2. Rakudo scopes are extra-heavy.

In theory, every Perl 6 scope introduces a new lexical $_, $!, and $/,
and must perform some exception housekeeping as well.  A clever compiler
could optimize these away in some cases, and luckily should more easily
find these optimizations just where it matters most -- in tight loops.

In practice, since Rakudo does not do these optimizations yet, it is
often the case that the PIR to implement visible Perl 6 code in a small
scope can be dwarfed by the boilerplate surrounding it.  This
boilerplate could be optimized a number of ways; for example, a new op
could be written that creates all three special lexicals at once.
Still, there's quite a bit of boilerplate there; it's unlikely to become
truly smokin' fast using micro-optimization alone.

3. PCT has no optimization passes.

Right now, PCT follows a notionally simple set of rules to convert PAST
to POST to PIR, and does not try to do significant optimization at any
point.  This results in PIR code that a human would regard as silly --
consecutive loads of the same lexical, boxing and unboxing of constants,
performing unchanging operations inside loops instead of hoisting them
outside, and so on.

I had a couple paragraphs here arguing why this should be solved with
optimization passes added specifically to PCT and not the HLLs or the
PIR compiler, but Tene assures me that was already the long-term plan.
The important part of my message then becomes:

It's time to get started.  By 2.0, we should have some optimizations
ready to go.  It's not necessary to implement the most cutting-edge
optimization algorithms -- joining the 80's ought to be sufficient.  :-)

THE TASKS

Recollecting thoughts from above, here are some individual tasks that
could be turned into tickets and/or release goals.

1. Add past_optimize and post_optimize passes to PCT.  Start with just a
single useful optimization in each.

2. Make it easy to plug in additional optimizations in a way that allows
people to work in parallel on different optimizations without blocking
on each other unduly.

3. Analyze the PIR output from HLL compiles of a number of scripts in
different task domains, with different coding styles, and of various
lengths.  Collect a list of common targets for optimization (e.g.
"redundant loads of same data"), and create tickets for implementing
optimization plugins for them.  Potentially coder-parallel.

4. Implement tickets created in task #3.  Should be easily parallelized
over multiple Parrot hackers.

5. Analyze and micro-optimize Rakudo's per-scope boilerplate.  Focus
first on optimizations that other HLLs may be able to use for their own
particular boilerplate hell.

6. Reduce the overhead of PIR sub calls.  I believe Allison is already
working on refactoring that will help with this, but I suspect there
will be more improvement possible even after her work is merged.

7. Allow lexical scopes to be separate from PIR subs.  If sub calls
become really cheap, this will be unnecessary.  If sub calls remain
expensive, this should be considered.

-'f