Buffered output to a file is painfully slow

Andrew Whitworth wknight8111 at gmail.com
Mon Feb 9 19:21:52 UTC 2009


On Mon, Feb 9, 2009 at 1:11 PM, chromatic <chromatic at wgz.org> wrote:
> I wanted to measure the overhead of the calls to the FileHandle PMC's puts()
> method, so I changed the relevant ops (print_p_i and print_p_sc) to call
> Parrot_io_write_buffer() directly.  The benchmark runs over five times faster.

On a tangential matter, maybe we need to consider changing "puts" to
be a VTABLE instead of a METHOD? It's a common-enough operation, and
the immediate performance win would be large in these cases. Not only
can we use it for file handles and sockets, but we could add "line
buffered" puts to things like string arrays, and I'm sure HLLs are
going to be writing their own subclasses too. Instead, maybe we could
use the push_string VTABLE instead, to serve the same purpose. Of
course, this does skip the root of the problem, that METHODs are
inherently slow and simply avoiding them (or converting them into an
ever-growing list of VTABLEs) is not a solution. which brings me
to....

> The culprit, as usual, is that converting between C calling conventions and
> Parrot's calling conventions is slow.  The unmodified benchmark generates
> 3,049,533 new PMCs.  The modified benchmark generates 2,655 new PMCs.  For
> reference, "Hello, world!" in PIR generates 1,454 new PMCs.

I still can't really understand where all these PMCs are coming from.
It seems like an unbelievably huge number for such a simple benchmark.
I'm sure there are a few places where we could be doing in situ PMC
reuse instead of allocating them fresh and hoping the GC will deal
with the wreckage. There are a few places where we could be avoiding
creating PMCs entirely, maybe using some more primative structures to
manage small amounts of data if needed. Some things that we do
calculate immediately can be avoided, for instance type tuple PMCs can
be avoided unless we are calling a multi.

Looking at the generated C code for the puts method, I can immediately
see a few things that are suspect, which brings a lot of questions to
mind: Why are we generating a _params_sig PMC here it seems to serve
no purpose whatsoever? Why are we creating a _returns_sig too? Don't
we have the signature for the method already generated as part of
Parrot_PCCINVOKE? It doesn't look like the _params_sig PMC is actually
being used anywhere either. Also, why do we create a RetContinuation
PMC here, what is it's purpose? Parrot_PCCINVOKE, and variants, create
a new context for the method when it's called, although
NCI.pmc:invoke() doesn't. So in some cases we appear to be creating
two contexts for a method call instead of just one.

> (The problem isn't in the garbage collector, however.  The GC runs 454 times
> for the modified benchmark and 2793 times for the unmodified benchmark -- only
> 6.5 times more for the slow benchmark, despite there being 2500 times more
> garbage to collect.  This benchmark is actually really easy on the GC, because
> it's a tight loop where almost everything is garbage -- walking the anchored
> set is easy because it's *tiny*.)

It is refreshing to hear that the GC isn't the source of the problem,
although I'm sure there is room for improvement here too. Even though
it's got a small set of PMCs to mark, it still has a huge set of PMC
to sweep. Each one needs to be touched and examined too. If most PMCs
are very short-term garbage, there is a case to be made to implement a
semi-space collector or an aggressive compacting collector instead of
an incremental MS like we've been looking at. Anything we could do to
kill large contiguous batches of dead PMCs in a single operation would
be better then sweeping through the arenas and killing them off
one-by-one.

Lots of areas here for improvement, we just need to decide how to fix
the problems and what to tackle first.

--Andrew Whitworth


More information about the parrot-dev mailing list