jonathan at jnthn.net
Sat Sep 3 23:06:47 UTC 2011
Compiling the Rakudo setting creates a lot of objects (the parse tree,
the PAST tree, the model and so forth). Not only that, but since it's
building up a complete view of the thing it's compiling, those objects
tend to stay around for a long time. Innevitably, that makes for a
pretty large working set. Of course, there's lots of very short lived
things (CallContext, for example) that come and go.
From the profiles I've been looking at, in situations where we make
lots of objects but they are short lived, the GC seems to handle
reasonably well. In theory, since it's generational, it'd continue to do
so when we're doing things like compiling the setting, since the tree
nodes we're keeping around survive a collection and end up in an older
generation, and the regular collections just grab the short-lived
things. However, the actual profile results are not so rosy.
Tonight I dug the profile data I was seeing, and found that
gc_gms_is_pmc_ptr seems to be a hot path. A little analysis later, I
think I know what's going on. When we do a GC run, we have to find all
pointers to PMCs and STRINGs, and that includes scanning the stack.
Since we can't be precise there, we have to validate the things we've
seen really are string and PMC pointers. gc_gms_is_pmc_ptr is the
function that does that. However, we'd not expect this to be so costly -
the C stack isn't really going to be that deep. Yet it shows up high in
the profile. What's going on?
The PMC headers are allocated from a pool. A pool is a linked list of
arenas. And arena is a chunk of memory holding a bunch of PMC headers.
Thus if we have a PMC pointer, it must have an address at a valid offset
somewhere between the start address and the end address of one of the
arenas in the pool. The way that we determine this is "the obvious":
walk the linked list, and check if the pointer lies within the bounds of
the arena, and at a valid offset.
Unfortunatley, I suspect this is actually quite expensive. An arena is
4096 bytes. On a 32 bit machine, given that a PMC is 16 bytes and
there's an extra 4 bytes for a pointer used by the GC, we get about 200
PMCs in an arena. On a 64 bit machine I guess that drops off to about
100. So if we had, say, 100,000 PMCs around in memory, that's 500/1000
linked list follows to determine it's not a pointer, and average 250/500
if it is (maybe temporal locality biases the search one way or the
other). And we probably have many more PMCs than that. Also note that we
do this per "candidate" pointer that we see in the memory block.
That would be less than ideal, but there's another issue: we're likely
jumping all over the place in memory as we visit these arenas. We're
probably getting a ton of CPU cache misses as we do so, which is going
to make walking the linked list more expensive than it may otherwise be.
We go through this a bunch of times, and given the arenas probably line
up well with pages, it's possible that we hit unfortunate overlap issues
with the cache too (IIUC, set associative ones can do weird things in
One obvious fix would be to just keep an extent list: an array of
(start,end) pairs, hanging off the pool header, which we can very
cheaply scan through (since it's just a bunch of values laid out
contiguously in memory). That way, we don't go chasing all over RAM, and
clobber the cache (though since we're doing a GC mark we may well be
about to anyway...)
It'd be interesting to know if anybody else sees things like this when
profile Rakudo (nom branch) setting compilation, and of course a patch
would be interested to see too (and, in theory, shouldn't be too hard).
More information about the parrot-dev