fast MMD as Hamming distance over hypercube graph

Allison Randal allison at parrot.org
Sun Nov 29 14:35:23 UTC 2009


Andrew Whitworth wrote:
>>
>> The cost is in setting up the hypercube graph for the inheritance hierarchy
>> of a particular class. Need to dig into it some more to see how cheap that
>> could be. The graph could at least be cached as a set of integer "parent
>> ids" in the child class after it's constructed the first time.
> 
> This does sound like a pretty neat idea, but I share your worry that
> generating the hypercube is going to be expensive. Combine that with
> the fact that we would have to completely recompute the hypercube
> every time we modified the hierarchy at all, and we couldn't lazily
> reuse hypercubes computed for class B when we assign B as a parent of
> class A, etc. Of course, creating new classes and modifying existing
> hierarchies is likely an infrequent-enough operation for most cases
> that the expense will be amortized by fast subsequent lookups.

Actually, I'm not so much concerned that generating the hypercube will 
be expensive, more that it'll be a choice between "cheaper than what we 
have now", "cheap", and "can we push it hard to make it really cheap?". 
But, I haven't spent much time on it yet. A simple first take on how to 
do it would be:

- Set the child ID to 0 (i.e. 0000000000)

- Iterate over the immediate parents.
   - Set each parent ID to one bit different from child
     (i.e. 0000000001, 0000000010, 0000000100, 0000001000)
   - Recursively iterate over parents of parents
     - Set parent-of-parent ID to one bit different from parent
       (i.e. 0000001001, 0000001010, 0000001100)

Could be simpler if you don't bother with unique IDs and just mark the 
level (i.e. immediate parents are all 0000000001, next level is all 
0000000011, etc). There are some trade-offs there. Using an integer 
means we're limited to the size of the integer, 4,294,967,295 unique 
parents or 32 levels in a 32-bit integer.

We already invalidate some caches in the Class on modification, so that 
part isn't difficult. Some choices would need to be made on how to store 
the cache of IDs in the child Class for: the smallest possible storage, 
the quickest possible lookup, and easily invalidated (so, likely all in 
one place, perhaps a struct pointer in the PMC struct). And low-level 
types can't be modified anyway, so one hypercube graph for all of them 
could be generated at compile-time.

> How does Parrot currently calculate the manhattan distance between a
> child and parent type?

Slowly. The horror is mmd_distance in src/multidispatch.c (called by 
Parrot_mmd_sort_candidates). That function and many of the functions it 
calls are remnants from the old MMD system, left in place for backward 
compatibility.

It iterates through integer type arrays of both the call and the 
multisub (first creating the arrays if they don't exist), performs a 
series of manual checks on the types (which it repeats every time a 
dispatch is made on a particular type), and iterates through all the 
parents of each call argument (which it repeats every time a dispatch is 
made).

So, the step of "iterating over the parents" has to be done anyway. The 
hypercube graph ads some bit math, plus cache storage and retrieval.

Bonus points if we can eliminate most of the manual checks (building 
core types into the hypergraph) and eliminate creating fixed integer 
arrays for the sub and the call.

>> Some relevant work, there's likely more to be found:
>> http://portal.acm.org/citation.cfm?id=301378
> 
> /me really needs to get a proper ACM subscription eventually to read
> all the cool papers there.
> 
> Any chance I could get a copy of this one?

Yes, ACM has a sane policy about redistribution of papers. I'll email 
you a copy. (Also to anyone else who wants it.)

Allison


More information about the parrot-dev mailing list