fast MMD as Hamming distance over hypercube graph
allison at parrot.org
Sun Nov 29 14:35:23 UTC 2009
Andrew Whitworth wrote:
>> The cost is in setting up the hypercube graph for the inheritance hierarchy
>> of a particular class. Need to dig into it some more to see how cheap that
>> could be. The graph could at least be cached as a set of integer "parent
>> ids" in the child class after it's constructed the first time.
> This does sound like a pretty neat idea, but I share your worry that
> generating the hypercube is going to be expensive. Combine that with
> the fact that we would have to completely recompute the hypercube
> every time we modified the hierarchy at all, and we couldn't lazily
> reuse hypercubes computed for class B when we assign B as a parent of
> class A, etc. Of course, creating new classes and modifying existing
> hierarchies is likely an infrequent-enough operation for most cases
> that the expense will be amortized by fast subsequent lookups.
Actually, I'm not so much concerned that generating the hypercube will
be expensive, more that it'll be a choice between "cheaper than what we
have now", "cheap", and "can we push it hard to make it really cheap?".
But, I haven't spent much time on it yet. A simple first take on how to
do it would be:
- Set the child ID to 0 (i.e. 0000000000)
- Iterate over the immediate parents.
- Set each parent ID to one bit different from child
(i.e. 0000000001, 0000000010, 0000000100, 0000001000)
- Recursively iterate over parents of parents
- Set parent-of-parent ID to one bit different from parent
(i.e. 0000001001, 0000001010, 0000001100)
Could be simpler if you don't bother with unique IDs and just mark the
level (i.e. immediate parents are all 0000000001, next level is all
0000000011, etc). There are some trade-offs there. Using an integer
means we're limited to the size of the integer, 4,294,967,295 unique
parents or 32 levels in a 32-bit integer.
We already invalidate some caches in the Class on modification, so that
part isn't difficult. Some choices would need to be made on how to store
the cache of IDs in the child Class for: the smallest possible storage,
the quickest possible lookup, and easily invalidated (so, likely all in
one place, perhaps a struct pointer in the PMC struct). And low-level
types can't be modified anyway, so one hypercube graph for all of them
could be generated at compile-time.
> How does Parrot currently calculate the manhattan distance between a
> child and parent type?
Slowly. The horror is mmd_distance in src/multidispatch.c (called by
Parrot_mmd_sort_candidates). That function and many of the functions it
calls are remnants from the old MMD system, left in place for backward
It iterates through integer type arrays of both the call and the
multisub (first creating the arrays if they don't exist), performs a
series of manual checks on the types (which it repeats every time a
dispatch is made on a particular type), and iterates through all the
parents of each call argument (which it repeats every time a dispatch is
So, the step of "iterating over the parents" has to be done anyway. The
hypercube graph ads some bit math, plus cache storage and retrieval.
Bonus points if we can eliminate most of the manual checks (building
core types into the hypergraph) and eliminate creating fixed integer
arrays for the sub and the call.
>> Some relevant work, there's likely more to be found:
> /me really needs to get a proper ACM subscription eventually to read
> all the cool papers there.
> Any chance I could get a copy of this one?
Yes, ACM has a sane policy about redistribution of papers. I'll email
you a copy. (Also to anyone else who wants it.)
More information about the parrot-dev