[RFC] clone semantics

Thu Mar 11 19:11:42 UTC 2010

On Thu, Mar 11, 2010 at 12:37 PM, Andrew Whitworth
<wknight8111 at gmail.com> wrote:
> On Thu, Mar 11, 2010 at 11:28 AM, Allison Randal <allison at parrot.org> wrote:
>> One important thing to keep in mind is that there is no global uniform
>> implementation of 'clone': the clone op just calls the clone vtable
>> function, and each PMC is free to implement it however it chooses. (This is
>> a good thing, different languages have different semantics for cloning.)

I'm a little wary of letting types define their own policy for
cloning. Cloning policy is just as (if not more) important to calling
code which must be crafted differently for different semantics. If I
expect a deep clone but get a shallow clone, that could give me hard
to understand bugs. If we let calling code decide the cloning policy,
there is likely less chance for this type of misunderstanding.

> This almost makes the case that we should provide both deep_clone and
> shallow_clone vtables, since the two operations really are wildly
> different and different people need to make use of them differently.
> I'm not necessarily suggesting this, but they really are two different
> operations and the one name for them is quite confusing in this
> regard.

While I want the ability to do both deep and shallow clones, I think
the responsibility for deep cloning is not something we want to push
into PMCs if we don't have to. TT #1015 is caused by Hash PMC not
being able to handle that responsibility.

I also cannot stress enough that we already have a generic "recurse
over yourself" vtable entry: "visit". Having more than one "recurse
over yourself" vtable simply allows PMCs to get some or all of them
wrong in probably subtle ways. Also, it is possible to concoct a deep
version of an arbitrary operation by combining it with the use of the
visit vtable.

>> As chromatic pointed out in the ticket, there's a substantial advantage to
>> (eventually) solving the problem of unique traversals once for all potential
>> uses, including deep cloning, freezing, marking RO or shared variables, some
>> kinds of iterators, and potentially GC marking.
>>
>> A simple registry is a good idea, but it should be more general than a
>> "clone registry". A better approach would be a "seen registry" marking nodes
>> that have been visited in a particular traversal (whatever is being done by
>> that traversal). Storing the registry in interp is unsafe, because you may
>> have multiple traversals happening simultaneously. (I'm not even talking
>> about concurrency, where global state is the plague. Simple sequential code
>> may traverse part-way through a data structure, then call some other code
>> that traverses through some other data structure.) Storing it the PMC struct
>> of the item being traversed is better, but still unsafe because you may have
>> multiple processes iterating over the same PMC at the same time. What you
>> really want is data storage for the registry that's unique to each
>> traversal.
>
> Peter was using a new registry type in his fixes for freeze/thaw
> recursive traversals. He is trying to use that same type to fix the
> recursive cloning issue. So this requirement is already being
> accounted for.
>
> A good registry PMC (and I am not sure about the specifics of the one
> Peter is using) would be able to return a boolean "Have I seen this
> one already?", and also be able to return some sort of tag "I have
> seen this one, and here is the associated data". In a freeze you
> should be able to get access to the already-frozen STRING buffer, in
> thaw you want to get access to the already-thawed PMC, and in clone
> you want to get a pointer to the cloned PMC/STRING to faithfully
> recreate cyclic structures. Assuming his PMC type does this, I think
> we have our general solution.

The three visitor API PMCs that have already been created (ImageIO and
ImageIOSize from trunk, VisitClone from tt1015 branch), use a hash
(called 'seen') to keep track of this for PMCs, so can store arbitrary
information about already visited elements.

Eliminating duplicate STRING buffers is not something that is being
done ATM, but would probably be fairly easy to add. AFAIK, you cannot
communicate between components at a PIR/HLL level by modifying a
shared string buffer because these are COW / immutable. So this would
only be an optimization, not a functionality change.

> Where we store the traversal, and how we keep track of it is a
> different issue. We could keep track of it by passing references, but
> that would requir signature changes to vtables that used it (clone,
> freeze, thaw, maybe mark, etc).I'm in favor of this, but requires a
> deprecation cycle.

freeze/thaw/visit already take a parameter for the registry: PMC *info
(also known as visit_info, visit).

These PMCs are expected to provide a certain vtable-based API (I
should document that somewhere), but are free to provide that as they
see fit. For example, freezing and cloning traversals can keep their
state in a "seen" hash; marking traversal (which probably shouldn't
allocate while traversing, but is hopefully unique per-interpreter at
a given time) might keep its state in PObj flags.