[DRAFT] [GSoC] NFG and single-representation strings for the Parrot Virtual Machine

Daniel Arbelo arbelo at gmail.com
Sat Apr 3 22:34:52 UTC 2010

Title: NFG and single-representation strings for the Parrot Virtual Machine.

Student: Daniel Arbelo Arrocha

This proposal aims to implement NFG for Parrot strings, as specified
on PDD 28 "Strings". This was explicitly designed as a way to get
strings that can represent the full Unicode character repertoire
without resorting to variable-byte encodings and prevent expensive
look ahead en every operation.


Name: Daniel Arbelo Arrocha <arbelo at gmail.com>

Benefits to the Perl/Open Source Community
   The beneficiary of this work is Parrot itself, grapheme normalized
strings are a feature that has been specified for a long time but
never implemented, even though it is necessary for Rakudo and is
likely to benefit other Unicode-aware HLLs. Also the nature of NFG
makes it possible for Unicode unaware HLLs to handle it transparently,
with zero (HLL-side) coding overhead.

   This proposal seeks to deliver a usable implementation of NFG
normalization for parrot strings and evaluate it's suitability as the
one internal parrot string representation.

Project Details
   The Strings PDD already explains NFG and it's trade-offs in a
better way than I ever could, and provides a convincing rationale for
it. There is however a point in which this proposal deviates from it.
To quote the relevant section: "Parrot was designed from the outset to
support multiple string formats: multiple character sets and multiple
encodings. We don't standardize on Unicode internally, converting all
strings to Unicode strings, because for the majority of use cases it's
still far more efficient to deal with whatever input data the user
sends us. "
   Since this writing, experience has shown that it might be a net win
to have one unified string representation. Benchmarks have shown that
the cost of operating on variable-width encodings is superior to the
cost of one-time conversion. Also the problem of determining string
equality (For instance in hash keys) over multiple character encodings
has caused trouble in the past. The drive now is for unification on
the internal representation and fixed-width character encodings. I
believe that NFG not only is compatible with this revised strategy,
but provides a clear way forward that can carry us through this shift
in focus with minimal pain. And, in the event that an unified internal
representation turns out to be the wrong thing, NFG still contains
value as a feature regardless of the path taken in the future.
   One detail left out of the PDD is the interface or scope of the
grapheme table. The function of the table argues for global, or at
least interpreter wide scope, and though it referred to as a table the
functionality argues for an array (direct look-up) paired with a hash
(reverse look-up). The simplest (but rather inefficient)
implementation, which leverages the existing parrot toolset, is to
write a Singleton PMC and make all table access go through VTABLEs.
Evaluating the suitability of this approach over a dedicated 'c' data
structure is a task that will need to be performed in the community
bonding period.

Project Schedule
   The schedule takes a two stage approach, first concentrating on the
NFG implementation and standardizing on them internally only after NFG
is working and proven viable. A tentative schedule of the best-case,
all-according-to-plan delivery dates, including a 'buffer week' right
after the midterms. Past GSoC experience shows that such a good
progression is unlikely, but I find it better to plan for the best and
deal with setbacks as they happen, rather than guess at how could it
possibly go wrong at every step of the way. With that in mind, should
the schedule slip in more than two full weeks the final "stretch" goal
of moving parrot internals to all-NFG can be cut out of the project
without putting the rest of the deliverables in jeopardy. Also, given
parrot's TDD philosophy, all feature milestones imply tests for that
feature, even if they're not explicitly listed.

Community bonding period: String interfaces cleanup, remove remaining
encapsulation violations and review strings usage on OS-level
interfaces. Get started on the Grapheme Table. Determine the
suitability of a Singleton PMC.
Week 1: More grapheme table work.
Week 2: Normalization algorithm and conversion functions via Unicode
fully-decomposed form.
Week 3: Add NFG character set related functions.
Week 4: Basic encoding-related functions, *_codepoint() and *_byte().
Week 5: Iterator support functions.
Week 6: Make mixed-encoding operations auto-promote to NFG.
Week 7: Benchmarks of 'auto-promotion' for strings in variable-byte
encoding. Midterms deadline.
Week 8: "Buffer" week to deal with unexpected issues that might cause
schedule slippage.
Week 9: Proper NFG conversion/normalization, without using Unicode as
a half-way format.
Week 10: Stretch goal: Move parrot internals to all-NFG. Convert
to/from NFG on all input/output paths.
Week 11: Remove support for other encodings from strings, except where
necessary for in/out conversion.
Week 12: Last pass of code cleanup and refactoring after the encoding
removals. Final evaluation deadline.

References and Likely Mentors
   So far I have discussed the basics of proposal with chromatic and
others on IRC. The overall response to the idea was good, but no
mentors have stepped forward yet.

   All code will be under the artistic license 2.0 (the same terms as
as Parrot itself).

   I have been a committer for almost a year by now (I got my commit
bit shortly after the end of lat year's GSoC), and have worked on
several subsystems including gc, freeze/thaw, frame-builder,
Configure, and others, I am de facto platform champion for OpenBSD and
have done some work on the Intel C++ and Sun cc builds. I have worked
on parrot's strings before, disentangling them from the
freezing/thawing logic and adding encapsulation in several places. I
also worked to remove the much-abused 'strstart' struct member before
we discovered it was necessary to maintain performance in sub-string
   I participated in the 2009 Summer of Code, under parrot. I
successfully completed the work on my proposal (Decimal arithmetic
PMCs), but had to do it outside of parrot's repo due to a licensing
issue with the decNumber library.

   Since my academic standing hasn't realy changed from last year, I
am stll an eligible student meeting Google's legal requirements.
Currently, the relevant paperwork is being emitted by the appropriate
parties and will be available well before the relevant deadlines, same
as last year.

[0] http://docs.parrot.org/parrot/latest/html/docs/pdds/pdd28_strings.pod.html
- Parrot Design Document for the string subsystem and the main NFG
[1] http://plan9.bell-labs.com/sources/plan9/sys/doc/utf.ps - An
explanation of Plan 9's Runes, a similar in intent to NFG strings.
[2] http://www.unicode.org/reports/tr15/ - The Unicode Consortium's
explanation of different normalization forms.
[3] http://unicode.org/reports/tr29/ - "grapheme clusters" in the
Unicode Standard Annex
[4] http://site.icu-project.org/ - International Components for
Unicode, the library behind parrot's Unicode support.

More information about the parrot-dev mailing list