[svn:parrot] r47010 - trunk/docs/pdds

mikehh at svn.parrot.org mikehh at svn.parrot.org
Wed May 26 03:44:33 UTC 2010


Author: mikehh
Date: Wed May 26 03:44:33 2010
New Revision: 47010
URL: https://trac.parrot.org/parrot/changeset/47010

Log:
remove lines erroneously added in merge

Modified:
   trunk/docs/pdds/pdd28_strings.pod

Modified: trunk/docs/pdds/pdd28_strings.pod
==============================================================================
--- trunk/docs/pdds/pdd28_strings.pod	Wed May 26 02:47:08 2010	(r47009)
+++ trunk/docs/pdds/pdd28_strings.pod	Wed May 26 03:44:33 2010	(r47010)
@@ -1,28 +1,3 @@
-NFG has been specified as a feature parrot wants for a long time, it's been in the Parrot Design Document for strings since before I had a commit bit, or any involvement in the project come to that.  Something that has gone that long unimplemented can't be <em>that</em> important, right? I mean, we clearly have survived without it.  Turns out it is important, but it takes some background to realize why.
-
-<!--break-->
-
-If you look at the parrot_string_t struct in the parrot source there is something that might jump out at you.  There's two pointers there, one for the string's encoding and the other for the charset.  It's a bit more heavyweight than you might initially expect, and it's been suggested in the past that one pointer would do just fine, but the charset/encoding separation provides for a rather neat separation of concerns and some code reuse.
-
-The charsets handle all of the <em>character</em> level information: Case, composition, character classes and comparisons.  The way this characters are stored is a matter for the encoding to handle. The size of characters, iterators, and all "bytes vs chars" issues are for the encoding to handle.  Of course, some encodings are more versatile than others.  The UTF-8 encoding is only used by the UTF-8 charset, true, but the Fixed_8 encoding is shared by ASCII, ISO 8859-1, and the 'binary' not-encoding.
-
-Now, let's say you have a long-ish string that, through no fault of your own, came in some form of Unicode.  Okay, that's no problem, parrot can handle that.  Except, of course, that depending on what the encoding is there won't be any O(1) random access for you.  That's simply the nature of things when you have a consistent size for your characters.
-
-This is hardly news, it first came up with UTF-8 right when it was invented.  The solution adopted by the people who invented UTF-8, on the operating system they invented it for, was to avoid the thing as much as possible.  Any program that was expected to run faster than molasses was adapted to use 'Runes', trading memory inefficiency for O(1) random access.  A fine compromise if you ask me.
-
-Of course, once you've solved that, you have to face the next problem.  Even once you bite the bullet and allow
-
-
-
-
-
-
-
-
-
-
-
-
 # Copyright (C) 2008-2010, Parrot Foundation.
 # $Id$
 
@@ -93,58 +68,6 @@
 characters.  Because graphemes are the highest-level abstract idea of a
 "character", they're useful for converting between character sets.
 
-
-
-
-
-
-
-
-
-NFG has been specified as a feature parrot wants for a long time, it's been in the Parrot Design Document for strings since before I had a commit bit, or any involvement in the project come to that.  Something that has gone that long unimplemented can't be <em>that</em> important, right? I mean, we clearly have survived without it.  Turns out it is important, but it takes some background to realize why.
-
-<!--break-->
-
-If you look at the parrot_string_t struct in the parrot source there is something that might jump out at you.  There's two pointers there, one for the string's encoding and the other for the charset.  It's a bit more heavyweight than you might initially expect, and it's been suggested in the past that one pointer would do just fine, but the charset/encoding separation provides for a rather neat separation of concerns and some code reuse.
-
-The charsets handle all of the <em>character</em> level information: Case, composition, character classes and comparisons.  The way this characters are stored is a matter for the encoding to handle. The size of characters, iterators, and all "bytes vs chars" issues are for the encoding to handle.  Of course, some encodings are more versatile than others.  The UTF-8 encoding is only used by the UTF-8 charset, true, but the Fixed_8 encoding is shared by ASCII, ISO 8859-1, and the 'binary' not-encoding.
-
-Now, let's say you have a long-ish string that, through no fault of your own, came in some form of Unicode.  Okay, that's no problem, parrot can handle that.  Except, of course, that depending on what the encoding is there won't be any O(1) random access for you.  That's simply the nature of things when you have a consistent size for your characters.
-
-This is hardly news, it first came up with UTF-8 right when it was invented.  The solution adopted by the people who invented UTF-8, on the operating system they invented it for, was to avoid the thing as much as possible.  Any program that was expected to run faster than molasses was adapted to use 'Runes', trading memory inefficiency for O(1) random access.  A fine compromise if you ask me.
-
-Of course, once you've solved that, you have to face the next problem.  Even if you bite the bullet and pay the storage space for O(1) access you still get bitten by the fact that encoding "all the scripts in the word" is an ugly business with more edge cases than you thought were even possible.
-
-Your next problem are 'composing characters', and boy is it <em>fun<em>.  Let's say you want to do something really easy, something that could not possibly go wrong. Like compare two strings for equality. Yes, that's easy, and efficient, so long as your system's memcmp() knows that the sequence "{A, tilde, acute, dot_below}" mus compare equal to "{A, tilde, dot_below, acute}"
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
 =head3 Normalization Form
 
 A normalization form standardizes the representation of a string by
@@ -238,7 +161,7 @@
 code must always assume a variable-byte encoding, and use expensive
 lookaheads. The cost is incurred on every operation, though the particular
 string operated on might not contain combining characters. It's particularly
-noticeable in parsing and regular expression matches, where backtracking
+noticeable in parsing and regular expres699sion matches, where backtracking
 operations may re-traverse the characters of a simple string hundreds of
 times.
 
@@ -373,7 +296,7 @@
 may be variably sized.}}
 
 =item hashval
-
+699
 A cache of the hash value of the string, for rapid lookups when the string is
 used as a hash key.
 


More information about the parrot-commits mailing list