[svn:parrot] r49100 - in branches/gc_massacre/docs: . book/pir pdds
bacek at svn.parrot.org
bacek at svn.parrot.org
Fri Sep 17 23:45:55 UTC 2010
Author: bacek
Date: Fri Sep 17 23:45:55 2010
New Revision: 49100
URL: https://trac.parrot.org/parrot/changeset/49100
Log:
Update string API documentation
Modified:
branches/gc_massacre/docs/book/pir/ch04_variables.pod
branches/gc_massacre/docs/embed.pod
branches/gc_massacre/docs/pdds/pdd19_pir.pod
branches/gc_massacre/docs/pdds/pdd23_exceptions.pod
branches/gc_massacre/docs/pdds/pdd28_strings.pod
Modified: branches/gc_massacre/docs/book/pir/ch04_variables.pod
==============================================================================
--- branches/gc_massacre/docs/book/pir/ch04_variables.pod Fri Sep 17 23:45:32 2010 (r49099)
+++ branches/gc_massacre/docs/book/pir/ch04_variables.pod Fri Sep 17 23:45:55 2010 (r49100)
@@ -966,17 +966,17 @@
ways to represent various charsets in memory and on disk.
Every string in Parrot has an associated encoding and character set. The default
-charset is 8-bit ASCII, which is almost universally supported. Double-quoted
-string constants can have an optional prefix specifying the string's encoding
-and charset.N<As you might suspect, single-quoted strings do not support this.>
+format is 8-bit ASCII, which is almost universally supported. Double-quoted
+string constants can have an optional prefix specifying the string's
+format.N<As you might suspect, single-quoted strings do not support this.>
Parrot tracks information about encoding and charset internally, and
automatically converts strings when necessary to preserve these
-characteristics. Strings constants may have prefixes of the form C<encoding:charset:>.
+characteristics. Strings constants may have prefixes of the form C<format:>.
=begin PIR_FRAGMENT
- $S0 = utf8:unicode:"Hello UTF-8 Unicode World!"
- $S1 = utf16:unicode:"Hello UTF-16 Unicode World!"
+ $S0 = utf8:"Hello UTF-8 Unicode World!"
+ $S1 = utf16:"Hello UTF-16 Unicode World!"
$S2 = ascii:"This is 8-bit ASCII"
$S3 = binary:"This is raw, unformatted binary data"
@@ -987,11 +987,10 @@
X<UCS-2 encoding>
X<UTF-8 encoding>
X<UTF-16 encoding>
-Parrot supports the character sets C<ascii>, C<binary>, C<iso-8859-1>
-(Latin 1), and C<unicode> and the encodings C<fixed_8>, C<ucs2>,
-C<utf8>, and C<utf16>.
+Parrot supports the formats C<ascii>, C<binary>, C<iso-8859-1>
+(Latin 1), C<utf8>, C<utf16>, C<ucs2>, and C<ucs4>.
-The C<binary> charset treats the string as a buffer of raw unformatted
+The C<binary> format treats the string as a buffer of raw unformatted
binary data. It isn't really a string per se, because binary data
contains no readable characters. This exists to support libraries which
manipulate binary data that doesn't easily fit into any other primitive
Modified: branches/gc_massacre/docs/embed.pod
==============================================================================
--- branches/gc_massacre/docs/embed.pod Fri Sep 17 23:45:32 2010 (r49099)
+++ branches/gc_massacre/docs/embed.pod Fri Sep 17 23:45:55 2010 (r49100)
@@ -559,18 +559,6 @@
=item C<Parrot_char_digit_value>
-=item C<Parrot_charset_c_name>
-
-=item C<Parrot_charset_name>
-
-=item C<Parrot_charset_number>
-
-=item C<Parrot_charset_number_of_str>
-
-=item C<Parrot_charsets_encodings_deinit>
-
-=item C<Parrot_charsets_encodings_init>
-
=item C<Parrot_clear_debug>
=item C<Parrot_clear_flag>
@@ -643,8 +631,6 @@
=item C<Parrot_cx_send_message>
-=item C<Parrot_default_charset>
-
=item C<Parrot_default_encoding>
=item C<Parrot_del_timer_event>
@@ -691,14 +677,8 @@
=item C<Parrot_ex_throw_from_op_args>
-=item C<Parrot_find_charset>
-
-=item C<Parrot_find_charset_converter>
-
=item C<Parrot_find_encoding>
-=item C<Parrot_find_encoding_converter>
-
=item C<Parrot_ns_find_current_namespace_global>
=item C<Parrot_find_global_k>
@@ -737,8 +717,6 @@
=item C<Parrot_gc_mark_PObj_alive>
-=item C<Parrot_get_charset>
-
=item C<Parrot_get_ctx_HLL_namespace>
=item C<Parrot_get_ctx_HLL_type>
@@ -917,8 +895,6 @@
=item C<Parrot_load_bytecode>
-=item C<Parrot_load_charset>
-
=item C<Parrot_load_encoding>
=item C<Parrot_load_language>
@@ -931,8 +907,6 @@
=item C<Parrot_make_cb>
-=item C<Parrot_make_default_charset>
-
=item C<Parrot_make_default_encoding>
=item C<Parrot_ns_make_namespace_autobase>
@@ -955,8 +929,6 @@
=item C<Parrot_new_cb_event>
-=item C<Parrot_new_charset>
-
=item C<Parrot_new_encoding>
=item C<Parrot_new_string>
@@ -1445,10 +1417,6 @@
=item C<Parrot_regenerate_HLL_namespaces>
-=item C<Parrot_register_charset>
-
-=item C<Parrot_register_charset_converter>
-
=item C<Parrot_register_encoding>
=item C<Parrot_register_HLL>
@@ -1521,14 +1489,10 @@
=item C<Parrot_str_byte_length>
-=item C<Parrot_str_change_charset>
-
=item C<Parrot_str_change_encoding>
=item C<Parrot_str_chopn>
-=item C<Parrot_str_chopn_inplace>
-
=item C<Parrot_str_compare>
=item C<Parrot_str_compose>
@@ -1539,8 +1503,6 @@
=item C<Parrot_str_downcase>
-=item C<Parrot_str_downcase_inplace>
-
=item C<Parrot_str_equal>
=item C<Parrot_str_escape>
@@ -1579,8 +1541,6 @@
=item C<Parrot_str_new_constant>
-=item C<Parrot_str_new_COW>
-
=item C<Parrot_str_new_init>
=item C<Parrot_str_new_noinit>
@@ -1593,20 +1553,12 @@
=item C<Parrot_str_replace>
-=item C<Parrot_str_resize>
-
-=item C<Parrot_str_reuse_COW>
-
-=item C<Parrot_str_set>
-
=item C<Parrot_str_split>
=item C<Parrot_str_substr>
=item C<Parrot_str_titlecase>
-=item C<Parrot_str_titlecase_inplace>
-
=item C<Parrot_str_to_cstring>
=item C<Parrot_str_to_hashval>
@@ -1621,10 +1573,6 @@
=item C<Parrot_str_upcase>
-=item C<Parrot_str_upcase_inplace>
-
-=item C<Parrot_str_write_COW>
-
=item C<Parrot_sub_new_from_c_func>
=item C<Parrot_test_debug>
@@ -1665,20 +1613,14 @@
=item C<PObj_custom_mark_SET>
-=item C<string_capacity>
-
=item C<string_chr>
=item C<string_make>
-=item C<string_make_from_charset>
-
=item C<string_max_bytes>
=item C<string_ord>
-=item C<string_primary_encoding_for_representation>
-
=item C<string_rep_compatible>
=item C<string_to_cstring_nullable>
Modified: branches/gc_massacre/docs/pdds/pdd19_pir.pod
==============================================================================
--- branches/gc_massacre/docs/pdds/pdd19_pir.pod Fri Sep 17 23:45:32 2010 (r49099)
+++ branches/gc_massacre/docs/pdds/pdd19_pir.pod Fri Sep 17 23:45:55 2010 (r49100)
@@ -134,9 +134,9 @@
=item "double-quoted string constants"
Are delimited by double-quotes (C<">). A C<"> inside a string must be escaped
-by C<\>. The default encoding for a double-quoted string constant is 7-bit
+by C<\>. The default format for a double-quoted string constant is 7-bit
ASCII, other character sets and encodings must be marked explicitly using a
-charset or encoding flag.
+format flag.
=item <<"heredoc", <<'heredoc'
@@ -190,11 +190,18 @@
=end PIR_FRAGMENT_TODO
-=item charset:"string constant"
+=item format:"string constant"
-Like above with a character set attached to the string. Valid character
-sets are currently: C<ascii> (the default), C<binary>, C<unicode>
-(with UTF-8 as the default encoding), and C<iso-8859-1>.
+Like above with a format attached to the string. Valid formats are
+currently: C<ascii> (the default), C<binary>, C<iso-8859-1>, C<utf8>,
+C<utf16>, C<ucs2>, and C<ucs4>.
+
+The format is attached to the string constant, and
+adopted by any string container the constant is assigned to.
+
+The standard escape sequences are honored within strings with an
+alternate format, so you can include a particular Unicode character
+as either a literal sequence of bytes, or as an escape sequence.
=back
@@ -212,20 +219,6 @@
=over 4
-=item encoding:charset:"string constant"
-
-Like above with an extra encoding attached to the string. For example:
-
- set S0, utf8:unicode:"«"
-
-The encoding and charset are attached to the string constant, and
-adopted by any string container the constant is assigned to.
-
-The standard escape sequences are honored within strings with an
-alternate encoding, so in the example above, you can include a
-particular Unicode character as either a literal sequence of bytes, or
-as an escape sequence.
-
=item numeric constants
Both integers (C<42>) and numbers (C<3.14159>) may appear as constants.
Modified: branches/gc_massacre/docs/pdds/pdd23_exceptions.pod
==============================================================================
--- branches/gc_massacre/docs/pdds/pdd23_exceptions.pod Fri Sep 17 23:45:32 2010 (r49099)
+++ branches/gc_massacre/docs/pdds/pdd23_exceptions.pod Fri Sep 17 23:45:55 2010 (r49100)
@@ -310,16 +310,6 @@
argument or a string index that's outside the length of the string. Payload
is an array, first element being the string 'ord'.
-The C<find_charset> opcode throws C<exception;domain> if the charset name it's
-looking up doesn't exist. Payload is an array: [0] string 'find_charset', [1]
-charset name that was not found.
-
-The C<trans_charset> opcode throws C<exception;domain> on "information loss"
-(presumably, this means when one charset doesn't have a one-to-one
-correspondence in the other charset). Payload is an array: [0] string
-'trans_charset', [1] source charset name, [2] destination charset name, [3]
-untranslatable code point.
-
The C<find_encoding> opcode throws C<exception;domain> if the encoding name
it's looking up doesn't exist. Payload is an array: [0] string
'find_encoding', [1] encoding name that was not found.
Modified: branches/gc_massacre/docs/pdds/pdd28_strings.pod
==============================================================================
--- branches/gc_massacre/docs/pdds/pdd28_strings.pod Fri Sep 17 23:45:32 2010 (r49099)
+++ branches/gc_massacre/docs/pdds/pdd28_strings.pod Fri Sep 17 23:45:55 2010 (r49100)
@@ -266,7 +266,6 @@
UINTVAL strlen;
UINTVAL hashval;
const struct _encoding *encoding;
- const struct _charset *charset;
};
The fields are:
@@ -302,23 +301,14 @@
=item encoding
-How the data is encoded (e.g. fixed 8-bit characters, UTF-8, or UTF-32). Note
-that this specifies encoding only -- it's valid to encode EBCDIC characters
-with the UTF-8 algorithm. Silly, but valid.
+What sort of string data is in the buffer, for example ASCII, ISO-8859-1,
+UTF-8 or UTF-16.
The encoding structure specifies the encoding (by index number and by name,
for ease of lookup), the maximum number of bytes that a single character will
occupy in that encoding, as well as functions for manipulating strings with
that encoding.
-=item charset
-
-What sort of string data is in the buffer, for example ASCII, EBCDIC, or
-Unicode.
-
-The charset structure specifies the character set (by index number and by
-name) and provides functions for transcoding to and from that character set.
-
=back
{{DEPRECATION NOTE: the enum C<parrot_string_representation_t> will be removed
@@ -352,32 +342,9 @@
Parrot's external API will be renamed for the standard "Parrot_*" naming
conventions.
-=head4 Parrot_str_set (was string_set)
-
-Set one string to a copy of the value of another string.
-
-=head4 Parrot_str_new_COW (was Parrot_make_COW_reference)
-
-Create a new copy-on-write string. Creating a new string header, clone the
-struct members of the original string, and point to the same string buffer as
-the original string.
-
-=head4 Parrot_str_reuse_COW (was Parrot_reuse_COW_reference)
-
-Create a new copy-on-write string. Clone the struct members of the original
-string into a passed in string header, and point the reused string header to
-the same string buffer as the original string.
-
-=head4 Parrot_str_write_COW (was Parrot_unmake_COW)
-
-If the specified Parrot string is copy-on-write, copy the string's contents
-to a new string buffer and clear the copy-on-write flag.
-
=head4 Parrot_str_concat (was string_concat)
-Concatenate two strings. Takes three arguments: two strings, and one integer
-value of flags. If both string arguments are null, return a new string created
-according to the integer flags.
+Concatenate two strings. Takes two strings as arguments.
=head4 Parrot_str_new (was string_from_cstring)
@@ -397,11 +364,10 @@
Returns a new string of the requested encoding, character set, and
normalization form, initializing the string value to the value passed in. The
-five arguments are a C string (C<char *>), an integer length of the string
-argument in bytes, and struct pointers for encoding, character set, and
-normalization form structs. If the C string (C<char *>) value is not passed,
-returns an empty string. If the encoding, character set, or normalization form
-are passed as null values, default values are used.
+three arguments are a C string (C<char *>), an integer length of the string
+argument in bytes, and a struct pointer for the encoding struct. If the C
+string (C<char *>) value is not passed, returns an empty string. If the
+encoding is passed as null value, a default value is used.
{{ NOTE: the crippled version of this function, C<string_make>, used to accept
a string name for the character set. This behavior is no longer supported, but
@@ -414,13 +380,6 @@
*>) as an argument, the value of the constant string. The length of the C
string is calculated internally.
-=head4 Parrot_str_resize (was string_grow)
-
-Resize the string buffer of the given string adding the number of bytes passed
-in the integer argument. If the argument is negative, remove the given number
-of bytes. Throws an exception if shrinking the string buffer size will
-truncate the string (if C<strlen> will be longer than C<buflen>).
-
=head4 Parrot_str_length (was string_compute_strlen)
Returns the number of characters in the string. Combining characters are each
@@ -505,11 +464,6 @@
Chop the requested number of characters off the end of a string without
modifying the original string.
-=head4 Parrot_str_chopn_inplace (was string_chopn_inplace).
-
-Chop the requested number of characters off the end of a string, modifying the
-original string.
-
=head4 Parrot_str_grapheme_chopn
Chop the requested number of graphemes off the end of a string without
@@ -545,6 +499,10 @@
Compare two strings using NFG normalization, return 1 if they are equal, 0 if
they are not equal.
+=head4 Parrot_str_split
+
+Splits the string C<str> at the delimiter C<delim>.
+
=head3 Internal String Functions
The following functions are used internally and are not part of the public
@@ -560,6 +518,10 @@
Terminate and clean up Parrot's string subsystem, including string allocation
and garbage collection.
+=head3 Deprecated String Functions
+
+The following string functions are slated to be deprecated.
+
=head4 string_max_bytes
Calculate the number of bytes needed to hold a given number of characters in a
@@ -568,10 +530,6 @@
{{NOTE: pretty primitive and not very useful. May be deprecated.}}
-=head3 Deprecated String Functions
-
-The following string functions are slated to be deprecated.
-
=head4 string_primary_encoding_for_representation
Not useful, it only ever returned ASCII.
@@ -618,10 +576,6 @@
Unsafe, and behavior handled by Parrot_str_to_cstring.
-=head4 Parrot_str_split
-
-Splits the string C<str> at the delimiter C<delim>.
-
=head4 Parrot_str_free (was string_free)
Unsafe and unuseful, let the garbage collector take care.
More information about the parrot-commits
mailing list