Charset/encoding merge
Nick Wellnhofer
wellnhofer at aevum.de
Wed Sep 1 13:39:02 UTC 2010
I created a new branch called charset_massacre that contains my proposed
charset/encoding merge. Now, all the string function pointers live in a
single string vtable. There are the following convenience macros to call
the string functions:
STRING_length
STRING_byte_length
STRING_max_bytes_per_codepoint
STRING_equal
STRING_compare
STRING_index
STRING_rindex
STRING_hash
STRING_validate
STRING_scan
STRING_ord
STRING_substr
STRING_is_cclass
STRING_find_cclass
STRING_find_not_cclass
STRING_get_grapemes // typo, will be fixed
STRING_compose
STRING_decompose
STRING_upcase
STRING_downcase
STRING_titlecase
STRING_upcase_first
STRING_downcase_first
STRING_titlecase_first
STRING_ITER_INIT
STRING_iter_get
STRING_iter_skip
STRING_iter_get_and_advance
STRING_iter_set_and_advance
STRING_iter_set_position
These macros replace the old CHARSET_* and ENCODING_* macros. I also
renamed some of the functions to match the corresponding Parrot opcodes.
My longer term plan is to switch a lot of Parrot_str_* calls to those
macros. Another notable change of the string API is that the charset
argument has been removed from Parrot_str_new_init.
The charset has also been removed from the packfile. I'm not sure what
this entails.
The API of the ByteBuffer PMC has changed a little. The get_string and
build_string methods no longer have a charset argument.
Another minor issue that affected some tests is that trans_charset to
"unicode" still works, but the resulting strings will have a charsetname
of "utf8".
I also removed the interactive charset and encoding configuration step.
Parrot doesn't work with only a subset of charsets or ancodings. It
probably wouldn't even compile.
The following opcodes can be deprecated:
- charset
- charsetname
- find_charset
- trans_charset
Any code that uses these opcodes should replace them by the
corresponding encoding opcodes. The list of supported encodings is:
- ascii
- iso-8859-1
- binary
- utf8
- utf16
- ucs2
- ucs4
If both trans_charset and trans_encoding are used, only trans_encoding
is needed.
Especially if you are a language implementer, it would be nice if you
could test your implementation with the new branch.
Nick
More information about the parrot-dev
mailing list