Charset/encoding merge

Nick Wellnhofer wellnhofer at aevum.de
Wed Sep 1 13:39:02 UTC 2010


I created a new branch called charset_massacre that contains my proposed 
charset/encoding merge. Now, all the string function pointers live in a 
single string vtable. There are the following convenience macros to call 
the string functions:

STRING_length
STRING_byte_length
STRING_max_bytes_per_codepoint

STRING_equal
STRING_compare
STRING_index
STRING_rindex
STRING_hash
STRING_validate

STRING_scan
STRING_ord
STRING_substr

STRING_is_cclass
STRING_find_cclass
STRING_find_not_cclass

STRING_get_grapemes // typo, will be fixed
STRING_compose
STRING_decompose

STRING_upcase
STRING_downcase
STRING_titlecase
STRING_upcase_first
STRING_downcase_first
STRING_titlecase_first

STRING_ITER_INIT
STRING_iter_get
STRING_iter_skip
STRING_iter_get_and_advance
STRING_iter_set_and_advance
STRING_iter_set_position

These macros replace the old CHARSET_* and ENCODING_* macros. I also 
renamed some of the functions to match the corresponding Parrot opcodes. 
My longer term plan is to switch a lot of Parrot_str_* calls to those 
macros. Another notable change of the string API is that the charset 
argument has been removed from Parrot_str_new_init.

The charset has also been removed from the packfile. I'm not sure what 
this entails.

The API of the ByteBuffer PMC has changed a little. The get_string and 
build_string methods no longer have a charset argument.

Another minor issue that affected some tests is that trans_charset to 
"unicode" still works, but the resulting strings will have a charsetname 
of "utf8".

I also removed the interactive charset and encoding configuration step. 
Parrot doesn't work with only a subset of charsets or ancodings. It 
probably wouldn't even compile.

The following opcodes can be deprecated:

- charset
- charsetname
- find_charset
- trans_charset

Any code that uses these opcodes should replace them by the 
corresponding encoding opcodes. The list of supported encodings is:

- ascii
- iso-8859-1
- binary
- utf8
- utf16
- ucs2
- ucs4

If both trans_charset and trans_encoding are used, only trans_encoding 
is needed.

Especially if you are a language implementer, it would be nice if you 
could test your implementation with the new branch.

Nick


More information about the parrot-dev mailing list