[RFC] Merge charsets and encodings

Nick Wellnhofer wellnhofer at aevum.de
Wed Aug 25 12:44:03 UTC 2010


I think the separation of charsets and encodings in the string code 
doesn't make sense. The way I see it, the only charset that's used in 
Parrot is Unicode. ASCII and ISO-8859-1 are subsets of Unicode, so they 
could be treated like the other UTF and UCS encodings.

Currently, you have to use trans_charset (to_charset in C) to convert a 
string to ISO-8859-1 but you have to use trans_encoding (to_encoding) to 
convert to UTF16. That looks arbitrary and confusing to me. The 
encoding:charset combinations right now are:

- fixed8:ascii
- fixed8:iso-8859-1
- fixed8:binary
- utf8:unicode
- utf16:unicode
- ucs2:unicode
- ucs4:unicode

My proposal is to merge all the charset and encoding functions into a 
single kind of string vtable eliminating duplicates like hash and 
find_cclass. I would keep the name "encoding", so there would be seven 
encodings:

- ascii
- iso-8859-1
- binary
- utf8
- utf16
- ucs2
- ucs4

The fixed8 and unicode encodings would still share many of their 
functions but it would be much easier to add specialisations. The string 
code would be simplified and the charset pointer in the string header 
could be removed.

Then the charset opcodes "charset", "charsetname", "find_charset", and 
"trans_charset" could go away. We can also keep them for a while and map 
them to the encoding opcodes for backwards compatibility.

We can also keep the encoding:charset:"string" syntax for string 
literals and simply try to lookup both the encoding and charset for full 
backwards compatibility.

Nick


More information about the parrot-dev mailing list