[RFC] Merge charsets and encodings
Nick Wellnhofer
wellnhofer at aevum.de
Wed Aug 25 12:44:03 UTC 2010
I think the separation of charsets and encodings in the string code
doesn't make sense. The way I see it, the only charset that's used in
Parrot is Unicode. ASCII and ISO-8859-1 are subsets of Unicode, so they
could be treated like the other UTF and UCS encodings.
Currently, you have to use trans_charset (to_charset in C) to convert a
string to ISO-8859-1 but you have to use trans_encoding (to_encoding) to
convert to UTF16. That looks arbitrary and confusing to me. The
encoding:charset combinations right now are:
- fixed8:ascii
- fixed8:iso-8859-1
- fixed8:binary
- utf8:unicode
- utf16:unicode
- ucs2:unicode
- ucs4:unicode
My proposal is to merge all the charset and encoding functions into a
single kind of string vtable eliminating duplicates like hash and
find_cclass. I would keep the name "encoding", so there would be seven
encodings:
- ascii
- iso-8859-1
- binary
- utf8
- utf16
- ucs2
- ucs4
The fixed8 and unicode encodings would still share many of their
functions but it would be much easier to add specialisations. The string
code would be simplified and the charset pointer in the string header
could be removed.
Then the charset opcodes "charset", "charsetname", "find_charset", and
"trans_charset" could go away. We can also keep them for a while and map
them to the encoding opcodes for backwards compatibility.
We can also keep the encoding:charset:"string" syntax for string
literals and simply try to lookup both the encoding and charset for full
backwards compatibility.
Nick
More information about the parrot-dev
mailing list