FileHandle.read and multi-byte encodings
Nick Wellnhofer
wellnhofer at aevum.de
Fri Jan 7 18:25:27 UTC 2011
The FileHandle.read method accepts a byte size argument but it is also
supposed to work with multi-bytes encodings. At the moment, this is
solved by returning a string with more bytes than requested if there
happens to be partial multi-byte character at the end of the buffer.
This can be surprising and is rather tricky to do correctly.
I also don't see many use cases for reading a minimum amount of bytes
from a handle with a multi-byte encoding. It would be more useful to
read a certain amount of characters. This can be implemented easily on
top of my recent Unicode readline improvements.
I tried to simply change the read method to accept character sizes in
branch nwellnhof/read_chars but that turned out to break Rakudo. AFAICS
Rakudo calls the read method only in one place [1] and immediately
converts the result to a ByteBuffer regardless of the current encoding.
(This might return a larger buffer than requested if the encoding is set
to the default utf8 for the reasons outlined above, which could be
considered a bug.)
To support that use case I propose a new method 'read_bytes' that takes
a byte size argument and returns a ByteBuffer. Once this is implemented,
Rakudo and possibly other HLLs can switch over, and we change the 'read'
method to accept character counts. Alternatively, we could introduce a
new method 'read_chars', but the old 'read' method would be pretty much
useless then.
I have no idea how this would affect other HLLs, so comments from HLL
developers are especially welcome.
Nick
[1] https://github.com/rakudo/rakudo/blob/master/src/core/IO.pm#L82
More information about the parrot-dev
mailing list