FileHandle.read and multi-byte encodings

Nick Wellnhofer wellnhofer at aevum.de
Fri Jan 7 18:25:27 UTC 2011


The FileHandle.read method accepts a byte size argument but it is also 
supposed to work with multi-bytes encodings. At the moment, this is 
solved by returning a string with more bytes than requested if there 
happens to be partial multi-byte character at the end of the buffer. 
This can be surprising and is rather tricky to do correctly.

I also don't see many use cases for reading a minimum amount of bytes 
from a handle with a multi-byte encoding. It would be more useful to 
read a certain amount of characters. This can be implemented easily on 
top of my recent Unicode readline improvements.

I tried to simply change the read method to accept character sizes in 
branch nwellnhof/read_chars but that turned out to break Rakudo. AFAICS 
Rakudo calls the read method only in one place [1] and immediately 
converts the result to a ByteBuffer regardless of the current encoding. 
(This might return a larger buffer than requested if the encoding is set 
to the default utf8 for the reasons outlined above, which could be 
considered a bug.)

To support that use case I propose a new method 'read_bytes' that takes 
a byte size argument and returns a ByteBuffer. Once this is implemented, 
Rakudo and possibly other HLLs can switch over, and we change the 'read' 
method to accept character counts. Alternatively, we could introduce a 
new method 'read_chars', but the old 'read' method would be pretty much 
useless then.

I have no idea how this would affect other HLLs, so comments from HLL 
developers are especially welcome.

Nick

[1] https://github.com/rakudo/rakudo/blob/master/src/core/IO.pm#L82


More information about the parrot-dev mailing list