FileHandle.read and multi-byte encodings

Andrew Whitworth wknight8111 at gmail.com
Fri Jan 7 18:47:45 UTC 2011


Is that how Rakudo reads from all files, by reading in binary mode
directly into an encoding-less byte buffer? If that's the case, it
seems like they could explicitly set the encoding on that filehandle
to binary and continue to read bytes, while the method could be
updated in-place to using characters instead.

I'm strongly in favor of some kind of read() method that respects
encodings. If we can do it in-place I would prefer that, but adding a
new method as necessary would be an acceptable second option.

--Andrew Whitworth



On Fri, Jan 7, 2011 at 1:25 PM, Nick Wellnhofer <wellnhofer at aevum.de> wrote:
> The FileHandle.read method accepts a byte size argument but it is also
> supposed to work with multi-bytes encodings. At the moment, this is solved
> by returning a string with more bytes than requested if there happens to be
> partial multi-byte character at the end of the buffer. This can be
> surprising and is rather tricky to do correctly.
>
> I also don't see many use cases for reading a minimum amount of bytes from a
> handle with a multi-byte encoding. It would be more useful to read a certain
> amount of characters. This can be implemented easily on top of my recent
> Unicode readline improvements.
>
> I tried to simply change the read method to accept character sizes in branch
> nwellnhof/read_chars but that turned out to break Rakudo. AFAICS Rakudo
> calls the read method only in one place [1] and immediately converts the
> result to a ByteBuffer regardless of the current encoding. (This might
> return a larger buffer than requested if the encoding is set to the default
> utf8 for the reasons outlined above, which could be considered a bug.)
>
> To support that use case I propose a new method 'read_bytes' that takes a
> byte size argument and returns a ByteBuffer. Once this is implemented,
> Rakudo and possibly other HLLs can switch over, and we change the 'read'
> method to accept character counts. Alternatively, we could introduce a new
> method 'read_chars', but the old 'read' method would be pretty much useless
> then.
>
> I have no idea how this would affect other HLLs, so comments from HLL
> developers are especially welcome.
>
> Nick
>
> [1] https://github.com/rakudo/rakudo/blob/master/src/core/IO.pm#L82
> _______________________________________________
> http://lists.parrot.org/mailman/listinfo/parrot-dev
>


More information about the parrot-dev mailing list