UTF8 performance

Nick Wellnhofer wellnhofer at aevum.de
Wed Jan 6 22:12:03 UTC 2010


On 06/01/10 04:23, Vasily Chekalkin wrote:
> Nick Wellnhofer wrote:
>> It seems that all ways to iterate over the characters in a UTF8 string
>> have quadratic running time. See the attached test. I would expect
>> that for keyed access and 'substr' but iterator access and 'split'
>> should have better performance. I had a look at the string iterator
>> PMC code and it doesn't use the iterators that the underlying string
>> API provides.
>>
>> I can offer to write a patch to fix this if noone else is working on
>> this.
>
> Good idea! Patches welcome!

Here is a preliminary patch.

I would also suggest to move the iterator function pointers from struct 
string_iterator_t to struct encoding_t and introduce new macros similar 
to ENCODING_ITER_INIT like I did in my patch. If that's OK I can convert 
the rest of the string iterator users.

It would also be helpful to remove the const qualifier from the 'str' 
member of struct string_iterator_t. Or is it important?

Nick
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: string-iter.diff
URL: <http://lists.parrot.org/pipermail/parrot-dev/attachments/20100106/907a2e63/attachment-0001.diff>


More information about the parrot-dev mailing list