GSUnicodeString and NSString disagree on rangeOfComposedCharacterSequenceAtIndex:

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

GSUnicodeString and NSString disagree on rangeOfComposedCharacterSequenceAtIndex:

David Chisnall-7
Hello the list,

I am testing out a new version of the compiler / runtime that is producing NSConstantString instances with UTF-16 data.  I have currently disabled a lot of the NSConstantString optimisations, on the basis of ‘make it work then make it fast’ and I’m still seeing quite a lot of test failures.  The most recent ones seem to come from the fact that GSUnicodeString’s implementation of rangeOfComposedCharacterSequenceAtIndex: calls rangeOfSequence_u(), which returns a different range to NSString’s implementation.

I have ls (an GSUnicodeString) and indianLong (a UTF-16 NSConstantString) from the NSString/test00.m. If I call -getCharacters:range: on both, then I get the same set of characters for [indianLong length] characters.  This is as expected.  When searching for indianLong in ls, it is not found.  Sticking in a lot of debugging code, I eventually tracked it down to this disagreement and when I comment out GSUnicodeString’s implementation of rangeOfComposedCharacterSequenceAtIndex: so that it uses the superclass implementation then this test passes.

Please can someone who understands these bits of exciting unicode logic take a look and see if there’s any reason for the disagreement?

I’m now hitting a failure in the unichar_const tests, because for some reason a GSMutableString and a (UTF-16) NSConstantString are not comparing equal, in spite of having the same hash, the same length, and the same values for both characters...

David


_______________________________________________
Gnustep-dev mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/gnustep-dev
Reply | Threaded
Open this post in threaded view
|

Re: GSUnicodeString and NSString disagree on rangeOfComposedCharacterSequenceAtIndex:

David Chisnall-7
On 7 Apr 2018, at 19:51, David Chisnall <[hidden email]> wrote:
>
> I’m now hitting a failure in the unichar_const tests, because for some reason a GSMutableString and a (UTF-16) NSConstantString are not comparing equal, in spite of having the same hash, the same length, and the same values for both characters...

Ignore that bit - I’d made the same fix in GSMutableString, but hadn’t recompiled before testing.  I’m now passing all of the NSString tests, with both rangeOfSequence_c and rangeOfSequence_u disabled and the subclasses that use it falling back to the superclass implementation.  I don’t know if the NSString or the GSString implementation is the correct one, but at least one of them is wrong, and the NSString one at least works with other NSString subclasses that implement the required primitive methods of NSString.

David


_______________________________________________
Gnustep-dev mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/gnustep-dev
Reply | Threaded
Open this post in threaded view
|

Re: GSUnicodeString and NSString disagree on rangeOfComposedCharacterSequenceAtIndex:

Fred Kiefer
In reply to this post by David Chisnall-7


> Am 07.04.2018 um 20:51 schrieb David Chisnall <[hidden email]>:
>
> I am testing out a new version of the compiler / runtime that is producing NSConstantString instances with UTF-16 data.  I have currently disabled a lot of the NSConstantString optimisations, on the basis of ‘make it work then make it fast’ and I’m still seeing quite a lot of test failures.  The most recent ones seem to come from the fact that GSUnicodeString’s implementation of rangeOfComposedCharacterSequenceAtIndex: calls rangeOfSequence_u(), which returns a different range to NSString’s implementation.
>
> I have ls (an GSUnicodeString) and indianLong (a UTF-16 NSConstantString) from the NSString/test00.m. If I call -getCharacters:range: on both, then I get the same set of characters for [indianLong length] characters.  This is as expected.  When searching for indianLong in ls, it is not found.  Sticking in a lot of debugging code, I eventually tracked it down to this disagreement and when I comment out GSUnicodeString’s implementation of rangeOfComposedCharacterSequenceAtIndex: so that it uses the superclass implementation then this test passes.
>
> Please can someone who understands these bits of exciting unicode logic take a look and see if there’s any reason for the disagreement?

I am surely no expert here, but I had a quick look at the code and the two algorithms seem to be very similar. The only difference is the set of code points that the characters get compared to. NSString uses [NSCharacterSet nonBaseCharacterSet], which looks correct to me. On the other hand GSString uses uni_isnonsp(), which I would read as "non spacing“ but is never explained. The code here is as follows:

BOOL
uni_isnonsp(unichar u)
{
  /*
   * Treating upper surrogates as non-spacing is a convenient solution
   * to a number of issues with UTF-16
   */
  if ((u >= 0xdc00) && (u <= 0xdfff))
    return YES;

// FIXME check is uni_cop good for this
  if (GSPrivateUniCop(u))
    return YES;
  else
    return NO;
}

As a side effect this should handle the upper surrogates correctly, but not the lower and I have no idea what GSPrivateUniCop does, even after looking at the code various times. OK, it is a binary search on uni_cop_table, but what is in that table?



_______________________________________________
Gnustep-dev mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/gnustep-dev
Reply | Threaded
Open this post in threaded view
|

Re: GSUnicodeString and NSString disagree on rangeOfComposedCharacterSequenceAtIndex:

Richard Frith-Macdonald-9


> On 8 Apr 2018, at 09:38, Fred Kiefer <[hidden email]> wrote:
>
>
>
>> Am 07.04.2018 um 20:51 schrieb David Chisnall <[hidden email]>:
>>
>> I am testing out a new version of the compiler / runtime that is producing NSConstantString instances with UTF-16 data.  I have currently disabled a lot of the NSConstantString optimisations, on the basis of ‘make it work then make it fast’ and I’m still seeing quite a lot of test failures.  The most recent ones seem to come from the fact that GSUnicodeString’s implementation of rangeOfComposedCharacterSequenceAtIndex: calls rangeOfSequence_u(), which returns a different range to NSString’s implementation.
>>
>> I have ls (an GSUnicodeString) and indianLong (a UTF-16 NSConstantString) from the NSString/test00.m. If I call -getCharacters:range: on both, then I get the same set of characters for [indianLong length] characters.  This is as expected.  When searching for indianLong in ls, it is not found.  Sticking in a lot of debugging code, I eventually tracked it down to this disagreement and when I comment out GSUnicodeString’s implementation of rangeOfComposedCharacterSequenceAtIndex: so that it uses the superclass implementation then this test passes.
>>
>> Please can someone who understands these bits of exciting unicode logic take a look and see if there’s any reason for the disagreement?
>
> I am surely no expert here, but I had a quick look at the code and the two algorithms seem to be very similar. The only difference is the set of code points that the characters get compared to. NSString uses [NSCharacterSet nonBaseCharacterSet], which looks correct to me. On the other hand GSString uses uni_isnonsp(), which I would read as "non spacing“ but is never explained. The code here is as follows:
>
> BOOL
> uni_isnonsp(unichar u)
> {
>  /*
>   * Treating upper surrogates as non-spacing is a convenient solution
>   * to a number of issues with UTF-16
>   */
>  if ((u >= 0xdc00) && (u <= 0xdfff))
>    return YES;
>
> // FIXME check is uni_cop good for this
>  if (GSPrivateUniCop(u))
>    return YES;
>  else
>    return NO;
> }
>
> As a side effect this should handle the upper surrogates correctly, but not the lower and I have no idea what GSPrivateUniCop does, even after looking at the code various times. OK, it is a binary search on uni_cop_table, but what is in that table?

I don't think we have an expert on this (I didn't write the original unicode stuff), but I was able to find out what GSPrivateUniCop() is ... it's checking for combining characters (diacriticvals etc), so it's use makes sense here if the idea is to detect runs of 16bit values which are part of the same 'user-perceived character' (important where the 16bit components of that character cross a range boundary).

I suspect neither uni_isnonsp() nor nonBaseCharacterSet is the correct mechanism though.

One thing that I do know from looking at cop.h is that the uni_isnonsp[() code, depending on cop.h is using out of date information;
The NSCharacterSet info is periodically regenerated from the latest unicode standard (I re-did that quite recently), but the headers in Source/Additions/unicode are not.  The cop.h file is definitely missing information and I'd therefore consider them other headers untrustworthy.  It would be good to ensure that we machine-generate all unicode related information rather than depending on header fiels which were (presumably) built by hand at some point.

Also, reading the unicode documentation it's clear that, even of the cop.h data was complete, it's only a subset of the characters in the nonBaseCharacterSet, and those other categories should be included ... so we should not be using GSPrivateUniCop(), and we should be using nonBaseCharacterSet to get all the follow-on characters in a multipart user-perceived character.

On the other hand, the earlier comment also looks correct:
  /*
   * Treating upper surrogates as non-spacing is a convenient solution
   * to a number of issues with UTF-16
   */

I checked, and nonBaseCharacterSet does not include the surrogate pair characters.

So I think we actually need to fix *both*  cases to
a. check for the second half of a surrogate pair and
b. check for a non-base character

Probably the best thing to do would be to fix uni_isnonsp() to use nonBaseCharacterSet and have the NSString code call uni_insnonsp().

Incidentally, I agree that we want ChangeLog entries (git seems very unfriendly/poor at presenting such information) ...
If it's trivial for David to generate them from git commit entries, I'd rather have the time spent on writing a few comments be devoted to that trivial change instead.


_______________________________________________
Gnustep-dev mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/gnustep-dev
Reply | Threaded
Open this post in threaded view
|

Re: GSUnicodeString and NSString disagree on rangeOfComposedCharacterSequenceAtIndex:

David Chisnall-7
In reply to this post by Fred Kiefer
On 8 Apr 2018, at 09:38, Fred Kiefer <[hidden email]> wrote:

>
>
>
>> Am 07.04.2018 um 20:51 schrieb David Chisnall <[hidden email]>:
>>
>> I am testing out a new version of the compiler / runtime that is producing NSConstantString instances with UTF-16 data.  I have currently disabled a lot of the NSConstantString optimisations, on the basis of ‘make it work then make it fast’ and I’m still seeing quite a lot of test failures.  The most recent ones seem to come from the fact that GSUnicodeString’s implementation of rangeOfComposedCharacterSequenceAtIndex: calls rangeOfSequence_u(), which returns a different range to NSString’s implementation.
>>
>> I have ls (an GSUnicodeString) and indianLong (a UTF-16 NSConstantString) from the NSString/test00.m. If I call -getCharacters:range: on both, then I get the same set of characters for [indianLong length] characters.  This is as expected.  When searching for indianLong in ls, it is not found.  Sticking in a lot of debugging code, I eventually tracked it down to this disagreement and when I comment out GSUnicodeString’s implementation of rangeOfComposedCharacterSequenceAtIndex: so that it uses the superclass implementation then this test passes.
>>
>> Please can someone who understands these bits of exciting unicode logic take a look and see if there’s any reason for the disagreement?
>
> I am surely no expert here, but I had a quick look at the code and the two algorithms seem to be very similar. The only difference is the set of code points that the characters get compared to. NSString uses [NSCharacterSet nonBaseCharacterSet], which looks correct to me. On the other hand GSString uses uni_isnonsp(), which I would read as "non spacing“ but is never explained. The code here is as follows:
>
> BOOL
> uni_isnonsp(unichar u)
> {
>  /*
>   * Treating upper surrogates as non-spacing is a convenient solution
>   * to a number of issues with UTF-16
>   */
>  if ((u >= 0xdc00) && (u <= 0xdfff))
>    return YES;
>
> // FIXME check is uni_cop good for this
>  if (GSPrivateUniCop(u))
>    return YES;
>  else
>    return NO;
> }
>
> As a side effect this should handle the upper surrogates correctly, but not the lower and I have no idea what GSPrivateUniCop does, even after looking at the code various times. OK, it is a binary search on uni_cop_table, but what is in that table?

Thanks.  I co-mentored a GSoC student a couple of years ago working on unicode collation and discovered quite how little I understand about unicode. I am reminded of that now:

[NSCharacterSet nonBaseCharacterSet] contains the characters in the Mark sets, which are mark sets, which appear to be the combining diacritics.  I don’t believe that the current code correctly handles other combining sets, such as the emoji flag sequences, which are not in the Mark sets.  A simple test demonstrates this:

```objc
#import <Foundation/Foundation.h>

int main(void)
{
        @autoreleasepool {
        uint32_t utf32[] = {0x1F1E7, 0x1F1EA};
        NSString *str = [[NSString alloc] initWithBytes: utf32 length: sizeof(utf32) encoding: NSUTF32LittleEndianStringEncoding];
        NSLog(@"%@, %ld", str, (unsigned long)[str rangeOfComposedCharacterSequenceAtIndex: 0].length);
        }
}
```

On macOS, this prints a Belgian flag and tells me that it is 4 UTF-16 code units long.  On GNUstep for me it prints the flag but tells me that the length is 1 with the version from NSString and 2 with the version from GSString.

For NSRegularExpression, I wrote a pair of adaptors classes that allow you to expose either the NSString or ICU UText interfaces using the other’s representation.  I wonder if we should consider moving to something like that on platforms that support ICU?  I’m not sure where the UText interface comes from, but it seems to now be increasingly widely used inside ICU and could have been designed specifically to support the abstractions that NSString exposes (it requires that you be able to access the string as UTF-16 code units and works by providing a buffer that is filled using a mechanism very similar to -getCharacters:range:).  I have implemented a simple function that uses ICU to DTRT:

```objc
NSRange rangeOfComposedCharacterSequenceAtIndex(NSString *str, NSUInteger idx)
{
        UText *txt = UTextInitWithNSString(NULL, str);
        UErrorCode e = 0;
        UBreakIterator *it = ubrk_open(UBRK_CHARACTER, NULL, NULL, 0, &e);
        ubrk_setUText(it, txt, &e);
        if (U_FAILURE(e))
        {
                NSLog(@"Error handling goes here");
        }
        ubrk_preceding(it, idx);
        uint32_t origin = ubrk_current(it);
        uint32_t next = ubrk_next(it);
        if (next == idx)
        {
                origin = next;
                next = ubrk_next(it);
        }
        if (next == UBRK_DONE)
        {
                [NSException raise: NSRangeException format:@"Invalid location."];
        }
        return NSMakeRange(origin, next - origin);
}
```

Creating a new UBreakIterator each call is expensive, but we could cache it in TLS.  Similarly, we can allocate the UText object (and its associated buffer) on the stack.  I’d expect that to make this operation fairly cheap, but I don’t have any Objective-C code where unicode string rendering is on the critical path so I can’t usefully measure it.  I’ve tested this with a flag and with an a + ogonek + acute (U+0061, U+0328, U+0301) and it gives the correct range in both cases.

The existing implementation appears to have another bug, in that it always gives you a range that starts from the index that you give it, even if that’s in the middle of a composed sequence.  This version does the right thing in all cases.

I’m also wondering if we should consider replacing GSString with an extended version of GSICUString on platforms where ICU is available?

David

Full test program:

#import <Foundation/Foundation.h>
#import "Source/GSICUString.h"
#include <unicode/ubrk.h>

NSRange rangeOfComposedCharacterSequenceAtIndex(NSString *str, NSUInteger idx)
{
        UText *txt = UTextInitWithNSString(NULL, str);
        UErrorCode e = 0;
        UBreakIterator *it = ubrk_open(UBRK_CHARACTER, NULL, NULL, 0, &e);
        ubrk_setUText(it, txt, &e);
        if (U_FAILURE(e))
        {
                NSLog(@"Error handling goes here");
        }
        ubrk_preceding(it, idx);
        uint32_t origin = ubrk_current(it);
        uint32_t next = ubrk_next(it);
        if (next == idx)
        {
                origin = next;
                next = ubrk_next(it);
        }
        if (next == UBRK_DONE)
        {
                [NSException raise: NSRangeException format:@"Invalid location."];
        }
        return NSMakeRange(origin, next - origin);
}

int main(void)
{
        @autoreleasepool {
                uint32_t utf32[] = {0x1F1E7, 0x1F1EA, 0x20, 0x061, 0x0328, 0x0301};
                NSString *str = [[NSString alloc] initWithBytes: utf32 length: sizeof(utf32) encoding: NSUTF32LittleEndianStringEncoding];
                NSLog(@"%@, %ld", str, (unsigned long)[str rangeOfComposedCharacterSequenceAtIndex: 0].length);
                str = [str stringByAppendingString: @"foo bar wibble!"];
                for (int i=0 ; i<=[str length] ; i++)
                {
                        NSLog(@"new: %d: %@", i, NSStringFromRange(rangeOfComposedCharacterSequenceAtIndex(str, i)));
                        NSLog(@"old: %d: %@", i, NSStringFromRange([str rangeOfComposedCharacterSequenceAtIndex: i]));
                }
        }
}


_______________________________________________
Gnustep-dev mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/gnustep-dev