New ABI NSConstantString

classic Classic list List threaded Threaded
29 messages Options
12
Reply | Threaded
Open this post in threaded view
|

New ABI NSConstantString

David Chisnall-7
Hello the list,

I have nearly finished the ELF version of the new Objective-C ABI.  It is able to pass the same tests that the previous one did in -base, with a smaller binary and better reflection metadata.  The last piece left is whether to improve NSConstantString.

The new ABI includes some breaking changes, so will require a complete recompile.  This gives us an opportunity to improve constant strings.  I think we have three options:

1) Use the existing NSConstantString structure.
2) Simply adopt CFConstantString.
3) Do something new.

I don’t think 1 is a very good idea.  -base includes some horribly hacks to go and replace NSConstantString instances with NSString instances on initialisation because NSConstantString -hash requires that it be computed dynamically.

Option 2 would simplify some Apple interoperability.  It allows UTF-8 and UTF-16 strings (is this useful?  Anyone in CJK locales want it?) but doesn’t really help with the hash issue.  

Option 3 would be to implement a structure something like:

{
        id isa;
        const char *str; // UTF8 or UTF16 string
        NSUInteger  hash;
        NSUInteger  flags;
}

The flags would store, at a minimum:

 - Whether this is UTF-8 or UTF-16.
 - What hash algorithm the compiler used.

If -base later decides to use a different hash algorithm, the implementation of -hash can then check the flags and if the compiler-provided hash is not the version being used, it can lazily set the hash ivar to something different.

Another alternative is to set isa to different things for UTF8 and UTF16, so we can just provide NSUTF8ConstantString and NSUTF16ConstantString as subclasses of NSConstantString.

Does anyone have any strong opinions in either direction?

David


_______________________________________________
Gnustep-dev mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/gnustep-dev
Reply | Threaded
Open this post in threaded view
|

Re: New ABI NSConstantString

Fred Kiefer
Am 01.04.2018 um 11:52 schrieb David Chisnall <[hidden email]>:

>
> Hello the list,
>
> I have nearly finished the ELF version of the new Objective-C ABI.  It is able to pass the same tests that the previous one did in -base, with a smaller binary and better reflection metadata.  The last piece left is whether to improve NSConstantString.
>
> The new ABI includes some breaking changes, so will require a complete recompile.  This gives us an opportunity to improve constant strings.  I think we have three options:
>
> 1) Use the existing NSConstantString structure.
> 2) Simply adopt CFConstantString.
> 3) Do something new.
>
> I don’t think 1 is a very good idea.  -base includes some horribly hacks to go and replace NSConstantString instances with NSString instances on initialisation because NSConstantString -hash requires that it be computed dynamically.
>
> Option 2 would simplify some Apple interoperability.  It allows UTF-8 and UTF-16 strings (is this useful?  Anyone in CJK locales want it?) but doesn’t really help with the hash issue.  
>
> Option 3 would be to implement a structure something like:
>
> {
> id isa;
> const char *str; // UTF8 or UTF16 string
> NSUInteger  hash;
> NSUInteger  flags;
> }
>
> The flags would store, at a minimum:
>
> - Whether this is UTF-8 or UTF-16.
> - What hash algorithm the compiler used.
>
> If -base later decides to use a different hash algorithm, the implementation of -hash can then check the flags and if the compiler-provided hash is not the version being used, it can lazily set the hash ivar to something different.
>
> Another alternative is to set isa to different things for UTF8 and UTF16, so we can just provide NSUTF8ConstantString and NSUTF16ConstantString as subclasses of NSConstantString.
>
> Does anyone have any strong opinions in either direction?

Wouldn’t the most useful structure be the one we already use for GSString?

@interface GSString : NSString
{
@public
  GSCharPtr _contents;
  unsigned int _count;
  struct {
    unsigned int wide: 1; // 16-bit characters in string?
    unsigned int owned: 1; // Set if the instance owns the
                                        // _contents buffer
    unsigned int unused: 2;
    unsigned int hash: 28;
  } _flags;
}
@end

Of course constant strings won’t require  the hidden reference count that come with all ObjC objects. But apart from that it seems to be a more useful structure. Storing the length with the string should speed up some common operations and 28 bit of hash should still be enough. There are even two unused bits in the flags that could encode the specific hash function.


_______________________________________________
Gnustep-dev mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/gnustep-dev
Reply | Threaded
Open this post in threaded view
|

Re: New ABI NSConstantString

David Chisnall-7
On 1 Apr 2018, at 11:36, Fred Kiefer <[hidden email]> wrote:
>
> Wouldn’t the most useful structure be the one we already use for GSString?

That’s certainly a good starting point!

>
> @interface GSString : NSString
> {
> @public
>  GSCharPtr _contents;
>  unsigned int _count;

Is this the number of bytes or the number of characters?  I imagine that both are useful.

>  struct {
>    unsigned int wide: 1; // 16-bit characters in string?
>    unsigned int owned: 1; // Set if the instance owns the
> // _contents buffer

Owned is presumably redundant for constant strings.

>    unsigned int unused: 2;
>    unsigned int hash: 28;
>  } _flags;
> }
> @end
>
> Of course constant strings won’t require  the hidden reference count that come with all ObjC objects. But apart from that it seems to be a more useful structure. Storing the length with the string should speed up some common operations and 28 bit of hash should still be enough. There are even two unused bits in the flags that could encode the specific hash function.

I’d like to have more than 2 bits spare for future expansion.  The current NXConstantString structure is now 30 years old, and I think there have been several times in the past when it would have been nice to add other things to it if we’d had a good way of maintaining compatibility.

This structure does have the advantage that it doesn’t need padding on any 32- or 64-bit architectures.

Do we have any measurements to tell us that 28 bits is enough for the hash?  The -hash method returns an NSUInteger, which is 64 bits on most platforms, so we’re not using much of the available range.  At some point, I’d like to move the hash implementation for NSString to MurmurHash3, which should give better distribution and is very fast on modern hardware.

I’m also a bit nervous about using C bitfields in static data structures, because their layout is ABI dependent (and on some platforms can change between compiler versions).

I’m also tempted to teach the compiler about GSTinyString for 64-bit platforms, though so far that’s not been part of the ABI.  That gives us 8 7-bit ASCII strings and a 5-bit length.  The hash for them needs computing dynamically, but they fit into a 64-bit pointer directly.

David


_______________________________________________
Gnustep-dev mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/gnustep-dev
Reply | Threaded
Open this post in threaded view
|

Re: New ABI NSConstantString

Richard Frith-Macdonald-9


> On 1 Apr 2018, at 12:21, David Chisnall <[hidden email]> wrote:
>
> On 1 Apr 2018, at 11:36, Fred Kiefer <[hidden email]> wrote:
>>
>> Wouldn’t the most useful structure be the one we already use for GSString?
>
> That’s certainly a good starting point!
>
>>
>> @interface GSString : NSString
>> {
>> @public
>> GSCharPtr _contents;
>> unsigned int _count;
>
> Is this the number of bytes or the number of characters?  I imagine that both are useful.

That's the character count.

>> struct {
>>   unsigned int wide: 1; // 16-bit characters in string?
>>   unsigned int owned: 1; // Set if the instance owns the
>> // _contents buffer
>
> Owned is presumably redundant for constant strings.

Yep.  In a constant string you could just consider it a bit reserved for mutable strings.

>>   unsigned int unused: 2;
>>   unsigned int hash: 28;
>> } _flags;
>> }
>> @end
>>
>> Of course constant strings won’t require  the hidden reference count that come with all ObjC objects. But apart from that it seems to be a more useful structure. Storing the length with the string should speed up some common operations and 28 bit of hash should still be enough. There are even two unused bits in the flags that could encode the specific hash function.
>
> I’d like to have more than 2 bits spare for future expansion.  The current NXConstantString structure is now 30 years old, and I think there have been several times in the past when it would have been nice to add other things to it if we’d had a good way of maintaining compatibility.
>
> This structure does have the advantage that it doesn’t need padding on any 32- or 64-bit architectures.


> Do we have any measurements to tell us that 28 bits is enough for the hash?

I don't think so, but with a good hash that gets us over a hundred million strings held efficiently in a set/dictionary, which seems plenty for now.
However, if the idea is to future-proof things in the ABI, perhaps 28bits is not enough.

> At some point, I’d like to move the hash implementation for NSString to MurmurHash3, which should give better distribution and is very fast on modern hardware.

Yes.  GNUstep-base has MurmurHash3 support, and perhaps it's time it was made the default.

> I’m also a bit nervous about using C bitfields in static data structures, because their layout is ABI dependent (and on some platforms can change between compiler versions).
I wasn't aware of that ... it would make sense for your new ABI, when individual bits, to have them specified as particular bits rather than as a bitfield, avoiding the possibility of problems with different compilers.

I don't think you should feel constrained to follow the current layout ... IMO the current one is good for years yet but probably not for decades.
However, I do think that it's more sensible to have pointer, count, hash, and flags similar to the current GNUstep layout than to follow Apple (and to bear in mind that its convenient for mutable strings to share a layout with constant ones).




_______________________________________________
Gnustep-dev mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/gnustep-dev
Reply | Threaded
Open this post in threaded view
|

Re: New ABI NSConstantString

David Chisnall-7
On 1 Apr 2018, at 14:06, Richard Frith-Macdonald <[hidden email]> wrote:
>
>
> I wasn't aware of that ... it would make sense for your new ABI, when individual bits, to have them specified as particular bits rather than as a bitfield, avoiding the possibility of problems with different compilers.
>
> I don't think you should feel constrained to follow the current layout ... IMO the current one is good for years yet but probably not for decades.
> However, I do think that it's more sensible to have pointer, count, hash, and flags similar to the current GNUstep layout than to follow Apple (and to bear in mind that its convenient for mutable strings to share a layout with constant ones).

How about this:

struct {
        // Class pointer
        id isa;
        // Pointer to the buffer.  ro_data section, so immutable.  NULL-terminated
        const char *data;
        // Number of characters, not including the null terminator
        long count;
        // Number of bytes in the encoding, not including the null terminator.
        long length;
        // Murmur 3 hash
        uint32_t hash
        // Flags bitfield:
        // Low 2 bits, enum with values:
        //   0: ASCII string
        //   1: UTF-8 but not ASCII string
        //   2: UTF-16 string
        //   3: Reserved for future encodings
        // (1<<2): has mumur3 hash
        // (1<<3) to (1<<15): Reserved for future compiler-defined flags
        // (1<<16) to (1<<31): Reserved for use by the constant string class
}

I think that this should give everything that we need, plus room for easy future expansion.

David


_______________________________________________
Gnustep-dev mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/gnustep-dev
Reply | Threaded
Open this post in threaded view
|

Re: New ABI NSConstantString

Ivan Vučica-2
Layman question: does it make sense to optimize for space, too, and have a smaller structure for tiny constant strings?

For 32bit ptrs and longs, this would be 20 bytes without the string itself. I don't think that's a lot, but I thought I'd ask.

On Thu, Apr 5, 2018, 16:25 David Chisnall <[hidden email]> wrote:
On 1 Apr 2018, at 14:06, Richard Frith-Macdonald <[hidden email]> wrote:
>
>
> I wasn't aware of that ... it would make sense for your new ABI, when individual bits, to have them specified as particular bits rather than as a bitfield, avoiding the possibility of problems with different compilers.
>
> I don't think you should feel constrained to follow the current layout ... IMO the current one is good for years yet but probably not for decades.
> However, I do think that it's more sensible to have pointer, count, hash, and flags similar to the current GNUstep layout than to follow Apple (and to bear in mind that its convenient for mutable strings to share a layout with constant ones).

How about this:

struct {
        // Class pointer
        id isa;
        // Pointer to the buffer.  ro_data section, so immutable.  NULL-terminated
        const char *data;
        // Number of characters, not including the null terminator
        long count;
        // Number of bytes in the encoding, not including the null terminator.
        long length;
        // Murmur 3 hash
        uint32_t hash
        // Flags bitfield:
        // Low 2 bits, enum with values:
        //   0: ASCII string
        //   1: UTF-8 but not ASCII string
        //   2: UTF-16 string
        //   3: Reserved for future encodings
        // (1<<2): has mumur3 hash
        // (1<<3) to (1<<15): Reserved for future compiler-defined flags
        // (1<<16) to (1<<31): Reserved for use by the constant string class
}

I think that this should give everything that we need, plus room for easy future expansion.

David


_______________________________________________
Gnustep-dev mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/gnustep-dev

_______________________________________________
Gnustep-dev mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/gnustep-dev
Reply | Threaded
Open this post in threaded view
|

Re: New ABI NSConstantString

Stefan Bidigaray
In reply to this post by David Chisnall-7
Hi David,
I forgot to make a comment when you originally posted the idea, and I think this would be a great time to add my 2 cents.

Regarding the structure:
* Would it not be better to add the flags bit field immediately after the isa pointer? My thought here is that it can be checked for if different versions of the structure exist. This is important for CoreBase since it does not have the luxury of real classes.
* Would it be possible to make the hash variable a NSUInterger? The output of -hash is an NSUInterger, and that would allow the value to be expanded in the future.
* Why have both count and length? Would it not make more sense to keep a single variable here called count and define it as, "The count/number of code units"? For ASCII and UTF-8 this would be # of bytes, and for UTF-16 it would be the # of 16-bit codes. The Apple documentation states "The number of UTF-16 code units in the receiver", making at least the ASCII and UTF-16 numbers correct. The way I understand the current implementation, the value for length would return the UTF-32 # of characters, which is inconsistent with the docs.
* I would also think that it makes more sense to have the length/count variable before the data pointer. I don't have a strong opinion about this one, but it just makes more sense in my head.

Regarding the hash function:
Why are we using Murmur3 hash? I know it is significantly more efficient than our current one-at-a-time approach, but how much better is it to competing hash functions? Is there a bench mark out there comparing some of the major ones? For example, how does it compare with lookup3 or SpookyHash. If we are storing the hash in the string structure, the speed of calculating the hash is not as important as the spread. Additionally, Murmur3 seems ill suited if NSUInteger is used to store the hash value since, as far as I could tell, it only outputs 32-bit and 128-bit hashes. Lookup3 and SpookyHash, for example, output 64-bit values (2 32-bit words in the case of lookup3), as well.

I'm late for work, so I have to wrap up.

Stefan

On Thu, Apr 5, 2018 at 11:24 AM, David Chisnall <[hidden email]> wrote:
On 1 Apr 2018, at 14:06, Richard Frith-Macdonald <[hidden email]> wrote:
>
>
> I wasn't aware of that ... it would make sense for your new ABI, when individual bits, to have them specified as particular bits rather than as a bitfield, avoiding the possibility of problems with different compilers.
>
> I don't think you should feel constrained to follow the current layout ... IMO the current one is good for years yet but probably not for decades.
> However, I do think that it's more sensible to have pointer, count, hash, and flags similar to the current GNUstep layout than to follow Apple (and to bear in mind that its convenient for mutable strings to share a layout with constant ones).

How about this:

struct {
        // Class pointer
        id isa;
        // Pointer to the buffer.  ro_data section, so immutable.  NULL-terminated
        const char *data;
        // Number of characters, not including the null terminator
        long count;
        // Number of bytes in the encoding, not including the null terminator.
        long length;
        // Murmur 3 hash
        uint32_t hash
        // Flags bitfield:
        // Low 2 bits, enum with values:
        //   0: ASCII string
        //   1: UTF-8 but not ASCII string
        //   2: UTF-16 string
        //   3: Reserved for future encodings
        // (1<<2): has mumur3 hash
        // (1<<3) to (1<<15): Reserved for future compiler-defined flags
        // (1<<16) to (1<<31): Reserved for use by the constant string class
}

I think that this should give everything that we need, plus room for easy future expansion.

David


_______________________________________________
Gnustep-dev mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/gnustep-dev


_______________________________________________
Gnustep-dev mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/gnustep-dev
Reply | Threaded
Open this post in threaded view
|

Re: New ABI NSConstantString

David Chisnall-7
In reply to this post by Ivan Vučica-2
On 5 Apr 2018, at 17:01, Ivan Vučica <[hidden email]> wrote:
>
> Layman question: does it make sense to optimize for space, too, and have a smaller structure for tiny constant strings?

With the new ABI, we get much better deduplication across compilation units for selectors and protocols, which should extend to constant strings.

At run time, on 64-bit platforms, we generate GSTinyString instances, which are 64 bits and are hidden inside a pointer.  I’m tempted to make the compiler generate those directly.

> For 32bit ptrs and longs, this would be 20 bytes without the string itself. I don't think that's a lot, but I thought I'd ask.

20 bytes isn’t too bad, 36 (for 64-bit platforms) is a bit more.  On a CHERI-like platform, it grows to 52 bytes, which starts to feel a bit excessive.

The absolute minimum structure is an isa pointer immediately followed by the character data, with a null terminator.  That’s not a great idea, because the isa pointer needs to be mutable, which would make the constant string also accidentally mutable.

The next smallest would be an isa pointer and a null-terminated string pointer, so 8 / 16 / 32 bytes on the respective architectures.

The cost of recomputing the hash is sufficiently expensive that it’s probably worth using at least the 28 bits that we provide already for string hashes.  

I’ve done some measurements in -base.  In the compiled binary, we have a total of 84976 bytes of strings, in 3307 strings, so an average of just under 26 bytes per string, so 36 bytes of overhead seems quite a lot, and even 20 is quite noticeable.  If we exclude strings of 8 or fewer characters, this gives us 81637 bytes in 2586 strings, so an average length of just under 32 bytes, so 36 bytes is still more than 100% overhead and adds up to about 90KB in the final binary.  

With the current encoding, each constant string is 24 bytes, so that adds up to about 60KB (excluding the string data itself) on 64-bit platforms.  That’s about 0.5% of the total binary size, so I’m not too worried about making it bigger.  Even making it 80KB is a lot of overhead per string (roughly 100%), but isn’t that much of the total binary size.


David


_______________________________________________
Gnustep-dev mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/gnustep-dev
Reply | Threaded
Open this post in threaded view
|

Re: New ABI NSConstantString

David Chisnall-7
In reply to this post by Stefan Bidigaray
On 5 Apr 2018, at 17:27, Stefan Bidigaray <[hidden email]> wrote:
>
> Hi David,
> I forgot to make a comment when you originally posted the idea, and I think this would be a great time to add my 2 cents.
>
> Regarding the structure:
> * Would it not be better to add the flags bit field immediately after the isa pointer? My thought here is that it can be checked for if different versions of the structure exist. This is important for CoreBase since it does not have the luxury of real classes.

I’m concerned with structure padding here.  Even on a 64-bit platform, we either need an 8-byte flags field (which is wasteful) or end up with 4 bytes of padding.  With 128-bit pointers (which are probably coming sooner than you expect) we will end up with 12 bytes of padding if we have a 32-bit flags field followed by a pointer.

> * Would it be possible to make the hash variable a NSUInterger? The output of -hash is an NSUInterger, and that would allow the value to be expanded in the future.

We can, though that would again increase the size quite noticeably.  I think I’m happy with a 32-bit hash, because as rfm points out with a decent hash algorithm that basically gives us unique hashes.

> * Why have both count and length? Would it not make more sense to keep a single variable here called count and define it as, "The count/number of code units"? For ASCII and UTF-8 this would be # of bytes, and for UTF-16 it would be the # of 16-bit codes. The Apple documentation states "The number of UTF-16 code units in the receiver", making at least the ASCII and UTF-16 numbers correct. The way I understand the current implementation, the value for length would return the UTF-32 # of characters, which is inconsistent with the docs.

If a UTF-8 string contains multi-byte sequences, then the length of the buffer and the number if UTF-16 code units will be different.  If we know the number of bytes, then we can use more efficient C standard library functions for things like comparisons, though that may not be important.

> * I would also think that it makes more sense to have the length/count variable before the data pointer. I don't have a strong opinion about this one, but it just makes more sense in my head.

Again, this gives us more padding in the structure.

>
> Regarding the hash function:
> Why are we using Murmur3 hash? I know it is significantly more efficient than our current one-at-a-time approach, but how much better is it to competing hash functions? Is there a bench mark out there comparing some of the major ones? For example, how does it compare with lookup3 or SpookyHash. If we are storing the hash in the string structure, the speed of calculating the hash is not as important as the spread. Additionally, Murmur3 seems ill suited if NSUInteger is used to store the hash value since, as far as I could tell, it only outputs 32-bit and 128-bit hashes. Lookup3 and SpookyHash, for example, output 64-bit values (2 32-bit words in the case of lookup3), as well.

The size of the type doesn’t necessarily give us the range.  We are completely free to give only a 32-bit or even 28-bit range within an NSUInteger (which is what we do now) and if we have good coverage.  A good hash function has even distribution of entropy across all bits, so taking a 32-bit or 128-bit hash and truncating it is fine.  That said, I’m happy to make the hash value 8 bytes on 64-bit platforms if this seems like a good use of bits.

I’m not wedded to the idea of Murmur3.  We do need to use the same hash for constant and non-constant strings, so execution speed is important.  I’m somewhat tempted to suggest SHA256, because it’s fairly easy to accelerate with SSE and newer CPUs have full hardware offload for it.  That said, the goal is not to mandate the use of the compiler-generated hash for constant strings, it’s to provide a space to store one that the compiler initialises to something sensible.

Given the analysis I’ve done in the reply to Ivan, I think it’s worth consuming space to improve performance.

David
_______________________________________________
Gnustep-dev mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/gnustep-dev
Reply | Threaded
Open this post in threaded view
|

Re: New ABI NSConstantString

Ivan Vučica-2
In reply to this post by David Chisnall-7
Thank you, this was very informative!

On Thu, Apr 5, 2018 at 6:41 PM, David Chisnall
<[hidden email]> wrote:

> On 5 Apr 2018, at 17:01, Ivan Vučica <[hidden email]> wrote:
>>
>> Layman question: does it make sense to optimize for space, too, and have a smaller structure for tiny constant strings?
>
> With the new ABI, we get much better deduplication across compilation units for selectors and protocols, which should extend to constant strings.
>
> At run time, on 64-bit platforms, we generate GSTinyString instances, which are 64 bits and are hidden inside a pointer.  I’m tempted to make the compiler generate those directly.
>
>> For 32bit ptrs and longs, this would be 20 bytes without the string itself. I don't think that's a lot, but I thought I'd ask.
>
> 20 bytes isn’t too bad, 36 (for 64-bit platforms) is a bit more.  On a CHERI-like platform, it grows to 52 bytes, which starts to feel a bit excessive.
>
> The absolute minimum structure is an isa pointer immediately followed by the character data, with a null terminator.  That’s not a great idea, because the isa pointer needs to be mutable, which would make the constant string also accidentally mutable.
>
> The next smallest would be an isa pointer and a null-terminated string pointer, so 8 / 16 / 32 bytes on the respective architectures.
>
> The cost of recomputing the hash is sufficiently expensive that it’s probably worth using at least the 28 bits that we provide already for string hashes.
>
> I’ve done some measurements in -base.  In the compiled binary, we have a total of 84976 bytes of strings, in 3307 strings, so an average of just under 26 bytes per string, so 36 bytes of overhead seems quite a lot, and even 20 is quite noticeable.  If we exclude strings of 8 or fewer characters, this gives us 81637 bytes in 2586 strings, so an average length of just under 32 bytes, so 36 bytes is still more than 100% overhead and adds up to about 90KB in the final binary.
>
> With the current encoding, each constant string is 24 bytes, so that adds up to about 60KB (excluding the string data itself) on 64-bit platforms.  That’s about 0.5% of the total binary size, so I’m not too worried about making it bigger.  Even making it 80KB is a lot of overhead per string (roughly 100%), but isn’t that much of the total binary size.
>
>
> David
>

_______________________________________________
Gnustep-dev mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/gnustep-dev
Reply | Threaded
Open this post in threaded view
|

Re: New ABI NSConstantString

Stefan Bidigaray
In reply to this post by David Chisnall-7
On Thu, Apr 5, 2018 at 1:41 PM, David Chisnall <[hidden email]> wrote:
On 5 Apr 2018, at 17:27, Stefan Bidigaray <[hidden email]> wrote:
>
> Hi David,
> I forgot to make a comment when you originally posted the idea, and I think this would be a great time to add my 2 cents.
>
> Regarding the structure:
> * Would it not be better to add the flags bit field immediately after the isa pointer? My thought here is that it can be checked for if different versions of the structure exist. This is important for CoreBase since it does not have the luxury of real classes.

I’m concerned with structure padding here.  Even on a 64-bit platform, we either need an 8-byte flags field (which is wasteful) or end up with 4 bytes of padding.  With 128-bit pointers (which are probably coming sooner than you expect) we will end up with 12 bytes of padding if we have a 32-bit flags field followed by a pointer.

Well, I was hoping there is a way we can define this structure so that it can be used directly in CoreBase, without having to call the toll-free bridging mechanism. If a 32-bit hash is used, could it be combined with the "flags" variable (see the structure I included at the end of this email)? I'm hoping to be able to have use the same constant strings without having to call the bridging mechanism. It's pretty slow and cumbersome.

By the way, I noticed there was not uint32_t flags in your original structure, making it 24 bytes in 32-bit CPUs.

> * Would it be possible to make the hash variable a NSUInterger? The output of -hash is an NSUInterger, and that would allow the value to be expanded in the future.

We can, though that would again increase the size quite noticeably.  I think I’m happy with a 32-bit hash, because as rfm points out with a decent hash algorithm that basically gives us unique hashes.

Sounds reasonable.
 
> * Why have both count and length? Would it not make more sense to keep a single variable here called count and define it as, "The count/number of code units"? For ASCII and UTF-8 this would be # of bytes, and for UTF-16 it would be the # of 16-bit codes. The Apple documentation states "The number of UTF-16 code units in the receiver", making at least the ASCII and UTF-16 numbers correct. The way I understand the current implementation, the value for length would return the UTF-32 # of characters, which is inconsistent with the docs.

If a UTF-8 string contains multi-byte sequences, then the length of the buffer and the number if UTF-16 code units will be different.  If we know the number of bytes, then we can use more efficient C standard library functions for things like comparisons, though that may not be important.

I guess I'm still a bit confused about the meaning and/or different of the variables count and length.

I know this is probably going to be rejected, but how about making constant string either ASCII or UTF-16 only? Scratching UTF-8 altogether? I know this would increase the byte count for most European languages using Latin characters, but I don't see the point of maintaining both UTF-8 and UTF-16 encoding. Everything that can be done with UTF-16 can be encoded in UTF-8 (and vise-versa), so how would the compiler pick between the two? Additionally, wouldn't sticking to just 1 of the 2 encoding simplify the code significantly?

> * I would also think that it makes more sense to have the length/count variable before the data pointer. I don't have a strong opinion about this one, but it just makes more sense in my head.

Again, this gives us more padding in the structure.

Would it? Isn't sizeof (long) == sizeof (void *) in all 32 and 64-bit architectures (except WIN64)? I thought a long would not be padded any more than a pointer for most applications.

>
> Regarding the hash function:
> Why are we using Murmur3 hash? I know it is significantly more efficient than our current one-at-a-time approach, but how much better is it to competing hash functions? Is there a bench mark out there comparing some of the major ones? For example, how does it compare with lookup3 or SpookyHash. If we are storing the hash in the string structure, the speed of calculating the hash is not as important as the spread. Additionally, Murmur3 seems ill suited if NSUInteger is used to store the hash value since, as far as I could tell, it only outputs 32-bit and 128-bit hashes. Lookup3 and SpookyHash, for example, output 64-bit values (2 32-bit words in the case of lookup3), as well.

The size of the type doesn’t necessarily give us the range.  We are completely free to give only a 32-bit or even 28-bit range within an NSUInteger (which is what we do now) and if we have good coverage.  A good hash function has even distribution of entropy across all bits, so taking a 32-bit or 128-bit hash and truncating it is fine.  That said, I’m happy to make the hash value 8 bytes on 64-bit platforms if this seems like a good use of bits.

I’m not wedded to the idea of Murmur3.  We do need to use the same hash for constant and non-constant strings, so execution speed is important.  I’m somewhat tempted to suggest SHA256, because it’s fairly easy to accelerate with SSE and newer CPUs have full hardware offload for it.  That said, the goal is not to mandate the use of the compiler-generated hash for constant strings, it’s to provide a space to store one that the compiler initialises to something sensible.

Given the analysis I’ve done in the reply to Ivan, I think it’s worth consuming space to improve performance.

I agree.

So how about a structure like:

struct {
        id isa; /* Class pointer */
        uint64_t flags;
        /* Flags bitfield:
           Low 2 bits, enum with values:
           0: ASCII string
           1: UTF-16 string
           2 and 3: Reserved for future encodings
           (1<<2) to (1<<3): 0 for one-at-a-time; 1 for murmur hash; 2 and 3 reserved for future hashes
           (1<<4) to (1<<15): Reserved for future compiler-defined flags
           (1<<16) to (1<<31): Reserved for use by the constant string class (I'm hoping this could hold the CFTypeID of a constant string so it can be identified by corebase)
           (1<<32) to (1<<63): hash
        */
        const char *data; /* Pointer to the buffer.  ro_data section, so immutable.  NULL-terminated */
        long count;  /* Number of UTF-16 code units, not including the null terminator */
}

It's 20 bytes on 32-bit CPUs and 36 bytes on 64-bit CPUs.


David

Stefan

_______________________________________________
Gnustep-dev mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/gnustep-dev
Reply | Threaded
Open this post in threaded view
|

Re: New ABI NSConstantString

David Chisnall-7
This might be slightly confusing, because your mail client doesn’t seem to do anything sane for quoting:

On 5 Apr 2018, at 20:09, Stefan Bidigaray <[hidden email]> wrote:

>
> On Thu, Apr 5, 2018 at 1:41 PM, David Chisnall <[hidden email]> wrote:
> On 5 Apr 2018, at 17:27, Stefan Bidigaray <[hidden email]> wrote:
> >
> > Hi David,
> > I forgot to make a comment when you originally posted the idea, and I think this would be a great time to add my 2 cents.
> >
> > Regarding the structure:
> > * Would it not be better to add the flags bit field immediately after the isa pointer? My thought here is that it can be checked for if different versions of the structure exist. This is important for CoreBase since it does not have the luxury of real classes.
>
> I’m concerned with structure padding here.  Even on a 64-bit platform, we either need an 8-byte flags field (which is wasteful) or end up with 4 bytes of padding.  With 128-bit pointers (which are probably coming sooner than you expect) we will end up with 12 bytes of padding if we have a 32-bit flags field followed by a pointer.
>
> Well, I was hoping there is a way we can define this structure so that it can be used directly in CoreBase, without having to call the toll-free bridging mechanism. If a 32-bit hash is used, could it be combined with the "flags" variable (see the structure I included at the end of this email)? I'm hoping to be able to have use the same constant strings without having to call the bridging mechanism. It's pretty slow and cumbersome.

Can you explain why CoreBase needs to store the hash as anything other than a 32-bit value that it can zero extend when returning a 64-bit value?  It the CoreFoundation and Foundation implementations of hash are compatible, then it will currently be returning a 28-bit value in a 64-bit register, so I don’t understand the issue here.

>
> By the way, I noticed there was not uint32_t flags in your original structure, making it 24 bytes in 32-bit CPUs.
>
> > * Would it be possible to make the hash variable a NSUInterger? The output of -hash is an NSUInterger, and that would allow the value to be expanded in the future.
>
> We can, though that would again increase the size quite noticeably.  I think I’m happy with a 32-bit hash, because as rfm points out with a decent hash algorithm that basically gives us unique hashes.
>
> Sounds reasonable.
>  
> > * Why have both count and length? Would it not make more sense to keep a single variable here called count and define it as, "The count/number of code units"? For ASCII and UTF-8 this would be # of bytes, and for UTF-16 it would be the # of 16-bit codes. The Apple documentation states "The number of UTF-16 code units in the receiver", making at least the ASCII and UTF-16 numbers correct. The way I understand the current implementation, the value for length would return the UTF-32 # of characters, which is inconsistent with the docs.
>
> If a UTF-8 string contains multi-byte sequences, then the length of the buffer and the number if UTF-16 code units will be different.  If we know the number of bytes, then we can use more efficient C standard library functions for things like comparisons, though that may not be important.
>
> I guess I'm still a bit confused about the meaning and/or different of the variables count and length.

One tells you the logical number of characters, the other the length of the buffer in bytes.  A lot of bytes-scanning functions are far more efficient if they know the length up front, because they can then process one word at a time until the last word.

> I know this is probably going to be rejected, but how about making constant string either ASCII or UTF-16 only? Scratching UTF-8 altogether? I know this would increase the byte count for most European languages using Latin characters, but I don't see the point of maintaining both UTF-8 and UTF-16 encoding. Everything that can be done with UTF-16 can be encoded in UTF-8 (and vise-versa), so how would the compiler pick between the two? Additionally, wouldn't sticking to just 1 of the 2 encoding simplify the code significantly?

There’s also the issue that -UTF8String is one of the most commonly used methods on NSString, so if we represent something as UTF-16 internally then it needs converting and returning in an autoreleased buffer, whereas with a UTF-8 string it can just return the pointer.  On non-Windows platforms, -UTF8String is the way of getting a string that you pass to pretty much any OS function.

>
> > * I would also think that it makes more sense to have the length/count variable before the data pointer. I don't have a strong opinion about this one, but it just makes more sense in my head.
>
> Again, this gives us more padding in the structure.
>
> Would it? Isn't sizeof (long) == sizeof (void *) in all 32 and 64-bit architectures (except WIN64)? I thought a long would not be padded any more than a pointer for most applications.

Not Win64, not on anything with larger than 64-bit pointers.

> >
> > Regarding the hash function:
> > Why are we using Murmur3 hash? I know it is significantly more efficient than our current one-at-a-time approach, but how much better is it to competing hash functions? Is there a bench mark out there comparing some of the major ones? For example, how does it compare with lookup3 or SpookyHash. If we are storing the hash in the string structure, the speed of calculating the hash is not as important as the spread. Additionally, Murmur3 seems ill suited if NSUInteger is used to store the hash value since, as far as I could tell, it only outputs 32-bit and 128-bit hashes. Lookup3 and SpookyHash, for example, output 64-bit values (2 32-bit words in the case of lookup3), as well.
>
> The size of the type doesn’t necessarily give us the range.  We are completely free to give only a 32-bit or even 28-bit range within an NSUInteger (which is what we do now) and if we have good coverage.  A good hash function has even distribution of entropy across all bits, so taking a 32-bit or 128-bit hash and truncating it is fine.  That said, I’m happy to make the hash value 8 bytes on 64-bit platforms if this seems like a good use of bits.
>
> I’m not wedded to the idea of Murmur3.  We do need to use the same hash for constant and non-constant strings, so execution speed is important.  I’m somewhat tempted to suggest SHA256, because it’s fairly easy to accelerate with SSE and newer CPUs have full hardware offload for it.  That said, the goal is not to mandate the use of the compiler-generated hash for constant strings, it’s to provide a space to store one that the compiler initialises to something sensible.
>
> Given the analysis I’ve done in the reply to Ivan, I think it’s worth consuming space to improve performance.
>
> I agree.
>
> So how about a structure like:
>
> struct {
>         id isa; /* Class pointer */
>         uint64_t flags;
>         /* Flags bitfield:
>            Low 2 bits, enum with values:
>            0: ASCII string
>            1: UTF-16 string
>            2 and 3: Reserved for future encodings
>            (1<<2) to (1<<3): 0 for one-at-a-time; 1 for murmur hash; 2 and 3 reserved for future hashes
>            (1<<4) to (1<<15): Reserved for future compiler-defined flags
>            (1<<16) to (1<<31): Reserved for use by the constant string class (I'm hoping this could hold the CFTypeID of a constant string so it can be identified by corebase)
>            (1<<32) to (1<<63): hash
>         */
>         const char *data; /* Pointer to the buffer.  ro_data section, so immutable.  NULL-terminated */
>         long count;  /* Number of UTF-16 code units, not including the null terminator */
> }

I don’t see why we’d use a single uint64_t rather than a pair of uint32_ts and I don’t like the ordering (it will be annoying to have to order the fields differently on 128-bit pointer platforms).  I’m not convinced that it’s worth omitting the length to save 8 bytes per string.  It’s probably also not actually worth using longs for the length on 64-bit platforms, so both of these should probably be 32 bits.  4GB of string literal seems a bit excessive (for one thing, I doubt the compiler will be entirely happy with it, and I don’t know happy linkers are with 4GB symbols…).

David


_______________________________________________
Gnustep-dev mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/gnustep-dev
Reply | Threaded
Open this post in threaded view
|

Re: New ABI NSConstantString

Stefan Bidigaray
I use the gmail web interface, which is not great. I'll just comment without quoting.

The thing I'm trying to address is the fact that all CF objects must start with:
struct {
        void *isa;
        uint32_t info;
};
That 32-bit info value includes the CFTypeID (a 16-bit value) and 16-bit for general/restricted use.

If that 32-bit (or it could be 64-bit) field could be the same for constant strings, it would allow CFString functions to work directly with ObjC constant strings, instead of having to call the toll-free bridging mechanism. That would be much more efficient for container objects in corebase.

Just to be clear, the CFString structure is currently:
struct {
        void *isa;
        uint32_t info;
        char *data;
        long count;
        long hash;
        void *allocator;
};

If the ObjC constant string structure and the CFString structure were similar, they could be used interchangeably in corebase and base.

So my proposal was to arrange the first top-most portion of the new constant string structure as:
sturct {
        void *isa;
        uint64_t info; /* includes both info and hash */
        char *data;
        long count;
};

If I modified the corebase version to match, these structure, with a little help from libobjc, could be exactly the same.

On Thu, Apr 5, 2018 at 3:33 PM, David Chisnall <[hidden email]> wrote:
This might be slightly confusing, because your mail client doesn’t seem to do anything sane for quoting:

On 5 Apr 2018, at 20:09, Stefan Bidigaray <[hidden email]> wrote:
>
> On Thu, Apr 5, 2018 at 1:41 PM, David Chisnall <[hidden email]> wrote:
> On 5 Apr 2018, at 17:27, Stefan Bidigaray <[hidden email]> wrote:
> >
> > Hi David,
> > I forgot to make a comment when you originally posted the idea, and I think this would be a great time to add my 2 cents.
> >
> > Regarding the structure:
> > * Would it not be better to add the flags bit field immediately after the isa pointer? My thought here is that it can be checked for if different versions of the structure exist. This is important for CoreBase since it does not have the luxury of real classes.
>
> I’m concerned with structure padding here.  Even on a 64-bit platform, we either need an 8-byte flags field (which is wasteful) or end up with 4 bytes of padding.  With 128-bit pointers (which are probably coming sooner than you expect) we will end up with 12 bytes of padding if we have a 32-bit flags field followed by a pointer.
>
> Well, I was hoping there is a way we can define this structure so that it can be used directly in CoreBase, without having to call the toll-free bridging mechanism. If a 32-bit hash is used, could it be combined with the "flags" variable (see the structure I included at the end of this email)? I'm hoping to be able to have use the same constant strings without having to call the bridging mechanism. It's pretty slow and cumbersome.

Can you explain why CoreBase needs to store the hash as anything other than a 32-bit value that it can zero extend when returning a 64-bit value?  It the CoreFoundation and Foundation implementations of hash are compatible, then it will currently be returning a 28-bit value in a 64-bit register, so I don’t understand the issue here.

>
> By the way, I noticed there was not uint32_t flags in your original structure, making it 24 bytes in 32-bit CPUs.
>
> > * Would it be possible to make the hash variable a NSUInterger? The output of -hash is an NSUInterger, and that would allow the value to be expanded in the future.
>
> We can, though that would again increase the size quite noticeably.  I think I’m happy with a 32-bit hash, because as rfm points out with a decent hash algorithm that basically gives us unique hashes.
>
> Sounds reasonable.
>
> > * Why have both count and length? Would it not make more sense to keep a single variable here called count and define it as, "The count/number of code units"? For ASCII and UTF-8 this would be # of bytes, and for UTF-16 it would be the # of 16-bit codes. The Apple documentation states "The number of UTF-16 code units in the receiver", making at least the ASCII and UTF-16 numbers correct. The way I understand the current implementation, the value for length would return the UTF-32 # of characters, which is inconsistent with the docs.
>
> If a UTF-8 string contains multi-byte sequences, then the length of the buffer and the number if UTF-16 code units will be different.  If we know the number of bytes, then we can use more efficient C standard library functions for things like comparisons, though that may not be important.
>
> I guess I'm still a bit confused about the meaning and/or different of the variables count and length.

One tells you the logical number of characters, the other the length of the buffer in bytes.  A lot of bytes-scanning functions are far more efficient if they know the length up front, because they can then process one word at a time until the last word.

> I know this is probably going to be rejected, but how about making constant string either ASCII or UTF-16 only? Scratching UTF-8 altogether? I know this would increase the byte count for most European languages using Latin characters, but I don't see the point of maintaining both UTF-8 and UTF-16 encoding. Everything that can be done with UTF-16 can be encoded in UTF-8 (and vise-versa), so how would the compiler pick between the two? Additionally, wouldn't sticking to just 1 of the 2 encoding simplify the code significantly?

There’s also the issue that -UTF8String is one of the most commonly used methods on NSString, so if we represent something as UTF-16 internally then it needs converting and returning in an autoreleased buffer, whereas with a UTF-8 string it can just return the pointer.  On non-Windows platforms, -UTF8String is the way of getting a string that you pass to pretty much any OS function.

>
> > * I would also think that it makes more sense to have the length/count variable before the data pointer. I don't have a strong opinion about this one, but it just makes more sense in my head.
>
> Again, this gives us more padding in the structure.
>
> Would it? Isn't sizeof (long) == sizeof (void *) in all 32 and 64-bit architectures (except WIN64)? I thought a long would not be padded any more than a pointer for most applications.

Not Win64, not on anything with larger than 64-bit pointers.

> >
> > Regarding the hash function:
> > Why are we using Murmur3 hash? I know it is significantly more efficient than our current one-at-a-time approach, but how much better is it to competing hash functions? Is there a bench mark out there comparing some of the major ones? For example, how does it compare with lookup3 or SpookyHash. If we are storing the hash in the string structure, the speed of calculating the hash is not as important as the spread. Additionally, Murmur3 seems ill suited if NSUInteger is used to store the hash value since, as far as I could tell, it only outputs 32-bit and 128-bit hashes. Lookup3 and SpookyHash, for example, output 64-bit values (2 32-bit words in the case of lookup3), as well.
>
> The size of the type doesn’t necessarily give us the range.  We are completely free to give only a 32-bit or even 28-bit range within an NSUInteger (which is what we do now) and if we have good coverage.  A good hash function has even distribution of entropy across all bits, so taking a 32-bit or 128-bit hash and truncating it is fine.  That said, I’m happy to make the hash value 8 bytes on 64-bit platforms if this seems like a good use of bits.
>
> I’m not wedded to the idea of Murmur3.  We do need to use the same hash for constant and non-constant strings, so execution speed is important.  I’m somewhat tempted to suggest SHA256, because it’s fairly easy to accelerate with SSE and newer CPUs have full hardware offload for it.  That said, the goal is not to mandate the use of the compiler-generated hash for constant strings, it’s to provide a space to store one that the compiler initialises to something sensible.
>
> Given the analysis I’ve done in the reply to Ivan, I think it’s worth consuming space to improve performance.
>
> I agree.
>
> So how about a structure like:
>
> struct {
>         id isa; /* Class pointer */
>         uint64_t flags;
>         /* Flags bitfield:
>            Low 2 bits, enum with values:
>            0: ASCII string
>            1: UTF-16 string
>            2 and 3: Reserved for future encodings
>            (1<<2) to (1<<3): 0 for one-at-a-time; 1 for murmur hash; 2 and 3 reserved for future hashes
>            (1<<4) to (1<<15): Reserved for future compiler-defined flags
>            (1<<16) to (1<<31): Reserved for use by the constant string class (I'm hoping this could hold the CFTypeID of a constant string so it can be identified by corebase)
>            (1<<32) to (1<<63): hash
>         */
>         const char *data; /* Pointer to the buffer.  ro_data section, so immutable.  NULL-terminated */
>         long count;  /* Number of UTF-16 code units, not including the null terminator */
> }

I don’t see why we’d use a single uint64_t rather than a pair of uint32_ts and I don’t like the ordering (it will be annoying to have to order the fields differently on 128-bit pointer platforms).  I’m not convinced that it’s worth omitting the length to save 8 bytes per string.  It’s probably also not actually worth using longs for the length on 64-bit platforms, so both of these should probably be 32 bits.  4GB of string literal seems a bit excessive (for one thing, I doubt the compiler will be entirely happy with it, and I don’t know happy linkers are with 4GB symbols…).

David



_______________________________________________
Gnustep-dev mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/gnustep-dev
Reply | Threaded
Open this post in threaded view
|

Re: New ABI NSConstantString

David Chisnall-7
On 6 Apr 2018, at 00:25, Stefan Bidigaray <[hidden email]> wrote:
>
> I use the gmail web interface, which is not great. I'll just comment without quoting.
>
> The thing I'm trying to address is the fact that all CF objects must start with:
> struct {
>         void *isa;
>         uint32_t info;
> };
> That 32-bit info value includes the CFTypeID (a 16-bit value) and 16-bit for general/restricted use.

Which 16 bits are the CFTypeID and which are spare?  Apple (from their open source release) appears to use a 12-bit TypeID (which indexes into a 10-bit table, so leaves two bits spare) and uses the rest for the ref count.

> If that 32-bit (or it could be 64-bit) field could be the same for constant strings, it would allow CFString functions to work directly with ObjC constant strings, instead of having to call the toll-free bridging mechanism. That would be much more efficient for container objects in corebase.
>
> Just to be clear, the CFString structure is currently:
> struct {
>         void *isa;
>         uint32_t info;
>         char *data;
>         long count;
>         long hash;
>         void *allocator;
> };
>
> If the ObjC constant string structure and the CFString structure were similar, they could be used interchangeably in corebase and base.
>
> So my proposal was to arrange the first top-most portion of the new constant string structure as:
> sturct {
>         void *isa;
>         uint64_t info; /* includes both info and hash */
>         char *data;
>         long count;
> };
>
> If I modified the corebase version to match, these structure, with a little help from libobjc, could be exactly the same.

I’d prefer not to pack too many unrelated things into a uint64_t (particularly because that will break things on big-endian platforms), so how about:

struct
{
        Class isa;
        uint32_t flags;
        uint32_t count;
        uint32_t length;
        uint32_t hash;
        const char *data;
};

That gives us 24 bytes on 32-bit, 32 bytes on 64-bit, and 40 bytes on 128-bit, with no padding on any architecture.

Does CoreBase have any issues using GSTinyStrings?  Presumably it has to put up with the fact that they might be generated at run time and handle them already?

David


_______________________________________________
Gnustep-dev mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/gnustep-dev
Reply | Threaded
Open this post in threaded view
|

Re: New ABI NSConstantString

David Chisnall-7
In reply to this post by Fred Kiefer
On 1 Apr 2018, at 11:36, Fred Kiefer <[hidden email]> wrote:

>
> Wouldn’t the most useful structure be the one we already use for GSString?
>
> @interface GSString : NSString
> {
> @public
>  GSCharPtr _contents;
>  unsigned int _count;
>  struct {
>    unsigned int wide: 1; // 16-bit characters in string?
>    unsigned int owned: 1; // Set if the instance owns the
> // _contents buffer
>    unsigned int unused: 2;
>    unsigned int hash: 28;
>  } _flags;
> }
> @end
>
> Of course constant strings won’t require  the hidden reference count that come with all ObjC objects. But apart from that it seems to be a more useful structure. Storing the length with the string should speed up some common operations and 28 bit of hash should still be enough. There are even two unused bits in the flags that could encode the specific hash function.

It would probably help catch more bugs if we made use of NSString’s class-cluster nature more in -base.  I have just fixed a bug in GSString where we were checking one object matched a particular class before dereferencing the _flags ivar of the other.  I caught this because the other was a GSTinyString, which is almost never a valid pointer.

Prior to this, we were checking whatever data happened to be in the wide byte and, if the other string happened to have the _contents array in the same place we were doing something that probably wouldn’t crash but may or may not give the correct answer.

I don’t know if we have other bugs of this nature hidden by the fact that 99% of the time we’re using strings with the same structure.

David


_______________________________________________
Gnustep-dev mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/gnustep-dev
Reply | Threaded
Open this post in threaded view
|

Re: New ABI NSConstantString

Stefan Bidigaray
In reply to this post by David Chisnall-7
Hi David,
I'm not entirely sure of how Apple handles the type id field, at the time I thought 16-bits was a decent value and ran with it.

As for which portion is used for the type id, I currently have it split into 2 uint16_t. But I was planning on doing the following:
struct {
       void *isa;
       uint32_t info;
       /* (1<<0) to (1<< 15): TypeID
          (1<<16) to (1<<31): reserved info
       */
};
I do not have to keep this order, so if you would like to use the lower 16 bits for string info, I'm good with that, too. Whatever happens, I'm going to have to modify the corebase code, anyway. That whole thing needs some tlc.

As for the structure you put up, I'm OK with that.

Corebase has not issues with any ObjC object, including GSTinyString. The first thing the functions do is call objc_getClass(), and compare that with the classes registered for toll-free bridging. So a GSTinyString would return as an non-bridged class, and the related ObjC method called. In the future, I would like handle on tiny strings directly.

On Fri, Apr 6, 2018 at 1:41 AM, David Chisnall <[hidden email]> wrote:
On 6 Apr 2018, at 00:25, Stefan Bidigaray <[hidden email]> wrote:
>
> I use the gmail web interface, which is not great. I'll just comment without quoting.
>
> The thing I'm trying to address is the fact that all CF objects must start with:
> struct {
>         void *isa;
>         uint32_t info;
> };
> That 32-bit info value includes the CFTypeID (a 16-bit value) and 16-bit for general/restricted use.

Which 16 bits are the CFTypeID and which are spare?  Apple (from their open source release) appears to use a 12-bit TypeID (which indexes into a 10-bit table, so leaves two bits spare) and uses the rest for the ref count.

> If that 32-bit (or it could be 64-bit) field could be the same for constant strings, it would allow CFString functions to work directly with ObjC constant strings, instead of having to call the toll-free bridging mechanism. That would be much more efficient for container objects in corebase.
>
> Just to be clear, the CFString structure is currently:
> struct {
>         void *isa;
>         uint32_t info;
>         char *data;
>         long count;
>         long hash;
>         void *allocator;
> };
>
> If the ObjC constant string structure and the CFString structure were similar, they could be used interchangeably in corebase and base.
>
> So my proposal was to arrange the first top-most portion of the new constant string structure as:
> sturct {
>         void *isa;
>         uint64_t info; /* includes both info and hash */
>         char *data;
>         long count;
> };
>
> If I modified the corebase version to match, these structure, with a little help from libobjc, could be exactly the same.

I’d prefer not to pack too many unrelated things into a uint64_t (particularly because that will break things on big-endian platforms), so how about:

struct
{
        Class isa;
        uint32_t flags;
        uint32_t count;
        uint32_t length;
        uint32_t hash;
        const char *data;
};

That gives us 24 bytes on 32-bit, 32 bytes on 64-bit, and 40 bytes on 128-bit, with no padding on any architecture.

Does CoreBase have any issues using GSTinyStrings?  Presumably it has to put up with the fact that they might be generated at run time and handle them already?

David



_______________________________________________
Gnustep-dev mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/gnustep-dev
Reply | Threaded
Open this post in threaded view
|

Re: New ABI NSConstantString

David Chisnall-7
In reply to this post by Stefan Bidigaray
On 5 Apr 2018, at 20:09, Stefan Bidigaray <[hidden email]> wrote:
>
> I know this is probably going to be rejected, but how about making constant string either ASCII or UTF-16 only? Scratching UTF-8 altogether? I know this would increase the byte count for most European languages using Latin characters, but I don't see the point of maintaining both UTF-8 and UTF-16 encoding. Everything that can be done with UTF-16 can be encoded in UTF-8 (and vise-versa), so how would the compiler pick between the two? Additionally, wouldn't sticking to just 1 of the 2 encoding simplify the code significantly?

I am leaning in this direction.  The APIs all want UTF-16 codepoints.  In ASCII, each character is precisely one UTF-16 codepoint.  In UTF-16, every two-byte value is a UTF-16 codepoint.  In UTF-8, UTF-16 codepoints are somewhere between 1 and 3 characters long and the mapping is complicated.  It’s a shame that in the 64-bit transition Apple didn’t make unichar 32 bits and make it a unicode character, so we’re stuck in the same situation of Windows with a hasty s/UCS2/UTF-16/ and an attempt to make the APIs keep working.

My current plan is to make the format support ASCII, UTF-8, UTF-16, and UTF-32, but only generate ASCII and UTF-16 in the compiler and then decide later if we want to support generating UTF-8 and UTF-32.  I also won’t initialise the hash in the compiler initially, until we’ve decided a bit more what the hash should be.

David


_______________________________________________
Gnustep-dev mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/gnustep-dev
Reply | Threaded
Open this post in threaded view
|

Re: New ABI NSConstantString

Ivan Vučica-2
On Sat, Apr 7, 2018, 09:50 David Chisnall <[hidden email]> wrote:


My current plan is to make the format support ASCII, UTF-8, UTF-16, and UTF-32, but only generate ASCII and UTF-16 in the compiler and then decide later if we want to support generating UTF-8 and UTF-32.  I also won’t initialise the hash in the compiler initially, until we’ve decided a bit more what the hash should be.

Emojis don't fit UTF-16. Even if one dismisses CJK, ancient scripts etc, constant strings are not absolutely unlikely to contain emojis.

Not supporting UTF-8 for internal storage may be reasonable, but not supporting UTF-32 for strings that require it seems like a bug.

_______________________________________________
Gnustep-dev mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/gnustep-dev
Reply | Threaded
Open this post in threaded view
|

Re: New ABI NSConstantString

David Chisnall-7
On 7 Apr 2018, at 10:21, Ivan Vučica <[hidden email]> wrote:
>
> On Sat, Apr 7, 2018, 09:50 David Chisnall <[hidden email]> wrote:
>
>
> My current plan is to make the format support ASCII, UTF-8, UTF-16, and UTF-32, but only generate ASCII and UTF-16 in the compiler and then decide later if we want to support generating UTF-8 and UTF-32.  I also won’t initialise the hash in the compiler initially, until we’ve decided a bit more what the hash should be.
>
> Emojis don't fit UTF-16. Even if one dismisses CJK, ancient scripts etc, constant strings are not absolutely unlikely to contain emojis.
>
> Not supporting UTF-8 for internal storage may be reasonable, but not supporting UTF-32 for strings that require it seems like a bug.

UTF-32 is not more expressive than UTF-16, and it’s not even more efficient than UTF-16 (all unicode characters can be expressed in either one or two UTF-16 characters, so in the worst case you need the same number of bytes to express a unicode character in UTF-16 and in the best case you need half as many).  The only advantage that UTF-32 has is of being a fixed-length encoding, but that isn’t actually very helpful when the APIs all refer to UTF-16 code units (and UTF-32 is not a fixed-length encoding of UTF-16 code units).

David


_______________________________________________
Gnustep-dev mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/gnustep-dev
Reply | Threaded
Open this post in threaded view
|

Re: New ABI NSConstantString

Richard Frith-Macdonald-9
In reply to this post by Ivan Vučica-2


> On 7 Apr 2018, at 10:21, Ivan Vučica <[hidden email]> wrote:
>
> On Sat, Apr 7, 2018, 09:50 David Chisnall <[hidden email]> wrote:
>
>
> My current plan is to make the format support ASCII, UTF-8, UTF-16, and UTF-32, but only generate ASCII and UTF-16 in the compiler and then decide later if we want to support generating UTF-8 and UTF-32.  I also won’t initialise the hash in the compiler initially, until we’ve decided a bit more what the hash should be.
>
> Emojis don't fit UTF-16. Even if one dismisses CJK, ancient scripts etc, constant strings are not absolutely unlikely to contain emojis.
>
> Not supporting UTF-8 for internal storage may be reasonable, but not supporting UTF-32 for strings that require it seems like a bug.

Everything fits in UTF-16 (or UTF-8 for that matter).  However it's true that many/most emojis don't fit in a *single* 16bit value and require two UTF-16 (or multiple 8bit UTF-8 values) to encode them.
Since the NSString APIs assume a 16bit character width, that means an emoji will generally be treated as two characters as far as they are concerned, but that's not really a problem and current gnustep-base can/does work for emojis (for instance, sending UTF16 to mobile phones).


_______________________________________________
Gnustep-dev mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/gnustep-dev
12