Note: Because some browsers do not support some of those emoji, all the emoji in this article are presented as a picture and some of them in the code may not be displayed correctly. If you are curious about what the rendering in your browser looks like, open Full Emoji Data, v4.0 — Beta for the full emoji list.

When I got the newest Beta version of the iOS 10, I found Apple had redrawn many emoji. And, some emoji they’ve provided now have the versions in both gender. For example, the emoji I usually used , now has a male version .

So, I wondered what it would be displayed on a device with iOS 9, so I sent it to someone. Very interesting, on an iOS 9 device, the new emoji displayed as two characters ‘’.

…well

Both String and Character type in the Swift language store Unicode. So I knew this must be something about the Unicode.

With some basic googling, I found a blog from the Unicode Consortium’s site:

“A proposed update of UTR #51, Unicode Emoji (Version 4.0) is available for public review and feedback. This new version covers a total of 2,243 emoji, an increase from the 1,788 in Version 3.0.

There are several important changes in the proposed update. Three existing symbols have been newly classified as emoji: U+2640 FEMALE SIGN, U+2642 MALE SIGN, and U+2695 STAFF OF AESCULAPIUS. These are used in sequences to represent additional professions and to make gender distinctions among emoji.”

Excerpt From: Unicode.org, Proposed Update UTR #51, Unicode Emoji (Version 4.0)

So this symbol ‘’ should be the U+2642, The male sign.

It looks like the new male version is indeed combined from two Unicode symbols. So a Character is not a character any more?

Character

This is the definition of the Character in the Swift language. It stores something called ‘Extended Grapheme Cluster’.

public struct Character : ExpressibleByExtendedGraphemeClusterLiteral, Hashable {
  // ...
}

“Every instance of Swift’s Character type represents a single extended grapheme cluster. An extended grapheme cluster is a sequence of one or more Unicode scalars that (when combined) produce a single human-readable character.”

Excerpt From: Apple Inc. “The Swift Programming Language (Swift 3 beta).” iBooks. https://itun.es/us/k5SW7.l

Now let’s have a brief introduction about how the Unicode handles emoji (after this, the Character in your mind is no longer a character you would have expected).

Unicode

“Unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world’s writing systems.”

Excerpt From: Wikipedia.org, Unicode

Then how the Unicode handles those characters?

Code Point

Unicode uses a unique code point for each character. Every code point is an integer value. Normally a Unicode code point is written as “U+hex” form, like U+0061 for the character ‘a’, U+01F64B for the emoji ‘’. Right now the size of code point is 21 bits, from U+000000 to U+10FFFF.

Sequence

In addition to using one code point for a character, Unicode also uses a sequence of two or more characters to represent some abstract characters.

Every character in a sequence has its own roles, namely Base, Variation, Modifier, ZWJ, Combining Marks.

Base

Base character in a sequence is the character that can be modified by the subsequent modifier. Not all the characters can be modified in the same way.

Variation

Some characters may have two different presentations, one looks like an emoji and another looks more like a text character. Such as ‘’ and ‘’, they are just the different presentations for the same U+2618 character.

"\u{2618}\u{FE0E}" ==> "☘︎"
"\u{2618}\u{FE0F}" ==> "☘️"

// \u{hex} is the Swift way for representing Unicode code point

In this code sample, we met sequence. Every hex number in the \u{} is a code point, and the following code point determines which presentation should be used.

U+FE0E VARIATION SELECTOR-15 (VS15) for a text presentation
U+FE0F VARIATION SELECTOR-16 (VS16) for an emoji presentation

Excerpt From: Unicode.org, Unicode® Technical Report #51

The beautiful symbol is so called ‘emoji presentation’.

Modifier

The mainly modifier currently is the Emoji Fitzpatrick Type Modifier

Fitzpatrick type is a type of scale about skin colors. This modifier is the one that provide some emoji with different skin tones.

Let me show you in code:

"\u{01F64B}"           ==> "🙋"
"\u{01F64B}\u{01F3FB}" ==> "🙋🏻"
"\u{01F64B}\u{01F3FC}" ==> "🙋🏼"
"\u{01F64B}\u{01F3FD}" ==> "🙋🏽"
"\u{01F64B}\u{01F3FE}" ==> "🙋🏾"
"\u{01F64B}\u{01F3FF}" ==> "🙋🏿"

Yes, they are not different emoji as many people would’ve expected, they are in fact the same emoji with different modifiers.

The actually Unicode for those emoji is just one: U+01F64B. The subsequent modifier changes the skin tone in the presentation. So every such emoji would be a sequence of two Unicode code points. Every system that supports Unicode needs to deal with such sequences and display correct emoji pictures.

In those systems which don’t support this modifier, people will see something like this:

"\u{01F64B}\u{01F3FD}" ==> "🙋‌🏽"

The Fitzpatrick type modifier will be displayed as a color block , and this block will be in the color of the skin tone scale it represented. So when you saw something like this , it means that the author wanted to show you an emoji in such skin tone, however, your system or app hasn’t supported the modifier yet.

The Unicode Consortium defines that all the emoji with visible skin should support this Fitzpatrick type modifier. Those supported emoji are called Emoji Modifier Base. Only those Emoji Modifier Base can be combined with the modifier, that’s why when you input emoji, only those emoji have minipalette of different skin tones.

Here are the supported Fitzpatrick types from Unicode® Technical Report #51:

ZWJ

U+200D ZERO WIDTH JOINER (the ZWJ) is a special code point in Unicode. It can be used between the elements in a sequence to indicate those elements should be combined as a single glyph, and it can be used multi times in a sequence. To better illustrate my point, let me give you an example that could blow your mind.

"\u{01F468}\u{200D}\u{002764}\u{00FE0F}\u{200D}\u{01F48B}\u{200D}\u{01F468}"
 ==> "👨‍❤️‍💋‍👨"

"\u{01F468}\u{200D}\u{01F469}\u{200D}\u{01F467}\u{200D}\u{01F466}"
 ==> "👨‍👩‍👧‍👦"

"\u{01F469}\u{200D}\u{01F680}"
 ==> "👩‍🚀"

"\u{01F469}\u{200D}\u{01F3A8}"
 ==> "👩‍🎨"

"\u{01F46E}\u{200D}\u{2642}\u{FE0F}"
 ==> "👮‍♂️"

"\u{01F46E}\u{200D}\u{2640}\u{FE0F}"
 ==> "👮‍♀️"

Those are called ZWJ sequences

Yes, they are certainly not ‘characters’ at all – this is a type of ‘description language’

+ + + =
+ + + =

(the here is in emoji presentation. Do you remember the U+FE0F in Variation section? It also is used here, because we are combining an ‘emoji’)

Furthermore, we now have some ‘description’ for different professions.

+ =
+ =

The U+2640 and U+2642 we mentioned at the beginning of this article, are also combined by ZWJ with the modified emoji.

+ =
+ =

and you can image more.

So basically with this ZWJ we have something like an ‘emoji description language’. We can expect more and more new emoji are ‘described’ without increasing the amount of code points, and more and more ‘normal’ characters or emoji can be combined for new purpose.

Combining Marks

Most of the combining marks are Diacritical Marks actually, check Combining character for more information about it.

"\u{65}"             ==> "e"
"\u{0301}"           ==> " ‌́"
"\u{65}\u{0301}"     ==> "é"

Basically, they are not emoji-related elements. But Unicode allows applying them to any character. Sometimes, we can use them with emoji for some special effects. Things like:

"\u{01F436}"         ==> "🐶"
"\u{20E0}"           ==> "‌⃠"
"\u{01F436}\u{20E0}" ==> "🐶⃠"

Counting a String, a.k.a 4 + 1 = 4

OK, so now, I think you may have a deeper understanding about why modern languages always say ‘don’t count a string by the memory it has consumed’.

Here is a note from Apple’s ‘The Swift Programming Language’, talking about the count of the characters in a string:

“Extended grapheme clusters can be composed of one or more Unicode scalars. This means that different characters—and different representations of the same character—can require different amounts of memory to store. Because of this, characters in Swift do not each take up the same amount of memory within a string’s representation. As a result, the number of characters in a string cannot be calculated without iterating through the string to determine its extended grapheme cluster boundaries. If you are working with particularly long string values, be aware that the characters property must iterate over the Unicode scalars in the entire string in order to determine the characters for that string.

The count of the characters returned by the characters property is not always the same as the length property of an NSString that contains the same characters. The length of an NSString is based on the number of 16-bit code units within the string’s UTF-16 representation and not the number of Unicode extended grapheme clusters within the string.”

Excerpt From: Apple Inc. “The Swift Programming Language (Swift 3 beta).” iBooks. https://itun.es/us/k5SW7.l

And here is a simple test in the Playground:

let cafe = "cafe"
let accent = "\u{0301}"

cafe.characters.count   // 4
accent.characters.count // 1

let café = cafe + accent
café.characters.count   // 4 + 1 = 4

But it is strange that the emoji with ZWJ sequence is considered as multi-character (although will be displayed as one character). I don’t know if it’s because I used a Beta version of Xcode 8.

"👨‍❤️‍💋‍👨".characters.count   // 4

And there are more …

Unicode is very complex indeed, and there are many other interesting things. For example, the flags in emoji are actually combined with Regional Indicator characters. Such as for , and means , etc.

I’ll stop here. For anybody interested in learning more, here are the links for easy reference.

  1. Proposed Update UTR #51, Unicode Emoji (Version 4.0)
  2. Unicode® Technical Report #51 (This one is a relatively comprehensive resource on emoji.)
  3. Unicode® Emoji Charts v4.0 — Beta
  4. Full Emoji Data, v4.0 — Beta
  5. Text vs Emoji, v4.0 — Beta
  6. Draft data files for UTR #51 Unicode Emoji, Version 4.0
  7. Wikipedia: Unicode
  8. Wikipedia: Combining character
  9. The Swift Programming Language (Swift 3 beta)