But It’s Better that "🤦🏼‍♂️".len() == 17 and Rather Useless that len("🤦🏼‍♂️") == 5

From time to time, someone shows that in JavaScript the .length of a string containing an emoji results in a number greater than 1 (typically 2) and then proceeds to the conclusion that haha JavaScript is so broken—and is rewarded with many likes. In this post, I will try to convince you that ridiculing JavaScript for this is less insightful than it first appears and that Swift’s approach to string length isn’t unambiguously the best one. Python 3’s approach is unambiguously the worst one, though.

What’s Going on with the Title?

"🤦🏼‍♂️".length == 7 evaluates to true as JavaScript. Let’s try JavaScript console in Firefox:

"🤦🏼‍♂️".length == 7
true

Haha, right? Well, you’ve been told that the Python community suffered the Python 2 vs. Python 3 split, among other things, to Get Unicode Right. Let’s try Python 3:

$ python3
Python 3.6.8 (default, Jan 14 2019, 11:02:34) 
[GCC 8.0.1 20180414 (experimental) [trunk revision 259383]] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> len("🤦🏼‍♂️") == 5
True
>>> 

OK, then. Now, Rust has the benefit of learning from languages that came before it. Let’s try Rust:

$ cargo new -q length
$ cd length
$ echo 'fn main() { println!("{}", "🤦🏼‍♂️".len() == 17); }' > src/main.rs
$ cargo run -q
true

That’s better!

What?

The string contains a single emoji consisting of five Unicode scalar values:

Unicode scalar UTF-32 code units UTF-16 code units UTF-8 code units UTF-32 bytes UTF-16 bytes UTF-8 bytes
U+1F926 FACE PALM 1 2 4 4 4 4
U+1F3FC EMOJI MODIFIER FITZPATRICK TYPE-3 1 2 4 4 4 4
U+200D ZERO WIDTH JOINER 1 1 3 4 2 3
U+2642 MALE SIGN 1 1 3 4 2 3
U+FE0F VARIATION SELECTOR-16 1 1 3 4 2 3
Total 5 7 17 20 14 17

The string that contains one graphical unit consists of 5 Unicode scalar values. First, there’s a base character that means a person face palming. By default, the person would have a cartoonish yellow color. The next character is an emoji skintone modifier the changes the color of the person’s skin (and, in practice, also the color of the person’s hair). By default, the gender of the person is undefined, and e.g. Apple defaults to what they consider a male appearance and e.g. Google defaults to what they consider a female appearance. The next two scalar values pick a male-typical appearance specifically regardless of font and vendor. Instead of being an emoji-specific modifier like the skin tone, the gender specification uses an emoji-predating gender symbol (MALE SIGN) explicitly ligated using the ZERO WIDTH JOINER with the (skin-toned) face-palming person. (Whether it is a good or a bad idea that the skin tone and gender specifications use different mechanisms is out of the scope of this post.) Finally, VARIATION SELECTOR-16 makes it explicit that we want a multicolor emoji rendering instead of a monochrome dingbat rendering.

Each of the languages above reports the string length as the number of code units that the string occupies. Python 3 strings store Unicode code points each of which is stored as one code unit by CPython 3, so the string occupies 5 code units. JavaScript (and Java) strings have (potentially-invalid) UTF-16 semantics, so the string occupies 7 code units. Rust strings are (guaranteed-valid) UTF-8, so the string occupies 17 code units. We’ll come to back to the actual storage as opposed to semantics later.

Note about Python 3 added on 2019-09-09: Originally this article claimed that Python 3 guaranteed UTF-32 validity. This was in error. Python 3 guarantees that the units of the string stay within the Unicode code point range but does not guarantee the absence of surrogates. It not only allows unpaired surrogates, which might be explained by wishing to be compatible with the value space of potentially-invalid UTF-16, but Python 3 allows materializing even surrogate pairs, which is a truly bizarre design. The previous conclusions stand with the added conclusion that Python 3 is even more messed up than I thought! With the way the example string was constructed in Python 3, the Python 3 string happens to match the valid UTF-32 representation of the string, so it is still illustrative of UTF-32, but the rest of the article has been slightly edited to avoid claiming that Python 3 used UTF-32.

But I Want the Length to Be 1!

There’s a language for that. The following used Swift 4.2.3, which was the latest release when I was researching this, on Ubuntu 18.04:

$ mkdir swiftlen
$ cd swiftlen/
$ swift package init -q --type executable
$ swift package init --type executable
Creating executable package: swiftlen
Creating Package.swift
Creating README.md
Creating .gitignore
Creating Sources/
Creating Sources/swiftlen/main.swift
Creating Tests/
Creating Tests/LinuxMain.swift
Creating Tests/swiftlenTests/
Creating Tests/swiftlenTests/swiftlenTests.swift
Creating Tests/swiftlenTests/XCTestManifests.swift
$ echo 'print("🤦🏼‍♂️".count == 1)' > Sources/swiftlen/main.swift 
$ swift run swiftlen 2>/dev/null
true

(Not using the Swift REPL for the example, because it does not appear to accept non-ASCII input on Ubuntu! Swift 5.0.3 prints the same and the REPL is still broken.)

OK, so we’ve found a language that thinks the string contains one countable unit. But what is that countable unit? It’s an extended grapheme cluster. (“Extended” to distinguish from the older attempt at defining grapheme clusters now called legacy grapheme clusters.) The definition is in Unicode Standard Annex #29 (UAX #29).

The Lengths Seen So Far

We’ve seen four different lengths so far:

  • Number of UTF-8 code units (17 in this case)
  • Number of UTF-16 code units (7 in this case)
  • Number of UTF-32 code units or Unicode scalar values (5 in this case)
  • Number of extended grapheme clusters (1 in this case)

Given a valid Unicode string and a version of Unicode, all of the above are well-defined and it holds that each item higher on the list is greater or equal than the items lower on the list.

One of these is not like the others, though: The first three numbers have an unchanging definition for any valid Unicode string whether it contains currently assigned scalar values or whether it is from the future and contains unassigned scalar values as far as software written today is aware. Also, computing the first three lengths does not involve lookups from the Unicode database. However, the last item depends on the Unicode version and involves lookups from the Unicode database. If a string contains scalar values that are unassigned as far as the copy of the Unicode database that the program is using is aware, the program will potentially overcount extended grapheme clusters in the string compared to a program whose copy of the Unicode database is newer and has assignments for those scalar values (and some of those assignments turn out to be combining characters).

More Than One Length per Programming Language

It is not the case that a given programming language has to choose only one of the above. If we run this Swift program:

var s = "🤦🏼‍♂️"
print(s.count)
print(s.unicodeScalars.count)
print(s.utf16.count)
print(s.utf8.count)

it prints:

1
5
7
17

Let’s try Rust with unicode-segmentation = "1.3.0" in Cargo.toml:

use unicode_segmentation::UnicodeSegmentation;

fn main() {
	let s = "🤦🏼‍♂️";
	println!("{}", s.graphemes(true).count());
	println!("{}", s.chars().count());
	println!("{}", s.encode_utf16().count());
	println!("{}", s.len());
}

The above program prints:

2
5
7
17

That’s unexpected! It turns out that unicode-segmentation does not implement the latest version of the Unicode segmentation rules, so it gives the ZERO WIDTH JOINER generic treatment (break right after ZWJ) instead of the newer refinement in the emoji context.

Let’s try again, but this time with unic-segment = "0.9.0" in Cargo.toml:

use unic_segment::Graphemes;

fn main() {
	let s = "🤦🏼‍♂️";
	println!("{}", Graphemes::new(s).count());
	println!("{}", s.chars().count());
	println!("{}", s.encode_utf16().count());
	println!("{}", s.len());
}
1
5
7
17

In the Rust case, strings (here mere string slices) know the number of UTF-8 code units they contain. The len() method call just returns this number that has been stored since the creation of the string (in this case, compile time). In the other cases, what happens is the creation of an iterator and then instead of actually examining the values (string slices correspoding to extended grapheme clusters, Unicode scalar values or UTF-16 code units) that the iterator would yield, the count() method just consumes the iterator and returns the number of items that were yielded by the iteration. The count isn’t stored anywhere on the string (slice) afterwards. If we wanted to later know the counts again, we’d have to iterate over the string again.

Know in Advance or Compute When Needed?

This introduces a notable question in the design space: Should a given type of length quantity be eagerly computed when the string is created? Or should the length be computed when someone asks for it? Or should it be computed when someone asks for it and then automatically stored on the string object so that it’s available immediately if someone asks for it again?

The answer Rust has is that the length in the code units of the Unicode Encoding Form of the language is stored upon string creation, and the rest are computed when someone asks for them (and then forgotten and not stored on the string).

Swift is a higher-level language and doesn’t document the exact nature of its string internals as part of the API contract. In fact, the internal representation of Swift strings changed substantially between Swift 4.2 and Swift 5.0. It’s not documented if different views to the string are held onto once created, for example. The documentation does say that strings are copy-on-write, so the first mutation may involve copying the string’s storage.

Notably, the design space includes not remembering anything. The C programming language is a prominent example of this case. C strings don’t even remember their number of code units. To find out the number of code units, you have to iterate over the string until a sentinel value. In the case of C, the sentinel is the code unit for U+0000, so it excludes one Unicode scalar value from the possible string contents. However, that’s not a strictly necessary property of a sentinel-based design that doesn’t remember any lengths. 0xFF does not occur as a code unit in any valid UTF-8 string and 0xFFFFFFFF does not occur in any valid UTF-32 string, so they could be used as sentinels for UTF-8 and UTF-32 storage, respectively, without excluding a scalar value from the Unicode value space. There is no 16-bit value that never occurs in a valid UTF-16 string. However, a valid UTF-16 string does not contain unpaired surrogates, so an unpaired low surrogate could, in principle, be used as a sentinel in a design that wanted to use guaranteed-valid UTF-16 strings that don’t remember their code unit length.

Knowing the Storage-Native Code Unit Length is Extremely Reasonable

The length of the string as counted in code units of its storage-native Unicode Encoding Form (i.e. whichever of UTF-8, UTF-16, and UTF 32 the programming language has chosen for its string semantics) is not like the other lengths. It is the length that the implementation cannot avoid having to know at the time of creating a new string, because it is the length that is required to be known in order to be able to allocate storage for a string. Even C, which promptly forgets about the code unit length in the storage-native Unicode Encoding Form after string has been created, has to know this length when allocating storage for a new string.

That is, the design decision is about whether to remember this length. It is not about whether to compute it eagerly. You just have to have it at string creation time—i.e. eagerly.

Considering that remembering this quantity makes string concatenation, which is a common operation, substantially faster to implement compared to not remembering this quantity, remembering this quantity is fundamentally reasonable. Also, it means that you don’t need to maintain a sentinel value, which means that a substring operation can yield results that share the buffer with the original string instead of having to copy in order to be able to insert sentinel. (Note that you can easily foil this benefit if you wish to eagerly maintain zero-termination for the sake of C string compatibility.)

What About Knowing the Other Lengths?

Even if we’ve established that it makes sense for string implementation to remember the storage length of the string in code units all the storage-native Unicode encoding form, it doesn’t answer whether a string implementation should also remember other lengths or which kind of length should be offered in the most ergonomic API. (As we see above, Swift makes the number of extended grapheme clusters more ergonomic to obtain that the code unit or scalar value length.)

Also, if any other length is to be remembered, there is the question of whether it should be eagerly computed as string creation time or lazily computed the first time someone asks for it. It is easy to see why at least the latter does not make sense for multi-threaded systems-programming language like Rust. If some properties of an object are lazily initialized, in a multi-threaded case you also need to solve synchronization of these computations. Furthermore, you need to allocate space at least for a pointer to auxiliary information if you want to be able to add auxiliary information later or you need to have a hashtable of auxiliary information where the string the information is about is the key, so auxiliary information, even when not present, has storage implications or implications of having to have global state in a run-time system. Finally, for systems programming, it may be more desirable to know the time complexity of a given operation clearly even if it means “always O(n)” instead of “possibly O(n) but sometimes O(1)”. Even if the latter looks strictly better, it is less predictable.

For a higher-level language, arguments from space requirements or synchronization issues might not be decisive. It’s more relevant to consider what a given length quantity is used for. This is often forgotten in Internet debates that revolve around what length is the most “correct” or “logical” one. So for the lengths that don’t map to the size of storage allocation, what are they good for?

It turns out that in the Firefox code base there are two places where someone wants to know the number of Unicode scalar values in a string that is not being stored as UTF-32 and attention is not paid to what the scalar values actually are. The IETF specification for Session Traversal Utilities for NAT (STUN) used for WebRTC has the curious property that it places length limits on certain protocol strings such that the limits are expressed as number of Unicode scalar values but the strings are transmitted in UTF-8. Firefox validates these limits. (The limit looks like an arbitrary power-of-two (128 scalar values). The spec has remarks about the possible resulting byte length, which was wrong according to the IETF UTF-8 RFC that was current and already nearly five years old at the time of publication of the STUN RFC. Specifically, the STUN RFC repeatedly says that 128 characters as UTF-8 may be as long as 763 bytes. To arrive at that number, you have to assume that a UTF-8 character can be up to six bytes long, as opposed to up to 4 bytes long as in the prevailing UTF-8 RFC and in the Unicode Standard, and that the last character of the 128 is a zero terminator and, therefore, known to take just one byte.) In this case, the reason for wishing to know a non-storage length is to impose a limit. The other case is reporting the column number for the source location of JavaScript errors.

Length limits, which we’ll come back to, probably aren’t a frequent enough a use case to justify making strings know a particular kind of length as opposed to such length being possible to compute when asked for. Neither are error messages.

Another use case for asking for a length is iterating by index and using the length as the loop termination condition 1990s Java style. Like this:

for (int i = 0; i < s.length(); i++) {
    // Do something with s.charAt(i)
}

In this case, it’s actually important for the length to be precomputed number on the string object. This use case is coupled with the requirement that indexing into the string to find the nth unit corresponding to the count of units that the “length” represents should be a fast operation.

The above pattern is a lot less conclusive in terms of what lengths should be precomputed (and what the indexing unit should be) than it first appears. The above loop doesn’t do random access by index. It sequentially uses every index from zero up to, but not including, length. Indeed, especially when iterating over a string by Unicode scalar value, typically when you examine the contents of a string, you iterate over the string in order. Programming languages these days provide an iterator facility for this, and e.g. to iterate over a UTF-8 string by scalar value, the iterator does not need to know the number of scalar values up front. E.g. in Rust, you can do this in O(n) time despite string slices not knowing their number of Unicode scalar values:

for (c in s.chars()) {
    // Do something with c
}

(Note that char is an 8-bit code unit (possibly UTF-8 code unit) in C and C++, char is a UTF-16 code unit in Java, char is a Unicode scalar value in Rust, and Character is an extended grapheme cluster in Swift.)

A programming language together with its library ecosystem should provide iteration over a string by Unicode scalar value and by extended grapheme cluster, but it does not follow that strings would need to know the scalar value length or the extended grapheme cluster length up front. Unlike the code unit storage length, those quantities aren’t useful for accelerating operations like concatenation that don’t care about the exact content of the string.

Which Unicode Encoding Form Should a Programming Language Choose?

The observation that having strings know their code unit length in their storage-native Unicode encoding form is extremely reasonable does not answer how many bits wide the code units should be.

The usual way to approach this question is to argue that UTF-32 is the best, because it provides O(1) indexing by “character” in the sense of a character meaning a Unicode scalar value, or the argument focuses on whether UTF-8 is unfair to some languages relative to UTF-16. I think these are bad ways to approach this question.

First of all, the argument that the answer should be UTF-32 is bad on two counts. First, it assumes that random access scalar value is important, but in practice it isn’t. It’s reasonable to want to have a capability to iterate over a string by scalar value, but random access by scalar value is in the YAGNI department. Second, arguments in favor of UTF-32 typically come at a point where the person making the argument has learned about surrogate pairs in UTF-16 but has not yet learned about extended grapheme clusters being even larger things that the user perceives as unit. That is, if you escape the variable-width nature of UTF-16 to UTF-32, you pay by doubling the memory requirements and extended grapheme clusters are still variable-width.

I’ll come back to the length fairness issue later, but I think a different argument is much more relevant in practice for the choice of in-memory Unicode encoding form. The more relevant argument is this: Implementations that choose UTF-8 actually accept the UTF-8 storage requirements. When wider-unit semantics are chosen for a language that doesn’t provide raw memory access and, therefore, has the opportunity to tweak string storage, the implementations try to come up with ways to avoid actually paying the cost of the wider units in some situations.

JavaScript and Java strings have the semantics of potentially-invalid UTF-16. SpiderMonkey and V8 implement an optimization for omitting the leading zeros of each code unit in a string, i.e. storing the string as ISO-8859-1 (the actual ISO-8859-1, not the Web notion of “ISO-8859-1” as a label of windows-1252), when all code units in the string have zeros in the most-significant half. The HotSpot JVM also implements this optimization, though enabling it is optional. Swift 4.2 implements a slightly different variant of the same idea, where ASCII-only strings are stored as 8-bit units and everything else is stored as UTF-16. CPython since 3.3 makes the same idea three-level with code point semantics: Strings are stored with 32-bit code units if at least one code point has a non-zero bit above the low 16 bits. Else if a string has a non-zero bits above the low 8 bits for at least one code point, the string is stored as 16-bit units. Otherwise, the string is stored as 8-bit units (Latin1).

I think the unwillingness of implementations of languages that have chosen UTF-16 or UTF-32 (or UTF-32-ish as in the case of Python 3) string semantics to actually use UTF-16 or UTF-32 storage when they can get away with not using actual UTF-16 or UTF-32 storage is the clearest indictment against UTF-16 or UTF-32 (and other wide-unit semantics like what Python 3 uses).

Languages that choose UTF-8, on the other hand, stick to actual UTF-8 for the purpose of storing Unicode scalar values. When languages that choose UTF-8 deviate from UTF-8, they do so in order to represent values that are not Unicode scalar values for compatibility with external constraints. Rust uses a representation called WTF-8 for file system paths on Windows. All UTF-8 strings are WTF-8 strings, but WTF-8 can also represent unpaired surrogates for compatibility with Windows file paths being sequences of 16-bit units that can contain unpaired surrogates. Perl 6 uses an internal representation called UTF-8 Clean-8 (or UTF8-C8), which represents strings that consist of Unicode scalar values in Unicode Normalization Form C the same way as UTF-8 but represents non-NFC content differently and can represent sequences of bytes that are not valid UTF-8.

UTF-8 is the only one of the Unicode encoding forms that is also a Unicode encoding scheme, and of the Unicode encoding schemes, UTF-8 has clearly won for interchange. (Unicode encoding forms are what you have in RAM, so UTF-16 consists of native-endian, two-byte-aligned 16-bit code units. Unicode encoding schemes are what can be used for byte-oriented interchange, so e.g. UTF-16LE consist of 8-bit code units every pair of which form a potentially-unaligned little-endian 16-bit number, which in turn may form a surrogate pair.) When UTF-8 is used as the in-RAM representation, input and output operations are less expensive than with UTF-16 or UTF-32. UTF-16 or UTF-32 in RAM requires conversion from UTF-8 when reading input and conversion to UTF-8 when writing output. A system that guarantees UTF-8 validity internally, such as Rust, needs only to validate UTF-8 upon reading input and no conversion is needed when writing output. (Go takes a garbage in, garbage out approach to UTF-8: input is not validated at input time and output is written without conversion. However, iteration by scalar value can yield REPLACEMENT CHARACTERs when iterating over invalid UTF-8. That is, the input step is less expensive than in Rust, but iterating by scalar value is marginally more expensive. The output step is less correct.)

Finally, in terms of nudging developers to write correct code, UTF-8 has the benefit of being blatantly variable-width, so even with languages such as English, Somali, and Swahili, as soon as you have a dash or a smart quote, the variable-width nature of UTF-8 shows up. In this context, extended grapheme clusters are just extending the variable-width nature. Meanwhile, UTF-16 allows programmers to get too far while pretending to be working with something where the units they need to care about are fixed-width. Reacting to surrogate pairs by wishing to use UTF-32 instead is a bad idea, because if you want to write correct software, you still need to deal with variable-width extended grapheme clusters.

The choice of UTF-32 (or Python 3-style code point sequences) arises from wanting the wrong thing. The choice of UTF-16 is a matter of early-adopter legacy from the time when Unicode was expected to be capped to 16 bits of code space and, once UTF-16 has been committed to, not breaking compatibility with already-written programs is important and justified the continued use of UTF-16, but if you aren’t bound by that legacy and are designing a new language, you should go with UTF-8. Occasionally even systems that appear to be bound by the UTF-16 legacy can break free. Even though Swift is committed to interoperability with Cocoa, which uses UTF-16 strings, Swift 5 switched to UTF-8 for Swift-native strings. Similarly, PyPy has gone UTF-8 despite Python 3 having code point semantics.

Shouldn’t the Nudge Go All the Way to Extended Grapheme Clusters?

Even if we accept that the storage should be UTF-8 and that the string implementation should maintain knowledge of the string length in UTF-8 code units, if the blatant variable-widthness of UTF-8 is argued to be a nudge toward dealing with the variable-widthness of extended grapheme clusters, shouldn’t the Swift approach of making extended grapheme cluster access and count the view that takes the least ceremony to use be the thing that every language should do?

Swift is still too young to draw definitive conclusions from. It’s easy to believe that the Swift approach nudges programmers to write more extended grapheme cluster-correct code and that the design makes sense for a language meant primarily for UI programming on a largely evergreen platform (iOS). It isn’t clear, though, that the Swift approach is the best for everyone.

Earlier, I said that the example used “Swift 4.2.3 on Ubuntu 18.04”. The “18.04” part is important! Swift.org ships binaries for Ubuntu 14.04, 16.04, and 18.04. Running the program

var s = "🤦🏼‍♂️"
print(s.count)
print(s.unicodeScalars.count)
print(s.utf16.count)
print(s.utf8.count)

in Swift 4.2.3 on Ubuntu 14.04 prints:

3
5
7
17

So Swift 4.2.3 on Ubuntu 18.04 as well as the unic_segment 0.9.0 Rust crate counted one extended grapheme cluster, the unicode-segmentation 1.3.0 Rust crate counted two extended grapheme clusters, and the same version of Swift, 4.2.3, but on a different operating system version counted three extended grapheme clusters!

Swift 4 delegates Unicode segmentation to operating system-provided ICU, and “Long-Term Support” in the Ubuntu case means security patches but does not mean rolling forward the Unicode version that the system copy of ICU knows about. In the case of iOS, delegating to system ICU is probably OK and will not lead to too high probability of the text being from the future from the point of view of the OS copy of ICU, since the iOS ecosystem stays exceptionally well up-to-date. However, delegating to system ICU is not such a great match for the idea of using Swift on the server side if the server side means running an old LTS distro.

(Swift 5 appears to no longer use system ICU for this. That is, Swift 5.0.3 on Ubuntu 14.04 sees one extended grapheme cluster in the string. I haven’t investigated what Swift 5 uses, but I assume that the switch to UTF-8 string representation necessitated using something other than ICU, which is heavily UTF-16-oriented. However, the result with Swift 4.2.3 nicely illustrates the issue related to using extended grapheme clusters.)

If you are doing things that have to be extended grapheme cluster-aware, there just is no way around the issue of not being able to correctly segment text that comes from the future relative to the Unicode segmentation implementation that your program is using. This is not a reason to avoid extended grapheme clusters for tasks that require awareness of extended grapheme clusters.

However, pushing extended grapheme clusters onto tasks that do not really require the use of extended grapheme cluster introduces failure modes arising from the Unicode version dependency where such a dependency isn’t strictly necessary. For example, the Unicode version dependency of extended grapheme clusters means that you should never persist indices into a Swift strings and load them back in a future execution of your app, because an intervening Unicode data update may change the meaning of the persisted indices! The Swift string documentation does not warn against this.

You might think that this kind of thing is a theoretical issue that will never bite anyone, but even experts in data persistence, the developers of PostgreSQL, managed to make backup restorability dependent on collation order, which may change with glibc updates.

Let’s consider other languages a bit.

C++ is often deployed such that the application developer doesn’t ship the standard library with the program. Most obviously, relying on GNU libstdc++ provided by an LTS Linux distribution presents similar problems as Swift 4 relying on ICU provided by an LTS Linux distribution. This isn’t a Linux-specific issue. Old supported branches of Windows generally don’t get new system-level Unicode data, either. Even though there is some movement towards individual applications shipping their own copy of LLVM libc++ with the application and the increased pace of C++ standard development starting with C++11 has made using a system-provided C++ standard library more problematic even ignoring Unicode considerations, it doesn’t seem like a good idea for C++ to develop a tight coupling with extended grapheme clusters for operations that don’t strictly necessitate it as longs as stuck-in-the-past system libraries (whether the C++ standard library itself or another library that it delegates to) are a significant part of the C++ standard library distribution practice.

There’s a proposal to expose extended grapheme cluster segmentation to JavaScript programs. The main problem with this proposal is the implication on APK sizes on Android and the effect of APK sizes on browser competition on Android. But if we ignore that for the moment and imagine this was part of the Web Platform, it would still be problematic to build this dependency into operations for which working on extended grapheme clusters isn’t strictly necessary. While the most popular browsers are evergreen, there’s still a long tail of browser instances that aren’t on the latest engine versions. When JavaScript executes on such browsers, there’d be effects similar to running Swift 4 on Ubuntu 14.04.

In contrast to C++ or JavaScript, the current Rust approach is to statically link all Rust library code, including the standard library, into the executable program. This means that the application distributor is in control of library versions and doesn’t need to worry about the program executing in the context of out-of-date Rust libraries. The flip side is concerns about the size of the executable. People already (rightly or wrongly) complain about the sizes of Rust executables. Pulling in a lot of Unicode data due to baking extended grapheme cluster processing into programs whose problem domain doesn’t strictly require working with extended grapheme clusters would be problematic in embedded contexts where the executable size is a real problem and not just a perceived problem—and would obviously make the perceived problem worse, too. Furthermore, in order to avoid problems similar to those involved in relying on system libraries, baking tight coupling with Unicode data into the standard library necessitates the organizational capability of keeping up with new Unicode versions in this area where not only data in the tables keeps changing but the format of the tables and, therefore, the associated algorithms have still been changing recently. Right now of the two extended grapheme cluster crates outside the Rust standard library, the one that’s organizationally closer to the standard library is the one that’s out of date.

Why Do We Want to Know Anyway?

“String length is about as meaningful a measurement as string height” – @qntm

Being able to allocate memory for strings gives a legitimate use case for knowing the storage length. However, in cases of Unicode scalar values or extended grapheme clusters, you typically want to iterate over them and look at each one instead of just knowing the count. So why do people want to know the count? As far as I can tell, there are two broad categories: Placing a quota limit that is fuzzy enough that it doesn’t need to be strictly tied to storage and trying to estimate how much text fits for display. Let’s look at the issue of estimating how much display space text takes, because it involves introducing yet another measurement of string length.

Display Space

Simply looking at the Latin letters i and m should make it clear that the display size of a string depends on the font and on the specific characters in the string. From this observation, the whole notion of estimating display space by counting characters seems folly. Indeed, if you want to know exactly how much text fits into a given space, you need to run a typesetting algorithm with a specific font, which may have a complex relationship between scalar values and glyphs, to actually see where the overflow starts. Yet, even in the case of the Latin script that has letters such as i and m, e.g. magazine editors can find character counts useful enough for estimating how many print pages an article of a given character count length is going to fill.

As for computer user interfaces, character terminal user interfaces use a monospaced font where both i and m take up one character cell on a grid. In the context of a monospaced font, the extended grapheme cluster count in the context of the Latin script corresponds directly to display space taken. The same obviously applies to the Greek and Cyrillic scripts, which are so close to the Latin script that fonts even intend to reuse glyphs across these scripts. In contrast, CJK ideographs, Japanese kana, and Hangul syllables take two cells of a terminal grid. From the CJK perspective, these are full-width characters and the ASCII characters are half-width characters. There exist also half-width katakana characters which fit into an 8-bit encoding with ASCII and take one cell on the terminal grid and, therefore, are technically easier to fit to Latin script-oriented terminal systems. The display width on a terminal also has a correspondence to byte with the legacy CJK encodings: ASCII takes one byte, a CJK ideograph, a full-width kana or a Hangul syllable takes two bytes. In the case of Shift_JIS, half-width katakana takes one byte per character.

This brings us to the concept of East Asian Width. ASCII and half-width katakana characters are narrow. CJK ideographs, full-width kana, and Hangul syllables are wide. However, even in the worldview that is split to Latin, Greek, and Cyrillic on one hand and Chinese, Japanese, and Korean on the other hand, there are ambiguities. From the perspective of European legacy encodings, Greek and Cyrillic (as well as accented Latin) is equally wide as ASCII. However, in legacy CJK encodings, Greek and Cyrillic characters take two bytes. This means that in terms of East Asian Width, a string can have a general-purpose width, which resolves these ambiguous characters as narrow, or legacy CJK-context width, which resolves these ambiguous characters as wide.

So is the general-purpose variant (that resolves Greek and Cyrillic characters as narrow) of East Asian Width the one true string length measure? Well, no.

First of all, the concept ignores all scripts that are geographically and in Unicode order between Latin, Greek, and Cyrillic on one hand and CJK on the other (even though some other scripts that are structurally similar to the Latin, Greek, and Cyrillic scripts and make sense for a monospaced font, such as Armenian and the Georgian scripts, fit this concept, too, despite not having a history in pre-Unicode CJK context). As it happens, though, emoji do fit into the concept, except for weird errors in the Unicode database. After all, emoji originate from Japan and were two bytes each when represented using the private use area of Shift_JIS.

Second, the concept assumes that there is one-to-one correspondence between scalar values and extended grapheme clusters. If we run this Rust program:

use unicode_width::UnicodeWidthStr;

fn main() {
    println!("{}", "🤦🏼‍♂️".width());
}

it prints:

5

This is because the base emoji is wide (2), the combining skin tone modifier is also wide (2), the male sign is counted as narrow (1), and the zero-width joiner and the variation selector are treated as control characters that don’t count towards width. Obviously, this is not the answer that we want. The answer we want is 2. Ideas that come to mind immediately, such as only counting the width of the first character in an extended grapheme cluster or taking the width of the widest character in an extended grapheme cluster, don’t work, because flag emoji consist of two regional indicator symbol letter characters both of which have East Asian Width of Neutral (i.e. they are counted as narrow but are not marked as narrow, because they are considered to exist outside the domain of East Asian typography). I’m not aware of any official Unicode definition that would reliably return 2 as the width of every kind of emoji. 😭

If you really must estimate display size without running text layout with a font, whether the extended grapheme cluster count or the East Asian Width of the string works better depends on context.

Arbitrary but Fair Quotas

In some cases there is a desire to impose a length limit that doesn’t arise from a strict storage limitation. For example, in the STUN protocol given earlier, presumably there is a desire to make it so that human-readable error messages cannot make protocol messages arbitrarily long. For example, in the case of Twitter, tweets being short is a core part of the type of expression that Twitter is about, so some definition of “short” is needed. In the case of string-based browser localStorage, there is a need to have some limit, but the limit is necessarily arbitrary and does not need to strictly map to bytes on disk.

In cases like this, there seems to be some concern that the limit should be internationally fair. Observations that UTF-8 and UTF-16 take a different amount of storage per character depending on the character superficially suggests that the UTF-8 length or the UTF-16 length might be unfair internationally.

What’s fair, though? The usual concern goes that UTF-8 favors English, because English takes one byte per character, and disfavors CJK, because Chinese, Japanese, and Korean take three bytes per character, so UTF-8 in unfair to CJK. This kind of analysis ignores how much information is conveyed per character. To assess what lengths we get for different languages when the amount of information conveyed is kept constant, I looked at the counts for the translations of the Universal Declaration of Human Rights. This is a document for which translation of the same content is available in particularly many languages, which is why I used it as the measurement corpus.

Unfortunately, not all translations contain the same text, so one needs to be careful when preparing the data for comparison. Some translations are incomplete, in some cases, very incomplete. For this reason, I included only translations in stage 4 or stage 5 along the 5-stage scale. Some translations carry the preamble with the recitals, but some do not. Some also carry historical notes. To make the length comparable, the preamble, notes, and whitespace-only text nodes were omitted. The rest of the XML text nodes were concatenated and normalized to Unicode Normalition Form C before counting. (Source code is available.)

Let’s look at the result. The table at the end of this document is sortable and is initially sorted by UTF-8 length. Each Δ% column shows how much the count in the column to its left deviates from the median count for that. (A note about color-coding. Coloring longer than median as red should not be taken to imply that those languages are somehow bad. It’s meant to imply that a length quota treats those languages badly.) In the table, the name of each language links to the translation in that language hosted on the site of the Unicode Consortium. The linked HTML versions may include the preamble and/or notes.

The CJK concern is alleviated when considering information conveyed. When measuring UTF-8 length, Mandarin using traditional characters is the shortest of the languages that have global name recognition! This should be expected, since the Han script pretty obviously packs more information per character than e.g. alphabetic scripts. (The globally less-known languages whose UTF-8 length is shorter than Mandarin’s (using traditional characters) are African and American Latin-script languages with a relatively small native speaker population for each—only one with a native speaker population exceeding a million and many whose native speaker population is smaller than 100 000, which explains why you might not recognize their names.)

Korean is also shorter than median in UTF-8 length. This also makes sense, since Hangul syllables pack three or two alphabetic jamo into one three-byte character. The UTF-8 length of Japanese is over median but only by 4.1%. The Japanese version of the text is 48% kanji and 52% hiragana. Japanese Wikipedia has almost the same kana to kanji ratio, though different kana: 46% kanji and the rest almost evenly split between hiragana and katakana, so we may assume the Universal Declaration of Human Rights to be representative of Japanese text in terms of kana to kanji ratio.

When sorting by UTF-16 code unit count, UTF-32 / scalar value count, or extended grapheme cluster count, CJK are the shortest. While it’s true that UTF-8 takes more bytes for CJK than UTF-16, the notion of UTF-8 being particularly disfavorable to CJK is not true relative to other languages. Rather, UTF-16 is particularly favorable to CJK. In particular, the Han script is so information-dense that even when sorting by East Asian Width, which effectively doubles the length of CJK but not other languages, Han-script languages stay clustered at the start of the table. Korean and Japanese move further but remain below median.

The language with the longest UTF-8 length is Shan, which uses the Burmese script. The Burmese language, also using the Burmese script, is the second-longest in UTF-8 length. There are a number of other Brahmic-script languages among the ones with the longest UTF-8 length. They use three bytes per character but don’t have CJK-like information per character density. These languages are below median in extended grapheme cluster count. In scalar value count, they intermingle with alphabetic languages.

It’s not clear if the concepts of median and mean (average) are meaningful. Does it make sense for a language with tens of millions of native speakers to count as an equal data point as a language with tens of thousands native speakers? Since this is about writing, should the numbers of writers be considered instead? (I.e. should literacy rates be taken into account?) In the hope that with a large number of languages in the table, median hand-wavily sorts out this kind of issue, I chose to compare with median. At least the Han-script languages have comparable numbers of native speakers as the Bhramic-script languages and provide a counter-weight at the other end of the spectrum of UTF-8 length. In any case, for measures other than UTF-8 length, median and mean are very close to each other.

Saying that Brahmic-script languages intermingle with alphabetic languages in character count is rather meaningless, though. In character count, after CJK (and Han-script Vietnamese and Yi-script Nousu), the language with the smallest character count is a Latin-script language (Waama). Also, the language with the largest character count is a Latin-script language (Ashéninka, Pichis). (I find it odd that in UTF-8 length Ashéninka Perené is the second-shortest but Ashéninka, Pichis is long enough to reach the Brahmic cluster. I don’t know what the relation of these two languages is and what explains two languages whose name suggests close relation ending up in opposite extremes in length. Update: It has been pointed out to me that the supposed Ashéninka Perené translation is a mislabeled duplicate of the Cashinahua translation.)

One might hypothesize that the Latin script has just been put to so many uses that some of the uses have to be far from what it has been optimized for. Yet, when considering language-specific alphabets, the character counts for Greek and Georgian are above median. It just is the case that languages are different. In that sense, the whole notion of trying to find a simple length measure that is fair across languages seems folly.

Let’s look at the the factor between the minimum and maximum of each measure, i.e. the factor with which the minimum needs to be multiplied to get the maximum. Let’s even ignore the outlier for maximum for each measure and use the second largest value instead of the largest value for each count. (Otherwise, Ashéninka, Pichis alone would skew the numbers a lot.) We get these factors:

UTF-8 8.6
UTF-16 7.9
UTF-32 7.9
EGC 7.9
EAW 4.3

UTF-16, UTF-32, and extended grapheme clusters aren’t distinguished by this measure, because the languages at the extremes use characters from the Basic Multilingual Plane with one character per grapheme cluster. Considering that there are supplementary-plane scripts, arguably the UTF-32 count would be fairer than the UTF-16 count even though this factor doesn’t show the difference. It’s not clear that counting extended grapheme clusters would be particularly fair compared to counting characters: It favors scripts that are visually joining over scripts that aren’t visually joining even if there’s no logical difference. While looking at just the factor, East Asian Width makes the gap the smallest, but it’s a rather imprecise fairness solution. It just counts CJK as double. Even after this, the Han-script languages are still among the ones with the smallest counts. On the other hand, it seems unfair to recognize Hangul syllables and kana as carrying more information than an alphabetic character while not giving the same treatment to other syllabaries, such as the Ethiopic script, Ge’ez.

Twitter counts each CJK character (including three-jamo Hangul syllables; i.e. it is not decomposing Hangul and treating it as alphabetic) as consuming 2 units of the quota (as when counting East Asian Width), counts emoji as consuming two units (even when East Asian Width of the cluster would be more), and, unlike East Asian Width, counts each Ethiopic syllable as consuming two units of the quota. What Twitter does seems fairer than just applying East Asian Width, but the result is still that the amount of information that can be packed in a tweet can vary four-fold depending on language. That still doesn’t seem exactly fair across languages.

In closing:

  • There is no simple measure of string length that would be fair in terms of how much information can be conveyed within a length quota regardless of language.
  • Of solutions that don’t depend on the Unicode database and, therefore, the Unicode version and that don’t ad-hoc hard-code character ranges according to a particular version of Unicode, counting characters aka. scalar values i.e. UTF-32 length is the best that can be done. It’s still wildly unfair leading to almost eight-fold differences in how much information can be conveyed. This is not a flaw of Unicode but arises from differences in languages and writing systems.
  • While counting scalar values is fairer than just counting UTF-8 or UTF-16 code units, the factor between minimum and maximum UTF-8 length is so close to the factor between minimum and maximum UTF-32 length, both of which are pretty large, that instead of putting thought into using the scalar value length instead of the UTF-8 length or the UTF-16 length, it’s probably better to put the thought into reconsidering if you really need to impose such a limit.
  • Unicode doesn’t provide a good database-based definition that would improve upon the character count in terms of normalizing the amount of information conveyed. While East Asian Width brings minimum and maximum closer, it unfairly singles out Hangul syllables and kana without considering other syllabaries, because normalizing length for information conveyed is not the purpose of East Asian Width.
  • Even if per-script (possibly non-integer) weights assigned to characters could make things fairer, it wouldn’t work well for the Latin script, which is all over the place in terms of language-dependent length.
Name UTF-8 Δ% UTF-16 Δ% UTF-32 Δ% EGC Δ% EAW Δ% Script
Cashinahua 4170 -57.6 4135 -53.0 4135 -52.9 4135 -52.3 4135 -52.4 Latn
Ashéninka Perené 4170 -57.6 4135 -53.0 4135 -52.9 4135 -52.3 4135 -52.4 Latn
Waama 4293 -56.3 4011 -54.4 4011 -54.4 4007 -53.8 4007 -53.9 Latn
Chickasaw 4850 -50.6 4685 -46.7 4685 -46.7 4587 -47.1 4587 -47.2 Latn
Bulu 4919 -49.9 4808 -45.3 4808 -45.3 4808 -44.5 4808 -44.7 Latn
Kulango, Bouna 5286 -46.2 4164 -52.6 4164 -52.6 4164 -51.9 4164 -52.1 Latn
Zapotec, Miahuatlán 5464 -44.4 5433 -38.2 5433 -38.2 5433 -37.3 5433 -37.5 Latn
Nyamwezi 5750 -41.5 5686 -35.3 5686 -35.3 5686 -34.4 5686 -34.6 Latn
Kaonde 5972 -39.2 5972 -32.1 5972 -32.0 5972 -31.1 5972 -31.3 Latn
Mixtec, Metlatónoc 6023 -38.7 5630 -36.0 5630 -35.9 5611 -35.2 5611 -35.4 Latn
Makonde 6100 -37.9 5946 -32.4 5946 -32.3 5946 -31.4 5946 -31.6 Latn
Sharanahua 6165 -37.3 6162 -29.9 6162 -29.9 6162 -28.9 6162 -29.1 Latn
Serer-Sine 6166 -37.3 6079 -30.9 6079 -30.8 6079 -29.8 6079 -30.0 Latn
Dinka, Northeastern 6214 -36.8 5815 -33.9 5815 -33.8 5775 -33.4 5775 -33.5 Latn
Okiek 6272 -36.2 6272 -28.7 6272 -28.6 6272 -27.6 6271 -27.8 Latn
Jola-Fonyi 6299 -35.9 6122 -30.4 6122 -30.3 6122 -29.3 6122 -29.5 Latn
Maninkakan, Eastern 6372 -35.2 5867 -33.3 5867 -33.2 5867 -32.3 5867 -32.5 Latn
Chinantec, Ojitlán 6463 -34.2 5957 -32.3 5957 -32.2 5957 -31.3 5957 -31.4 Latn
Soninke 6496 -33.9 6430 -26.9 6430 -26.8 6430 -25.8 6430 -26.0 Latn
Chokwe (Angola) 6596 -32.9 6565 -25.3 6565 -25.3 6565 -24.2 6565 -24.4 Latn
Chinese, Mandarin (Traditional) 6606 -32.8 2202 -75.0 2202 -74.9 2202 -74.6 4404 -49.3 Hant
Otomi, Mezquital 6614 -32.7 6438 -26.8 6438 -26.7 6379 -26.4 6379 -26.6 Latn
Chinese, Mandarin (Simplified) 6708 -31.7 2278 -74.1 2278 -74.1 2278 -73.7 4493 -48.3 Hans
Quechua (Unified Quichua, old Hispanic orthography) 6713 -31.7 6670 -24.2 6670 -24.1 6670 -23.0 6670 -23.2 Latn
Shilluk 6798 -30.8 6036 -31.4 6036 -31.3 6036 -30.3 6036 -30.5 Latn
Colorado 6798 -30.8 6797 -22.7 6797 -22.7 6796 -21.6 6794 -21.8 Latn
Dendi 6823 -30.6 6327 -28.1 6327 -28.0 6325 -27.0 6325 -27.2 Latn
Chinese, Jinyu 6848 -30.3 2284 -74.0 2284 -74.0 2284 -73.6 4566 -47.4 Hans
Chinese, Min Nan 6887 -29.9 2297 -73.9 2297 -73.9 2297 -73.5 4592 -47.1 Hans
Chinese, Gan 6889 -29.9 2297 -73.9 2297 -73.9 2297 -73.5 4593 -47.1 Hans
Vietnamese (Han nom) 6910 -29.7 2564 -70.8 2224 -74.7 2224 -74.3 4397 -49.4 Hani
Chinese, Hakka 6929 -29.5 2311 -73.7 2311 -73.7 2311 -73.3 4620 -46.8 Hans
Lunda 6968 -29.1 6968 -20.8 6968 -20.7 6968 -19.6 6968 -19.8 Latn
Chinese, Yue 6973 -29.0 2325 -73.6 2325 -73.5 2325 -73.2 4648 -46.5 Hani
Pular 6991 -28.9 6991 -20.5 6991 -20.4 6991 -19.3 6991 -19.5 Latn
Limba, West-Central 7007 -28.7 6257 -28.8 6257 -28.8 6257 -27.8 6257 -28.0 Latn
Naga, Ao 7019 -28.6 6729 -23.5 6729 -23.4 6729 -22.3 6729 -22.5 Latn
Mazahua Central 7052 -28.2 6750 -23.2 6750 -23.2 6517 -24.8 6517 -25.0 Latn
Chinese, Wu 7082 -27.9 2362 -73.1 2362 -73.1 2362 -72.7 4722 -45.6 Hans
Kpelle, Guinea 7139 -27.4 6136 -30.2 6136 -30.2 6136 -29.2 6136 -29.4 Latn
Amis 7206 -26.7 7206 -18.1 7206 -18.0 7206 -16.8 7206 -17.1 Latn
Baatonum 7255 -26.2 6788 -22.8 6788 -22.8 6779 -21.8 6779 -22.0 Latn
Tetun 7280 -25.9 7280 -17.2 7280 -17.2 7280 -16.0 7280 -16.2 Latn
Chinantec, Chiltepec 7304 -25.7 6468 -26.4 6468 -26.4 6262 -27.7 6262 -27.9 Latn
(Maiunan) 7312 -25.6 7312 -16.9 7312 -16.8 7312 -15.6 7312 -15.8 Latn
Tetun Dili 7357 -25.1 7225 -17.8 7225 -17.8 7225 -16.6 7225 -16.8 Latn
(Minjiang, written) 7366 -25.0 7363 -16.3 7363 -16.2 7363 -15.0 7363 -15.3 Latn
Quechua, Cusco 7369 -25.0 7309 -16.9 7309 -16.8 7309 -15.6 7309 -15.9 Latn
(Mijisa) 7393 -24.8 7393 -15.9 7393 -15.9 7393 -14.7 7392 -14.9 Latn
Drung 7412 -24.6 7412 -15.7 7412 -15.7 7412 -14.5 7412 -14.7 Latn
Mazatec, Ixcatlán 7442 -24.3 7261 -17.4 7261 -17.4 7261 -16.2 7261 -16.4 Latn
Rwanda 7456 -24.1 7456 -15.2 7456 -15.2 7456 -14.0 7456 -14.2 Latn
(Minjiang, spoken) 7512 -23.6 7509 -14.6 7509 -14.6 7509 -13.3 7509 -13.6 Latn
Sukuma 7532 -23.4 7452 -15.3 7452 -15.2 7452 -14.0 7452 -14.2 Latn
Makhuwa 7562 -23.0 7398 -15.9 7398 -15.8 7398 -14.6 7398 -14.8 Latn
Aymara, Central 7568 -23.0 7363 -16.3 7363 -16.2 7363 -15.0 7363 -15.3 Latn
Ido 7580 -22.9 7580 -13.8 7580 -13.7 7580 -12.5 7580 -12.8 Latn
Záparo 7591 -22.8 7583 -13.8 7583 -13.7 7583 -12.5 7583 -12.7 Latn
Bamanankan 7597 -22.7 6890 -21.7 6890 -21.6 6890 -20.5 6890 -20.7 Latn
Nyankore 7628 -22.4 7628 -13.3 7628 -13.2 7628 -12.0 7628 -12.2 Latn
Ndebele 7659 -22.1 7659 -12.9 7659 -12.8 7659 -11.6 7659 -11.8 Latn
Sãotomense 7712 -21.5 6956 -20.9 6956 -20.8 6956 -19.7 6956 -19.9 Latn
Pijin 7716 -21.5 7716 -12.3 7716 -12.2 7716 -11.0 7716 -11.2 Latn
Latin 7747 -21.2 7747 -11.9 7747 -11.8 7747 -10.6 7747 -10.8 Latn
Susu 7757 -21.1 7310 -16.9 7310 -16.8 7310 -15.6 7310 -15.9 Latn
Oroqen 7768 -21.0 7761 -11.7 7761 -11.7 7761 -10.4 7761 -10.7 Latn
Lozi 7825 -20.4 7825 -11.0 7825 -11.0 7825 -9.7 7825 -9.9 Latn
Latin (1) 7869 -19.9 7869 -10.5 7869 -10.5 7869 -9.2 7869 -9.4 Latn
Otuho 7890 -19.7 7890 -10.3 7890 -10.2 7801 -10.0 7712 -11.2 Latn
Achuar-Shiwiar (1) 7893 -19.7 7842 -10.8 7842 -10.8 7785 -10.2 7728 -11.0 Latn
Huastec (Veracruz) 7911 -19.5 7882 -10.4 7882 -10.3 7882 -9.0 7882 -9.3 Latn
Umbundu (011) 7941 -19.2 7910 -10.1 7910 -10.0 7910 -8.7 7910 -9.0 Latn
Nuosu 7953 -19.1 2663 -69.7 2663 -69.7 2663 -69.3 5308 -38.9 Yiii
Even 7969 -18.9 4320 -50.9 4320 -50.8 4320 -50.1 4320 -50.3 Cyrl
Q'eqchi' 7981 -18.8 7981 -9.2 7981 -9.2 7981 -7.9 7981 -8.1 Latn
Moba 7985 -18.7 7726 -12.1 7726 -12.1 7726 -10.8 7726 -11.1 Latn
Mam, Northern 7994 -18.7 7994 -9.1 7994 -9.0 7994 -7.7 7994 -8.0 Latn
Kabiyé 7997 -18.6 6193 -29.6 6193 -29.5 6193 -28.5 6193 -28.7 Latn
Kanuri, Central 8077 -17.8 7621 -13.3 7621 -13.3 7621 -12.0 7621 -12.3 Latn
Esperanto 8095 -17.6 7930 -9.8 7930 -9.8 7930 -8.5 7930 -8.7 Latn
Serbian (Latin) 8102 -17.6 7876 -10.4 7876 -10.4 7876 -9.1 7876 -9.3 Latn
Urarina 8127 -17.3 8125 -7.6 8125 -7.5 8125 -6.2 8125 -6.5 Latn
Kurdish, Central 8163 -16.9 7462 -15.1 7462 -15.1 7462 -13.9 7462 -14.1 Latn
Kurdish, Northern 8163 -16.9 7462 -15.1 7462 -15.1 7462 -13.9 7462 -14.1 Latn
Huitoto, Murui 8179 -16.8 7523 -14.5 7523 -14.4 7523 -13.2 7523 -13.4 Latn
Croatian 8201 -16.5 7996 -9.1 7996 -9.0 7996 -7.7 7996 -8.0 Latn
Bemba 8206 -16.5 8206 -6.7 8206 -6.6 8206 -5.3 8206 -5.5 Latn
Waorani 8209 -16.5 8137 -7.5 8137 -7.4 8052 -7.1 7967 -8.3 Latn
Gonja 8215 -16.4 7579 -13.8 7579 -13.8 7579 -12.5 7579 -12.8 Latn
Scots 8224 -16.3 8224 -6.5 8224 -6.4 8224 -5.1 8224 -5.3 Latn
Ndonga 8239 -16.2 8239 -6.3 8239 -6.2 8239 -4.9 8239 -5.2 Latn
Garifuna 8243 -16.1 7721 -12.2 7721 -12.1 7721 -10.9 7721 -11.1 Latn
Bosnian (Latin) 8259 -16.0 8049 -8.5 8049 -8.4 8049 -7.1 8049 -7.4 Latn
Twi (Akuapem) 8264 -15.9 7653 -13.0 7653 -12.9 7653 -11.7 7653 -11.9 Latn
Zulu 8265 -15.9 8261 -6.1 8261 -6.0 8261 -4.7 8261 -4.9 Latn
Guarayu 8280 -15.7 8098 -7.9 8098 -7.9 8098 -6.5 8098 -6.8 Latn
Swahili 8315 -15.4 8315 -5.4 8315 -5.4 8315 -4.0 8315 -4.3 Latn
Zhuang, Yongbei 8318 -15.4 8316 -5.4 8316 -5.4 8316 -4.0 8316 -4.3 Latn
Wolof 8321 -15.3 7940 -9.7 7940 -9.6 7940 -8.4 7940 -8.6 Latn
Zapotec, Güilá 8364 -14.9 8328 -5.3 8328 -5.2 8328 -3.9 8328 -4.1 Latn
Oromo, Borana-Arsi-Guji 8381 -14.7 8381 -4.7 8381 -4.6 8381 -3.3 8381 -3.5 Latn
Welsh 8382 -14.7 8247 -6.2 8247 -6.2 8247 -4.8 8247 -5.1 Latn
Tok Pisin 8399 -14.5 8393 -4.6 8393 -4.5 8393 -3.1 8393 -3.4 Latn
Awa-Cuaiquer 8405 -14.5 8391 -4.6 8391 -4.5 8309 -4.1 8227 -5.3 Latn
Luvale 8411 -14.4 8411 -4.4 8411 -4.3 8411 -2.9 8411 -3.2 Latn
Crioulo, Upper Guinea (008) 8414 -14.4 8225 -6.5 8225 -6.4 8225 -5.1 8225 -5.3 Latn
Afrikaans 8427 -14.2 8365 -4.9 8365 -4.8 8365 -3.5 8365 -3.7 Latn
Faroese 8454 -14.0 7854 -10.7 7854 -10.6 7854 -9.4 7854 -9.6 Latn
Fulfulde, Nigerian 8455 -14.0 8135 -7.5 8135 -7.4 8135 -6.1 8135 -6.4 Latn
Norwegian, Nynorsk 8461 -13.9 8268 -6.0 8268 -5.9 8268 -4.6 8268 -4.8 Latn
Yagua 8468 -13.8 8432 -4.1 8432 -4.1 8432 -2.7 8432 -2.9 Latn
Rundi 8498 -13.5 8498 -3.4 8498 -3.3 8498 -1.9 8498 -2.2 Latn
Norwegian, Bokmål 8500 -13.5 8360 -4.9 8360 -4.9 8360 -3.5 8360 -3.8 Latn
Umbundu 8503 -13.5 8415 -4.3 8415 -4.2 8415 -2.9 8415 -3.1 Latn
English 8565 -12.8 8555 -2.7 8555 -2.7 8555 -1.3 8555 -1.5 Latn
Yao 8574 -12.8 8574 -2.5 8574 -2.4 8574 -1.1 8574 -1.3 Latn
Nomatsiguenga 8575 -12.7 8432 -4.1 8432 -4.1 8432 -2.7 8432 -2.9 Latn
Mapudungun 8585 -12.6 8366 -4.9 8366 -4.8 8366 -3.5 8366 -3.7 Latn
Fijian 8586 -12.6 8584 -2.4 8584 -2.3 8584 -0.9 8584 -1.2 Latn
Tamazight, Central Atlas 8587 -12.6 8226 -6.5 8226 -6.4 8226 -5.1 8226 -5.3 Latn
Nyanja (Chinyanja) 8590 -12.6 8590 -2.3 8590 -2.3 8590 -0.9 8590 -1.1 Latn
Yapese 8635 -12.1 8473 -3.7 8473 -3.6 8473 -2.2 8473 -2.5 Latn
Crioulo, Upper Guinea 8636 -12.1 8632 -1.8 8632 -1.8 8632 -0.4 8632 -0.6 Latn
Secoya 8651 -12.0 8155 -7.3 8155 -7.2 8137 -6.1 8137 -6.3 Latn
Wayuu 8664 -11.8 8077 -8.2 8077 -8.1 8077 -6.8 8077 -7.0 Latn
Lingala 8668 -11.8 8654 -1.6 8654 -1.5 8654 -0.1 8654 -0.4 Latn
Haitian Creole French (Kreyol) 8680 -11.7 8535 -2.9 8535 -2.9 8535 -1.5 8535 -1.8 Latn
Tonga 8685 -11.6 8685 -1.2 8685 -1.2 8685 0.2 8685 -0.0 Latn
Seselwa Creole French 8706 -11.4 8697 -1.1 8697 -1.0 8697 0.4 8697 0.1 Latn
Mende 8707 -11.4 8010 -8.9 8010 -8.9 8010 -7.6 8010 -7.8 Latn
Nyanja (Chechewa) 8725 -11.2 8725 -0.8 8725 -0.7 8725 0.7 8725 0.4 Latn
Hani 8767 -10.8 8767 -0.3 8767 -0.2 8767 1.2 8767 0.9 Latn
Slovenian 8772 -10.7 8520 -3.1 8520 -3.0 8520 -1.7 8520 -1.9 Latn
Hmong, Southern Qiandong 8792 -10.5 8792 -0.0 8792 0.0 8792 1.5 8792 1.2 Latn
Chokwe 8808 -10.4 8808 0.2 8808 0.2 8808 1.7 8808 1.4 Latn
Pipil 8831 -10.1 8825 0.4 8825 0.4 8825 1.8 8825 1.6 Latn
(Bizisa) 8847 -10.0 8847 0.6 8847 0.7 8847 2.1 8847 1.8 Latn
Quechua, Cajamarca 8858 -9.9 8851 0.6 8851 0.7 8851 2.1 8851 1.9 Latn
Kasem 8868 -9.8 8445 -4.0 8445 -3.9 8445 -2.5 8445 -2.8 Latn
Romani, Balkan 8875 -9.7 8606 -2.1 8606 -2.1 8606 -0.7 8606 -0.9 Latn
Turkish 8877 -9.7 8225 -6.5 8225 -6.4 8225 -5.1 8225 -5.3 Latn
Fante 8898 -9.5 8229 -6.4 8229 -6.4 8229 -5.0 8229 -5.3 Latn
Basque 8907 -9.4 8907 1.3 8907 1.4 8907 2.8 8907 2.5 Latn
Ganda 8962 -8.8 8962 1.9 8962 2.0 8962 3.4 8962 3.2 Latn
Occitan 8963 -8.8 8661 -1.5 8661 -1.4 8661 -0.0 8661 -0.3 Latn
Xhosa 8969 -8.7 8881 1.0 8881 1.1 8881 2.5 8881 2.2 Latn
Breton 8982 -8.6 8661 -1.5 8661 -1.4 8661 -0.0 8661 -0.3 Latn
Veps 8985 -8.6 8428 -4.2 8428 -4.1 8428 -2.7 8428 -3.0 Latn
Quechua, Arequipa-La Unión 8988 -8.5 8969 2.0 8969 2.1 8969 3.5 8969 3.2 Latn
Friulian 9003 -8.4 8688 -1.2 8688 -1.1 8688 0.3 8688 0.0 Latn
Swedish 9008 -8.3 8612 -2.1 8612 -2.0 8612 -0.6 8612 -0.9 Latn
Danish 9010 -8.3 8831 0.4 8831 0.5 8831 1.9 8831 1.6 Latn
Aromanian 9020 -8.2 8694 -1.1 8694 -1.1 8694 0.3 8694 0.1 Latn
Madura 9023 -8.2 9023 2.6 9023 2.7 9023 4.1 9023 3.9 Latn
Romani, Balkan (1) 9035 -8.1 8739 -0.6 8739 -0.6 8739 0.9 8739 0.6 Latn
Chayahuita 9065 -7.8 8639 -1.8 8639 -1.7 8639 -0.3 8639 -0.6 Latn
Icelandic 9070 -7.7 8249 -6.2 8249 -6.1 8249 -4.8 8249 -5.1 Latn
Krio 9086 -7.5 8139 -7.4 8139 -7.4 8139 -6.1 8139 -6.3 Latn
Estonian 9093 -7.5 8800 0.1 8800 0.1 8800 1.6 8800 1.3 Latn
Aja 9099 -7.4 8077 -8.2 8077 -8.1 8069 -6.9 8069 -7.1 Latn
Sorbian, Upper 9108 -7.3 8442 -4.0 8442 -3.9 8442 -2.6 8442 -2.8 Latn
Sotho, Southern 9136 -7.0 9136 3.9 9136 4.0 9136 5.4 9136 5.2 Latn
Catalan-Valencian-Balear 9141 -7.0 8823 0.3 8823 0.4 8823 1.8 8823 1.6 Latn
Luba-Kasai 9143 -7.0 9143 4.0 9143 4.0 9143 5.5 9143 5.2 Latn
Minangkabau 9175 -6.6 9167 4.2 9167 4.3 9167 5.8 9167 5.5 Latn
Bari 9178 -6.6 8555 -2.7 8555 -2.7 8555 -1.3 8555 -1.5 Latn
Portuguese (Brazil) 9219 -6.2 8887 1.1 8887 1.1 8887 2.6 8887 2.3 Latn
Huastec (San Luís Potosí) 9222 -6.2 8826 0.4 8826 0.4 8826 1.9 8826 1.6 Latn
Czech 9225 -6.1 8126 -7.6 8126 -7.5 8126 -6.2 8126 -6.5 Latn
Purepecha 9234 -6.0 9082 3.3 9082 3.3 9082 4.8 9082 4.5 Latn
Fon 9244 -5.9 7952 -9.6 7952 -9.5 7943 -8.3 7943 -8.6 Latn
Twi (Asante) 9246 -5.9 8374 -4.8 8374 -4.7 8374 -3.4 8374 -3.6 Latn
Papiamentu 9249 -5.9 9237 5.0 9237 5.1 9237 6.6 9237 6.3 Latn
Slovak 9266 -5.7 8378 -4.7 8378 -4.7 8378 -3.3 8378 -3.6 Latn
Malagasy, Plateau 9272 -5.6 9272 5.4 9272 5.5 9272 7.0 9272 6.7 Latn
Romansch (Vallader) 9300 -5.4 9048 2.9 9048 3.0 9048 4.4 9048 4.1 Latn
Ladin 9324 -5.1 8740 -0.6 8740 -0.5 8740 0.9 8740 0.6 Latn
Mbundu 9327 -5.1 9317 5.9 9317 6.0 9317 7.5 9317 7.2 Latn
Occitan (Auvergnat) 9330 -5.1 8642 -1.7 8642 -1.7 8642 -0.3 8642 -0.5 Latn
Lithuanian 9339 -5.0 8794 0.0 8794 0.1 8794 1.5 8794 1.2 Latn
Ladino 9348 -4.9 9345 6.3 9345 6.3 9345 7.8 9345 7.6 Latn
Mískito 9353 -4.8 9345 6.3 9345 6.3 9345 7.8 9345 7.6 Latn
Assyrian Neo-Aramaic 9363 -4.7 5186 -41.0 5186 -41.0 5127 -40.8 5127 -41.0 Syrc
Waray-Waray 9387 -4.5 9387 6.7 9387 6.8 9387 8.3 9387 8.0 Latn
Korean 9391 -4.4 3856 -56.2 3856 -56.1 3856 -55.5 6623 -23.8 Hang
Somali 9403 -4.3 9403 6.9 9403 7.0 9403 8.5 9403 8.2 Latn
Finnish 9404 -4.3 9023 2.6 9023 2.7 9023 4.1 9023 3.9 Latn
Romansch (Sursilvan) 9421 -4.1 9300 5.8 9300 5.8 9300 7.3 9300 7.0 Latn
Chin, Tedim 9441 -3.9 9431 7.2 9431 7.3 9431 8.8 9431 8.6 Latn
Latvian 9447 -3.9 8582 -2.4 8582 -2.3 8582 -1.0 8582 -1.2 Latn
Romansch (Grischun) 9449 -3.8 9293 5.7 9293 5.7 9293 7.2 9293 7.0 Latn
Gagauz 9451 -3.8 8510 -3.2 8510 -3.2 8510 -1.8 8510 -2.0 Latn
Dagbani 9458 -3.8 8896 1.2 8896 1.2 8896 2.7 8896 2.4 Latn
Finnish, Kven 9464 -3.7 9123 3.7 9123 3.8 9123 5.3 9123 5.0 Latn
Corsican 9475 -3.6 8922 1.5 8922 1.5 8922 3.0 8922 2.7 Latn
Koongo (Angola) 9486 -3.5 9416 7.1 9416 7.1 9416 8.7 9356 7.7 Latn
Ditammari 9487 -3.5 7867 -10.5 7867 -10.5 7748 -10.6 7748 -10.8 Latn
Portuguese (Portugal) 9501 -3.3 9154 4.1 9154 4.2 9154 5.6 9154 5.4 Latn
Manx 9504 -3.3 9440 7.3 9440 7.4 9440 8.9 9440 8.7 Latn
Chamorro 9506 -3.3 9504 8.1 9504 8.1 9504 9.7 9504 9.4 Latn
Galician 9510 -3.2 9223 4.9 9223 4.9 9223 6.4 9223 6.2 Latn
Occitan (Languedocien) 9522 -3.1 9364 6.5 9364 6.6 9364 8.1 9364 7.8 Latn
Romansch (Puter) 9538 -2.9 9303 5.8 9303 5.9 9303 7.4 9303 7.1 Latn
Ligurian 9557 -2.7 8942 1.7 8942 1.8 8942 3.2 8942 2.9 Latn
Quechua, Huaylas Ancash 9563 -2.7 9471 7.7 9471 7.8 9471 9.3 9471 9.0 Latn
Mizo 9576 -2.6 9489 7.9 9489 8.0 9489 9.5 9489 9.2 Latn
Tiv 9585 -2.5 9490 7.9 9490 8.0 9490 9.5 9490 9.2 Latn
Interlingua 9588 -2.4 9588 9.0 9588 9.1 9588 10.7 9588 10.4 Latn
Koongo 9596 -2.4 9596 9.1 9596 9.2 9596 10.7 9596 10.5 Latn
Pohnpeian 9603 -2.3 9603 9.2 9603 9.3 9603 10.8 9603 10.5 Latn
Polish 9613 -2.2 9111 3.6 9111 3.7 9111 5.1 9111 4.9 Latn
Ga 9614 -2.2 8262 -6.0 8262 -6.0 8257 -4.7 8257 -5.0 Latn
Kituba 9630 -2.0 9630 9.5 9630 9.6 9630 11.1 9630 10.8 Latn
Palauan 9654 -1.8 9654 9.8 9654 9.9 9654 11.4 9654 11.1 Latn
Guaraní, Paraguayan 9658 -1.7 8956 1.8 8956 1.9 8956 3.4 8956 3.1 Latn
Frisian, Western 9660 -1.7 9495 8.0 9495 8.0 9495 9.6 9495 9.3 Latn
Albanian, Tosk 9703 -1.3 8972 2.0 8972 2.1 8972 3.5 8972 3.3 Latn
Italian 9739 -0.9 9674 10.0 9674 10.1 9674 11.6 9674 11.3 Latn
Marshallese 9758 -0.7 9758 11.0 9758 11.0 9758 12.6 9758 12.3 Latn
Spanish 9759 -0.7 9574 8.9 9574 8.9 9574 10.5 9574 10.2 Latn
Venetian 9764 -0.6 9083 3.3 9083 3.4 9083 4.8 9083 4.5 Latn
Romansch (Sutsilvan) 9764 -0.6 9459 7.6 9459 7.6 9459 9.2 9459 8.9 Latn
Huastec (Sierra de Otontepec) 9778 -0.5 9430 7.2 9430 7.3 9430 8.8 9430 8.5 Latn
Comorian, Ngazidja 9783 -0.4 9783 11.2 9783 11.3 9783 12.9 9783 12.6 Latn
Lamnso' 9792 -0.4 7828 -11.0 7828 -10.9 7648 -11.7 7648 -12.0 Latn
Hawaiian 9812 -0.2 8588 -2.3 8588 -2.3 8588 -0.9 8588 -1.2 Latn
Romansch (Surmiran) 9827 0.0 9662 9.9 9662 9.9 9662 11.5 9662 11.2 Latn
German, Standard (1996) 9828 0.0 9696 10.3 9696 10.3 9696 11.9 9696 11.6 Latn
Mixe, Totontepec 9829 0.0 8351 -5.0 8351 -5.0 8351 -3.6 8351 -3.9 Latn
German, Standard (1901) 9830 0.0 9692 10.2 9692 10.3 9692 11.9 9692 11.6 Latn
Talysh 9836 0.1 8180 -7.0 8180 -6.9 8180 -5.6 8180 -5.8 Latn
Aceh 9845 0.2 9729 10.6 9729 10.7 9729 12.3 9729 12.0 Latn
Maltese 9846 0.2 9198 4.6 9198 4.7 9198 6.2 9198 5.9 Latn
Chin, Matu 9854 0.3 9840 11.9 9840 12.0 9840 13.6 9840 13.3 Latn
Asturian 9858 0.3 9636 9.6 9636 9.6 9636 11.2 9636 10.9 Latn
Gaelic, Scottish 9859 0.3 9646 9.7 9646 9.8 9646 11.3 9646 11.0 Latn
Chuukese 9878 0.5 9878 12.3 9878 12.4 9878 14.0 9878 13.7 Latn
Nyemba 9882 0.6 9881 12.4 9881 12.4 9881 14.0 9881 13.7 Latn
Amarakaeri 9917 0.9 9499 8.0 9499 8.1 9086 4.9 9086 4.6 Latn
Candoshi-Shapra 9918 0.9 9862 12.1 9862 12.2 9862 13.8 9862 13.5 Latn
Siona 9933 1.1 9161 4.2 9161 4.2 8826 1.9 8748 0.7 Latn
Dangme 9936 1.1 8796 0.0 8796 0.1 8779 1.3 8779 1.0 Latn
Shona 9943 1.2 9943 13.1 9943 13.1 9943 14.7 9943 14.4 Latn
Páez 9980 1.6 9869 12.2 9869 12.3 9869 13.9 9869 13.6 Latn
Romansch 10003 1.8 9866 12.2 9866 12.3 9866 13.9 9866 13.6 Latn
Pampangan 10005 1.8 10005 13.8 10005 13.8 10005 15.5 10005 15.2 Latn
Cebuano 10008 1.8 10008 13.8 10008 13.9 10008 15.5 10008 15.2 Latn
Tagalog 10013 1.9 10013 13.9 10013 13.9 10013 15.6 10013 15.3 Latn
Romagnolo 10029 2.1 9511 8.2 9511 8.2 9511 9.8 9511 9.5 Latn
French 10030 2.1 9598 9.1 9598 9.2 9598 10.8 9598 10.5 Latn
Sotho, Northern 10036 2.1 9771 11.1 9771 11.2 9771 12.8 9771 12.5 Latn
Indonesian 10059 2.4 10059 14.4 10059 14.5 10059 16.1 10059 15.8 Latn
Tswana 10067 2.4 10047 14.2 10047 14.3 10047 15.9 10047 15.6 Latn
Bugis 10070 2.5 10070 14.5 10070 14.6 10070 16.2 10070 15.9 Latn
Sunda 10071 2.5 10071 14.5 10071 14.6 10071 16.2 10071 15.9 Latn
Uzbek, Northern (Latin) 10088 2.7 9836 11.8 9836 11.9 9836 13.5 9836 13.2 Latn
Gaelic, Irish 10114 2.9 9591 9.1 9591 9.1 9591 10.7 9591 10.4 Latn
Hindustani, Sarnami 10116 2.9 9963 13.3 9963 13.4 9963 15.0 9963 14.7 Latn
Tzeltal, Oxchuc 10119 3.0 9780 11.2 9780 11.3 9780 12.9 9780 12.6 Latn
Turkmen (Latin) 10124 3.0 9185 4.4 9185 4.5 9185 6.0 9185 5.7 Latn
Dagaare, Southern 10141 3.2 9495 8.0 9495 8.0 9477 9.4 9477 9.1 Latn
Igbo 10151 3.3 8653 -1.6 8653 -1.5 8653 -0.1 8653 -0.4 Latn
Picard 10151 3.3 9175 4.3 9175 4.4 9175 5.9 9175 5.6 Latn
Micmac 10162 3.4 9234 5.0 9234 5.1 9234 6.6 9234 6.3 Latn
Uyghur (Latin) 10186 3.7 9999 13.7 9999 13.8 9999 15.4 9999 15.1 Latn
Malay (Latin) 10189 3.7 10188 15.9 10188 15.9 10188 17.6 10188 17.3 Latn
Azerbaijani, North (Latin) 10198 3.8 8717 -0.9 8717 -0.8 8717 0.6 8717 0.3 Latn
Japanese 10227 4.1 3437 -60.9 3437 -60.9 3437 -60.3 6832 -21.4 Jpan
Bislama 10233 4.1 10233 16.4 10233 16.4 10233 18.1 10233 17.8 Latn
Bali 10235 4.2 10235 16.4 10235 16.5 10235 18.1 10235 17.8 Latn
Occitan (Francoprovençal, Savoie) 10240 4.2 8665 -1.5 8665 -1.4 8665 0.0 8665 -0.3 Latn
Themne 10244 4.2 8323 -5.4 8323 -5.3 8323 -3.9 8323 -4.2 Latn
Karelian 10245 4.3 9874 12.3 9874 12.4 9761 12.6 9648 11.0 Latn
Dutch 10247 4.3 10246 16.5 10246 16.6 10246 18.2 10246 17.9 Latn
Bamun 10248 4.3 8744 -0.6 8744 -0.5 8744 0.9 8744 0.6 Latn
Edo 10262 4.4 10260 16.7 10260 16.8 10260 18.4 10260 18.1 Latn
Bicolano, Central 10263 4.4 10263 16.7 10263 16.8 10263 18.4 10263 18.1 Latn
Tsonga (Mozambique) 10274 4.5 10047 14.2 10047 14.3 10047 15.9 10047 15.6 Latn
Quechua, Ayacucho 10295 4.8 10273 16.8 10273 16.9 10273 18.6 10273 18.2 Latn
Mina 10300 4.8 9085 3.3 9085 3.4 9040 4.3 9040 4.1 Latn
Romanian (2006) 10303 4.8 9683 10.1 9683 10.2 9683 11.7 9683 11.5 Latn
Luxembourgeois 10306 4.9 9998 13.7 9998 13.8 9998 15.4 9998 15.1 Latn
Romanian (1993) 10311 4.9 9691 10.2 9691 10.3 9691 11.8 9691 11.5 Latn
Romanian (1953) 10317 5.0 9691 10.2 9691 10.3 9691 11.8 9691 11.5 Latn
Mozarabic 10317 5.0 10184 15.8 10184 15.9 10184 17.5 10184 17.2 Latn
Sardinian, Logudorese 10323 5.0 10195 15.9 10195 16.0 10195 17.7 10195 17.3 Latn
Haitian Creole French (Popular) 10339 5.2 10103 14.9 10103 15.0 10103 16.6 10103 16.3 Latn
Hiligaynon 10405 5.9 10405 18.3 10405 18.4 10405 20.1 10405 19.8 Latn
Shor 10414 6.0 5724 -34.9 5724 -34.9 5724 -33.9 5724 -34.1 Cyrl
Sango 10428 6.1 8644 -1.7 8644 -1.6 8644 -0.2 8644 -0.5 Latn
Ilocano 10429 6.1 10429 18.6 10429 18.7 10429 20.4 10429 20.0 Latn
Occitan (Francoprovençal, Fribourg) 10439 6.2 9226 4.9 9226 5.0 9226 6.5 9226 6.2 Latn
Niue 10444 6.3 10443 18.8 10443 18.8 10443 20.5 10443 20.2 Latn
Comorian, Maore 10458 6.4 10340 17.6 10340 17.7 10340 19.3 10340 19.0 Latn
Chin, Falam 10467 6.5 10467 19.0 10467 19.1 10467 20.8 10467 20.5 Latn
Ibibio 10468 6.5 10467 19.0 10467 19.1 10467 20.8 10467 20.5 Latn
Lingala (tones) 10476 6.6 8990 2.2 8990 2.3 8760 1.1 8760 0.8 Latn
Hebrew 10502 6.9 5822 -33.8 5822 -33.8 5822 -32.8 5822 -33.0 Hebr
Saxon, Low 10539 7.2 10318 17.3 10318 17.4 10318 19.1 10318 18.8 Latn
Venda 10620 8.1 10106 14.9 10106 15.0 10106 16.6 10106 16.3 Latn
Mòoré 10621 8.1 9427 7.2 9427 7.3 9427 8.8 9427 8.5 Latn
Quichua, Chimborazo Highland 10651 8.4 10549 20.0 10549 20.0 10436 20.4 10323 18.8 Latn
Saami, North 10654 8.4 9944 13.1 9944 13.2 9944 14.8 9944 14.5 Latn
Occitan (Francoprovençal, Valais) 10662 8.5 9413 7.0 9413 7.1 9413 8.6 9413 8.3 Latn
Walloon 10714 9.0 9785 11.3 9785 11.3 9785 12.9 9785 12.6 Latn
Hungarian 10718 9.1 9783 11.2 9783 11.3 9783 12.9 9783 12.6 Latn
Nzema 10740 9.3 9439 7.3 9439 7.4 9439 8.9 9439 8.6 Latn
Tsonga (Zimbabwe) 10758 9.5 10546 19.9 10546 20.0 10546 21.7 10546 21.4 Latn
Quechua, North Junín 10765 9.5 10756 22.3 10756 22.4 10756 24.1 10756 23.8 Latn
Hmong, Northern Qiandong 10801 9.9 10801 22.8 10801 22.9 10801 24.7 10801 24.3 Latn
Khasi 10810 10.0 10605 20.6 10605 20.7 10605 22.4 10605 22.1 Latn
K'iche', Central 10817 10.1 10817 23.0 10817 23.1 10817 24.8 10817 24.5 Latn
Javanese (Latin) 10863 10.5 10863 23.5 10863 23.6 10863 25.4 10863 25.0 Latn
Occitan (Francoprovençal, Vaud) 10885 10.8 9757 11.0 9757 11.0 9757 12.6 9757 12.3 Latn
Shuar 10930 11.2 10533 19.8 10533 19.9 10533 21.6 10533 21.2 Latn
Baoulé 10946 11.4 10204 16.0 10204 16.1 10204 17.8 10204 17.4 Latn
Totonac, Papantla 10955 11.5 10955 24.6 10955 24.7 10955 26.4 10955 26.1 Latn
Evenki 10962 11.5 5948 -32.4 5948 -32.3 5776 -33.3 5776 -33.5 Cyrl
Kabuverdianu 10971 11.6 10334 17.5 10334 17.6 10334 19.3 10325 18.8 Latn
Jula 11038 12.3 8719 -0.9 8719 -0.8 8719 0.6 8719 0.4 Latn
Éwé 11107 13.0 9967 13.3 9967 13.4 9950 14.8 9950 14.5 Latn
Asháninka 11167 13.6 11164 27.0 11164 27.0 11164 28.8 11164 28.5 Latn
Hmong Njua 11179 13.8 11179 27.1 11179 27.2 11179 29.0 11179 28.7 Latn
Mbundu (009) 11200 14.0 11133 26.6 11133 26.7 11133 28.5 11133 28.1 Latn
Arabic, Standard 11214 14.1 6183 -29.7 6183 -29.6 6166 -28.8 6166 -29.0 Arab
Samoan 11231 14.3 11231 27.7 11231 27.8 11231 29.6 11231 29.3 Latn
Quechua, Margos-Yarowilca-Lauricocha 11260 14.6 11108 26.3 11108 26.4 11108 28.2 11108 27.9 Latn
Achuar-Shiwiar 11299 15.0 11296 28.5 11296 28.5 11296 30.4 11296 30.0 Latn
Tojolabal 11465 16.7 10173 15.7 10173 15.8 10173 17.4 10173 17.1 Latn
Bushi 11487 16.9 10980 24.9 10980 24.9 10980 26.7 10980 26.4 Latn
Osetin 11528 17.3 6370 -27.6 6370 -27.5 6370 -26.5 6370 -26.7 Cyrl
Tzotzil (Chamula) 11558 17.6 10703 21.7 10703 21.8 10703 23.5 10703 23.2 Latn
Rarotongan 11562 17.7 11527 31.1 11527 31.2 11527 33.0 11527 32.7 Latn
Maya, Yucatán 11732 19.4 10675 21.4 10675 21.5 10675 23.2 10675 22.9 Latn
Quechua, Northern Conchucos Ancash 11786 19.9 11782 34.0 11782 34.1 11782 36.0 11782 35.6 Latn
Yanomamö 11913 21.2 10470 19.1 10470 19.1 10470 20.8 10470 20.5 Latn
Aguaruna 11918 21.3 11854 34.8 11854 34.9 11854 36.8 11854 36.4 Latn
Hausa (Niger) 12078 22.9 11831 34.5 11831 34.6 11831 36.5 11831 36.2 Latn
Hausa (Nigeria) 12078 22.9 11863 34.9 11863 35.0 11863 36.9 11863 36.5 Latn
Vietnamese 12182 24.0 8877 0.9 8877 1.0 8877 2.4 8877 2.2 Latn
Chin, Haka 12231 24.5 12231 39.1 12231 39.2 12231 41.2 12231 40.8 Latn
Quechua, Ambo-Pasco 12327 25.4 12181 38.5 12181 38.6 12181 40.6 12181 40.2 Latn
Cashibo-Cacataibo 12349 25.7 11514 30.9 11514 31.0 11514 32.9 11514 32.5 Latn
Tem 12418 26.4 8878 1.0 8878 1.0 8246 -4.8 8246 -5.1 Latn
Ojibwa, Northwestern 12419 26.4 4775 -45.7 4775 -45.7 4775 -44.9 4775 -45.0 Cans
Pidgin, Nigerian 12424 26.4 12424 41.3 12424 41.4 12424 43.4 12424 43.0 Latn
Tahitian 12449 26.7 12244 39.2 12244 39.3 12244 41.3 12244 40.9 Latn
Amahuaca 12530 27.5 12530 42.5 12530 42.6 12530 44.6 12530 44.2 Latn
Lobi 12645 28.7 10435 18.7 10435 18.7 10435 20.4 10435 20.1 Latn
Cree, Swampy 12705 29.3 4849 -44.9 4849 -44.8 4849 -44.0 4849 -44.2 Cans
Navajo 12835 30.6 9981 13.5 9981 13.6 9803 13.1 9803 12.8 Latn
Quechua, South Bolivian 12924 31.5 12902 46.7 12902 46.8 12902 48.9 12902 48.5 Latn
Kaqchikel, Central 12943 31.7 12616 43.5 12616 43.6 12616 45.6 12616 45.2 Latn
Maori 12994 32.2 12993 47.7 12993 47.8 12993 49.9 12993 49.6 Latn
Seraiki 13020 32.5 7303 -17.0 7303 -16.9 7302 -15.7 7302 -16.0 Arab
Ticuna 13137 33.7 10508 19.5 10508 19.6 9886 14.1 9886 13.8 Latn
Arabela 13256 34.9 13255 50.7 13255 50.8 13255 53.0 13255 52.6 Latn
Swati 13372 36.1 13320 51.5 13320 51.6 13320 53.7 13320 53.3 Latn
Komi-Permyak 13499 37.4 7378 -16.1 7378 -16.0 7378 -14.9 7378 -15.1 Cyrl
Farsi, Western 13597 38.4 7537 -14.3 7537 -14.2 7460 -13.9 7460 -14.1 Arab
Yukaghir, Northern 13618 38.6 7366 -16.2 7366 -16.2 7366 -15.0 7366 -15.2 Cyrl
Dari 13669 39.1 7607 -13.5 7607 -13.4 7561 -12.7 7561 -13.0 Arab
Pintupi-Luritja 13736 39.8 13736 56.2 13736 56.3 13736 58.5 13736 58.1 Latn
Urdu 13859 41.0 7768 -11.7 7768 -11.6 7733 -10.8 7733 -11.0 Arab
Panjabi, Western 13996 42.4 7904 -10.1 7904 -10.1 7893 -8.9 7893 -9.2 Arab
Tongan 14017 42.6 12453 41.6 12453 41.7 12453 43.7 12453 43.3 Latn
Yoruba 14059 43.1 10238 16.4 10238 16.5 9276 7.1 9276 6.8 Latn
Inuktitut, Greenlandic 14067 43.1 14067 60.0 14067 60.1 14067 62.3 14067 61.9 Latn
Serbian (Cyrillic) 14090 43.4 7740 -12.0 7740 -11.9 7740 -10.7 7740 -10.9 Cyrl
Urdu (2) 14108 43.6 7904 -10.1 7904 -10.1 7868 -9.2 7868 -9.4 Arab
Nanai 14148 44.0 7666 -12.8 7666 -12.8 7636 -11.9 7636 -12.1 Cyrl
Caquinte 14250 45.0 14246 62.0 14246 62.1 14246 64.4 14246 64.0 Latn
Tigrigna 14270 45.2 5502 -37.4 5502 -37.4 5502 -36.5 5502 -36.7 Ethi
Bosnian (Cyrillic) 14404 46.6 7906 -10.1 7906 -10.0 7906 -8.8 7906 -9.0 Cyrl
Malay (Arabic) 14410 46.6 7899 -10.2 7899 -10.1 7899 -8.8 7899 -9.1 Arab
Konjo 14620 48.8 14620 66.2 14620 66.4 14620 68.7 14620 68.3 Latn
Pashto, Northern 14727 49.9 8276 -5.9 8276 -5.8 8274 -4.5 8274 -4.8 Arab
Bora 14934 52.0 11819 34.4 11819 34.5 11659 34.6 11659 34.2 Latn
Quechua, Huamalíes-Dos de Mayo Huánuco 14973 52.4 14772 68.0 14772 68.1 14772 70.5 14772 70.0 Latn
Toba 15250 55.2 14672 66.8 14672 67.0 14672 69.3 14672 68.9 Latn
Nahuatl, Central 15460 57.3 15457 75.8 15457 75.9 15457 78.4 15457 77.9 Latn
Vai 15555 58.3 6931 -21.2 6931 -21.1 6931 -20.0 6931 -20.2 Vaii
Tatar 15601 58.8 8493 -3.4 8493 -3.4 8493 -2.0 8493 -2.2 Cyrl
Tajiki 15606 58.8 8594 -2.3 8594 -2.2 8594 -0.8 8594 -1.1 Cyrl
Macedonian 15843 61.2 8704 -1.0 8704 -1.0 8704 0.5 8704 0.2 Cyrl
Ukrainian 16109 63.9 8785 -0.1 8785 -0.0 8785 1.4 8785 1.1 Cyrl
Azerbaijani, North (Cyrillic) 16117 64.0 8733 -0.7 8733 -0.6 8733 0.8 8733 0.5 Cyrl
Orok 16118 64.0 8696 -1.1 8696 -1.0 8251 -4.8 8251 -5.0 Cyrl
Amharic 16144 64.3 5382 -38.8 5382 -38.8 5382 -37.9 5382 -38.1 Ethi
Kazakh 16273 65.6 8791 -0.0 8791 0.0 8791 1.5 8791 1.2 Cyrl
Mongolian, Halh (Cyrillic) 16295 65.8 8837 0.5 8837 0.6 8837 2.0 8837 1.7 Cyrl
Tamazight, Standard Morocan 16301 65.9 6371 -27.6 6371 -27.5 6371 -26.5 6371 -26.7 Tfng
Turkmen (Cyrillic) 16438 67.3 8826 0.4 8826 0.4 8826 1.9 8826 1.6 Cyrl
Altai, Southern 16508 68.0 8865 0.8 8865 0.9 8865 2.3 8865 2.0 Cyrl
Shipibo-Conibo 16674 69.7 16391 86.4 16391 86.5 16391 89.2 16391 88.7 Latn
Bulgarian 16844 71.4 9228 4.9 9228 5.0 9228 6.5 9228 6.2 Cyrl
Armenian 16853 71.5 9038 2.8 9038 2.8 9038 4.3 9038 4.0 Armn
Chachi 17042 73.4 16912 92.3 16912 92.4 16911 95.2 16910 94.6 Latn
Belarusan 17117 74.2 9307 5.8 9307 5.9 9307 7.4 9307 7.1 Cyrl
Tai Dam 17301 76.1 7181 -18.3 7181 -18.3 6423 -25.9 6423 -26.1 Tavt
Abkhaz 17318 76.2 9280 5.5 9280 5.6 9280 7.1 9280 6.8 Cyrl
Yaneshaʼ 17336 76.4 15851 80.2 15851 80.4 15238 75.9 15238 75.4 Latn
Uzbek, Northern (Cyrillic) 17394 77.0 9364 6.5 9364 6.6 9364 8.1 9364 7.8 Cyrl
Adyghe 17432 77.4 9483 7.8 9483 7.9 9483 9.4 9483 9.2 Cyrl
Kirghiz 17490 78.0 9390 6.8 9390 6.9 9390 8.4 9390 8.1 Cyrl
Nganasan 17527 78.4 9336 6.2 9336 6.2 9336 7.7 9336 7.5 Cyrl
Yiddish, Eastern 17589 79.0 9593 9.1 9593 9.2 8621 -0.5 8621 -0.8 Hebr
Yakut 17615 79.3 9470 7.7 9470 7.8 9470 9.3 9470 9.0 Cyrl
Khakas 17616 79.3 9554 8.6 9554 8.7 9554 10.3 9554 10.0 Cyrl
Tuva 17717 80.3 9572 8.8 9572 8.9 9572 10.5 9572 10.2 Cyrl
Russian 17750 80.6 9605 9.2 9605 9.3 9605 10.8 9605 10.6 Cyrl
Matsés 17788 81.0 17336 97.1 17336 97.3 17336 100.1 17336 99.5 Latn
Kabardian 17879 81.9 9633 9.5 9633 9.6 9633 11.2 9633 10.9 Cyrl
Inuktitut, Eastern Canadian 17910 82.3 6456 -26.6 6456 -26.5 6456 -25.5 6456 -25.7 Cans
Magahi 17920 82.4 6950 -21.0 6950 -20.9 5090 -41.3 6052 -30.3 Deva
Uyghur (Arabic) 18323 86.5 9826 11.7 9826 11.8 9826 13.4 9826 13.1 Arab
Greek (monotonic) 18324 86.5 10017 13.9 10017 14.0 10017 15.6 10017 15.3 Grek
Cherokee (cased) 18759 90.9 7245 -17.6 7245 -17.6 7245 -16.4 7245 -16.6 Cher
Cherokee (uppercase) 18759 90.9 7245 -17.6 7245 -17.6 7245 -16.4 7245 -16.6 Cher
Bhojpuri 18930 92.6 7294 -17.1 7294 -17.0 5217 -39.8 6274 -27.8 Deva
Greek (polytonic) 19555 99.0 10039 14.2 10039 14.2 10039 15.9 10039 15.6 Grek
Maithili 20047 104.0 7500 -14.7 7500 -14.7 5382 -37.9 6435 -25.9 Deva
Nepali 20816 111.8 7720 -12.2 7720 -12.2 5338 -38.4 6615 -23.9 Deva
Bengali 21349 117.2 7871 -10.5 7871 -10.4 5318 -38.6 7061 -18.7 Beng
Thai (2) 21694 120.8 7390 -16.0 7390 -15.9 5896 -32.0 5950 -31.5 Thai
Thai 21873 122.6 7479 -15.0 7479 -14.9 5992 -30.8 6043 -30.4 Thai
Gujarati 21890 122.8 8184 -6.9 8184 -6.9 5586 -35.5 6978 -19.7 Gujr
Ashéninka, Pichis 22298 126.9 22163 152.0 22163 152.2 22163 155.8 22163 155.1 Latn
Panjabi, Eastern 22584 129.8 8788 -0.1 8788 0.0 6181 -28.7 7470 -14.0 Guru
Sanskrit 22717 131.2 8171 -7.1 8171 -7.0 5186 -40.2 6544 -24.7 Deva
Sinhala 22785 131.9 8519 -3.1 8519 -3.1 6061 -30.1 6853 -21.1 Sinh
Khün 23411 138.2 8047 -8.5 8047 -8.4 4655 -46.3 5140 -40.8 Lana
Kannada 23429 138.4 8463 -3.8 8463 -3.7 5580 -35.6 6989 -19.6 Knda
Hindi 23466 138.8 8962 1.9 8962 2.0 6187 -28.6 7632 -12.2 Deva
Lao 24128 145.5 8340 -5.2 8340 -5.1 6365 -26.5 6447 -25.8 Laoo
Telugu 24993 154.3 9145 4.0 9145 4.1 6027 -30.4 7156 -17.6 Telu
Khmer, Central 25053 154.9 8619 -2.0 8619 -1.9 5511 -36.4 6791 -21.8 Khmr
Malayalam 25115 155.6 8907 1.3 8907 1.4 5286 -39.0 6762 -22.2 Mlym
Marathi 25231 156.8 9345 6.3 9345 6.3 6241 -28.0 7939 -8.6 Deva
Javanese (Javanese) 26155 166.2 8741 -0.6 8741 -0.5 5207 -39.9 6786 -21.9 Java
Georgian 26534 170.0 9742 10.8 9742 10.9 9742 12.4 9742 12.1 Geor
Chakma 27301 177.8 14231 61.8 7696 -12.4 4883 -43.6 5313 -38.8 Cakm
Pular (Adlam) 28460 189.6 14951 70.0 8233 -6.3 7435 -14.2 7435 -14.4 Adlm
Maldivian 28469 189.7 15030 70.9 15030 71.0 8449 -2.5 8449 -2.8 Thaa
Dzongkha 28504 190.1 9650 9.7 9650 9.8 7620 -12.1 7620 -12.3 Tibt
Mon 28674 191.8 10016 13.9 10016 14.0 5751 -33.6 6233 -28.3 Mymr
Sanskrit (Grantha) 29914 204.4 15418 75.3 8241 -6.2 5244 -39.5 8173 -5.9 Gran
Tamil 30208 207.4 10824 23.1 10824 23.2 6894 -20.4 9273 6.7 Taml
Tamil (Sri Lanka) 30213 207.4 10825 23.1 10825 23.2 6893 -20.5 9275 6.8 Taml
Tibetan, Central 30411 209.5 10243 16.5 10243 16.6 7958 -8.2 7958 -8.4 Tibt
Burmese 35846 264.8 12572 43.0 12572 43.1 7695 -11.2 8630 -0.7 Mymr
Shan 36130 267.7 12550 42.7 12550 42.8 8327 -3.9 8604 -1.0 Mymr
Min 4170 -57.6 2202 -75.0 2202 -74.9 2202 -74.6 4007 -53.9
Median 9827 8794 8788 8665 8688
Mean 11315 15.1 8833 0.4 8787 -0.0 8567 -1.1 8700 0.1
Max (ignoring outlier) 35846 264.8 17336 97.1 17336 97.3 17336 100.1 17336 99.5
Max 36130 267.7 22163 152.0 22163 152.2 22163 155.8 22163 155.1

Read More