Strings, RegExp and Template Literals

Strings, Regular Expressions and Template Literals – Part 1 | Understanding ES6

Strings are undoubtedly one of the most important data types in any programming language.

Strings are in almost every programming language and to learn effective use of them is basic necessity of each developers. To effectively work with Strings, developer needs to understand Regular Expressions because it has capacity to manipulate strings. With ECMAScript 6 Strings and Regular Expressions now have new features and those missing functionalities that other programming languages have.

In this post I will list below few of new Features/Methods of Strings from ES6:

UTF-16 Code Points

Until ECMAScript 6, JavaScript strings supported only 16-bit character encoding. All string properties and methods, like the length and the charAt() method, were based on these 16-bit code units. Although, 16 bits used to be enough to contain any character, but now ES6 introduced new character set by Unicode.

The first 216 code points in UTF-16 are represented as single 16-bit code units. This range is called the Basic Multilingual Plane (BMP). Everything after that is considered to be in one of the supplementary planes, where the code points can not be represented in just 16-bits. To solve this problem UTF-16 introduced surrogate pairs in which a single code point is represented by two 16-bit code units. That means any single character in a string can be either one code unit for BMP characters, giving a total of 16 bits, or two units for supplementary plane characters, giving a total of 32 bits.

Meaning, all string operations work on 16-bit code unit in ECMAScript 5, you may get unexpected results from UTF-16 code strings:

var text = "𠮷";

console.log(text.length);           // 2
console.log(/^.$/.test(text));      // false
console.log(text.charAt(0));        // ""
console.log(text.charAt(1));        // ""
console.log(text.charCodeAt(0));    // 55362
console.log(text.charCodeAt(1));    // 57271

The single Unicode character 𠮷 is represented using surrogate pairs, so the JavaScript string operation treat it as having two 16-bit characters. That means:

  • The length of var text is 2, when it should be 1.
  • When we try with regular expression to match a single character fails because it thinks that there are two characters.
  • The charAt() method is unable to return a valid character string, because neither set of 16 bits corresponds to a printable character.
  • The charCodeAt() method also can’t identify the character properly and it returns the appropriate 16-bit number for each code unit.

On the other hand, ES6 enforces UTF-16 string encoding to address these type of problems. Standardizing string operations based on this character encoding means that JavaScript can support functionality designed to work specifically with surrogate pairs.

The codePointAt() Method

To fully support UTF-16, ES6 has added new method named codePointAt(). This method fetches Unicode code-point to a given position in a string. Instead of character position, this method uses code unit position and returns an integer.

var text = "𠮷a";

console.log(text.charCodeAt(0));    // 55362
console.log(text.charCodeAt(1));    // 57271
console.log(text.charCodeAt(2));    // 97

console.log(text.codePointAt(0));   // 134071
console.log(text.codePointAt(1));   // 57271
console.log(text.codePointAt(2));   // 97

The codePointAt() method returns the same value when it operates on BMP characters, but in our case the first character in text variable is non-BMP so it sees as two code units and returns length property as 3 instead of 2. The charCodeAt() method returns code unit only for first position at 0. But codePointAt() method returns code unit for all the characters in string. Though both the method returns same value for position 1 & 2.

You may also create a function to find out if the characters in a string has two code points or note:

function is32Bit(c) {
    return c.codePointAt(0) > 0xFFFF;
}

console.log(is32Bit("𠮷"));         // true
console.log(is32Bit("a"));          // false

Note: 0xFFFF represents Hexadecimal characters, so any codepoint greater then it will be a two code points unit.

The String.fromCodePoint() Method

String.fromCodePoint() method is exactly opposite of the codePointAt(). The codePointAt() method above returns the code point unit for any given character in a string, while String.fromCodePoint() returns a single character string from a given code point.

console.log(String.fromCodePoint(134071));  // "𠮷"

In ECMAScript 5 we had String.fromCharCode() method which works fine for BMP characters, so in ECMAScript 6 we now have String.fromCodePoint() method for all the non-BMP characters.

The normalize() Method

In Unicode, different characters may be considered equivalent comparing two strings or any other comparison based operations. There are two ways to define this: 1) Canonical Equivalence and 2) Compatibility.

  • Canonical Equivalence means two sequences of code points can be considered interchangeable.
  • Compatible means any two compatible sequences of code points look different but can be used interchangeably in specific cases.

For example: two strings representing the same text may contain different code point units. eg: “æ” and “ae”. First one is a single character string and second is two character string. They both may be used interchangeably but they are not equivalent unless normalized in some way.

ES6 supports Unicode normalization by supplying strings to a normalize() method. This method optionally accepts a single string parameter indicating one of the following Unicode normalization forms to apply:

  • Normalization Form Canonical Composition ("NFC"), which is the default
  • Normalization Form Canonical Decomposition ("NFD")
  • Normalization Form Compatibility Composition ("NFKC")
  • Normalization Form Compatibility Decomposition ("NFKD")

You can understand more about above forms on Unicode’s webpage.
We just need to make sure that when comparing string, both string must be normalize to same form.

var normalized = values.map(function(text) {
    return text.normalize();
});

normalized.sort(function(first, second) {
    if (first < second) {
        return -1;
    } else if (first === second) {
        return 0;
    } else {
        return 1;
    }
});

This above code converts the strings to array into a normalized form so that the array can be sorted appropriately. In the above code, we have normalized the first and second string to a default normalize method but you can also use any of the form to normalize the strings.

values.sort(function(first, second) {
    var firstNormalized = first.normalize("NFD"),
        secondNormalized = second.normalize("NFD");

    if (firstNormalized < secondNormalized) {
        return -1;
    } else if (firstNormalized === secondNormalized) {
        return 0;
    } else {
        return 1;
    }
});

Though the normalize() method is much useful in comparing strings but you probably won't need it unless you work with internationalized application.

Identify Substrings

Till ECMAScript 5, to find the substrings of any strings/text we have used indexOf method. With ECMAScript 6, now we have three new methods which simplifies finding substrings:
1) includes(): Returns true if the given text is found anywhere in the string else it will return false.
2) startsWith(): Returns true if the given text is found at the start of the string else will return false.
3) endsWith(): Returns true if the given text is found at the end of the string else will return false.

Each of the three methods above accepts two parameters. First is the character which you wants to search in the string and the second one is optional index number. If provided the method will starts searching the character from that index, except endsWith() method. If the optional index number is passed with endsWith() method it will start matching from end till the index. If the second argument is omitted, startsWith() and includes() starts with the beginning of the string and endsWith() starts from the end of the string.

var msg = "Hello world!";

console.log(msg.startsWith("Hello"));       // true
console.log(msg.endsWith("!"));             // true
console.log(msg.includes("o"));             // true

console.log(msg.startsWith("o"));           // false
console.log(msg.endsWith("world!"));        // true
console.log(msg.includes("x"));             // false

console.log(msg.startsWith("o", 4));        // true
console.log(msg.endsWith("o", 8));          // true
console.log(msg.includes("o", 8));          // false

As you can see from above examples, all the three methods returns boolean results. If you want to find the actual position of the character in the string you need to go with indexOf() and lastIndexOf() methods.

The repeat() Method

ECMAScript 6 has included this new method for string, which accepts a number as parameter. The repeat() method will use this number to repeat the string and will give us a new string.

console.log("Hello".repeat(2));             // "HelloHello"
console.log("World".repeat(3));             // "WorldWorldWorld"
console.log("Hello World".repeat(3));       // "Hello WorldHello WorldHello World"

This is a very handy and useful method especially while manipulating text.

// indent using a specified number of spaces
var indent = " ".repeat(4),
    indentLevel = 0;

// whenever you increase the indent
var newIndent = indent.repeat(++indentLevel);

The first repeat() instance creates a string of four spaces, and the indentLevel variable keeps track of the indent level. Then, you can just call repeat() with an incremented indentLevel to change the number of spaces.

I have covered almost all the things related to Strings. I will explain Regular Expression and Template Literal in the next two post.

Web/ UI & Front-end developer based in Ahmedabad, GJ, India. Here to help/ discuss community to spread web awareness.

Leave a reply:

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.