utf8

Show Deprecated

This library provides basic support for UTF-8 encoding. This library does not provide any support for Unicode other than the handling of the encoding. Any operation that needs the meaning of a character, such as character classification, is outside its scope.

Unless stated otherwise, all functions that expect a byte position as a parameter assume that the given position is either the start of a byte sequence or one plus the length of the subject string. As in the string library, negative indices count from the end of the string.

You can find a large catalog of usable UTF-8 characters here.

Summary

Functions

char(codepoints : Tuple<number>):string
Converts zero or more codepoints to UTF-8 byte sequences.
codes(str : string):function,string,number
Returns an iterator function that iterates over all codepoints in a given string.
codepoint(str : string,i : number,j : number):Tuple<number>
Returns the codepoints (as integers) from all codepoints in a given string.
len(s : string,i : number,j : number):number
Returns the number of UTF-8 codepoints in a given string.
offset(s : string,n : number,i : number):number?
Returns the position (in bytes) where the encoding of the n‑th codepoint of s (counting from byte position i) starts.
graphemes(str : string,i : number,j : number):function
Returns an iterator function that iterates over the grapheme clusters of a given string.
nfcnormalize(str : string):string
Converts the input string to Normal Form C.
nfdnormalize(str : string):string
Converts the input string to Normal Form D.

Properties

charpattern:string
The pattern "[%z\x01-\x7F\xC2-\xF4][\x80-\xBF]*", which matches exactly zero or more UTF-8 byte sequences, assuming that the subject is a valid UTF-8 string.

Functions

char

string

Receives zero or more codepoints as integers, converts each one to its corresponding UTF-8 byte sequence and returns a string with the concatenation of all these sequences.

Parameters

codepoints: Tuple<number>

Returns

string

codes

function

Returns an iterator function so that the construction:


for position, codepoint in utf8.codes(str) do
	-- body
end

will iterate over all codepoints in string str. It raises an error if it meets any invalid byte sequence.

Parameters

str: string

The string to iterate over.

Returns

function

string

number

codepoint

Tuple<number>

Returns the codepoints (as integers) from all codepoints in the provided string (str) that start between byte positions i and j (both included). The default for i is 1 and for j is i. It raises an error if it meets any invalid byte sequence.

Parameters

str: string

i: number

The index of the codepoint that should be fetched from this string.

Default Value: 1

j: number

The index of the last codepoint between i and j that will be returned. If excluded, this will default to the value of i.

Default Value: i

Returns

Tuple<number>

len

number

Returns the number of UTF-8 codepoints in the string str that start between positions i and j (both inclusive). The default for i is 1 and for j is -1. If it finds any invalid byte sequence, returns a nil value plus the position of the first invalid byte.

Parameters

s: string

i: number

The starting position.

Default Value: 1

j: number

The ending position.

Default Value: -1

Returns

number

offset

number

Returns the position (in bytes) where the encoding of the n‑th codepoint of s (counting from byte position i) starts. A negative n gets characters before position i. The default for i is 1 when n is non-negative and #s + 1 otherwise, so that utf8.offset(s, -n) gets the offset of the n‑th character from the end of the string. If the specified character is neither in the subject nor right after its end, the function returns nil.

Parameters

s: string

n: number

i: number

Default Value: 1

Returns

number

graphemes

function

Returns an iterator function so that


for first, last in utf8.graphemes(str) do
	local grapheme = s:sub(first, last)
	-- body
end

will iterate the grapheme clusters of the string.

Converts the input string to Normal Form C, which tries to convert decomposed characters into composed characters.

Parameters

str: string

Returns

string

nfdnormalize

string

Converts the input string to Normal Form D, which tries to break up composed characters into decomposed characters.

Parameters

str: string

The string to convert.

Returns

string

Properties

charpattern

string

The pattern "[%z\x01-\x7F\xC2-\xF4][\x80-\xBF]*", which matches exactly zero or more UTF-8 byte sequence, assuming that the subject is a valid UTF-8 string.

utf8

Summary

Functions

Properties

Functions

char

Parameters

Returns

codes

Parameters

Returns

codepoint

Parameters

Returns

len

Parameters

Returns

offset

Parameters

Returns

graphemes

Parameters

Returns

nfcnormalize

Parameters

Returns

nfdnormalize

Parameters

Returns

Properties

charpattern

On this page