mcfonts.utils.unicode

Functions for working with Unicode characters, codepoints, and surrogate pairs.

Module Contents

Functions

character_to_surrogates(character)

Convert one character into 2 integers of its UTF-16 surrogate codepoints.

is_character_invisible(character)

Return if character would be invisible (might not have glyph info).

is_character_private_use(character)

Return if character is in a Private Use Area (PUA).

is_codepoint_surrogate(codepoint)

Return if a codepoint is part of a high or low surrogate pair.

pretty_print_character(character)

Return relevant info about a character into a string, following U+<codepoint> <name> <character>.

str_to_tags(string)

Return a version of string with all alphanumeric characters changed into Tags.

surrogates_to_character(surrogates)

Given a tuple of surrogate chars, return the single codepoint they combine to.

Attributes

INVISIBLE_CHARACTERS

A set of characters that do not have a visual representation under most fonts.

mcfonts.utils.unicode.INVISIBLE_CHARACTERS

A set of characters that do not have a visual representation under most fonts.

mcfonts.utils.unicode.character_to_surrogates(character)

Convert one character into 2 integers of its UTF-16 surrogate codepoints.

A surrogate pair are two characters that represent another character. Since UTF-16 only stores characters from 0 to 0xFFFF, chars past 0xFFFF need to be split into two codepoints below 0xFFFF.

This is useful even in plaintext Unicode notation, because \u1D105 is not a single character, it's two (ᴐ5, not 𝄅).

Parameters:
character : str

A single character.

Returns:

A surrogate pair, in codepoints of the surrogates.

Return type:

tuple[int, int]

mcfonts.utils.unicode.is_character_invisible(character)

Return if character would be invisible (might not have glyph info).

A character is "invisible" if it:

  • Is in these categories: Cf, Cc, Zl, Zs, Zp.

  • Is equal to these codepoints: 2800, 034F, 115F, 1160, 17B4, 17B5, 3164, FFA0, 1D159, 1D174, 1D176, 1D177, 1D178, 1D17A.

  • Is private use.

You can visit https://invisible-characters.com/ if you would like to see the list.

Warning

"Invisibility" is not a valid Unicode standard property. For standardization purposes, do use utilize this outside of this library.

Parameters:
character : str

A single character.

Returns:

If char is a spacing character.

Return type:

bool

mcfonts.utils.unicode.is_character_private_use(character)

Return if character is in a Private Use Area (PUA).

A PUA is one of these codepoint ranges:

  • U+E000 to U+F8FF

  • U+F0000 to U+FFFFD

  • U+100000 to U+10FFFD

Parameters:
character : str

A single character.

Returns:

If character is in a Private Use Area.

Return type:

bool

mcfonts.utils.unicode.is_codepoint_surrogate(codepoint)

Return if a codepoint is part of a high or low surrogate pair.

Parameters:
codepoint : int

An integer of the character's codepoint.

Returns:

If it's within 0xD800..=0xDC00.

Return type:

bool

mcfonts.utils.unicode.pretty_print_character(character)

Return relevant info about a character into a string, following U+<codepoint> <name> <character>.

>>> pretty_print_character('\u2601')
'U+2601: CLOUD ☁'
>>> pretty_print_character('\ue000')
'U+E000: <PRIVATE USE> \ue000'
>>> pretty_print_character('\U0001f400')
'U+1F400: RAT 🐀'
>>> pretty_print_character('\b')
'U+0008: BACKSPACE ␈'
>>> pretty_print_character('\b')
'U+0008: BACKSPACE ␈'
Parameters:
character : str

A single character.

Returns:

The pretty character string.

Return type:

str

mcfonts.utils.unicode.str_to_tags(string)

Return a version of string with all alphanumeric characters changed into Tags.

Given string, which should have only ASCII characters, turn it into that same string but every character is a Tag of itself, instead.

See https://en.wikipedia.org/wiki/Tags_(Unicode_block).

Parameters:
string : str

Any string; it should have ASCII characters.

Returns:

string but with ASCII characters replaced with their Tag equivalents.

Return type:

str

mcfonts.utils.unicode.surrogates_to_character(surrogates)

Given a tuple of surrogate chars, return the single codepoint they combine to.

Parameters:
surrogates : tuple[int, int]

A tuple of two surrogate codepoints.

Returns:

A single character of the resulting surrogates.

Return type:

str


Last update: 2023 November 30