mcfonts.unicode

Functions for working with Unicode characters, codepoints, and surrogate pairs.

Module Contents

blocks(characters: collections.abc.Iterable[str], /) dict[str, int]

Return a report of the Unicode blocks covered by characters.

Automatically sorted by block.

Parameters:
characters : collections.abc.Iterable[str]

Set of characters.

Returns:

Dictionary of {block name: count}.

Return type:

dict[str, int]

character_to_surrogates(character: str, /) tuple[int, int]

Convert one character into 2 integers of its UTF-16 surrogate codepoints.

A surrogate pair are two characters that represent another character. Since UTF-16 only stores characters from 0 to 0xFFFF, chars past 0xFFFF need to be split into two codepoints below 0xFFFF.

This is useful even in plaintext Unicode notation, because \u1D105 is not a single character, it's two (ᴐ5, not 𝄅).

Parameters:
character : str

The character.

Returns:

A surrogate pair, in codepoints of the surrogates.

Return type:

tuple[int, int]

is_character_invisible(character: str, /) bool

Return if character would be invisible.

A character is "invisible" if it:

  • Is in these categories: Cf, Cc, Zl, Zs, Zp.

  • Is equal to these codepoints: 2800, 034F, 115F, 1160, 17B4, 17B5, 3164, FFA0, 1D159, 1D174, 1D176, 1D177, 1D178, 1D17A.

  • Is private use.

You can visit https://invisible-characters.com/ if you would like to see the list.

Warning

"Invisibility" is not a valid Unicode standard property. For standardization purposes, do use utilize this outside of this library.

Returns:

If character wuold appear invisible.

Parameters:
character : str

The character.

Return type:

bool

is_character_private_use(character: str, /) bool

Return if character is in a Private Use Area (PUA).

A PUA is one of these codepoint ranges:

  • U+E000 to U+F8FF

  • U+F0000 to U+FFFFD

  • U+100000 to U+10FFFD

Parameters:
character : str

The character.

Returns:

If character is in a Private Use Area.

Return type:

bool

is_character_surrogate(character: str, /) bool

Return if a character is part of a high or low surrogate pair.

Parameters:
character : str

The character.

Returns:

If it's within U+D800 to U+DC00.

Return type:

bool

pprint_char(character: str, /) str

Return relevant info about a character into a string, following U+<codepoint> <name> <character>.

>>> pprint_char('\u2601')
'U+2601: CLOUD ☁'
>>> pprint_char('\ue000')
'U+E000: <PRIVATE USE> \ue000'
>>> pprint_char('\U0001f400')
'U+1F400: RAT 🐀'
>>> pprint_char('\b')
'U+0008: BACKSPACE ␈'
>>> pprint_char('\b')
'U+0008: BACKSPACE ␈'
Parameters:
character : str

The character.

Returns:

A "pretty" character string.

Return type:

str

str_to_tags(string: str, /) str

Return a version of string with all alphanumeric characters changed into Tags.

Given string, which should have only ASCII characters, turn it into that same string but every character is a Tag of itself, instead.

See https://en.wikipedia.org/wiki/Tags_(Unicode_block).

Parameters:
string : str

Any string; it should have onlu ASCII characters.

Returns:

Input string but with ASCII characters replaced with their Tag equivalents.

Return type:

str

surrogates_to_character(surrogates: tuple[int, int], /) str

Given a tuple of surrogate codepoints, return the single character they represent.

Parameters:
surrogates : tuple[int, int]

The tuple of 2 surrogate codepoints.

Returns:

A single character of the resulting surrogates.

Return type:

str

symbol_for(character: str, /) str
Parameters:
character : str

Return type:

str

INVISIBLE_CHARACTERS : Final[set[str]]

A set of characters that do not have a visual representation under most fonts.

SURROGATE_END : Final[int] = 56320
SURROGATE_START : Final[int] = 55296