Use the regex library to get Unicode property escapes in Python
I was writing some code to detect and replace emoji.
Maybe you can do this using the builtin re
module by defining character classes (“all Unicode characters in range X–Y”) but that looks tedious and fiddly. I had a vague memory of “named Unicode groups” – what I was thinking of were Unicode property escapes (which I’ve previously used in JavaScript).
You can use those with the regex
library, which is a dropin replacement for the standard library re
module.
Here’s the set of examples from JavaScript: note the difference between Emoji
and Extended_Pictographic
.
>>> import regex
>>> regex.search(r'\p{Emoji}', 'flowers')
>>> regex.search(r'\p{Emoji}', 'flowers 🌼🌺🌸')
<regex.Match object; span=(8, 9), match='🌼'>
>>> regex.search(r'\p{Emoji}', 'flowers 123')
<regex.Match object; span=(8, 9), match='1'>
>>> regex.search(r'\p{Extended_Pictographic}', 'flowers')
>>> regex.search(r'\p{Extended_Pictographic}', 'flowers 🌼🌺🌸')
<regex.Match object; span=(8, 9), match='🌼'>
>>> regex.search(r'\p{Extended_Pictographic}', 'flowers 123')