Text Sanitization For Discord
Text sanitization suitable for discord bots.
import discordtextsanitizer as dts # If using a library which already handles raw @everyone and @here mentions discord_safeish = dts.preprocess_text(unsafe_content) # If interacting directly discord_safer = dts.sanitize_mass_mentions(unsafe_content, run_preprocess=True) # If you're taking in content from users and not services, you may want to use: discord_even_safer = dts.sanitize_mass_mentions( unsafe_content, run_preprocess=True, agressive=True ) # or even discord_safest = dts.sanitize_mass_mentions( unsafe_content, run_preprocess=True, users=True ) # This may insert more characters, but is still the safest option until discord # Fully documents their sanitization. # Want to cleanup html tag and replace entities? # (included for fuller sanitization of web fetched content for discord) via_lib = dts.preprocess_text(unsafe_content, strip_html=True) # or direct_interaction = dts.sanitize_mass_mentions(unsafe_content, strip_html=True, run_preprocess=True)
Discord sanitizes text, silently changing messages.
The process they use isn't fully documented, and their sanitizer has not been disclosed or open sourced.
This leaves the otherwise correct solutions for filtering mass mentions as not working as people would expect.
Why not use this?
If you are only sending in embeds or sending from message content, you probably don't need this. In the first case, embeds don't cause pings, at worst you might get some malformed messages. In the second, you are reading input which has already been through the undocumented sanitization.
So how does this work without a documented set of steps from Discord?
After some trial and error, I have a list of characters which Discord removes consistently.
There were many characters dropped inconsistently.
Originally, following the misleading documentation Discord has, I've found that I couldn't cause NFC normalized unicode to drop anything other than the characters which were dropped consistently. (Note: This was short lived, and a counterexample has since been found) However, this includes right to left overrides, which may be useful for globaly sourced content.
Rather than reimplement NFC normalization, and directional override removal, this uses two well supported libraries which handle this, then removes any remaining characters which Discord is known to drop silently
What to do if you find something this doesn't handle.
Open an issue with details, or a PR with a fix and a sample of text it fixes, I'll be happy to include it.
I'd prefer this not be neccessary at all, but until such a time where that's the case, cooperation among developers who may be impacted by this is great.
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
|Filename, size||File type||Python version||Upload date||Hashes|
|Filename, size discord_text_sanitizer-0.0.10-py3-none-any.whl (5.5 kB)||File type Wheel||Python version py3||Upload date||Hashes View|
|Filename, size discord-text-sanitizer-0.0.10.tar.gz (4.0 kB)||File type Source||Python version None||Upload date||Hashes View|
Hashes for discord_text_sanitizer-0.0.10-py3-none-any.whl
Hashes for discord-text-sanitizer-0.0.10.tar.gz