Skip to main content

Text Sanitization For Discord

Project description

discord-text-sanitizer

Text sanitization suitable for discord bots.

Quick Start

import discordtextsanitizer as dts

# If using a library which already handles raw @everyone and @here mentions
discord_safeish = dts.preprocess_text(unsafe_content)

# If interacting directly
discord_safer = dts.sanitize_mass_mentions(unsafe_content, run_preprocess=True)

# If you're taking in content from users and not services, you may want to use:
discord_even_safer = dts.sanitize_mass_mentions(
    unsafe_content, run_preprocess=True, agressive=True
)
# or even
discord_safest = dts.sanitize_mass_mentions(
    unsafe_content, run_preprocess=True, users=True
)
# This may insert more characters, but is still the safest option until discord
# Fully documents their sanitization.

# Want to cleanup html tag and replace entities?
# (included for fuller sanitization of web fetched content for discord)

via_lib = dts.preprocess_text(unsafe_content, strip_html=True)
# or
direct_interaction = dts.sanitize_mass_mentions(unsafe_content, strip_html=True, run_preprocess=True)

Why?

Discord sanitizes text, silently changing messages.

The process they use isn't fully documented, and their sanitizer has not been disclosed or open sourced.

This leaves the otherwise correct solutions for filtering mass mentions as not working as people would expect.

Why not use this?

If you are only sending in embeds or sending from message content, you probably don't need this. In the first case, embeds don't cause pings, at worst you might get some malformed messages. In the second, you are reading input which has already been through the undocumented sanitization.

So how does this work without a documented set of steps from Discord?

After some trial and error, I have a list of characters which Discord removes consistently.

There were many characters dropped inconsistently.

Originally, following the misleading documentation Discord has, I've found that I couldn't cause NFC normalized unicode to drop anything other than the characters which were dropped consistently. (Note: This was short lived, and a counterexample has since been found) However, this includes right to left overrides, which may be useful for globaly sourced content.

Rather than reimplement NFC normalization, and directional override removal, this uses two well supported libraries which handle this, then removes any remaining characters which Discord is known to drop silently

What to do if you find something this doesn't handle.

Open an issue with details, or a PR with a fix and a sample of text it fixes, I'll be happy to include it.

I'd prefer this not be neccessary at all, but until such a time where that's the case, cooperation among developers who may be impacted by this is great.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for discord-text-sanitizer, version 0.0.10
Filename, size File type Python version Upload date Hashes
Filename, size discord_text_sanitizer-0.0.10-py3-none-any.whl (5.5 kB) File type Wheel Python version py3 Upload date Hashes View hashes
Filename, size discord-text-sanitizer-0.0.10.tar.gz (4.0 kB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page