Clean Reddit Text Data
Project description
redditcleaner
Cleans Reddit Text Data 📜 🧹 🧼 🧽
Installation
About
Reddit uses some characters in the raw text of comments and submission selftexts that may need to be removed if just the plain natural text is required for NLP/Data Science tasks. This Python module cleans this text data.
Usage
import redditcleaner
text_raw = <Reddit text>
text_cleaned = redditcleaner.clean(text_raw)
Input
If Reddit's or Pushshift's API is used to retrieve comments or submissions, the raw comment bodies or submission self texts may look like this:
Normal text\n\n**Bold**\n\n*Italic*\n\n[Link](https://fsf.org)\n\n
~~Strike-through~~\n\n`Code`\n\n^(Superscript)
\n\n>!Spoiler!<\n\n# Heading\n\nBullet list:\n\n* Item 1\n* Item 2
\n\nNumbered list:\n\n1. Item 1\n2. Item 2\n\n>Quote\n\n
Code block\n\nTable:\n\n|Cell 1.1|Cell 1.2|\n|:-|:-|\n|Cell 2.1|Cell 2.2|
\n * Find &#x200B; > "\> the "> hidden\ntext [fsf](http://fsf.org)...
This & that in a normal sentence. "manual quote"
These characters stem from (Reddit-specific) Markdown formatting. See here how the first bit looks like on Reddit.
Output
This Python module removes these characters and returns the cleaned text. Using the example above, the output would be:
Normal text Bold Italic
Strike-through Code Superscript
Spoiler Heading Bullet list: Item 1 Item 2
Numbered list: 1. Item 1 2. Item 2 Quote
Code block Table: Cell 1.1 Cell 1.2 Cell 2.1 Cell 2.2
Find the hidden text ... This & that in a normal sentence. "manual quote"
:warning: Common punctuation, numbers, parentheses, quotation marks, emojis, etc. are deliberately not removed, as this data cleaning task pertains to Reddit-specific characters only. An additional round of data cleaning can be applied manually to clean these common items or apply lowercasing, or whatever else is needed.
Advanced Usage
The clean
function supports optional arguments and it can be used as a lambda to be applied on e.g. Pandas DataFrames.
Optional Arguments
Specific removals of characters can be disabled with optional arguments passed to the clean
function. Everything is on by default, but setting one to False
disables it.
def clean(text, newline=True, quote=True, bullet_point=True,
link=True, strikethrough=True, spoiler=True,
code=True, superscript=True, table=True, heading=True)
E.g.
redditcleaner.clean(text, heading=False)
Pandas Usage
This simulates a common format used when retrieving this type of data via the Reddit API:
# Put "retrieved" text into Pandas Dataframe
test_body_1 = "\n * Find &#x200B; > \"\\> the \"> hidden\ntext [fsf](http://fsf.org)... This & that in a normal sentence. \"manual quote\""
test_body_2 = "Normal text\n\n**Bold**\n\n*Italic*\n\n[Link](https://fsf.org)\n\n~~Strike-through~~\n\n`Code`\n\n^(Superscript)\n\n>!Spoiler!<\n\n# Heading\n\nBullet list:\n\n* Item 1\n* Item 2\n\nNumbered list:\n\n1. Item 1\n2. Item 2\n\n>Quote\n\n Code block\n\nTable:\n\n|Cell 1.1|Cell 1.2|\n|:-|:-|\n|Cell 2.1|Cell 2.2|"
import pandas as pd
df = pd.DataFrame([['asdf', 'test_a', test_body_1],
['fdsa', 'test_b', test_body_2]],
columns=list(['id', 'author', 'body']))
# Prepare redditcleaner
import redditcleaner
# Apply (map) the function on all body column entries
df['body'] = df['body'].map(redditcleaner.clean)
df
id | author | body | |
---|---|---|---|
0 | asdf | testa | Find the hidden text ... This & that in a normal sentence. "manual quote" |
1 | fdsa | testb | Normal text Bold Italic Strike-through Code Superscript Spoiler Heading Bullet list: Item 1 Item 2 Numbered list: 1. Item 1 2. Item 2 Quote Code block Table: Cell 1.1 Cell 1.2 Cell 2.1 Cell 2.2 |
Contributing
If I missed any characters that should also be removed, please let me know or feel free to create a PR yourself! :heart:
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for redditcleaner-1.1.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bd07bd225fdf1a8e64a6588470180e53b82fde614aba1740f586569f41bb9368 |
|
MD5 | ba19f361a6bc1ce34bc2a20a2d2099bc |
|
BLAKE2b-256 | f98a7491757daaf8f3381f736473018880c9e89defd44b9ebbf48a83c172e5ff |