Clean Reddit Text Data

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3.8

Project description

redditcleaner

Cleans Reddit Text Data 📜 🧹 🧼 🧽

Installation

pip install redditcleaner

About

Reddit uses some characters in the raw text of comments and submission selftexts that may need to be removed if just the plain natural text is required for NLP/Data Science tasks. This Python module cleans this text data.

Usage

import redditcleaner
text_raw = <Reddit text>
text_cleaned = redditcleaner.clean(text_raw)

Input

If Reddit's or Pushshift's API is used to retrieve comments or submissions, the raw comment bodies or submission self texts may look like this:

Normal text\n\n**Bold**\n\n*Italic*\n\n[Link](https://fsf.org)\n\n
~~Strike-through~~\n\n`Code`\n\n^(Superscript)
\n\n&gt;!Spoiler!&lt;\n\n# Heading\n\nBullet list:\n\n* Item 1\n* Item 2
\n\nNumbered list:\n\n1. Item 1\n2. Item 2\n\n&gt;Quote\n\n 
Code block\n\nTable:\n\n|Cell 1.1|Cell 1.2|\n|:-|:-|\n|Cell 2.1|Cell 2.2|

\n * Find &amp;#x200B; &gt; "\&gt; the "&gt; hidden\ntext [fsf](http://fsf.org)...
This & that in a normal sentence. "manual quote"

These characters stem from (Reddit-specific) Markdown formatting.

Output

This Python module removes these characters and returns the cleaned text. Using the example above, the output would be:

Normal text Bold Italic
Strike-through Code Superscript
Spoiler Heading Bullet list: Item 1 Item 2 
Numbered list: 1. Item 1 2. Item 2 Quote
Code block Table: Cell 1.1 Cell 1.2 Cell 2.1 Cell 2.2

Find the hidden text ... This & that in a normal sentence. "manual quote"

:warning: Common punctuation, numbers, parentheses, quotation marks etc. are deliberately not removed, as this data cleaning task pertains to Reddit-specific characters only. An additional round of data cleaning can be applied manually to clean these common items or apply lowercasing, or whatever else is needed.

Advanced Usage

The clean function supports optional arguments and it can be used as a lambda to be applied on e.g. Pandas DataFrames.

Optional Arguments

Specific removals of characters can be disabled with optional arguments passed to the clean function. Everything is on by default, but setting one to False disables it.

def clean(text, newline=True, quote=True, bullet_point=True, 
          link=True, strikethrough=True, spoiler=True,
          code=True, superscript=True, table=True, heading=True)

E.g.

redditcleaner.clean(text, heading=False)

Pandas Usages

This simulates a common format used when retrieving this type of data via the Reddit API:

# Put "retrieved" text into Pandas Dataframe
test_body_1 = "\n * Find &amp;#x200B; &gt; \"\\&gt; the \"&gt; hidden\ntext [fsf](http://fsf.org)... This & that in a normal sentence. \"manual quote\""
test_body_2 = "Normal text\n\n**Bold**\n\n*Italic*\n\n[Link](https://fsf.org)\n\n~~Strike-through~~\n\n`Code`\n\n^(Superscript)\n\n&gt;!Spoiler!&lt;\n\n# Heading\n\nBullet list:\n\n* Item 1\n* Item 2\n\nNumbered list:\n\n1. Item 1\n2. Item 2\n\n&gt;Quote\n\n    Code block\n\nTable:\n\n|Cell 1.1|Cell 1.2|\n|:-|:-|\n|Cell 2.1|Cell 2.2|"

import pandas as pd
df = pd.DataFrame([['asdf', 'test_a', test_body_1],
                   ['fdsa', 'test_b', test_body_2]],
                   columns=list(['id', 'author', 'body']))

# Prepare redditcleaner
import redditcleaner
clean_reddit = lambda x: redditcleaner.clean(x)

# Apply (map) the function on all body columns
df['body'] = df['body'].map(clean_reddit)
df

	id	author	body
0	asdf	testa	Find the hidden text ... This & that in a normal sentence. "manual quote"
1	fdsa	testb	Normal text Bold Italic Strike-through Code Superscript Spoiler Heading Bullet list: Item 1 Item 2 Numbered list: 1. Item 1 2. Item 2 Quote Code block Table: Cell 1.1 Cell 1.2 Cell 2.1 Cell 2.2

Contributing

If I missed any characters that should also be removed, please let me know or feel free to create a PR yourself! :heart:

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3.8

Release history Release notifications | RSS feed

1.1.2

Apr 14, 2020

1.1.1

Apr 11, 2020

1.1

Apr 11, 2020

This version

1.0

Apr 10, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

redditcleaner-1.0.tar.gz (3.7 kB view details)

Uploaded Apr 10, 2020 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

redditcleaner-1.0-py3-none-any.whl (4.5 kB view details)

Uploaded Apr 10, 2020 Python 3

File details

Details for the file redditcleaner-1.0.tar.gz.

File metadata

Download URL: redditcleaner-1.0.tar.gz
Upload date: Apr 10, 2020
Size: 3.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/42.0.2 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.2

File hashes

Hashes for redditcleaner-1.0.tar.gz
Algorithm	Hash digest
SHA256	`1c21c8a1c1891c32a98171bd4f5643594313bff7a68b7299bb5c98f267f25432`
MD5	`eaff9c24b59ee53d347dcf94ffe3dd36`
BLAKE2b-256	`f8ff4397549da820b69e897fd8f9dfa8c1776e4186563853cef7fd6b3d112b2d`

See more details on using hashes here.

File details

Details for the file redditcleaner-1.0-py3-none-any.whl.

File metadata

Download URL: redditcleaner-1.0-py3-none-any.whl
Upload date: Apr 10, 2020
Size: 4.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/42.0.2 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.2

File hashes

Hashes for redditcleaner-1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`19bd2258baf10724a67280c7d1ecb9fff00a28e8eb84fafb0f166406c710bb0f`
MD5	`10ac4e3aef2d3c7e87357152d3b07540`
BLAKE2b-256	`cf2ef1b6bbee03cd58c9c664d87484d5674076b392350686b92cccb645328fef`

See more details on using hashes here.

redditcleaner 1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

redditcleaner

Installation

About

Usage

Input

Output

Advanced Usage

Optional Arguments

Pandas Usages

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes