A stopword filter for Japanese
Project description
JaStopwordFilter
JaStopwordFilter
is a lightweight Python library designed to filter stopwords from Japanese text based on customizable rules. It provides an efficient way to preprocess Japanese text for natural language processing (NLP) tasks, with support for common stopword removal techniques and user-defined customization.
Features
- Preloaded Stopwords: Includes a comprehensive list of Japanese stopwords from SlothLib.
- Customizable Rules:
- Remove tokens based on length.
- Filter dates in common Japanese formats (e.g.,
2024年11月
). - Exclude numbers, symbols, spaces, and emojis.
- Custom Wordlist: Add your own stopwords to the filter.
- Flexible Usage: Use only the rules you need by enabling or disabling them during initialization.
Installation
Clone the repository and install the dependencies:
git clone https://github.com/your-username/ja-stopword-filter.git
cd ja-stopword-filter
pip install -r requirements.txt
Usage
Example Code
from ja_stopword_filter import JaStopwordFilter
# Example token list
tokens = ["2024年11月", "こんにちは", "123", "!", "😊", "スペース", "短い", "custom"]
# Custom wordlist
custom_wordlist = ["custom", "スペース"]
# Initialize the filter
filter = JaStopwordFilter(
use_slothlib=True, # Use SlothLib stopwords
use_length=True, # Filter tokens with length <= 1
use_date=True, # Filter Japanese date formats
use_numbers=True, # Filter numeric tokens
use_symbols=True, # Filter symbolic tokens
use_spaces=True, # Filter whitespace-only tokens
use_emojis=True, # Filter emoji tokens
custom_wordlist=custom_wordlist # Add custom stopwords
)
# Filter tokens
filtered_tokens = filter.remove(tokens)
print(filtered_tokens) # Output: ['こんにちは', '短い']
Parameters
The JaStopwordFilter
class supports the following parameters during initialization:
Parameter | Type | Default | Description |
---|---|---|---|
use_slothlib |
bool |
True |
Whether to use the SlothLib stopword list. |
use_length |
bool |
False |
Remove tokens with a length of 1 character or less. |
use_date |
bool |
False |
Remove tokens that match Japanese date formats. |
use_numbers |
bool |
False |
Remove numeric tokens. |
use_symbols |
bool |
False |
Remove symbolic tokens (e.g., ! , @ ). |
use_spaces |
bool |
False |
Remove tokens that are empty or consist only of spaces. |
use_emojis |
bool |
False |
Remove tokens containing emojis. |
custom_wordlist |
list |
None |
A list of user-defined stopwords to remove. |
Stopword Sources
SlothLib Stopwords
If use_slothlib
is set to True
, the filter loads stopwords from a slothlib.txt
file. Ensure this file is in the same directory as the script or adjust the file path in the get_stopwords
function.
Custom Wordlist
You can pass a list of custom stopwords using the custom_wordlist
parameter. These will be merged with the SlothLib stopwords if enabled.
Rules
The filter applies the following rules if they are enabled:
- Length Filtering: Tokens with one or fewer characters are removed.
- Date Filtering: Matches Japanese date patterns like:
YYYY年MM月
MM月DD日
YYYY年MM月DD日
- Number Filtering: Removes numeric tokens (
123
,2024
). - Symbol Filtering: Removes punctuation and special symbols.
- Space Filtering: Removes tokens that are empty or consist only of spaces.
- Emoji Filtering: Detects and removes tokens containing emojis.
Contributing
Contributions are welcome! If you find a bug or have a feature request, feel free to open an issue or submit a pull request.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file jp_stopword_filter-0.1.0.tar.gz
.
File metadata
- Download URL: jp_stopword_filter-0.1.0.tar.gz
- Upload date:
- Size: 5.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: pdm/2.18.1 CPython/3.9.6 Darwin/23.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e3d7754fb00dc8ee6aa473dfb697a479768118a1465fb5f9bc9ad699424ad454 |
|
MD5 | 5fb3f8fb7d3534f259f34887f69d1a7c |
|
BLAKE2b-256 | 54ae084cb22a79fb98d9da88d9baaf8ce4f4467faad81cce4a5ab2bd46b5c1a6 |
File details
Details for the file jp_stopword_filter-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: jp_stopword_filter-0.1.0-py3-none-any.whl
- Upload date:
- Size: 5.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: pdm/2.18.1 CPython/3.9.6 Darwin/23.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9440b0dd4e96f29ee475b3f2ee503334266403e8f1298e4d3328f8044f4a30e5 |
|
MD5 | 62627a921bab7809cd0a41ef06bb3e8c |
|
BLAKE2b-256 | 3fe411dcecef129f8c74fa97a984850611433e511259a5bed889631704f86178 |