A lightweight Python library designed to filter stopwords for Japanese text.
Project description
jp-stopword-filter
jp-stopword-filter is a lightweight Python library designed to filter stopwords from Japanese text based on customizable rules. It provides an efficient way to preprocess Japanese text for natural language processing (NLP) tasks, with support for common stopword removal techniques and user-defined customization.
Features
- Preloaded Stopwords: Includes a comprehensive list of Japanese stopwords from SlothLib.
- Customizable Rules:
- Remove tokens based on length.
- Filter dates in common Japanese formats (e.g.,
2024年11月). - Exclude numbers, symbols, spaces, and emojis.
- Custom Wordlist: Add your own stopwords to the filter.
- Custom Filters: Define your own filtering logic with a
custom_filterfunction. - Flexible Usage: Use only the rules you need by enabling or disabling them during initialization.
Installation
Install via PyPI:
pip install jp-stopword-filter
Alternatively, clone the repository and install dependencies:
git clone https://github.com/your-username/ja-stopword-filter.git
cd ja-stopword-filter
pip install -r requirements.txt
Usage
Basic Usage (String Tokens Only)
In this example, we filter tokens represented as strings. A custom wordlist is provided for additional filtering.
from ja_stopword_filter import JaStopwordFilter
# Define a token list
tokens = ["2024年11月", "こんにちは", "123", "!", "😊", "スペース", "短い", "custom"]
# Custom wordlist
custom_wordlist = ["custom", "スペース"]
# Initialize the filter
filter = JaStopwordFilter(
convert_full_to_half=True, # Convert full-width characters to half-width
use_slothlib=True, # Include SlothLib stopwords
filter_length=1, # Filter tokens with length <= 1
use_date=True, # Remove tokens matching date patterns
use_numbers=True, # Remove numeric tokens
use_symbols=True, # Remove tokens with symbols
use_spaces=True, # Remove empty or whitespace-only tokens
use_emojis=True, # Remove emoji-containing tokens
custom_wordlist=custom_wordlist # Add custom stopwords
)
# Filter tokens
filtered_tokens = filter.remove(tokens)
print(filtered_tokens) # Output: ['こんにちは', '短い']
Advanced Usage (Using Token Class and custom_filter)
This example demonstrates how to use the Token class and a custom_filter function to define custom filtering logic. Additionally, you can extend the Token class to include custom attributes and design corresponding custom_filter functions to suit specific use cases.
Example with Basic Token Class
from ja_stopword_filter import JaStopwordFilter, Token
# Define a list of Token objects
tokens = [
Token("2024年11月", "名詞"),
Token("こんにちは", "動詞"),
Token("123", "名詞"),
Token("短い", "形容詞"),
Token("custom", "名詞"),
]
# Define a custom filter function
def custom_filter(token: Token) -> bool:
# Remove tokens where the part of speech (pos) is "名詞"
return token.pos == "名詞"
# Initialize the filter
filter = JaStopwordFilter(
convert_full_to_half=True, # Convert full-width characters to half-width
custom_filter=custom_filter, # Apply custom filtering logic
use_numbers=True, # Remove numeric tokens
use_emojis=True, # Remove tokens with emojis
)
# Filter tokens
filtered_tokens = filter.remove(tokens)
filtered_surfaces = [t.surface for t in filtered_tokens]
print(filtered_surfaces) # Output: ['こんにちは', '短い']
Example with Extended Token Class
You can extend the Token class to include additional attributes, such as frequency, is_special, or context. These attributes allow you to design more complex filtering logic.
from ja_stopword_filter import JaStopwordFilter, Token
# Extend the Token class to include custom attributes
class ExtendedToken(Token):
def __init__(self, surface: str, pos: str, frequency: int, is_special: bool) -> None:
super().__init__(surface, pos)
self.frequency = frequency # Frequency of the token in the text
self.is_special = is_special # Whether the token is marked as special
# Define a list of ExtendedToken objects
tokens = [
ExtendedToken("2024年11月", "名詞", 10, False),
ExtendedToken("こんにちは", "動詞", 5, True),
ExtendedToken("123", "名詞", 2, False),
ExtendedToken("短い", "形容詞", 15, False),
ExtendedToken("custom", "名詞", 3, True),
]
# Define a custom filter function
def custom_filter(token: ExtendedToken) -> bool:
# Remove tokens if they are "名詞" or their frequency is less than 5
return token.pos == "名詞" or token.frequency < 5
# Initialize the filter
filter = JaStopwordFilter(
custom_filter=custom_filter, # Apply custom filtering logic
use_numbers=True, # Remove numeric tokens
use_symbols=True, # Remove tokens with symbols
)
# Filter tokens
filtered_tokens = filter.remove(tokens)
filtered_surfaces = [t.surface for t in filtered_tokens]
print(filtered_surfaces) # Output: ['こんにちは', '短い']
Key Points:
- Custom Attributes: Add attributes like
frequency,is_special, or others depending on your requirements. - Flexible Filtering: Use the
custom_filterparameter to define logic based on the extended attributes. - Use Cases:
- Filter tokens by their frequency in the text.
- Exclude special or flagged tokens.
- Combine part of speech filtering with additional attribute-based rules.
This flexibility allows jp-stopword-filter to adapt to a wide variety of text preprocessing tasks beyond stopword removal.
Parameters
JaStopwordFilter supports the following parameters for customization:
| Parameter | Type | Default | Description |
|---|---|---|---|
convert_full_to_half |
bool |
True |
Convert full-width characters to half-width. |
use_slothlib |
bool |
True |
Use the SlothLib stopword list. |
filter_length |
int |
0 |
Remove tokens with length ≤ this value (0 disables it). |
use_date |
bool |
False |
Remove tokens matching Japanese date patterns. |
use_numbers |
bool |
False |
Remove numeric tokens. |
use_symbols |
bool |
False |
Remove tokens containing symbols. |
use_spaces |
bool |
False |
Remove tokens that are empty or consist only of spaces. |
use_emojis |
bool |
False |
Remove tokens containing emojis. |
custom_wordlist |
list[str] |
None |
Add custom stopwords. |
custom_filter |
Callable[[Token], bool] |
None |
Apply a custom filter function for Token objects. |
Filtering Rules
Preloaded Stopwords
By default, JaStopwordFilter uses stopwords from SlothLib, a comprehensive list of commonly used Japanese stopwords.
In addition to the SlothLib stopwords, users can provide their own custom wordlist through the custom_wordlist parameter. This allows for further customization of the filtering process by adding domain-specific or task-specific stopwords.
Example: Adding a Custom Wordlist
from ja_stopword_filter import JaStopwordFilter
# Define a custom wordlist
custom_wordlist = ["example", "特定単語", "custom_stopword"]
# Initialize the filter with a custom wordlist
filter = JaStopwordFilter(
use_slothlib=True, # Include SlothLib stopwords
custom_wordlist=custom_wordlist # Add user-defined stopwords
)
# Define a list of tokens
tokens = ["こんにちは", "特定単語", "example", "custom_stopword", "一般単語"]
# Filter tokens
filtered_tokens = filter.remove(tokens)
print(filtered_tokens) # Output: ['こんにちは', '一般単語']
Key Points:
- SlothLib Stopwords: Automatically included when
use_slothlib=True(default setting). - Custom Wordlist: Use the
custom_wordlistparameter to add your own stopwords. - Combining Stopwords: SlothLib and custom wordlists work together seamlessly, ensuring comprehensive stopword removal tailored to your needs.
This feature ensures flexibility, allowing users to adapt the filtering process to specific languages, domains, or projects.
Rule Descriptions
- Length Filtering: Removes tokens with length ≤ the specified value.
- Date Filtering: Matches and removes Japanese date patterns such as:
YYYY年MM月MM月DD日YYYY年MM月DD日
- Number Filtering: Removes numeric tokens like
123or2024. - Symbol Filtering: Removes punctuation and special characters.
- Space Filtering: Removes empty or whitespace-only tokens.
- Emoji Filtering: Detects and removes tokens containing emojis.
- Custom Filter: Apply custom logic to filter tokens based on user-defined rules.
Contributing
Contributions are welcome! If you find a bug or have a feature request, feel free to open an issue or submit a pull request.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file jp_stopword_filter-0.2.0.tar.gz.
File metadata
- Download URL: jp_stopword_filter-0.2.0.tar.gz
- Upload date:
- Size: 13.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: pdm/2.20.1 CPython/3.9.6 Darwin/23.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
11256fb6420746e42f5acb753e3b538ae0817a624d2888e0d25946ae4d062400
|
|
| MD5 |
8d6c251bc76afa5c89297021ec6cf2a7
|
|
| BLAKE2b-256 |
f910b5bb18e75172656678e90e346b513b90e371c1ae3b4498aa32f28b41a2ab
|
File details
Details for the file jp_stopword_filter-0.2.0-py3-none-any.whl.
File metadata
- Download URL: jp_stopword_filter-0.2.0-py3-none-any.whl
- Upload date:
- Size: 10.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: pdm/2.20.1 CPython/3.9.6 Darwin/23.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
85a02a08d76ec10c5d90600ffbbcd6d191ff05d5cc90d4d075b99af521472ce1
|
|
| MD5 |
6c24cd2d559622aaac74a2a04939a13e
|
|
| BLAKE2b-256 |
f0fa265184e085b75dd945e1c76488df471eadfa8a362e0acc50e5483143ff2e
|