Skip to main content

A lightweight Python library designed to filter stopwords for Japanese text.

Project description

日本語

jp-stopword-filter

jp-stopword-filter is a lightweight Python library designed to filter stopwords from Japanese text based on customizable rules. It provides an efficient way to preprocess Japanese text for natural language processing (NLP) tasks, with support for common stopword removal techniques and user-defined customization.

Features

  • Preloaded Stopwords: Includes a comprehensive list of Japanese stopwords from SlothLib.
  • Customizable Rules:
    • Remove tokens based on length.
    • Filter dates in common Japanese formats (e.g., 2024年11月).
    • Exclude numbers, symbols, spaces, and emojis.
  • Custom Wordlist: Add your own stopwords to the filter.
  • Custom Filters: Define your own filtering logic with a custom_filter function.
  • Flexible Usage: Use only the rules you need by enabling or disabling them during initialization.

Installation

Install via PyPI:

pip install jp-stopword-filter

Alternatively, clone the repository and install dependencies:

git clone https://github.com/your-username/ja-stopword-filter.git
cd ja-stopword-filter
pip install -r requirements.txt

Usage

Basic Usage (String Tokens Only)

In this example, we filter tokens represented as strings. A custom wordlist is provided for additional filtering.

from ja_stopword_filter import JaStopwordFilter

# Define a token list
tokens = ["2024年11月", "こんにちは", "123", "!", "😊", "スペース", "短い", "custom"]

# Custom wordlist
custom_wordlist = ["custom", "スペース"]

# Initialize the filter
filter = JaStopwordFilter(
    convert_full_to_half=True,  # Convert full-width characters to half-width
    use_slothlib=True,         # Include SlothLib stopwords
    filter_length=1,           # Filter tokens with length <= 1
    use_date=True,             # Remove tokens matching date patterns
    use_numbers=True,          # Remove numeric tokens
    use_symbols=True,          # Remove tokens with symbols
    use_spaces=True,           # Remove empty or whitespace-only tokens
    use_emojis=True,           # Remove emoji-containing tokens
    custom_wordlist=custom_wordlist  # Add custom stopwords
)

# Filter tokens
filtered_tokens = filter.remove(tokens)
print(filtered_tokens)  # Output: ['こんにちは', '短い']

Advanced Usage (Using Token Class and custom_filter)

This example demonstrates how to use the Token class and a custom_filter function to define custom filtering logic. Additionally, you can extend the Token class to include custom attributes and design corresponding custom_filter functions to suit specific use cases.

Example with Basic Token Class

from ja_stopword_filter import JaStopwordFilter, Token

# Define a list of Token objects
tokens = [
    Token("2024年11月", "名詞"),
    Token("こんにちは", "動詞"),
    Token("123", "名詞"),
    Token("短い", "形容詞"),
    Token("custom", "名詞"),
]

# Define a custom filter function
def custom_filter(token: Token) -> bool:
    # Remove tokens where the part of speech (pos) is "名詞"
    return token.pos == "名詞"

# Initialize the filter
filter = JaStopwordFilter(
    convert_full_to_half=True,  # Convert full-width characters to half-width
    custom_filter=custom_filter,  # Apply custom filtering logic
    use_numbers=True,            # Remove numeric tokens
    use_emojis=True,             # Remove tokens with emojis
)

# Filter tokens
filtered_tokens = filter.remove(tokens)
filtered_surfaces = [t.surface for t in filtered_tokens]
print(filtered_surfaces)  # Output: ['こんにちは', '短い']

Example with Extended Token Class

You can extend the Token class to include additional attributes, such as frequency, is_special, or context. These attributes allow you to design more complex filtering logic.

from ja_stopword_filter import JaStopwordFilter, Token

# Extend the Token class to include custom attributes
class ExtendedToken(Token):
    def __init__(self, surface: str, pos: str, frequency: int, is_special: bool) -> None:
        super().__init__(surface, pos)
        self.frequency = frequency  # Frequency of the token in the text
        self.is_special = is_special  # Whether the token is marked as special

# Define a list of ExtendedToken objects
tokens = [
    ExtendedToken("2024年11月", "名詞", 10, False),
    ExtendedToken("こんにちは", "動詞", 5, True),
    ExtendedToken("123", "名詞", 2, False),
    ExtendedToken("短い", "形容詞", 15, False),
    ExtendedToken("custom", "名詞", 3, True),
]

# Define a custom filter function
def custom_filter(token: ExtendedToken) -> bool:
    # Remove tokens if they are "名詞" or their frequency is less than 5
    return token.pos == "名詞" or token.frequency < 5

# Initialize the filter
filter = JaStopwordFilter(
    custom_filter=custom_filter,  # Apply custom filtering logic
    use_numbers=True,            # Remove numeric tokens
    use_symbols=True,            # Remove tokens with symbols
)

# Filter tokens
filtered_tokens = filter.remove(tokens)
filtered_surfaces = [t.surface for t in filtered_tokens]
print(filtered_surfaces)  # Output: ['こんにちは', '短い']

Key Points:

  • Custom Attributes: Add attributes like frequency, is_special, or others depending on your requirements.
  • Flexible Filtering: Use the custom_filter parameter to define logic based on the extended attributes.
  • Use Cases:
    • Filter tokens by their frequency in the text.
    • Exclude special or flagged tokens.
    • Combine part of speech filtering with additional attribute-based rules.

This flexibility allows jp-stopword-filter to adapt to a wide variety of text preprocessing tasks beyond stopword removal.

Parameters

JaStopwordFilter supports the following parameters for customization:

Parameter Type Default Description
convert_full_to_half bool True Convert full-width characters to half-width.
use_slothlib bool True Use the SlothLib stopword list.
filter_length int 0 Remove tokens with length ≤ this value (0 disables it).
use_date bool False Remove tokens matching Japanese date patterns.
use_numbers bool False Remove numeric tokens.
use_symbols bool False Remove tokens containing symbols.
use_spaces bool False Remove tokens that are empty or consist only of spaces.
use_emojis bool False Remove tokens containing emojis.
custom_wordlist list[str] None Add custom stopwords.
custom_filter Callable[[Token], bool] None Apply a custom filter function for Token objects.

Filtering Rules

Preloaded Stopwords

By default, JaStopwordFilter uses stopwords from SlothLib, a comprehensive list of commonly used Japanese stopwords.

In addition to the SlothLib stopwords, users can provide their own custom wordlist through the custom_wordlist parameter. This allows for further customization of the filtering process by adding domain-specific or task-specific stopwords.

Example: Adding a Custom Wordlist

from ja_stopword_filter import JaStopwordFilter

# Define a custom wordlist
custom_wordlist = ["example", "特定単語", "custom_stopword"]

# Initialize the filter with a custom wordlist
filter = JaStopwordFilter(
    use_slothlib=True,           # Include SlothLib stopwords
    custom_wordlist=custom_wordlist  # Add user-defined stopwords
)

# Define a list of tokens
tokens = ["こんにちは", "特定単語", "example", "custom_stopword", "一般単語"]

# Filter tokens
filtered_tokens = filter.remove(tokens)
print(filtered_tokens)  # Output: ['こんにちは', '一般単語']

Key Points:

  • SlothLib Stopwords: Automatically included when use_slothlib=True (default setting).
  • Custom Wordlist: Use the custom_wordlist parameter to add your own stopwords.
  • Combining Stopwords: SlothLib and custom wordlists work together seamlessly, ensuring comprehensive stopword removal tailored to your needs.

This feature ensures flexibility, allowing users to adapt the filtering process to specific languages, domains, or projects.

Rule Descriptions

  1. Length Filtering: Removes tokens with length ≤ the specified value.
  2. Date Filtering: Matches and removes Japanese date patterns such as:
    • YYYY年MM月
    • MM月DD日
    • YYYY年MM月DD日
  3. Number Filtering: Removes numeric tokens like 123 or 2024.
  4. Symbol Filtering: Removes punctuation and special characters.
  5. Space Filtering: Removes empty or whitespace-only tokens.
  6. Emoji Filtering: Detects and removes tokens containing emojis.
  7. Custom Filter: Apply custom logic to filter tokens based on user-defined rules.

Contributing

Contributions are welcome! If you find a bug or have a feature request, feel free to open an issue or submit a pull request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jp_stopword_filter-0.2.0.tar.gz (13.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

jp_stopword_filter-0.2.0-py3-none-any.whl (10.6 kB view details)

Uploaded Python 3

File details

Details for the file jp_stopword_filter-0.2.0.tar.gz.

File metadata

  • Download URL: jp_stopword_filter-0.2.0.tar.gz
  • Upload date:
  • Size: 13.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: pdm/2.20.1 CPython/3.9.6 Darwin/23.6.0

File hashes

Hashes for jp_stopword_filter-0.2.0.tar.gz
Algorithm Hash digest
SHA256 11256fb6420746e42f5acb753e3b538ae0817a624d2888e0d25946ae4d062400
MD5 8d6c251bc76afa5c89297021ec6cf2a7
BLAKE2b-256 f910b5bb18e75172656678e90e346b513b90e371c1ae3b4498aa32f28b41a2ab

See more details on using hashes here.

File details

Details for the file jp_stopword_filter-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for jp_stopword_filter-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 85a02a08d76ec10c5d90600ffbbcd6d191ff05d5cc90d4d075b99af521472ce1
MD5 6c24cd2d559622aaac74a2a04939a13e
BLAKE2b-256 f0fa265184e085b75dd945e1c76488df471eadfa8a362e0acc50e5483143ff2e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page