A lightweight Python library designed to filter stopwords for Japanese text.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Project description

jp-stopword-filter

jp-stopword-filter は、カスタマイズ可能なルールに基づいて日本語のストップワードをフィルタリングするための軽量なPythonライブラリです。自然言語処理（NLP）タスクのために日本語テキストを効率的に前処理する方法を提供し、一般的なストップワード削除技術とユーザー定義のカスタマイズをサポートします。

特徴

プリロードされたストップワード: SlothLibからの日本語ストップワードリストを含みます。
カスタマイズ可能なルール:
- 文字数に基づいてトークンを削除。
- 日本語形式の日付（例: 2024年11月）をフィルタリング。
- 数字、記号、スペース、絵文字を除外。
カスタムワードリスト: 独自のストップワードをフィルタに追加可能。
カスタムフィルタ: custom_filter関数を使用して独自のフィルタリングロジックを定義可能。
柔軟な利用: 初期化時に必要なルールのみを有効または無効にできます。

インストール

PyPI経由でインストール：

pip install jp-stopword-filter

注意：配布名は jp-stopword-filter ですが、インポート名は ja_stopword_filter です：

from ja_stopword_filter import JaStopwordFilter

開発

このプロジェクトは環境構築・依存管理・ビルド・公開のすべてに uv を使用します。

git clone https://github.com/BrambleXu/jp-stopword-filter.git
cd jp-stopword-filter

uv sync          # .venv を作成し、ランタイム + 開発依存をインストール
uv run pytest    # テストを実行
make lint        # ruff + codespell

ビルドと公開：

uv build         # sdist + wheel を dist/ に生成
uv publish       # PyPI へアップロード（または Release ワークフローの trusted publishing に任せる）

使用方法

基本的な使い方（文字列トークンのみ）

この例では、文字列として表現されたトークンをフィルタリングします。追加のフィルタリングのためにカスタムワードリストを提供しています。

from ja_stopword_filter import JaStopwordFilter

# トークンリストの定義
tokens = ["２０２４年１１月", "こんにちは", "１２３", "！", "😊", "スペース", "短い", "custom"]

# カスタムワードリスト
custom_wordlist = ["custom", "スペース"]

# フィルタの初期化
filter = JaStopwordFilter(
    convert_full_to_half=True,  # 全角文字を半角文字に変換
    use_slothlib=True,         # SlothLibのストップワードを使用
    filter_length=1,           # 長さが1以下のトークンをフィルタリング
    use_date=True,             # 日付形式のトークンを削除
    use_numbers=True,          # 数字のトークンを削除
    use_symbols=True,          # 記号を含むトークンを削除
    use_spaces=True,           # 空白のみのトークンを削除
    use_emojis=True,           # 絵文字を含むトークンを削除
    custom_wordlist=custom_wordlist  # カスタムストップワードを追加
)

# トークンをフィルタリング
filtered_tokens = filter.remove(tokens)
print(filtered_tokens)  # 出力: ['こんにちは', '短い']

高度な使い方（`Token`クラスと`custom_filter`の使用）

この例では、Tokenクラスとcustom_filter関数を使用してカスタムフィルタリングロジックを定義する方法を示します。また、Tokenクラスを拡張してカスタム属性を追加し、それに応じたcustom_filter関数を設計することもできます。

基本的な`Token`クラスの例

from ja_stopword_filter import JaStopwordFilter, Token

# Tokenオブジェクトのリストを定義
tokens = [
    Token("２０２４年１１月", "名詞"),
    Token("こんにちは", "動詞"),
    Token("１２３", "名詞"),
    Token("短い", "形容詞"),
    Token("custom", "名詞"),
]

# カスタムフィルタ関数を定義
def custom_filter(token: Token) -> bool:
    # 品詞（pos）が"名詞"の場合、トークンを削除
    return token.pos == "名詞"

# フィルタを初期化
filter = JaStopwordFilter(
    convert_full_to_half=True,  # 全角文字を半角文字に変換
    custom_filter=custom_filter,  # カスタムフィルタリングロジックを適用
    use_numbers=True,            # 数字のトークンを削除
    use_emojis=True,             # 絵文字を含むトークンを削除
)

# トークンをフィルタリング
filtered_tokens = filter.remove(tokens)
filtered_surfaces = [t.surface for t in filtered_tokens]
print(filtered_surfaces)  # 出力: ['こんにちは', '短い']

拡張された`Token`クラスの例

Tokenクラスを拡張して、frequency、is_special、contextなどの追加属性を含めることができます。これらの属性を使用して、より複雑なフィルタリングロジックを設計できます。

from ja_stopword_filter import JaStopwordFilter, Token

# Tokenクラスを拡張し、カスタム属性を追加
class ExtendedToken(Token):
    def __init__(self, surface: str, pos: str, frequency: int, is_special: bool) -> None:
        super().__init__(surface, pos)
        self.frequency = frequency  # トークンの頻度
        self.is_special = is_special  # 特殊フラグ

# ExtendedTokenオブジェクトのリストを定義
tokens = [
    ExtendedToken("２０２４年１１月", "名詞", 10, False),
    ExtendedToken("こんにちは", "動詞", 5, True),
    ExtendedToken("１２３", "名詞", 2, False),
    ExtendedToken("短い", "形容詞", 15, False),
    ExtendedToken("custom", "名詞", 3, True),
]

# カスタムフィルタ関数を定義
def custom_filter(token: ExtendedToken) -> bool:
    # 品詞が"名詞"または頻度が5未満のトークンを削除
    return token.pos == "名詞" or token.frequency < 5

# フィルタを初期化
filter = JaStopwordFilter(
    custom_filter=custom_filter,  # カスタムフィルタリングロジックを適用
    use_numbers=True,            # 数字のトークンを削除
    use_symbols=True,            # 記号を含むトークンを削除
)

# トークンをフィルタリング
filtered_tokens = filter.remove(tokens)
filtered_surfaces = [t.surface for t in filtered_tokens]
print(filtered_surfaces)  # 出力: ['こんにちは', '短い']

パラメータ

JaStopwordFilterは以下のパラメータをサポートしています：

パラメータ	型	デフォルト	説明
`convert_full_to_half`	`bool`	`True`	全角文字を半角文字に変換します。
`use_slothlib`	`bool`	`True`	SlothLibのストップワードリストを使用します。
`filter_length`	`int`	`0`	指定した文字数以下のトークンを削除します。
`use_date`	`bool`	`False`	日本語の日付形式に一致するトークンを削除します。
`use_numbers`	`bool`	`False`	数字のトークンを削除します。
`use_symbols`	`bool`	`False`	記号を含むトークンを削除します。
`use_spaces`	`bool`	`False`	空白のみのトークンを削除します。
`use_emojis`	`bool`	`False`	絵文字を含むトークンを削除します。
`custom_wordlist`	`list[str]`	`None`	カスタムストップワードを追加します。
`custom_filter`	`Callable[[Token], bool]`	`None`	`Token`オブジェクト用のカスタムフィルタ関数を適用します。

フィルタリングルール

プリロードされたストップワード

デフォルトでは、JaStopwordFilter はSlothLibのストップワードを使用します。これは、日本語の一般的なストップワードを網羅したリストです。

さらに、ユーザーはcustom_wordlistパラメーターを通じて独自のカスタムワードリストを提供することもできます。これにより、特定の分野やタスクに合わせてストップワードのフィルタリングプロセスをさらにカスタマイズできます。

例: カスタムワードリストを追加する

from ja_stopword_filter import JaStopwordFilter

# カスタムワードリストを定義
custom_wordlist = ["example", "特定単語", "custom_stopword"]

# カスタムワードリストを使用してフィルタを初期化
filter = JaStopwordFilter(
    use_slothlib=True,           # SlothLibのストップワードを含む
    custom_wordlist=custom_wordlist  # ユーザー定義のストップワードを追加
)

# トークンのリストを定義
tokens = ["こんにちは", "特定単語", "example", "custom_stopword", "一般単語"]

# トークンをフィルタリング
filtered_tokens = filter.remove(tokens)
print(filtered_tokens)  # 出力: ['こんにちは', '一般単語']

主なポイント:

SlothLibストップワード: use_slothlib=True（デフォルト設定）で自動的に含まれる。
カスタムワードリスト: custom_wordlistパラメーターを使用して独自のストップワードを追加可能。
ストップワードの統合: SlothLibのリストとカスタムワードリストがシームレスに連携し、ニーズに合わせた包括的なストップワードの除去を実現。

この機能により、特定の言語、分野、またはプロジェクトに応じた柔軟なフィルタリングプロセスを可能にします。

ルールの説明

長さによるフィルタリング: 指定された値以下の長さのトークンを除去します。
日付フィルタリング: 以下のような日本語の日付パターンを一致させて除去します。
- YYYY年MM月
- MM月DD日
- YYYY年MM月DD日
数値フィルタリング: 123や2024のような数値のトークンを除去します。
記号フィルタリング: 句読点や特殊文字を除去します。
スペースフィルタリング: 空白またはスペースのみのトークンを除去します。
絵文字フィルタリング: 絵文字を含むトークンを検出して除去します。
カスタムフィルタ: ユーザー定義のルールに基づいてトークンをフィルタリングするロジックを適用します。

コントリビューション

バグを見つけた場合や新機能のリクエストがある場合は、Issueを作成するか、Pull Requestを送信してください！

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

BrambleXu

Release history Release notifications | RSS feed

This version

0.2.2

Jun 20, 2026

0.2.0

Nov 26, 2024

0.1.0

Nov 19, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jp_stopword_filter-0.2.2.tar.gz (7.7 kB view details)

Uploaded Jun 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

jp_stopword_filter-0.2.2-py3-none-any.whl (8.4 kB view details)

Uploaded Jun 20, 2026 Python 3

File details

Details for the file jp_stopword_filter-0.2.2.tar.gz.

File metadata

Download URL: jp_stopword_filter-0.2.2.tar.gz
Upload date: Jun 20, 2026
Size: 7.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.11.23 {"installer":{"name":"uv","version":"0.11.23","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for jp_stopword_filter-0.2.2.tar.gz
Algorithm	Hash digest
SHA256	`e1e8b8c5009f1da1c231a7d949c67bcb675324c15f2e2bcdff8eafbccc5e4231`
MD5	`b64c9124ed002956a34fd93eb5e82f70`
BLAKE2b-256	`c2776df8c73eabd6b5d6baafd466e1f3e0e7075d16603a707686b348fcbd905b`

See more details on using hashes here.

File details

Details for the file jp_stopword_filter-0.2.2-py3-none-any.whl.

File metadata

Download URL: jp_stopword_filter-0.2.2-py3-none-any.whl
Upload date: Jun 20, 2026
Size: 8.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.11.23 {"installer":{"name":"uv","version":"0.11.23","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for jp_stopword_filter-0.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`776e79c56673d705e20eb225849063bde04e1f64f17d39ccb9a4a488da4137aa`
MD5	`31f95b612a27cabf7da84dac12d24dd1`
BLAKE2b-256	`2d912b6e17d846a1dd4ac98fecc6cb40f679340888e5d166a34af3d892fba1ff`

See more details on using hashes here.

jp-stopword-filter 0.2.2

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Project description

jp-stopword-filter

特徴

インストール

開発

使用方法

基本的な使い方（文字列トークンのみ）

高度な使い方（Tokenクラスとcustom_filterの使用）

基本的なTokenクラスの例

拡張されたTokenクラスの例

パラメータ

フィルタリングルール

プリロードされたストップワード

主なポイント:

ルールの説明

コントリビューション

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

高度な使い方（`Token`クラスと`custom_filter`の使用）

基本的な`Token`クラスの例

拡張された`Token`クラスの例