A powerful toolkit for web scraping and data normalization.

These details have not been verified by PyPI

Project links

Homepage

Project description

WebDetaKit: ウェブデータ抽出 & 正規化ツールキット

WebDetaKit は、ウェブサイトから情報を効率的に抽出し、構造化されたデータ（CSV、JSON、Pandas DataFrame）に正規化するためのPythonツールキットです。シンプルなAPIで、ウェブスクレイピングの基本的なニーズに対応します。

主な機能

HTML取得: 指定されたURLからウェブページのHTMLコンテンツを取得します。
データ抽出: CSSセレクタを使用して、HTMLから特定のテキストや属性値（例: リンクのURL、画像のソース）を抽出します。
データ正規化: 抽出した複数のデータリストを結合し、Pandas DataFrameとして整形します。
ファイル保存: 整形されたDataFrameをCSVまたはJSON形式でファイルに保存します。

インストール

WebDetaKit は pip を使って簡単にインストールできます。

pip install webdetakit
使用方法
1. HTMLコンテンツの取得

Python
from webdetakit.core import fetch_html

url = "[https://www.example.com](https://www.example.com)"
html_content = fetch_html(url)

if html_content:
    print("HTMLコンテンツを取得しました。")
else:
    print("HTMLコンテンツの取得に失敗しました。")
2. データ（テキスト）の抽出

Python
from webdetakit.core import extract_text

# 例として取得したHTMLコンテンツを使用
# html_content は fetch_html で取得したもの
# もしくはテスト用のダミーHTMLでも可
dummy_html = """
<html>
<body>
    <h1>メインタイトル</h1>
    <p class="summary">これは要約の段落です。</p>
    <div>
        <span class="item-name">商品A</span>
        <span class="price">1000円</span>
    </div>
    <div>
        <span class="item-name">商品B</span>
        <span class="price">2000円</span>
    </div>
</body>
</html>
"""

# H1タグのテキストを抽出
titles = extract_text(dummy_html, "h1")
print(f"タイトル: {titles}")

# 'summary' クラスを持つPタグのテキストを抽出
summaries = extract_text(dummy_html, "p.summary")
print(f"要約: {summaries}")

# 'item-name' クラスを持つspanタグのテキストをすべて抽出
item_names = extract_text(dummy_html, "span.item-name")
print(f"商品名: {item_names}")
3. データ（属性値）の抽出

Python
from webdetakit.core import extract_attribute

# 例としてHTMLコンテンツを使用
dummy_html_links = """
<html>
<body>
    <a href="/page1.html">ページ1</a>
    <img src="/images/pic1.jpg" alt="写真1">
    <a href="[https://www.google.com](https://www.google.com)">Google</a>
</body>
</html>
"""

# aタグのhref属性を抽出
links = extract_attribute(dummy_html_links, "a", "href")
print(f"リンクURL: {links}")

# imgタグのsrc属性を抽出
image_sources = extract_attribute(dummy_html_links, "img", "src")
print(f"画像ソース: {image_sources}")
4. データの正規化と保存

Python
from webdetakit.core import normalize_to_dataframe, save_dataframe_to_csv, save_dataframe_to_json
import pandas as pd

# 抽出したデータの例（リストの長さは揃っている必要があります）
data_to_normalize = {
    "商品名": ["商品A", "商品B", "商品C"],
    "価格": ["1000円", "2000円", "3000円"],
    "URL": ["/a.html", "/b.html", "/c.html"]
}

# Pandas DataFrameに正規化
df = normalize_to_dataframe(data_to_normalize)
print("\n正規化されたDataFrame:")
print(df)

# CSVファイルとして保存
save_dataframe_to_csv(df, "products.csv")
# 結果: products.csv が作成されます

# JSONファイルとして保存
save_dataframe_to_json(df, "products.json")
# 結果: products.json が作成されます
開発者向け情報
開発環境のセットアップ

プロジェクトをクローンした後、webdetakit_project ディレクトリ内で以下のコマンドを実行し、開発に必要な依存関係をインストールします。

Bash
git clone [https://github.com/Kongou173/webdetakit_project.git](https://github.com/Kongou173/webdetakit_project.git) # もしGitHubにリポジトリがある場合
cd webdetakit_project
pip install -e .[dev]
テストの実行

開発中にテストを実行するには、プロジェクトのルートディレクトリで pytest を使用します。

Bash
cd webdetakit_project
pytest
貢献
このプロジェクトへの貢献を歓迎します！バグ報告、機能リクエスト、プルリクエストなど、お気軽にお寄せください。

ライセンス
このプロジェクトは MIT ライセンスの下で公開されています。詳細については LICENSE ファイルをご覧ください。

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.1.1

Jul 18, 2025

0.1.0

Jul 18, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webdetakit-0.1.1.tar.gz (6.1 kB view details)

Uploaded Jul 18, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

webdetakit-0.1.1-py3-none-any.whl (6.6 kB view details)

Uploaded Jul 18, 2025 Python 3

File details

Details for the file webdetakit-0.1.1.tar.gz.

File metadata

Download URL: webdetakit-0.1.1.tar.gz
Upload date: Jul 18, 2025
Size: 6.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.13

File hashes

Hashes for webdetakit-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`4afa6c5b911632f3984d398cbf4a7cc9a0f097f6825ff8899d8f95d9e475b43d`
MD5	`87f5f196ddffc8c56476277e13528810`
BLAKE2b-256	`6c6e0eaa948232e6f0864fc664031d65d4ee6e3df7968d3ecc8d44d7f1318b7d`

See more details on using hashes here.

File details

Details for the file webdetakit-0.1.1-py3-none-any.whl.

File metadata

Download URL: webdetakit-0.1.1-py3-none-any.whl
Upload date: Jul 18, 2025
Size: 6.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.13

File hashes

Hashes for webdetakit-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`46310c3d88884836f543e8d86a5d9b5f2200992b0d21d965f4ce6613ddf13f96`
MD5	`49fcd37163e19175d5b105c94543c853`
BLAKE2b-256	`292527326ccd0fa7aaeb77b660b642d427a06d61282d333b9c5a63676482b69f`

See more details on using hashes here.

webdetakit 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

WebDetaKit: ウェブデータ抽出 & 正規化ツールキット

主な機能

インストール

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes