Skip to main content

Add your description here

Project description

nuki

Overview - 概要

nuki is a scraping utility library built on Patchright and selectolax. nuki(抜き)はPatchrightとselectolaxをベースにしたスクレイピングユーティリティライブラリです。

Requirements - 必要条件

  • Python 3.12 or higher
  • Libraries: patchright, selectolax, pandas, camoufox(自動インストール)
  • write_parquet を使う場合は pandas の Parquet エンジンとして pyarrow(または fastparquet)が必要です。
  • Browser binaries(別途インストールが必要)

Installation - インストール

pip

pip install nuki

uv (推奨)

uv add nuki

ブラウザバイナリを別途インストールしてください。

Patchright(Chromium)

pip

python -m patchright install chromium

uv (推奨)

uv run patchright install chromium

Camoufox(Firefox)

pip

camoufox fetch

uv (推奨)

uv run camoufox fetch

メソッド

  • patchright_page(user_data_dir) … コンテキストマネージャ。Patchright(Chrome チャネル・永続コンテキスト)で Page を開き、with ブロック内に渡す。
    user_data_dir'C:\Users\あなた\...\User Data' のような文字列(chrome://version/ で確認可)。

  • camoufox_page(locale=...) … 同様に Camoufox(Firefox)で Page を開く。bot 検知が厳しいサイト向け。
    例: with camoufox_page(locale='en-US,en') as page:
    ロケールのデフォルトは 'ja-JP,ja'

Basic Usage - 基本的な使い方

from nuki import *

fh = from_here(__file__)
add_log_file(fh('log/scraping.log'))

user_data_dir = r'C:\Users\あなたのユーザ名\AppData\Local\Google\Chrome\User Data'
with patchright_page(user_data_dir) as page:
    p = npage(page)
    p.goto('https://www.foobarbaz1.jp')

    pref_urls = [p.abs_url(e.url()) for e in p.ss('li.item > ul > li > a')]

    classroom_urls = []
    for i, url in enumerate(pref_urls, 1):
        print(f'{i}/{len(pref_urls)} pref_urls')
        if not p.goto(url):
            continue
        random_sleep(1, 2)
        links = [p.abs_url(e.url()) for e in p.ss('.school-area h4 a')]
        classroom_urls.extend(links)

    for i, url in enumerate(classroom_urls, 1):
        print(f'{i}/{len(classroom_urls)} classroom_urls')
        if not p.goto(url):
            continue
        random_sleep(1, 2)
        append_csv(fh('csv/out.csv'), {
            'URL': page.url,
            '教室名': p.s('h1 .text01').text_content(),
            '住所': p.s('.item .mapText').text_content(),
            '電話番号': p.s('.item .phoneNumber').text_content(),
            'HP': p.s_re('th', 'ホームページ').next().s('a').url(),
        })

Save HTML while scraping - スクレイピングしながらHTMLを保存する

from nuki import *

fh = from_here(__file__)
add_log_file(fh('log/scraping.log'))

with camoufox_page() as page:
    ctx = {}
    p = npage(page)
    p.goto('https://www.foobarbaz1.jp')

    ctx['アイテムURLs'] = [p.abs_url(e.url()) for e in p.ss('ul.items > li > a')]

    for i, url in enumerate(ctx['アイテムURLs'], 1):
        print(f"{i}/{len(ctx['アイテムURLs'])} アイテムURLs")
        if not p.goto(url):
            continue
        random_sleep(1, 2)
        if p.wait('#logo', timeout=10000).unwrap() is None:
            continue
        file_name = f'{hash_name(url)}.html'
        if not save_html(fh('html') / file_name, page.content()):
            continue
        append_csv(fh('outurlhtml.csv'), {
            'URL': url,
            'HTML': file_name,
        })

Scrape from local HTML files - 保存済みHTMLからスクレイピングしてParquetに出力する

import pandas as pd

from nuki import *

fh = from_here(__file__)
add_log_file(fh('log/scraping.log'))

df = pd.read_csv(fh('outurlhtml.csv'))
results = []
for i, (url, path) in enumerate(zip(df['URL'], df['HTML']), 1):
    print(i)
    if not (parser := parse_html(fh('html') / path)):
        continue
    p = nparser(parser)
    results.append({
        'URL': url,
        '教室名': p.s('h1 .text02').text(),
        '住所': p.s('.item .mapText').text(),
        '所在地': p.s_re('dt', r'所在地').next('dd').text(),
    })
write_parquet(fh('outhtml.parquet'), results)

License - ライセンス

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nuki-0.1.2.tar.gz (8.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nuki-0.1.2-py3-none-any.whl (6.2 kB view details)

Uploaded Python 3

File details

Details for the file nuki-0.1.2.tar.gz.

File metadata

  • Download URL: nuki-0.1.2.tar.gz
  • Upload date:
  • Size: 8.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-requests/2.33.1

File hashes

Hashes for nuki-0.1.2.tar.gz
Algorithm Hash digest
SHA256 8e3c71af324f4e058ded0c7cddd4f63c2a1cd18c6649e2d546485127707e677b
MD5 475ab368c52e25e78580b7c768b34b49
BLAKE2b-256 f5594738252e54dcae364c54329f5ed7751683281704ecedbfad8406314fb8f9

See more details on using hashes here.

File details

Details for the file nuki-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: nuki-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 6.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-requests/2.33.1

File hashes

Hashes for nuki-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 c11c3bc111be9159dc9c7c5976c6f27b6ab372813357fcbd04bae883c4e3eebc
MD5 239a9430700af5d6ebadc2b7c5395fa7
BLAKE2b-256 7967b239098b4719eaf048a13f249dc88bbb58406ebd0d5a38f394a489b29eab

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page