Skip to main content

Add your description here

Project description

nuki

Overview - 概要

nuki is a scraping utility library built on Patchright and selectolax. nuki(抜き)はPatchrightとselectolaxをベースにしたスクレイピングユーティリティライブラリです。

DOM・パーサのラッパーは nuki から、ブラウザ起動は nuki.browser、CSV やログなどの周辺は nuki.utils から import します。

Requirements - 必要条件

  • Python 3.12 or higher
  • Libraries: patchright, selectolax, pandas, camoufox(自動インストール)
  • write_parquet を使う場合は pandas の Parquet エンジンとして pyarrow(または fastparquet)が必要です。
  • Browser binaries(別途インストールが必要)

Installation - インストール

pip

pip install nuki

uv (推奨)

uv add nuki

ブラウザバイナリを別途インストールしてください。

Patchright(Chromium)

pip

python -m patchright install chromium

uv (推奨)

uv run patchright install chromium

Camoufox(Firefox)

pip

camoufox fetch

uv (推奨)

uv run camoufox fetch

メソッド

nuki.browser

  • patchright_page(user_data_dir) … コンテキストマネージャ。Patchright(Chrome チャネル・永続コンテキスト)で Page を開き、with ブロック内に渡す。
    user_data_dir'C:\Users\あなた\...\User Data' のような文字列(chrome://version/ で確認可)。

  • camoufox_page(locale=...) … 同様に Camoufox(Firefox)で Page を開く。bot 検知が厳しいサイト向け。
    例: with camoufox_page(locale='en-US,en') as page:
    ロケールのデフォルトは 'ja-JP,ja'

nuki.utils

ログ・相対パス・CSV・Parquet・HTML 保存など(各関数はモジュールの docstring / ソース参照)。

Basic Usage - 基本的な使い方

from nuki import npage
from nuki.browser import patchright_page
from nuki.utils import add_log_file, append_csv, from_here, random_sleep

fh = from_here(__file__)
add_log_file(fh('log/scraping.log'))

user_data_dir = r'C:\Users\あなたのユーザ名\AppData\Local\Google\Chrome\User Data'
with patchright_page(user_data_dir) as page:
    p = npage(page)
    p.goto('https://www.foobarbaz1.jp')

    pref_urls = p.ss('li.item > ul > li > a').abs_urls()

    classroom_urls = []
    for i, url in enumerate(pref_urls, 1):
        print(f'{i}/{len(pref_urls)} pref_urls')
        if not p.goto(url):
            continue
        random_sleep(1, 2)
        classroom_urls.extend(p.ss('.school-area h4 a').abs_urls())

    for i, url in enumerate(classroom_urls, 1):
        print(f'{i}/{len(classroom_urls)} classroom_urls')
        if not p.goto(url):
            continue
        random_sleep(1, 2)
        append_csv(fh('csv/out.csv'), {
            'URL': page.url,
            '教室名': p.s('h1 .text01').text_content(),
            '住所': p.s('.item .mapText').text_content(),
            '電話番号': p.s('.item .phoneNumber').text_content(),
            'HP': p.ss('th').re('ホームページ').first().next().s('a').url(),
        })

Save HTML while scraping - スクレイピングしながらHTMLを保存する

from nuki import npage
from nuki.browser import camoufox_page
from nuki.utils import add_log_file, append_csv, from_here, hash_name, random_sleep, save_html

fh = from_here(__file__)
add_log_file(fh('log/scraping.log'))

with camoufox_page() as page:
    ctx = {}
    p = npage(page)
    p.goto('https://www.foobarbaz1.jp')

    ctx['アイテムURLs'] = p.ss('ul.items > li > a').abs_urls()

    for i, url in enumerate(ctx['アイテムURLs'], 1):
        print(f"{i}/{len(ctx['アイテムURLs'])} アイテムURLs")
        if not p.goto(url):
            continue
        random_sleep(1, 2)
        if p.wait('#logo', timeout=10000).unwrap() is None:
            continue
        file_name = f'{hash_name(url)}.html'
        if not save_html(fh('html') / file_name, page.content()):
            continue
        append_csv(fh('outurlhtml.csv'), {
            'URL': url,
            'HTML': file_name,
        })

Scrape from local HTML files - 保存済みHTMLからスクレイピングしてParquetに出力する

import pandas as pd

from nuki import nparser
from nuki.utils import add_log_file, from_here, parse_html, write_parquet

fh = from_here(__file__)
add_log_file(fh('log/scraping.log'))

df = pd.read_csv(fh('outurlhtml.csv'))
results = []
for i, (url, path) in enumerate(zip(df['URL'], df['HTML']), 1):
    print(i)
    if not (parser := parse_html(fh('html') / path)):
        continue
    p = nparser(parser)
    results.append({
        'URL': url,
        '教室名': p.s('h1 .text02').text(),
        '住所': p.s('.item .mapText').text(),
        '所在地': p.ss('dt').re(r'所在地').first().next('dd').text(),
    })
write_parquet(fh('outhtml.parquet'), results)

License - ライセンス

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nuki-0.1.3.tar.gz (8.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nuki-0.1.3-py3-none-any.whl (7.2 kB view details)

Uploaded Python 3

File details

Details for the file nuki-0.1.3.tar.gz.

File metadata

  • Download URL: nuki-0.1.3.tar.gz
  • Upload date:
  • Size: 8.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-requests/2.33.1

File hashes

Hashes for nuki-0.1.3.tar.gz
Algorithm Hash digest
SHA256 94c0350c73261db79110a1c165377578fc67cbe6aa3cd2e711cd92b0adbbecc2
MD5 61502d2e6e853154e04c81e78b0124c4
BLAKE2b-256 515ed7f3a012e1e97d2a411690bed5a3bcc893aa523abccf35fbd9a6ca3fa770

See more details on using hashes here.

File details

Details for the file nuki-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: nuki-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 7.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-requests/2.33.1

File hashes

Hashes for nuki-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 56aa15fbbc55c6a45533600fdd0c1e8357b38841a7a828da326b8f2179755e0c
MD5 d381754fa0317f75f81430d196b9cbdb
BLAKE2b-256 5f9a6e7391c1288e8647755fa5ef82c8fb4f59dec0ac7441b7596202ecc57da2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page