Add your description here
Project description
nuki
Overview - 概要
nuki is a scraping utility library built on Patchright and selectolax. nuki(抜き)はPatchrightとselectolaxをベースにしたスクレイピングユーティリティライブラリです。
DOM・パーサのラッパーは nuki から、ブラウザ起動は nuki.browser、CSV やログなどの周辺は nuki.utils から import します。
Requirements - 必要条件
- Python 3.12 or higher
- Libraries: patchright, selectolax, pandas, camoufox(自動インストール)
write_parquetを使う場合は pandas の Parquet エンジンとしてpyarrow(またはfastparquet)が必要です。- Browser binaries(別途インストールが必要)
Installation - インストール
pip
pip install nuki
uv (推奨)
uv add nuki
ブラウザバイナリを別途インストールしてください。
Patchright(Chromium)
pip
python -m patchright install chromium
uv (推奨)
uv run patchright install chromium
Camoufox(Firefox)
pip
camoufox fetch
uv (推奨)
uv run camoufox fetch
メソッド
nuki.browser
-
patchright_page(user_data_dir)… コンテキストマネージャ。Patchright(Chrome チャネル・永続コンテキスト)でPageを開き、withブロック内に渡す。
user_data_dirは'C:\Users\あなた\...\User Data'のような文字列(chrome://version/で確認可)。 -
camoufox_page(locale=...)… 同様に Camoufox(Firefox)でPageを開く。bot 検知が厳しいサイト向け。
例:with camoufox_page(locale='en-US,en') as page:
ロケールのデフォルトは'ja-JP,ja'。
nuki.utils
ログ・相対パス・CSV・Parquet・HTML 保存など(各関数はモジュールの docstring / ソース参照)。
Basic Usage - 基本的な使い方
from nuki import npage
from nuki.browser import patchright_page
from nuki.utils import add_log_file, append_csv, from_here, random_sleep
fh = from_here(__file__)
add_log_file(fh('log/scraping.log'))
user_data_dir = r'C:\Users\あなたのユーザ名\AppData\Local\Google\Chrome\User Data'
with patchright_page(user_data_dir) as page:
p = npage(page)
p.goto('https://www.foobarbaz1.jp')
pref_urls = p.ss('li.item > ul > li > a').abs_urls()
classroom_urls = []
for i, url in enumerate(pref_urls, 1):
print(f'{i}/{len(pref_urls)} pref_urls')
if not p.goto(url):
continue
random_sleep(1, 2)
classroom_urls.extend(p.ss('.school-area h4 a').abs_urls())
for i, url in enumerate(classroom_urls, 1):
print(f'{i}/{len(classroom_urls)} classroom_urls')
if not p.goto(url):
continue
random_sleep(1, 2)
append_csv(fh('csv/out.csv'), {
'URL': page.url,
'教室名': p.s('h1 .text01').text_content(),
'住所': p.s('.item .mapText').text_content(),
'電話番号': p.s('.item .phoneNumber').text_content(),
'HP': p.ss('th').re('ホームページ').first().next().s('a').url(),
})
Save HTML while scraping - スクレイピングしながらHTMLを保存する
from nuki import npage
from nuki.browser import camoufox_page
from nuki.utils import add_log_file, append_csv, from_here, hash_name, random_sleep, save_html
fh = from_here(__file__)
add_log_file(fh('log/scraping.log'))
with camoufox_page() as page:
ctx = {}
p = npage(page)
p.goto('https://www.foobarbaz1.jp')
ctx['アイテムURLs'] = p.ss('ul.items > li > a').abs_urls()
for i, url in enumerate(ctx['アイテムURLs'], 1):
print(f"{i}/{len(ctx['アイテムURLs'])} アイテムURLs")
if not p.goto(url):
continue
random_sleep(1, 2)
if p.wait('#logo', timeout=10000).unwrap() is None:
continue
file_name = f'{hash_name(url)}.html'
if not save_html(fh('html') / file_name, page.content()):
continue
append_csv(fh('outurlhtml.csv'), {
'URL': url,
'HTML': file_name,
})
Scrape from local HTML files - 保存済みHTMLからスクレイピングしてParquetに出力する
import pandas as pd
from nuki import nparser
from nuki.utils import add_log_file, from_here, parse_html, write_parquet
fh = from_here(__file__)
add_log_file(fh('log/scraping.log'))
df = pd.read_csv(fh('outurlhtml.csv'))
results = []
for i, (url, path) in enumerate(zip(df['URL'], df['HTML']), 1):
print(i)
if not (parser := parse_html(fh('html') / path)):
continue
p = nparser(parser)
results.append({
'URL': url,
'教室名': p.s('h1 .text02').text(),
'住所': p.s('.item .mapText').text(),
'所在地': p.ss('dt').re(r'所在地').first().next('dd').text(),
})
write_parquet(fh('outhtml.parquet'), results)
License - ライセンス
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nuki-0.1.3.tar.gz.
File metadata
- Download URL: nuki-0.1.3.tar.gz
- Upload date:
- Size: 8.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: python-requests/2.33.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
94c0350c73261db79110a1c165377578fc67cbe6aa3cd2e711cd92b0adbbecc2
|
|
| MD5 |
61502d2e6e853154e04c81e78b0124c4
|
|
| BLAKE2b-256 |
515ed7f3a012e1e97d2a411690bed5a3bcc893aa523abccf35fbd9a6ca3fa770
|
File details
Details for the file nuki-0.1.3-py3-none-any.whl.
File metadata
- Download URL: nuki-0.1.3-py3-none-any.whl
- Upload date:
- Size: 7.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: python-requests/2.33.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
56aa15fbbc55c6a45533600fdd0c1e8357b38841a7a828da326b8f2179755e0c
|
|
| MD5 |
d381754fa0317f75f81430d196b9cbdb
|
|
| BLAKE2b-256 |
5f9a6e7391c1288e8647755fa5ef82c8fb4f59dec0ac7441b7596202ecc57da2
|