Add your description here
Project description
nuki
Overview - 概要
nuki is a scraping utility library built on Patchright and selectolax. nuki(抜き)はPatchrightとselectolaxをベースにしたスクレイピングユーティリティライブラリです。
Requirements - 必要条件
- Python 3.12 or higher
- Libraries: patchright, selectolax, pandas, camoufox(自動インストール)
write_parquetを使う場合は pandas の Parquet エンジンとしてpyarrow(またはfastparquet)が必要です。- Browser binaries(別途インストールが必要)
Installation - インストール
pip
pip install nuki
uv (推奨)
uv add nuki
ブラウザバイナリを別途インストールしてください。
Patchright(Chromium)
pip
python -m patchright install chromium
uv (推奨)
uv run patchright install chromium
Camoufox(Firefox)
pip
camoufox fetch
uv (推奨)
uv run camoufox fetch
メソッド
-
patchright_page(user_data_dir)… コンテキストマネージャ。Patchright(Chrome チャネル・永続コンテキスト)でPageを開き、withブロック内に渡す。
user_data_dirは'C:\Users\あなた\...\User Data'のような文字列(chrome://version/で確認可)。 -
camoufox_page(locale=...)… 同様に Camoufox(Firefox)でPageを開く。bot 検知が厳しいサイト向け。
例:with camoufox_page(locale='en-US,en') as page:
ロケールのデフォルトは'ja-JP,ja'。
Basic Usage - 基本的な使い方
from nuki import *
fh = from_here(__file__)
add_log_file(fh('log/scraping.log'))
user_data_dir = r'C:\Users\あなたのユーザ名\AppData\Local\Google\Chrome\User Data'
with patchright_page(user_data_dir) as page:
p = npage(page)
p.goto('https://www.foobarbaz1.jp')
pref_urls = [p.abs_url(e.url()) for e in p.ss('li.item > ul > li > a')]
classroom_urls = []
for i, url in enumerate(pref_urls, 1):
print(f'{i}/{len(pref_urls)} pref_urls')
if not p.goto(url):
continue
random_sleep(1, 2)
links = [p.abs_url(e.url()) for e in p.ss('.school-area h4 a')]
classroom_urls.extend(links)
for i, url in enumerate(classroom_urls, 1):
print(f'{i}/{len(classroom_urls)} classroom_urls')
if not p.goto(url):
continue
random_sleep(1, 2)
append_csv(fh('csv/out.csv'), {
'URL': page.url,
'教室名': p.s('h1 .text01').text_content(),
'住所': p.s('.item .mapText').text_content(),
'電話番号': p.s('.item .phoneNumber').text_content(),
'HP': p.s_re('th', 'ホームページ').next().s('a').url(),
})
Save HTML while scraping - スクレイピングしながらHTMLを保存する
from nuki import *
fh = from_here(__file__)
add_log_file(fh('log/scraping.log'))
with camoufox_page() as page:
ctx = {}
p = npage(page)
p.goto('https://www.foobarbaz1.jp')
ctx['アイテムURLs'] = [p.abs_url(e.url()) for e in p.ss('ul.items > li > a')]
for i, url in enumerate(ctx['アイテムURLs'], 1):
print(f"{i}/{len(ctx['アイテムURLs'])} アイテムURLs")
if not p.goto(url):
continue
random_sleep(1, 2)
if p.wait('#logo', timeout=10000).unwrap() is None:
continue
file_name = f'{hash_name(url)}.html'
if not save_html(fh('html') / file_name, page.content()):
continue
append_csv(fh('outurlhtml.csv'), {
'URL': url,
'HTML': file_name,
})
Scrape from local HTML files - 保存済みHTMLからスクレイピングしてParquetに出力する
import pandas as pd
from nuki import *
fh = from_here(__file__)
add_log_file(fh('log/scraping.log'))
df = pd.read_csv(fh('outurlhtml.csv'))
results = []
for i, (url, path) in enumerate(zip(df['URL'], df['HTML']), 1):
print(i)
if not (parser := parse_html(fh('html') / path)):
continue
p = nparser(parser)
results.append({
'URL': url,
'教室名': p.s('h1 .text02').text(),
'住所': p.s('.item .mapText').text(),
'所在地': p.s_re('dt', r'所在地').next('dd').text(),
})
write_parquet(fh('outhtml.parquet'), results)
License - ライセンス
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nuki-0.1.2.tar.gz.
File metadata
- Download URL: nuki-0.1.2.tar.gz
- Upload date:
- Size: 8.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: python-requests/2.33.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8e3c71af324f4e058ded0c7cddd4f63c2a1cd18c6649e2d546485127707e677b
|
|
| MD5 |
475ab368c52e25e78580b7c768b34b49
|
|
| BLAKE2b-256 |
f5594738252e54dcae364c54329f5ed7751683281704ecedbfad8406314fb8f9
|
File details
Details for the file nuki-0.1.2-py3-none-any.whl.
File metadata
- Download URL: nuki-0.1.2-py3-none-any.whl
- Upload date:
- Size: 6.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: python-requests/2.33.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c11c3bc111be9159dc9c7c5976c6f27b6ab372813357fcbd04bae883c4e3eebc
|
|
| MD5 |
239a9430700af5d6ebadc2b7c5395fa7
|
|
| BLAKE2b-256 |
7967b239098b4719eaf048a13f249dc88bbb58406ebd0d5a38f394a489b29eab
|