Crow Kit: workflow and tooling utilities

Project description

CroW-Kit (Crowdsourced Wrapper Generation Framework)

CroW-Kit is a lightweight Python toolkit implementing the CroW (Crowdsourced Wrapper Generation Framework). It allows users to interactively design, store, and execute web data wrappers for both tabular and non-tabular websites.

This package provides independent wrapper generation, extraction, and maintenance functionality — ideal for researchers, developers, and data engineers working with web data.

Installation

Install directly from PyPI:

pip install crow-kit

Playwright Requirement (Important)

CroW-Kit uses Playwright to run wrappers in a headless browser environment.
After installing the package, you must install the Playwright browser binaries once:

python -m playwright install chromium

If Playwright browsers are not installed, wrapper execution may fail with errors such as:

BrowserType.launch: Executable doesn't exist

The installed Playwright browser version must also be compatible with the installed Python playwright package. If issues occur, upgrade the Python Playwright package:

pip install --upgrade playwright

Important: Dependencies & Requirements

CroW-Kit relies on Selenium and webdriver-manager to control a live browser for interactive wrapper creation.

Python dependencies: Installed automatically when you run:
```
pip install crow-kit
```
Browser Requirement: You must have Google Chrome installed on your system.
Permissions: The package needs write permissions in its working directory to create a crow_kit_data/wrappers/ folder for storing the JSON wrapper files.
External Files: The interactive wrapper generation depends on several JavaScript and CSS files (st.action-panel.js, jquery-3.7.1.min.js, etc.). These are included with the package. Ensure your environment allows these files to be loaded.

Usage Overview

Generate a Wrapper: Use setTableWrapper or setGeneralWrapper. A browser window will open, letting you click on the data you want to scrape. Your selections are saved as a JSON wrapper file.
Extract Data: Use getWrapperData to automatically fetch data using the saved wrapper. This works headlessly and handles pagination if defined.

GUI Interaction Instructions

During wrapper creation, CroW-Kit opens a browser window with a floating control panel. Data elements are mapped using a field-selection and right-click interaction.

Visual Feedback

Buttons appear red when waiting for selection.
After a successful selection, the button turns green.
A status message inside the panel guides the next step.

Table Wrapper Mode

Click Select Table (button is red).
Move to the webpage and right-click the target table.
The button turns green to confirm selection.
Click Done to save the wrapper.

General (Non-Tabular) Wrapper Mode

Click inside an Attribute or Value field in the panel.
Move to the webpage element containing the desired data.
Right-click the target element to assign it to the selected field.
Use ✔ to preview sample extraction.
Repeat for all attributes.
Click Done to save.

Why Right-Click?

Right-click is used instead of left-click to prevent triggering the webpage’s default behavior (such as navigation links, dropdowns, or dynamic UI actions).

Core Functions

1. `setTableWrapper(url, wrapper_name='no_name')`

Creates a table-based wrapper for <table> HTML structures.

Parameters:

url (str): URL of the page containing the table
wrapper_name (str, optional): Prefix for the saved wrapper filename

Returns:

(success, wrapper_filename, error_code, error_type, error_message)

Example:

from crow_kit import setTableWrapper

success, wrapper_file, err_code, err_type, err_msg = setTableWrapper(
    "https://example.com/table_page",
    wrapper_name="sample_table"
)

if success:
    print("Wrapper created:", wrapper_file)
else:
    print("Error:", err_type, err_msg)

Interactive Steps:

Chrome opens the page URL
Action panel prompts you to select the table to scrape
If there’s pagination, you select the “Next Page” button
Browser closes and JSON wrapper is saved

2. `setGeneralWrapper(url, wrapper_name='no_name', repeat='no')`

Creates a wrapper for general (non-tabular) content such as articles, product cards, or repeating search results.

Parameters:

url (str): Target webpage
wrapper_name (str): Name prefix for the wrapper file
repeat (str): 'yes' if content repeats across pages, 'no' otherwise

Returns:

(success, wrapper_filename, error_code, error_type, error_message)

Example:

from crow_kit import setGeneralWrapper

success, wrapper_file, _, _, _ = setGeneralWrapper(
    "https://example.com/articles",
    wrapper_name="article_wrapper",
    repeat='yes'
)

Interactive Steps:

Chrome opens the page
Click each data point (e.g., title, author) and assign a name
Select “Next Page” if applicable
Confirm, browser closes and JSON wrapper is saved

3. `getWrapperData(wrapper_name, maximum_data_count=100, url='')`

Runs a saved wrapper to extract structured data headlessly.

Parameters:

wrapper_name (str): JSON wrapper filename
maximum_data_count (int, optional): Maximum rows to extract
url (str, optional): Override original URL

Returns:

(success, extracted_data)

Example:

from crow_kit import getWrapperData

success, data = getWrapperData(wrapper_file, maximum_data_count=50)

if success:
    for row in data:
        print(row)

4. `listWrappers()`

Lists all locally saved wrappers.

Returns:

(success, wrapper_file_list)

Example:

from crow_kit import listWrappers

success, files = listWrappers()
if success:
    print("Available wrappers:", files)

Wrapper Storage

Wrappers are stored in:

crow_kit_data/wrappers/

Each JSON wrapper contains:

Wrapper type (table or general)
Target URL
XPath selectors for data fields
XPath for “Next Page” button (if any)
Repetition pattern (repeat)

Example Workflow

from crow_kit import setGeneralWrapper, getWrapperData, listWrappers

# Step 1: Create a general wrapper
success_create, wrapper_file, _, _, _ = setGeneralWrapper(
    "https://example.com/articles",
    wrapper_name="article_wrapper",
    repeat='yes'
)

if not success_create:
    print("Failed to create wrapper.")
    exit()

# Step 2: List wrappers
success_list, files = listWrappers()
if success_list:
    print("Available wrappers:", files)

# Step 3: Extract data
success_extract, extracted_data = getWrapperData(wrapper_file, maximum_data_count=100)

if success_extract:
    print(f"Extracted {len(extracted_data)} rows")
    for row in extracted_data:
        print(row)
else:
    print("Failed to extract data:", extracted_data)

Example Output

Tabular wrapper:

[
    ["Name", "Age", "City"],
    ["Alice", "30", "New York"],
    ["Bob", "28", "Chicago"]
]

General wrapper:

[
    ["Title", "Date", "Author"],
    ["AI and Web Wrappers", "2025-10-20", "K. Naha"],
    ["The Future of Data", "2025-10-19", "J. Doe"]
]

License

MIT License

Project details

Release history Release notifications | RSS feed

0.3.4

Feb 27, 2026

This version

0.3.3

Feb 27, 2026

0.3.2

Jan 20, 2026

0.3.1

Jan 18, 2026

0.3.0

Dec 31, 2025

0.1.0

Oct 23, 2025

0.0.0

Oct 23, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crow_kit-0.3.3.tar.gz (85.0 kB view details)

Uploaded Feb 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

crow_kit-0.3.3-py3-none-any.whl (85.3 kB view details)

Uploaded Feb 27, 2026 Python 3

File details

Details for the file crow_kit-0.3.3.tar.gz.

File metadata

Download URL: crow_kit-0.3.3.tar.gz
Upload date: Feb 27, 2026
Size: 85.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.13

File hashes

Hashes for crow_kit-0.3.3.tar.gz
Algorithm	Hash digest
SHA256	`ae24686f4d62eefd81a1f84ea7b4c3662bafb700a5338b90372fa1163f462b93`
MD5	`0aad0edde8beaba0335b5fc122bf9d7b`
BLAKE2b-256	`c9af7b2f91a15573ad5afe846679bf55cf3b49f8e3c6554a17a56837a4ae44e7`

See more details on using hashes here.

File details

Details for the file crow_kit-0.3.3-py3-none-any.whl.

File metadata

Download URL: crow_kit-0.3.3-py3-none-any.whl
Upload date: Feb 27, 2026
Size: 85.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.13

File hashes

Hashes for crow_kit-0.3.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`41d1a2923cc5370ca268d34923abdfe84dc30d8e70a8c0a0b426894116540175`
MD5	`a061ca7f6b1b615a298b1b23fcd7ba06`
BLAKE2b-256	`d9a67d20322954ed9ea55f0a038d323664d77aa651e3b7772090c3b3464c5a71`

See more details on using hashes here.

crow-kit 0.3.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

CroW-Kit (Crowdsourced Wrapper Generation Framework)

Installation

Playwright Requirement (Important)

Important: Dependencies & Requirements

Usage Overview

GUI Interaction Instructions

Visual Feedback

Table Wrapper Mode

General (Non-Tabular) Wrapper Mode

Why Right-Click?

Core Functions

1. setTableWrapper(url, wrapper_name='no_name')

2. setGeneralWrapper(url, wrapper_name='no_name', repeat='no')

3. getWrapperData(wrapper_name, maximum_data_count=100, url='')

4. listWrappers()

Wrapper Storage

Example Workflow

Example Output

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

1. `setTableWrapper(url, wrapper_name='no_name')`

2. `setGeneralWrapper(url, wrapper_name='no_name', repeat='no')`

3. `getWrapperData(wrapper_name, maximum_data_count=100, url='')`

4. `listWrappers()`