A lightweight Python toolkit for wrapper generation and data extraction (CroW framework).

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

CroW-Kit (Crowdsourced Wrapper Generation Framework)

CroW-Kit is a lightweight Python toolkit supporting the CroW (Crowdsourced Wrapper Generation Framework). It enables users to interactively design, store, and execute web data wrappers for both tabular and non-tabular websites.

This package provides independent wrapper generation, extraction, and maintenance functionality — ideal for researchers, developers, and data engineers working with web data.

Installation

Install directly from PyPI:

pip install crow-kit

Important: Dependencies & Requirements

This package uses selenium and webdriver-manager to control a live browser.

Browser: You must have Google Chrome or Brave Browser installed on your system.
Permissions: The package needs write permissions in its working directory to create a wrappers_5ece4797eaf5e/ folder for storing the JSON wrapper files.
External Files: The interactive wrapper generation depends on several JavaScript and CSS files (st.action-panel.js, jquery-3.7.1.min.js, etc.). These are included with the package, but you must ensure your environment doesn't block them from being loaded.

Usage Overview

The workflow is a simple two-step process:

Generate a Wrapper: Run an interactive function (setTableWrapper or setGeneralWrapper). A browser window will open, allowing you to click on the data you want to scrape. Your selections are saved as a JSON file.
Extract Data: Use the saved wrapper file (getWrapperData) to automatically fetch the data from the site, including handling pagination.

Core Functions

1. setTableWrapper(url, wrapper_name='no_name')

Interactively creates a table-based wrapper using Selenium. This is best for data inside a <table_wrapper> HTML tag.

Parameters:

url (str): The URL of the web page containing the table.
wrapper_name (str, optional): Prefix for the saved wrapper filename.

Returns:

Tuple: (success, wrapper_filename, error_code, error_type, error_message)

Example:

from crow_kit import setTableWrapper

success, wrapper_file, err_code, err_type, err_msg = setTableWrapper(
    "[https://example.com/table_page](https://example.com/table_page)",
    wrapper_name="sample_table"
)

if success:
    print("Wrapper created:", wrapper_file)
else:
    print("Error:", err_type, err_msg)

What Happens Interactively:

A new Chrome window opens to the specified url.
An action panel will appear. You will be prompted to click on the table you want to scrape.
After selecting the table, you will be prompted to click on the "Next Page" button (if one exists).
Once you confirm, the browser will close, and a JSON wrapper file will be saved.

2. setGeneralWrapper(url, wrapper_name='no_name', repeat='no')

Create a wrapper for general or non-tabular content. This is best for repeating items like articles, product cards, or search results.

Parameters:

url (str): Target webpage.
wrapper_name (str): Name for the wrapper file.
repeat (str): 'yes' if the content repeats across multiple pages (e.g., product listings with pagination). Use 'no' if you are only scraping data from a single page.

Returns:

Tuple: (success, wrapper_filename, error_code, error_type, error_message)

Example:

from crow_kit import setGeneralWrapper

success, wrapper_file, _, _, _ = setGeneralWrapper(
    "[https://example.com/articles](https://example.com/articles)",
    wrapper_name="article_wrapper",
    repeat='yes'
)

What Happens Interactively:

A Chrome window opens.
Click on the first data point (e.g., an article title). A popup will ask you to give this data a name (e.g., "title").
Click on the next data point (e.g., the author). Give it a name (e.g., "author").
Continue this for all the data fields you want to extract from one of the repeating items.
When you are done adding fields, you will be prompted to click on the "Next Page" button (if one exists).
Confirm your selections. The browser closes, and the wrapper is saved.

3. getWrapperData(wrapper_name, maximum_data_count=100, url='')

Execute a previously created wrapper to extract structured data. This function runs headlessly (no browser window).

Parameters:

wrapper_name (str): Name of the saved wrapper JSON file (e.g., article_wrapper_...json).
maximum_data_count (int, optional): The maximum number of records to extract. This acts as a safeguard against infinite loops.
url (str, optional): Override the original URL saved in the wrapper. This is useful for running the same wrapper on a different but structurally identical page.

Returns:

Tuple: (success, extracted_data) Where extracted_data is a list of lists containing the extracted values (including a header row).

Example:

from crow_kit import getWrapperData

# 'wrapper_file' is the filename returned from setGeneralWrapper
success, data = getWrapperData(wrapper_file, maximum_data_count=50)

if success:
    for row in data:
        print(row)

4. listWrappers()

Lists all locally saved wrapper files in the wrappers_5ece4797eaf5e/ directory.

Returns:

Tuple: (success, wrapper_file_list) Where wrapper_file_list is a list of filenames.

Example:

from crow_kit import listWrappers

success, files = listWrappers()
if success:
    print("Available wrappers:", files)

Wrapper Storage

All generated wrappers are stored in a local directory:

wrappers_5ece4797eaf5e/

Each wrapper file is a JSON that includes:

Wrapper type (table or general)
Target URL
XPath selectors for the data fields
XPath for the "next page" button (if any)
Repetition pattern (repeat)

Example Workflow

Here is a complete example from start to finish.

from crow_kit import setGeneralWrapper, getWrapperData, listWrappers

# --- Step 1: Create a general wrapper ---
# A browser will open. Follow the interactive steps.
print("Creating wrapper...")
success_create, wrapper_file, _, _, _ = setGeneralWrapper(
    "[https://example.com/articles](https://example.com/articles)",
    wrapper_name="article_wrapper",
    repeat='yes'
)

if not success_create:
    print("Failed to create wrapper.")
    exit()

print(f"Wrapper '{wrapper_file}' created.")

# --- Step 2: List available wrappers ---
success_list, files = listWrappers()
if success_list:
    print("Available wrappers:", files)

# --- Step 3: Extract data using the new wrapper ---
print("Extracting data...")
success_extract, extracted_data = getWrapperData(
    wrapper_file,
    maximum_data_count=100
)

if success_extract:
    print(f"Successfully extracted {len(extracted_data)} rows.")
    for row in extracted_data:
        print(row)
else:
    print("Failed to extract data:", extracted_data)

Example Output

The data returned by getWrapperData is a list of lists. The first inner list is always the header row you defined during wrapper creation.

Tabular wrapper output:

[
    ["Name", "Age", "City"],
    ["Alice", "30", "New York"],
    ["Bob", "28", "Chicago"]
]

General wrapper output:

[
    ["Title", "Date", "Author"],
    ["AI and Web Wrappers", "2025-10-20", "K. Naha"],
    ["The Future of Data", "2025-10-19", "J. Doe"]
]

License

This project is licensed under the MIT License

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.3.4

Feb 27, 2026

0.3.3

Feb 27, 2026

0.3.2

Jan 20, 2026

0.3.1

Jan 18, 2026

0.3.0

Dec 31, 2025

This version

0.1.0

Oct 23, 2025

0.0.0

Oct 23, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crow_kit-0.1.0.tar.gz (11.0 kB view details)

Uploaded Oct 23, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

crow_kit-0.1.0-py3-none-any.whl (8.5 kB view details)

Uploaded Oct 23, 2025 Python 3

File details

Details for the file crow_kit-0.1.0.tar.gz.

File metadata

Download URL: crow_kit-0.1.0.tar.gz
Upload date: Oct 23, 2025
Size: 11.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.13

File hashes

Hashes for crow_kit-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`150dc3e798ff50441a0db1bbeecfac3f03da659237a41c49fb9c0938965abcfe`
MD5	`cd08326a579c7044f22242db9ad4a012`
BLAKE2b-256	`b508c4a15954ec825caf44c1169a23b7e564fa21598332b87e17f490da1abb07`

See more details on using hashes here.

File details

Details for the file crow_kit-0.1.0-py3-none-any.whl.

File metadata

Download URL: crow_kit-0.1.0-py3-none-any.whl
Upload date: Oct 23, 2025
Size: 8.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.13

File hashes

Hashes for crow_kit-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0aea496f81a576f91bcb4e37be05ccb49cf7f39271aa55400c7f4266ff87e701`
MD5	`5575ca0c83ff45afee6e351660e44c49`
BLAKE2b-256	`172acb4d8c2b30c29ab00ca3095007327bb6e65798093b9e5ff716a465d99b2e`

See more details on using hashes here.

crow-kit 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

CroW-Kit (Crowdsourced Wrapper Generation Framework)

Installation

Important: Dependencies & Requirements

Usage Overview

Core Functions

1. setTableWrapper(url, wrapper_name='no_name')

2. setGeneralWrapper(url, wrapper_name='no_name', repeat='no')

3. getWrapperData(wrapper_name, maximum_data_count=100, url='')

4. listWrappers()

Wrapper Storage

Example Workflow

Example Output

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes