Crow Kit: workflow and tooling utilities
Project description
CroW-Kit (Crowdsourced Wrapper Generation Framework)
CroW-Kit is a lightweight Python toolkit implementing the CroW (Crowdsourced Wrapper Generation Framework). It allows users to interactively design, store, and execute web data wrappers for both tabular and non-tabular websites.
This package provides independent wrapper generation, extraction, and maintenance functionality — ideal for researchers, developers, and data engineers working with web data.
Installation
Install directly from PyPI:
pip install crow-kit
Important: Dependencies & Requirements
CroW-Kit relies on Selenium and webdriver-manager to control a live browser for interactive wrapper creation.
-
Python dependencies: Installed automatically when you run:
pip install crow-kit
-
Browser Requirement: You must have Google Chrome installed on your system.
-
Permissions: The package needs write permissions in its working directory to create a
crow_kit_data/wrappers/folder for storing the JSON wrapper files. -
External Files: The interactive wrapper generation depends on several JavaScript and CSS files (
st.action-panel.js,jquery-3.7.1.min.js, etc.). These are included with the package. Ensure your environment allows these files to be loaded.
Usage Overview
-
Generate a Wrapper: Use
setTableWrapperorsetGeneralWrapper. A browser window will open, letting you click on the data you want to scrape. Your selections are saved as a JSON wrapper file. -
Extract Data: Use
getWrapperDatato automatically fetch data using the saved wrapper. This works headlessly and handles pagination if defined.
Core Functions
1. setTableWrapper(url, wrapper_name='no_name')
Creates a table-based wrapper for <table> HTML structures.
Parameters:
url(str): URL of the page containing the tablewrapper_name(str, optional): Prefix for the saved wrapper filename
Returns:
(success, wrapper_filename, error_code, error_type, error_message)
Example:
from crow_kit import setTableWrapper
success, wrapper_file, err_code, err_type, err_msg = setTableWrapper(
"https://example.com/table_page",
wrapper_name="sample_table"
)
if success:
print("Wrapper created:", wrapper_file)
else:
print("Error:", err_type, err_msg)
Interactive Steps:
- Chrome opens the page URL
- Action panel prompts you to select the table to scrape
- If there’s pagination, you select the “Next Page” button
- Browser closes and JSON wrapper is saved
2. setGeneralWrapper(url, wrapper_name='no_name', repeat='no')
Creates a wrapper for general (non-tabular) content such as articles, product cards, or repeating search results.
Parameters:
url(str): Target webpagewrapper_name (str): Name prefix for the wrapper filerepeat (str):'yes'if content repeats across pages,'no'otherwise
Returns:
(success, wrapper_filename, error_code, error_type, error_message)
Example:
from crow_kit import setGeneralWrapper
success, wrapper_file, _, _, _ = setGeneralWrapper(
"https://example.com/articles",
wrapper_name="article_wrapper",
repeat='yes'
)
Interactive Steps:
- Chrome opens the page
- Click each data point (e.g., title, author) and assign a name
- Select “Next Page” if applicable
- Confirm, browser closes and JSON wrapper is saved
3. getWrapperData(wrapper_name, maximum_data_count=100, url='')
Runs a saved wrapper to extract structured data headlessly.
Parameters:
wrapper_name (str): JSON wrapper filenamemaximum_data_count (int, optional): Maximum rows to extracturl (str, optional): Override original URL
Returns:
(success, extracted_data)
Example:
from crow_kit import getWrapperData
success, data = getWrapperData(wrapper_file, maximum_data_count=50)
if success:
for row in data:
print(row)
4. listWrappers()
Lists all locally saved wrappers.
Returns:
(success, wrapper_file_list)
Example:
from crow_kit import listWrappers
success, files = listWrappers()
if success:
print("Available wrappers:", files)
Wrapper Storage
Wrappers are stored in:
crow_kit_data/wrappers/
Each JSON wrapper contains:
- Wrapper type (
tableorgeneral) - Target URL
- XPath selectors for data fields
- XPath for “Next Page” button (if any)
- Repetition pattern (
repeat)
Example Workflow
from crow_kit import setGeneralWrapper, getWrapperData, listWrappers
# Step 1: Create a general wrapper
success_create, wrapper_file, _, _, _ = setGeneralWrapper(
"https://example.com/articles",
wrapper_name="article_wrapper",
repeat='yes'
)
if not success_create:
print("Failed to create wrapper.")
exit()
# Step 2: List wrappers
success_list, files = listWrappers()
if success_list:
print("Available wrappers:", files)
# Step 3: Extract data
success_extract, extracted_data = getWrapperData(wrapper_file, maximum_data_count=100)
if success_extract:
print(f"Extracted {len(extracted_data)} rows")
for row in extracted_data:
print(row)
else:
print("Failed to extract data:", extracted_data)
Example Output
Tabular wrapper:
[
["Name", "Age", "City"],
["Alice", "30", "New York"],
["Bob", "28", "Chicago"]
]
General wrapper:
[
["Title", "Date", "Author"],
["AI and Web Wrappers", "2025-10-20", "K. Naha"],
["The Future of Data", "2025-10-19", "J. Doe"]
]
License
MIT License
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file crow_kit-0.3.1.tar.gz.
File metadata
- Download URL: crow_kit-0.3.1.tar.gz
- Upload date:
- Size: 82.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f12e246ba1d9591a5336915c8e450ccbd4a888dd2647ecd35f56aeecffbf291b
|
|
| MD5 |
aeecf0aac45540d7380c9ca2f41cd903
|
|
| BLAKE2b-256 |
5bc78c5990dce5ecf473f235f1ec5bbec7d522cf84ac52c66735a28d0110e638
|
File details
Details for the file crow_kit-0.3.1-py3-none-any.whl.
File metadata
- Download URL: crow_kit-0.3.1-py3-none-any.whl
- Upload date:
- Size: 83.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a98784b674a34aaa77e04b70f9b5b89e9a5155814f971a0574c7118758430a6a
|
|
| MD5 |
c4e3dd45dc966a31f7222656b8684005
|
|
| BLAKE2b-256 |
d231199c2e334feb404715dea5afaa704f807d93945cedacdc0ef81890e4d6c4
|