A lightweight Python toolkit for wrapper generation and data extraction (CroW framework).
Project description
CroW-Kit (Crowdsourced Wrapper Generation Framework)
CroW-Kit is a lightweight Python toolkit supporting the CroW (Crowdsourced Wrapper Generation Framework). It enables users to interactively design, store, and execute web data wrappers for both tabular and non-tabular websites.
This package provides independent wrapper generation, extraction, and maintenance functionality — ideal for researchers, developers, and data engineers working with web data.
Installation
Install directly from PyPI:
pip install crow-kit
Important: Dependencies & Requirements
This package uses selenium and webdriver-manager to control a live browser.
-
Browser: You must have Google Chrome or Brave Browser installed on your system.
-
Permissions: The package needs write permissions in its working directory to create a
wrappers_5ece4797eaf5e/folder for storing the JSON wrapper files. -
External Files: The interactive wrapper generation depends on several JavaScript and CSS files (
st.action-panel.js,jquery-3.7.1.min.js, etc.). These are included with the package, but you must ensure your environment doesn't block them from being loaded.
Usage Overview
The workflow is a simple two-step process:
-
Generate a Wrapper: Run an interactive function (
setTableWrapperorsetGeneralWrapper). A browser window will open, allowing you to click on the data you want to scrape. Your selections are saved as a JSON file. -
Extract Data: Use the saved wrapper file (
getWrapperData) to automatically fetch the data from the site, including handling pagination.
Core Functions
1. setTableWrapper(url, wrapper_name='no_name')
Interactively creates a table-based wrapper using Selenium. This is best for data inside a <table_wrapper> HTML tag.
Parameters:
-
url(str): The URL of the web page containing the table. -
wrapper_name(str, optional): Prefix for the saved wrapper filename.
Returns:
- Tuple:
(success, wrapper_filename, error_code, error_type, error_message)
Example:
from crow_kit import setTableWrapper
success, wrapper_file, err_code, err_type, err_msg = setTableWrapper(
"[https://example.com/table_page](https://example.com/table_page)",
wrapper_name="sample_table"
)
if success:
print("Wrapper created:", wrapper_file)
else:
print("Error:", err_type, err_msg)
What Happens Interactively:
-
A new Chrome window opens to the specified
url. -
An action panel will appear. You will be prompted to click on the table you want to scrape.
-
After selecting the table, you will be prompted to click on the "Next Page" button (if one exists).
-
Once you confirm, the browser will close, and a JSON wrapper file will be saved.
2. setGeneralWrapper(url, wrapper_name='no_name', repeat='no')
Create a wrapper for general or non-tabular content. This is best for repeating items like articles, product cards, or search results.
Parameters:
-
url(str): Target webpage. -
wrapper_name(str): Name for the wrapper file. -
repeat(str):'yes'if the content repeats across multiple pages (e.g., product listings with pagination). Use'no'if you are only scraping data from a single page.
Returns:
- Tuple:
(success, wrapper_filename, error_code, error_type, error_message)
Example:
from crow_kit import setGeneralWrapper
success, wrapper_file, _, _, _ = setGeneralWrapper(
"[https://example.com/articles](https://example.com/articles)",
wrapper_name="article_wrapper",
repeat='yes'
)
What Happens Interactively:
-
A Chrome window opens.
-
Click on the first data point (e.g., an article title). A popup will ask you to give this data a name (e.g., "title").
-
Click on the next data point (e.g., the author). Give it a name (e.g., "author").
-
Continue this for all the data fields you want to extract from one of the repeating items.
-
When you are done adding fields, you will be prompted to click on the "Next Page" button (if one exists).
-
Confirm your selections. The browser closes, and the wrapper is saved.
3. getWrapperData(wrapper_name, maximum_data_count=100, url='')
Execute a previously created wrapper to extract structured data. This function runs headlessly (no browser window).
Parameters:
-
wrapper_name(str): Name of the saved wrapper JSON file (e.g.,article_wrapper_...json). -
maximum_data_count(int, optional): The maximum number of records to extract. This acts as a safeguard against infinite loops. -
url(str, optional): Override the original URL saved in the wrapper. This is useful for running the same wrapper on a different but structurally identical page.
Returns:
- Tuple:
(success, extracted_data)Whereextracted_datais a list of lists containing the extracted values (including a header row).
Example:
from crow_kit import getWrapperData
# 'wrapper_file' is the filename returned from setGeneralWrapper
success, data = getWrapperData(wrapper_file, maximum_data_count=50)
if success:
for row in data:
print(row)
4. listWrappers()
Lists all locally saved wrapper files in the wrappers_5ece4797eaf5e/ directory.
Returns:
- Tuple:
(success, wrapper_file_list)Wherewrapper_file_listis a list of filenames.
Example:
from crow_kit import listWrappers
success, files = listWrappers()
if success:
print("Available wrappers:", files)
Wrapper Storage
All generated wrappers are stored in a local directory:
wrappers_5ece4797eaf5e/
Each wrapper file is a JSON that includes:
-
Wrapper type (table or general)
-
Target URL
-
XPath selectors for the data fields
-
XPath for the "next page" button (if any)
-
Repetition pattern (
repeat)
Example Workflow
Here is a complete example from start to finish.
from crow_kit import setGeneralWrapper, getWrapperData, listWrappers
# --- Step 1: Create a general wrapper ---
# A browser will open. Follow the interactive steps.
print("Creating wrapper...")
success_create, wrapper_file, _, _, _ = setGeneralWrapper(
"[https://example.com/articles](https://example.com/articles)",
wrapper_name="article_wrapper",
repeat='yes'
)
if not success_create:
print("Failed to create wrapper.")
exit()
print(f"Wrapper '{wrapper_file}' created.")
# --- Step 2: List available wrappers ---
success_list, files = listWrappers()
if success_list:
print("Available wrappers:", files)
# --- Step 3: Extract data using the new wrapper ---
print("Extracting data...")
success_extract, extracted_data = getWrapperData(
wrapper_file,
maximum_data_count=100
)
if success_extract:
print(f"Successfully extracted {len(extracted_data)} rows.")
for row in extracted_data:
print(row)
else:
print("Failed to extract data:", extracted_data)
Example Output
The data returned by getWrapperData is a list of lists. The first inner list is always the header row you defined during wrapper creation.
Tabular wrapper output:
[
["Name", "Age", "City"],
["Alice", "30", "New York"],
["Bob", "28", "Chicago"]
]
General wrapper output:
[
["Title", "Date", "Author"],
["AI and Web Wrappers", "2025-10-20", "K. Naha"],
["The Future of Data", "2025-10-19", "J. Doe"]
]
License
This project is licensed under the MIT License
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file crow_kit-0.1.0.tar.gz.
File metadata
- Download URL: crow_kit-0.1.0.tar.gz
- Upload date:
- Size: 11.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
150dc3e798ff50441a0db1bbeecfac3f03da659237a41c49fb9c0938965abcfe
|
|
| MD5 |
cd08326a579c7044f22242db9ad4a012
|
|
| BLAKE2b-256 |
b508c4a15954ec825caf44c1169a23b7e564fa21598332b87e17f490da1abb07
|
File details
Details for the file crow_kit-0.1.0-py3-none-any.whl.
File metadata
- Download URL: crow_kit-0.1.0-py3-none-any.whl
- Upload date:
- Size: 8.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0aea496f81a576f91bcb4e37be05ccb49cf7f39271aa55400c7f4266ff87e701
|
|
| MD5 |
5575ca0c83ff45afee6e351660e44c49
|
|
| BLAKE2b-256 |
172acb4d8c2b30c29ab00ca3095007327bb6e65798093b9e5ff716a465d99b2e
|