PureCPP Crawl4AI integration
Project description
Crawl4AI Loader for PureCPP
This module provides a Crawl4AILoader, a data loader designed to integrate the crawl4ai web crawling library with the purecpp_extract data loading framework. It allows you to easily fetch web page content as Markdown and load it into RAGDocument objects, ready for use in Retrieval-Augmented Generation (RAG) pipelines.
✨ Features
- Simple Web Content Extraction: Leverages
crawl4aito fetch the content of a single URL. - Markdown Conversion: Automatically converts the fetched HTML content into clean Markdown.
- RAG-Ready Output: Wraps the extracted content and metadata into
RAGDocumentobjects, the standard format for the PureCPP ecosystem. - Asynchronous by Design: Built with
asynciofor efficient, non-blocking I/O operations. - Configurable: Accepts a
BrowserConfigobject to customize the crawling behavior (e.g., setting user agents, handling cookies).
⚙️ Installation
Before using the loader, ensure you have Python 3.10+ and the necessary libraries installed.
- Clone the repository (if applicable) or ensure your project is set up.
- Create and activate a virtual environment:
uv venv source .venv/bin/activate uv sync
🚀 Usage
The loader is straightforward to use. Instantiate Crawl4AILoader with a target URL and a BrowserConfig, then call its load method.
For a complete, runnable script, please refer to the loader.py file in the project directory.
Basic Usage Pattern:
import asyncio
from purecpp_crawl4ai.loader import Crawl4AILoader, BrowserConfig
async def run_loader():
# 1. Configure the browser
config = BrowserConfig()
# 2. Instantiate the loader with a target URL
loader = Crawl4AILoader("[https://www.example.com](https://www.example.com)", config)
# 3. Load the content asynchronously
documents = await loader.load()
# 4. Use the resulting documents
for doc in documents:
print(f"Loaded content from: {doc.metadata['url']}")
print(f"Snippet: {doc.page_content[:100]}...")
if __name__ == "__main__":
asyncio.run(run_loader())
Testing
The project includes a test suite to ensure the loader functions correctly. The tests use unittest.mock to simulate the behavior of AsyncWebCrawler, allowing for fast and reliable testing without making actual network requests.
Ensure unittest is available (it's part of the Python standard library).
Navigate to the project's root directory in your terminal.
Run the tests using the following command:
python -m unittest test_loader.py
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file purecpp_crawl4ai-0.1.0.tar.gz.
File metadata
- Download URL: purecpp_crawl4ai-0.1.0.tar.gz
- Upload date:
- Size: 4.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
af2eff1166e9591c458dd7fc8902660d23ff32de3bb88ac8384ee42099da8eaf
|
|
| MD5 |
f621199905f1e66fa30a2a7b227e3106
|
|
| BLAKE2b-256 |
dc18aa555ad673b22559b2a39b3b2e37546f77f9a989e06ac52726ef6b51c2e3
|
File details
Details for the file purecpp_crawl4ai-0.1.0-py3-none-any.whl.
File metadata
- Download URL: purecpp_crawl4ai-0.1.0-py3-none-any.whl
- Upload date:
- Size: 4.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
989d8aa48cd88d8ee3a6690dc950aec5cd4b15af5e8806b9b41ae8a30b995057
|
|
| MD5 |
9ca9f8dee85bdec95f21c3a8ae3dd6af
|
|
| BLAKE2b-256 |
690d13bc47581257ff68fb475de65b55dda45a7232896d86600fd6793afd4832
|