A package for crawling and converting web content to Markdown
Project description
UpToDateAI
UpToDateAI is a Python package designed to fetch and provide the latest documentation about recently released programming frameworks to AI models. This package helps bridge the gap between AI model training cut-off dates and the latest developments in the programming world.
Installation
You can install UpToDateAI using pip:
pip install uptodateai
Usage
URL of the website you want to crawl:
from uptodateai import process_docs
process_docs("https://docs.fastht.ml/")
This will crawl the specified website and save the content as Markdown files in a docs
directory.
Features
- Web crawling using Scrapy
- Content extraction using newspaper3k
- HTML to Markdown conversion
- Automatic directory structure creation based on URL paths
- Customizable crawling settings
Development
To set up the development environment:
- Clone the repository
- Install dependencies:
pip install -r requirements.txt
- Run tests:
python -m unittest discover tests
License
This project is licensed under the MIT License. See the LICENSE file for details.
Contributing
Contributions are welcome! Please open an issue or submit a pull request.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for uptodateai-0.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bc9cb768baf1b1fef4421fb4aba3034bf69a6977713c2e66be3183067486e76c |
|
MD5 | c6c3413b8f2fd44334fc76e061c52d6c |
|
BLAKE2b-256 | 90759419f1ef4f70cb3a474f363842871c0b3860124e4007dd87bf375a88708e |