A web content extractor and archiver for simplenote
Project description
Web Content Extractor and Archiver
This project is a web content extractor and archiver that fetches content from specified URLs from simplenote only, processes it, and saves it as Markdown files organized by date.start the script with the simplenote url in your clipboard and it will fetch the content from the url and save it as a markdown file in the output directory.
Project Structure
web-content-extractor
├── src
│ ├── simex.py # Main script to orchestrate fetching and processing
│ ├── fetcher.py # Functions for fetching content from URLs
│ ├── processor.py # Functions for processing fetched HTML content
│ ├── archiver.py # Manages saving processed content as Markdown files
│ ├── imp_clip.py # Functions for updating sources from clipboard
│ └── utils
│ └── __init__.py # Utility functions shared across modules
├── requirements.txt # Project dependencies
├── config.yaml # Configuration settings for sources and paths
├── output # Directory for saved Markdown files
└── README.md # Project documentation
Installation
-
Clone the repository:
git clone <repository-url> cd web-content-extractor -
Install the required dependencies:
pip install -r requirements.txt
Usage
- Configure the
config.yamlfile to specify the URLs and output filenames. - Run the main script:
python src/llmix.py
Customization
- Modify the
SOURCESdictionary inconfig.yamlto add or change the URLs you want to fetch content from. - Adjust the
BASE_PATHinconfig.yamlto change where the Markdown files are saved.
Contributing
Feel free to submit issues or pull requests for improvements or additional features.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file simexp-0.1.8.tar.gz.
File metadata
- Download URL: simexp-0.1.8.tar.gz
- Upload date:
- Size: 5.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d93265e578baf904da177a297b754cfc7c967d83a868951d8fc3246d88d6b491
|
|
| MD5 |
2c5fd665249ba27251644fc70744dee4
|
|
| BLAKE2b-256 |
957d0f17d25e49e61a1d14389338cdd816f1c5715fc4718333ae42b187f701b3
|
File details
Details for the file simexp-0.1.8-py3-none-any.whl.
File metadata
- Download URL: simexp-0.1.8-py3-none-any.whl
- Upload date:
- Size: 6.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fa68ced9622252c210dba2e913fd2a149991cb8ba9e57ffed79f14c629e2d921
|
|
| MD5 |
035eaae24168e03e8d0a6841ed330bc6
|
|
| BLAKE2b-256 |
0a0a592fa808262d0c0134fd4483dd05710f9a63bd4b82a50c0822ea64122307
|