Scalable and Compliant Web Crawler
Project description
DPK Connector
DPK Connector is a scalable and compliant web crawler developed for data acquisition towards LLM development. It is built on Scrapy. For more details read the documentation.
Virtual Environment
The project uses pyproject.toml
and a Makefile for operations.
To do development you should establish the virtual environment
make venv
and then either activate
source venv/bin/activate
or set up your IDE to use the venv directory when developing in this project
Library Artifact Build and Publish
To test, build and publish the library
make test build publish
To up the version number, edit the Makefile to change VERSION and rerun the above. This will require committing both the Makefile
and the autotmatically updated pyproject.toml
file.
How to use
See the overview.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file data_prep_connector-0.2.2.tar.gz
.
File metadata
- Download URL: data_prep_connector-0.2.2.tar.gz
- Upload date:
- Size: 15.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 48b56e614721597fdc90c08850e37d63d39eb76551d40db28b32216dc5d67fbb |
|
MD5 | 0be2ca4df8256f2650c0a84442ed1ba9 |
|
BLAKE2b-256 | 3bca96600f0a7dae543f04d11990d414e200b2cf72ad4277c230909eaf1a307a |
File details
Details for the file data_prep_connector-0.2.2-py3-none-any.whl
.
File metadata
- Download URL: data_prep_connector-0.2.2-py3-none-any.whl
- Upload date:
- Size: 16.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 903142ecac9894e6b404023a656153df38bd12027ed3a4ffaa3fa0875b6538a6 |
|
MD5 | 080084fb08f7195cbfe1c1065ecf7ef1 |
|
BLAKE2b-256 | d667592be7fd3791d60524a5b54f71caa7eac1e292b1327d5205431b68cdcf21 |