Skip to main content

Scalable and Compliant Web Crawler

Project description

DPK Connector

DPK Connector is a scalable and compliant web crawler developed for data acquisition towards LLM development. It is built on Scrapy. For more details read the documentation.

Virtual Environment

The project uses pyproject.toml and a Makefile for operations. To do development you should establish the virtual environment

make venv

and then either activate

source venv/bin/activate

or set up your IDE to use the venv directory when developing in this project

Library Artifact Build and Publish

To test, build and publish the library

make test build publish

To up the version number, edit the Makefile to change VERSION and rerun the above. This will require committing both the Makefile and the autotmatically updated pyproject.toml file.

How to use

See the overview.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data_prep_connector-0.2.2.tar.gz (15.5 kB view details)

Uploaded Source

Built Distribution

data_prep_connector-0.2.2-py3-none-any.whl (16.6 kB view details)

Uploaded Python 3

File details

Details for the file data_prep_connector-0.2.2.tar.gz.

File metadata

  • Download URL: data_prep_connector-0.2.2.tar.gz
  • Upload date:
  • Size: 15.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.10

File hashes

Hashes for data_prep_connector-0.2.2.tar.gz
Algorithm Hash digest
SHA256 48b56e614721597fdc90c08850e37d63d39eb76551d40db28b32216dc5d67fbb
MD5 0be2ca4df8256f2650c0a84442ed1ba9
BLAKE2b-256 3bca96600f0a7dae543f04d11990d414e200b2cf72ad4277c230909eaf1a307a

See more details on using hashes here.

File details

Details for the file data_prep_connector-0.2.2-py3-none-any.whl.

File metadata

File hashes

Hashes for data_prep_connector-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 903142ecac9894e6b404023a656153df38bd12027ed3a4ffaa3fa0875b6538a6
MD5 080084fb08f7195cbfe1c1065ecf7ef1
BLAKE2b-256 d667592be7fd3791d60524a5b54f71caa7eac1e292b1327d5205431b68cdcf21

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page