Skip to main content

Fetches files of desired type from common crawler's database

Project description

CmonCrawl-Fetcher

Downloads desired files from Common Crawl data by filetype

Usage

usage: cmoncrawl-fetcher.py [-h] -l <limit> -f FILETYPES [FILETYPES ...] [-p NUM_PROCS] -o OUTPUT [-t TOLERANCE]

Python package that downloads files from common crawler's database. An example usage is `cmoncrawl-fetcher.py -l 5 -f jpg png -o out_dir` This'll make it download 5 jpgs and 5 pngs into out_dir

options:
  -h, --help            show this help message and exit
  -l <limit>, --limit <limit>
                        Number of images per filetype desired
  -f FILETYPES [FILETYPES ...], --filetypes FILETYPES [FILETYPES ...]
                        Desired filetypes to fetch
  -p NUM_PROCS, --num_procs NUM_PROCS
                        Number of processes to use, default is 1
  -o OUTPUT, --output OUTPUT
                        Output directory to store downloaded files
  -t TOLERANCE, --tolerance TOLERANCE
                        Number of fails for a given hostname before we ignore this host

Required arguments are the desired file type (input the extension), the output directory, and the number of desired files for each extension.

By default we prioritize those with Content Types that signify the filetype over the extension, we store corresponding content types for file types in filetype_config.json, we currently support 69 file types.

Contributing

  • Package installation uses poetry, but this is subject to change in the future.

  • We use git-flow workflows for our development. Depending on the kind of feature you are contributing, create a hotfix branch or a feature branch. Installation instructions are here.

  • We also require a pre-commit hook. You can follow the instructions here to install them.

  • You will need to install the hooks in the yaml file in the repository using the following command: pre-commit install.

License

We've released this project under GPLv3. Check the LICENSE file for more details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cmoncrawl_fetcher-0.1.0.tar.gz (16.5 kB view hashes)

Uploaded Source

Built Distribution

cmoncrawl_fetcher-0.1.0-py3-none-any.whl (17.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page