Skip to main content

quickly extract links from html

Project description

Fast Link Extractor

Project under active deveopment

A Python 3.7+ package to extract links from a webpage. Asyncronous functions allows the code to run fast when extracting from many sub-directories.

A use case for this tool is to extract download links for use with wget or fsspec.

Main base-level functions

  • .link_extractor(): extract links from a given URL
  • .filter_with_regex(): allows you to filter output with a regular expression
  • .prepend_with_baseurl(): allows the original URL to be pre-pended to each output

Installation

PyPi

pip install fast-link-extractor

Example

Simply import the package and call link_extractor(). This will output of list of extracted links

import fast-link-extractor as fle

# url to extract links from
base_url = "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/"

# extract all links from sub directories ending with .nc
# this may take ~10 seconds, there are a lot of sub-directories
links = fle.link_extractor(base_url, 
                           search_subs=True,
                           regex='.nc$')

ToDo

  • more tests: need more tests
  • documentation: need to setup documentation

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fast_link_extractor-0.1.0.tar.gz (5.2 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page