Skip to main content

Extract html from one or multiple url's

Project description

A script which extracts HTML from web pages that match a certain CSS pattern.

$ pip install lurk

usage

in python

In python, lurk returns a dictionary:

from lurk import lurk

for link in lurk('http://en.wikipedia.org/wiki/en', 'a'):
    if 'href' in link:
        print link['href']

in bash

In bash, lurk returns JSON.

Familiarize yourself with CSS attribute selectors.

$ lurk \
http://www.gnu.org/software/libc/manual/html_node/Function-Index.html \
'a[href*="#index-"]' \
> links.json

This command saves a JSON object containing an array of links to all GNU C functions into links.json:

[
  {
    "code": "*pthread_getspecific",
    "href": "Thread_002dspecific-Data.html#index-_002apthread_005fgetspecific"
  },

  {
    "code": "*sbrk",
    "href": "Resizing-the-Data-Segment.html#index-_002asbrk"
  },

  // ...
]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for lurk, version 0.1.2
Filename, size File type Python version Upload date Hashes
Filename, size lurk-0.1.2-py2.py3-none-any.whl (4.1 kB) File type Wheel Python version 2.7 Upload date Hashes View
Filename, size lurk-0.1.2.tar.gz (2.4 kB) File type Source Python version None Upload date Hashes View

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring DigiCert DigiCert EV certificate Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page