Skip to main content

Extract html from one or multiple url's

Project description

A script which extracts HTML from web pages that match a certain CSS pattern.

$ pip install lurk

usage

in python

In python, lurk returns a dictionary:

from lurk import lurk

for link in lurk('http://en.wikipedia.org/wiki/en', 'a'):
    if 'href' in link:
        print link['href']

in bash

In bash, lurk returns JSON.

Familiarize yourself with CSS attribute selectors.

$ lurk \
http://www.gnu.org/software/libc/manual/html_node/Function-Index.html \
'a[href*="#index-"]' \
> links.json

This command saves a JSON object containing an array of links to all GNU C functions into links.json:

[
  {
    "code": "*pthread_getspecific",
    "href": "Thread_002dspecific-Data.html#index-_002apthread_005fgetspecific"
  },

  {
    "code": "*sbrk",
    "href": "Resizing-the-Data-Segment.html#index-_002asbrk"
  },

  // ...
]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lurk-0.1.1.tar.gz (2.5 kB view details)

Uploaded Source

Built Distribution

lurk-0.1.1-py2.py3-none-any.whl (4.1 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file lurk-0.1.1.tar.gz.

File metadata

  • Download URL: lurk-0.1.1.tar.gz
  • Upload date:
  • Size: 2.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for lurk-0.1.1.tar.gz
Algorithm Hash digest
SHA256 0d3585581b9b693536561fe6be2946a663fdead12c1e87163ab2a3679dcab270
MD5 62d5035f8bed3b2389de4aa49f296896
BLAKE2b-256 81a5834664896e7e16bd7238eb22dac7e3c76d54d35bfb2d223b3aa27ec66a47

See more details on using hashes here.

File details

Details for the file lurk-0.1.1-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for lurk-0.1.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 14e9125bb4df9722a28cb0153f262357607dcaa41fd9ae59671e863be6ca5906
MD5 8c499d89af76fe58b9a113c06e1a0d54
BLAKE2b-256 8ca4783ee0cda75d40fca104ea8bc3840eecd4e7ac7c5505a3d3ca1b8b0b5121

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page