Skip to main content

a module for polling urls and stats from homepages

Project description

travis-img pageone ====== a module for polling urls and stats from homepages

Install

pip install pageone

Test

Requires nose

nosetests

Usage

pageone does two things: extract article urls from a site’s homepage and also uses selenium and phantomjs to find the relative positions of these urls.

To get stats about the positions of links, use link_stats:

from pageone import PageOne

p = PageOne(url='http://www.propublica.org/')

# get stats about links positions
for link in p.link_stats():
    print link

This will return a list of dictionaries that look like this:

{
 'bucket': 4,
 'datetime': datetime.datetime(2014, 6, 7, 16, 6, 3, 533818),
 'font_size': 13,
 'has_img': 1,
 'headline': u'',
 'homepage': 'http://www.propublica.org/',
 'img_area': 3969,
 'img_height': 63,
 'img_src': u'http://www.propublica.org/images/ngen/gypsy_image_medium/mpmh_victory_drive_140x140_130514_1.jpg',
 'img_width': 63,
 'url': u'http://www.propublica.org/article/protect-service-members-defense-department-plans-broad-ban-high-cost-loans',
 'x': 61,
 'x_bucket': 1,
 'y': 730,
 'y_bucket': 4
}

Here bucket variables represent where a link falls in 200x200 pixel grid. For x_bucket this number moves from left-to-right. For y_bucket, it moves top-to-bottom. bucket moves from top-left to bottom right. You can customize the size of this grid by passing in bucket_pixels to link_stats, eg:

from pageone import PageOne

p = PageOne(url='http://www.propublica.org/')

# get stats about links positions
for link in p.link_stats(bucket_pixels = 100):
    print link

To get simply get all of the article urls on a homepage, use articles:

from pageone import PageOne
p = PageOne(url='http://www.propublica.org/')

for article in p.articles():
  print article

If you want to get article urls from other sites, use incl_external:

from pageone import PageOne
p = PageOne(url='http://www.propublica.org/')

for article in p.articles(incl_external=True):
  print article

How do I know which urls are articles?

pageone uses siegfried for url parsing and validation. If you want to apply a custom regex for article url validation, you can pass in a pattern to either link_stats or articles, eg:

from pageone import PageOne
import re

pattern = re.compile(r'.*propublica.org/[a-z]+/[a-z0-9/-]+')

p = PageOne(url='http://www.propublica.org/')

for article in p.articles(pattern=pattern):
  print article

PhantomJS

pageone requires phantomjs to run link_stats. pageone defaults to looking for phantomjs in /usr/bin/local/phantomjs, but if you want to specify another path, pass in phantom_path to linkstats:

from pageone import PageOne

p = PageOne(url='http://www.propublica.org/')
for link in p.link_stats(phantom_path="/usr/bin/phantomjs"):
    print link

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pageone-0.1.6.tar.gz (5.3 kB view details)

Uploaded Source

Built Distribution

pageone-0.1.6.macosx-10.9-intel.exe (71.5 kB view details)

Uploaded Source

File details

Details for the file pageone-0.1.6.tar.gz.

File metadata

  • Download URL: pageone-0.1.6.tar.gz
  • Upload date:
  • Size: 5.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for pageone-0.1.6.tar.gz
Algorithm Hash digest
SHA256 7ad4d37b5189c8f3a9aefe028cec4d276a9497962eaad19b9d4da24fa7ec3ac8
MD5 692d4e1cdf71b3f291a0766e2b3b189a
BLAKE2b-256 8fdb338e1838a97bdf8ac4e946a22c6981f9dbf82d5d2476fa391428e3c21e67

See more details on using hashes here.

File details

Details for the file pageone-0.1.6.macosx-10.9-intel.exe.

File metadata

File hashes

Hashes for pageone-0.1.6.macosx-10.9-intel.exe
Algorithm Hash digest
SHA256 2566caebfdee65b53bd23aa41c41f80d21a754fd99449f09d3917ac70df39a77
MD5 bcf28c7147fb701184687e4ecf4b3750
BLAKE2b-256 41ff3206709aff471018242980a2d7e8eece81e49371d05ca09e35c6cd6b1454

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page