a module for polling urls and stats from homepages
Project description
pageone ====== a module for polling urls and stats from homepages
Install
pip install pageone
Test
Requires nose
nosetests
Usage
pageone does two things: extract article urls from a site’s homepage and also uses selenium and phantomjs to find the relative positions of these urls.
To get stats about the positions of links, use link_stats:
from pageone import PageOne
p = PageOne(url='http://www.propublica.org/')
# get stats about links positions
for link in p.link_stats():
print link
This will return a list of dictionaries that look like this:
{'bucket': 4,
'datetime': datetime.datetime(2014, 6, 7, 16, 6, 3, 533818),
'font_size': 13,
'has_img': 1,
'headline': u'',
'homepage': 'http://www.propublica.org/',
'img_area': 3969,
'img_height': 63,
'img_src': u'http://www.propublica.org/images/ngen/gypsy_image_medium/mpmh_victory_drive_140x140_130514_1.jpg',
'img_width': 63,
'url': u'http://www.propublica.org/article/protect-service-members-defense-department-plans-broad-ban-high-cost-loans',
'x': 61,
'x_bucket': 1,
'y': 730,
'y_bucket': 4}
To get simply get all of the article urls on a homepage, use articles:
from pageone import PageOne
p = PageOne(url='http://www.propublica.org/')
for article in p.articles():
print article
If you want to get article urls from other sites, use incl_external:
from pageone import PageOne
p = PageOne(url='http://www.propublica.org/')
for article in p.articles(incl_external=True):
print article
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pageone-0.0.3.tar.gz
(4.5 kB
view hashes)
Built Distribution
Close
Hashes for pageone-0.0.3.macosx-10.9-intel.exe
Algorithm | Hash digest | |
---|---|---|
SHA256 | d9e80149dc6d00483ac03425440e889bea534d33318e7dd77e4ed00e5b22a09b |
|
MD5 | 603522c6981034823168fce6ad7a02ce |
|
BLAKE2b-256 | acfca7c2bf6d087922a869f51b3fff437e27b7a697a8d1d2b2b049e056ed7c3f |