autopager

Detect and classify pagination links on web pages

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language

Project description

Autopager is a Python package which detects and classifies pagination links.

License is MIT.

Installation

Install autopager with pip:

pip install autopager

Autopager depends on a few other packages like lxml and python-crfsuite; it will try install them automatically, but you may need to consult with installation docs for these packages if installation fails.

Autopager works in Python 3.6+.

Usage

autopager.urls function returns a list of pagination URLs:

>>> import autopager
>>> import requests
>>> autopager.urls(requests.get('http://my-url.org'))
['http://my-url.org/page/1', 'http://my-url.org/page/3', 'http://my-url.org/page/4']

autopager.select function returns all pagination <a> elements as parsel.SelectorList (the same object as scrapy response.css / response.xpath methods return).

autopager.extract function returns a list of (link_type, link) tuples where link_type is one of “PAGE”, “PREV”, “NEXT” and link is a parsel.Selector instance.

These functions accept HTML page contents (as an unicode string), requests Response or scrapy Response as a first argument.

By default, a prebuilt extraction model is used. If you want to use your own model use autopager.AutoPager class; it has the same methods but allows to provide model path or model itself:

>>> import autopager
>>> pager = autopager.AutoPager('my_model.crf')
>>> pager.urls(html)

You also have to use AutoPager class if you’ve cloned repository from git; prebuilt model is only available in pypi releases.

Detection Quality

Web pages can be very different; autopager tries to work for all websites, but some errors are inevitable. As a very rough estimate, expect it to work properly for 9/10 paginators on websites sampled from 1M international most popular websites (according to Alexa Top).

Contributing

Source code: https://github.com/TeamHG-Memex/autopager
Issue tracker: https://github.com/TeamHG-Memex/autopager/issues

How It Works

Autopager uses machine learning to detect paginators. It classifies <a> HTML elements into 4 classes:

PREV - previous page link
PAGE - a link to a specific page
NEXT - next page link
OTHER - not a pagination link

To do that it uses features like link text, css class names, URL parts and right/left contexts. CRF model is used for learning.

Web page is represented as a sequence of <a> elements. Only <a> elements with non-empty href attributes are in this sequence.

Training Data

Data is stored at autopager/data. Raw HTML source code is in autopager/data/html folder. Annotations are in autopager/data/data.csv file; elements are stored as CSS selectors.

Training data is annotated with 5 non-empty classes:

PREV - previous page link
PAGE - a link to a specific page
NEXT - next page link
LAST - ‘got to last page’ link which is not just a number
FIRST - ‘got to first page’ link which is not just ‘1’ number

Because LAST and FIRST are relatively rare they are converted to PAGE by pagination model. By using these classes during annotation it can be possible to make model predict them as well in future, with more training examples.

To add a new page to training data save it to an html file and add a row to the data.csv file. It is helpful to use http://selectorgadget.com/ extension to get CSS selectors.

Don’t worry if your CSS selectors don’t return <a> elements directly (it is easy to occasionally select a parent or a child of an <a> element when using SelectorGadget). If a selection itself is not <a> element then parent <a> elements and children <a> elements are tried, this is usually what is wanted because <a> tags are not nested on valid websites.

When using SelectorGadget pay special attention not to select anything other than pagination elements. Always check element count displayed by SelectorGadget and compare it to a number of elements you wanted to select.

Some websites change their DOM after rendering. This rarely affect paginator elements, but sometimes it can happen. To prevent it instead of downloading HTML file using “Save As..” browser menu option it is better to use “Copy Outer HTML” in developer tools or render HTML using a headless browser (e.g. Splash). If you do so make sure to put UTF-8 encoding to data.csv, regardless of page encoding defined in HTTP headers or <meta> tags.

Changes

0.3.1 (2020-09-09)

Fixing the distribution;
backports.csv is no longer required in setup.py

0.3 (2020-09-09)

Minimum Python requirement is now 3.6. Older versions may still work, but they’re no longer tested on CI.
Memory usage is limited, to avoid spikes on pathological pages.

0.2 (2016-04-26)

more training examples;
fixed Scrapy < 1.1 support;
fixed a bug in text-before and text-after features.

0.1 (2016-03-15)

Initial release

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language

Release history Release notifications | RSS feed

This version

0.3.1

Sep 9, 2020

0.3

Sep 9, 2020

0.2

Apr 25, 2016

0.1

Mar 10, 2016

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autopager-0.3.1.tar.gz (401.5 kB view details)

Uploaded Sep 9, 2020 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

autopager-0.3.1-py2.py3-none-any.whl (404.0 kB view details)

Uploaded Sep 9, 2020 Python 2Python 3

File details

Details for the file autopager-0.3.1.tar.gz.

File metadata

Download URL: autopager-0.3.1.tar.gz
Upload date: Sep 9, 2020
Size: 401.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: Python-urllib/3.8

File hashes

Hashes for autopager-0.3.1.tar.gz
Algorithm	Hash digest
SHA256	`3de41ba5cc88828b48695f0e7176ebf6ab09d04139d1a4339af58d8b2728fdaa`
MD5	`5bb5b4242e3ecf619e8c08f1683745c0`
BLAKE2b-256	`38407c7ccb492a103bd942f9ab4055de90468943f4b97f94b609a15184918ed8`

See more details on using hashes here.

File details

Details for the file autopager-0.3.1-py2.py3-none-any.whl.

File metadata

Download URL: autopager-0.3.1-py2.py3-none-any.whl
Upload date: Sep 9, 2020
Size: 404.0 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: Python-urllib/3.8

File hashes

Hashes for autopager-0.3.1-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`7f31c677d24dcf13e0f07b22831653d9988608a7821c22c8e8b49e876d527d51`
MD5	`34fb2b76b56714f8be95149966440e22`
BLAKE2b-256	`302bec83bbb5a88fddd81aaff110647160e7bd215bb92badcf4229fbbeb7b29a`

See more details on using hashes here.

autopager 0.3.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Installation

Usage

Detection Quality

Contributing

How It Works

Training Data

Changes

0.3.1 (2020-09-09)

0.3 (2020-09-09)

0.2 (2016-04-26)

0.1 (2016-03-15)

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes