Scrape links from a single web page
Project description
============
Link Grabber
============
.. image:: https://travis-ci.org/michigan-com/linkGrabber.svg?branch=master
:target: https://travis-ci.org/michigan-com/linkGrabber
Link Grabber provides a quick and easy way to grab links from
a single web page. This python package is a simple wrapper
around `BeautifulSoup <http://www.crummy.com/software/BeautifulSoup/>`_, focusing on grabbing HTML's
hyperlink tag, "a."
Dependecies:
* Python 2.7, 3.3, 3.4
* BeautifulSoup
* Requests
* Six
How-To
------
.. code:: bash
$ python setup.py install
OR
.. code:: bash
$ pip install linkGrabber
Quickie
-------
.. code:: python
import re
import linkGrabber
links = linkGrabber.Links("http://www.google.com")
links.find()
# limit the number of "a" tags to 5
links.find(limit=5)
# filter the "a" tag href attribute
links.find(href=re.compile("plus.google.com"))
Documentation
-------------
http://linkgrabber.neurosnap.net/
find
````
Parameters:
* filters (dict): Beautiful Soup's filters as a dictionary
* limit (int): Limit the number of links in sequential order
* reverse (bool): Reverses how the list of <a> tags are sorted
* sort (function): Accepts a function that accepts which key to sort upon
within the List class
Find all links that have a style containing "11px"
.. code:: python
import re
from linkGrabber import Links
links = Links("http://www.google.com")
links.find(style=re.compile("11px"), limit=5)
Reverse the sort before limiting links:
.. code:: python
from linkGrabber import Links
links = Links("http://www.google.com")
links.find(limit=2, reverse=True)
Sort by a link's attribute:
.. code:: python
from linkGrabber import Links
links = Links("http://www.google.com")
links.find(limit=3, sort=lambda key: key['text'])
Exclude text:
.. code:: python
import re
from linkGrabber import Links
links = Links("http://www.google.com")
links.find(exclude=[{ "text": re.compile("Read More") }])
Remove duplicate URLs and make the output pretty:
.. code:: python
from linkGrabber import Links
links = Links("http://www.google.com")
links.find(duplicates=False, pretty=True)
Link Dictionary
```````````````
All attrs from BeautifulSoup's Tag object are available in the dictionary
as well as a few extras:
* text (text inbetween the <a></a> tag)
* seo (parse all text after last "/" in URL and attempt to make it human readable)
=========
Changelog
=========
v0.3.1 (11/09/2017)
-------------------
* :bug: `find` would fail when not providing a `href` property [6](https://github.com/michigan-com/linkGrabber/pull/6) @MohamedHuzien
v.0.3.0 (7/09/2015)
-------------------
* Added parser parameter to Links class
* Default parser set to lxml
v.0.2.10 (7/09/2015
-------------------
* Added six as a dependency
v0.2.9 (1/24/2014)
------------------
* Updated documentation
v0.2.8 (10/23/2014)
-------------------
* Added better documentation
v0.2.7 (06/25/2014)
-------------------
* Fixed exclude for non-iterable strings
v0.2.6 (06/25/2014)
-------------------
* Exclude parameter is now a list of dictionaries
* Added pretty property
* Added duplicates property which will remove any identical URLs
* Added more tests
* Added better docs
v0.2.5 (06/23/2014)
-------------------
* Added exclude parameter to Links.find() which removes
links that match certain criteria
v0.2.4 (06/10/2014)
-------------------
* Updated documentation to be better read on pypi
* Removed scrape.py and moved it to __init__.py
* Now using nose for unit testing
v0.2.3 (05/22/2014)
-------------------
* Updated setup py file and some verbage
v0.2.2 (05/19/2014)
-------------------
* linkGrabber.Links.find() now responds with all Tag.attrs
from BeautifulSoup4 as well as 'text' and 'seo' keys
v0.2.1 (05/18/2014)
-------------------
* Added more tests
v0.2.0 (05/17/2014)
-------------------
* Modified naming convention, reduced codebase, more readable structure
v0.1.9 (05/17/2014)
-------------------
* Python 3.4 compatability
v0.1.8 (05/16/2014)
-------------------
* Changed paramerter names to better reflect functionality
v0.1.7 (05/16/2014)
-------------------
* Update README
v0.1.6 (05/16/2014)
-------------------
* Update README with more examples
v0.1.5 (05/16/2014)
-------------------
* Updated find_links to accept link_reverse=(bool) and link_sort=(function)
v0.1.0 (05/16/2014)
-------------------
* Initial release.
Link Grabber
============
.. image:: https://travis-ci.org/michigan-com/linkGrabber.svg?branch=master
:target: https://travis-ci.org/michigan-com/linkGrabber
Link Grabber provides a quick and easy way to grab links from
a single web page. This python package is a simple wrapper
around `BeautifulSoup <http://www.crummy.com/software/BeautifulSoup/>`_, focusing on grabbing HTML's
hyperlink tag, "a."
Dependecies:
* Python 2.7, 3.3, 3.4
* BeautifulSoup
* Requests
* Six
How-To
------
.. code:: bash
$ python setup.py install
OR
.. code:: bash
$ pip install linkGrabber
Quickie
-------
.. code:: python
import re
import linkGrabber
links = linkGrabber.Links("http://www.google.com")
links.find()
# limit the number of "a" tags to 5
links.find(limit=5)
# filter the "a" tag href attribute
links.find(href=re.compile("plus.google.com"))
Documentation
-------------
http://linkgrabber.neurosnap.net/
find
````
Parameters:
* filters (dict): Beautiful Soup's filters as a dictionary
* limit (int): Limit the number of links in sequential order
* reverse (bool): Reverses how the list of <a> tags are sorted
* sort (function): Accepts a function that accepts which key to sort upon
within the List class
Find all links that have a style containing "11px"
.. code:: python
import re
from linkGrabber import Links
links = Links("http://www.google.com")
links.find(style=re.compile("11px"), limit=5)
Reverse the sort before limiting links:
.. code:: python
from linkGrabber import Links
links = Links("http://www.google.com")
links.find(limit=2, reverse=True)
Sort by a link's attribute:
.. code:: python
from linkGrabber import Links
links = Links("http://www.google.com")
links.find(limit=3, sort=lambda key: key['text'])
Exclude text:
.. code:: python
import re
from linkGrabber import Links
links = Links("http://www.google.com")
links.find(exclude=[{ "text": re.compile("Read More") }])
Remove duplicate URLs and make the output pretty:
.. code:: python
from linkGrabber import Links
links = Links("http://www.google.com")
links.find(duplicates=False, pretty=True)
Link Dictionary
```````````````
All attrs from BeautifulSoup's Tag object are available in the dictionary
as well as a few extras:
* text (text inbetween the <a></a> tag)
* seo (parse all text after last "/" in URL and attempt to make it human readable)
=========
Changelog
=========
v0.3.1 (11/09/2017)
-------------------
* :bug: `find` would fail when not providing a `href` property [6](https://github.com/michigan-com/linkGrabber/pull/6) @MohamedHuzien
v.0.3.0 (7/09/2015)
-------------------
* Added parser parameter to Links class
* Default parser set to lxml
v.0.2.10 (7/09/2015
-------------------
* Added six as a dependency
v0.2.9 (1/24/2014)
------------------
* Updated documentation
v0.2.8 (10/23/2014)
-------------------
* Added better documentation
v0.2.7 (06/25/2014)
-------------------
* Fixed exclude for non-iterable strings
v0.2.6 (06/25/2014)
-------------------
* Exclude parameter is now a list of dictionaries
* Added pretty property
* Added duplicates property which will remove any identical URLs
* Added more tests
* Added better docs
v0.2.5 (06/23/2014)
-------------------
* Added exclude parameter to Links.find() which removes
links that match certain criteria
v0.2.4 (06/10/2014)
-------------------
* Updated documentation to be better read on pypi
* Removed scrape.py and moved it to __init__.py
* Now using nose for unit testing
v0.2.3 (05/22/2014)
-------------------
* Updated setup py file and some verbage
v0.2.2 (05/19/2014)
-------------------
* linkGrabber.Links.find() now responds with all Tag.attrs
from BeautifulSoup4 as well as 'text' and 'seo' keys
v0.2.1 (05/18/2014)
-------------------
* Added more tests
v0.2.0 (05/17/2014)
-------------------
* Modified naming convention, reduced codebase, more readable structure
v0.1.9 (05/17/2014)
-------------------
* Python 3.4 compatability
v0.1.8 (05/16/2014)
-------------------
* Changed paramerter names to better reflect functionality
v0.1.7 (05/16/2014)
-------------------
* Update README
v0.1.6 (05/16/2014)
-------------------
* Update README with more examples
v0.1.5 (05/16/2014)
-------------------
* Updated find_links to accept link_reverse=(bool) and link_sort=(function)
v0.1.0 (05/16/2014)
-------------------
* Initial release.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
linkGrabber-0.3.1.tar.gz
(9.9 kB
view details)
File details
Details for the file linkGrabber-0.3.1.tar.gz
.
File metadata
- Download URL: linkGrabber-0.3.1.tar.gz
- Upload date:
- Size: 9.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f74ba4c3663c0a852a8d042486cea6d227fff3f04c6ea0275b893c6428a26466 |
|
MD5 | 3c62af4281efaa43f37b51d0ff934933 |
|
BLAKE2b-256 | 548cd557bf84fbbacc000ed85ade46c449ad6700afac15674a1072b78c40f7f1 |