Scrape links from a single web page
Project description
=====
Link Grabber
=====
Link Grabber provides a quick and easy way to grab links from
a single web page. This python package is a simple wrapper
around `BeautifulSoup <http://www.crummy.com/software/BeautifulSoup/>`_, focusing on grabbing HTML's
hyperlink tag, "a."
Dependecies:
* BeautifulSoup
* Requests
How-To
======
.. code:: bash
$ python setup.py install
OR
.. code:: bash
$ pip install linkGrabber
Quickie
=======
.. code:: python
import re
import linkGrabber
links = linkGrabber.Links("http://www.google.com")
links.find()
# limit the number of "a" tags to 5
links.find(limit=5)
# filter the "a" tag href attribute
links.find(href=re.compile("plus.google.com"))
Documentation
=============
find
----------
Parameters:
* filters (dict): Beautiful Soup's filters as a dictionary
* limit (int): Limit the number of links in sequential order
* reverse (bool): Reverses how the list of <a> tags are sorted
* sort (function): Accepts a function that accepts which key to sort upon
within the List class
Find all links that have a style containing "11px"
.. code:: python
import re
from linkGrabber import Links
links = Links("http://www.google.com")
links.find(style=re.compile("11px"), limit=5)
Reverse the sort before limiting links:
.. code:: python
from linkGrabber import Links
links = Links("http://www.google.com")
links.find(limit=2, reverse=True)
Sort by a link's attribute:
.. code:: python
from linkGrabber import Links
links = Links("http://www.google.com")
links.find(limit=3, sort=lambda key: key['text'])
Exclude text:
.. code:: python
from linkGrabber import Links
links = Links("http://www.google.com")
links.find(exclude={ "text": re.compile("Read More") })
Link Dictionary
---------------
All attrs from BeautifulSoup's Tag object are available in the dictionary
as well as a few extras:
* text (text inbetween the <a></a> tag)
* seo (parse all text after last "/" in URL and attempt to make it human readable)
=========
Changelog
=========
v0.2.5 (06/23/2014)
-------------------
* Added exclude parameter to Links.find() which removes
links that match certain criteria
v0.2.4 (06/10/2014)
-------------------
* Updated documentation to be better read on pypi
* Removed scrape.py and moved it to __init__.py
* Now using nose for unit testing
v0.2.3 (05/22/2014)
-------------------
* Updated setup py file and some verbage
v0.2.2 (05/19/2014)
-------------------
* linkGrabber.Links.find() now responds with all Tag.attrs
from BeautifulSoup4 as well as 'text' and 'seo' keys
v0.2.1 (05/18/2014)
-------------------
* Added more tests
v0.2.0 (05/17/2014)
-------------------
* Modified naming convention, reduced codebase, more readable structure
v0.1.9 (05/17/2014)
-------------------
* Python 3.4 compatability
v0.1.8 (05/16/2014)
-------------------
* Changed paramerter names to better reflect functionality
v0.1.7 (05/16/2014)
-------------------
* Update README
v0.1.6 (05/16/2014)
-------------------
* Update README with more examples
v0.1.5 (05/16/2014)
-------------------
* Updated find_links to accept link_reverse=(bool) and link_sort=(function)
v0.1.0 (05/16/2014)
-------------------
* Initial release.
Link Grabber
=====
Link Grabber provides a quick and easy way to grab links from
a single web page. This python package is a simple wrapper
around `BeautifulSoup <http://www.crummy.com/software/BeautifulSoup/>`_, focusing on grabbing HTML's
hyperlink tag, "a."
Dependecies:
* BeautifulSoup
* Requests
How-To
======
.. code:: bash
$ python setup.py install
OR
.. code:: bash
$ pip install linkGrabber
Quickie
=======
.. code:: python
import re
import linkGrabber
links = linkGrabber.Links("http://www.google.com")
links.find()
# limit the number of "a" tags to 5
links.find(limit=5)
# filter the "a" tag href attribute
links.find(href=re.compile("plus.google.com"))
Documentation
=============
find
----------
Parameters:
* filters (dict): Beautiful Soup's filters as a dictionary
* limit (int): Limit the number of links in sequential order
* reverse (bool): Reverses how the list of <a> tags are sorted
* sort (function): Accepts a function that accepts which key to sort upon
within the List class
Find all links that have a style containing "11px"
.. code:: python
import re
from linkGrabber import Links
links = Links("http://www.google.com")
links.find(style=re.compile("11px"), limit=5)
Reverse the sort before limiting links:
.. code:: python
from linkGrabber import Links
links = Links("http://www.google.com")
links.find(limit=2, reverse=True)
Sort by a link's attribute:
.. code:: python
from linkGrabber import Links
links = Links("http://www.google.com")
links.find(limit=3, sort=lambda key: key['text'])
Exclude text:
.. code:: python
from linkGrabber import Links
links = Links("http://www.google.com")
links.find(exclude={ "text": re.compile("Read More") })
Link Dictionary
---------------
All attrs from BeautifulSoup's Tag object are available in the dictionary
as well as a few extras:
* text (text inbetween the <a></a> tag)
* seo (parse all text after last "/" in URL and attempt to make it human readable)
=========
Changelog
=========
v0.2.5 (06/23/2014)
-------------------
* Added exclude parameter to Links.find() which removes
links that match certain criteria
v0.2.4 (06/10/2014)
-------------------
* Updated documentation to be better read on pypi
* Removed scrape.py and moved it to __init__.py
* Now using nose for unit testing
v0.2.3 (05/22/2014)
-------------------
* Updated setup py file and some verbage
v0.2.2 (05/19/2014)
-------------------
* linkGrabber.Links.find() now responds with all Tag.attrs
from BeautifulSoup4 as well as 'text' and 'seo' keys
v0.2.1 (05/18/2014)
-------------------
* Added more tests
v0.2.0 (05/17/2014)
-------------------
* Modified naming convention, reduced codebase, more readable structure
v0.1.9 (05/17/2014)
-------------------
* Python 3.4 compatability
v0.1.8 (05/16/2014)
-------------------
* Changed paramerter names to better reflect functionality
v0.1.7 (05/16/2014)
-------------------
* Update README
v0.1.6 (05/16/2014)
-------------------
* Update README with more examples
v0.1.5 (05/16/2014)
-------------------
* Updated find_links to accept link_reverse=(bool) and link_sort=(function)
v0.1.0 (05/16/2014)
-------------------
* Initial release.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
linkGrabber-0.2.5.tar.gz
(6.7 kB
view hashes)