An Extensible Image Crawler
Project description
Haul
Find thumbnails and original images from URL or HTML file.
Demo
Installation
on Ubuntu
$ sudo apt-get install build-essential python-dev libxml2-dev libxslt1-dev
$ pip install haul
on Mac OS X
$ pip install haul
Fail to install haul? It is probably caused by lxml.
Usage
Find images from img src, a href and even background-image:
import haul
url = 'http://gibuloto.tumblr.com/post/62525699435/fuck-yeah'
result = haul.find_images(url)
print(result.image_urls)
"""
output:
[
'http://25.media.tumblr.com/3f5f10d7216f1dd5eacb5eb3e302286a/tumblr_mtpcwdzKBT1qh9n5lo1_500.png',
...
'http://24.media.tumblr.com/avatar_a3a119b674e2_16.png',
'http://25.media.tumblr.com/avatar_9b04f54875e1_16.png',
'http://31.media.tumblr.com/avatar_0acf8f9b4380_16.png',
]
"""
Find original (or bigger size) images with extend=True:
import haul
url = 'http://gibuloto.tumblr.com/post/62525699435/fuck-yeah'
result = haul.find_images(url, extend=True)
print(result.image_urls)
"""
output:
[
'http://25.media.tumblr.com/3f5f10d7216f1dd5eacb5eb3e302286a/tumblr_mtpcwdzKBT1qh9n5lo1_500.png',
...
'http://24.media.tumblr.com/avatar_a3a119b674e2_16.png',
'http://25.media.tumblr.com/avatar_9b04f54875e1_16.png',
'http://31.media.tumblr.com/avatar_0acf8f9b4380_16.png',
# bigger size, extended from above urls
'http://25.media.tumblr.com/3f5f10d7216f1dd5eacb5eb3e302286a/tumblr_mtpcwdzKBT1qh9n5lo1_1280.png',
...
'http://24.media.tumblr.com/avatar_a3a119b674e2_128.png',
'http://25.media.tumblr.com/avatar_9b04f54875e1_128.png',
'http://31.media.tumblr.com/avatar_0acf8f9b4380_128.png',
]
"""
Advanced Usage
Custom finder / extender pipeline:
from haul import Haul
from haul.compat import str
def img_data_src_finder(pipeline_index,
soup,
finder_image_urls=[],
*args, **kwargs):
"""
Find image URL in <img>'s data-src attribute
"""
now_finder_image_urls = []
for img in soup.find_all('img'):
src = img.get('data-src', None)
if src:
src = str(src)
now_finder_image_urls.append(src)
output = {}
output['finder_image_urls'] = finder_image_urls + now_finder_image_urls
return output
MY_FINDER_PIPELINE = (
'haul.finders.pipeline.html.img_src_finder',
'haul.finders.pipeline.css.background_image_finder',
img_data_src_finder,
)
GOOGLE_SITES_EXTENDER_PIEPLINE = (
'haul.extenders.pipeline.google.blogspot_s1600_extender',
'haul.extenders.pipeline.google.ggpht_s1600_extender',
'haul.extenders.pipeline.google.googleusercontent_s1600_extender',
)
url = 'http://fashion-fever.nl/dressing-up/'
h = Haul(parser='lxml',
finder_pipeline=MY_FINDER_PIPELINE,
extender_pipeline=GOOGLE_SITES_EXTENDER_PIEPLINE)
result = h.find_images(url, extend=True)
Run Tests
$ cd tests
$ python test.py
History
1.3.2 (2013-11-05)
Bug fixed: #12
1.3.1 (2013-10-24)
Add is_found attribute for HaulResult
Add to_ordered_dict() method for HaulResult
1.3.0 (2013-10-16)
Use unicode for every string
Fix running test.py from another directory
Rename module models to core
Remove in_ignorecase()
1.2.0 (2013-10-15)
Improve error handling
1.1.0 (2013-10-04)
Custom finder / extender pipeline support
1.0.0 (2013-10-03)
Initial release
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file haul-1.3.2.tar.gz
.
File metadata
- Download URL: haul-1.3.2.tar.gz
- Upload date:
- Size: 8.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6b5664ed9b389d6e8ba6537f64887cb0ae80178f43677328acf6f9036c6cf5dd |
|
MD5 | a5f25a930976e4513d37d357d8846216 |
|
BLAKE2b-256 | 103508b709f7bf1d38ae347aa6a92746fa513a2fa4ab393dccaa339394cbb5f8 |