Library for scraping urls and downloading them as files
Project description
WebMixer
A library for scraping urls
The Basic Scraper
All webmixer.scrapers.pages
and webmixer.scrapers.tags
classes inherit from webmixer.base.BasicScraper
, which means they all have the following attributes and functions:
Attributes
- directory (str): Directory to write files to
- color (str): Color for error messages (default: 'rgb(153, 97, 137)')
- locale (str): Language to use when writing error messages (default: 'en')
- Note: must be listed in
webmixer.messages.MESSAGES
- Note: must be listed in
- default_ext (str): Extension to default to for extracted files
Functions
create_tag(tag)
Args:
- tag (str): tag name to create (e.g. 'p')
Returns a BeautifulSoup tag Example:
image_tag = create_tag('img')
get_filename(link, default_ext=None)
Args:
- link (str): URL that has been scraped
- default_ext (optional str): if the link doesn't have an extension, use this extension'
Returns a filename (str) to use for extracted files Example:
video_filename = get_filename('<url>', default_ext='.mp4')
mark_tag_to_skip(tag)
Mark tag to skip during further scraping operations Args:
- tag (str): tag to mark
Example:
Process img tag here...
mark_tag_to_skip(img)
write_url(link, url=None, default_ext=None, filename=None, directory=None)
Downloads a url and writes it to a zip Args:
- filepath (str): path to local file
- directory (str): directory to write to zip
- url (optional str): URL used for handling relative URLs
- default_ext (optional str): if the link doesn't have an extension, use this extension
- filename (optional str): name for file to write to zip
- directory (optional str): directory to write file to zip
Returns filepath within zip Example:
write_url('<link>', url='https://domain.com/', default_ext='.mp4', filename='video', directory='media') # 'media/video.mp4'
write_contents(filename, contents, directory=None)
Writes contents to the zip with a given filename Args:
- filename (str): filename for contents
- contents (bytes): contents to write to zip
- directory (str): directory to write to zip
Returns filepath within zip Example:
write_contents('myfile.pdf', <pdf contents>, directory='docs') # docs/myfile.pdf
write_file(filepath, directory=None)
Writes a local file to the zip Args:
- filepath (str): path to local file
- directory (str): directory to write to zip
Returns filepath within zip Example:
write_file('path/to/myfile.mp3', directory='music') # music/myfile.mp3
create_broken_link_message(link)
Generates a tag with broken link message Args:
- link (str): link to copy/paste
Returns a div tag with a link to copy/paste into browser Example:
iframe.replaceWith(create_broken_link_message('<url>'))
# iframe -> <div>copy link...</div>
create_copy_link_message(link, partially_scrapable=False)
Generates a tag with 'copy link into browser' message Args:
- link (str): link to copy/paste
- partially_scrapable (bool): link was mostly scraped, but doesn't include everything from original site
Returns a div tag with a link to copy/paste into browser Example:
iframe.replaceWith(create_copy_link_message('<url>'))
# iframe -> <div>copy link...</div>
Exceptions
webmixer.exceptions
can be useful for handling errors from a variety of sources. If you are scraping a more specialized source, there may be some exceptions that are exclusive to that source. You can then raise the following exceptions to correctly manage that source:
BrokenSourceException
Used when the link is completely broken (e.g. site no longer exists)
UnscrapableSourceException
Used when the link is working, but cannot be supported on Kolibri (e.g. Flash content)
For instance, the webmixer.scrapers.pages.gdrive.GoogleDriveScraper
may throw a FileNotDownloadableError
error. In order to handle this correctly, it will raise an UnscrapableSourceException
try:
...
except FileNotDownloadableError as e:
raise UnscrapableSourceException(e)
Page Scrapers
There are several page scrapers that are available for use in scraping html pages. These will download urls to their respective file types
Built-in Scrapers
Here is a list of the basic scraper classes, which are also listed under webmixer.scrapers.pages.base.COMMON_SCRAPERS
:
- WebVideoScraper
- PDFScraper
- EPubScraper
- ImageScraper
- FlashScraper
- VideoScraper
- AudioScraper
Using Page Scrapers
When you create a scraper object, you may specify the following:
- url (str): URL that tag can be found at (used to handle relative URLs) required
- zipper (optional
ricecooker.utils.html_writer
): Zip to write to - triaged (optional [str]): List of already parsed URLs
To scrape the page, you may use any of the following writing options:
to_zip: Writes a file to self.zipper, which is useful when scraping embedded sources from an html page Args:
- filename (optional str): name of file to write to Returns path to file from within zip
Here are the default extensions for each webmixer.scrapers.pages.base.Scraper
:
Scraper | Extension |
---|---|
HTMLPageScraper | .html |
PDFScraper | |
EPubScraper | .epub |
AudioScraper | .mp3 |
VideoScraper | .mp4 |
WebVideoScraper | .mp4 |
ImageScraper | .png |
FlashScraper | error |
For example:
from webmixer.scrapers.base import ImageScraper
image= <BeautifulSoup tag>
image['src'] = ImageScraper('<url>').to_zip() # Sets 'src' to zipped image filepath
to_tag: Writes file to zip and generates a tag based on what kind of scraper it is. This is useful when you are replacing iframes with native html elements Args:
- filename (optional str): name of file to write to Returns tag
Here are the return tag types for each webmixer.scrapers.pages.base.Scraper
:
Scraper | Tag |
---|---|
HTMLPageScraper | None |
PDFScraper | <embed> |
EPubScraper | None |
AudioScraper | <audio> |
VideoScraper | <video> |
WebVideoScraper | <video> |
ImageScraper | <img> |
FlashScraper | error |
For example:
from webmixer.scrapers.base import PDFScraper
iframe= <BeautifulSoup tag>
iframe.replaceWith(PDFScraper('<url>').to_tag()) # Replaces iframe with <embed> tag
to_file: Writes to a file. This is useful for downloading URLs as files to your local machine. Args:
- filename (optional str): name of file to write to
- directory (optional str): directory to write to
- overwrite (bool): overwrite file if it exists Returns a filepath to the downloaded file
to_file
uses the download_file
method to write the file to a write_to_path
Here are the return file types for each webmixer.scrapers.pages.base.Scraper
:
Scraper | Extension |
---|---|
HTMLPageScraper | .zip - generated by ricecooker.utils.html_writer |
PDFScraper | |
EPubScraper | .epub |
AudioScraper | .mp3 |
VideoScraper | .mp4 |
WebVideoScraper | .mp4 |
ImageScraper | error - content kind not supported |
FlashScraper | error |
For example:
from webmixer.scrapers.base import HTMLPageScraper
new_html_zip_path = HTMLPageScraper('<url>').to_file() # Returns newly scraped html .zip file
Custom Scrapers
Given how diverse the internet is, you may need to implement your own scraper to handle individual sources. You must implement a test
classmethod in order to use your scraper.
If you would like to share a custom scraper, please feel free to open a pull request with a new file under webmixer.scrapers.pages
Attributes
All scrapers have the following attributes:
- dl_directory (str): Directory to write
to_file
downloaded file to (default: 'downloads') - directory (str): Directory to write files to
- color (str): Color for error messages (default: 'rgb(153, 97, 137)')
- locale (str): Language to use when writing error messages (default: 'en')
- Note: must be listed in
webmixer.messages.MESSAGES
- Note: must be listed in
- default_ext (str): Extension to default to for extracted files
- kind (
le_utils.constants.content_kind
): Content kind to write to
webmixer.scrapers.pages.base.HTMLPageScraper
has these additional attributes:
- partially_scrapable (bool): Not all content can be viewed from within Kolibri (default: False)
- scrape_subpages (bool): Determines whether to scrape any subpages within this page (default: True)
- main_area_selector (optional tuple): Main element selector to replace everything in body tag
- omit_list (optional list): list of selectors to remove from page contents (e.g. [('a', {'class': 'link'})])
- loadjs (bool): Determines whether to load js when loading the page (default: True)
- scrapers ([
webmixer.scrapers.pages.Scraper
]): List of additional scrapers to use on this page - extra_tags ([
webmixer.scrapers.tags.Tag
]): List of additional tags to scrape
For example, the following code will remove links, scrape Wikipedia pages, and sets all images to 'myimg.png':
from webmixer.scrpaers.tags import ImageTag
from webmixer.scrapers.pages.base import HTMLPageScraper
from webmixer.scrapers.pages.wikipedia import WikipediaScraper
class MyCustomTag(ImageTag):
def process(self):
self.tag['src'] = self.write_file('myimg.png')
class MyCustomScraper(HTMLPageScraper):
omit_list = [('a',)] # Remove links
extra_tags = [MyCustomTag] # Use MyCustomTag to set images to 'myimg.png'
scrapers = [WikipediaScraper] # Scrape any Wikipedia pages
@classmethod # Required test classmethod
def test(self, url):
return 'my-domain.com' in url
Functions
@classmethod test(url): Required method to determine if this is the correct scraper for this URL Args:
- url (str): url to test Returns True if scraper is meant to scrape URL Example:
@classmethod
def test(self, url):
return 'somedomain' in url
preprocess(contents): Process contents before main scraping method Args: contents (BeautifulSoup): contents to preprocess Example:
# Delete the first image on the page before scraping all the images
def preprocess(self, contents):
contents.find('img').decompose()
postprocess(contents): Process contents after main scraping method Args: contents (BeautifulSoup): contents to postprocess Example:
# Append a link at the end of the <body> tag
def postprocess(self, contents):
link = self.create_tag('a')
link.string = 'New Link'
contents.body.append(link)
Tags
There are several tags that are available for use in scraping html pages. These will handle downloading any referenced files.
Using Tags
To create a tag, you may specify the following:
- tag (BeautifulSoup.tag): tag to parse required
- url (str): url that tag can be found at (used to handle relative URLs) required
- attribute (optional str): attribute to find link at (e.g. 'src' or 'data-src')
- scrape_subpages (optional bool): parse linked pages referenced by this tag (default: True)
- extra_scrapers (optional [
webmixer.scrapers.base.BasicScrapers
]): list of scrapers to try to scrape linked pages - color (optional str): color for injected error messages (default: 'rgb(153, 97, 137)')
To scrape the tag, use the scrape
method. This will process the tag so that it can be usable from within an html zip. Here is a simple scraping example:
from webmixer.scrapers.tags import ImageTag
image_tag = <BeautifulSoup.img tag>
image_scraper = ImageTag(image_tag, '<url>')
image_scraper.scrape() # image_tag['src'] will point to downloaded image file in zip
Built-in Tags
Here is a list of the available tags, which are also listed under webmixer.scrapers.tags.COMMON_TAGS
- ImageTag (img)
- AudioTag (audio)
- VideoTag (video)
- EmbedTag (embed)
- LinkTag (a) Scrapes linked pages referenced by 'href' attribute
- IframeTag (iframe) Scrapes embedded pages referenced byon 'src' attribute
- StyleTag (style) Scrapes sheets referenced by 'href' attribute
- ScriptTag (script) Scrapes scripts referenced by 'src' attribute
Custom Tags
Depending on the source you are trying to scrape, you may need more specific methods for scraping a page. To create a custom tag, you will need to subclass webmixer.scrapers.tags.BasicScraperTag
Attributes
All tags have the following attributes:
- selector (tuple): BeautifulSoup selector to find tag (e.g. ('a', {'class': 'link'}))
- default_ext (str): Extension to use if link doesn't have an extension
- directory (str): Directory to write tag files to
- attributes (dict): Any attributes to assign to a tag
- default_attribute (str): Attribute that references files (default: 'src')
- scrape_subpages (bool): Determines whether to scrape any linked pages (default: True)
- extra_scrapers ([
webmixer.scrapers.base.BasicScrapers
]): List of additional scrapers to use for scraping linked pages - color (str): Color for error messages (default: 'rgb(153, 97, 137)')
- locale (str): Language to use when writing error messages (default: 'en')
- Note: must be listed in
webmixer.messages.MESSAGES
- Note: must be listed in
Example:
from webmixer.scrapers.tags import BasicScraperTag
class MyVideoTag(BasicScraperTag):
selector = ('video', {'class': 'video-class'}) # Select video.video-class
directory = 'media' # Files will be written to media folder
attributes = { # Videos will have width 100%
'width': '100%'
}
Built-in functions
For more custom scraping logic, you may also override the following methods:
process(): Makes the tag usable from within an html zip by downloading any referenced files Example:
class MyVideoTag(BasicScraperTag):
def process(self):
# Scrape all of the <source> tags
for source in self.tag.find_all('source'):
BasicScraperTag(source, self.zipper, self.url).scrape()
handle_error(): Determines how to handle cases where the link is broken Example:
class MyVideoTag(BasicScraperTag):
def handle_error(self):
self.tag.decompose() # Just remove the element if it doesn't work
handle_unscrapable(): Determines how to handle cases where the link is not scrapable Example:
class MyVideoTag(BasicScraperTag):
def handle_unscrapable(self):
self.tag.replaceWith(self.create_copy_link_message(self.link))
Helper Functions
webmixer.utils.guess_scraper
If you would like to determine which scraper to use based on a URL, you can use the webmixer.utils.guess_scraper
method. This will accept the following arguments:
- url (str): URL to scrape
- scrapers ([
webmixer.scrapers.base.BasicScrapers
]): list of other scrapers to test URL against - allow_defualt (optional bool): use generic default scraper in case nothing matches (default: False)
You can also pass in additional arguments to scrapers with kwargs
So a simple usage of guess_scraper
might be:
from webmixer.utils import guess_scraper
scraper = guess_scraper('<url>', scrapers=[MyCustomScraper])
======= History
0.0.0 (2019-07-30)
- First release on PyPI.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.