Skip to main content

Screenshot as a service

Project description

Saas - Screenshot as a service

saas demo

Installation

Requirements

FUSE

What is fuse? From the FUSE wikipedia page

Filesystem in Userspace (FUSE) is a software interface for Unix-like computer operating systems that lets non-privileged users create their own file systems without editing kernel code. This is achieved by running file system code in user space while the FUSE module provides only a "bridge" to the actual kernel interfaces.

FUSE is used to mount a synthetic filesystem to read back the photos taken of the url given to saas. The user-space filesystem is dynamically filled with files and directories by saas. FUSE makes a good choice for this component since this can be easily integrated into almost any workflow, read more about this in the API section.

Elasticsearch

Elasticsearch is used as a storage backend for saas. Read more about the storage in the storage section.

ImageMagick

ImageMagick is used for optimizing image files saved to disk. This is an optional dependency since it is only used when the --optimize-storage flag is used.

Linux

1. Install Elasticsearch using docker

sudo docker pull docker.elastic.co/elasticsearch/elasticsearch:6.5.4

2. Install Firefox and Geckodriver

sudo apt-get install firefox

wget https://github.com/mozilla/geckodriver/releases/download/v0.23.0/geckodriver-v0.23.0-linux64.tar.gz
tar -xvzf geckodriver-v0.23.0-linux64.tar.gz
chmod +x geckodriver
sudo mv geckodriver /usr/bin/

3. Install ImageMagick (optional)

sudo apt-get install imagemagick

4. Install saas

# Make sure you have Python 3.7 installed!
python --version
# Python 3.7.2

pip install saas

saas --version
# saas 1.2.1

macOS

1. Install FUSE for macOS

Either from official website (recommended) or using homebrew

brew update
brew tap homebrew/cask
brew cask install osxfuse

2. Install Elasticsearch

brew install elasticsearch

3. Install Firefox and Geckodriver

Either from official website or using homebrew

brew cask install firefox

4. Install Geckodriver

brew install geckodriver

5. Install Python 3.7

brew install python3
python3 --version
# Python 3.7.2

6. Install ImageMagick (optional)

brew install imagemagick

7. Install saas

# Make sure you have Python 3.7 installed!
python3 --version
# Python 3.7.2

python3 -m pip install saas

saas --version
# saas 1.2.1

Usage

Getting started

Start Elasticsearch

Everytime you run saas you must make sure that there is an elasticsearch instance running and availible is availible for saas to connect to.

If using docker
sudo docker run -d -p 9200:9200 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:6.5.4
If binary exists in PATH
# foreground
elasticsearch

# or run it in the background
elasticsearch 2>&1 > elasticsearch.log &

Taking a picture of a single URL

# create input file
$ touch input_urls

# make mountpoint for filesystem
$ mkdir mount

# start saas
# the --ignore-found-urls option will disable the crawler behaviour
$ saas input_urls mount --ignore-found-urls
mounting filesystem at: ./mount
starting 1 crawler threads
starting 1 photographer threads

# add url to input file
$ echo "https://news.ycombinator.com/" >> input_urls

# the photo will appear inside the mountpoint
$ tree mount/
mount/
└── news.ycombinator.com
    ├── 2019011721
    │   └── index.png
    └── latest
        └── index.png

3 directories, 2 files

Using the crawler

The crawler is a useful tool to find new urls to take pictuers of. It can be configured to run wild and crawl any domain it comes across, or stay at the domains that the urls in the input file belongs to.

Stay at domains

Using the --stay-at-domain flag the crawler will discard any domain that does not belong to the same domain as the page it was found at.

$ saas input_urls mount --stay-at-domain

$ echo "https://daringfireball.net/" >> input_urls

# after a minute or so

$ tree mount/daringfireball.net/latest/
mount/daringfireball.net/latest/
├── 2006
│   └── 06
│       └── apple_open_source.png
├── 2007
│   └── 01
│       └── enderle_leg_pulling.png
├── 2008
│   └── 04
│       └── big_fan.png.rendering.saas
├── 2017
│   └── 07
│       └── you_should_not_force_quit_apps.png
├── 2019
│   └── 01
│       └── on_getting_started_with_regular_expressions.png
├── index.png
└── linked
    └── 2019
        └── 01
            └── 07
                └── samsung-itunes.png

14 directories, 7 files

Resetting the data

Since the mounted filesystem is a read-only filesystem simply removing the a photo from the filesystem is currently not possible.

For now, at least, the best way to clear the data directory and the index is by using the --clear-data-dir and --clear-elasticsearch options

# cannot modify the mounted filesystem
$ touch mount/foo
touch: mount/foo: Read-only file system

# clear the index of urls and photo metadata
$ saas input_urls mount --clear-elasticsearch

# clear the photo files
$ saas input_urls mount --clear-data-dir

Read more about storage

Setting the viewport size

The camera viewport can be adjusted with the --viewport-width and --viewport-height options.

By default the camera tries to take a full screen screenshot. This means that it figures out how tall a page is and resizes the camera height accordingly. Full screen screenshots take way longer time, especially on image-heavy sites.

Full list of options

usage: saas [-h] [--version] [--debug] [--refresh-rate] [--crawler-threads]
            [--photographer-threads] [--data-dir] [--clear-data-dir]
            [--elasticsearch-host] [--setup-elasticsearch]
            [--clear-elasticsearch] [--stay-at-domain] [--ignore-found-urls]
            [--viewport-width] [--viewport-height] [--viewport-max-height]
            [--optimize-storage] [--stop-if-idle]
            url_file mountpoint

Screenshot as a service

positional arguments:
  url_file              Path to input url file
  mountpoint            Where to mount filesystem via FUSE

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  --debug               Display debugging information
  --refresh-rate        Refresh captures of urls every 'day', 'hour' or
                        'minute' (default: hour)
  --crawler-threads     Number of crawler threads, usually not neccessary with
                        more than one (default: 1)
  --photographer-threads
                        Number of photographer threads, beaware that
                        increasing too much won't neccessarily speed up
                        performance and hog the system (default: 1)
  --data-dir            Path to data directory (default: ~/.saas-data-dir)
  --clear-data-dir      Use flag to clear data directory on start
  --elasticsearch-host
                        Elasticsearch host (default: localhost:9200)
  --setup-elasticsearch
                        Use flag to create indices in elasticsearch
  --clear-elasticsearch
                        Use flag to clear elasticsearch on start, WARNING:
                        this will clear all indices found in elasticsearch
                        instance
  --stay-at-domain      Use flag to ignore urls from a different domain than
                        the one it was found at
  --ignore-found-urls   Use flag to ignore urls found on crawled urls
  --viewport-width      Width of camera viewport in pixels (default: 1920)
  --viewport-height     Height of camera viewport in pixels, if set to 0
                        camera will try to take a full height high quality
                        screenshot, which is way slower than fixed size
                        (default: 0)
  --viewport-max-height
                        Max height of camera viewport in pixels, if
                        --viewport-height is set this will be ignored
  --optimize-storage    Image files should be optimized to take up less
                        storage (takes longer time to render)
  --stop-if-idle        If greater than 0 saas will stop if it is idle for
                        more than the provided number of minutes

Storage

Saas uses two types of storages. A regular directory for storage of photo files, and elasticsearch for photo metadata and urls.

Elasticsearch

The elastic search instance is configured by saas with three indices

  • crawled this index holds urls that crawler have visited, the HTTP response code and any locks (meaning any photographer thread is taking a picture of that url)
  • uncrawled this index contains scraped urls from pages crawler have visited
  • photos this index contains photo metadata, file size, captured_at, filename etc.

Data directory

When saas responds to a directory listing it only needs to query the elasticsearch photos index. Only when a read request is made, the actual file content is fetched from the data directory. The data directory holds the raw photo data with a unique id for each photo. Default path for this directory is ~/.saas-data-dir

$ tree ~/.saas-data-dir/
├── 18
│   └── 18dfe716-cdb2-4916-8154-6088d9bc6ee3.png
├── 1c
│   └── 1c1d0ee8-28f6-4b7c-b70f-8e800c58a3a6.png
├── 29
│   └── 29dd23f3-1791-46e6-8a83-25f5736a0894.png
├── 50
│   └── 50f13985-2cce-4464-942d-d9bbea165785.png
├── 76
│   └── 769933ce-2cde-4f30-a215-c26227850c8b.png
├── 89
│   ├── 8975f15c-7112-499c-97d5-44dd501b9b09.png
│   └── 89ec9675-84f8-47fa-9589-8d39a8a34ea1.png
├── ab
│   └── ab5bbb0f-03cb-45ed-be1d-e257434a925c.png
├── ca
│   └── ca1551a2-8855-4d0d-869b-108b9b7122bf.png
└── d7
    └── d79598a2-619f-4192-bb39-5e31642be800.png

Build

Install saas by cloning it from source

$ git clone https://github.com/nattvara/saas.git && cd saas

$ python3 -m venv ./venv

$ source ./venv/bin/activate

$ python setup.py develop

Firefox extensions

The camera module uses selenium to render pages. To improve performance saas uses uBlock Origin to block ads. To have greater access to more webpages saas uses I don't care about cookies to bypass popups and GDPR consent forms. Many websites also employ the practice of paywalls for some of their content, however, many websites leave their site open to users coming from search engines and social media sites. Saas therefore has a small custom firefox extension to rewrite all http requests made from firefox to include the header Referer: https://google.com - this will allow access to a lot more content on the web.

Updating uBlock Origin

Download the latest ublock.xpi from gorhill/uBlock releases and replace the version in the extensions/ directory.

Updating IDCAC

Download and install the latest version using firefox from https://www.i-dont-care-about-cookies.eu/. Locate the .xpi file inside Firefox's extensions directory, on macOS this is ~/Library/Application Support/Firefox/Profiles/[profile]/extensions/. Copy the .xpi file to the extensions/ directory.

Referer Header

Make zip archive of source files

zip -r -j -FS extensions/referer_header.xpi extensions/referer_header/*

Run the testsuite

$ python -m unittest discover -s tests

Run the typechecker

$ mypy saas

API

The main reason for using FUSE is that saas's api is the filesystem. Everything that can interact with the filesystem can interact with saas. Almost every programming language ships with easy access to the filesystem, hence integration in any environment is as easy as reading and writing to the filesystem.

For example exposing saas through a http interface could be as easy as starting a super simple node service like the following (should definitely be more thorough than this in production).

const http = require('http')
const url = require('url')
const fs = require('fs');
const port = 3000

const requestHandler = (request, response) => {
    fs.appendFile(
        'urls',
        url.parse(request.url, true).query.url + '\n',
        () => {}
    )
    response.end('')
}

const server = http.createServer(requestHandler)
server.listen(port, (err) => {})

This would allow for adding new urls to crawl by calling the service like the following

curl http://localhost:3000/?url=https%3A%2F%2Fwww.wsj.com%2F
curl http://localhost:3000/?url=https%3A%2F%2Fwww.nytimes.com%2F

Starting a simple python webserver could allow for traversing the saas filesystem

# inside mounted filesystem
python -m SimpleHTTPServer 3001

# so the following url
# https://www.ft.com/content/180f3428-1923-11e9-b93e-f4351a53f1c3
# if photographed, could be found at
wget http://localhost:3001/www.ft.com/latest/content/180f3428-1923-11e9-b93e-f4351a53f1c3.png

Those are two out of a hundred ways to integrate/extend saas.

Performance and Scalability

Saas is designed to run over multiple machines. There can be virtually unlimited number of saas-nodes added to a single cluster, the only two things they need is a common elasticsearch instance or cluster to talk to, and a common data directory. Elasticsearch is well known for its scalability and the data directory could for instance be a network drive they share, Amazon EFS or any other way to share a drive between machines.

Since all nodes in a cluster share the same index and data directory they can all read the images the cluster as a whole produces. Nodes can also join and leave the cluster freely without incurring any long time data loss.

The biggest hit to performance are taking photos of image-heavy sites or using a large viewport size. Fixed viewport size is a good option for optimizing performance, there is virtually no upper limit to how large a website can be vertically. Screenshots of tabloid websites or sites with infinite-scroll can easily reach 25-50 MB in size.

Checkout the guide Maximize saas throughput for a thorough guide for how to deploy a large cluster of saas nodes on AWS and optimize performance.

Examples

See examples/ for some good examples for testing saas.

Known issues

Under some circumstances, a fatal crash for instance, the mounted filesystem might not unmount automatically. Also the filesystem will not be able to unmount if some other process is currently reading from the filesystem.

If you encouter this, run

umount path/to/mounted_directory

License

MIT © Ludwig Kristoffersson

See LICENSE file for more information

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

saas-1.2.1.tar.gz (2.8 MB view details)

Uploaded Source

Built Distribution

saas-1.2.1-py3-none-any.whl (2.8 MB view details)

Uploaded Python 3

File details

Details for the file saas-1.2.1.tar.gz.

File metadata

  • Download URL: saas-1.2.1.tar.gz
  • Upload date:
  • Size: 2.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.3

File hashes

Hashes for saas-1.2.1.tar.gz
Algorithm Hash digest
SHA256 db60780f7d382636d22b7e6c7ed136d23668246a6d8bfff68338d02886af545b
MD5 ced04fcdb9a30e819767dfd296bd6dff
BLAKE2b-256 ae8c9f94197387e7aa903d5529911bf40af9cd0286a013803459bab84eba4ed0

See more details on using hashes here.

File details

Details for the file saas-1.2.1-py3-none-any.whl.

File metadata

  • Download URL: saas-1.2.1-py3-none-any.whl
  • Upload date:
  • Size: 2.8 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.3

File hashes

Hashes for saas-1.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a2c13163bdfd32a4f1f392c0195dfc9f88615b2ad98f97ef9da4fcc447dbfb59
MD5 0a2f9ec3b7e3d070a92d8f3d50749de9
BLAKE2b-256 88538bcb9f70cc510fa4b63b3dfc94cd22f2768915847e001a6ce4883f5c694b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page