Skip to main content

Code for Society Library Sources

Project description

Society Library Sources

This repo contains the source document collectors for the Society Library.

Use the package directly

# Note: you will need to set your environment variables for this to work, see .env.template
export GOOGLE_API_KEY=<your api key>
python -m sl_sources search google_scholar "artificial intelligence" --num_results 5
export SEMANTIC_SCHOLAR_API_KEY=<your api key>
python -m sl_sources download semantic_scholar <paper id>
python -m sl_sources search youtube "machine learning tutorial" --num_results 3 --output results.json

Library

pip install sl_sources
from sl_sources import search_google_scholar, download_from_google_scholar
# Note: You will need to set your environment variables for this to work, see .env.template

You can update the library with the following commands:

# rmrf the dist folder if it exists
rm -rf dist

# build the library
python setup.py sdist bdist_wheel

# upload to pypi
twine upload dist/*

Worker

The Media Worker wraps the search and download functions for all sources, and is especially good for scraping websites and downloading videos.

Setup

You will need to make a Google account, enable "Google Cloud Functions" and "Google Cloud Build". However, if you initialize gcloud with your Google account and log in, these can be enabled for you automatically when the worker is deployed.

Download and install the gcloud cli

brew install --cask google-cloud-sdk # mac

# Linux
curl -O https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-cli-linux-x86_64.tar.gz
tar -xf google-cloud-cli-linux-x86_64.tar.gz
./google-cloud-sdk/install.sh

Then initialize gcloud and authenticate:

gcloud init
gcloud auth login

Local development

You can run the worker locally using functions-framework

pip install functions-framework
functions-framework --target handle_request --debug

Make sure you have set CLOUD_FUNCTION_URL=http://127.0.0.1:8080 and CLOUD_FUNCTION_ENABLED=true in your .env file.

You can now call the function using curl

# search
curl -X POST http://127.0.0.1:8080 -H "Content-Type: application/json" -d '{"source": "google", "query": "artificial intelligence in neuroscience", "type": "search", "num_results": 10}'

# download
curl -X POST http://127.0.0.1:8080 -H "Content-Type: application/json" -d '{"search_result": {"url": "https://www.google.com", "title": "Google", "source": "google"}, "type": "download"}'
# note that the search_result object is the result of the search function

Deploy the worker

bash deploy_worker.sh

The worker will be deployed using the environment variables in the .env, so make sure those are what you want them to be.

You will need to update your .env and set CLOUD_FUNCTION_ENABLED to "true" and CLOUD_FUNCTION_URL to your deployed worker URL, which will be shown at deployment time. It should look like this:

CLOUD_FUNCTION_ENABLED=true
CLOUD_FUNCTION_URL=https://us-<region>-<project>.cloudfunctions.net/media_worker

You can initialize and run many workers simultaneously. The one limitation is that cloud functions can run for a maximum of 9 minutes (540 seconds) so make sure that your work is split into smaller chunks than would require that much processing time.

Testing worker locally

You can test the worker using functions_framework

pip install functions-framework
functions_framework --target handle_request --debug

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sl_sources-0.0.3.tar.gz (32.7 kB view hashes)

Uploaded Source

Built Distribution

sl_sources-0.0.3-py3-none-any.whl (34.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page