Code for Society Library Sources
Project description
Society Library Sources
This repo contains the source document collectors for the Society Library.
Use the package directly
# Note: you will need to set your environment variables for this to work, see .env.template
export GOOGLE_API_KEY=<your api key>
python -m sl_sources search google_scholar "artificial intelligence" --num_results 5
export SEMANTIC_SCHOLAR_API_KEY=<your api key>
python -m sl_sources download semantic_scholar <paper id>
python -m sl_sources search youtube "machine learning tutorial" --num_results 3 --output results.json
Library
pip install sl_sources
from sl_sources import search_google_scholar, download_from_google_scholar
# Note: You will need to set your environment variables for this to work, see .env.template
You can update the library with the following commands:
# rmrf the dist folder if it exists
rm -rf dist
# build the library
python setup.py sdist bdist_wheel
# upload to pypi
twine upload dist/*
Worker
The Media Worker wraps the search and download functions for all sources, and is especially good for scraping websites and downloading videos.
Setup
You will need to make a Google account, enable "Google Cloud Functions" and "Google Cloud Build". However, if you initialize gcloud with your Google account and log in, these can be enabled for you automatically when the worker is deployed.
Download and install the gcloud cli
brew install --cask google-cloud-sdk # mac
# Linux
curl -O https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-cli-linux-x86_64.tar.gz
tar -xf google-cloud-cli-linux-x86_64.tar.gz
./google-cloud-sdk/install.sh
Then initialize gcloud and authenticate:
gcloud init
gcloud auth login
Local development
You can run the worker locally using functions-framework
pip install functions-framework
functions-framework --target handle_request --debug
Make sure you have set CLOUD_FUNCTION_URL=http://127.0.0.1:8080
and CLOUD_FUNCTION_ENABLED=true
in your .env file.
You can now call the function using curl
# search
curl -X POST http://127.0.0.1:8080 -H "Content-Type: application/json" -d '{"source": "google", "query": "artificial intelligence in neuroscience", "type": "search", "num_results": 10}'
# download
curl -X POST http://127.0.0.1:8080 -H "Content-Type: application/json" -d '{"search_result": {"url": "https://www.google.com", "title": "Google", "source": "google"}, "type": "download"}'
# note that the search_result object is the result of the search function
Deploy the worker
bash deploy_worker.sh
The worker will be deployed using the environment variables in the .env, so make sure those are what you want them to be.
You will need to update your .env and set CLOUD_FUNCTION_ENABLED
to "true" and CLOUD_FUNCTION_URL
to your deployed worker URL, which will be shown at deployment time. It should look like this:
CLOUD_FUNCTION_ENABLED=true
CLOUD_FUNCTION_URL=https://us-<region>-<project>.cloudfunctions.net/media_worker
You can initialize and run many workers simultaneously. The one limitation is that cloud functions can run for a maximum of 9 minutes (540 seconds) so make sure that your work is split into smaller chunks than would require that much processing time.
Testing worker locally
You can test the worker using functions_framework
pip install functions-framework
functions_framework --target handle_request --debug
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for sl_sources-0.0.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c1414cca70571af3fffe0726eb76abecbc18a40fe87592dbbe4f469b12860333 |
|
MD5 | c786704abc7080bdcfa2cea43e7d4af6 |
|
BLAKE2b-256 | b02821256ecc62fc913b276b85ef4dcff3887ab4e1f1041f05dafe1de5a8b3d4 |