mag-archiver is an Azure service that automatically archives Microsoft Academic Graph (MAG) releases so that they can be transferred to other cloud services.
Project description
MAG Archiver
MAG Archiver is an Azure Function App that automatically archives Microsoft Academic Graph (MAG) releases so that they can be transferred to other cloud services.
Status
This is a proof of concept; the functionality for archiving and compressing each MAG release has not been implemented yet.
Setup
The following instructions explain how to setup Mag Archiver.
Dependencies
- Install Azure CLI
- Install Azure Functions Core Tools
- Create an Azure Storage Account
- Region: choose an Azure region that is close to the other cloud provider that you want to transfer the data to.
- Access tier: hot (need to be able to delete containers without cold storage deletion fees)
- Create container: mag-snapshots
- Under Blob Service > Lifecycle Management > Code view: add the life-cycle rules from lifecycle-rules.json
- These rules move blobs to the cold tier after 30 days and delete the blobs after 61 days.
- Create an Azure Function App
- Take note of your function app name, you will need it later.
- Under Settings > Configuration > Application settings, add the following Application settings (name: value):
- STORAGE_ACCOUNT_NAME: the name of your storage account.
- STORAGE_ACCOUNT_KEY: they key for your storage account.
- TARGET_CONTAINER: mag-snapshots
- Subscribe to Microsoft Academic Graph on Azure storage
Deploy to Azure
To deploy mag-archiver follow the instructions below.
Setup Azure account
Make sure that the Azure account that your Function App is deployed to is set as the default.
To do this, list your accounts and copy the id of the account that should be the default account:
az account list
Set the account to the Azure account that your Function App is deployed to:
az account set -s <insert your account id here>
Check that the correct account is set, you should see your account show up:
az account show
Deploy the Function App
Clone the project:
git clone git@github.com:The-Academic-Observatory/mag-archiver.git
Enter the function app folder:
cd mag-archiver
Deploy the function:
func azure functionapp publish <your function app name> --python
Architecture
The architecture of MAG Archiver is illustrated via the deployment and process view diagrams below.
Process View
The MAG subscription adds each new MAG release as a new Azure Blob storage container in the user's Azure Storage account.
An Azure Function App runs every 10 minutes and checks to see if any new MAG release containers have been added.
Metadata about which MAG releases have been discovered and processed are stored in an
Azure Table Storage table called MagReleases
.
The MagReleases
table is also used used to enable the Apache Airflow MAG workflow to query and find out what MAG
releases have finished processing and where on the Azure blob storage container they can be downloaded from. A
share access signature (SAS)
with read only privileges is used to provide the Apache Airflow MAG workflow with access to the table.
When the Function App finds a new MAG release, it copies the files from the container onto a shared container called
mag-snapshots
under a folder with the same name as the container it was copied from. After copying the files, the
original container is deleted.
The Function App copies the MAG files to a shared container so that the Apache Airflow MAG workflow only needs to hold a single SAS token, one for the shared container. In the future the copying of files by the Cloud Function can be replaced by a service that compresses the files, as shown in the diagram above.
A total of two SAS tokens are shared: one for the MagReleases
table and one for the mag-snapshots
container.
Deployment View
The deployment view below shows what services are used and where they are deployed.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file mag-archiver-2020.12.0.tar.gz
.
File metadata
- Download URL: mag-archiver-2020.12.0.tar.gz
- Upload date:
- Size: 85.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.54.0 CPython/3.8.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3e662c18dbcaf0f7443b77185f973bf0a545eae0a0fd0fd8982699e3813c3879 |
|
MD5 | a09ccb4f8e8129d567b75f6d8876f923 |
|
BLAKE2b-256 | d77cc74c0e7d4b62dc9bcbe30cd9cdd9b8815895ef00f3381560c7d0764bca21 |