Deploy GROBID on AWS EC2
Project description
AWS GROBID Deploy
Deploy GROBID on AWS EC2 using Python.
Note: The deployed EC2 GROBID service will be publicly available on the internet. It is best practice (and more "economically sustainable") to always teardown the instance when not in use. Spinning up new instances is fast and easy.
Prerequisites
Before using this tool, ensure you have:
- AWS Account with appropriate permissions (see AWS_PERMISSIONS.md)
- AWS Credentials configured via AWS profiles or environment variables
- Required IAM Permissions for EC2 operations
Quick Setup
# Configure AWS profile
aws configure --profile your-profile-name
# Test your credentials
aws sts get-caller-identity --profile your-profile-name
For detailed setup instructions, see the AWS Permissions Guide.
Usage (Python)
import json
import aws_grobid
import requests
# There are a few different pre-canned configurations available:
# Base GROBID service w/ CRF only models
# aws_grobid.GROBIDDeploymentConfigs.grobid_crf
# Base GROBID service w/ Deep Learning models
# aws_grobid.GROBIDDeploymentConfigs.grobid_full
# Software Mentions annotation service w/ Deep Learning models
# aws_grobid.GROBIDDeploymentConfigs.software_mentions
# NOTE: You also need to change the URL endpoint specified below for Software mentions
# Create a new GROBID instance and wait for it to be ready
# This generally takes about 6 minutes
# Instance is automatically torn down if the
# GROBID service is not available within 7 minutes
instance_details = aws_grobid.deploy_and_wait_for_ready(
grobid_config=aws_grobid.GROBIDDeploymentConfigs.grobid_crf,
)
# You can also specify the instance type, region, tags, etc.
# instance_details = aws_grobid.deploy_and_wait_for_ready(
# grobid_config=aws_grobid.GROBIDDeploymentConfigs.grobid_full,
# instance_type='c5.4xlarge',
# region='us-east-1',
# tags={'awsApplication': 'arn:...'},
# timeout=420, # 7 minutes
# )
# Use the instance to process a PDF file
# The API URL is available from:
# instance_details.api_url
# ...
# Example request to GROBID Server for Annotation
with open("example.pdf", "rb") as open_pdf:
response = requests.post(
# NOTE: Use f"{instance_details.api_url}/service/annotateSoftwarePDF" for Software mentions
f"{instance_details.api_url}/api/processFulltextDocument",
files={"input": open_pdf},
data={"disambiguate": 1},
timeout=180, # 3 minutes
)
response.raise_for_status()
response_data = response.json()
# Write response to JSON
with open("example-output.json", "w") as open_json:
json.dump(response_data, open_json)
# Teardown the instance when done
aws_grobid.terminate_instance(
region=instance_details.region,
instance_id=instance_details.instance_id
)
When providing an instance type that has NVIDIA GPUs available (G* or P* families), we automatically pass the GPU flag to Docker so GROBID can use the GPU.
Note: The first call to the GROBID service may take a minute or so to warm up. Subsequent calls are much faster.
We automatically pick up .env-controlled environment variables. This is useful for setting AWS_PROFILE or AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY_ID.
CLI
After installing the package, a CLI is available as aws-grobid.
- Deploy and wait until ready (prints instance details as JSON):
# Deploy with default credentials
aws-grobid deploy --config crf --instance-type m6a.4xlarge --region us-west-2 \
--tag awsApplication=example --timeout 420
# Deploy with specific AWS profile
aws-grobid deploy --config crf --instance-type m6a.4xlarge --region us-west-2 \
--tag awsApplication=example --timeout 420 --profile your-profile-name
- Terminate an instance:
# Terminate with default credentials
aws-grobid terminate --region us-west-2 --instance-id i-0123456789abcdef0
# Terminate with specific AWS profile
aws-grobid terminate --region us-west-2 --instance-id i-0123456789abcdef0 \
--profile your-profile-name
Note: 'lite' remains available as a deprecated alias for 'crf' for backward compatibility.
Optional: better typing in editors
If you want precise types for the boto3 clients/resources in your IDE or mypy, install the dev extras:
pip install -e ".[dev]"
This includes boto3-stubs[ec2] and enables rich autocompletion and type checking without affecting runtime.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file aws_grobid-0.2.0.tar.gz.
File metadata
- Download URL: aws_grobid-0.2.0.tar.gz
- Upload date:
- Size: 24.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
231a3faa659b89133d553aa81c2ec767dafeb5cad8889a5a94bcba5248cdd304
|
|
| MD5 |
6dfbb9a04755760230739210c13d99d2
|
|
| BLAKE2b-256 |
71dc1049c32dcf0350b2ca4c24921d689ccecffb8e4ebc01fa081d7c814c834c
|
Provenance
The following attestation bundles were made for aws_grobid-0.2.0.tar.gz:
Publisher:
ci.yml on evamaxfield/aws-grobid
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
aws_grobid-0.2.0.tar.gz -
Subject digest:
231a3faa659b89133d553aa81c2ec767dafeb5cad8889a5a94bcba5248cdd304 - Sigstore transparency entry: 605910053
- Sigstore integration time:
-
Permalink:
evamaxfield/aws-grobid@99e6364b646847b810b5c5964bbb37e189cbd129 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/evamaxfield
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@99e6364b646847b810b5c5964bbb37e189cbd129 -
Trigger Event:
push
-
Statement type:
File details
Details for the file aws_grobid-0.2.0-py3-none-any.whl.
File metadata
- Download URL: aws_grobid-0.2.0-py3-none-any.whl
- Upload date:
- Size: 22.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4f8e95c682b9c34caeb247046fa475e03e30bb8c475bc06664c012fdda5e7027
|
|
| MD5 |
3a9536fcb2e05e955b3b906af8378eaf
|
|
| BLAKE2b-256 |
3ee7261d1614dc02f8f1aa75ab2da695407cea70f586fa9cbf8a5837f8e3b487
|
Provenance
The following attestation bundles were made for aws_grobid-0.2.0-py3-none-any.whl:
Publisher:
ci.yml on evamaxfield/aws-grobid
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
aws_grobid-0.2.0-py3-none-any.whl -
Subject digest:
4f8e95c682b9c34caeb247046fa475e03e30bb8c475bc06664c012fdda5e7027 - Sigstore transparency entry: 605910105
- Sigstore integration time:
-
Permalink:
evamaxfield/aws-grobid@99e6364b646847b810b5c5964bbb37e189cbd129 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/evamaxfield
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@99e6364b646847b810b5c5964bbb37e189cbd129 -
Trigger Event:
push
-
Statement type: