Simple python client for GROBID REST services
Project description
GROBID Client Python
A simple, efficient Python client for GROBID REST services that provides concurrent processing capabilities for PDF documents, reference strings, and patents.
📋 Table of Contents
- Features
- Prerequisites
- Installation
- Quick Start
- Usage
- Configuration
- Services
- Testing
- Performance
- Development
- License
✨ Features
- Concurrent Processing: Efficiently process multiple documents in parallel
- Flexible Input: Process PDF files, text files with references, and XML patents
- Configurable: Customizable server settings, timeouts, and processing options
- Command Line & Library: Use as a standalone CLI tool or import into your Python projects
- Coordinate Extraction: Optional PDF coordinate extraction for precise element positioning
- Sentence Segmentation: Layout-aware sentence segmentation capabilities
- JSON Output: Convert TEI XML output to structured JSON format with CORD-19-like structure
- Markdown Output: Convert TEI XML output to clean Markdown format with structured sections
📋 Prerequisites
- Python: 3.8 - 3.13 (tested versions)
- GROBID Server: A running GROBID service instance
- Local installation: GROBID Documentation
- Docker:
docker run -t --rm -p 8070:8070 lfoppiano/grobid:0.8.2 - Default server:
http://localhost:8070 - Online demo: https://lfoppiano-grobid.hf.space (usage limits apply), more details here.
[!IMPORTANT] GROBID supports Windows only through Docker containers. See the Docker documentation for details.
🚀 Installation
Choose one of the following installation methods:
PyPI (Recommended)
pip install grobid-client-python
Development Version
pip install git+https://github.com/kermitt2/grobid_client_python.git
Local Development
git clone https://github.com/kermitt2/grobid_client_python
cd grobid_client_python
pip install -e .
⚡ Quick Start
Command Line
# Process PDFs in a directory
grobid_client --input ./pdfs --output ./output processFulltextDocument
# Process with custom server
grobid_client --server https://your-grobid-server.com --input ./pdfs processFulltextDocument
Python Library
from grobid_client.grobid_client import GrobidClient
# Create client instance
client = GrobidClient(config_path="./config.json")
# Process documents
client.process("processFulltextDocument", "/path/to/pdfs", n=10)
📖 Usage
Command Line Interface
The client provides a comprehensive CLI with the following syntax:
grobid_client [OPTIONS] SERVICE
Available Services
| Service | Description | Input Format |
|---|---|---|
processFulltextDocument |
Extract full document structure | PDF files |
processHeaderDocument |
Extract document metadata | PDF files |
processReferences |
Extract bibliographic references | PDF files |
processCitationList |
Parse citation strings | Text files (one citation per line) |
processCitationPatentST36 |
Process patent citations | XML ST36 format |
processCitationPatentPDF |
Process patent PDFs | PDF files |
Common Options
| Option | Description | Default |
|---|---|---|
--input |
Input directory path | Required |
--output |
Output directory path | Same as input |
--server |
GROBID server URL | http://localhost:8070 |
--n |
Concurrency level | 10 |
--config |
Config file path | Optional |
--force |
Overwrite existing files | False |
--verbose |
Enable verbose logging | False |
Processing Options
| Option | Description |
|---|---|
--generateIDs |
Generate random XML IDs |
--consolidate_header |
Consolidate header metadata |
--consolidate_citations |
Consolidate bibliographic references |
--include_raw_citations |
Include raw citation text |
--include_raw_affiliations |
Include raw affiliation text |
--teiCoordinates |
Add PDF coordinates to XML |
--segmentSentences |
Segment sentences with coordinates |
--flavor |
Processing flavor for fulltext extraction |
--json |
Convert TEI output to JSON format |
--markdown |
Convert TEI output to Markdown format |
Examples
# Basic fulltext processing
grobid_client --input ~/documents --output ~/results processFulltextDocument
# High concurrency with coordinates
grobid_client --input ~/pdfs --output ~/tei --n 20 --teiCoordinates processFulltextDocument
# Process with JSON output
grobid_client --input ~/pdfs --output ~/results --json processFulltextDocument
# Process with Markdown output
grobid_client --input ~/pdfs --output ~/results --markdown processFulltextDocument
# Process citations with custom server
grobid_client --server https://grobid.example.com --input ~/citations.txt processCitationList
# Force reprocessing with sentence segmentation and JSON output
grobid_client --input ~/docs --force --segmentSentences --json processFulltextDocument
Python Library
Basic Usage
from grobid_client.grobid_client import GrobidClient
# Initialize with default localhost server
client = GrobidClient()
# Initialize with custom server
client = GrobidClient(grobid_server="https://your-server.com")
# Initialize with config file
client = GrobidClient(config_path="./config.json")
# Process documents
client.process(
service="processFulltextDocument",
input_path="/path/to/pdfs",
output_path="/path/to/output",
n=20
)
Advanced Usage
# Process with specific options
client.process(
service="processFulltextDocument",
input_path="/path/to/pdfs",
output_path="/path/to/output",
n=10,
generateIDs=True,
consolidate_header=True,
teiCoordinates=True,
segmentSentences=True
)
# Process with JSON output
client.process(
service="processFulltextDocument",
input_path="/path/to/pdfs",
output_path="/path/to/output",
json_output=True
)
# Process with Markdown output
client.process(
service="processFulltextDocument",
input_path="/path/to/pdfs",
output_path="/path/to/output",
markdown_output=True
)
```python
# Process citation lists
client.process(
service="processCitationList",
input_path="/path/to/citations.txt",
output_path="/path/to/output"
)
Standalone Conversion Tools
The library includes standalone scripts to convert TEI XML files to other formats without using the main client or server.
TEI to JSON Converter
Converts TEI XML files to the structured JSON format (similar to --json option).
# Convert a single file
python -m grobid_client.format.TEI2LossyJSON_cli --input path/to/file.tei.xml --output path/to/output.json
# Convert with verbose logging
python -m grobid_client.format.TEI2LossyJSON_cli --input path/to/file.tei.xml --verbose
TEI to Markdown Converter
Converts TEI XML files to Markdown format (similar to --markdown option).
# Convert a single file
python -m grobid_client.format.TEI2Markdown_cli --input path/to/file.tei.xml --output path/to/output.md
⚙️ Configuration
Configuration can be provided via a JSON file. When using the CLI, the --server argument overrides the config file
settings.
Default Configuration
{
"grobid_server": "http://localhost:8070",
"batch_size": 1000,
"sleep_time": 5,
"timeout": 60,
"coordinates": [
"persName",
"figure",
"ref",
"biblStruct",
"formula",
"s"
]
}
Configuration Parameters
| Parameter | Description | Default |
|---|---|---|
grobid_server |
GROBID server URL | http://localhost:8070 |
batch_size |
Thread pool size. Tune carefully: a large batch size will result in the data being written less frequently | 1000 |
sleep_time |
Wait time when server is busy (seconds) | 5 |
timeout |
Client-side timeout (seconds) | 180 |
coordinates |
XML elements for coordinate extraction | See above |
logging |
Logging configuration (level, format, file output) | See Logging section |
[!TIP] Since version 0.0.12, the config file is optional. The client will use default localhost settings if no configuration is provided.
Logging Configuration
The client provides configurable logging with different verbosity levels. By default, only essential statistics and warnings are shown.
Logging Behavior
- Without
--verbose: Shows only essential information and warnings/errors - With
--verbose: Shows detailed processing information at INFO level
Always Visible Output
The following information is always displayed regardless of the --verbose flag:
Found 1000 file(s) to process
Processing completed: 950 out of 1000 files processed
Errors: 50 out of 1000 files processed
Processing completed in 120.5 seconds
Verbose Output (--verbose)
When the --verbose flag is used, additional detailed information is displayed:
- Server connection status
- Individual file processing details
- JSON conversion messages
- Detailed error messages
- Processing progress information
Examples
# Clean output - only essential statistics
grobid_client --input pdfs/ processFulltextDocument
# Output:
# Found 1000 file(s) to process
# Processing completed: 950 out of 1000 files processed
# Errors: 50 out of 1000 files processed
# Processing completed in 120.5 seconds
# Verbose output - detailed processing information
grobid_client --input pdfs/ --verbose processFulltextDocument
# Output includes all essential stats PLUS:
# GROBID server http://localhost:8070 is up and running
# JSON file example.json does not exist, generating JSON from existing TEI...
# Successfully created JSON file: example.json
# ... and other detailed processing information
Configuration File Logging
The config file can include logging settings:
{
"grobid_server": "http://localhost:8070",
"logging": {
"level": "WARNING",
"format": "%(asctime)s - %(levelname)s - %(message)s",
"console": true,
"file": null
}
}
Note: The --verbose command line flag always takes precedence over configuration file logging settings.
🔬 Services
Fulltext Document Processing
Extracts complete document structure including headers, body text, figures, tables, and references.
grobid_client --input pdfs/ --output results/ processFulltextDocument
JSON Output Format
When using the --json flag, the client converts TEI XML output to a structured JSON format similar to CORD-19. This provides:
- Structured Bibliography: Title, authors, DOI, publication date, journal information
- Body Text: Paragraphs and sentences with metadata and reference annotations
- Figures and Tables: Structured JSON format for tables with headers, rows, and metadata
- Reference Information: In-text citations with offsets and targets
JSON Structure
{
"level": "paragraph",
"biblio": {
"title": "Document Title",
"authors": [
"Author 1",
"Author 2"
],
"doi": "10.1000/example",
"publication_date": "2023-01-01",
"journal": "Journal Name",
"abstract": [
...
]
},
"body_text": [
{
"id": "p_12345",
"text": "Paragraph text with citations [1].",
"head_section": "Introduction",
"refs": [
{
"type": "bibr",
"target": "b1",
"text": "[1]",
"offset_start": 25,
"offset_end": 28
}
]
}
],
"figures_and_tables": [
{
"id": "table_1",
"type": "table",
"label": "Table 1",
"head": "Sample Data",
"content": {
"headers": [
"Header 1",
"Header 2"
],
"rows": [
[
"Value 1",
"Value 2"
]
],
"metadata": {
"row_count": 1,
"column_count": 2,
"has_headers": true
}
}
}
]
}
Usage Examples
# Generate both TEI and JSON outputs
grobid_client --input pdfs/ --output results/ --json processFulltextDocument
# JSON output with coordinates and sentence segmentation
grobid_client --input pdfs/ --output results/ --json --teiCoordinates --segmentSentences processFulltextDocument
# Python library usage
client.process(
service="processFulltextDocument",
input_path="/path/to/pdfs",
output_path="/path/to/output",
json_output=True
)
[!NOTE] When using
--json, the--forceflag only checks for existing TEI files. If a TEI file is rewritten (due to--force), the corresponding JSON file is automatically rewritten as well.
Markdown Output Format
When using the --markdown flag, the client converts TEI XML output to a clean, readable Markdown format. This
provides:
- Structured Sections: Title, Authors, Affiliations, Publication Date, Fulltext, Annex, and References
- Clean Formatting: Human-readable format suitable for documentation and sharing
- Preserved Content: All text content with proper section organization
- Reference Formatting: Bibliographic references in a readable format
Markdown Structure
The generated Markdown follows this structure:
# Document Title
## Authors
- Author Name 1
- Author Name 2
## Affiliations
- Affiliation 1
- Affiliation 2
## Publication Date
January 1, 2023
## Fulltext
### Introduction
Content of the introduction section...
### Methods
Content of the methods section...
## Annex
### Acknowledgements
Acknowledgement text...
### Competing Interests
Competing interests statement...
## References
**[1]** Paper Title. *Author Name*. *Journal Name* (2023).
**[2]** Another Paper. *Author et al.*. *Conference* (2022).
Usage Examples
# Generate both TEI and Markdown outputs
grobid_client --input pdfs/ --output results/ --markdown processFulltextDocument
# Markdown output with coordinates and sentence segmentation
grobid_client --input pdfs/ --output results/ --markdown --teiCoordinates --segmentSentences processFulltextDocument
# Python library usage
client.process(
service="processFulltextDocument",
input_path="/path/to/pdfs",
output_path="/path/to/output",
markdown_output=True
)
[!NOTE] When using
--markdown, the--forceflag only checks for existing TEI files. If a TEI file is rewritten (due to--force), the corresponding Markdown file is automatically rewritten as well.
Header Document Processing
Extracts only document metadata (title, authors, abstract, etc.).
grobid_client --input pdfs/ --output headers/ processHeaderDocument
Reference Processing
Extracts and structures bibliographic references from documents.
grobid_client --input pdfs/ --output refs/ processReferences
Citation List Processing
Parses raw citation strings from text files.
grobid_client --input citations.txt --output parsed/ processCitationList
[!TIP] For citation lists, input should be text files with one citation string per line.
🧪 Testing
The project includes comprehensive unit and integration tests using pytest.
Running Tests
# Install development dependencies
pip install -e .[dev]
# Run all tests
pytest
# Run with coverage
pytest --cov=grobid_client
# Run specific test file
pytest tests/test_client.py
# Run with verbose output
pytest -v
Test Structure
tests/test_client.py- Unit tests for the base API clienttests/test_grobid_client.py- Unit tests for the GROBID clienttests/test_integration.py- Integration tests with real GROBID servertests/conftest.py- Test configuration and fixtures
Continuous Integration
Tests are automatically run via GitHub Actions on:
- Push to main branch
- Pull requests
- Multiple Python versions (3.8-3.13)
📊 Performance
Benchmark results for processing 136 PDFs (3,443 pages total, ~25 pages per PDF) on Intel Core i7-4790K CPU 4.00GHz:
| Concurrency | Runtime (s) | s/PDF | PDF/s |
|---|---|---|---|
| 1 | 209.0 | 1.54 | 0.65 |
| 2 | 112.0 | 0.82 | 1.21 |
| 3 | 80.4 | 0.59 | 1.69 |
| 5 | 62.9 | 0.46 | 2.16 |
| 8 | 55.7 | 0.41 | 2.44 |
| 10 | 55.3 | 0.40 | 2.45 |
Additional Benchmarks
- Header processing: 3.74s for 136 PDFs (36 PDF/s) with n=10
- Reference extraction: 26.9s for 136 PDFs (5.1 PDF/s) with n=10
- Citation parsing: 4.3s for 3,500 citations (814 citations/s) with n=10
🛠️ Development
Setting Up Development Environment
# Clone the repository
git clone https://github.com/kermitt2/grobid_client_python
cd grobid_client_python
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install in development mode with test dependencies
pip install -e .[dev]
# Install pre-commit hooks (optional)
pre-commit install
Creating a New Release
The project uses bump-my-version for version management:
# Install bump-my-version
pip install bump-my-version
# Bump version (patch, minor, or major)
bump-my-version bump patch
# The release will be automatically published to PyPI
Contributing
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes
- Add tests for new functionality
- Run the test suite (
pytest) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
📄 License
Distributed under the Apache 2.0 License. See LICENSE for more
information.
👥 Authors & Contact
Main Author: Patrice Lopez (patrice.lopez@science-miner.com)
Maintainer: Luca Foppiano (luca@sciencialab.com)
🔗 Links
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file grobid_client_python-0.1.4.tar.gz.
File metadata
- Download URL: grobid_client_python-0.1.4.tar.gz
- Upload date:
- Size: 1.6 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2504d69b46ebcb8521262fbe812e45d61b2d981b1ace01cc2f45285d34f02aaf
|
|
| MD5 |
61a2f0db41a93010f728e2f0dd28cb73
|
|
| BLAKE2b-256 |
fe5a6aae08d4c2db65dd5561026fff1f57704b609f7539808bce7b9653b97478
|
File details
Details for the file grobid_client_python-0.1.4-py3-none-any.whl.
File metadata
- Download URL: grobid_client_python-0.1.4-py3-none-any.whl
- Upload date:
- Size: 47.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1188b91b83a44575889312d5885cfa6a35df19fe4833aa7d757aa7e903ddde78
|
|
| MD5 |
3685de4ced9b478f371924383d9e7486
|
|
| BLAKE2b-256 |
4ef1c35f1a6b3946c9a6a99215b71c9b5868626912c8cb5a5c76effe765bd3d5
|