GrobidArticleExtractor is a Python package designed to extract and organize content from scientific papers in PDF format.
Project description
GrobidArticleExtractor
This Python tool extracts content from PDF files using GROBID and organizes it by sections. It provides a structured way to extract both metadata and content from academic papers and other structured documents.
Features
- Direct PDF processing using GROBID API
- Metadata extraction (title, authors, abstract, publication date)
- Hierarchical section organization with subsections
Prerequisites
-
Install GROBID:
docker pull lfoppiano/grobid:0.8.0 docker run --init -p 8070:8070 -e JAVA_OPTS="-XX:+UseZGC" lfoppiano/grobid:0.8.0
JAVA_OPTS="-XX:+UseZGC"
helps to resolve the following error in mac os.[thread 44 also had an error] A fatal error has been detected by the Java Runtime Environment: SIGSEGV (0xb) at pc=0x00007ffffef8ad07, pid=8, tid=47 JRE version: OpenJDK Runtime Environment (17.0.2+8) (build 17.0.2+8-86) Java VM: OpenJDK 64-Bit Server VM (17.0.2+8-86, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, parallel gc, linux-amd64) Problematic frame: [thread 41 also had an error] [thread 45 also had an error] [thread 46 also had an error]
-
Installation :
Install this package via :
pip install GrobidArticleExtractor
Or get the newest development version via:
pip install git+https://github.com/sensein/GrobidArticleExtractor.git
Note: If upgrading from a previous version, you may need to reinstall the package to ensure the CLI command is properly installed:
pip uninstall GrobidArticleExtractor pip install GrobidArticleExtractor
Usage
Command Line Interface
The tool provides a user-friendly command-line interface for batch processing PDF files:
# Basic usage (processes PDFs from 'pdfs' directory)
grobidextractor
# Process PDFs from a specific directory
grobidextractor path/to/pdfs
# Specify custom output directory
grobidextractor path/to/pdfs -o path/to/output
# Use custom GROBID server and disable content preview
grobidextractor path/to/pdfs --grobid-url http://custom:8070 --no-preview
Available options:
$ grobidextractor --help
Usage: grobidextractor [OPTIONS] [INPUT_FOLDER]
Process PDF files from INPUT_FOLDER and extract their content using GROBID.
The extracted content is saved as JSON files in the output directory.
Each JSON file is named after its source PDF file.
Options:
-o, --output-dir PATH Directory to save extracted JSON files (default: output)
-g, --grobid-url TEXT GROBID service URL (default: http://localhost:8070)
--preview / --no-preview
Show preview of extracted content (default: True)
--help Show this message and exit.
Example:
grobidextractor path/to/pdfs -o path/to/output
Python API Usage
You can also use the tool programmatically in your Python code:
from GrobidArticleExtractor.app import GrobidArticleExtractor
# Initialize extractor (default GROBID URL: http://localhost:8070)
extractor = GrobidArticleExtractor()
# Process a PDF file
xml_content = extractor.process_pdf("path/to/your/paper.pdf")
if xml_content:
# Extract and organize content
result = extractor.extract_content(xml_content)
# Access metadata
print(result['metadata'])
# Access sections
for section in result['sections']:
print(section['heading'])
if 'content' in section:
print(section['content'])
Custom GROBID server:
extractor = GrobidArticleExtractor(grobid_url="http://your-grobid-server:8070")
Output Structure
The extracted content is organized as follows:
{
'metadata': {
'title': 'Paper Title',
'authors': ['Author 1', 'Author 2'],
'abstract': 'Paper abstract...',
'publication_date': '2023'
},
'sections': [
{
'heading': 'Introduction',
'content': ['Paragraph 1...', 'Paragraph 2...'],
'subsections': [
{
'heading': 'Background',
'content': ['Subsection content...']
}
]
}
# More sections...
]
}
Project Structure
The project is organized into two main files:
app.py
- Contains the coreGrobidArticleExtractor
class with all the PDF processing and content extraction functionalitycli.py
- Contains the command-line interface implementation using Click
Error Handling
The tool includes comprehensive error handling for common scenarios:
- PDF file not found
- GROBID service unavailable
- XML parsing errors
- Invalid content structure
All errors are logged with appropriate messages using Python's logging module.
Contributing
Feel free to submit issues and enhancement requests!
License
MIT License
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file grobidarticleextractor-0.7.0.tar.gz
.
File metadata
- Download URL: grobidarticleextractor-0.7.0.tar.gz
- Upload date:
- Size: 10.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.11.7 Darwin/23.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 069ea23a3b6aab9c5f1aaa2786398383403b70f6efc2db4d6e7f794d73b66f9c |
|
MD5 | c5ef92012aa377a747470cb3f47114ee |
|
BLAKE2b-256 | 755896acef08322d2d010ca88a737f3a0fff0a12aee5d448277b536e88f8a7a7 |
File details
Details for the file grobidarticleextractor-0.7.0-py3-none-any.whl
.
File metadata
- Download URL: grobidarticleextractor-0.7.0-py3-none-any.whl
- Upload date:
- Size: 9.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.11.7 Darwin/23.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7af55a38739c893d58f6578229951cac9d52e94f3ddd867b51609b5640153e3d |
|
MD5 | 8fdc5211d7e1bc5d0278b1754f319ddd |
|
BLAKE2b-256 | fb4986ac3c28e06b7484d78e0df602bef1dc38c0d20fdfc28c57227b0239e0a6 |