A Python OAI-PMH client for harvesting repositories with typed access and DSpace export
Project description
Kongin
Kongin is a Python client for OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting). It provides typed access to metadata records and easy export to DSpace REST API format.
Features
- Typed Access: Access metadata with
record.title,record.creatorsinstead of nested dicts - Any XML Format: Handles oai_dc, xoai, oaire, simple_xml, and any other schema
- DSpace Export: Convert records to DSpace REST API JSON format (versions 7, 8, 9, 10+)
- Automatic Pagination: Iterate over all records without manual token handling
- Web Interface: Streamlit app for visual harvesting and DSpace upload
- Simple API: Clean, Pythonic interface
Installation
# Core library
pip install kongin
# With web interface
pip install kongin[web]
Usage
Kongin can be used as a Python library or through a web interface.
Option 1: Python Library (Simple)
import kongin
# One-liner harvest
records = kongin.harvest('https://repositorio.example.org/oai')
for r in records:
print(r.title, r.creators)
# Export to DSpace JSON
kongin.export_to_dspace(records, 'output.json')
Option 2: Python Library (Full Control)
from kongin import OAIClient
# Connect to repository
client = OAIClient('https://repositorio.example.org/oai')
# Iterate over records with typed access
for record in client.list_records(metadata_prefix='oai_dc'):
print(f"Title: {record.title}")
print(f"Authors: {record.creators}")
print(f"Abstract: {record.description}")
Option 3: Command Line
# Harvest and save to JSON
kongin harvest https://repositorio.example.org/oai -o records.json
# Get repository info
kongin identify https://repositorio.example.org/oai
# List available sets
kongin sets https://repositorio.example.org/oai
Option 4: Web Interface
streamlit run app.py
The web interface allows you to:
- Enter OAI-PMH URL and harvest records
- View results in a table with metrics
- Export to JSON (DSpace format) or CSV
- Upload directly to DSpace collections
Library Examples
Basic Harvesting
from kongin import OAIClient
client = OAIClient('https://repositorio.example.org/oai')
# List available sets
sets = client.list_sets()
for s in sets:
print(f"{s['set_spec']}: {s['set_name']}")
# List metadata formats
formats = client.list_metadata_formats()
for f in formats:
print(f"{f['prefix']}: {f['namespace']}")
# Get repository info
info = client.identify()
print(f"Repository: {info['repository_name']}")
Harvesting Records
# Harvest all records from a set
for record in client.list_records(
metadata_prefix='simple_xml',
set_spec='articles'
):
print(record.title)
print(record.creators)
# With date filtering
for record in client.list_records(
metadata_prefix='oai_dc',
from_date='2024-01-01',
until_date='2024-12-31'
):
process(record)
# Get a single record
record = client.get_record(
identifier='oai:repo.example.org:12345',
metadata_prefix='oai_dc'
)
Accessing Metadata
# Typed properties for common fields
print(record.title) # First title
print(record.titles) # All titles
print(record.creators) # All authors
print(record.description) # First description/abstract
print(record.date) # First date
print(record.subjects) # All subjects
print(record.identifiers) # All identifiers (DOI, URI, etc.)
# Access any field directly
volume = record.metadata.get('oaire:citationVolume')
issue = record.metadata.get('oaire:citationIssue')
custom = record.metadata.get('custom:field')
# Get all values for a field
all_rights = record.metadata.get_all('dc:rights')
# Check if field exists
if 'dcterms:abstract' in record.metadata:
print(record.metadata.get('dcterms:abstract'))
Export to DSpace REST API
DSpace changed to a REST API starting from version 7. Kongin supports DSpace 7, 8, 9, 10 and future versions that maintain API compatibility.
from kongin import OAIClient, DSpaceExporter
client = OAIClient('https://repositorio.example.org/oai')
# Export single record
record = client.get_record('oai:repo:123', 'oai_dc')
dspace_item = record.to_dspace()
# Result is JSON compatible with POST /api/core/items
# Export multiple records
records = list(client.list_records(set_spec='theses'))
exporter = DSpaceExporter()
items = exporter.export_records(records)
# Save to JSON file
exporter.save_json(records, 'dspace_import.json')
# Custom field mapping
custom_mapping = {
'local:category': 'local.category',
'oaire:citationVolume': 'local.citation.volume',
}
exporter = DSpaceExporter(custom_mapping)
items = exporter.export_records(records)
Upload to DSpace
from kongin import OAIClient, DSpaceClient
# Harvest records
oai = OAIClient('https://source-repo.org/oai')
records = list(oai.list_records(metadata_prefix='oai_dc', set_spec='articles'))
# Upload to DSpace
dspace = DSpaceClient(
base_url='https://dspace.example.org',
email='admin@example.org',
password='password'
)
# List collections
collections = dspace.list_collections()
for c in collections:
print(f"{c['name']} ({c['uuid']})")
# Upload to a collection
collection_id = collections[0]['uuid']
for record in records:
dspace.create_item(record, collection_id)
Manual Pagination
# Get first page
page = client.list_records_page(metadata_prefix='oai_dc')
print(f"Total records: {page.complete_list_size}")
for record in page:
process(record)
# Get next pages
while page.has_more:
page = client.resume(page.resumption_token)
for record in page:
process(record)
Configuration
client = OAIClient(
url='https://repositorio.example.org/oai',
timeout=30, # Request timeout in seconds
max_retries=3, # Retry failed requests
http_method='GET', # or 'POST'
requests_args={ # Passed to requests library
'verify': False, # Disable SSL verification
'headers': {'User-Agent': 'MyHarvester/1.0'}
}
)
API Reference
OAIClient
identify()- Get repository informationlist_sets()- List available setslist_metadata_formats()- List supported formatsget_record(identifier, metadata_prefix)- Get single recordlist_records(...)- Iterate all records (auto-pagination)list_records_page(...)- Get single page (manual pagination)list_identifiers(...)- Iterate record headers onlyharvest(**params)- Raw OAI-PMH request
Record
.identifier- OAI identifier.datestamp- Last modified date.set_specs- Sets this record belongs to.deleted- True if record was deleted.metadata- Metadata container.title,.creators,.description, etc. - Typed properties.to_dict()- Export as dictionary.to_dspace()- Export for DSpace REST API
Metadata
.get(key)- Get first value.get_all(key)- Get all values as list.to_dict()- Export as dictionary
DSpaceExporter
.export_record(record)- Convert single record.export_records(records)- Convert multiple records.to_json(records)- Export as JSON string.save_json(records, filepath)- Save to file
DSpaceClient
.list_collections()- Get available collections.list_communities()- Get available communities.create_item(record, collection_id)- Create item in collection.upload_records(records, collection_id)- Upload multiple records
DSpace Compatibility
This library is compatible with DSpace REST API versions:
| DSpace Version | API | Status |
|---|---|---|
| DSpace 7.x | REST API v7 | Supported |
| DSpace 8.x | REST API v7 | Supported |
| DSpace 9.x | REST API v7 | Supported |
| DSpace 10.x | REST API v7 | Supported |
Note: DSpace versions prior to 7 used a different API (XMLUI/JSPUI) and are not supported.
License
MIT License - see LICENSE file.
Author
Acknowledgments
This project was developed with assistance from Claude (Anthropic). The code was reviewed, tested, and validated by the author. Kongin builds upon an earlier prototype to create a more robust and feature-complete OAI-PMH client.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kongin-0.4.0.tar.gz.
File metadata
- Download URL: kongin-0.4.0.tar.gz
- Upload date:
- Size: 25.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f1d6fddb3ada161b4c9ccde20f178a717843ba2dab7a1e63546bbcd2367ab8d9
|
|
| MD5 |
7d50def7614a520a45f13e765df929a3
|
|
| BLAKE2b-256 |
3f6605455d8b6c9c623c033d783a03202038e116f331243c2d25d8da0b69db85
|
File details
Details for the file kongin-0.4.0-py3-none-any.whl.
File metadata
- Download URL: kongin-0.4.0-py3-none-any.whl
- Upload date:
- Size: 26.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e4eb9ad6144623f75cd30c9ecd2ea7ca4cbc70fc31dfb1d56bc783fb9872b9eb
|
|
| MD5 |
c51b22f239d2bfd20ea6a508dae45b35
|
|
| BLAKE2b-256 |
509d353d8e1fa65342d301c48878dc6b04c4c3d4b2b301941d1ba1b4ab9010d0
|