Skip to main content

Specifind: A Natural Language Processing Tool for Automating Species Occurrence (Re-)Discovery from Scientific Literature

Project description

Specifind

Specifind logo

Specifind is a Python toolkit built to automatically extract species occurrence information from unstructured ecological literature. It identifies scientific species names, geographic entities, and the relations connecting them—unlocking occurrence data hidden in text.

The toolkit integrates OCR, layout analysis, Named Entity Recognition, Coreference Resolution, and Relation Extraction into a unified and traceable pipeline. It is powered by a newly developed, expertly annotated dataset of 1,000+ ecological abstracts spanning biogeography, botany, entomology, mycology, and zoology.


🌿 Key Features

  • 📄 Science-OCR for domain-optimized OCR of scientific papers
  • 🔍 NER for scientific species names & geographic entities
  • 🌍 Relation Extraction connecting species to locations
  • 🧠 FastCOREF for high-performance coreference resolution
  • 🧩 Built on spaCy for extensibility, speed, and NLP interoperability
  • 🧱 Full pipeline for text & PDF extraction
  • 🧭 Traceability that links extractions back to the original text

📦 Installation

pip install specifind

🚀 Quick Start

Basic Usage

from specifind import Specifind

s = Specifind()

s.analyze("Upupa epops is an exotic bird. It is widely extended over Spain.")

# or

s.analyze_file("path/to/file.pdf")

# Output:
# {
#     "species": [
#         "Upupa epops"
#     ],
#     "geography": [
#         "Spain"
#     ],
#     "occurrences": {
#         "Upupa epops": [
#             "Spain"
#         ]
#     },
#     "evidence": {
#         "Upupa epops": {
#             "Spain": [
#                 "It is widely extended over Spain."
#             ]
#         }
#     }
# }

📘 API Reference

analyze_file(...)

Process and extract information from a PDF file.

Parameters

Name Type Default Description
path str Path to the file to analyze.
first_page int 0 First page to process (inclusive).
last_page int PDF page length Last page to process (exclusive).
coref bool True Enable coreference resolution.
dpi int 192 if GPU available else 96 Rendering DPI for PDF pages. Consider lowering the value if running out of memory (OOM).
return_doc bool False If True, return Spacy Doc object with the annotations available in doc.ents and doc._.relations.

Returns

  • Dictionary including parsed entities, relations and evidences.
  • (optional) internal doc object (if return_doc=True)

analyze(...)

Process and extract information directly from raw text.

Parameters

Name Type Default Description
text str Raw text to analyze.
coref bool True Enable coreference resolution.
return_doc bool False If True, return Spacy Doc object with the annotations.

Returns

  • Dictionary including parsed entities, relations and evidences.
  • (optional) internal doc object

🚀 Benchmarks

Named Entity Recognition (NER)

Species & Locations

🔍 Match Type 🎯 Precision 📈 Recall 🏆 F1
Exact 0.904 0.935 0.919
Partial/Intersect 0.938 0.969 0.958

Relation Extraction (RE)

Occurrences

🎯 Precision 📈 Recall 🏆 F1
0.964 0.993 0.978

🤝 Contributing

Contributions, issue reports, and feature suggestions are welcome. Feel free to open a Pull Request or discussion.


📄 License

Specifind is licensed under AGPL-3.0. See LICENSE for details.


📚 Citing Specifind

If you use Specifind in your research, please cite our pre-print:

BibTeX

@article{specifind2025,
  title   = {Specifind: A Natural Language Processing Tool for Automating Species Occurrence (Re-)Discovery from Scientific Literature},
  author  = {Golomb Durán Tomas, Díaz Anna, Barroso María, Far Antoni Josep, Roldán Alejandro, Cancellario Tommaso},
  year    = {2025},
  journal = {BioRxiv},
  url     = {https://github.com/ToGo347/specifind}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

specifind-0.1.0.tar.gz (581.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

specifind-0.1.0-py3-none-any.whl (743.2 kB view details)

Uploaded Python 3

File details

Details for the file specifind-0.1.0.tar.gz.

File metadata

  • Download URL: specifind-0.1.0.tar.gz
  • Upload date:
  • Size: 581.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for specifind-0.1.0.tar.gz
Algorithm Hash digest
SHA256 dd987d83d336bccd1177ef2bf166dad423c3d6fa524b5f415614713ef6e96148
MD5 d2cf5140988d29c32ef751ffd52ee629
BLAKE2b-256 74a1ca72a473b526fd8a3bd15d2ea21931fce2d44eedad5ea6441385ba1e08b1

See more details on using hashes here.

File details

Details for the file specifind-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: specifind-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 743.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for specifind-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4b1b7785dc5e64ec422299e95f758de14004994612ee7b163a4c49a09cf46335
MD5 86c01ff95c334db0726396797bdce6a0
BLAKE2b-256 98b451837370ca193cc6a1dba90a78d221492075c2b01063f268e265df34cee6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page