specifind

Specifind: A Natural Language Processing Tool for Automating Species Occurrence (Re-)Discovery from Scientific Literature

These details have not been verified by PyPI

Project links

Project description

Specifind logo

Specifind is a Python toolkit built to automatically extract species occurrence information from unstructured ecological literature. It identifies scientific species names, geographic entities, and the relations connecting them—unlocking occurrence data hidden in text.

The toolkit integrates OCR, layout analysis, Named Entity Recognition, Coreference Resolution, and Relation Extraction into a unified and traceable pipeline. It is powered by a newly developed, expertly annotated dataset of 1,000+ ecological abstracts spanning biogeography, botany, entomology, mycology, and zoology.

🌿 Key Features

📄 Science-OCR for domain-optimized OCR of scientific papers
🔍 NER for scientific species names & geographic entities
🌍 Relation Extraction connecting species to locations
🧠 FastCOREF for high-performance coreference resolution
🧩 Built on spaCy for extensibility, speed, and NLP interoperability
🧱 Full pipeline for text & PDF extraction
🧭 Traceability that links extractions back to the original text

📦 Installation

pip install specifind

🚀 Quick Start

Basic Usage

from specifind import Specifind

s = Specifind()

s.analyze("Upupa epops is an exotic bird. It is widely extended over Spain.")

# or

s.analyze_file("path/to/file.pdf")

# Output:
# {
#     "species": [
#         "Upupa epops"
#     ],
#     "geography": [
#         "Spain"
#     ],
#     "occurrences": {
#         "Upupa epops": [
#             "Spain"
#         ]
#     },
#     "evidence": {
#         "Upupa epops": {
#             "Spain": [
#                 "It is widely extended over Spain."
#             ]
#         }
#     }
# }

📘 API Reference

`class Specifind(...)`

Initializes the OCR engine.

Parameter	Type	Default	Description
use_gpu	bool	True	If `True`, uses GPU if available. If `False`, forces CPU usage, which may be slower but more stable on some systems and avoid memory issues.
debug	bool	False	If `True`, opens the displaCy visualizer with the results

`analyze_file(...)`

Process and extract information from a PDF file.

Parameters

Name	Type	Default	Description
`path`	str	—	Path to the file to analyze.
`first_page`	int	0	First page to process (inclusive).
`last_page`	int	PDF page length	Last page to process (exclusive).
`coref`	bool	True	Enable coreference resolution.
`dpi`	int	`192` if GPU available else `96`	Rendering DPI for PDF pages. Consider lowering the value if running out of memory (OOM).
`return_doc`	bool	False	If `True`, return Spacy Doc object with the annotations available in `doc.ents` and `doc._.relations`.
`store_ocr`	bool	True	If `True`, saves OCR results into a txt file

Returns

Dictionary including parsed entities, relations and evidences.
(optional) internal doc object (if return_doc=True)

`analyze(...)`

Process and extract information directly from raw text.

Parameters

Name	Type	Default	Description
`text`	str	—	Raw text to analyze.
`coref`	bool	True	Enable coreference resolution.
`return_doc`	bool	False	If `True`, return Spacy Doc object with the annotations.

Returns

Dictionary including parsed entities, relations and evidences.
(optional) internal doc object

🚀 Benchmarks

Named Entity Recognition (NER)

Species & Locations

🔍 Match Type	🎯 Precision	📈 Recall	🏆 F1
Exact	0.904	0.935	0.919
Partial/Intersect	0.938	0.969	0.958

Relation Extraction (RE)

Occurrences

🎯 Precision	📈 Recall	🏆 F1
0.964	0.993	0.978

🤝 Contributing

Contributions, issue reports, and feature suggestions are welcome. Feel free to open a Pull Request or discussion.

📄 License

Specifind is licensed under AGPL-3.0. See LICENSE for details.

📚 Citing Specifind

If you use Specifind in your research, please cite our pre-print:

BibTeX

@article{specifind2025,
  title   = {Specifind: A Natural Language Processing Tool for Automating Species Occurrence (Re-)Discovery from Scientific Literature},
  author  = {Golomb Durán Tomas, Díaz Anna, Barroso María, Far Antoni Josep, Roldán Alejandro, Cancellario Tommaso},
  year    = {2025},
  journal = {BioRxiv},
  url     = {https://github.com/ToGo347/specifind}
}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

Jan 14, 2026

0.1.0

Dec 5, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

specifind-0.2.0.tar.gz (581.5 kB view details)

Uploaded Jan 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

specifind-0.2.0-py3-none-any.whl (743.5 kB view details)

Uploaded Jan 14, 2026 Python 3

File details

Details for the file specifind-0.2.0.tar.gz.

File metadata

Download URL: specifind-0.2.0.tar.gz
Upload date: Jan 14, 2026
Size: 581.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for specifind-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`af09598007ffa75c81481871d64a13202cbc75806baa83c09725584d70f7ffa6`
MD5	`8336f80fb44acf7d00d31e14021d34c6`
BLAKE2b-256	`cb3370eecadc1bed1279666f9cac064675bfa7d1ff39acf7e01c8b90b2b6c74a`

See more details on using hashes here.

File details

Details for the file specifind-0.2.0-py3-none-any.whl.

File metadata

Download URL: specifind-0.2.0-py3-none-any.whl
Upload date: Jan 14, 2026
Size: 743.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for specifind-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e57092f3cd6e7cd285146603026855a70daca481f020e30e6a844eed055e4fb5`
MD5	`9ca54395ef0d2da6af072c6dd057ac3a`
BLAKE2b-256	`5532fd25c255b223045e4fbaf0a3285bc4cdc94b568fd0ae7d73c1c37d444acf`

See more details on using hashes here.

specifind 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🌿 Key Features

📦 Installation

🚀 Quick Start

Basic Usage

📘 API Reference

class Specifind(...)

analyze_file(...)

analyze(...)

🚀 Benchmarks

Named Entity Recognition (NER)

Relation Extraction (RE)

🤝 Contributing

📄 License

📚 Citing Specifind

BibTeX

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`class Specifind(...)`

`analyze_file(...)`

`analyze(...)`