Extracts relevant meta information for cataloging.

These details have not been verified by PyPI

Project description

RaRa Meta Extractor

Py3.10 Py3.11 Py3.12

rara-meta-extractor is a Python library for extracting relevant meta information for cataloging.

✨ Features

Extracts relevant metainformation for cataloging (authors, titles, publication dates, publisher, ISBN, ISSN etc).
Detects and extracts summaries, conclusions and abstracts.
Uses Llama for extracting metadata from plaintext and custom parsers for extrating metadata from EPUB and METS/ALTO mark-ups.
Supports extracting custom set of user-defined fields.¹

¹ Might not work well with fine-tuned Llama instances.

⚡ Quick Start

Get started with rara-meta-extractor in just a few steps:

Install the Package
Ensure you"re using Python 3.10 or above, then run:
```
pip install rara-meta-extractor
```

Import and Use
Extracting user-defined fields:

 from rara_meta_extractor.llama_extractor import LlamaExtractor
 from pprint import pprint

 text = """
    JUMALAL EI OLE AEGA

    Toimetanud Milvi Teesalu
    Kaane kujundanud Piret Tuur
    Autoriõigus: Marje Ernits ja OÜ Eesti Raamat, 2019
    ISBN 978-9949-683-96-3
    ISBN 978-9949-683-97-0 (epub)
 """

 fields = [
   "editor", "designer", "isbn", "author",
   "copyright year", "title"
 ]

 llama_extractor = LlamaExtractor(
     llama_host_url="http://local-llama:8080",
     fields=fields,
     temperature=0.3
 )

 extracted_info = llama_extractor.extract(text)
 pprint(extracted_info)

Out:

{
  "editor": ["Milvi Teesalu"],
  "designer": ["Piret Tuur"],
  "isbn": ["978-9949-683-96-3", "978-9949-683-97-0"],
  "author": ["Marje Ernits ja OÜ Eesti Raamat"],
  "copyright year": ["2019"],
  "title": ["JUMALAL EI OLE AEGA"]
}

Extracting predefined metadata:

from rara_meta_extractor.meta_extractor import MetaExtractor
from pprint import pprint

text = """
   JUMALAL EI OLE AEGA

   Toimetanud Milvi Teesalu
   Kaane kujundanud Piret Tuur
   Autoriõigus: Marje Ernits ja OÜ Eesti Raamat, 2019
   ISBN 978-9949-683-96-3
   ISBN 978-9949-683-97-0 (epub)
"""

meta_extractor = MetaExtractor(
   meta_extractor_config = {
      "llama_host_url"="http://local-llama:8080"
   text_classifier_config = {
      "llama_host_url"="http://local-llama:8080"
   }
)

 extracted_info = meta_extractor.extract_simple(text)
 pprint(extracted_info)

Out:

{
   "extractor": "Llama-Extractor",
   "meta": {
      "authors": [
         {
         "name": "Marje Ernits",
         "role": "Autor"
         },
         {
         "name": "Milvi Teesalu",
         "role": "Toimetaja"
         },
         {
         "name": "Piret Tuur",
         "role": "Kujundaja"
         },
         {
         "name": "Eesti Raamat",
         "role": "Väljaandja"
         }
      ],
      "isbn": [
         "9789949683963",
         "9789949683970"
      ],
      "publication_place": "Tallinn",
      "titles": [
         {
         "title": "Jumalal ei ole aega",
         "title_type": "main_title"
         },
         {
         "title": "jutustused] /",
         "title_type": "additional_title_part"
         }
      ],
      "udc": [
         "821.511.113-32"
      ],
      "udk": [
         "821.511.113"
      ]
   }
}

⚙️ Installation Guide

Follow the steps below to install the rara-meta-extractor package, either via pip or locally.

Installation via `pip`

Click to expand

Set Up Your Python Environment
Create or activate a Python environment using Python 3.10 or above.
Install the Package
Run the following command:
```
pip install rara-meta-extractor
```

Local Installation

Follow these steps to install the rara-meta-extractor package locally:

Click to expand

Clone the Repository
Clone the repository and navigate into it:
```
git clone <repository-url>
cd <repository-directory>
```
Set Up Python Environment
Create or activate a Python environment using Python 3.10 or above. E.g:
```
conda create -n py310 python==3.10
conda activate py310
```
Install Build Package
Install the build package to enable local builds:
```
pip install build
```
Build the Package
Run the following command inside the repository:
```
python -m build
```
Install the Package
Install the built package locally:
```
pip install .
```

🚀 Testing Guide

Follow these steps to test the rara-meta-extractor package.

How to Test

Click to expand

Clone the Repository
Clone the repository and navigate into it:
```
git clone <repository-url>
cd <repository-directory>
```
Set Up Python Environment
Create or activate a Python environment using Python 3.10 or above.
Install Build Package
Install the build package:
```
pip install build
```
Build the Package
Build the package inside the repository:
```
python -m build
```
Install with Testing Dependencies
Install the package along with its testing dependencies:
```
pip install .[testing]
```
Run Tests
Run the test suite from the repository root:
```
python -m pytest -v tests
```

📝 Documentation

Click to expand

🔍 `MetaExtractor` Class

Overview

MetaExtractor class wraps the logic of different types of meta extractors (EPUBMetaExtractor, MetsAltoMetsExtrator and LlamaMetaExtractor) along with all text part classifiers (EPUBTextPartClassifier, MetsAltoTextPartClassifier, and RegexTextPartClassifier).

Importing

from rara_meta_extractor.meta_extractor import MetaExtractor

Class Parameters

Name	Type	Optional	Default	Description
meta_extractor_config	dict	True*	rara_meta_extractor.config.META_EXTRACTOR_CONFIG	Configuration for Llama's meta extractor agent.
text_classifier_config	dict	True*	rara_meta_extractor.config.TEXT_CLASSIFIER_CONFIG	Configuration for Llama's text classifier agent. NB! Text classifier is used only for filtering the input passed to the meta extractor agent. However, this behaviour is disabled by default.

Although both params have default values, it is stronly recommended to ensure that correct llama_host_url is used.

All possible configuration parameters are listed in the table below:

Configuration Parameters

The following table lists all possible configuration params for meta_extractor_config and text_classifier_config.

Name	Type	Required	Description
llama_host_url	str	True	Llama server URL, e.g. "http://localhost:8080"
instructions	str	False	Instructions for Llama.
fields	List[str]	False	List of fields to extract. This is necessary to define only, if you wish to use a custom set of fields to extract opposed to the predefined ones. NB! If fields is defined, the JSON schema will be generated automatically.
json_schema	dict	False	JSON schema to use for generating grammars for Llama. NB! This is only necessary, if fields are not defined or you wish to use more advanced restrictions for them. The schema is not necessary for extracting default/predefined fields. Read more about the correct structure from here: https://github.com/ggml-org/llama.cpp/tree/master/grammars
temperature	float	False	Temperature in range [0, 2]. The lower the temperature, the more deterministic are the Llama outputs. By default = 0.0
n_predict	int	False	Number of tokens Llama is allowed to predict. By default = 500.

Key Functions

Function: `extract`

The main function for extracting meta information.

Parameters

Name	Type	Required	Default	Description
texts	List[dict]	True	-	List of texts from where to extract meta information. For EPUB and METS/ALTO, expects content of `texts` from digitizer output. Otherwise, must minimally contain keys `text` and `lang`.
epub_metadata	dict	False	{}	Expects the content of `doc_meta.epub_metadata` from digitizer output.
mets_alto_metadata	List[str]	False	[]	Expects the content of `doc_meta.mets_alto_metadata` from digitizer output.
verify_texts_with_llm	bool	False	False	If enabled, each text is passed to text classifier agent first and only texts classified as metadata blocks are passed to meta extractor(s).
n_trials	int	False	1	Indicates how many trials to run for predicting metadata with LlamaExtractor for the same text. NB! Setting it higher than 1 has purpose only if temperature > 0.
merge_texts	bool	False	True	If enabled, texts are merged into a single text block before passing it to LlamaExtractor. Otherwise texts are passed one by one to LlamaExtractor and results are merged afterwards.
min_ratio	float	False	0.8	Relevant only if n_trials > 1. Indicates the ratio of times a meta value has to be predicted during trials. E.g. if min_ratio = 0.7 and a value is predicted 2 out of 3 trials, it will not be returned as 2/3 = 0.66 < 0.7.
add_missing_keys	bool	False	False	If enabled, all possible meta keys are added to the output, even if the content has not been extracted.
detect_text_parts	bool	False	True	If enabled, runs text part detection for detecting conclusions, abstracts etc.
max_length_per_text	int	False	1500	If verify_texts_with_llm is set to False, this param is used for dummy metadata detection - if a text is longer than the threshold set with this param, it will not be included into Llama input.
n_first_pages	int	False	5	How many first pages to consider for possible Llama input? NB! Not all of them are actually added to the input as the pages are passed through prefiltering.
n_last_pages	int	False	0	How many last pages to consider for possible Llama input? NB! Not all of them are actually added to the input as the pages are passed through prefiltering.
n_strict_include	int	False	3	Number of pages (out of n_first_pages + n_list_pages set) to pass to Llama without additional prefiltering.
simple	bool	False	False	If enabled, the outputs of titles and authors are simplified (some fields necessary mostly for constructing final MARC files are removed).

Result

Function extract returns a dictionary with two keys:

extractor- Indicates which extractor was used (possible values are: "Llama-Extractor", "EPUB-Extractor", and "METS/ALTO-Extractor")
meta - Extracted metainformation formatted as dict.

Function: `extract_from_digitizer_output`

This function allows passing raw digitizer output to the meta extractor.

Parameters

Name	Type	Required	Default	Description
digitizer_output	dict	True	-	Output of rara-digitizer.
verify_texts_with_llm	bool	False	False	If enabled, each text is passed to text classifier agent first and only texts classified as metadata blocks are passed to meta extractor(s).
n_trials	int	False	1	Indicates how many trials to run for predicting metadata with LlamaExtractor for the same text. NB! Setting it higher than 1 has purpose only if temperature > 0.
merge_texts	bool	False	True	If enabled, texts are merged into a single text block before passing it to LlamaExtractor. Otherwise texts are passed one by one to LlamaExtractor and results are merged afterwards.
min_ratio	float	False	0.8	Relevant only if n_trials > 1. Indicates the ratio of times a meta value has to be predicted during trials. E.g. if min_ratio = 0.7 and a value is predicted 2 out of 3 trials, it will not be returned as 2/3 = 0.66 < 0.7.
add_missing_keys	bool	False	False	If enabled, all possible meta keys are added to the output, even if the content has not been extracted.
detect_text_parts	bool	False	True	If enabled, runs text part detection for detecting conclusions, abstracts etc.
max_length_per_text	int	False	1500	If verify_texts_with_llm is set to False, this param is used for dummy metadata detection - if a text is longer than the threshold set with this param, it will not be included into Llama input.
n_first_pages	int	False	5	How many first pages to consider for possible Llama input? NB! Not all of them are actually added to the input as the pages are passed through prefiltering.
n_last_pages	int	False	0	How many last pages to consider for possible Llama input? NB! Not all of them are actually added to the input as the pages are passed through prefiltering.
n_strict_include	int	False	3	Number of pages (out of n_first_pages + n_list_pages set) to pass to Llama without additional prefiltering.
simple	bool	False	False	If enabled, information detected with Llama-Extractor is validated against the original text- If the information cannot be found in the original text, it will be excluded from the output.
validate_llama_output	bool	False	True	If enabled, the outputs of titles and authors are simplified (some fields necessary mostly for constructing final MARC files are removed).

Result

Function extract returns a dictionary with two keys:

extractor- Indicates which extractors were used (possible values are a combination of the following: "Llama-Extractor", "EPUB-Extractor", and "METS/ALTO-Extractor")
meta - Extracted metainformation formatted as dict.

🔍 Usage Examples

Click to expand

Example 1: Simple meta extraction

from rara_meta_extractor.meta_extractor import MetaExtractor
from pprint import pprint

test_text = """
Original title:\nHilarious Stories of Animals\n   \n \nCopyright © 2021 Creative Arts Management OÜ\nAll rights reserved.\n \nEditor: KRISTO VILLEM\n \n \nISBN   978-9916-665-46-6\n \n \n\nLiza Moonlight\n\nGreetings, friends! This story will teach you the ever\nchanging flow of time. As time passes, so do the seasons.\nThere are many lovely  things to each season and each of\nthem holds many secrets and surprises.  \nEnjoy these tales and hopefully you will also discover\nsomething new!\n\nEverything that surrounds us has patterns. As the day\nalways follo ws the night and the sun always sets and then\nrises, the seasons also follow one another. The first season of\nour book's cycle is Spring. It a time of many new beginnings.\nBirds return  to their homeplaces and the sun start to give\nmore and more warmth. Chippy the Bird will be Your guide!\n\nIt is probably no surprise that the thrilly easter rabbit\nfamily comes out to enjoy the sun and play around on the\nwarm grass. They have been sitting snugly in their\nburrows for the whole winter and are so very happy to be\noutside and hop around and flop their ears.
"""

meta_extractor = MetaExtractor(
   meta_extractor_config = {
      "llama_host_url"="http://local-llama:8080"
   text_classifier_config = {
      "llama_host_url"="http://local-llama:8080"
   }
)
texts = [{"text": test_text, "lang": "en"}]

extracted_info = meta_extractor.extract(texts=texts, simple=True)

pprint(extracted_info)

Output:

{
  "extractor": ["Llama-Extractor"],
  "meta": {
    "authors": [
      {
        "name": "Liza Moonlight",
        "role": "Autor"
      },
      {
        "name": "Kristo Villem",
        "role": "Toimetaja"
      }
    ],
    "distributer_name": "Creative Arts Management OÜ",
    "distribution_place": "Tallinn",
    "isbn": [
      "9789916665466",
      "9789916665473",
      "9789916665480",
      "9789916665497"
    ],
    "titles": [
      {
        "title": "Hilarious stories of animals",
        "title_type": "main_title"
      },
      {
        "title": "4 books in 1 /",
        "title_type": "additional_title_part"
      }
    ],
    "udc": [
      "821-9-32",
      "821.111",
      "474.2)-93-322.4"
    ],
    "udk": [
      "821-93"
    ]
  }
}

Example 2: Run multiple trials

from rara_meta_extractor.meta_extractor import MetaExtractor
from pprint import pprint

test_text = """
1KUMMITUS
KURGUSDoireann Ní Ghríofa
kummitus
kurgusDoireann Ní Ghríofa
Inglise keelest tõlkinud Krista Kaer
kummitus
kurgusDoireann Ní Ghríofa
Inglise keelest tõlkinud Krista Kaer
Raamatu väljaandmist on toetanud Iiri Kirjandusfond
ja Eesti Kultuurkapital
Originaali tiitel:
Doireann Ní Ghríofa
A Ghost in the Throat
Tramp Press
2020
Copyright © Doireann Ní Ghríofa, 2020
Kõik õigused kaitstud
Tõlge eesti keelde © Krista Kaer, 2024
Poeemi „Itk Art O’Leary surma puhul” gaeli keelest tõlkinud Indrek Õis
Toimetanud ja korrektuuri lugenud Eha Kõrge
Kujundanud Britt Urbla Keller
ISBN 978-9985-3-6045-3
Kirjastus Varrak
Tallinn, 2024
www.varrak.ee
www.facebook.com/kirjastusvarrak
Trükikoda OÜ Greif
"""

meta_extractor = MetaExtractor(
   meta_extractor_config = {
      "llama_host_url"="http://local-llama:8080",
      "temperature": 0.1  #Raise temperature a bit to make the output less deterministic
   text_classifier_config = {
      "llama_host_url"="http://local-llama:8080"
   }
)
texts = [{"text": test_text, "lang": "et"}]

extracted_info = meta_extractor.extract(texts=texts, n_trials=7, min_ratio=0.7)

pprint(extracted_info)

Output:

{"extractor": ["Llama-Extractor"],
 "meta": {"authors": [{"is_primary": false,
                       "name": "Krista Kaer",
                       "name_order": 0,
                       "role": "Tõlkija",
                       "type": ""},
                      {"is_primary": false,
                       "name": "Indrek Õis",
                       "name_order": 0,
                       "role": "Tõlkija",
                       "type": ""},
                      {"is_primary": false,
                       "name": "Eha Kõrge",
                       "name_order": 0,
                       "role": "Toimetaja",
                       "type": ""},
                      {"is_primary": false,
                       "name": "Britt Urbla Keller",
                       "name_order": 0,
                       "role": "Kujundaja",
                       "type": ""},
                      {"is_primary": false,
                       "name": "Varrak",
                       "name_order": 0,
                       "role": "Väljaandja",
                       "type": ""}],
          "edition_info/number": "est",
          "host_entry": {"name": "",
                         "part_number": "",
                         "publication_date": "2024"},
          "isbn": ["9789985360453"],
          "issue_type": "Raamat",
          "manufacture_place": "([Lohkva (Tartumaa)]",
          "manufacturer": "Greif",
          "publication_date": "2024",
          "publication_place": "Tallinn",
          "series": {"issn": "", "name": "", "volume": ""},
          "table_of_contents": {"content": [], "language": ""},
          "text_parts": [],
          "titles": [{"author_from_title": "",
                      "part_number": "",
                      "part_title": "[romaan]",
                      "skip": 0,
                      "title": "Kummitus kurgus",
                      "title_language": "et",
                      "title_type": "väljaandes esitatud kujul põhipealkiri",
                      "title_type_int": 245,
                      "version": ""}],
          "udk": ["821"]}
}

Project details

These details have not been verified by PyPI

Intended Audience
- Science/Research
Programming Language

Release history Release notifications | RSS feed

This version

2.2.4

Jun 3, 2026

2.2.3

May 15, 2026

2.2.2

Mar 19, 2026

2.2.1

Mar 18, 2026

2.2.0

Jan 26, 2026

2.1.8

Jan 13, 2026

2.1.7

Dec 2, 2025

2.1.6

Nov 27, 2025

2.1.5

Nov 4, 2025

2.1.4

Nov 4, 2025

2.1.3

Nov 3, 2025

2.1.2

Oct 24, 2025

2.1.1

Sep 30, 2025

2.1.0

Sep 19, 2025

2.0.23

Aug 21, 2025

2.0.22

Aug 19, 2025

2.0.21

Aug 19, 2025

2.0.20

Aug 6, 2025

2.0.19

Aug 5, 2025

2.0.18

Jul 31, 2025

2.0.17

Jul 30, 2025

2.0.16

Jul 29, 2025

2.0.14

Jul 29, 2025

2.0.13

Jul 28, 2025

2.0.12

Jul 28, 2025

2.0.11

Jul 27, 2025

2.0.10

Jul 25, 2025

2.0.9

Jul 16, 2025

2.0.8

Jul 16, 2025

2.0.7

Jun 27, 2025

2.0.6

Jun 19, 2025

2.0.5

Jun 17, 2025

2.0.4

Jun 10, 2025

2.0.3

Jun 10, 2025

2.0.2

Jun 9, 2025

2.0.1

Jun 5, 2025

2.0.0

Jun 5, 2025

1.0.3

Jun 4, 2025

1.0.2

Jun 4, 2025

1.0.1

May 29, 2025

1.0.0

May 28, 2025

0.0.4

Mar 10, 2025

0.0.3

Mar 10, 2025

0.0.2

Mar 10, 2025

0.0.1

Feb 26, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rara_meta_extractor-2.2.4.tar.gz (88.2 kB view details)

Uploaded Jun 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rara_meta_extractor-2.2.4-py3-none-any.whl (85.6 kB view details)

Uploaded Jun 3, 2026 Python 3

File details

Details for the file rara_meta_extractor-2.2.4.tar.gz.

File metadata

Download URL: rara_meta_extractor-2.2.4.tar.gz
Upload date: Jun 3, 2026
Size: 88.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for rara_meta_extractor-2.2.4.tar.gz
Algorithm	Hash digest
SHA256	`eea06350ee9b2f3e0a362e0d9c72b9072dab57ea709ab8181ba70523fe0d2fbb`
MD5	`4fc3a957176e86d490b3486d9ba4dbbb`
BLAKE2b-256	`3e063d6a6ab7779adde3dcb6c85b5ec752f41ce0fddc3bdda4f9e1baeae854d0`

See more details on using hashes here.

File details

Details for the file rara_meta_extractor-2.2.4-py3-none-any.whl.

File metadata

Download URL: rara_meta_extractor-2.2.4-py3-none-any.whl
Upload date: Jun 3, 2026
Size: 85.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for rara_meta_extractor-2.2.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f7e57dd950b6f151d3f79f562926e23f521061238e27946bdb214bbf12ba1a53`
MD5	`22a6b5f2a0a73e0ec593328b9c9375bb`
BLAKE2b-256	`01af4485511024f7262bfd6a470663d0dbec23f36e5a9b07804c7f88c5ae7f97`

See more details on using hashes here.

rara-meta-extractor 2.2.4

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

RaRa Meta Extractor

✨ Features

⚡ Quick Start

⚙️ Installation Guide

Installation via pip

Local Installation

🚀 Testing Guide

How to Test

📝 Documentation

🔍 MetaExtractor Class

Overview

Importing

Class Parameters

Configuration Parameters

Key Functions

Function: extract

Parameters

Result

Function: extract_from_digitizer_output

Parameters

Result

🔍 Usage Examples

Example 1: Simple meta extraction

Example 2: Run multiple trials

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Installation via `pip`

🔍 `MetaExtractor` Class

Function: `extract`

Function: `extract_from_digitizer_output`