Skip to main content

LLM extraction from documents

Project description

Coverage

Assumptions

This project operates under the following key assumptions regarding the input PDF files and project structure:

  • Textual PDFs: All input PDF documents are assumed to be text-searchable (i.e., not scanned images). The project relies on the ability to extract raw text content directly from the PDFs. If scanned PDFs are provided, text extraction may fail or produce garbled output.
  • English Language Content: The textual content within all input PDFs is assumed to be primarily in the English language. Text processing and analysis steps (e.g., keyword extraction, natural language processing) may yield inaccurate or irrelevant results for content in other languages.
  • Consistent Document Structure: Particularly for "call for proposals" PDFs, a very similar internal structure and layout are assumed. The project's parsing logic relies on this consistency to accurately locate and extract specific pieces of information. Deviations in structure may lead to incomplete or incorrect data extraction.
  • Presence of Call Proposals: For each EU project intended for processing, it's assumed that a corresponding PDF file exists within the designated input folder. This PDF must contain the string "call" in its filename or a prominent location within its text to correctly identify and process it as a call proposal document.
  • Handling of Numbered Call Files: In cases where multiple PDF files exist for the same call, identified by a common naming pattern like PROGRAMCODE-YYYY-TYPE-GRANT-CATEGORY-XX (e.g., AMIF-2025-TF2-AG-INTE-01, AMIF-2025-TF2-AG-INTE-02), the project will only process the file with the lowest numerical suffix (XX). This is due to the assumption that such sequentially numbered files for the same call contain identical core information.
  • Currency Denomination: All monetary values (e.g., prices, budgets, grants) mentioned within the PDF documents are assumed to be denominated in Euros (EUR).

Solution Overview

This project focuses on extracting key information from EU project-related PDF documents. During the data extraction process, it was identified that the provided PDFs, particularly those related to grant projects (e.g., AMIF), do not contain specific Technology Readiness Level (TRL) information. While TRL is a common concept in Horizon EU projects, the documents only offered generic definitions (TRL 1 to 9) without project-specific details. Attempts to extract TRL data, including leveraging LLM AI models (gemini), proved unsuccessful and led to hallucinations. For AMIF, it is not a practice to indicate TLR.

Consequently, the solution prioritizes the extraction of available and reliable data points:

  • Budget Information: This includes detailed proposal budget and grant amounts per project, which are consistently and clearly documented within the "call for proposal" PDFs.
  • Organization Details: Extraction of the number and type of organizations involved in grants was also targeted. However, due to ambiguity and lack of clear definitions within the document describing the task regarding what "number and type of organization" specifically entails in the grant context, this aspect could not be fully implemented or clarified through further inquiry.

Given that many of the provided PDFs were found to be templates or contained minimal additional data relevant to the extraction goals, the core focus of this solution was directed exclusively towards processing the "call for proposal" PDFs, as they proved to be the most valuable source of actionable information.

Design Choices and Approach

The core of this solution for information extraction relies on a multi-stage process leveraging local Large Language Models (LLMs) for specific data points. Our approach prioritizes accuracy and efficiency through a combination of heuristic text processing and targeted LLM inference.

  • Local LLM Models: We utilized phi4:14b primarily for extracting monetary information and gemma3:27b for processing consortium-related table data.

  • PDF to Text Conversion: The process begins with converting the PDF documents into raw text strings. This is handled by the PdfConverter class, which internally uses the docling package for robust text extraction from PDF files.

  • Text Segmentation - Paragraphs: Following text conversion, the raw text is segmented into paragraphs using the Document class. While various methods for paragraph definition were explored, including \n (single newline), \n\n (empty line), and models like SaT (wtpsplit) (https://github.com/segment-any-text/wtpsplit), an heuristic approach based on empty lines (\n\n) was adopted for its superior performance in accurately identifying distinct paragraph ( this perfomance depends of course on how it is extracted the text from the PDFs).

  • Text Segmentation - Sentences: After paragraph definition, sentences are extracted from each paragraph. For this granular segmentation, the SaT (wtpsplit) model (https://github.com/segment-any-text/wtpsplit) was employed due to its effectiveness in delineating individual sentences.

  • Information Filtering with Regular Expressions: Before LLM processing, the segmented text (primarily paragraphs, though sentence-level filtering is also an option) undergoes a crucial filtering step using Regular Expressions. These regex patterns were custom-designed based on common characteristics observed in "call for proposal" PDFs to pre-select relevant sections. This includes identifying:

    • Monetary Amounts: Strings containing currency indicators (e.g., "EUR") coupled with digits.
    • Consortium Details: Sections typically related to consortium formation, specifically looking for the table that indicate minimum number of entities.
  • LLM-based Data Extraction: Once filtered, the relevant paragraphs are fed to the pre-selected local LLMs.

    • Granular Processing: To maximize extraction accuracy, particularly for monetary information, paragraphs are inputted to the LLMs in batches rather than providing the entire document at once. This granular approach was observed to yield more precise results (at least for local LLM). For consortium entity extraction, the LLM receives the identified table as its input.
    • Prompt Engineering: User and system prompts for the LLMs are dynamically generated using a Jinja2 template.
    • Structured Output: The LLM's raw output is then parsed using a PydanticJsonParser. This ensures that the extracted data conforms to a predefined schema, enabling robust validation and easy integration into subsequent processes. However, there is not a well defined fallback method in case of ValidationError caused by the parser.
    • Iterative Accumulation: This batch processing, prompting, and parsing cycle is repeated for all filtered paragraphs, and the results are accumulated to form the complete extracted dataset for the document.

Data Extraction Flow and Temporary Storage

The entire extraction process described above is repeated for each individual PDF document. The extracted data from each PDF is then temporarily stored in two separate JSON files: one for monetary information and another for entity-related data.

Data Transformation

Following the extraction phase, the temporarily stored JSON files are loaded into pandas DataFrames for subsequent transformation and consolidation. This stage is crucial for refining the extracted raw data:

  • Monetary Data Transformation: Extracted monetary data undergoes various validation checks to ensure its quality and adherence to expected conditions. We then filter and retain only the information most relevant for analysis, such as the grant requested per project, available call budgets, or specific budget allocations per topic as mentioned in the call for proposal. A significant challenge identified was the duplication of monetary values, where identical amounts might or might not refer to the same underlying entity or concept (e.g., two mentions of the minimum EU grant request). To address this, a specific deduplication strategy is employed:

    1. Sentences containing the duplicate monetary amounts are converted into embeddings using Sentence-Transformers (S-BERT).
    2. Hierarchical clustering is then performed on these embeddings using cosine distance.
    3. If multiple sentences fall within the same cluster (indicating high semantic similarity for the same amount), the sentence with the longest text is selected as the representative for that cluster, simplifying the data while retaining context.
  • Entity Data Transformation: For the extracted entity data, due to time constraints, the transformation primarily involves basic validation checks followed by a simple stacking of the dtaa extracted that contained the entity information.

  • Pipeline Orchestration: All these transformation steps are orchestrated via a Pipeline class, which applies a series of pre-written functions sequentially to the pandas DataFrames, streamlining the data processing workflow.

Data Load

Finally, the processed pandas DataFrames for monetary and organization type information are stored as CSV files: etl_money_result.csv for the monetary data, and etl_entity_result.csv for the entity data.

Testing Strategy

For quality assurance, a suite of unit tests has been developed to validate individual components and functions of the codebase. While these tests provide foundational coverage, the current test coverage stands at approximately 50%, indicating areas for future expansion.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_etl_pipeline-0.1.0.tar.gz (46.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_etl_pipeline-0.1.0-py3-none-any.whl (63.0 kB view details)

Uploaded Python 3

File details

Details for the file llm_etl_pipeline-0.1.0.tar.gz.

File metadata

  • Download URL: llm_etl_pipeline-0.1.0.tar.gz
  • Upload date:
  • Size: 46.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.13.4 Windows/11

File hashes

Hashes for llm_etl_pipeline-0.1.0.tar.gz
Algorithm Hash digest
SHA256 66001d8027c0f4ee3347c9ddb40128a8910304d7ffb204974bcbdb41b9b93d75
MD5 8ad99d26735df3c0054ddfb46b2384d7
BLAKE2b-256 0da56e2dc6c07f4a87a4ec342bc58ca8bff98f63de081f6c6e1ee9c4396d1615

See more details on using hashes here.

File details

Details for the file llm_etl_pipeline-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: llm_etl_pipeline-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 63.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.13.4 Windows/11

File hashes

Hashes for llm_etl_pipeline-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9de4dd3fd94a5e1588f60b8cfe7bef07825a8779fb8e8d6a41ea7f6418d9fb49
MD5 86b4cc320ef5b96b6c149c127bc5901e
BLAKE2b-256 106913d43a09891a33f09773d849bd0f6ddd0645cfce96f0286cd4b3a3ee17fc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page