LLM extraction from documents
Project description
Assumptions
This project operates under the following key assumptions regarding the input PDF files and project structure:
- Textual PDFs: All input PDF documents are assumed to be text-searchable (i.e., not scanned images). The project relies on the ability to extract raw text content directly from the PDFs. If scanned PDFs are provided, text extraction may fail or produce garbled output.
- English Language Content: The textual content within all input PDFs is assumed to be primarily in the English language. Text processing and analysis steps (e.g., keyword extraction, natural language processing) may yield inaccurate or irrelevant results for content in other languages.
- Consistent Document Structure: Particularly for "call for proposals" PDFs, a very similar internal structure and layout are assumed. The project's parsing logic relies on this consistency to accurately locate and extract specific pieces of information. Deviations in structure may lead to incomplete or incorrect data extraction.
- Presence of Call Proposals: For each EU project intended for processing, it's assumed that a corresponding PDF file exists within the designated input folder. This PDF must contain the string "call" in its filename or a prominent location within its text to correctly identify and process it as a call proposal document.
- Handling of Numbered Call Files: In cases where multiple PDF files exist for the same call, identified by a common naming pattern like
PROGRAMCODE-YYYY-TYPE-GRANT-CATEGORY-XX(e.g.,AMIF-2025-TF2-AG-INTE-01,AMIF-2025-TF2-AG-INTE-02), the project will only process the file with the lowest numerical suffix (XX). This is due to the assumption that such sequentially numbered files for the same call contain identical core information. - Currency Denomination: All monetary values (e.g., prices, budgets, grants) mentioned within the PDF documents are assumed to be denominated in Euros (EUR).
Solution Overview
This project focuses on extracting key information from EU project-related PDF documents. During the data extraction process, it was identified that the provided PDFs, particularly those related to grant projects (e.g., AMIF), do not contain specific Technology Readiness Level (TRL) information. While TRL is a common concept in Horizon EU projects, the documents only offered generic definitions (TRL 1 to 9) without project-specific details. Attempts to extract TRL data, including leveraging LLM AI models (gemini), proved unsuccessful and led to hallucinations. For AMIF, it is not a practice to indicate TLR.
Consequently, the solution prioritizes the extraction of available and reliable data points:
- Budget Information: This includes detailed proposal budget and grant amounts per project, which are consistently and clearly documented within the "call for proposal" PDFs.
- Organization Details: Extraction of the number and type of organizations involved in grants was also targeted. However, due to ambiguity and lack of clear definitions within the document describing the task regarding what "number and type of organization" specifically entails in the grant context, this aspect could not be fully implemented or clarified through further inquiry.
Given that many of the provided PDFs were found to be templates or contained minimal additional data relevant to the extraction goals, the core focus of this solution was directed exclusively towards processing the "call for proposal" PDFs, as they proved to be the most valuable source of actionable information.
Design Choices and Approach
The core of this solution for information extraction relies on a multi-stage process leveraging local Large Language Models (LLMs) for specific data points. Our approach prioritizes accuracy and efficiency through a combination of heuristic text processing and targeted LLM inference.
-
Local LLM Models: We utilized
phi4:14bprimarily for extracting monetary information andgemma3:27bfor processing consortium-related table data. -
PDF to Text Conversion: The process begins with converting the PDF documents into raw text strings. This is handled by the
PdfConverterclass, which internally uses thedoclingpackage for robust text extraction from PDF files. -
Text Segmentation - Paragraphs: Following text conversion, the raw text is segmented into paragraphs using the
Documentclass. While various methods for paragraph definition were explored, including\n(single newline),\n\n(empty line), and models likeSaT (wtpsplit)(https://github.com/segment-any-text/wtpsplit), an heuristic approach based on empty lines (\n\n) was adopted for its superior performance in accurately identifying distinct paragraph ( this perfomance depends of course on how it is extracted the text from the PDFs). -
Text Segmentation - Sentences: After paragraph definition, sentences are extracted from each paragraph. For this granular segmentation, the
SaT (wtpsplit)model (https://github.com/segment-any-text/wtpsplit) was employed due to its effectiveness in delineating individual sentences. -
Information Filtering with Regular Expressions: Before LLM processing, the segmented text (primarily paragraphs, though sentence-level filtering is also an option) undergoes a crucial filtering step using Regular Expressions. These regex patterns were custom-designed based on common characteristics observed in "call for proposal" PDFs to pre-select relevant sections. This includes identifying:
- Monetary Amounts: Strings containing currency indicators (e.g., "EUR") coupled with digits.
- Consortium Details: Sections typically related to consortium formation, specifically looking for the table that indicate minimum number of entities.
-
LLM-based Data Extraction: Once filtered, the relevant paragraphs are fed to the pre-selected local LLMs.
- Granular Processing: To maximize extraction accuracy, particularly for monetary information, paragraphs are inputted to the LLMs in batches rather than providing the entire document at once. This granular approach was observed to yield more precise results (at least for local LLM). For consortium entity extraction, the LLM receives the identified table as its input.
- Prompt Engineering: User and system prompts for the LLMs are dynamically generated using a
Jinja2template. - Structured Output: The LLM's raw output is then parsed using a
PydanticJsonParser. This ensures that the extracted data conforms to a predefined schema, enabling robust validation and easy integration into subsequent processes. However, there is not a well defined fallback method in case of ValidationError caused by the parser. - Iterative Accumulation: This batch processing, prompting, and parsing cycle is repeated for all filtered paragraphs, and the results are accumulated to form the complete extracted dataset for the document.
Data Extraction Flow and Temporary Storage
The entire extraction process described above is repeated for each individual PDF document. The extracted data from each PDF is then temporarily stored in two separate JSON files: one for monetary information and another for entity-related data.
Data Transformation
Following the extraction phase, the temporarily stored JSON files are loaded into pandas DataFrames for subsequent transformation and consolidation. This stage is crucial for refining the extracted raw data:
-
Monetary Data Transformation: Extracted monetary data undergoes various validation checks to ensure its quality and adherence to expected conditions. We then filter and retain only the information most relevant for analysis, such as the grant requested per project, available call budgets, or specific budget allocations per topic as mentioned in the call for proposal. A significant challenge identified was the duplication of monetary values, where identical amounts might or might not refer to the same underlying entity or concept (e.g., two mentions of the minimum EU grant request). To address this, a specific deduplication strategy is employed:
- Sentences containing the duplicate monetary amounts are converted into embeddings using
Sentence-Transformers (S-BERT). - Hierarchical clustering is then performed on these embeddings using cosine distance.
- If multiple sentences fall within the same cluster (indicating high semantic similarity for the same amount), the sentence with the longest text is selected as the representative for that cluster, simplifying the data while retaining context.
- Sentences containing the duplicate monetary amounts are converted into embeddings using
-
Entity Data Transformation: For the extracted entity data, due to time constraints, the transformation primarily involves basic validation checks followed by a simple stacking of the dtaa extracted that contained the entity information.
-
Pipeline Orchestration: All these transformation steps are orchestrated via a
Pipelineclass, which applies a series of pre-written functions sequentially to thepandasDataFrames, streamlining the data processing workflow.
Data Load
Finally, the processed pandas DataFrames for monetary and organization type information are stored as CSV files: etl_money_result.csv for the monetary data, and etl_entity_result.csv for the entity data.
Testing Strategy
For quality assurance, a suite of unit tests has been developed to validate individual components and functions of the codebase. While these tests provide foundational coverage, the current test coverage stands at approximately 50%, indicating areas for future expansion.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llm_etl_pipeline-0.1.0.tar.gz.
File metadata
- Download URL: llm_etl_pipeline-0.1.0.tar.gz
- Upload date:
- Size: 46.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.13.4 Windows/11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
66001d8027c0f4ee3347c9ddb40128a8910304d7ffb204974bcbdb41b9b93d75
|
|
| MD5 |
8ad99d26735df3c0054ddfb46b2384d7
|
|
| BLAKE2b-256 |
0da56e2dc6c07f4a87a4ec342bc58ca8bff98f63de081f6c6e1ee9c4396d1615
|
File details
Details for the file llm_etl_pipeline-0.1.0-py3-none-any.whl.
File metadata
- Download URL: llm_etl_pipeline-0.1.0-py3-none-any.whl
- Upload date:
- Size: 63.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.13.4 Windows/11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9de4dd3fd94a5e1588f60b8cfe7bef07825a8779fb8e8d6a41ea7f6418d9fb49
|
|
| MD5 |
86b4cc320ef5b96b6c149c127bc5901e
|
|
| BLAKE2b-256 |
106913d43a09891a33f09773d849bd0f6ddd0645cfce96f0286cd4b3a3ee17fc
|