Skip to main content

Librairie outils IA Lexia par Lexfluent

Project description

Libraire python Lexfluent GeDAIA

Création/Révision Auteur date
Création Jacques MASSA 2 décembre 2024
Modification jacques MASSA 10 mars 2025
Modification jacques MASSA 06 février 2026
Modification jacques MASSA 05 mars 2026

Présentation

La librairie pyLexfluent propose toutes les fonctionnalités IA dans les domaines juridique et document.

  • Classification : Entraînement et inférence
  • Extraction de données : ODP, CNI, IBAN, Document juridique, Certificat d'Urbanisme, Extrait Acte de naissance, Extrait Acte de Décés,Extrait Acte de Mariage
  • Augmentation des données : Finance

Crédits

Built with Transformers
Built with gtp-oss20B
Built with Llama

Nouveautés

0.1.95 :

Ajout de la protection contre l'injection de prompt et de jailbreak dans les agents IA (Interviewer)

Prérequis :

Vous devez avoir un token Hugginface en variable d'environnement :

```
export HF_TOKEN = XXXXXXXXXXXXXXXXXXXXXXXXXXX
echo $HF_TOKEN
```

Le Token doit avoir l'accès à : meta-llama/Prompt-Guard-86M

Identification de la menace

Une large partie de nos agents IA ont pour mission d'analyser, inspecter, questionner le contenu de documents provenant de PDF ou autres formats. A l'instart des virus pour nos programmes, ces documents peuvent eux-mêmes contenir des "menaces" sous la forme d'injection de prompt ou de jailbreak :

  • Les Injections de Prompt (Prompt Injections) sont des entrées qui exploitent la concaténation de données non fiables (provenant de tiers ou d'utilisateurs) dans la fenêtre de contexte d'un modèle, afin d'amener ce dernier à exécuter des instructions non souhaitées.

  • Les Jailbreaks (ou débridages) sont des instructions malveillantes conçues pour passer outre les fonctionnalités de sûreté et de sécurité intégrées à un modèle.

Afin de lutter contre ces nouvelles menaces, nous avons mis en place plusieurs protections.

0.1.92 :

IBAN DetectStructure

    Fix Valeur par défaut du rois_filter = FALSE

0.1.91 :

IBAN Analyzer

Fix bug liste des logs non initialisée 

0.1.90 :

Interviewer AI

Fix bug valeur par défaut de AI_TIMEOUT(180) et AI_RETRIES (1)

0.1.89 :

#### Reconnaissance des proposition de prêts.
Chaque requête AI dispose d'un timeout(180s par défaut)  pour s'exécuter, et d'un nombre de tentatives(1 tentative par défaut)  pour rééssayer.
Exemple :
```
	    result = await try_safe_execute_async(logger,
                                      odp_proposal_extractor.extract_data,
                                      file_path=file_path, 
                                      base_url=base_url, 
                                      default_model_instruct=model_instruct, 
                                      openapi_compatibility=openapi_compatibility,
                                      ai_timeout=120, 
                                      ai_max_retries=2)
```
#### Loging.
Rassemblement des logs dans un seul et même fichier : pylexfluent.log 

Installations Prérequises

"pip install setuptools",
"pip install wheel",
"pip install scikit-learn",
"pip install matplotlib",
"pip install tqdm",
"pip install pytesseract ",
"pip install pillow>=10.1.0",
"pip install jax==0.4.38",
"pip install jaxlib==0.4.38",
"pip install mediapipe",
"pip install opencv-python", 
"pip install pandas",
"pip install tensorrt",
"pip install tensorrt-lean",
"pip install tensorrt-dispatch",
"pip install tensorflow",
"pip install tf-keras",
"pip install tensorflow-hub",
"pip install torch",
"pip install torchvision",
"pip install torchaudio",
"pip install sentence-transformers",
"pip install spacy[cuda12x]",
"pip install ocrmypdf",
"pip install pdf2image",
"pip install pdfplumber",
"pip install langchain-community",
"pip install langchain-ollama",
"pip install langchain-openai",
"pip install pymongo",
"pip install openpyxl",
"pip install easyocr",
"pip install docling[all]"
python -m spacy download fr_core_news_lg

Il y peut y avoir un conflit de version avec cuDNN requis par TensforFlow et Torch Dans ce cas il faut supprimer nvidia-cuDNN-cu12 apporté par PIP

pip uninstall nvidia-cudnn-cu12

Prerequis système

Update et installations requises

    apt-get update 
    apt-get upgrade
    apt install software-properties-common -y
    apt-get install poppler-utils -y
    add-apt-repository ppa:alex-p/tesseract-ocr5
    apt-get install libc6 -y
    apt-get install poppler-utils -y
    apt-get install tesseract-ocr -y
    apt-get install tesseract-ocr-fra -y
    apt-get install tesseract-ocr-eng -y
    apt-get install tesseract-ocr-ita -y
    apt-get install tesseract-ocr-spa -y
    apt-get install tesseract-ocr-deu -y
    apt-get install tesseract-ocr-cos -y
    apt-get install tesseract-ocr-lat -y
    apt-get install automake libtool -y
    apt-get install libleptonica-dev -y
    apt-get install ffmpeg libsm6 libxext6  -y
    apt-get install ocrmypdf -y    

JBIG2

Installing the JBIG2 encoder Most Linux distributions do not include a JBIG2 encoder since JBIG2 encoding was patented for a long time. All known JBIG2 US patents have expired as of 2017, but it is possible that unknown patents exist.

JBIG2 encoding is recommended for OCRmyPDF and is used to losslessly create smaller PDFs. If JBIG2 encoding is not available, lower quality CCITT encoding will be used for monochrome images.

JBIG2 decoding is not patented and is performed automatically by most PDF viewers. It is widely supported and has been part of the PDF specification since 2001.

JBIG encoding is automatically provided by these OCRmyPDF packages: - Docker image (both Ubuntu and Alpine) - Snap package - ArchLinux AUR package - Alpine Linux package - Homebrew on macOS

For all other platforms, you would need to build the JBIG2 encoder from source:

git clone https://github.com/agl/jbig2enc
cd jbig2enc
./autogen.sh
./configure && make
[sudo] make install

Dependencies include libtoolize and libleptonica, which on Ubuntu systems are packaged as libtool and libleptonica-dev. On Fedora (35) they are packaged as libtool and leptonica-devel. For this to work, please make sure to install autotools, automake, libtool, pkg-config and leptonica first if not already installed. Other dependencies might be required depending on your system.

[sudo] apt install autotools-dev automake libtool libleptonica-dev pkg-config

Téléchargement modèles

SPACY

python -m spacy download fr_core_news_lg

GPU issue

Si problème : Successful NUMA node read from SysFS had negative value (-1)

for a in /sys/bus/pci/devices/*; do echo 0 |  tee -a $a/numa_node; done

Exemples d'utilisation

Classification

Code

import logging
import sys

from lxf.services.measure_time import measure_time_async
from lxf.services.try_safe import try_safe_execute_asyncio



from lxf.ai.classification.classifier import get_classification
from lxf.domain.predictions import  Predictions

import lxf.settings as settings 
from lxf.settings import set_looging_level, get_logging_level
set_logging_level(logging.DEBUG)
###################################################################

logger = logging.getLogger('test classifier')
fh = logging.FileHandler('./logs/test_classifier.log')
fh.setLevel(get_logging_level())
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
fh.setFormatter(formatter)
logger.setLevel(get_logging_level())
logger.addHandler(fh)
#################################################################

@measure_time_async
async def do_test(file_name) -> Predictions :
    """
    """
    return await get_classification(file_name=file_name,max_pages=10)


if __name__ == "__main__":
    sys.stdout.reconfigure(line_buffering=True) 
    pdf_path = "data/ODP.pdf"
    iban_pdf="data/RIBB.pdf"
    result = try_safe_execute_asyncio(logger=logger,func=do_test,file_name=iban_pdf) #asyncio.run(do_test(iban_pdf))
    print(result)    
    result = try_safe_execute_asyncio(logger=logger,func=do_test,file_name=pdf_path) #asyncio.run(do_test(pdf_path))
    print(result)

Code

import logging
import asyncio
import os
import sys



import lxf.settings as settings
from lxf.setting import set_logging_level, get_logging_level
set_logging_level(logging.DEBUG)
settings.enable_tqdm=False

from lxf.domain.loan import Pret
from lxf.extractors.finance import odp_extractor
from lxf.extractors.finance import iban_extractor

from lxf.services.try_safe import  try_safe_execute_async



###################################################################

logger = logging.getLogger('test_finance')
fh = logging.FileHandler('./logs/test_finance.log')
fh.setLevel(get_logging_level())
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
fh.setFormatter(formatter)
logger.setLevel(get_logging_level())
logger.addHandler(fh)
#################################################################

async def do_test_odp(file_path:str)->Pret:
    result = await try_safe_execute_async(logger,odp_extractor.extract_data,file_path=file_path)
    return result
    
async def do_test_iban(file_path:str)->str :
    """
    """
    result = await try_safe_execute_async(logger,iban_extractor.extract_data,file_path=file_path)
    return result

if __name__ == "__main__":
    sys.stdout.reconfigure(line_buffering=True) 
    pdf_path = "data/ODP.pdf"
    # pret:Pret=  asyncio.run(do_test_odp(file_path=pdf_path))
    # if pret!=None:
    #     print(pret.emprunteurs)
    iban_pdf="data/rib pm.pdf"
    txt = asyncio.run(do_test_iban(file_path=iban_pdf))
    print(txt)
    

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pylexfluent-0.1.95.tar.gz (124.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pylexfluent-0.1.95-py3-none-any.whl (150.1 kB view details)

Uploaded Python 3

File details

Details for the file pylexfluent-0.1.95.tar.gz.

File metadata

  • Download URL: pylexfluent-0.1.95.tar.gz
  • Upload date:
  • Size: 124.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for pylexfluent-0.1.95.tar.gz
Algorithm Hash digest
SHA256 29282a04827fb4b6255aa93c89a86baacc71459a00a1e076388095f6140485dd
MD5 73786bd0944070bcc28228f5ab35989a
BLAKE2b-256 f4ceea1da8fb9169af4338f2555258def4a41e7b5eefbf98798ba43ba9213afa

See more details on using hashes here.

File details

Details for the file pylexfluent-0.1.95-py3-none-any.whl.

File metadata

  • Download URL: pylexfluent-0.1.95-py3-none-any.whl
  • Upload date:
  • Size: 150.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for pylexfluent-0.1.95-py3-none-any.whl
Algorithm Hash digest
SHA256 8de36dbf261b8516d5486cd06748bceb84687a4d17ffc4fa964bef560235f0bc
MD5 ef428b9ebf56b18acf4cdbd0df413326
BLAKE2b-256 ef9411e4e9263b51873c48dbe2efe20aec581707bdfacad9835646e69176b2b9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page