A package to find keywords in .pdf, .docx, .odt, and .rtf files, with support for multiple languages and the ability to run on multiple CPU cores

These details have not been verified by PyPI

Reason this release was yanked:

Bug in calculating contingency tables of keyword frequency when fusion_keyword_before_after = True

Project description

English version

The find_keyword_xtvu Python package facilitates the search for keywords across PDF, DOCX, ODT, and RTF files, enabling the extraction of sentences that contain these keywords. It also offers support for multiple languages and can run on multicore CPUs.

What's New in Version 5.5.4

Bug fix: Fixed an issue where some document names couldn't be read correctly.
New argument fusion_keyword_before_after: Ability to merge phrases to avoid redundancy in the results.
Multilingual Support: This new version now supports multiple languages by integrating SpaCy's NLP models. You can now search for keywords and extract sentences in languages such as English, French, German, Spanish, and more. The supported models are listed in the SpaCy documentation.

Installation

You can install this package via pip:

pip install find-keyword-xtvu==<latest_version_on_PyPi>

Directory Structure

The directory organization containing the .py code and documents can be structured as follows:

/Parent Folder
│
├── script_principal.py     # The main Python script
│
├── fichiers_entre          # Folder containing subfolders of PDF files
│   ├── files1              # Subfolder containing input .pdf, .docx, .otd, and .rtf files
│   ├── files2          
│   ├── files3          
│   ...
└── resultats               # Folder containing the results

Usage

Place the files in the input directory:
- Place the PDF, DOCX, ODT, or RTF files you want to analyze into the subfolders within the fichiers_entree directory. By default, you can organize them in a single subfolder (e.g., files1) or in multiple subfolders (files2, files3, etc.), depending on your needs.
Define the keywords:
- Open the script_principal.py script and modify the KEYWORDS list to include the keywords you want to search for in the files.
Run the script:
- Run the script_principal.py script in an IDE like Visual Studio Code.

The script_principal.py file uses the find_keyword_xtvu package and can be organized as follows:

from find_keyword_xtvu import find_keyword_xtvu
if __name__ == "__main__":
    find_keyword_xtvu(
        prefixe_langue = 'fr',
        threads_rest=1,
        nb_phrases_avant=10,
        nb_phrases_apres=10,
        keywords=[""],
        taille=20,
        timeout=200,
        result_keyword_table_name="",
        freque_document_keyword_table_name="",
        fusion_keyword_before_after = False,
        tesseract_cmd="/usr/local/bin/tesseract",
        input_path="/path/to/fichiers_entre",
        output_path="/path/to/resultats"
    )

Arguments

prefixe_langue: Language prefix to specify the language model to use (default value: 'fr'). To know the supported languages and their prefixes, see the SpaCy documentation. If you provide an unsupported prefix, or if you want to use the multilingual model, specify the multi argument. In either case, the multilingual model xx_ent_wiki_sm will be used. Learn more about this model here.
threads_rest: Number of threads to reserve for other tasks (default value: 1).
nb_phrases_avant: Number of sentences to include before the keyword (default value: 10).
nb_phrases_apres: Number of sentences to include after the keyword (default value: 10).
keywords: List of keywords to search for (default: [""]).
taille: Maximum file size to process in megabytes (default value: 20 MB).
timeout: Maximum time for processing a page in seconds (default value: 200).
result_keyword_table_name: Name of the table for keyword results. If this field is empty, a default name for this table will be res.
freque_document_keyword_table_name: Name of the table for the results of the contingency tables of keyword frequency in each file folder. If this field is empty, the default name for this table will be freque_document_keyword.
fusion_keyword_before_after: This boolean parameter controls whether the function should avoid including redundant phrases when a keyword appears multiple times within close proximity in the text. When set to True, the function ensures that phrases surrounding a keyword are only extracted once, even if they overlap with the phrases surrounding another occurrence of the same keyword. This prevents the repetition of phrases in the final output, leading to a more concise result. If set to False, the function will extract all phrases surrounding each occurrence of the keyword, which may lead to redundancy if the keyword appears frequently in the text. (default value: False)
tesseract_cmd: Path to the Tesseract executable (default value: "/usr/local/bin/tesseract").
input_path: Path to the folder containing the files to be processed.
output_path: Path to the folder where the results will be saved.

Outputs

The find_keyword_xtvu function will generate the following three Excel workbooks (.xlsx):

A file containing the results of the keywords found in the documents, with a name that can be defined by the result_keyword_table_name argument in the find_keyword_xtvu function.
A file containing the contingency tables of keyword frequency in the documents, with a name that can be defined by the freque_document_keyword_table_name argument in the find_keyword_xtvu function. Each contingency table shows how many times each keyword was found in each document within a specific folder. These tables are saved in different sheets within a single Excel workbook, with each sheet representing a folder.
A file listing problematic files, named heavy_or_slow_df.xlsx.

Contribution

As the author of this library, I would like to thank Madame Sylvie HUET, researcher at LISC, INRAE, Centre Clermont-Auvergne-Rhône-Alpes, France, for her valuable contributions.

Contributions are welcome! If you would like to improve this project or if you have any questions, feel free to contact me at vuxuantung09134@gmail.com (in French, English, or Vietnamese).

License

This project is licensed under the MIT License. See the LICENSE file for details.

Version Française

Le package Python find_keyword_xtvu facilite la recherche de mots-clés dans les fichiers PDF, DOCX, ODT et RTF, permettant d'extraire les phrases contenant ces mots-clés. Il offre également un support pour plusieurs langues et peut s'exécuter sur des CPU multicœurs.

Installation

Vous pouvez installer ce package via pip :

pip install find-keyword-xtvu==<dernière_version_sur_PyPi>

Structure du Répertoire

L'organisation du dossier contenant le code .py et les documents peut être structurée comme suit :

/Dossier parent
│
├── script_principal.py     # Le script Python principal
│
├── fichiers_entre          # Dossier contenant les sous-dossiers de fichiers PDF
│   ├── files1              # Sous-dossier contenant les fichiers .pdf, .docx, .otd et .rtf d'entrée
│   ├── files2          
│   ├── files3          
│   ...
└── resultats               # Dossier contenant les résultats

Utilisation

Placez les fichiers dans le répertoire d'entrée :
- Mettez les fichiers PDF, DOCX, ODT, ou RTF que vous souhaitez analyser dans les sous-dossiers du dossier fichiers_entree. Par défaut, vous pouvez les organiser dans un seul sous-dossier (par exemple, files1) ou dans plusieurs sous-dossiers (files2, files3, etc.), selon vos besoins.
Définissez les mots-clés :
- Ouvrez le script script_principal.py et modifiez la liste KEYWORDS pour inclure les mots-clés que vous souhaitez rechercher dans les fichiers.
Exécutez le script :
- Exécutez le script script_principal.pydans un IDE comme Visual Studio Code.

Le fichier script_principal.py utilise le package find_keyword_xtvu et peut être organisé comme suit :

from find_keyword_xtvu import find_keyword_xtvu
if __name__ == "__main__":
    find_keyword_xtvu(
        prefixe_langue = 'fr',
        threads_rest=1,
        nb_phrases_avant=10,
        nb_phrases_apres=10,
        keywords=[""],
        taille=20,
        timeout=200,
        result_keyword_table_name="",
        freque_document_keyword_table_name="",
        fusion_keyword_before_after = False,
        tesseract_cmd="/usr/local/bin/tesseract",
        input_path="/path/to/fichiers_entre",
        output_path="/path/to/resultats"
    )

Arguments

prefixe_langue : Préfixe de langue pour spécifier le modèle linguistique à utiliser (valeur par défaut : 'fr'). Pour connaître les langues supportées et leurs préfixes, consultez la documentation SpaCy. Si vous fournissez un préfixe non supporté, ou si vous souhaitez utiliser le modèle multilingue, spécifiez l'argument multi. Dans les deux cas, le modèle multilingue xx_ent_wiki_sm sera utilisé. En savoir plus sur ce modèle ici.
threads_rest : Nombre de threads à réserver pour d'autres tâches (valeur par défaut : 1).
nb_phrases_avant : Nombre de phrases à inclure avant le mot-clé (valeur par défaut : 10).
nb_phrases_apres : Nombre de phrases à inclure après le mot-clé (valeur par défaut : 10).
keywords : Liste des mots-clés à rechercher (par défaut : [""]).
taille : Taille maximale des fichiers à traiter en mégaoctets (valeur par défaut : 20 MB).
timeout : Durée maximale pour le traitement d'une page en secondes (valeur par défaut : 200).
result_keyword_table_name : Nom de la table pour les résultats des mots-clés. Si ce champ est vide, un nom par défaut pour cette table sera res.
freque_document_keyword_table_name : Nom de la table pour les résultats des tables de contingence de la fréquence des mots-clés dans chaque dossier de fichiers. Si ce champ est vide, le nom par défaut pour cette table sera freque_document_keyword.
fusion_keyword_before_after : Ce paramètre booléen contrôle si la fonction doit éviter d'inclure des phrases redondantes lorsque un mot-clé apparaît plusieurs fois à proximité dans le texte. Lorsqu'il est défini sur True, la fonction garantira que les phrases entourant un mot-clé sont extraites une seule fois, même si elles chevauchent les phrases entourant une autre occurrence du même mot-clé. Cela empêche la répétition de phrases dans le résultat final, conduisant à un résultat plus concis. Si défini sur False, la fonction extraira toutes les phrases entourant chaque occurrence du mot-clé, ce qui peut conduire à une redondance si le mot-clé apparaît fréquemment dans le texte. (valeur par défaut : False)
tesseract_cmd : Chemin vers l'exécutable Tesseract (valeur par défaut : "/usr/local/bin/tesseract").
input_path : Chemin vers le dossier contenant les fichiers à traiter.
output_path : Chemin vers le dossier où les résultats seront enregistrés.

Sorties

La fonction find_keyword_xtvu va générer trois classeurs Excel (.xlsx) suivants :

Un fichier contenant les résultats des mots-clés trouvés dans les documents avec un nom pouvant être défini par l'argument result_keyword_table_name dans la fonction find_keyword_xtvu.
Un fichier contenant les tables de contingence de la fréquence des mots-clés dans les documents avec un nom pouvant être défini par l'argument freque_document_keyword_table_name dans la fonction find_keyword_xtvu. Chaque table de contingence montre combien de fois chaque mot-clé a été trouvé dans chaque document au sein d'un dossier spécifique. Ces tables sont enregistrées sous différentes feuilles dans un seul classeur Excel, avec chaque feuille représentant un dossier.
Un fichier répertoriant les fichiers problématiques, nommé heavy_or_slow_df.xlsx.

Contribution

En tant qu'auteur de cette bibliothèque, je tiens à remercier Madame Sylvie HUET, chercheuse au LISC, INRAE, Centre Clermont-Auvergne-Rhône-Alpes, France, pour ses précieuses contributions.

Les contributions sont les bienvenues ! Si vous souhaitez améliorer ce projet ou si vous avez des questions, n'hésitez pas à me contacter à l'adresse vuxuantung09134@gmail.com (en français, anglais ou vietnamien).

Licence

Ce projet est sous licence MIT. Voir le fichier LICENSE pour plus de détails.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

5.7.3.1

Mar 4, 2025

5.7.3

Sep 12, 2024

5.7.2.2

Sep 8, 2024

5.7.2 yanked

Sep 7, 2024

5.7.1.2

Sep 7, 2024

5.7.1.1

Sep 5, 2024

5.7.1

Sep 4, 2024

5.7

Sep 2, 2024

5.7rc2 pre-release

Sep 1, 2024

5.6.9

Aug 30, 2024

5.6.9a1 pre-release

Aug 30, 2024

5.6.8.1

Aug 29, 2024

5.6.8

Aug 28, 2024

5.6.8rc2 pre-release

Aug 28, 2024

5.6.7.1

Aug 27, 2024

5.6.7

Aug 26, 2024

5.6.6 yanked

Aug 26, 2024

Reason this release was yanked:

Bug when calculating word frequency with `fusion_keyword_before_after = True`

5.6.5.1 yanked

Aug 26, 2024

Reason this release was yanked:

Bug during the installation of scipy

5.6.5 yanked

Aug 26, 2024

5.6.4

Aug 25, 2024

5.6.3 yanked

Aug 24, 2024

Reason this release was yanked:

Bug when reading PDF document

5.6.2

Aug 23, 2024

5.6.1

Aug 22, 2024

5.6

Aug 22, 2024

5.5.9

Aug 22, 2024

5.5.8

Aug 21, 2024

5.5.7

Aug 21, 2024

5.5.6

Aug 21, 2024

5.5.5

Aug 20, 2024

This version

5.5.4 yanked

Aug 19, 2024

Reason this release was yanked:

Bug in calculating contingency tables of keyword frequency when fusion_keyword_before_after = True

5.5.3 yanked

Aug 18, 2024

Reason this release was yanked:

Some document names can't be read or written correctly

5.5.2

Aug 14, 2024

5.5.1

Aug 13, 2024

5.5 yanked

Aug 13, 2024

Reason this release was yanked:

Error on [E030] Sentence boundaries unset when processing multilingual documents.

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

find_keyword_xtvu-5.5.4-py3-none-any.whl (12.4 kB view details)

Uploaded Aug 19, 2024 Python 3

File details

Details for the file find_keyword_xtvu-5.5.4-py3-none-any.whl.

File metadata

Download URL: find_keyword_xtvu-5.5.4-py3-none-any.whl
Upload date: Aug 19, 2024
Size: 12.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.9.19

File hashes

Hashes for find_keyword_xtvu-5.5.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ea9d490cdf94247087f756afaac887834c5574f960a2e3a1d2ecaeda2cc323aa`
MD5	`a435548f2207a62156d80cca48444934`
BLAKE2b-256	`90a0176480ffde049347aa279e392cb8630ec4ee0b60ce0303b98cf44fe2036d`

See more details on using hashes here.

find-keyword-xtvu 5.5.4

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

English version

What's New in Version 5.5.4

Installation

Directory Structure

Usage

Arguments

Outputs

Contribution

License

Version Française

Installation

Structure du Répertoire

Utilisation

Arguments

Sorties

Contribution

Licence

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes