A package for NLP in Spanish

These details have not been verified by PyPI

Project links

Homepage

Project description

Spanish NLP

Introduction

Spanish NLP is a Python library designed to facilitate Natural Language Processing tasks in Spanish. This library provides a suite of tools for text preprocessing, classification, and data augmentation, making it easier for researchers and developers to work with Spanish text data. The library is designed to be low-code, allowing users to quickly implement NLP pipelines with minimal effort.

Spanish NLP

Installation

Spanish NLP can be installed via pip:

pip install spanish-nlp

To install from source, clone the repository and install the package using pip:

git clone https://github.com/jorgeortizfuentes/spanish_nlp.git
cd spanish_nlp
pip install .

Usage

Preprocessing

See more information in the Jupyter Notebook example

To preprocess text using the preprocess module, you can import it and call the desired parameters:

from spanish_nlp import SpanishPreprocess
sp = SpanishPreprocess(
        lower=False,
        remove_url=True,
        remove_hashtags=False,
        split_hashtags=True,
        normalize_breaklines=True,
        remove_emoticons=False,
        remove_emojis=False,
        convert_emoticons=False,
        convert_emojis=False,
        normalize_inclusive_language=True,
        reduce_spam=True,
        remove_vowels_accents=True,
        remove_multiple_spaces=True,
        remove_punctuation=True,
        remove_unprintable=True,
        remove_numbers=True,
        remove_stopwords=False,
        stopwords_list=None,
        lemmatize=False,
        stem=False,
        remove_html_tags=True,
)

test_text = """𝓣𝓮𝔁𝓽𝓸 𝓭𝓮 𝓹𝓻𝓾𝓮𝓫𝓪

<b>Holaaaaaaaa a todxs </b>, este es un texto de prueba :) a continuación les mostraré un poema de Roberto Bolaño llamado "Los perros románticos" 🤭👀😅

https://www.poesi.as/rb9301.htm

¡Me gustan los pingüinos! Sí, los PINGÜINOS 🐧🐧🐧 🐧 #VivanLosPinguinos #SíSeñor #PinguinosDelMundoUníos #ÑanduesDelMundoTambién

Si colaboras con este repositorio te puedes ganar $100.000 (en dinero falso). O tal vez 20 pingüinos. Mi teléfono es +561212121212"""

print(sp.transform(test_text, debug=False))

Output:

hola a todos este es un texto de prueba:) a continuacion los mostrare un poema de roberto bolaño llamado los perros romanticos 🤭 👀 😅
me gustan los pinguinos si los pinguinos 🐧 🐧 🐧 🐧 vivan los pinguinos si señor pinguinos del mundo unios ñandues del mundo tambien
si colaboras con este repositorio te puedes ganar en dinero falso o tal vez pinguinos mi telefono es

Classification

See more information in the Jupyter Notebook example

Available classifiers

Hate Speech (hate_speech)
Incivility (incivility)
Toxic Speech (toxic_speech)
Sentiment Analysis (sentiment_analysis)
Emotion Analysis (emotion_analysis)
Irony Analysis (irony_analysis)
Sexist Analysis (sexist_analysis)
Racism Analysis (racism_analysis)

Classification Example

from spanish_nlp import SpanishClassifier

sc = classifiers.SpanishClassifier(model_name="hate_speech", device='cpu')
# DISCLAIMER: The following message is merely an example of hate speech and does not represent the views of the author or contributors.
t1 =  "LAS MUJERES Y GAYS DEBERIAN SER EXTERMINADOS"
t2 = "El presidente convocó a una reunión a los representantes de los partidos políticos"
p1 = sc.predict(t1)
p2 = sc.predict(t2)

print("Text 1: ", t1)
print("Prediction 1: ", p1)
print("Text 2: ", t2)
print("Prediction 2: ", p2)

Output:

Text 1:  LAS MUJERES Y GAYS DEBERÍAN SER EXTERMINADOS
Prediction 1:  {'hate_speech': 0.7544152736663818, 'not_hate_speech': 0.24558477103710175}
Text 2:  El presidente convocó a una reunión a los representantes de los partidos políticos
Prediction 2:  {'not_hate_speech': 0.9793208837509155, 'hate_speech': 0.02067909575998783}

Augmentation

See more information in the Jupyter Notebook example

Available Augmentation Models

Spelling augmentation
- Keyboard spelling method
- OCR spelling method
- Random spelling replace method
- Grapheme spelling
- Word spelling
- Remove punctuation
- Remove spaces
- Remove accents
- Lowercase
- Uppercase
- Randomcase
- All method
Masked augmentation
- Sustitute method
- Insert method
Others models under development (such as Synonyms, WordEmbeddings, GenerativeOpenSource, GenerativeOpenAI, BackTranslation, AbstractiveSummarization)

Augmentation Models Examples

from spanish_nlp import augmentation

ocr = augmentation.Spelling(method="ocr",
                            stopwords="default",
                            aug_percent=0.3,
                            tokenizer="default")

grapheme_spelling = augmentation.Spelling(method="grapheme_spelling",
                                          stopwords="default",
                                          aug_percent=0.3,
                                          tokenizer="default")

masked_sustitute = augmentation.Masked(method="sustitute",
                                       model="dccuchile/bert-base-spanish-wwm-cased",
                                       tokenizer="default",
                                       stopwords="default",
                                       aug_percent=0.4,
                                       device="cpu",
                                       top_k=10)


text = "En aquel tiempo yo tenía veinte años y estaba loco. Había perdido un país pero había ganado un sueño. Y si tenía ese sueño lo demás no importaba. Ni trabajar ni rezar ni estudiar en la madrugada junto a los perros románticos."

new_texts = [text]
new_texts.append(ocr.augment(text, num_samples=1, num_workers=1))
new_texts.append(grapheme_spelling.augment(text, num_samples=1, num_workers=1))
new_texts.append(masked_sustitute.augment(text, num_samples=1))

for t in new_texts:
    print(t)
    print("---")

Output:

En aquel tiempo yo tenía veinte años y estaba loco. Había perdido un país pero había ganado un sueño. Y si tenía ese sueño lo demás no importaba. Ni trabajar ni rezar ni estudiar en la madrugada junto a los perros románticos.
---
['En a9uel tiempo yo tenía veint3 años y e8ta8a 1oco. Había Rerd1dQ un RaíB pePQ había ganado Vn su3ño. Y si tenía es3 BVeno lo 0emáB n0 iWRQPtaEa. N1 trabajar ni rezar ni 3s7ud1ar en la maOrVga0a junto a 1os p3rPo8 Pománt1Go5.']
---
['Em akel tiempo yo tenía veinte años y estaba loco. Había perdido un país pero  abía janado um sueño. Y si temía ese sueño lo demás no importava. Ni trabajar ni rezar ni estudiar em la nadrugada junto a los perros románticos.']
---
['En aquel tiempo yo tenía veinte años y estaba loco. Había perdido un país pero había ganado un sueño. Y si tenía mi sueño lo demás no importaba. ni trabajar ni rezar ni estudiar en la madrugada junto a los clubes románticos.']
---

Spell Checking

See more information in the Jupyter Notebook example

Available Spell Checking Methods

Dictionary-based (dictionary): Uses pyspellchecker for suggestions based on edit distance.
Contextual Language Model (contextual_lm): Uses transformer models for context-aware corrections (not yet implemented).

Spell Checking Example (Dictionary Method)

from spanish_nlp import SpanishSpellChecker

# Initialize with the dictionary method (default)
checker = SpanishSpellChecker(method="dictionary")

text_with_errors = "Ola komo stas? Esto es una prueva."

# Find potential errors
errors = checker.find_errors(text_with_errors)
print(f"Potential Errors: {errors}")

# Get suggestions for a word
suggestions = checker.suggest("prueva")
print(f"Suggestions for 'prueva': {suggestions}")

# Correct a single word
corrected_word = checker.correct_word("komo")
print(f"Correction for 'komo': {corrected_word}")

# Correct the entire text
corrected_text = checker.correct_text(text_with_errors)
print(f"Corrected Text: {corrected_text}")

# Initialize with custom distance
checker_strict = SpanishSpellChecker(method="dictionary", distance=1)
print(f"Strict suggestions for 'pruevs': {checker_strict.suggest('pruevs')}")

# Initialize with custom dictionary words
checker_custom = SpanishSpellChecker(method="dictionary", custom_dictionary=["levenshtein"])
print(f"Is 'levenshtein' correct? {checker_custom.is_correct('levenshtein')}")

Output:

Potential Errors: ['stas', 'komo', 'prueva']
Suggestions for 'prueva': ['prueba']
Correction for 'komo': como
Corrected Text: ola como estas? esto es una prueba.
Strict suggestions for 'pruevs': []
Is 'levenshtein' correct? True

License

Spanish NLP is licensed under the GNU General Public License v3.0.

Author

This project was developed by Jorge Ortiz-Fuentes, Linguist and Data Scientist from Chile.

Acknowledgements

We would like to express our gratitude to the Millennium Institute For Foundational Research and Department of Computer Science at the University of Chile for supporting the development of Spanish NLP. Special thanks to Felipe Bravo-Marquéz, Ricardo Cordova and Hernán Sarmiento for their knowledge, support and invaluable contribution to the project.

Contributing

Contributions to Spanish NLP are welcome! Please see the Developer Guide (CONTRIBUTING.md) for details on the contribution workflow, versioning, and publishing process.

To contribute to the project, please follow these steps:

Create a new branch for your feature or bug fix.
Make your changes and commit them with clear messages.
Push your changes to your fork.
Submit a pull request to the main repository.

Citation

If you use Spanish NLP in your research, please cite it as follows:

@misc{spanish_nlp,
  author = {Jorge Ortiz-Fuentes},
  title = {Spanish NLP: A Python library for Natural Language Processing in Spanish},
  year = {2022},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/jorgeortizfuentes/spanish_nlp}},
}

Contact

For any questions or inquiries, please contact Jorge Ortiz-Fuentes.

Disclaimer

The hate speech example provided in the classification section is for demonstration purposes only and does not reflect the views of the author or contributors.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.4.0

Apr 26, 2025

0.3.1

Jan 20, 2025

0.3.0

Jan 19, 2025

0.2.11

Apr 24, 2023

0.2.10

Apr 24, 2023

0.2.9

Apr 11, 2023

0.2.8

Apr 10, 2023

0.2.7

Mar 26, 2023

0.2.6

Mar 2, 2023

0.2.5

Mar 1, 2023

0.2.4

Mar 1, 2023

0.2.3

Feb 28, 2023

0.2.2

Feb 26, 2023

0.2.1

Feb 26, 2023

0.2.0

Feb 26, 2023

0.1.12

Feb 26, 2023

0.1.11

Feb 26, 2023

0.1.10

Feb 26, 2023

0.1.9

Feb 25, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spanish_nlp-0.4.0.tar.gz (40.7 kB view details)

Uploaded Apr 26, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

spanish_nlp-0.4.0-py3-none-any.whl (46.0 kB view details)

Uploaded Apr 26, 2025 Python 3

File details

Details for the file spanish_nlp-0.4.0.tar.gz.

File metadata

Download URL: spanish_nlp-0.4.0.tar.gz
Upload date: Apr 26, 2025
Size: 40.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.22

File hashes

Hashes for spanish_nlp-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`2e9334734c8f0fc9926902475c960eb44f58b6e1a656e821c5985f06e6d2276f`
MD5	`5d2145541163a1bdd9bd1f9321aed9e1`
BLAKE2b-256	`847252b6857e0a515080ebdeeb7bbf583006277b3b537cc776d25b85a6a9a720`

See more details on using hashes here.

File details

Details for the file spanish_nlp-0.4.0-py3-none-any.whl.

File metadata

Download URL: spanish_nlp-0.4.0-py3-none-any.whl
Upload date: Apr 26, 2025
Size: 46.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.22

File hashes

Hashes for spanish_nlp-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2fc6dbaf2b1077fc310438cddef634f7674446d4c8369b5659ee031bf2cf21bc`
MD5	`097a96deb284ef577dccc0e215cbdabf`
BLAKE2b-256	`f2a0b84d803bc5b99fef46de4e7d8168d6665777555a3980d79f8a877aa3e534`

See more details on using hashes here.

spanish-nlp 0.4.0

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

Spanish NLP

Introduction

Table of Contents

Installation

Usage

Preprocessing

Classification

Available classifiers

Classification Example

Augmentation

Available Augmentation Models

Augmentation Models Examples

Spell Checking

Available Spell Checking Methods

Spell Checking Example (Dictionary Method)

License

Author

Acknowledgements

Contributing

Citation

Contact

Disclaimer

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes