Classifica titulo+resumo de teses/dissertacoes brasileiras na grande area de avaliacao CAPES, a partir do texto (portugues).

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

rene.veloso

These details have not been verified by PyPI

Project description

texto2area

Biblioteca Python que classifica o título e/ou resumo de uma tese/dissertação brasileira em uma das 9 grandes áreas de avaliação CAPES, a partir do texto — ponta a ponta: recebe texto cru em português e faz normalização, lematização, n-gramas e classificação.

from texto2area import classificar

area, margens, termos = classificar(
    "A atuação do enfermeiro no cuidado ao paciente idoso em saúde coletiva."
)
# area    -> 'CIÊNCIAS DA SAÚDE'
# margens -> [('CIÊNCIAS DA SAÚDE', 1.28), ('CIÊNCIAS BIOLÓGICAS', -1.0), ...]
# termos  -> ['saúde_coletivo', 'enfermeiro', 'paciente', 'cuidado']

Instalação

pip install git+https://github.com/reneveloso/texto2area
python -m spacy download pt_core_news_lg     # modelo de lematização (~568 MB), obrigatório

Ou, a partir de um clone: pip install .

Uso pretendido e domínio de validade

Idioma: português. Para textos em outro idioma, traduza antes (a tradução não é embutida — exigiria um modelo pesado).
Domínio: teses e dissertações (título/resumo), 2013–2024 (Catálogo Sucupira). Fora disso (outros gêneros, outras taxonomias) o desempenho não é garantido.
Saída: grande área + margens (decision_function) por classe + termos do texto que mais pesaram na decisão (interpretabilidade do modelo linear).

Como funciona (fiel ao pipeline de treino)

Normalização (NFC, limpeza, colapso de espaços).
Lematização com spaCy pt_core_news_lg, mantendo POS de conteúdo (NOUN, PROPN, ADJ, VERB), lema minúsculo, len>=2.
Injeção de n-gramas: bi/trigramas adjacentes unidos por _, mantidos os que existem no vocabulário do modelo (85% das 359.402 features são n-gramas).
TF-IDF (sublinear_tf, min_df=50) + LinearSVC (one-vs-rest, class_weight='balanced').

Desempenho (avaliação em conjuntos retidos)

Protocolo	Acurácia	F1-macro	Baseline (maj.)
In-distribution (split 80/20)	0,792	0,791	0,165
Out-of-time (treino ≤2023, teste 2024)	0,729	0,732	0,170

F1 por área varia de Linguística/Letras/Artes 0,879 e Saúde 0,860 a Multidisciplinar 0,549 (classe difusa, sem vocabulário próprio). O artefato é treinado em 100% dos dados (1.017.727 documentos); as métricas vêm dos protocolos de avaliação.

Limitações

Rótulo administrativo como verdade-base (parte dos "erros" é interdisciplinaridade real).
Multidisciplinar pouco separável (F1 0,549).
Não distingue as ~49 áreas de avaliação finas (apenas as 9 grandes áreas).
Português apenas (sem tradução embutida).

Segurança

O modelo é carregado via joblib, que executa código ao desserializar. Use apenas os artefatos versionados neste repositório ou de fonte confiável.

Reprodução

Ver reproduzir/REPRODUCAO.md. O treino é determinístico (SEED=42); o corpus (Sucupira) não é redistribuído.

Como citar

Ver CITATION.cff. DOI do Zenodo a ser adicionado.

Licença

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

rene.veloso

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.1

Jun 30, 2026

This version

0.1.0

Jun 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

texto2area-0.1.0.tar.gz (29.9 MB view details)

Uploaded Jun 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

texto2area-0.1.0-py3-none-any.whl (29.9 MB view details)

Uploaded Jun 30, 2026 Python 3

File details

Details for the file texto2area-0.1.0.tar.gz.

File metadata

Download URL: texto2area-0.1.0.tar.gz
Upload date: Jun 30, 2026
Size: 29.9 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for texto2area-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`b3ce026e91e156f17c2931b5f677f873703199d31954997333e71a5ad94364a2`
MD5	`223275a0c73ea048c49365b42f1ed594`
BLAKE2b-256	`3d0f0eb9dec5a2a667b1df2738716c30deee69b3721dc330f327625ef661ce79`

See more details on using hashes here.

Provenance

The following attestation bundles were made for texto2area-0.1.0.tar.gz:

Publisher: publish.yml on reneveloso/texto2area

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: texto2area-0.1.0.tar.gz
- Subject digest: b3ce026e91e156f17c2931b5f677f873703199d31954997333e71a5ad94364a2
- Sigstore transparency entry: 2026385630
- Sigstore integration time: Jun 30, 2026
Source repository:
- Permalink: reneveloso/texto2area@d5c2d5f6dcfed0217a4ab42c2fd0a1fdd3baddbb
- Branch / Tag: refs/heads/main
- Owner: https://github.com/reneveloso
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@d5c2d5f6dcfed0217a4ab42c2fd0a1fdd3baddbb
- Trigger Event: workflow_dispatch

File details

Details for the file texto2area-0.1.0-py3-none-any.whl.

File metadata

Download URL: texto2area-0.1.0-py3-none-any.whl
Upload date: Jun 30, 2026
Size: 29.9 MB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for texto2area-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`79afc16ce7d4b3c71aa11bda8f0b06b441c72b1a50b20298e1bc07118ea9f647`
MD5	`c0cfc087c78f13b22fc902b0dc394de4`
BLAKE2b-256	`46b3f7104e8f5772441dd8f03e0b739f6784aafd965e37f7a09cc81fcde2083a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for texto2area-0.1.0-py3-none-any.whl:

Publisher: publish.yml on reneveloso/texto2area

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: texto2area-0.1.0-py3-none-any.whl
- Subject digest: 79afc16ce7d4b3c71aa11bda8f0b06b441c72b1a50b20298e1bc07118ea9f647
- Sigstore transparency entry: 2026385720
- Sigstore integration time: Jun 30, 2026
Source repository:
- Permalink: reneveloso/texto2area@d5c2d5f6dcfed0217a4ab42c2fd0a1fdd3baddbb
- Branch / Tag: refs/heads/main
- Owner: https://github.com/reneveloso
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@d5c2d5f6dcfed0217a4ab42c2fd0a1fdd3baddbb
- Trigger Event: workflow_dispatch

texto2area 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

texto2area

Instalação

Uso pretendido e domínio de validade

Como funciona (fiel ao pipeline de treino)

Desempenho (avaliação em conjuntos retidos)

Limitações

Segurança

Reprodução

Como citar

Licença

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance