Preprocessing and Extraction of Linguistic Information for Computational Analysis
Project description
pelican_nlp stands for “Preprocessing and Extraction of Linguistic Information for Computational Analysis - Natural Language Processing”. This package enables the creation of standardized and reproducible language processing pipelines, extracting linguistic features from various tasks like discourse, fluency, and image descriptions. |
Installation
Create conda environment
conda create --name pelican-nlp --channel defaults python=3.10
Activate environment
conda activate pelican-nlp
Install the package using pip:
pip install pelican-nlp
Usage
To run pelican_nlp, you need a configuration.yml file in your main project directory. This file defines the settings and parameters used for your project.
Sample configuration files are available here: https://github.com/ypauli/pelican_nlp/tree/main/examples
Adapt a sample configuration to your needs.
Save your personalized configuration.yml in the root of your project directory.
Running pelican_nlp
You can run pelican_nlp via the command line or a Python script.
From the command line:
Navigate to your project directory (must contain your participants/ folder and configuration.yml), then run:
conda activate pelican-nlp
pelican-run
To optimize performance, close other programs and limit GPU usage during language processing.
Data Format Requirements: LPDS
For reliable operation, your data must follow the Language Processing Data Structure (LPDS), inspired by brain imaging data structures like BIDS.
Main Concepts (Quick Guide)
Project Root: Contains a participants/ folder plus optional files like participants.tsv, dataset_description.json, and README.
Participants: Each participant has a folder named part-<ID> (e.g., part-01).
Sessions (Optional): For longitudinal studies, use ses-<ID> subfolders inside each participant folder.
Tasks/Contexts: Each session (or directly in the participant folder for non-longitudinal studies) includes subfolders for specific tasks (e.g., interview, fluency, image-description).
Data Files: Named with structured metadata, e.g.: part-01_ses-01_task-fluency_cat-semantic_acq-baseline_transcript.txt
Filename Structure
Filenames follow this format:
part-<id>[_ses-<id>]_task-<label>[_<key>-<value>...][_suffix].<extension>
Required Entities: part, task
Optional Entities Examples: ses, cat, acq, proc, metric, model, run, group, param
Suffix Examples: transcript, audio, embeddings, logits, annotations
Example Project Structure
my_project/ ├── participants/ │ ├── part-01/ │ │ └── ses-01/ │ │ └── interview/ │ │ └── part-01_ses-01_task-interview_transcript.txt │ └── part-02/ │ └── fluency/ │ └── part-02_task-fluency_audio.wav ├── configuration.yml ├── dataset_description.json ├── participants.tsv └── README.md
Features
- Feature 1: Cleaning text files
Handles whitespaces, timestamps, punctuation, special characters, and case-sensitivity.
- Feature 2: Linguistic Feature Extraction
Extracts semantic embeddings, logits, distance from optimality, perplexity and semantic similarity.
- Feature 3: Acoustic Feature Extraction
Extracts prosogram and openSMILE feature.
Examples
You can find example setups on the github repository in the examples folder:
Contributing
Contributions are welcome! Please check out the contributing guide.
License
This project is licensed under Attribution-NonCommercial 4.0 International. See the LICENSE file for details.
Citation
If you use this project, please cite:
Pauli Y, Marsman J-B, Rabe F, et al. Standardising the NLP Workflow: A Framework for Reproducible Linguistic Analysis. arXiv preprint arXiv:2511.15512 [cs.CL] 2025. https://doi.org/10.48550/arXiv.2511.15512
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pelican_nlp-0.3.31-py3-none-any.whl.
File metadata
- Download URL: pelican_nlp-0.3.31-py3-none-any.whl
- Upload date:
- Size: 34.3 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9b92697b1e3fe95181195b22c54dae71b32ad792a443ad0a1edc3b5d97f3f45e
|
|
| MD5 |
29967e7e0846307a2fd11115d099a3e7
|
|
| BLAKE2b-256 |
8a66b874e777bb8ad53137c4c419eafe79c8a662a74e643055a71813cd28ff92
|