Context tree estimation using the Smallest Maximizer Criterion (SMC)
Project description
Smallest Maximizer Criterion
We introduce a new criterion to select in a consistent way the probabilistic context tree generating a sample. The basic idea is to construct a totally ordered set of candidate trees. This set is composed by the "champion trees", the ones that maximize the likelihood of the sample for each number of degrees of freedom. The smallest maximizer criterion selects the infimum of the subset of champion trees whose gain in likelihood is negligible. In addition, we propose a new algorithm based on resampling to implement this criterion.
This study was motivated by the linguistic challenge of retrieving rhythmic features from written texts. Applied to a data set consisting of texts extracted from daily newspapers, our algorithm identifies different context trees for European Portuguese and Brazilian Portuguese. This is compatible with the long standing conjecture that European Portuguese and Brazilian Portuguese belong to different rhythmic classes. Moreover, these context trees have several interesting properties which are linguistically meaningful.
Requirements
python 3.8
Installation
`pip install -r requirements.txt`
Examples
Estimation by pruning
run python3 examples/estimation_by_pruning.py
Citing
Please cite the following publication when using this algorithm:
Galves, Antonio & Galves, Charlotte & Garcia, Jesus & Garcia, Nancy & Leonardi, Florencia. (2009). Context tree selection and linguistic rhythm retrieval from written texts. The Annals of Applied Statistics. 6. 10.1214/11-AOAS511.
Bibtex version:
@article{article,
author = {Galves, Antonio and Galves, Charlotte and Garcia,
Jesus and Garcia, Nancy and Leonardi, Florencia},
year = {2009},
month = {02},
pages = {},
title = {Context tree selection and linguistic rhythm retrieval from written
texts},
volume = {6},
journal = {The Annals of Applied Statistics},
doi = {10.1214/11-AOAS511}
}
Running tests
Run pytest -s
License
- The code in this repository is licensed under GNU General Public License v3.0
Acknowledgement
This implementation was produced as part of the activities of FAPESP Research, Innovation and Dissemination Center for Neuromathematics (grant # 2020/04807-0, S.Paulo Research Foundation).
Universidade de São Paulo
Instituto de Matemática e Estatística
Research, Innovation and Dissemination Center for Neuromathematics - NeuroMat
2020
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file g4l-smc-0.0.1.tar.gz
.
File metadata
- Download URL: g4l-smc-0.0.1.tar.gz
- Upload date:
- Size: 2.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.25.1 setuptools/51.0.0.post20201207 requests-toolbelt/0.9.1 tqdm/4.55.1 CPython/3.8.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a8960400b52ba157d74ca24ef8dd1876740e8e6029cb17e2fcbf4b3099df00a5 |
|
MD5 | fd66a145e31281adae89ef694aeb5ef3 |
|
BLAKE2b-256 | 90e6c62f36ca453b1c9625ffac8718f8e18b2a1c6edc9f9afecc937da517cc60 |