Classify American vs. British English
Project description
American-British-variety-classifier
A minimalistic spelling- and vocabulary-based American-vs-British variety classifier.
Description
This classifier is based on the VarCon database. First the database was read and used in its entirety, after which progressive prunings were performed to improve performance.
The classifier performs rudimentary preprocessing (some weird character deletion to reduce the odds of discarding non-important words) and then checks all lowercase nonnumerical words if they are present in the dictionary. The final step is assigning variety to the input text, for which we use the following logic:
- for documents with no identified American or British lexemes it returns
UNK
, - if one variant has more than twice as many identified words as the other, it classifies the instance as the more frequent variant,
- else it classifies it as
MIX
.
Installation
pip install abclf
Use
import abclf
text = "The flautist heard a rumour about the gray haired clarinettist in a wollen pullover"
abclf.get_variant(text)
# 'B'
Authors
- Peter Rupnik
- Taja Kuzman
- Nikola Ljubešić
Copyright of the original VarCon database
Copyright 2000-2020 by Kevin Atkinson (kevina@gnu.org) and Benjamin
Titze (btitze@protonmail.ch).
Copyright 2000-2019 by Kevin Atkinson
Permission to use, copy, modify, distribute and sell this array, the
associated software, and its documentation for any purpose is hereby
granted without fee, provided that the above copyright notice appears
in all copies and that both that copyright notice and this permission
notice appear in supporting documentation. Kevin Atkinson makes no
representations about the suitability of this array for any
purpose. It is provided "as is" without express or implied warranty.
Copyright 2016 by Benjamin Titze
Permission to use, copy, modify, distribute and sell this array, the
associated software, and its documentation for any purpose is hereby
granted without fee, provided that the above copyright notice appears
in all copies and that both that copyright notice and this permission
notice appear in supporting documentation. Benjamin Titze makes no
representations about the suitability of this array for any
purpose. It is provided "as is" without express or implied warranty.
Since the original words lists come from the Ispell distribution:
Copyright 1993, Geoff Kuenning, Granada Hills, CA
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:
1. Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
3. All modifications to the source code must be clearly marked as
such. Binary redistributions based on modified source code
must be clearly marked as modified versions in the documentation
and/or other materials provided with the distribution.
(clause 4 removed with permission from Geoff Kuenning)
5. The name of Geoff Kuenning may not be used to endorse or promote
products derived from this software without specific prior
written permission.
THIS SOFTWARE IS PROVIDED BY GEOFF KUENNING AND CONTRIBUTORS ``AS IS'' AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
ARE DISCLAIMED. IN NO EVENT SHALL GEOFF KUENNING OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
SUCH DAMAGE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file abclf-1.0.1.tar.gz
.
File metadata
- Download URL: abclf-1.0.1.tar.gz
- Upload date:
- Size: 56.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fd520ca76b3aaff8ca6bcc99c91fdb3d04e84ea4adf763ff6e58037aa3da9078 |
|
MD5 | e624da34d921d0906da334d79f3ff5d1 |
|
BLAKE2b-256 | 849d320748f8c4a03df3e25201de7ab4ff74ae0f9a57b3a0428846fb975fc431 |
File details
Details for the file abclf-1.0.1-py3-none-any.whl
.
File metadata
- Download URL: abclf-1.0.1-py3-none-any.whl
- Upload date:
- Size: 54.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 063b050f1b000e5e28a02b57bc7dd22a4f056c36832080e0f8a79697a4e3c643 |
|
MD5 | aee5f033293ff95bc3753a148bc0ab83 |
|
BLAKE2b-256 | 5f5dd517ff20eb2b3b2b8951fef2fb9d045475baabc554dea62e81f47ad4f6f7 |