Skip to main content

Classify American vs. British English

Project description

American-British-variety-classifier

A minimalistic spelling- and vocabulary-based American-vs-British variety classifier.

Description

This classifier is based on the VarCon database. First the database was read and used in its entirety, after which progressive prunings were performed to improve performance.

The classifier performs rudimentary preprocessing (some weird character deletion to reduce the odds of discarding non-important words) and then checks all lowercase nonnumerical words if they are present in the dictionary. The final step is assigning variety to the input text, for which we use the following logic:

  • for documents with no identified American or British lexemes it returns UNK,
  • if one variant has more than twice as many identified words as the other, it classifies the instance as the more frequent variant,
  • else it classifies it as MIX.

Installation

pip install abclf

Use

import abclf
text = "The flautist heard a rumour about the gray haired clarinettist in a wollen pullover"
abclf.get_variant(text)

# 'B'

Authors

  • Peter Rupnik
  • Taja Kuzman
  • Nikola Ljubešić

Copyright of the original VarCon database

Copyright 2000-2020 by Kevin Atkinson (kevina@gnu.org) and Benjamin
Titze (btitze@protonmail.ch).

Copyright 2000-2019 by Kevin Atkinson

Permission to use, copy, modify, distribute and sell this array, the
associated software, and its documentation for any purpose is hereby
granted without fee, provided that the above copyright notice appears
in all copies and that both that copyright notice and this permission
notice appear in supporting documentation. Kevin Atkinson makes no
representations about the suitability of this array for any
purpose. It is provided "as is" without express or implied warranty.

Copyright 2016 by Benjamin Titze

Permission to use, copy, modify, distribute and sell this array, the
associated software, and its documentation for any purpose is hereby
granted without fee, provided that the above copyright notice appears
in all copies and that both that copyright notice and this permission
notice appear in supporting documentation. Benjamin Titze makes no
representations about the suitability of this array for any
purpose. It is provided "as is" without express or implied warranty.

Since the original words lists come from the Ispell distribution:

Copyright 1993, Geoff Kuenning, Granada Hills, CA
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:

1. Redistributions of source code must retain the above copyright
   notice, this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright
   notice, this list of conditions and the following disclaimer in the
   documentation and/or other materials provided with the distribution.
3. All modifications to the source code must be clearly marked as
   such.  Binary redistributions based on modified source code
   must be clearly marked as modified versions in the documentation
   and/or other materials provided with the distribution.
(clause 4 removed with permission from Geoff Kuenning)
5. The name of Geoff Kuenning may not be used to endorse or promote
   products derived from this software without specific prior
   written permission.

THIS SOFTWARE IS PROVIDED BY GEOFF KUENNING AND CONTRIBUTORS ``AS IS'' AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
ARE DISCLAIMED.  IN NO EVENT SHALL GEOFF KUENNING OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
SUCH DAMAGE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

abclf-1.0.1.tar.gz (56.3 kB view details)

Uploaded Source

Built Distribution

abclf-1.0.1-py3-none-any.whl (54.4 kB view details)

Uploaded Python 3

File details

Details for the file abclf-1.0.1.tar.gz.

File metadata

  • Download URL: abclf-1.0.1.tar.gz
  • Upload date:
  • Size: 56.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.13

File hashes

Hashes for abclf-1.0.1.tar.gz
Algorithm Hash digest
SHA256 fd520ca76b3aaff8ca6bcc99c91fdb3d04e84ea4adf763ff6e58037aa3da9078
MD5 e624da34d921d0906da334d79f3ff5d1
BLAKE2b-256 849d320748f8c4a03df3e25201de7ab4ff74ae0f9a57b3a0428846fb975fc431

See more details on using hashes here.

File details

Details for the file abclf-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: abclf-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 54.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.13

File hashes

Hashes for abclf-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 063b050f1b000e5e28a02b57bc7dd22a4f056c36832080e0f8a79697a4e3c643
MD5 aee5f033293ff95bc3753a148bc0ab83
BLAKE2b-256 5f5dd517ff20eb2b3b2b8951fef2fb9d045475baabc554dea62e81f47ad4f6f7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page