Skip to main content

Extracting Perfects (and related forms) from parallel corpora

Project description

PefectExtractor

Extracting Perfects (and related forms) from parallel corpora

This command-line application allows for extraction of Perfects (and related forms, like the Recent Past construction in French and Spanish) from part-of-speech-tagged, lemmatized and sentence-aligned parallel corpora encoded in XML.

Recognizing Perfects

In English, a present perfect is easily recognizable as a present form of to have plus a past participle, like in (1):

(1) I have seen that movie twenty times.

However, one difficulty in finding Perfects in most languages is that there might be words between the auxiliary and the past participle, like in (2):

(2) Nobody has ever climbed that mountain.

Furthermore, languages have passive forms that generally require the past participle of to be to be interjected, like in (3):

(3) The bill has been paid by John.

In English, there is the additional issue of the present perfect continuous, which in form shares the first part of the construction with the present perfect, like in (4):

(4) He has been waiting here for two hours.

In some languages (e.g. French, German, and Dutch), the Perfect can be formed with both Have and Be. The past participle governs which auxiliary verb is used, as (5) and (6) show.

(5) J'ai vu quelque chose [lit. I have seen some thing]
(6) Elle est arrivé [lit. She is arrived]

For French, this is a closed list (DR and MRS P. VANDERTRAMP), but for other languages, this might be a more open class.

The last common issue with finding Perfects is that in e.g. Dutch and German, the Perfect might appear before the auxiliary verb in subordinate clauses. (7) is an example:

(7) Dat is de stad waar hij gewoond heeft. [lit. This is the city where he lived has]

The extraction script provided here takes care of all these issues, and can have language-specific settings.

Implementation

The extraction script (apps/extractor/perfectextractor.py) is implemented using the lxml XML toolkit.

The script looks for auxiliary verbs (using a XPath expression), and for each of these, it tries to find a past participle on the right hand side of the sentence (or left hand side in Dutch/German), allowing for words between the verbs, though this lookup stops at the occurrence of other verbs, punctuation and coordinating conjunctions.

The script also allows for extraction of present perfect continuous forms.

The script handles these by a list of verbs that use Be as auxiliary. The function get_ergative_verbs in extractor/wiktionary.py extracts these verbs from Wiktionary for Dutch. This function uses the Requests: HTTP for Humans package. For German, the list is compiled from this list.

Recognizing Recent Pasts

Most Romance languages share a grammaticalized construction to refer to events in the recent past, e.g. the passé récent in French and the pasado reciente in Spanish. In English, typically a present perfect alongside the adverb just is used to convey this meaning, commonly referred to as perfect of recent past (Comrie 1985) or hot news perfect (McCawley 1971).

The French passé récent is formed with a present tense of venir 'come' followed by the particle de and an infinitive, as in (8) below.

(8) Je viens de voir Marie. [lit. I come DE see Mary] 

The Spanish pasado reciente is (quite similarly) formed with a present tense of acabar 'finish' followed by the particle de and an infinitive, as in (9) below.

(9) Acabo de ver a María. [lit. I finish DE see Mary]

The extraction script (apps/extractor/recentpastextractor.py) provided here allows export of these constructions from parallel corpora.

Other extractors

This application also allows extraction from parallel corpora based on part-of-speech tags or regexes.

Corpora

Dutch Parallel Corpus

The extraction was first tested with the Dutch Parallel Corpus. This corpus (that uses the TEI format) consists of three languages: Dutch, French and English. The configuration for this corpus can be found in corpora/dpc/base.cfg and corpora/dpc/perfect.cfg. Example documents from this corpus are included in the tests/data/dpc directory. The data for this corpus is closed source, to retrieve the corpus, you'll have to contact the authors on the cited website. After you've obtained the data, you can run the extraction script with:

python extract.py <folder> en fr nl --corpus=dpc --extractor=perfect

OPUS Corpora

The extraction has also been implemented for the open parallel corpus OPUS, that contains most notably the Europarl Corpus and the OpenSubtitles Corpus. This corpus (that uses the XCES format for alignment) consists of a wide variety of languages. The configuration for this corpus can be found in corpora/opus/base.cfg and corpora/opus/perfect.cfg: implementations have been made for Dutch, English, French, German and Spanish. Example documents from this corpus are included in the tests/data/europarl directory. The data for this corpus is open source: you can download the corpus and the alignment files from the cited website. After you've obtained the data, you can run the extraction script with:

python extract.py <folder> en de es --corpus=opus --extractor=perfect

BNC Corpus

The extraction has also been implemented for the monolingual BNC Corpus. The data for this corpus is open source: you can download the corpus from the cited website. After you've obtained the data, you can run the extraction script with:

python extract.py <folder> en --corpus=bnc --extractor=perfect

Implementing your own corpus

If you want to implement the extraction for another corpus, you'll have to create:

  • An implementation of the corpus in the corpora directory (see corpora/opus for an example).
  • A configuration file in this directory (see corpora/opus/base.cfg for an example).
  • An entry in the main script (see extract.py)

Other options to the extraction script

You can view all options of the extraction script by typing:

python extract.py --help

Do note that at this point in time, not all options are available in all corpora. Feel free to send a pull request once you have implemented an option, or to request one by creating an issue.

Other scripts

These scripts can be found in perfectextractor/scripts.

pick_alignments

This script allows to filter the alignment file based on (for example) a file prefix. This is helpful in the case of large alignment files, as is e.g. the case for the Europarl corpus. Example usage:

python pick_alignments.py 

merge_results

This script allows to merge results from various files. Example usage:

python merge_results.py 

splitter

This script allows to split a big corpus into subparts and then to run the extractors. Example usage:

python splitter.py 

Tests

The unit tests can be run using:

python -m unittest discover -b

A coverage report can be generated (after installing coverage.py) using:

coverage run --source . -m unittest discover -b
coverage html

Citing

If you happen to have used (parts of) this project for your research, please refer to this paper:

van der Klis, M., Le Bruyn, B., de Swart, H. (2017). Mapping the Perfect via Translation Mining. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers 2017, 497-502.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

perfectextractor-0.4.tar.gz (38.5 kB view details)

Uploaded Source

Built Distribution

perfectextractor-0.4-py3-none-any.whl (55.9 kB view details)

Uploaded Python 3

File details

Details for the file perfectextractor-0.4.tar.gz.

File metadata

  • Download URL: perfectextractor-0.4.tar.gz
  • Upload date:
  • Size: 38.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.0.3 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.6.9

File hashes

Hashes for perfectextractor-0.4.tar.gz
Algorithm Hash digest
SHA256 d6743317af7af7c25e5dd93bbf0f182f85c5df9b18b11025899b08151c2511ee
MD5 205d93ec11556ef3d19f9c435ccde1d0
BLAKE2b-256 f000986fefd324be325d32aab3626f82a851786f996fc8e01bbae2b897259e23

See more details on using hashes here.

File details

Details for the file perfectextractor-0.4-py3-none-any.whl.

File metadata

  • Download URL: perfectextractor-0.4-py3-none-any.whl
  • Upload date:
  • Size: 55.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.0.3 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.6.9

File hashes

Hashes for perfectextractor-0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 f38cbe55b1b4fbac4cfa3c6d4a8c789c8e90b554bc8c94f46d965f5cf854c37c
MD5 013a7a689dd8b10fbb64eb2b6f532065
BLAKE2b-256 5f5e4ec646eb2469c36015431cb6f78541002bc0450953cb7b64efedd6f52441

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page