Skip to main content

To make file operations, data cleansing and wordlist coding easier for literary students

Project description

Purpose: This module is to facilitate Python beginners, especially instructors and students of foreign languages and literature, for the convenience of easily operating txt, xlsx, json, tsv and docx files including data cleansing and making word list.

Function 1: Useful for getting data from the files directly and saving processing results in the files. Either task can be done with a single line of code.

Function 2: The functions of "FilePath" and "FileName" imported from the py file can effectively help you get all the absolute file paths in any folder (containing all the files of any sub-folder in the folder) of your PC disk and obtain the file names. This will be also realized within one line of code.

Function 3: You can make a word list of a certain file or a batch of files, showing their word frequency sorted in reverse order, easily with the function of "word_list" and "batch_word_list" in PgsFile.

Function 4: The Pgs-Corpora was designed in this library, which includes a monolingual corpus of native and translational Chinese as well as native and non-native English, a bi-directional parallel corpus of Chinese and English texts covering financial, legal, political, academic topics and sports news. Besides, the 8774 English idioms, stopwords of 28 languages and the termbank of Chinese thought and culture are also available in here.

Function 5: The library also supports common text cleaning tasks, such as removing empty text, empty lines, folders containing empty text, etc., full-width characters and half-width characters are converted to each other, the uniform format of Chinese and English punctuation, etc.

Table 1: The directory and size of Pgs-Corpora ├── Idioms (1, 171.78 KB) ├── Monolingual (2197, 63.65 MB) │ ├── Chinese (456, 15.27 MB) │ │ ├── People's Daily 20130605 (396, 1.38 MB) │ │ │ ├── Raw (132, 261.73 KB) │ │ │ ├── Seg_only (132, 471.47 KB) │ │ │ └── Tagged (132, 675.30 KB) │ │ └── Translational Fictions (60, 13.89 MB) │ └── English (1741, 48.38 MB) │ ├── Native (65, 44.14 MB) │ │ ├── A Short Collection of British Fiction (27, 33.90 MB) │ │ └── Preschoolers- and Teenagers-oriented Texts in English (36, 10.24 MB) │ ├── Non-native (1675, 3.63 MB) │ │ └── Shanghai Daily (1675, 3.63 MB) │ │ └── Business_2019 (1675, 3.63 MB) │ │ ├── 2019-01-01 (1, 3.35 KB) │ │ ├── 2019-01-02 (1, 3.65 KB) │ │ ├── 2019-01-03 (7, 10.90 KB) │ │ ├── 2019-01-04 (5, 9.63 KB) │ │ └── 2019-01-07 (4, 9.50 KB) │ │ └── ... (and 245 more directories) │ └── Translational (1, 622.57 KB) ├── Parallel (371, 24.67 MB) │ ├── HK Financial and Legal EC Parallel Corpora (5, 19.17 MB) │ ├── New Year Address_CE_2006-2021 (15, 147.49 KB) │ ├── Sports News_CE_2010 (20, 66.42 KB) │ ├── TED_EC_2017-2020 (330, 5.24 MB) │ └── Xi's Speech_CE_2021 (1, 53.01 KB) ├── Stopwords (28, 88.09 KB) └── Terminology (1, 2.20 MB)

...

Author: Pan Guisheng, a PhD student at the Graduate Institute of Interpretation and Translation of Shanghai International Studies University E-mail: 895284504@qq.com

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

PgsFile-0.1.2-py3-none-any.whl (46.2 MB view details)

Uploaded Python 3

File details

Details for the file PgsFile-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: PgsFile-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 46.2 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.5

File hashes

Hashes for PgsFile-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 e7d08ab1e4c41b12838e9ed1b0c99631abc51fcbab4984474c17e7b12d8d8c2d
MD5 bce7847d0366f6b57c12fcad4788cb83
BLAKE2b-256 f7ad7c1bd030b4ac1dc575795c5ccd466d620a7138360ed1861fcf05938d0ed7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page