Skip to main content

A Python library for parsing CSPro dictionaries and cases.

Project description

pyCSPro

Python library for parsing CSPro dictionaries and cases.

Install

pip:

pip install pycspro

git:

git clone https://github.com/amestsantim/pycspro
cd pycspro
python setup.py install

Usage

The library is simple to use and has only two classes. DictionaryParser and CaseParser. There is also a medium article that explains, in considerable detail, how it works. You can find it here:

Working With CSPro Data Using Python (Pandas)

DictionaryParser

This class receives a raw dictionary text and parses it into a Python dictionary which we can then manipulate to acomplish various tasks.

parse()

This method accepts the contents of a dictionary file (.dcf) and returns a dictionary object, which is basically a nested Python dictionary.

from pycspro import DictionaryParser

raw_dictionary = open('CensusDictionary.dcf', 'r').read()
dictionary_parser = DictionaryParser(raw_dictionary)
parsed_dictionary = dictionary_parser.parse()
print(json.dumps(parsed_dictionary, indent=4))
{
    "Dictionary": {
        "Name": "CEN2000",
        "Label": "Popstan Census",
        "Note": "",
        "Version": "CSPro 7.2",
        "RecordTypeStart": 1,
        "RecordTypeLen": 1,
        "Positions": "Relative",
        "ZeroFill": true,
        "DecimalChar": false,
        "Languages": [],
        "Relation": [],
        "Level": {
            "Name": "QUEST",
            "Label": "Questionnaire",
            "Note": "",
            "IdItems": [
                {
                    "Name": "PROVINCE",
                    "Label": "Province",
                    "Note": "",
                    "Len": 2,
                    "ItemType": "Item",
                    "DataType": "Numeric",
                    "Occurrences": 1,
                    "Decimal": 0,
                    "DecimalChar": false,
                    "ZeroFill": true,
                    "OccurrenceLabel": [],
                    "Start": 2,
                    "ValueSets": [
                        {
                            "Name": "PROV_VS1",
                            "Label": "Province",
                            "Note": "",
                            "Value": [
                                "1;Artesia",
                                "2;Copal",
                                "3;Dari",
                                "4;Eris",
                                "5;Girda",
                                "6;Hali",
                                "7;Kerac",
                                "8;Lacuna",
                                "9;Laya",
                                "10;Lira",
                                "11;Matanga",
                                "12;Patan",
                                "13;Rift",
                                "14;Terra",
                                "15;Tumar"
                            ]
                        }
                    ]
                },
                ...

get_column_labels()

This method accepts a record name and returns a dictionary where keys are the item names and the values are the item labels for all the items within the given record.

{
    'H01_TYPE': 'Type of housing',
    'H02_WALL': 'Wall type',
    'H03_ROOF': 'Roof type',
    'H04_FLOOR': 'Floor type',
    'H05_ROOMS': 'Number of rooms',
    'H06_TENURE': 'Tenure'
}

This is useful for replacing the column labels in the Data Frame.

housing = dfs['HOUSING']
housing.rename(columns = dictionary_parser.get_column_labels('HOUSING'))

get_value_labels()

This method accepts a record name and returns a dictionary where keys are the item names and the values are yet another dictionary. This dictionaries key-value paris are all the possible values and their respective labels.

{
    'P02_REL': {1: 'Head', 2: 'Spouse', 3: 'Child', 4: 'Parent', 5: 'Other', 6: 'Nonrelative', 9: 'Not Reported'},
    'P03_SEX': {1: 'Male', 2: 'Female'}
}

This can be used to replace values by their more meaningful labels in a Data Frame.

person = dfs['PERSON']
person.replace(dictionary_parser.get_value_labels('PERSON'))

CaseParser

The CaseParser class is responsible for cutting up raw CSPro cases into tables by using a parsed dictionary (DictionaryParser). It produces a nested dictionary where each record is yet another dictionary. The resulting format is well suited to be converted into a Pandas Data Frame by using the from_dict method of the pandas DataFrame class.

During instantiation, you can also pass in a cutting_mask to the CaseParser class to specify only the columns (items) you are interested in. This can be useful when there are a large number of items in a record.

cutting_mask = {
    'QUEST': ['PROVINCE', 'DISTRICT'],
    'PERSON': ['P03_SEX', 'P04_AGE', 'P11_LITERACY', 'P15_OCC'],
    'HOUSING': ['H01_TYPE', 'H05_ROOMS', 'H07_RENT', 'H08_TOILET', 'H13_PERSONS']
}
case_parser = CaseParser(parsed_dictionary, cutting_mask)

parse()

The parse method receives a list of cases and returns a nested dictionary of records.

import pandas as pd
from pycspro import CaseParser

case_parser = CSProCaseParser(parsed_dictionary)
parsed_cases = case_parser.parse(cases) # where cases is a list of CSPro cases

# parsed_cases will be Python dictionary where the keys are the record names
# and values would be a dictionary with columns as keys and column values as a Python list
for table_name, table in parsed_cases.items():
    pd.DataFrame.from_dict(table)

Live Demo

Binder

There is a Jupyter Notebook on Binder (great project!) that you can play with, in a live environment (in your browser) and see how easy it is to use this library. Please, take it for a spin!

Syntax Checking

This library uses a finite state machine to check the syntax of dictionaries. However, in the current version, some simplifying assumptions were made. These are: Dictionaries are assumed to have only a single Level SubItems are not considered and will cause an error if present

Performance

When reading and loading cases directly from CSWeb's MySQL database, you should be passing in to the CaseParser about 50,000 cases at a time and then converting the result into a Pandas DataFrame. In the next iteration, send in another 50,000 cases and on return, convert to a DataFrame and append to the previous DataFrame. That way, you can grow your DataFrame to a large size without consuming a lot of memory.

To Dos

Some edge cases such as SPECIAL values etc might not have been handled. If you run in to such edge cases, please submit an issue (or even better, a pull request) and hopefully, we will have them ironed out soon enough!

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pycspro-1.1.0.tar.gz (11.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pycspro-1.1.0-py3-none-any.whl (9.2 kB view details)

Uploaded Python 3

File details

Details for the file pycspro-1.1.0.tar.gz.

File metadata

  • Download URL: pycspro-1.1.0.tar.gz
  • Upload date:
  • Size: 11.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.10

File hashes

Hashes for pycspro-1.1.0.tar.gz
Algorithm Hash digest
SHA256 544e4293a1793cd91b8579d28aaa6f23f9b9a45935a21bd050457592c2547c22
MD5 51c7a626ae60128db2306c3b59236bf1
BLAKE2b-256 bdd23258a4bceda2ee61c2b30812a8d53bbaa70096f6aa3c1da74ab0b521de89

See more details on using hashes here.

File details

Details for the file pycspro-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: pycspro-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 9.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.10

File hashes

Hashes for pycspro-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8e7a58b9b5e0a860c3b1d9cba451500680d5fa76c8642014d63cc217b4cd6ebf
MD5 87b1671aad7e3da13438ff60d8403501
BLAKE2b-256 6d1d2261687ecec361ddf2c4c0b50039be3207d961bebc744411de24b6275075

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page