Skip to main content

A small package for data processing

Project description

Data processing Package

  • Requirements

    • pandas
    • pyreadstat
    • numpy
    • zipfile
    • fastapi[UploadFile]
  • Step 1: import classes

    # Convert data to pandas dataframe
    from dpkits.ap_data_converter import APDataConverter
    
    # Calculate LSM score
    from dpkits.calculate_lsm import LSMCalculation
    
    # Transpose data to stack and untack
    from dpkits.data_transpose import DataTranspose
    
    # Create the tables from converted dataframe 
    from dpkits.table_generator import DataTableGenerator
    
    # Format data tables 
    from dpkits.table_formater import TableFormatter
    
  • Step 2: Convert data files to dataframe

    • class APDataConverter(files=None, file_name='', is_qme=True)
      • input 1 of files or file_name
      • files: list[UploadFile] default = None
      • file_name: str default = ''
      • is_qme: bool default = True
      • Returns:
        • df_data: pandas.Dataframe
        • df_info: pandas.Dataframe
      # Define input/output files name
      str_file_name = 'APDataTest'
      str_tbl_file_name = f'{str_file_name}_Topline.xlsx'
      
      converter = APDataConverter(file_name='APDataTesting.xlsx')
      
      df_data, df_info = converter.convert_df_mc() 
      
      # Use 'converter.convert_df_md()' if you need md data
      
  • Step 3: Calculate LSM classificate (only for LSM projects)

    • class LSMCalculation.cal_lsm_6(df_data, df_info)
      • df_data: pandas.Dataframe
      • df_info: pandas.Dataframe
      • Returns:
        • df_data: pandas.Dataframe
        • df_info: pandas.Dataframe
      df_data, df_info = LSMCalculation.cal_lsm_6(df_data, df_info)
      
      # df_data, df_info will contains the columns CC1_Score to CC6_Score & LSM_Score
      
  • Step 4: Data cleaning (if needed)

    # Use pandas's functions to clean/process data
    
    df_data['Gender_new'] = df_data['Gender']
    
    df_data.replace({
        'Q1_SP1': {1: 5, 2: 4, 3: 3, 4: 2, 5: 1},
        'Q1_SP2': {1: 5, 2: 4, 3: 3, 4: 2, 5: 1},
    }, inplace=True)
    
    df_data.loc[(df_data['Gender_new'] == 2) & (df_data['Age'] == 5),  ['Gender_new']] = [np.nan]
    df_info.loc[df_info['var_name'] == 'Q1_SP1', ['val_lbl']] = [{'1': 'a', '2': 'b', '3': 'c', '4': 'd', '5': 'e'}]
    
    df_info = pd.concat([df_info, pd.DataFrame(
        columns=['var_name', 'var_lbl', 'var_type', 'val_lbl'],
        data=[
            ['Gender_new', 'Please indicate your gender', 'SA', {'1': 'aaa', '2': 'bb', '3': 'cc'}]
        ]
    )], ignore_index=True)
    
  • Step 5: Transpose data (if needed)

    • class DataTranspose.to_stack(df_data, df_info, dict_stack_structure)
      • df_data: pandas.Dataframe
      • df_info: pandas.Dataframe
      • dict_stack_structure: dict
      • Returns:
        • df_data_stack: pandas.Dataframe
        • df_info_stack: pandas.Dataframe
      dict_stack_structure = {
          'id_col': 'ResID',
          'sp_col': 'Ma_SP',
          'lst_scr': ['Gender', 'Age', 'City', 'HHI'],
          'dict_sp': {
              1: {
                  'Ma_SP1': 'Ma_SP',
                  'Q1_SP1': 'Q1',
                  'Q2_SP1': 'Q2',
                  'Q3_SP1': 'Q3',
              },
              2: {
                  'Ma_SP2': 'Ma_SP',
                  'Q1_SP2': 'Q1',
                  'Q2_SP2': 'Q2',
                  'Q3_SP2': 'Q3',
              },
          },
          'lst_fc': ['Awareness1', 'Frequency', 'Awareness2', 'Perception']
      }
      
      df_data_stack, df_info_stack = DataTranspose.to_stack(df_data, df_info, dict_stack_structure)
      
    • class DataTranspose.to_unstack(df_data_stack, df_info_stack, dict_unstack_structure)
      • df_data_stack: pandas.Dataframe which transpose from stack
      • df_info_stack: pandas.Dataframe which transpose from stack
      • dict_unstack_structure: dict
      • Returns:
        • df_data_unstack: pandas.Dataframe
        • df_info_unstack: pandas.Dataframe
      dict_unstack_structure = {
          'id_col': 'ResID',
          'sp_col': 'Ma_SP',
          'lst_col_part_head': ['Gender', 'Age', 'City', 'HHI'],
          'lst_col_part_body': ['Q1', 'Q2', 'Q3'],
          'lst_col_part_tail': ['Awareness1', 'Frequency', 'Awareness2', 'Perception']
      }
      
      df_data_unstack, df_info_unstack = DataTranspose.to_unstack(df_data_stack, df_info_stack, dict_unstack_structure)
      
  • Step 6: OE Running

    
    
  • Step 7: Export *.sav & *.xlsx

    • class converter.generate_multiple_data_files(dict_dfs=dict_dfs, is_md=False, is_export_sav=True, is_export_xlsx=True, is_zip=True)
      • df_data: pandas.Dataframe
        • dict_dfs: dict
        • is_md: bool default False
        • is_export_sav: bool default True
        • is_export_xlsx: bool default True
        • is_zip: bool default True
        • Returns: NONE
        dict_dfs = {
            1: {
                'data': df_data,
                'info': df_info,
                'tail_name': 'ByCode',
                'sheet_name': 'ByCode',
                'is_recode_to_lbl': False,
            },
            2: {
                'data': df_data,
                'info': df_info,
                'tail_name': 'ByLabel',
                'sheet_name': 'ByLabel',
                'is_recode_to_lbl': True,
            },
            3: {
                'data': df_data_stack,
                'info': df_info_stack,
                'tail_name': 'Stack',
                'sheet_name': 'Stack',
                'is_recode_to_lbl': False,
            },
            4: {
                'data': df_data_unstack,
                'info': df_info_unstack,
                'tail_name': 'Unstack',
                'sheet_name': 'Unstack',
                'is_recode_to_lbl': False,
            },
        }
        
        converter.generate_multiple_data_files(dict_dfs=dict_dfs, is_md=False, is_export_sav=True, is_export_xlsx=True, is_zip=True)
        
  • Step 8: Export data tables

    • init DataTableGenerator(df_data=df_data, df_info=df_info, xlsx_name=str_tbl_file_name)
      • df_data: pandas.Dataframe
      • df_info: pandas.Dataframe
      • xlsx_name: str
      • Returns: NONE
    • class DataTableGenerator.run_tables_by_js_files(lst_func_to_run)
      • lst_func_to_run: list
      • Returns: NONE
    • init TableFormatter(xlsx_name=str_tbl_file_name)
      • xlsx_name: str
      • Returns: NONE
    • class TableFormatter.format_sig_table()
      • Returns: NONE
    lst_side_qres = [
        {"qre_name": "CC1", "sort": "des"},
        {"qre_name": "$CC3", "sort": "asc"},
        {"qre_name": "$CC4", "sort": "des"},
        {"qre_name": "$CC6"},
        {"qre_name": "$CC10"},
        {"qre_name": "LSM"},
        {"qre_name": "Gender"},
        {"qre_name": "Age"},
        {"qre_name": "City"},
        {"qre_name": "HHI"},
        
        # MA Question with net/combine (can apply to SA questions)
        {"qre_name": "$Q15", "cats": {
            'net_code': {
                '900001|combine|Group 1 + 2': {
                    '1': 'Yellow/dull teeth',
                    '3': 'Dental plaque',
                    '5': 'Bad breath',
                    '7': 'Aphthousulcer',
                    '2': 'Sensitive teeth',
                    '4': 'Caries',
                    '6': 'Gingivitis (bleeding, swollen gums)',
                },
                '900002|net|Group 1': {
                    '1': 'Yellow/dull teeth',
                    '3': 'Dental plaque',
                    '5': 'Bad breath',
                    '7': 'Aphthousulcer',
                },
                '900003|net|Group 2': {
                    '2': 'Sensitive teeth',
                    '4': 'Caries',
                    '6': 'Gingivitis (bleeding, swollen gums)',
                },
            },
            '8': 'Other (specify)',
            '9': 'No problem',
        }},
    
        # Scale question with full properties
        {
            "qre_name": "Perception",
            "cats": {
                '1': 'Totally disagree', '2': 'Disagree', '3': 'Neutral', '4': 'Agree', '5': 'Totally agree',
                'net_code': {
                    '900001|combine|B2B': {'1': 'Totally disagree', '2': 'Disagree'},
                    '900002|combine|Medium': {'3': 'Neutral'},
                    '900003|combine|T2B': {'4': 'Agree', '5': 'Totally agree'},
                }
            },
            "mean": {1: 1, 2: 2, 3: 3, 4: 4, 5: 5}
        },
    ]
    
    lst_header_qres = [
        [
            {
                "qre_name": "Age",
                "qre_lbl": "Age",
                "cats": {
                    'TOTAL': 'TOTAL',
                    '2': '18 - 24', '3': '25 - 30', '4': '31 - 39', '5': '40 - 50', '6': 'Trên 50'
                }
            },
            {
                "qre_name": "@City2",
                "qre_lbl": "Location",
                "cats": {
                    'City.isin([1, 5, 10, 11, 12])': 'All South',
                    'City.isin([2, 4, 16, 17, 18])': 'All North',
                }
            },
        ],
    ]
    
    lst_func_to_run = [
        {
            'func_name': 'run_standard_table_sig',
            'tables_to_run': [
                'Tbl_1_Pct',  # this table use df_data & df_info to run
                'Tbl_1_Count',  # this table use df_data & df_info to run
            ],
            'tables_format': {
    
                "Tbl_1_Pct": {
                    "tbl_name": "Table 1 - Pct",
                    "tbl_filter": "City > 0",
                    "is_count": 0,
                    "is_pct_sign": 1,
                    "is_hide_oe_zero_cats": 1,
                    "sig_test_info": {
                        "sig_type": "",  # ind / rel
                        "sig_cols": [],
                        "lst_sig_lvl": []
                    },
                    "lst_side_qres": lst_side_qres,
                    "lst_header_qres": lst_header_qres
                },
    
                "Tbl_1_Count": {
                    "tbl_name": "Table 1 - Count",
                    "tbl_filter": "City > 0",
                    "is_count": 1,
                    "is_pct_sign": 0,
                    "is_hide_oe_zero_cats": 1,
                    "sig_test_info": {
                        "sig_type": "",
                        "sig_cols": [],
                        "lst_sig_lvl": []
                    },
                    "lst_side_qres": lst_side_qres,
                    "lst_header_qres": lst_header_qres
                },
            },
    
        },
    ]
    
    dtg = DataTableGenerator(df_data=df_data, df_info=df_info, xlsx_name=str_tbl_file_name)
    dtg.run_tables_by_js_files(lst_func_to_run)
    
    dtf = TableFormatter(xlsx_name=str_tbl_file_name)
    dtf.format_sig_table()
    

This is a simple example package. You can use Github-flavored Markdown to write your content.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dpkits-1.3.21.tar.gz (64.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dpkits-1.3.21-py3-none-any.whl (71.9 kB view details)

Uploaded Python 3

File details

Details for the file dpkits-1.3.21.tar.gz.

File metadata

  • Download URL: dpkits-1.3.21.tar.gz
  • Upload date:
  • Size: 64.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.0

File hashes

Hashes for dpkits-1.3.21.tar.gz
Algorithm Hash digest
SHA256 e3f99c70f45257880c1a3421b990643889092604c42469823d5a2c7a20a72949
MD5 05b4aba5fa02e7678e90400341d64c16
BLAKE2b-256 7e3e654758a10f46f6180d36bee0f65bb8bb56fac66f2b0b5d45118b83aa382d

See more details on using hashes here.

File details

Details for the file dpkits-1.3.21-py3-none-any.whl.

File metadata

  • Download URL: dpkits-1.3.21-py3-none-any.whl
  • Upload date:
  • Size: 71.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.0

File hashes

Hashes for dpkits-1.3.21-py3-none-any.whl
Algorithm Hash digest
SHA256 fc0cb8e8cb2b736e686101244941e86438c5dd1394db4a7856ef1e38c1107b7b
MD5 5cbf73e6dab498e86d79d19b648f7885
BLAKE2b-256 d411827625468e391f30e241aa093814a2e5859dce1b5818e0a1303434400d85

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page