Skip to main content

A small package for data processing

Project description

Data processing Package

  • Requirements

    • pandas
    • pyreadstat
    • numpy
    • zipfile
    • fastapi[UploadFile]
  • Step 1: import classes

    # Convert data to pandas dataframe
    from dpkits.ap_data_converter import APDataConverter
    
    # Calculate LSM score
    from dpkits.calculate_lsm import LSMCalculation
    
    # Transpose data to stack and untack
    from dpkits.data_transpose import DataTranspose
    
    # Create the tables from converted dataframe 
    from dpkits.table_generator import DataTableGenerator
    
    # Format data tables 
    from dpkits.table_formater import TableFormatter
    
  • Step 2: Convert data files to dataframe

    • class APDataConverter(files=None, file_name='', is_qme=True)
      • input 1 of files or file_name
      • files: list[UploadFile] default = None
      • file_name: str default = ''
      • is_qme: bool default = True
      • Returns:
        • df_data: pandas.Dataframe
        • df_info: pandas.Dataframe
      # Define input/output files name
      str_file_name = 'APDataTest'
      str_tbl_file_name = f'{str_file_name}_Topline.xlsx'
      
      converter = APDataConverter(file_name='APDataTesting.xlsx')
      
      df_data, df_info = converter.convert_df_mc() 
      
      # Use 'converter.convert_df_md()' if you need md data
      
  • Step 3: Calculate LSM classificate (only for LSM projects)

    • class LSMCalculation.cal_lsm_6(df_data, df_info)
      • df_data: pandas.Dataframe
      • df_info: pandas.Dataframe
      • Returns:
        • df_data: pandas.Dataframe
        • df_info: pandas.Dataframe
      df_data, df_info = LSMCalculation.cal_lsm_6(df_data, df_info)
      
      # df_data, df_info will contains the columns CC1_Score to CC6_Score & LSM_Score
      
  • Step 4: Data cleaning (if needed)

    # Use pandas's functions to clean/process data
    
    df_data['Gender_new'] = df_data['Gender']
    
    df_data.replace({
        'Q1_SP1': {1: 5, 2: 4, 3: 3, 4: 2, 5: 1},
        'Q1_SP2': {1: 5, 2: 4, 3: 3, 4: 2, 5: 1},
    }, inplace=True)
    
    df_data.loc[(df_data['Gender_new'] == 2) & (df_data['Age'] == 5),  ['Gender_new']] = [np.nan]
    df_info.loc[df_info['var_name'] == 'Q1_SP1', ['val_lbl']] = [{'1': 'a', '2': 'b', '3': 'c', '4': 'd', '5': 'e'}]
    
    df_info = pd.concat([df_info, pd.DataFrame(
        columns=['var_name', 'var_lbl', 'var_type', 'val_lbl'],
        data=[
            ['Gender_new', 'Please indicate your gender', 'SA', {'1': 'aaa', '2': 'bb', '3': 'cc'}]
        ]
    )], ignore_index=True)
    
  • Step 5: Transpose data (if needed)

    • class DataTranspose.to_stack(df_data, df_info, dict_stack_structure)
      • df_data: pandas.Dataframe
      • df_info: pandas.Dataframe
      • dict_stack_structure: dict
      • Returns:
        • df_data_stack: pandas.Dataframe
        • df_info_stack: pandas.Dataframe
      dict_stack_structure = {
          'id_col': 'ResID',
          'sp_col': 'Ma_SP',
          'lst_scr': ['Gender', 'Age', 'City', 'HHI'],
          'dict_sp': {
              1: {
                  'Ma_SP1': 'Ma_SP',
                  'Q1_SP1': 'Q1',
                  'Q2_SP1': 'Q2',
                  'Q3_SP1': 'Q3',
              },
              2: {
                  'Ma_SP2': 'Ma_SP',
                  'Q1_SP2': 'Q1',
                  'Q2_SP2': 'Q2',
                  'Q3_SP2': 'Q3',
              },
          },
          'lst_fc': ['Awareness1', 'Frequency', 'Awareness2', 'Perception']
      }
      
      df_data_stack, df_info_stack = DataTranspose.to_stack(df_data, df_info, dict_stack_structure)
      
    • class DataTranspose.to_unstack(df_data_stack, df_info_stack, dict_unstack_structure)
      • df_data_stack: pandas.Dataframe which transpose from stack
      • df_info_stack: pandas.Dataframe which transpose from stack
      • dict_unstack_structure: dict
      • Returns:
        • df_data_unstack: pandas.Dataframe
        • df_info_unstack: pandas.Dataframe
      dict_unstack_structure = {
          'id_col': 'ResID',
          'sp_col': 'Ma_SP',
          'lst_col_part_head': ['Gender', 'Age', 'City', 'HHI'],
          'lst_col_part_body': ['Q1', 'Q2', 'Q3'],
          'lst_col_part_tail': ['Awareness1', 'Frequency', 'Awareness2', 'Perception']
      }
      
      df_data_unstack, df_info_unstack = DataTranspose.to_unstack(df_data_stack, df_info_stack, dict_unstack_structure)
      
  • Step 6: OE Running

    
    
  • Step 7: Export *.sav & *.xlsx

    • class converter.generate_multiple_data_files(dict_dfs=dict_dfs, is_md=False, is_export_sav=True, is_export_xlsx=True, is_zip=True)
      • df_data: pandas.Dataframe
        • dict_dfs: dict
        • is_md: bool default False
        • is_export_sav: bool default True
        • is_export_xlsx: bool default True
        • is_zip: bool default True
        • Returns: NONE
        dict_dfs = {
            1: {
                'data': df_data,
                'info': df_info,
                'tail_name': 'ByCode',
                'sheet_name': 'ByCode',
                'is_recode_to_lbl': False,
            },
            2: {
                'data': df_data,
                'info': df_info,
                'tail_name': 'ByLabel',
                'sheet_name': 'ByLabel',
                'is_recode_to_lbl': True,
            },
            3: {
                'data': df_data_stack,
                'info': df_info_stack,
                'tail_name': 'Stack',
                'sheet_name': 'Stack',
                'is_recode_to_lbl': False,
            },
            4: {
                'data': df_data_unstack,
                'info': df_info_unstack,
                'tail_name': 'Unstack',
                'sheet_name': 'Unstack',
                'is_recode_to_lbl': False,
            },
        }
        
        converter.generate_multiple_data_files(dict_dfs=dict_dfs, is_md=False, is_export_sav=True, is_export_xlsx=True, is_zip=True)
        
  • Step 8: Export data tables

    • init DataTableGenerator(df_data=df_data, df_info=df_info, xlsx_name=str_tbl_file_name)
      • df_data: pandas.Dataframe
      • df_info: pandas.Dataframe
      • xlsx_name: str
      • Returns: NONE
    • class DataTableGenerator.run_tables_by_js_files(lst_func_to_run)
      • lst_func_to_run: list
      • Returns: NONE
    • init TableFormatter(xlsx_name=str_tbl_file_name)
      • xlsx_name: str
      • Returns: NONE
    • class TableFormatter.format_sig_table()
      • Returns: NONE
    lst_side_qres = [
        {"qre_name": "CC1", "sort": "des"},
        {"qre_name": "$CC3", "sort": "asc"},
        {"qre_name": "$CC4", "sort": "des"},
        {"qre_name": "$CC6"},
        {"qre_name": "$CC10"},
        {"qre_name": "LSM"},
        {"qre_name": "Gender"},
        {"qre_name": "Age"},
        {"qre_name": "City"},
        {"qre_name": "HHI"},
        
        # MA Question with net/combine (can apply to SA questions)
        {"qre_name": "$Q15", "cats": {
            'net_code': {
                '900001|combine|Group 1 + 2': {
                    '1': 'Yellow/dull teeth',
                    '3': 'Dental plaque',
                    '5': 'Bad breath',
                    '7': 'Aphthousulcer',
                    '2': 'Sensitive teeth',
                    '4': 'Caries',
                    '6': 'Gingivitis (bleeding, swollen gums)',
                },
                '900002|net|Group 1': {
                    '1': 'Yellow/dull teeth',
                    '3': 'Dental plaque',
                    '5': 'Bad breath',
                    '7': 'Aphthousulcer',
                },
                '900003|net|Group 2': {
                    '2': 'Sensitive teeth',
                    '4': 'Caries',
                    '6': 'Gingivitis (bleeding, swollen gums)',
                },
            },
            '8': 'Other (specify)',
            '9': 'No problem',
        }},
    
        # Scale question with full properties
        {
            "qre_name": "Perception",
            "cats": {
                '1': 'Totally disagree', '2': 'Disagree', '3': 'Neutral', '4': 'Agree', '5': 'Totally agree',
                'net_code': {
                    '900001|combine|B2B': {'1': 'Totally disagree', '2': 'Disagree'},
                    '900002|combine|Medium': {'3': 'Neutral'},
                    '900003|combine|T2B': {'4': 'Agree', '5': 'Totally agree'},
                }
            },
            "mean": {1: 1, 2: 2, 3: 3, 4: 4, 5: 5}
        },
    ]
    
    lst_header_qres = [
        [
            {
                "qre_name": "Age",
                "qre_lbl": "Age",
                "cats": {
                    'TOTAL': 'TOTAL',
                    '2': '18 - 24', '3': '25 - 30', '4': '31 - 39', '5': '40 - 50', '6': 'Trên 50'
                }
            },
            {
                "qre_name": "@City2",
                "qre_lbl": "Location",
                "cats": {
                    'City.isin([1, 5, 10, 11, 12])': 'All South',
                    'City.isin([2, 4, 16, 17, 18])': 'All North',
                }
            },
        ],
    ]
    
    lst_func_to_run = [
        {
            'func_name': 'run_standard_table_sig',
            'tables_to_run': [
                'Tbl_1_Pct',  # this table use df_data & df_info to run
                'Tbl_1_Count',  # this table use df_data & df_info to run
            ],
            'tables_format': {
    
                "Tbl_1_Pct": {
                    "tbl_name": "Table 1 - Pct",
                    "tbl_filter": "City > 0",
                    "is_count": 0,
                    "is_pct_sign": 1,
                    "is_hide_oe_zero_cats": 1,
                    "sig_test_info": {
                        "sig_type": "",  # ind / rel
                        "sig_cols": [],
                        "lst_sig_lvl": []
                    },
                    "lst_side_qres": lst_side_qres,
                    "lst_header_qres": lst_header_qres
                },
    
                "Tbl_1_Count": {
                    "tbl_name": "Table 1 - Count",
                    "tbl_filter": "City > 0",
                    "is_count": 1,
                    "is_pct_sign": 0,
                    "is_hide_oe_zero_cats": 1,
                    "sig_test_info": {
                        "sig_type": "",
                        "sig_cols": [],
                        "lst_sig_lvl": []
                    },
                    "lst_side_qres": lst_side_qres,
                    "lst_header_qres": lst_header_qres
                },
            },
    
        },
    ]
    
    dtg = DataTableGenerator(df_data=df_data, df_info=df_info, xlsx_name=str_tbl_file_name)
    dtg.run_tables_by_js_files(lst_func_to_run)
    
    dtf = TableFormatter(xlsx_name=str_tbl_file_name)
    dtf.format_sig_table()
    

This is a simple example package. You can use Github-flavored Markdown to write your content.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dpkits-1.6.0.tar.gz (71.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dpkits-1.6.0-py3-none-any.whl (78.7 kB view details)

Uploaded Python 3

File details

Details for the file dpkits-1.6.0.tar.gz.

File metadata

  • Download URL: dpkits-1.6.0.tar.gz
  • Upload date:
  • Size: 71.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.0

File hashes

Hashes for dpkits-1.6.0.tar.gz
Algorithm Hash digest
SHA256 dd6f5b7d04c48b29f7a43cdd905633cde211434b1abb788642069ffbc015ab2a
MD5 de8aec60eb0f011220bf62b5d1a99781
BLAKE2b-256 f073c05595ccdb59bdf918ed0570a505a032f75b0b1baf9111f85295df0b719a

See more details on using hashes here.

File details

Details for the file dpkits-1.6.0-py3-none-any.whl.

File metadata

  • Download URL: dpkits-1.6.0-py3-none-any.whl
  • Upload date:
  • Size: 78.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.0

File hashes

Hashes for dpkits-1.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fdd94055679fcd35138fe3c26bdbec689544555615a77051423848b93a568d59
MD5 25bad620776813f35dfea0e36a261c77
BLAKE2b-256 174a54d62581e3765fe9c1081f2b6185b36a7144e1a6522cf712d0b4e69a0c49

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page