Skip to main content

A small package for data processing

Project description

Data processing Package

  • Requirements

    • pandas
    • pyreadstat
    • numpy
    • zipfile
    • fastapi[UploadFile]
  • Step 1: import classes

    # Convert data to pandas dataframe
    from dpkits.ap_data_converter import APDataConverter
    
    # Calculate LSM score
    from dpkits.calculate_lsm import LSMCalculation
    
    # Transpose data to stack and untack
    from dpkits.data_transpose import DataTranspose
    
    # Create the tables from converted dataframe 
    from dpkits.table_generator import DataTableGenerator
    
    # Format data tables 
    from dpkits.table_formater import TableFormatter
    
  • Step 2: Convert data files to dataframe

    • class APDataConverter(files=None, file_name='', is_qme=True)
      • input 1 of files or file_name
      • files: list[UploadFile] default = None
      • file_name: str default = ''
      • is_qme: bool default = True
      • Returns:
        • df_data: pandas.Dataframe
        • df_info: pandas.Dataframe
      # Define input/output files name
      str_file_name = 'APDataTest'
      str_tbl_file_name = f'{str_file_name}_Topline.xlsx'
      
      converter = APDataConverter(file_name='APDataTesting.xlsx')
      
      df_data, df_info = converter.convert_df_mc() 
      
      # Use 'converter.convert_df_md()' if you need md data
      
  • Step 3: Calculate LSM classificate (only for LSM projects)

    • class LSMCalculation.cal_lsm_6(df_data, df_info)
      • df_data: pandas.Dataframe
      • df_info: pandas.Dataframe
      • Returns:
        • df_data: pandas.Dataframe
        • df_info: pandas.Dataframe
      df_data, df_info = LSMCalculation.cal_lsm_6(df_data, df_info)
      
      # df_data, df_info will contains the columns CC1_Score to CC6_Score & LSM_Score
      
  • Step 4: Data cleaning (if needed)

    # Use pandas's functions to clean/process data
    
    df_data['Gender_new'] = df_data['Gender']
    
    df_data.replace({
        'Q1_SP1': {1: 5, 2: 4, 3: 3, 4: 2, 5: 1},
        'Q1_SP2': {1: 5, 2: 4, 3: 3, 4: 2, 5: 1},
    }, inplace=True)
    
    df_data.loc[(df_data['Gender_new'] == 2) & (df_data['Age'] == 5),  ['Gender_new']] = [np.nan]
    df_info.loc[df_info['var_name'] == 'Q1_SP1', ['val_lbl']] = [{'1': 'a', '2': 'b', '3': 'c', '4': 'd', '5': 'e'}]
    
    df_info = pd.concat([df_info, pd.DataFrame(
        columns=['var_name', 'var_lbl', 'var_type', 'val_lbl'],
        data=[
            ['Gender_new', 'Please indicate your gender', 'SA', {'1': 'aaa', '2': 'bb', '3': 'cc'}]
        ]
    )], ignore_index=True)
    
  • Step 5: Transpose data (if needed)

    • class DataTranspose.to_stack(df_data, df_info, dict_stack_structure)
      • df_data: pandas.Dataframe
      • df_info: pandas.Dataframe
      • dict_stack_structure: dict
      • Returns:
        • df_data_stack: pandas.Dataframe
        • df_info_stack: pandas.Dataframe
      dict_stack_structure = {
          'id_col': 'ResID',
          'sp_col': 'Ma_SP',
          'lst_scr': ['Gender', 'Age', 'City', 'HHI'],
          'dict_sp': {
              1: {
                  'Ma_SP1': 'Ma_SP',
                  'Q1_SP1': 'Q1',
                  'Q2_SP1': 'Q2',
                  'Q3_SP1': 'Q3',
              },
              2: {
                  'Ma_SP2': 'Ma_SP',
                  'Q1_SP2': 'Q1',
                  'Q2_SP2': 'Q2',
                  'Q3_SP2': 'Q3',
              },
          },
          'lst_fc': ['Awareness1', 'Frequency', 'Awareness2', 'Perception']
      }
      
      df_data_stack, df_info_stack = DataTranspose.to_stack(df_data, df_info, dict_stack_structure)
      
    • class DataTranspose.to_unstack(df_data_stack, df_info_stack, dict_unstack_structure)
      • df_data_stack: pandas.Dataframe which transpose from stack
      • df_info_stack: pandas.Dataframe which transpose from stack
      • dict_unstack_structure: dict
      • Returns:
        • df_data_unstack: pandas.Dataframe
        • df_info_unstack: pandas.Dataframe
      dict_unstack_structure = {
          'id_col': 'ResID',
          'sp_col': 'Ma_SP',
          'lst_col_part_head': ['Gender', 'Age', 'City', 'HHI'],
          'lst_col_part_body': ['Q1', 'Q2', 'Q3'],
          'lst_col_part_tail': ['Awareness1', 'Frequency', 'Awareness2', 'Perception']
      }
      
      df_data_unstack, df_info_unstack = DataTranspose.to_unstack(df_data_stack, df_info_stack, dict_unstack_structure)
      
  • Step 6: OE Running

    
    
  • Step 7: Export *.sav & *.xlsx

    • class converter.generate_multiple_data_files(dict_dfs=dict_dfs, is_md=False, is_export_sav=True, is_export_xlsx=True, is_zip=True)
      • df_data: pandas.Dataframe
        • dict_dfs: dict
        • is_md: bool default False
        • is_export_sav: bool default True
        • is_export_xlsx: bool default True
        • is_zip: bool default True
        • Returns: NONE
        dict_dfs = {
            1: {
                'data': df_data,
                'info': df_info,
                'tail_name': 'ByCode',
                'sheet_name': 'ByCode',
                'is_recode_to_lbl': False,
            },
            2: {
                'data': df_data,
                'info': df_info,
                'tail_name': 'ByLabel',
                'sheet_name': 'ByLabel',
                'is_recode_to_lbl': True,
            },
            3: {
                'data': df_data_stack,
                'info': df_info_stack,
                'tail_name': 'Stack',
                'sheet_name': 'Stack',
                'is_recode_to_lbl': False,
            },
            4: {
                'data': df_data_unstack,
                'info': df_info_unstack,
                'tail_name': 'Unstack',
                'sheet_name': 'Unstack',
                'is_recode_to_lbl': False,
            },
        }
        
        converter.generate_multiple_data_files(dict_dfs=dict_dfs, is_md=False, is_export_sav=True, is_export_xlsx=True, is_zip=True)
        
  • Step 8: Export data tables

    • init DataTableGenerator(df_data=df_data, df_info=df_info, xlsx_name=str_tbl_file_name)
      • df_data: pandas.Dataframe
      • df_info: pandas.Dataframe
      • xlsx_name: str
      • Returns: NONE
    • class DataTableGenerator.run_tables_by_js_files(lst_func_to_run)
      • lst_func_to_run: list
      • Returns: NONE
    • init TableFormatter(xlsx_name=str_tbl_file_name)
      • xlsx_name: str
      • Returns: NONE
    • class TableFormatter.format_sig_table()
      • Returns: NONE
    lst_side_qres = [
        {"qre_name": "CC1", "sort": "des"},
        {"qre_name": "$CC3", "sort": "asc"},
        {"qre_name": "$CC4", "sort": "des"},
        {"qre_name": "$CC6"},
        {"qre_name": "$CC10"},
        {"qre_name": "LSM"},
        {"qre_name": "Gender"},
        {"qre_name": "Age"},
        {"qre_name": "City"},
        {"qre_name": "HHI"},
        
        # MA Question with net/combine (can apply to SA questions)
        {"qre_name": "$Q15", "cats": {
            'net_code': {
                '900001|combine|Group 1 + 2': {
                    '1': 'Yellow/dull teeth',
                    '3': 'Dental plaque',
                    '5': 'Bad breath',
                    '7': 'Aphthousulcer',
                    '2': 'Sensitive teeth',
                    '4': 'Caries',
                    '6': 'Gingivitis (bleeding, swollen gums)',
                },
                '900002|net|Group 1': {
                    '1': 'Yellow/dull teeth',
                    '3': 'Dental plaque',
                    '5': 'Bad breath',
                    '7': 'Aphthousulcer',
                },
                '900003|net|Group 2': {
                    '2': 'Sensitive teeth',
                    '4': 'Caries',
                    '6': 'Gingivitis (bleeding, swollen gums)',
                },
            },
            '8': 'Other (specify)',
            '9': 'No problem',
        }},
    
        # Scale question with full properties
        {
            "qre_name": "Perception",
            "cats": {
                '1': 'Totally disagree', '2': 'Disagree', '3': 'Neutral', '4': 'Agree', '5': 'Totally agree',
                'net_code': {
                    '900001|combine|B2B': {'1': 'Totally disagree', '2': 'Disagree'},
                    '900002|combine|Medium': {'3': 'Neutral'},
                    '900003|combine|T2B': {'4': 'Agree', '5': 'Totally agree'},
                }
            },
            "mean": {1: 1, 2: 2, 3: 3, 4: 4, 5: 5}
        },
    ]
    
    lst_header_qres = [
        [
            {
                "qre_name": "Age",
                "qre_lbl": "Age",
                "cats": {
                    'TOTAL': 'TOTAL',
                    '2': '18 - 24', '3': '25 - 30', '4': '31 - 39', '5': '40 - 50', '6': 'Trên 50'
                }
            },
            {
                "qre_name": "@City2",
                "qre_lbl": "Location",
                "cats": {
                    'City.isin([1, 5, 10, 11, 12])': 'All South',
                    'City.isin([2, 4, 16, 17, 18])': 'All North',
                }
            },
        ],
    ]
    
    lst_func_to_run = [
        {
            'func_name': 'run_standard_table_sig',
            'tables_to_run': [
                'Tbl_1_Pct',  # this table use df_data & df_info to run
                'Tbl_1_Count',  # this table use df_data & df_info to run
            ],
            'tables_format': {
    
                "Tbl_1_Pct": {
                    "tbl_name": "Table 1 - Pct",
                    "tbl_filter": "City > 0",
                    "is_count": 0,
                    "is_pct_sign": 1,
                    "is_hide_oe_zero_cats": 1,
                    "sig_test_info": {
                        "sig_type": "",  # ind / rel
                        "sig_cols": [],
                        "lst_sig_lvl": []
                    },
                    "lst_side_qres": lst_side_qres,
                    "lst_header_qres": lst_header_qres
                },
    
                "Tbl_1_Count": {
                    "tbl_name": "Table 1 - Count",
                    "tbl_filter": "City > 0",
                    "is_count": 1,
                    "is_pct_sign": 0,
                    "is_hide_oe_zero_cats": 1,
                    "sig_test_info": {
                        "sig_type": "",
                        "sig_cols": [],
                        "lst_sig_lvl": []
                    },
                    "lst_side_qres": lst_side_qres,
                    "lst_header_qres": lst_header_qres
                },
            },
    
        },
    ]
    
    dtg = DataTableGenerator(df_data=df_data, df_info=df_info, xlsx_name=str_tbl_file_name)
    dtg.run_tables_by_js_files(lst_func_to_run)
    
    dtf = TableFormatter(xlsx_name=str_tbl_file_name)
    dtf.format_sig_table()
    

This is a simple example package. You can use Github-flavored Markdown to write your content.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dpkits-1.6.2.tar.gz (71.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dpkits-1.6.2-py3-none-any.whl (78.8 kB view details)

Uploaded Python 3

File details

Details for the file dpkits-1.6.2.tar.gz.

File metadata

  • Download URL: dpkits-1.6.2.tar.gz
  • Upload date:
  • Size: 71.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.0

File hashes

Hashes for dpkits-1.6.2.tar.gz
Algorithm Hash digest
SHA256 27d9022fc7fba0c1b25f99543eb83e52260aec29d67446f60141381f4e2b6dc4
MD5 9a21e0b024f09411349c54212eb3602c
BLAKE2b-256 cb709a3e244cc1bdeaa3e4bfb1bdc3d97eb72fccfdc4d78d14ef9ebd8eeae47c

See more details on using hashes here.

File details

Details for the file dpkits-1.6.2-py3-none-any.whl.

File metadata

  • Download URL: dpkits-1.6.2-py3-none-any.whl
  • Upload date:
  • Size: 78.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.0

File hashes

Hashes for dpkits-1.6.2-py3-none-any.whl
Algorithm Hash digest
SHA256 b80d0dcad55d7526e4aac4ec6422bd1a96524b40df17270f01af79ae03a770bd
MD5 014a45529c33f60c94b113470f8e7c68
BLAKE2b-256 e2d78b3b37cf67b11bcb23ad313f1a3573e92adb9363f8c3b92a73d8a8507ef1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page