Skip to main content

A small package for data processing

Project description

Data processing Package

  • Requirements

    • pandas
    • pyreadstat
    • numpy
    • zipfile
    • fastapi[UploadFile]
  • Step 1: import classes

    # Convert data to pandas dataframe
    from dpkits.ap_data_converter import APDataConverter
    
    # Calculate LSM score
    from dpkits.calculate_lsm import LSMCalculation
    
    # Transpose data to stack and untack
    from dpkits.data_transpose import DataTranspose
    
    # Create the tables from converted dataframe 
    from dpkits.table_generator import DataTableGenerator
    
    # Format data tables 
    from dpkits.table_formater import TableFormatter
    
  • Step 2: Convert data files to dataframe

    • class APDataConverter(files=None, file_name='', is_qme=True)
      • input 1 of files or file_name
      • files: list[UploadFile] default = None
      • file_name: str default = ''
      • is_qme: bool default = True
      • Returns:
        • df_data: pandas.Dataframe
        • df_info: pandas.Dataframe
      # Define input/output files name
      str_file_name = 'APDataTest'
      str_tbl_file_name = f'{str_file_name}_Topline.xlsx'
      
      converter = APDataConverter(file_name='APDataTesting.xlsx')
      
      df_data, df_info = converter.convert_df_mc() 
      
      # Use 'converter.convert_df_md()' if you need md data
      
  • Step 3: Calculate LSM classificate (only for LSM projects)

    • class LSMCalculation.cal_lsm_6(df_data, df_info)
      • df_data: pandas.Dataframe
      • df_info: pandas.Dataframe
      • Returns:
        • df_data: pandas.Dataframe
        • df_info: pandas.Dataframe
      df_data, df_info = LSMCalculation.cal_lsm_6(df_data, df_info)
      
      # df_data, df_info will contains the columns CC1_Score to CC6_Score & LSM_Score
      
  • Step 4: Data cleaning (if needed)

    # Use pandas's functions to clean/process data
    
    df_data['Gender_new'] = df_data['Gender']
    
    df_data.replace({
        'Q1_SP1': {1: 5, 2: 4, 3: 3, 4: 2, 5: 1},
        'Q1_SP2': {1: 5, 2: 4, 3: 3, 4: 2, 5: 1},
    }, inplace=True)
    
    df_data.loc[(df_data['Gender_new'] == 2) & (df_data['Age'] == 5),  ['Gender_new']] = [np.nan]
    df_info.loc[df_info['var_name'] == 'Q1_SP1', ['val_lbl']] = [{'1': 'a', '2': 'b', '3': 'c', '4': 'd', '5': 'e'}]
    
    df_info = pd.concat([df_info, pd.DataFrame(
        columns=['var_name', 'var_lbl', 'var_type', 'val_lbl'],
        data=[
            ['Gender_new', 'Please indicate your gender', 'SA', {'1': 'aaa', '2': 'bb', '3': 'cc'}]
        ]
    )], ignore_index=True)
    
  • Step 5: Transpose data (if needed)

    • class DataTranspose.to_stack(df_data, df_info, dict_stack_structure)
      • df_data: pandas.Dataframe
      • df_info: pandas.Dataframe
      • dict_stack_structure: dict
      • Returns:
        • df_data_stack: pandas.Dataframe
        • df_info_stack: pandas.Dataframe
      dict_stack_structure = {
          'id_col': 'ResID',
          'sp_col': 'Ma_SP',
          'lst_scr': ['Gender', 'Age', 'City', 'HHI'],
          'dict_sp': {
              1: {
                  'Ma_SP1': 'Ma_SP',
                  'Q1_SP1': 'Q1',
                  'Q2_SP1': 'Q2',
                  'Q3_SP1': 'Q3',
              },
              2: {
                  'Ma_SP2': 'Ma_SP',
                  'Q1_SP2': 'Q1',
                  'Q2_SP2': 'Q2',
                  'Q3_SP2': 'Q3',
              },
          },
          'lst_fc': ['Awareness1', 'Frequency', 'Awareness2', 'Perception']
      }
      
      df_data_stack, df_info_stack = DataTranspose.to_stack(df_data, df_info, dict_stack_structure)
      
    • class DataTranspose.to_unstack(df_data_stack, df_info_stack, dict_unstack_structure)
      • df_data_stack: pandas.Dataframe which transpose from stack
      • df_info_stack: pandas.Dataframe which transpose from stack
      • dict_unstack_structure: dict
      • Returns:
        • df_data_unstack: pandas.Dataframe
        • df_info_unstack: pandas.Dataframe
      dict_unstack_structure = {
          'id_col': 'ResID',
          'sp_col': 'Ma_SP',
          'lst_col_part_head': ['Gender', 'Age', 'City', 'HHI'],
          'lst_col_part_body': ['Q1', 'Q2', 'Q3'],
          'lst_col_part_tail': ['Awareness1', 'Frequency', 'Awareness2', 'Perception']
      }
      
      df_data_unstack, df_info_unstack = DataTranspose.to_unstack(df_data_stack, df_info_stack, dict_unstack_structure)
      
  • Step 6: OE Running

    
    
  • Step 7: Export *.sav & *.xlsx

    • class converter.generate_multiple_data_files(dict_dfs=dict_dfs, is_md=False, is_export_sav=True, is_export_xlsx=True, is_zip=True)
      • df_data: pandas.Dataframe
        • dict_dfs: dict
        • is_md: bool default False
        • is_export_sav: bool default True
        • is_export_xlsx: bool default True
        • is_zip: bool default True
        • Returns: NONE
        dict_dfs = {
            1: {
                'data': df_data,
                'info': df_info,
                'tail_name': 'ByCode',
                'sheet_name': 'ByCode',
                'is_recode_to_lbl': False,
            },
            2: {
                'data': df_data,
                'info': df_info,
                'tail_name': 'ByLabel',
                'sheet_name': 'ByLabel',
                'is_recode_to_lbl': True,
            },
            3: {
                'data': df_data_stack,
                'info': df_info_stack,
                'tail_name': 'Stack',
                'sheet_name': 'Stack',
                'is_recode_to_lbl': False,
            },
            4: {
                'data': df_data_unstack,
                'info': df_info_unstack,
                'tail_name': 'Unstack',
                'sheet_name': 'Unstack',
                'is_recode_to_lbl': False,
            },
        }
        
        converter.generate_multiple_data_files(dict_dfs=dict_dfs, is_md=False, is_export_sav=True, is_export_xlsx=True, is_zip=True)
        
  • Step 8: Export data tables

    • init DataTableGenerator(df_data=df_data, df_info=df_info, xlsx_name=str_tbl_file_name)
      • df_data: pandas.Dataframe
      • df_info: pandas.Dataframe
      • xlsx_name: str
      • Returns: NONE
    • class DataTableGenerator.run_tables_by_js_files(lst_func_to_run)
      • lst_func_to_run: list
      • Returns: NONE
    • init TableFormatter(xlsx_name=str_tbl_file_name)
      • xlsx_name: str
      • Returns: NONE
    • class TableFormatter.format_sig_table()
      • Returns: NONE
    lst_side_qres = [
        {"qre_name": "CC1", "sort": "des"},
        {"qre_name": "$CC3", "sort": "asc"},
        {"qre_name": "$CC4", "sort": "des"},
        {"qre_name": "$CC6"},
        {"qre_name": "$CC10"},
        {"qre_name": "LSM"},
        {"qre_name": "Gender"},
        {"qre_name": "Age"},
        {"qre_name": "City"},
        {"qre_name": "HHI"},
        
        # MA Question with net/combine (can apply to SA questions)
        {"qre_name": "$Q15", "cats": {
            'net_code': {
                '900001|combine|Group 1 + 2': {
                    '1': 'Yellow/dull teeth',
                    '3': 'Dental plaque',
                    '5': 'Bad breath',
                    '7': 'Aphthousulcer',
                    '2': 'Sensitive teeth',
                    '4': 'Caries',
                    '6': 'Gingivitis (bleeding, swollen gums)',
                },
                '900002|net|Group 1': {
                    '1': 'Yellow/dull teeth',
                    '3': 'Dental plaque',
                    '5': 'Bad breath',
                    '7': 'Aphthousulcer',
                },
                '900003|net|Group 2': {
                    '2': 'Sensitive teeth',
                    '4': 'Caries',
                    '6': 'Gingivitis (bleeding, swollen gums)',
                },
            },
            '8': 'Other (specify)',
            '9': 'No problem',
        }},
    
        # Scale question with full properties
        {
            "qre_name": "Perception",
            "cats": {
                '1': 'Totally disagree', '2': 'Disagree', '3': 'Neutral', '4': 'Agree', '5': 'Totally agree',
                'net_code': {
                    '900001|combine|B2B': {'1': 'Totally disagree', '2': 'Disagree'},
                    '900002|combine|Medium': {'3': 'Neutral'},
                    '900003|combine|T2B': {'4': 'Agree', '5': 'Totally agree'},
                }
            },
            "mean": {1: 1, 2: 2, 3: 3, 4: 4, 5: 5}
        },
    ]
    
    lst_header_qres = [
        [
            {
                "qre_name": "Age",
                "qre_lbl": "Age",
                "cats": {
                    'TOTAL': 'TOTAL',
                    '2': '18 - 24', '3': '25 - 30', '4': '31 - 39', '5': '40 - 50', '6': 'Trên 50'
                }
            },
            {
                "qre_name": "@City2",
                "qre_lbl": "Location",
                "cats": {
                    'City.isin([1, 5, 10, 11, 12])': 'All South',
                    'City.isin([2, 4, 16, 17, 18])': 'All North',
                }
            },
        ],
    ]
    
    lst_func_to_run = [
        {
            'func_name': 'run_standard_table_sig',
            'tables_to_run': [
                'Tbl_1_Pct',  # this table use df_data & df_info to run
                'Tbl_1_Count',  # this table use df_data & df_info to run
            ],
            'tables_format': {
    
                "Tbl_1_Pct": {
                    "tbl_name": "Table 1 - Pct",
                    "tbl_filter": "City > 0",
                    "is_count": 0,
                    "is_pct_sign": 1,
                    "is_hide_oe_zero_cats": 1,
                    "sig_test_info": {
                        "sig_type": "",  # ind / rel
                        "sig_cols": [],
                        "lst_sig_lvl": []
                    },
                    "lst_side_qres": lst_side_qres,
                    "lst_header_qres": lst_header_qres
                },
    
                "Tbl_1_Count": {
                    "tbl_name": "Table 1 - Count",
                    "tbl_filter": "City > 0",
                    "is_count": 1,
                    "is_pct_sign": 0,
                    "is_hide_oe_zero_cats": 1,
                    "sig_test_info": {
                        "sig_type": "",
                        "sig_cols": [],
                        "lst_sig_lvl": []
                    },
                    "lst_side_qres": lst_side_qres,
                    "lst_header_qres": lst_header_qres
                },
            },
    
        },
    ]
    
    dtg = DataTableGenerator(df_data=df_data, df_info=df_info, xlsx_name=str_tbl_file_name)
    dtg.run_tables_by_js_files(lst_func_to_run)
    
    dtf = TableFormatter(xlsx_name=str_tbl_file_name)
    dtf.format_sig_table()
    

This is a simple example package. You can use Github-flavored Markdown to write your content.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dpkits-1.3.16.tar.gz (62.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dpkits-1.3.16-py3-none-any.whl (69.4 kB view details)

Uploaded Python 3

File details

Details for the file dpkits-1.3.16.tar.gz.

File metadata

  • Download URL: dpkits-1.3.16.tar.gz
  • Upload date:
  • Size: 62.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.0

File hashes

Hashes for dpkits-1.3.16.tar.gz
Algorithm Hash digest
SHA256 e5a02c866456cd730caebecba3249e1ac6cc75e46da55303a34226b515a77400
MD5 d018385718fffae8bee7ad9810463b85
BLAKE2b-256 f72a1d36bbf84f7bdcdfeb8ecfa3a3eb87fdc4301e533638a24619a9e9808630

See more details on using hashes here.

File details

Details for the file dpkits-1.3.16-py3-none-any.whl.

File metadata

  • Download URL: dpkits-1.3.16-py3-none-any.whl
  • Upload date:
  • Size: 69.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.0

File hashes

Hashes for dpkits-1.3.16-py3-none-any.whl
Algorithm Hash digest
SHA256 d8598cd96c3b53f4c2e2bca7e91c8b5157b3f86306860caf6d9124d576c82c1b
MD5 d6cc98747d4317ae3a57219a3e848a36
BLAKE2b-256 c1983fdbfaef8c38923284c221685c3ba0d2b0c0c65be2a6bd200309000b0d4e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page