A small package for data processing
Project description
Data processing Package
-
Requirements
- pandas
- pyreadstat
- numpy
- zipfile
- fastapi[UploadFile]
-
Step 1: import classes
# Convert data to pandas dataframe from dpkits.ap_data_converter import APDataConverter # Calculate LSM score from dpkits.calculate_lsm import LSMCalculation # Transpose data to stack and untack from dpkits.data_transpose import DataTranspose # Create the tables from converted dataframe from dpkits.table_generator import DataTableGenerator # Format data tables from dpkits.table_formater import TableFormatter -
Step 2: Convert data files to dataframe
- class APDataConverter(files=None, file_name='', is_qme=True)
- input 1 of files or file_name
- files: list[UploadFile] default = None
- file_name: str default = ''
- is_qme: bool default = True
- Returns:
- df_data: pandas.Dataframe
- df_info: pandas.Dataframe
# Define input/output files name str_file_name = 'APDataTest' str_tbl_file_name = f'{str_file_name}_Topline.xlsx' converter = APDataConverter(file_name='APDataTesting.xlsx') df_data, df_info = converter.convert_df_mc() # Use 'converter.convert_df_md()' if you need md data
- class APDataConverter(files=None, file_name='', is_qme=True)
-
Step 3: Calculate LSM classificate (only for LSM projects)
- class LSMCalculation.cal_lsm_6(df_data, df_info)
- df_data: pandas.Dataframe
- df_info: pandas.Dataframe
- Returns:
- df_data: pandas.Dataframe
- df_info: pandas.Dataframe
df_data, df_info = LSMCalculation.cal_lsm_6(df_data, df_info) # df_data, df_info will contains the columns CC1_Score to CC6_Score & LSM_Score
- class LSMCalculation.cal_lsm_6(df_data, df_info)
-
Step 4: Data cleaning (if needed)
# Use pandas's functions to clean/process data df_data['Gender_new'] = df_data['Gender'] df_data.replace({ 'Q1_SP1': {1: 5, 2: 4, 3: 3, 4: 2, 5: 1}, 'Q1_SP2': {1: 5, 2: 4, 3: 3, 4: 2, 5: 1}, }, inplace=True) df_data.loc[(df_data['Gender_new'] == 2) & (df_data['Age'] == 5), ['Gender_new']] = [np.nan] df_info.loc[df_info['var_name'] == 'Q1_SP1', ['val_lbl']] = [{'1': 'a', '2': 'b', '3': 'c', '4': 'd', '5': 'e'}] df_info = pd.concat([df_info, pd.DataFrame( columns=['var_name', 'var_lbl', 'var_type', 'val_lbl'], data=[ ['Gender_new', 'Please indicate your gender', 'SA', {'1': 'aaa', '2': 'bb', '3': 'cc'}] ] )], ignore_index=True) -
Step 5: Transpose data (if needed)
- class DataTranspose.to_stack(df_data, df_info, dict_stack_structure)
- df_data: pandas.Dataframe
- df_info: pandas.Dataframe
- dict_stack_structure: dict
- Returns:
- df_data_stack: pandas.Dataframe
- df_info_stack: pandas.Dataframe
dict_stack_structure = { 'id_col': 'ResID', 'sp_col': 'Ma_SP', 'lst_scr': ['Gender', 'Age', 'City', 'HHI'], 'dict_sp': { 1: { 'Ma_SP1': 'Ma_SP', 'Q1_SP1': 'Q1', 'Q2_SP1': 'Q2', 'Q3_SP1': 'Q3', }, 2: { 'Ma_SP2': 'Ma_SP', 'Q1_SP2': 'Q1', 'Q2_SP2': 'Q2', 'Q3_SP2': 'Q3', }, }, 'lst_fc': ['Awareness1', 'Frequency', 'Awareness2', 'Perception'] } df_data_stack, df_info_stack = DataTranspose.to_stack(df_data, df_info, dict_stack_structure) - class DataTranspose.to_unstack(df_data_stack, df_info_stack, dict_unstack_structure)
- df_data_stack: pandas.Dataframe which transpose from stack
- df_info_stack: pandas.Dataframe which transpose from stack
- dict_unstack_structure: dict
- Returns:
- df_data_unstack: pandas.Dataframe
- df_info_unstack: pandas.Dataframe
dict_unstack_structure = { 'id_col': 'ResID', 'sp_col': 'Ma_SP', 'lst_col_part_head': ['Gender', 'Age', 'City', 'HHI'], 'lst_col_part_body': ['Q1', 'Q2', 'Q3'], 'lst_col_part_tail': ['Awareness1', 'Frequency', 'Awareness2', 'Perception'] } df_data_unstack, df_info_unstack = DataTranspose.to_unstack(df_data_stack, df_info_stack, dict_unstack_structure)
- class DataTranspose.to_stack(df_data, df_info, dict_stack_structure)
-
Step 6: OE Running
-
Step 7: Export *.sav & *.xlsx
- class converter.generate_multiple_data_files(dict_dfs=dict_dfs, is_md=False, is_export_sav=True, is_export_xlsx=True, is_zip=True)
- df_data: pandas.Dataframe
- dict_dfs: dict
- is_md: bool default False
- is_export_sav: bool default True
- is_export_xlsx: bool default True
- is_zip: bool default True
- Returns: NONE
dict_dfs = { 1: { 'data': df_data, 'info': df_info, 'tail_name': 'ByCode', 'sheet_name': 'ByCode', 'is_recode_to_lbl': False, }, 2: { 'data': df_data, 'info': df_info, 'tail_name': 'ByLabel', 'sheet_name': 'ByLabel', 'is_recode_to_lbl': True, }, 3: { 'data': df_data_stack, 'info': df_info_stack, 'tail_name': 'Stack', 'sheet_name': 'Stack', 'is_recode_to_lbl': False, }, 4: { 'data': df_data_unstack, 'info': df_info_unstack, 'tail_name': 'Unstack', 'sheet_name': 'Unstack', 'is_recode_to_lbl': False, }, } converter.generate_multiple_data_files(dict_dfs=dict_dfs, is_md=False, is_export_sav=True, is_export_xlsx=True, is_zip=True)
- df_data: pandas.Dataframe
- class converter.generate_multiple_data_files(dict_dfs=dict_dfs, is_md=False, is_export_sav=True, is_export_xlsx=True, is_zip=True)
-
Step 8: Export data tables
- init DataTableGenerator(df_data=df_data, df_info=df_info, xlsx_name=str_tbl_file_name)
- df_data: pandas.Dataframe
- df_info: pandas.Dataframe
- xlsx_name: str
- Returns: NONE
- class DataTableGenerator.run_tables_by_js_files(lst_func_to_run)
- lst_func_to_run: list
- Returns: NONE
- init TableFormatter(xlsx_name=str_tbl_file_name)
- xlsx_name: str
- Returns: NONE
- class TableFormatter.format_sig_table()
- Returns: NONE
lst_side_qres = [ {"qre_name": "CC1", "sort": "des"}, {"qre_name": "$CC3", "sort": "asc"}, {"qre_name": "$CC4", "sort": "des"}, {"qre_name": "$CC6"}, {"qre_name": "$CC10"}, {"qre_name": "LSM"}, {"qre_name": "Gender"}, {"qre_name": "Age"}, {"qre_name": "City"}, {"qre_name": "HHI"}, # MA Question with net/combine (can apply to SA questions) {"qre_name": "$Q15", "cats": { 'net_code': { '900001|combine|Group 1 + 2': { '1': 'Yellow/dull teeth', '3': 'Dental plaque', '5': 'Bad breath', '7': 'Aphthousulcer', '2': 'Sensitive teeth', '4': 'Caries', '6': 'Gingivitis (bleeding, swollen gums)', }, '900002|net|Group 1': { '1': 'Yellow/dull teeth', '3': 'Dental plaque', '5': 'Bad breath', '7': 'Aphthousulcer', }, '900003|net|Group 2': { '2': 'Sensitive teeth', '4': 'Caries', '6': 'Gingivitis (bleeding, swollen gums)', }, }, '8': 'Other (specify)', '9': 'No problem', }}, # Scale question with full properties { "qre_name": "Perception", "cats": { '1': 'Totally disagree', '2': 'Disagree', '3': 'Neutral', '4': 'Agree', '5': 'Totally agree', 'net_code': { '900001|combine|B2B': {'1': 'Totally disagree', '2': 'Disagree'}, '900002|combine|Medium': {'3': 'Neutral'}, '900003|combine|T2B': {'4': 'Agree', '5': 'Totally agree'}, } }, "mean": {1: 1, 2: 2, 3: 3, 4: 4, 5: 5} }, ] lst_header_qres = [ [ { "qre_name": "Age", "qre_lbl": "Age", "cats": { 'TOTAL': 'TOTAL', '2': '18 - 24', '3': '25 - 30', '4': '31 - 39', '5': '40 - 50', '6': 'Trên 50' } }, { "qre_name": "@City2", "qre_lbl": "Location", "cats": { 'City.isin([1, 5, 10, 11, 12])': 'All South', 'City.isin([2, 4, 16, 17, 18])': 'All North', } }, ], ] lst_func_to_run = [ { 'func_name': 'run_standard_table_sig', 'tables_to_run': [ 'Tbl_1_Pct', # this table use df_data & df_info to run 'Tbl_1_Count', # this table use df_data & df_info to run ], 'tables_format': { "Tbl_1_Pct": { "tbl_name": "Table 1 - Pct", "tbl_filter": "City > 0", "is_count": 0, "is_pct_sign": 1, "is_hide_oe_zero_cats": 1, "sig_test_info": { "sig_type": "", # ind / rel "sig_cols": [], "lst_sig_lvl": [] }, "lst_side_qres": lst_side_qres, "lst_header_qres": lst_header_qres }, "Tbl_1_Count": { "tbl_name": "Table 1 - Count", "tbl_filter": "City > 0", "is_count": 1, "is_pct_sign": 0, "is_hide_oe_zero_cats": 1, "sig_test_info": { "sig_type": "", "sig_cols": [], "lst_sig_lvl": [] }, "lst_side_qres": lst_side_qres, "lst_header_qres": lst_header_qres }, }, }, ] dtg = DataTableGenerator(df_data=df_data, df_info=df_info, xlsx_name=str_tbl_file_name) dtg.run_tables_by_js_files(lst_func_to_run) dtf = TableFormatter(xlsx_name=str_tbl_file_name) dtf.format_sig_table() - init DataTableGenerator(df_data=df_data, df_info=df_info, xlsx_name=str_tbl_file_name)
This is a simple example package. You can use Github-flavored Markdown to write your content.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dpkits-1.3.18.tar.gz.
File metadata
- Download URL: dpkits-1.3.18.tar.gz
- Upload date:
- Size: 62.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
21f9df26cfb34babdf13dee524fe8f0990af6a03a41f0d6a191c36e546cd72a1
|
|
| MD5 |
1ccb78acd5b8ee192e2dc6a705f387fb
|
|
| BLAKE2b-256 |
640aeac41beb1ed42ab3d25796e364f90272120c8738e387fa6b0b78b86c2455
|
File details
Details for the file dpkits-1.3.18-py3-none-any.whl.
File metadata
- Download URL: dpkits-1.3.18-py3-none-any.whl
- Upload date:
- Size: 69.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c154c0f762b37cf01f1996cf49574ac5b6f41c1a3a324cfa92235979bdb4f22f
|
|
| MD5 |
686daa6139fac19ca2bafc420bbcb2ae
|
|
| BLAKE2b-256 |
8f9b66ee2fc180ca8fc206e19a669e391936932094d377864111a10fb872c239
|