A python library that will help developers trying to parse different excel file formats. From conversions of file formats to extracting required information from complex excel sheets, this library is the go-to I would like to propose to all of developer community.
Project description
WBClean_XUM
WBClean_XUM is a small Python utility class for cleaning Excel workbooks by transposing entire sheets, removing noisy/empty columns, optionally mapping “important” columns using an LLM (Groq/OpenAI-compatible API), converting .xls → .xlsx (Windows), and returning the cleaned output as a pandas DataFrame or an Excel file.
Installation
From PyPI
pip install WBClean_XUM
From GitHub
pip install WBClean_XUM
Requirements
- Python 3.9+
- Dependencies:
pandasopenpyxlrequestspywin32(Windows only; required only for.xlsconversion)
Note:
.xlsconversion uses COM automation (win32com) and requires Microsoft Excel installed on Windows.
Quick Start
- Clean a raw Excel file and get a DataFrame
import re
import WBClean_XUM
tool = WBClean_XUM.new()
df = tool.XUM_Clean(
filePath, # File path
pattern=pattern, # Pattern must be lowercase column name that you are looking in your sheet. if mupltiple present, seperate by |. STR format
returnDF=True, # By default returns Dataframe. if toggled to False, will save an .xlsx file in your folder
remove_none=True, # by default will remove all the None rows. Toggle to False if required.
sheetName=None, # optional if destinationSheetName specified
destinationSheet="WBClean_XUM", # Destination sheet name
getImpFeatures=False, # By default is set to False, because this requires a groq account and api key of yours
prompt_ReqFeildString="""
- col_id_1: What keywords to look for here. Have each seperated by / .
- col_id_2: Description / Note (give description only if present or else use Notes)
- col_id_3: Quantity (QTY / Qty / Order Qty)
- col_id_4: OEM Part Number / Catalog Number / X ref / Alt Part if explicitly present (not item code), else null
- add more as per your requirement...
""", # If getImpFeatures == True, then this parameter is required. This is the rules for your LLM.
prompt_ReqJSONOutputString="""
{{
"col_id_1": "enter a valid datatype (int,str,float) and null for fallback",
"col_id_2": "int or null",
"col_id_3": "str or null",
"col_id_4": "float or null",
}}
""", # If getImpFeatures == True, then this parameter is required. This is the output format specifier for your LLM.
Key="Groq API Key", # If getImpFeatures == True, then this parameter is required. This is the output format specifier for your LLM.
APIUrl="https://api.groq.com/openai/v1/chat/completions", # change as per your requirement
groqModel="llama-3.3-70b-versatile", # change as per your requirement
contentType="application/json", # change as per your requirement
temperature=0, # change as per your requirement
maxTokens=512 # change as per your requirement
)
print(df.head())
- Use Utilities seperately
Transpose
XUM_TransposeSheet(
srcPath, # File path
destinationSheetName="WBClean_XUM", # Destination sheet name (Recommended)
srcSheetName=None # optional if destinationSheetName specified
)
#TODO: Returns -> 2-D Array
LLM Feature
XUM_LLMFormat(
prompt_ReqFeildString="""
- col_id_1: What keywords to look for here. Have each seperated by / .
- col_id_2: Description / Note (give description only if present or else use Notes)
- col_id_3: Quantity (QTY / Qty / Order Qty)
- col_id_4: OEM Part Number / Catalog Number / X ref / Alt Part if explicitly present (not item code), else null
- add more as per your requirement...
""", # If getImpFeatures == True, then this parameter is required. This is the rules for your LLM.
prompt_ReqJSONOutputString="""
{{
"col_id_1": "enter a valid datatype (int,str,float) and null for fallback",
"col_id_2": "int or null",
"col_id_3": "str or null",
"col_id_4": "float or null",
}}
""", # If getImpFeatures == True, then this parameter is required. This is the output format specifier for your LLM.
Key="Groq API Key", # If getImpFeatures == True, then this parameter is required. This is the output format specifier for your LLM.
APIUrl="https://api.groq.com/openai/v1/chat/completions", # change as per your requirement
groqModel="llama-3.3-70b-versatile", # change as per your requirement
contentType="application/json", # change as per your requirement
temperature=0, # change as per your requirement
maxTokens=512 # change as per your requirement
prompt_SampleData, # enter "" if nothing to provide else pass-in any array
prompt_FullCustom=False, # This is a the point that changes everything else. If this is toggled to True, you don't have to specify [prompt_ReqFeildString,prompt_ReqJSONOutputString]
prompt_Full=None # if prompt_FullCustom==True, then provide your custom written prompt to pass the LLM
)
#TODO: Returns -> json in string format ("{json contents}")
Delete Rows
XUM_DeleteRows(
Table2dArray=[[],[],[],..], # 2-D array of the table/sheet
rowIndexList=[0,1,2,..]
)
#TODO: Returns -> table as 2-D Array
Delete Columns
XUM_DeleteColumns(
Table2dArray=[[],[],[],..], # 2-D array of the table/sheet
colIndexList=[0,1,2,..]
)
#TODO: Returns -> table as 2-D Array
Regex based pattern search
XUM_TextPresenceRegex(
x="Text: Anything",
pattern=re.compile('regexpattern', re.I)
)
#TODO: Returns -> (Boolean) True or False
XLS Conversion
XUM_XLSConversion(
xlsPath="xls file path"
)
#TODO: Returns -> xlsx file path
For suggesting changes or updates or corrections, please contact: nmkrishnan108@gmail.com
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file wbclean_xum-1.5.tar.gz.
File metadata
- Download URL: wbclean_xum-1.5.tar.gz
- Upload date:
- Size: 6.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
66d8e0bb694ce9630d25e88b46de6c0731139a4d915e08cddde6663e2fda2261
|
|
| MD5 |
af8d766dd37b9780a56c9d491eaea630
|
|
| BLAKE2b-256 |
330043030fb5e4ca32c8a3c8647009e4352ac43669b18fe859b3142f70281514
|
File details
Details for the file wbclean_xum-1.5-py3-none-any.whl.
File metadata
- Download URL: wbclean_xum-1.5-py3-none-any.whl
- Upload date:
- Size: 6.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
705b6e3b67469994ed5f01dedd9d79b35fee8236221edd47f8471032a3f026dc
|
|
| MD5 |
b20781a85674c30492ae3e10e620af95
|
|
| BLAKE2b-256 |
9abee77badc55ecc1894f858d78c12bc3735483ecab98ca1ff78e1d681b0a863
|