Skip to main content

A python library that will help developers trying to parse different excel file formats. From conversions of file formats to extracting required information from complex excel sheets, this library is the go-to I would like to propose to all of developer community.

Project description

WBClean_XUM

WBClean_XUM is a small Python utility class for cleaning Excel workbooks by transposing entire sheets, removing noisy/empty columns, optionally mapping “important” columns using an LLM (Groq/OpenAI-compatible API), converting .xls → .xlsx (Windows), and returning the cleaned output as a pandas DataFrame or an Excel file.


Installation

From PyPI

pip install WBClean_XUM

From GitHub

pip install WBClean_XUM

Requirements

  • Python 3.9+
  • Dependencies:
    • pandas
    • openpyxl
    • requests
    • pywin32 (Windows only; required only for .xls conversion)

Note: .xls conversion uses COM automation (win32com) and requires Microsoft Excel installed on Windows.


Quick Start

  1. Clean a raw Excel file and get a DataFrame
import re
import WBClean_XUM

tool = WBClean_XUM.new()


df = tool.XUM_Clean(
    filePath,                    # File path
    pattern=pattern,             # Pattern must be lowercase column name that you are looking in your sheet. if mupltiple present, seperate by |. STR format
    returnDF=True,               # By default returns Dataframe. if toggled to False, will save an .xlsx file in your folder 
    remove_none=True,            # by default will remove all the None rows. Toggle to False if required.
    sheetName=None,              # optional if destinationSheetName specified
    destinationSheet="WBClean_XUM", # Destination sheet name
    getImpFeatures=False,        # By default is set to False, because this requires a groq account and api key of yours
    prompt_ReqFeildString="""
    - col_id_1: What keywords to look for here. Have each seperated by / . 
    - col_id_2: Description / Note (give description only if present or else use Notes)
    - col_id_3: Quantity (QTY / Qty / Order Qty)
    - col_id_4: OEM Part Number / Catalog Number / X ref / Alt Part if explicitly present (not item code), else null
    - add more as per your requirement...
    """,                         # If getImpFeatures == True, then this parameter is required. This is the rules for your LLM.
    prompt_ReqJSONOutputString="""
    {{
      "col_id_1": "enter a valid datatype (int,str,float) and null for fallback",
      "col_id_2": "int or null",
      "col_id_3": "str or null",
      "col_id_4": "float or null",
    }}
    """,                        # If getImpFeatures == True, then this parameter is required. This is the output format specifier for your LLM.
    Key="Groq API Key",          # If getImpFeatures == True, then this parameter is required. This is the output format specifier for your LLM.
    APIUrl="https://api.groq.com/openai/v1/chat/completions", # change as per your requirement
    groqModel="llama-3.3-70b-versatile", # change as per your requirement
    contentType="application/json", # change as per your requirement
    temperature=0, # change as per your requirement
    maxTokens=512 # change as per your requirement
)

print(df.head())
  1. Use Utilities seperately

Transpose

XUM_TransposeSheet(
    srcPath,                              # File path
    destinationSheetName="WBClean_XUM",   # Destination sheet name (Recommended)
    srcSheetName=None                     # optional if destinationSheetName specified
)

#TODO: Returns -> 2-D Array

LLM Feature

XUM_LLMFormat(
    prompt_ReqFeildString="""
    - col_id_1: What keywords to look for here. Have each seperated by / . 
    - col_id_2: Description / Note (give description only if present or else use Notes)
    - col_id_3: Quantity (QTY / Qty / Order Qty)
    - col_id_4: OEM Part Number / Catalog Number / X ref / Alt Part if explicitly present (not item code), else null
    - add more as per your requirement...
    """,                         # If getImpFeatures == True, then this parameter is required. This is the rules for your LLM.
    prompt_ReqJSONOutputString="""
    {{
      "col_id_1": "enter a valid datatype (int,str,float) and null for fallback",
      "col_id_2": "int or null",
      "col_id_3": "str or null",
      "col_id_4": "float or null",
    }}
    """,                        # If getImpFeatures == True, then this parameter is required. This is the output format specifier for your LLM.
    Key="Groq API Key",          # If getImpFeatures == True, then this parameter is required. This is the output format specifier for your LLM.
    APIUrl="https://api.groq.com/openai/v1/chat/completions", # change as per your requirement
    groqModel="llama-3.3-70b-versatile", # change as per your requirement
    contentType="application/json", # change as per your requirement
    temperature=0, # change as per your requirement
    maxTokens=512 # change as per your requirement
    prompt_SampleData, # enter "" if nothing to provide else pass-in any array
    prompt_FullCustom=False, # This is a the point that changes everything else. If this is toggled to True, you don't have to specify [prompt_ReqFeildString,prompt_ReqJSONOutputString]
    prompt_Full=None # if prompt_FullCustom==True, then provide your custom written prompt to pass the LLM
)

#TODO: Returns -> json in string format ("{json contents}")

Delete Rows

XUM_DeleteRows(
    Table2dArray=[[],[],[],..], # 2-D array of the table/sheet
    rowIndexList=[0,1,2,..]
)

#TODO: Returns -> table as 2-D Array

Delete Columns

XUM_DeleteColumns(
    Table2dArray=[[],[],[],..],  # 2-D array of the table/sheet
    colIndexList=[0,1,2,..]
)

#TODO: Returns -> table as 2-D Array

Regex based pattern search

XUM_TextPresenceRegex(
    x="Text: Anything",
    pattern=re.compile('regexpattern', re.I)
)

#TODO: Returns -> (Boolean) True or False

XLS Conversion

XUM_XLSConversion(
    xlsPath="xls file path"
)

#TODO: Returns -> xlsx file path

For suggesting changes or updates or corrections, please contact: nmkrishnan108@gmail.com

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wbclean_xum-1.6.tar.gz (6.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wbclean_xum-1.6-py3-none-any.whl (6.2 kB view details)

Uploaded Python 3

File details

Details for the file wbclean_xum-1.6.tar.gz.

File metadata

  • Download URL: wbclean_xum-1.6.tar.gz
  • Upload date:
  • Size: 6.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.5

File hashes

Hashes for wbclean_xum-1.6.tar.gz
Algorithm Hash digest
SHA256 588c1e607e7be0a047f54705773ae3ac2eb80f96e769358e332b777af2ffd328
MD5 1a5f8cbefe477eacef69e7d1877b9333
BLAKE2b-256 4be08b57c57b3fb4ec28320d3bf1c53c0ff75b4006a7abadccd77aa57fc66406

See more details on using hashes here.

File details

Details for the file wbclean_xum-1.6-py3-none-any.whl.

File metadata

  • Download URL: wbclean_xum-1.6-py3-none-any.whl
  • Upload date:
  • Size: 6.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.5

File hashes

Hashes for wbclean_xum-1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 a716af3f04c89528e58725d8b05f367e740e23d82c15784ed7f9d56848515a5f
MD5 9f6c31b3672174f0d698ac8ebb1787d1
BLAKE2b-256 54cb5f1c22efbe5cb900d1747639b89582205d0ff5a2ac4e1d76f3985334a595

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page