Skip to main content

A python library that will help developers trying to parse different excel file formats. From conversions of file formats to extracting required information from complex excel sheets, this library is the go-to I would like to propose to all of developer community.

Project description

WBClean_XUM

WBClean_XUM is a small Python utility class for cleaning Excel workbooks by transposing entire sheets, removing noisy/empty columns, optionally mapping “important” columns using an LLM (Groq/OpenAI-compatible API), converting .xls → .xlsx (Windows), and returning the cleaned output as a pandas DataFrame or an Excel file.


Installation

From PyPI

pip install WBClean_XUM

From GitHub

pip install WBClean_XUM

Requirements

  • Python 3.9+
  • Dependencies:
    • pandas
    • openpyxl
    • requests
    • pywin32 (Windows only; required only for .xls conversion)

Note: .xls conversion uses COM automation (win32com) and requires Microsoft Excel installed on Windows.


Quick Start

  1. Clean a raw Excel file and get a DataFrame
import re
import WBClean_XUM

tool = WBClean_XUM.new()


df = tool.XUM_Clean(
    filePath,                    # File path
    pattern=pattern,             # Pattern must be lowercase column name that you are looking in your sheet. if mupltiple present, seperate by |. STR format
    returnDF=True,               # By default returns Dataframe. if toggled to False, will save an .xlsx file in your folder 
    remove_none=True,            # by default will remove all the None rows. Toggle to False if required.
    sheetName=None,              # optional if destinationSheetName specified
    destinationSheet="WBClean_XUM", # Destination sheet name
    getImpFeatures=False,        # By default is set to False, because this requires a groq account and api key of yours
    prompt_ReqFeildString="""
    - col_id_1: What keywords to look for here. Have each seperated by / . 
    - col_id_2: Description / Note (give description only if present or else use Notes)
    - col_id_3: Quantity (QTY / Qty / Order Qty)
    - col_id_4: OEM Part Number / Catalog Number / X ref / Alt Part if explicitly present (not item code), else null
    - add more as per your requirement...
    """,                         # If getImpFeatures == True, then this parameter is required. This is the rules for your LLM.
    prompt_ReqJSONOutputString="""
    {{
      "col_id_1": "enter a valid datatype (int,str,float) and null for fallback",
      "col_id_2": "int or null",
      "col_id_3": "str or null",
      "col_id_4": "float or null",
    }}
    """,                        # If getImpFeatures == True, then this parameter is required. This is the output format specifier for your LLM.
    Key="Groq API Key",          # If getImpFeatures == True, then this parameter is required. This is the output format specifier for your LLM.
    APIUrl="https://api.groq.com/openai/v1/chat/completions", # change as per your requirement
    groqModel="llama-3.3-70b-versatile", # change as per your requirement
    contentType="application/json", # change as per your requirement
    temperature=0, # change as per your requirement
    maxTokens=512 # change as per your requirement
)

print(df.head())
  1. Use Utilities seperately

Transpose

XUM_TransposeSheet(
    srcPath,                              # File path
    destinationSheetName="WBClean_XUM",   # Destination sheet name (Recommended)
    srcSheetName=None                     # optional if destinationSheetName specified
)

#TODO: Returns -> 2-D Array

LLM Feature

XUM_LLMFormat(
    prompt_ReqFeildString="""
    - col_id_1: What keywords to look for here. Have each seperated by / . 
    - col_id_2: Description / Note (give description only if present or else use Notes)
    - col_id_3: Quantity (QTY / Qty / Order Qty)
    - col_id_4: OEM Part Number / Catalog Number / X ref / Alt Part if explicitly present (not item code), else null
    - add more as per your requirement...
    """,                         # If getImpFeatures == True, then this parameter is required. This is the rules for your LLM.
    prompt_ReqJSONOutputString="""
    {{
      "col_id_1": "enter a valid datatype (int,str,float) and null for fallback",
      "col_id_2": "int or null",
      "col_id_3": "str or null",
      "col_id_4": "float or null",
    }}
    """,                        # If getImpFeatures == True, then this parameter is required. This is the output format specifier for your LLM.
    Key="Groq API Key",          # If getImpFeatures == True, then this parameter is required. This is the output format specifier for your LLM.
    APIUrl="https://api.groq.com/openai/v1/chat/completions", # change as per your requirement
    groqModel="llama-3.3-70b-versatile", # change as per your requirement
    contentType="application/json", # change as per your requirement
    temperature=0, # change as per your requirement
    maxTokens=512 # change as per your requirement
    prompt_SampleData, # enter "" if nothing to provide else pass-in any array
    prompt_FullCustom=False, # This is a the point that changes everything else. If this is toggled to True, you don't have to specify [prompt_ReqFeildString,prompt_ReqJSONOutputString]
    prompt_Full=None # if prompt_FullCustom==True, then provide your custom written prompt to pass the LLM
)

#TODO: Returns -> json in string format ("{json contents}")

Delete Rows

XUM_DeleteRows(
    Table2dArray=[[],[],[],..], # 2-D array of the table/sheet
    rowIndexList=[0,1,2,..]
)

#TODO: Returns -> table as 2-D Array

Delete Columns

XUM_DeleteColumns(
    Table2dArray=[[],[],[],..],  # 2-D array of the table/sheet
    colIndexList=[0,1,2,..]
)

#TODO: Returns -> table as 2-D Array

Regex based pattern search

XUM_TextPresenceRegex(
    x="Text: Anything",
    pattern=re.compile('regexpattern', re.I)
)

#TODO: Returns -> (Boolean) True or False

XLS Conversion

XUM_XLSConversion(
    xlsPath="xls file path"
)

#TODO: Returns -> xlsx file path

For suggesting changes or updates or corrections, please contact: nmkrishnan108@gmail.com

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wbclean_xum-1.4.tar.gz (6.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wbclean_xum-1.4-py3-none-any.whl (6.1 kB view details)

Uploaded Python 3

File details

Details for the file wbclean_xum-1.4.tar.gz.

File metadata

  • Download URL: wbclean_xum-1.4.tar.gz
  • Upload date:
  • Size: 6.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.5

File hashes

Hashes for wbclean_xum-1.4.tar.gz
Algorithm Hash digest
SHA256 edf98dae19c3f79a834ad9c5df61e1e369b80a139b9eb4e040de5652e8b73a69
MD5 096b0dcce7ae0c942f83dd700128da89
BLAKE2b-256 a551fd0a411916fcccba3793ecddcaa49a279e84a82f392ff9d1ab3714ec7446

See more details on using hashes here.

File details

Details for the file wbclean_xum-1.4-py3-none-any.whl.

File metadata

  • Download URL: wbclean_xum-1.4-py3-none-any.whl
  • Upload date:
  • Size: 6.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.5

File hashes

Hashes for wbclean_xum-1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 62230e84a658f78c00cc1d8f2faba841066a195096614d5e43c93b49e7228fa8
MD5 20b19dab68519318bd10148036bb0289
BLAKE2b-256 4474f965fedaf3a180966f0bb0f38b193a432ec0b0fd20961e4ef8a3fecd670e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page