Skip to main content

Everything to Markdown.

Project description

wisup_e2m Logo

E2M Repo E2M Version Python Version

🚀 E2M: Everything to Markdown

Everything to Markdown

E2M is a Python library that can parse and convert various file types into Markdown format. By utilizing a parser-converter architecture, it supports the conversion of multiple file formats, including doc, docx, epub, html, htm, url, pdf, ppt, pptx, mp3, and m4a.

✨The ultimate goal of the E2M project is to provide high-quality data for Retrieval-Augmented Generation (RAG) and model training or fine-tuning.

📂 Supported File Types

  • doc
  • docx
  • epub
  • html
  • htm
  • url
  • pdf
  • pptx
  • mp3
  • m4a

📦 Installation

Create Environment:

conda create -n e2m python=3.10
conda activate e2m

Install E2M using pip:

# Option 1: Install via pip
pip install wisup_e2m
# Option 2: Install via git
pip install git+https://github.com/wisupai/e2m.git
# Option 3: Manual installation
git clone https://github.com/wisupai/e2m.git
pip install poetry
poetry build
pip install dist/wisup_e2m-0.1.41-py3-none-any.whl

⚡️ Parser Quick Start

Here's simple examples demonstrating how to use E2M Parsers:

📄 Pdf Parser

from wisup_e2m import PdfParser

pdf_path = "./test.pdf"
parser = PdfParser(engine="marker") # pdf engines: marker, unstructured, surya_layout
pdf_data = parser.parse(pdf_path)
print(pdf_data.text)

📝 Doc Parser

from wisup_e2m import DocParser

doc_path = "./test.doc"
parser = DocParser(engine="xml") # doc engines: xml
doc_data = parser.parse(doc_path)
print(doc_data.text)

📜 Docx Parser

from wisup_e2m import DocxParser

docx_path = "./test.docx"
parser = DocxParser(engine="xml") # docx engines: xml
docx_data = parser.parse(docx_path)
print(docx_data.text)

📚 Epub Parser

from wisup_e2m import EpubParser

epub_path = "./test.epub"
parser = EpubParser(engine="unstructured") # epub engines: unstructured
epub_data = parser.parse(epub_path)
print(epub_data.text)

🌐 Html Parser

from wisup_e2m import HtmlParser

html_path = "./test.html"
parser = HtmlParser(engine="unstructured") # html engines: unstructured
html_data = parser.parse(html_path)
print(html_data.text)

🔗 Url Parser

from wisup_e2m import UrlParser

url = "https://www.example.com"
parser = UrlParser(engine="jina") # url engines: jina
url_data = parser.parse(url)
print(url_data.text)

🖼️ Ppt Parser

from wisup_e2m import PptParser

ppt_path = "./test.ppt"
parser = PptParser(engine="unstructured") # ppt engines: unstructured
ppt_data = parser.parse(ppt_path)
print(ppt_data.text)

🖼️ Pptx Parser

from wisup_e2m import PptxParser

pptx_path = "./test.pptx"
parser = PptxParser(engine="unstructured") # pptx engines: unstructured
pptx_data = parser.parse(pptx_path)
print(pptx_data.text)

🎤 Voice Parser

from wisup_e2m import VoiceParser

voice_path = "./test.mp3"
parser = VoiceParser(
  engine="openai_whisper_local", # voice engines: openai_whisper_api, openai_whisper_local
  model="large" # available models: https://github.com/openai/whisper#available-models-and-languages
  )

voice_data = parser.parse(voice_path)
print(voice_data.text)

🔄 Converter Quick Start

Here's simple examples demonstrating how to use E2M Converters:

📝 Text Converter

from wisup_e2m import TextConverter

text = "Parsed text data from any parser"
converter = TextConverter(
  engine="litellm", # text engines: litellm
  model="deepseek/deepseek-chat",
  api_key="your api key",
  base_url="your base url"
  )
text_data = converter.convert(text)
print(text_data)

🖼️ Image Converter

from wisup_e2m import ImageConverter

images = ["./test1.png", "./test2.png"]
converter = ImageConverter(
  engine="litellm", # image engines: litellm
  model="gpt-4o",
  api_key="your api key",
  base_url="your base url"
  )
image_data = converter.convert(image_path)
print(image_data)

🆙 Next Level

🛠️ E2MParser

E2MParser is an integrated parser that supports multiple file types. It can be used to parse a wide range of file types into Markdown format.

from wisup_e2m import E2MParser

# Initialize the parser with your configuration file
ep = E2MParser.from_config("config.yaml")

# Parse the desired file
data = ep.parse(file_name="/path/to/file.pdf")

# Print the parsed data as a dictionary
print(data.to_dict())

🛠️ E2MConverter

E2MConverter is an integrated converter that supports text and image conversion. It can be used to convert text and images into Markdown format.

from wisup_e2m import E2MConverter

ec = E2MConverter.from_config("./config.yaml")

text = "Parsed text data from any parser"

ec.convert(text=text)

images = ["test.jpg", "test.png"]
ec.convert(images=images)

You can use a config.yaml file to specify the parsers and converters you want to use. Here is an example of a config.yaml file:

parsers:
    doc_parser:
        engine: "xml"
        langs: ["en", "zh"]
    docx_parser:
        engine: "xml"
        langs: ["en", "zh"]
    epub_parser:
        engine: "unstructured"
        langs: ["en", "zh"]
    html_parser:
        engine: "unstructured"
        langs: ["en", "zh"]
    url_parser:
        engine: "jina"
        langs: ["en", "zh"]
    pdf_parser:
        engine: "marker"
        langs: ["en", "zh"]
    pptx_parser:
        engine: "unstructured"
        langs: ["en", "zh"]
    voice_parser:
        # option 1: use openai whisper api
        # engine: "openai_whisper_api"
        # api_base: "https://api.openai.com/v1"
        # api_key: "your_api_key"
        # model: "whisper"

        # option 2: use local whisper model
        engine: "openai_whisper_local"
        model: "large" # available models: https://github.com/openai/whisper#available-models-and-languages

converter:
    text_converter:
        engine: "litellm"
        model: "deepseek/deepseek-chat"
        api_key: "your_api_key"
        # base_url: ""
    image_converter:
        engine: "litellm"
        model: "gpt-4o-mini"
        api_key: "your_api_key"
        # base_url: ""

❓ Q&A

  • Resource xxx not found. Please use the NLTK Downloader to obtain the resource:

      import nltk
      nltk.download('all') # you can directly download all resources
    
  • Resource wordnet not found.

    • Uninstall nltk completely: pip uninstall nltk
    • Reinstall nltk with the following command: pip install nltk
    • Download corpora/wordnet.zip manually and unzip it to the directory specified in the error message. Otherwise, you can download it using the following commands:
      • Windows: wget https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/wordnet.zip -O ~\AppData\Roaming\nltk_data\corpora\wordnet.zip and unzip ~\AppData\Roaming\nltk_data\corpora\wordnet.zip -d ~\AppData\Roaming\nltk_data\corpora\
      • Unix: wget https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/wordnet.zip -O ~/nltk_data/corpora/wordnet.zip and unzip ~/nltk_data/corpora/wordnet.zip -d ~/nltk_data/corpora/

📜 License

This project is licensed under the MIT License. See the LICENSE file for details.

📧 Contact

You can scan the QR code below to join our WeChat group:

wisup_e2m Logo

For any questions or inquiries, please open an issue on GitHub or contact us at team@wisup.ai.

🌟 Contributing

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wisup_e2m-0.1.52.tar.gz (39.8 kB view details)

Uploaded Source

Built Distribution

wisup_e2m-0.1.52-py3-none-any.whl (58.3 kB view details)

Uploaded Python 3

File details

Details for the file wisup_e2m-0.1.52.tar.gz.

File metadata

  • Download URL: wisup_e2m-0.1.52.tar.gz
  • Upload date:
  • Size: 39.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.10.14 Darwin/23.6.0

File hashes

Hashes for wisup_e2m-0.1.52.tar.gz
Algorithm Hash digest
SHA256 d76f9ef8984f2b18ba992c12cb5201b82d261403b222c6dada086a27624cacd1
MD5 c61b5524310ad10017d6fd4e526465c7
BLAKE2b-256 456581a326161370c8cd30dd484fdfce50a5e16803d69adaf817dfcc1a424205

See more details on using hashes here.

File details

Details for the file wisup_e2m-0.1.52-py3-none-any.whl.

File metadata

  • Download URL: wisup_e2m-0.1.52-py3-none-any.whl
  • Upload date:
  • Size: 58.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.10.14 Darwin/23.6.0

File hashes

Hashes for wisup_e2m-0.1.52-py3-none-any.whl
Algorithm Hash digest
SHA256 cfab03e81a489129ef10431046b12f6c4c03b655ffccaba8e1b16be8f1c00d7f
MD5 347da08143bbf8976cf5025928c9fe99
BLAKE2b-256 087fd58c3294fa9f4d6a83aa09b5f7817f764538fcbefed63bf17a67532c83be

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page