Skip to main content

Everything to Markdown.

Project description

wisup_e2m Logo

License E2M Repo E2M Version Python Version 中文文档

🚀 E2M: Everything to Markdown

Everything to Markdown

E2M is a Python library that can parse and convert various file types into Markdown format. By utilizing a parser-converter architecture, it supports the conversion of multiple file formats, including doc, docx, epub, html, htm, url, pdf, ppt, pptx, mp3, and m4a.

✨The ultimate goal of the E2M project is to provide high-quality data for Retrieval-Augmented Generation (RAG) and model training or fine-tuning.

Core Architecture of the Project:

  • Parser: Responsible for parsing various file types into text or image data.
  • Converter: Responsible for converting text or image data into Markdown format.

Generally, for any type of file, the parser is run first to extract internal data such as text and images. Then, the converter is used to transform this data into Markdown format.

wisup_e2m Logo

📂 All Converters and Parsers

Parser
Parser Type Engine Supported File Type
PdfParser surya_layout, marker, unstructured pdf
DocParser xml doc
DocxParser xml docx
PptParser unstructured ppt
PptxParser unstructured pptx
UrlParser unstructured, jina, firecrawl url
VoiceParser openai_whisper_api, openai_whisper_local, SpeechRecognition mp3, m4a
Converter
Converter Type Engine
ImageConverter litellm
TextConverter litellm

📦 Installation

Create Environment:

conda create -n e2m python=3.10
conda activate e2m

Install E2M using pip:

# Option 1: Install via pip
pip install wisup_e2m
# Option 2: Install via git
pip install git+https://github.com/wisupai/e2m.git
# Option 3: Manual installation
git clone https://github.com/wisupai/e2m.git
pip install poetry
poetry build
pip install dist/wisup_e2m-0.1.41-py3-none-any.whl

⚡️ Parser Quick Start

Here's simple examples demonstrating how to use E2M Parsers:

📄 Pdf Parser

from wisup_e2m import PdfParser

pdf_path = "./test.pdf"
parser = PdfParser(engine="marker") # pdf engines: marker, unstructured, surya_layout
pdf_data = parser.parse(pdf_path)
print(pdf_data.text)

📝 Doc Parser

from wisup_e2m import DocParser

doc_path = "./test.doc"
parser = DocParser(engine="xml") # doc engines: xml
doc_data = parser.parse(doc_path)
print(doc_data.text)

📜 Docx Parser

from wisup_e2m import DocxParser

docx_path = "./test.docx"
parser = DocxParser(engine="xml") # docx engines: xml
docx_data = parser.parse(docx_path)
print(docx_data.text)

📚 Epub Parser

from wisup_e2m import EpubParser

epub_path = "./test.epub"
parser = EpubParser(engine="unstructured") # epub engines: unstructured
epub_data = parser.parse(epub_path)
print(epub_data.text)

🌐 Html Parser

from wisup_e2m import HtmlParser

html_path = "./test.html"
parser = HtmlParser(engine="unstructured") # html engines: unstructured
html_data = parser.parse(html_path)
print(html_data.text)

🔗 Url Parser

from wisup_e2m import UrlParser

url = "https://www.example.com"
parser = UrlParser(engine="jina") # url engines: jina
url_data = parser.parse(url)
print(url_data.text)

🖼️ Ppt Parser

from wisup_e2m import PptParser

ppt_path = "./test.ppt"
parser = PptParser(engine="unstructured") # ppt engines: unstructured
ppt_data = parser.parse(ppt_path)
print(ppt_data.text)

🖼️ Pptx Parser

from wisup_e2m import PptxParser

pptx_path = "./test.pptx"
parser = PptxParser(engine="unstructured") # pptx engines: unstructured
pptx_data = parser.parse(pptx_path)
print(pptx_data.text)

🎤 Voice Parser

from wisup_e2m import VoiceParser

voice_path = "./test.mp3"
parser = VoiceParser(
  engine="openai_whisper_local", # voice engines: openai_whisper_api, openai_whisper_local
  model="large" # available models: https://github.com/openai/whisper#available-models-and-languages
  )

voice_data = parser.parse(voice_path)
print(voice_data.text)

🔄 Converter Quick Start

Here's simple examples demonstrating how to use E2M Converters:

📝 Text Converter

from wisup_e2m import TextConverter

text = "Parsed text data from any parser"
converter = TextConverter(
  engine="litellm", # text engines: litellm
  model="deepseek/deepseek-chat",
  api_key="your api key",
  base_url="your base url"
  )
text_data = converter.convert(text)
print(text_data)

🖼️ Image Converter

from wisup_e2m import ImageConverter

images = ["./test1.png", "./test2.png"]
converter = ImageConverter(
  engine="litellm", # image engines: litellm
  model="gpt-4o",
  api_key="your api key",
  base_url="your base url"
  )
image_data = converter.convert(image_path)
print(image_data)

🆙 Next Level

🛠️ E2MParser

E2MParser is an integrated parser that supports multiple file types. It can be used to parse a wide range of file types into Markdown format.

from wisup_e2m import E2MParser

# Initialize the parser with your configuration file
ep = E2MParser.from_config("config.yaml")

# Parse the desired file
data = ep.parse(file_name="/path/to/file.pdf")

# Print the parsed data as a dictionary
print(data.to_dict())

🛠️ E2MConverter

E2MConverter is an integrated converter that supports text and image conversion. It can be used to convert text and images into Markdown format.

from wisup_e2m import E2MConverter

ec = E2MConverter.from_config("./config.yaml")

text = "Parsed text data from any parser"

ec.convert(text=text)

images = ["test.jpg", "test.png"]
ec.convert(images=images)

You can use a config.yaml file to specify the parsers and converters you want to use. Here is an example of a config.yaml file:

parsers:
    doc_parser:
        engine: "xml"
        langs: ["en", "zh"]
    docx_parser:
        engine: "xml"
        langs: ["en", "zh"]
    epub_parser:
        engine: "unstructured"
        langs: ["en", "zh"]
    html_parser:
        engine: "unstructured"
        langs: ["en", "zh"]
    url_parser:
        engine: "jina"
        langs: ["en", "zh"]
    pdf_parser:
        engine: "marker"
        langs: ["en", "zh"]
    pptx_parser:
        engine: "unstructured"
        langs: ["en", "zh"]
    voice_parser:
        # option 1: use openai whisper api
        # engine: "openai_whisper_api"
        # api_base: "https://api.openai.com/v1"
        # api_key: "your_api_key"
        # model: "whisper"

        # option 2: use local whisper model
        engine: "openai_whisper_local"
        model: "large" # available models: https://github.com/openai/whisper#available-models-and-languages

converters:
    text_converter:
        engine: "litellm"
        model: "deepseek/deepseek-chat"
        api_key: "your_api_key"
        # base_url: ""
    image_converter:
        engine: "litellm"
        model: "gpt-4o-mini"
        api_key: "your_api_key"
        # base_url: ""

❓ Q&A

  • Resource xxx not found. Please use the NLTK Downloader to obtain the resource:

      import nltk
      nltk.download('all') # you can directly download all resources
    
  • Resource wordnet not found.

    • Uninstall nltk completely: pip uninstall nltk
    • Reinstall nltk with the following command: pip install nltk
    • Download corpora/wordnet.zip manually and unzip it to the directory specified in the error message. Otherwise, you can download it using the following commands:
      • Windows: wget https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/wordnet.zip -O ~\AppData\Roaming\nltk_data\corpora\wordnet.zip and unzip ~\AppData\Roaming\nltk_data\corpora\wordnet.zip -d ~\AppData\Roaming\nltk_data\corpora\
      • Unix: wget https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/wordnet.zip -O ~/nltk_data/corpora/wordnet.zip and unzip ~/nltk_data/corpora/wordnet.zip -d ~/nltk_data/corpora/

📜 License

This project is licensed under the MIT License. See the LICENSE file for details.

📧 Contact

You can scan the QR code below to join our WeChat group:

wisup_e2m Logo

For any questions or inquiries, please open an issue on GitHub or contact us at team@wisup.ai.

🌟 Contributing

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wisup_e2m-0.1.53.tar.gz (41.8 kB view details)

Uploaded Source

Built Distribution

wisup_e2m-0.1.53-py3-none-any.whl (60.5 kB view details)

Uploaded Python 3

File details

Details for the file wisup_e2m-0.1.53.tar.gz.

File metadata

  • Download URL: wisup_e2m-0.1.53.tar.gz
  • Upload date:
  • Size: 41.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.10.14 Darwin/23.6.0

File hashes

Hashes for wisup_e2m-0.1.53.tar.gz
Algorithm Hash digest
SHA256 1e0d49866fe7a8ae5f668ad3049133192ecbd250d4276ef70d7a7c2e3e965717
MD5 ffbce9435da337c07e7509f9c5042912
BLAKE2b-256 96982426da0feb0e234b9ac253811c39226552b24b379a6c94c7d1f8825c9cc7

See more details on using hashes here.

File details

Details for the file wisup_e2m-0.1.53-py3-none-any.whl.

File metadata

  • Download URL: wisup_e2m-0.1.53-py3-none-any.whl
  • Upload date:
  • Size: 60.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.10.14 Darwin/23.6.0

File hashes

Hashes for wisup_e2m-0.1.53-py3-none-any.whl
Algorithm Hash digest
SHA256 2151bac0a1bf78edaf3517773d68c2f224a6342e74c004dfe7569ae897e54bf8
MD5 0f89c8a2d8893e4ee1a3d51037416677
BLAKE2b-256 8e1e69bdb88aac6b9c935fc3a4181d9ce0ebb43b3c41050d5b6685b18d42c448

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page