Skip to main content

Everything to Markdown.

Project description

wisup_e2m Logo

License E2M Repo E2M Version Python Version PyPI 中文文档

🚀 E2M: Everything to Markdown

Everything to Markdown

E2M is a Python library that can parse and convert various file types into Markdown format. By utilizing a parser-converter architecture, it supports the conversion of multiple file formats, including doc, docx, epub, html, htm, url, pdf, ppt, pptx, mp3, and m4a.

✨The ultimate goal of the E2M project is to provide high-quality data for Retrieval-Augmented Generation (RAG) and model training or fine-tuning.

Core Architecture of the Project:

  • Parser: Responsible for parsing various file types into text or image data.
  • Converter: Responsible for converting text or image data into Markdown format.

Generally, for any type of file, the parser is run first to extract internal data such as text and images. Then, the converter is used to transform this data into Markdown format.

wisup_e2m Logo

📹 Video Introduction

📂 All Converters and Parsers

Parser
Parser Type Engine Supported File Type
PdfParser surya_layout, marker, unstructured pdf
DocParser pandoc, xml doc
DocxParser pandoc, xml docx
PptParser unstructured ppt
PptxParser unstructured pptx
UrlParser unstructured, jina, firecrawl url
EpubParser unstructured epub
HtmlParser unstructured html, htm
VoiceParser openai_whisper_api, openai_whisper_local, SpeechRecognition mp3, m4a
Converter
Converter Type Engine Strategy
ImageConverter litellm, zhipuai (Not Well in Image Recognition, Not Recommended) default
TextConverter litellm, zhipuai default

📦 Installation

Create Environment:

conda create -n e2m python=3.10
conda activate e2m

Update pip:

pip install --upgrade pip

Install E2M using pip:

# Option 1: Install via git, most recommended
pip install git+https://github.com/wisupai/e2m.git --index-url https://pypi.org/simple
# Option 2: Install via pip
pip install wisup_e2m
# Option 3: Manual installation
git clone https://github.com/wisupai/e2m.git
cd e2m
pip install poetry
poetry build
pip install dist/wisup_e2m-0.1.61-py3-none-any.whl

⚡️ Parser Quick Start

Here's simple examples demonstrating how to use E2M Parsers:

📄 Pdf Parser

from wisup_e2m import PdfParser

pdf_path = "./test.pdf"
parser = PdfParser(engine="marker") # pdf engines: marker, unstructured, surya_layout
pdf_data = parser.parse(pdf_path)
print(pdf_data.text)

📝 Doc Parser

from wisup_e2m import DocParser

doc_path = "./test.doc"
parser = DocParser(engine="pandoc") # doc engines: pandoc, xml
doc_data = parser.parse(doc_path)
print(doc_data.text)

📜 Docx Parser

from wisup_e2m import DocxParser

docx_path = "./test.docx"
parser = DocxParser(engine="pandoc") # docx engines: pandoc, xml
docx_data = parser.parse(docx_path)
print(docx_data.text)

📚 Epub Parser

from wisup_e2m import EpubParser

epub_path = "./test.epub"
parser = EpubParser(engine="unstructured") # epub engines: unstructured
epub_data = parser.parse(epub_path)
print(epub_data.text)

🌐 Html Parser

from wisup_e2m import HtmlParser

html_path = "./test.html"
parser = HtmlParser(engine="unstructured") # html engines: unstructured
html_data = parser.parse(html_path)
print(html_data.text)

🔗 Url Parser

from wisup_e2m import UrlParser

url = "https://www.example.com"
parser = UrlParser(engine="jina") # url engines: jina, firecrawl, unstructured
url_data = parser.parse(url)
print(url_data.text)

🖼️ Ppt Parser

from wisup_e2m import PptParser

ppt_path = "./test.ppt"
parser = PptParser(engine="unstructured") # ppt engines: unstructured
ppt_data = parser.parse(ppt_path)
print(ppt_data.text)

🖼️ Pptx Parser

from wisup_e2m import PptxParser

pptx_path = "./test.pptx"
parser = PptxParser(engine="unstructured") # pptx engines: unstructured
pptx_data = parser.parse(pptx_path)
print(pptx_data.text)

🎤 Voice Parser

from wisup_e2m import VoiceParser

voice_path = "./test.mp3"
parser = VoiceParser(
  engine="openai_whisper_local", # voice engines: openai_whisper_api, openai_whisper_local
  model="large" # available models: https://github.com/openai/whisper#available-models-and-languages
  )

voice_data = parser.parse(voice_path)
print(voice_data.text)

🔄 Converter Quick Start

Here's simple examples demonstrating how to use E2M Converters:

📝 Text Converter

from wisup_e2m import TextConverter

text = "Parsed text data from any parser"
converter = TextConverter(
  engine="litellm", # text engines: litellm
  model="deepseek/deepseek-chat",
  api_key="your api key",
  base_url="your base url"
  )
text_data = converter.convert(text)
print(text_data)

🖼️ Image Converter

from wisup_e2m import ImageConverter

images = ["./test1.png", "./test2.png"]
converter = ImageConverter(
  engine="litellm", # image engines: litellm
  model="gpt-4o",
  api_key="your api key",
  base_url="your base url"
  )
image_data = converter.convert(image_path)
print(image_data)

🆙 Next Level

🛠️ E2MParser

E2MParser is an integrated parser that supports multiple file types. It can be used to parse a wide range of file types into Markdown format.

from wisup_e2m import E2MParser

# Initialize the parser with your configuration file
ep = E2MParser.from_config("config.yaml")

# Parse the desired file
data = ep.parse(file_name="/path/to/file.pdf")

# Print the parsed data as a dictionary
print(data.to_dict())

🛠️ E2MConverter

E2MConverter is an integrated converter that supports text and image conversion. It can be used to convert text and images into Markdown format.

from wisup_e2m import E2MConverter

ec = E2MConverter.from_config("./config.yaml")

text = "Parsed text data from any parser"

ec.convert(text=text)

images = ["test.jpg", "test.png"]
ec.convert(images=images)

You can use a config.yaml file to specify the parsers and converters you want to use. Here is an example of a config.yaml file:

parsers:
    doc_parser:
        engine: "pandoc"
        langs: ["en", "zh"]
    docx_parser:
        engine: "pandoc"
        langs: ["en", "zh"]
    epub_parser:
        engine: "unstructured"
        langs: ["en", "zh"]
    html_parser:
        engine: "unstructured"
        langs: ["en", "zh"]
    url_parser:
        engine: "jina"
        langs: ["en", "zh"]
    pdf_parser:
        engine: "marker"
        langs: ["en", "zh"]
    pptx_parser:
        engine: "unstructured"
        langs: ["en", "zh"]
    voice_parser:
        # option 1: use openai whisper api
        # engine: "openai_whisper_api"
        # api_base: "https://api.openai.com/v1"
        # api_key: "your_api_key"
        # model: "whisper"

        # option 2: use local whisper model
        engine: "openai_whisper_local"
        model: "large" # available models: https://github.com/openai/whisper#available-models-and-languages

converters:
    text_converter:
        engine: "litellm"
        model: "deepseek/deepseek-chat"
        api_key: "your_api_key"
        # base_url: ""
    image_converter:
        engine: "litellm"
        model: "gpt-4o-mini"
        api_key: "your_api_key"
        # base_url: ""

❓ Q&A

FAQ Document

📜 License

This project is licensed under the MIT License. See the LICENSE file for details.

📧 Contact

You can scan the QR code below to join our WeChat group:

wisup_e2m Logo

For any questions or inquiries, please open an issue on GitHub or contact us at team@wisup.ai.

Contact for business cooperation: team@wisup.ai

💼 Join Us

wisup_e2m Logo

  • Wisup is an AI startup with a strong focus on data and algorithms. We specialize in providing high-quality data and algorithm services for enterprises. We embrace a remote working model and welcome talented individuals from around the world to join us.

  • Our philosophy: From information to data, from data to knowledge, from knowledge to value.

  • Our vision: To make the world a better place through data.

  • We are looking for: Like-minded Co-Founders

    • No restrictions on education, age, location, race, or gender
    • Keen interest in AI and familiarity with AI and related vertical industries
    • Passionate about AI and data, with a strong sense of purpose
    • Possess unique strengths, responsibility, and a team-oriented mindset
  • To apply, send your resume to: team@wisup.ai

  • You also need to answer three questions in your email:

    • What makes you irreplaceable?
    • What is the most challenging situation you have faced, and how did you resolve it?
    • How do you view the future development of AI?

🌟 Contributing

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wisup_e2m-0.1.61.tar.gz (49.9 kB view details)

Uploaded Source

Built Distribution

wisup_e2m-0.1.61-py3-none-any.whl (71.0 kB view details)

Uploaded Python 3

File details

Details for the file wisup_e2m-0.1.61.tar.gz.

File metadata

  • Download URL: wisup_e2m-0.1.61.tar.gz
  • Upload date:
  • Size: 49.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.10.14 Darwin/23.6.0

File hashes

Hashes for wisup_e2m-0.1.61.tar.gz
Algorithm Hash digest
SHA256 3550aa7a82036b43199b96fbc728cda2ad6ef0a089e8e20becd2e107dd2e830f
MD5 08a950786bd82d74f53aeab533e7a7c7
BLAKE2b-256 e93df4d384be914d23c6fcd2045311c6f17b28875acc051cc4149de41050d6f4

See more details on using hashes here.

File details

Details for the file wisup_e2m-0.1.61-py3-none-any.whl.

File metadata

  • Download URL: wisup_e2m-0.1.61-py3-none-any.whl
  • Upload date:
  • Size: 71.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.10.14 Darwin/23.6.0

File hashes

Hashes for wisup_e2m-0.1.61-py3-none-any.whl
Algorithm Hash digest
SHA256 4d9cbc4cf9b5d0d9e7b9fb6deeb4bd1677d7c5b54c01f4223847ae23914e207d
MD5 a7f4bb107d042fe8c6340eb71e71cb2b
BLAKE2b-256 d168b09e22208ff9e1dec5908c8a3d3e6f310b7aee75063ccb1fc245c382ac20

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page