Everything to Markdown.

These details have not been verified by PyPI

Project links

Project description

wisup_e2m Logo

🚀 E2M: Everything to Markdown

Everything to Markdown

E2M is a Python library that can parse and convert various file types into Markdown format. By utilizing a parser-converter architecture, it supports the conversion of multiple file formats, including doc, docx, epub, html, htm, url, pdf, ppt, pptx, mp3, and m4a.

✨The ultimate goal of the E2M project is to provide high-quality data for Retrieval-Augmented Generation (RAG) and model training or fine-tuning.

Core Architecture of the Project:

Parser: Responsible for parsing various file types into text or image data.
Converter: Responsible for converting text or image data into Markdown format.

Generally, for any type of file, the parser is run first to extract internal data such as text and images. Then, the converter is used to transform this data into Markdown format.

wisup_e2m Logo

📹 Video Introduction

📂 All Converters and Parsers

Parser
Parser Type	Engine	Supported File Type
PdfParser	surya_layout, marker, unstructured	pdf
DocParser	xml	doc
DocxParser	xml	docx
PptParser	unstructured	ppt
PptxParser	unstructured	pptx
UrlParser	unstructured, jina, firecrawl	url
EpubParser	unstructured	epub
HtmlParser	unstructured	html, htm
VoiceParser	openai_whisper_api, openai_whisper_local, SpeechRecognition	mp3, m4a

Converter
Converter Type	Engine	Strategy
ImageConverter	litellm, zhipuai (Not Well in Image Recognition, Not Recommended)	default
TextConverter	litellm, zhipuai	default

📦 Installation

Create Environment:

conda create -n e2m python=3.10
conda activate e2m

Update pip:

pip install --upgrade pip

Install E2M using pip:

# Option 1: Install via pip
pip install wisup_e2m
# Option 2: Install via git
pip install git+https://github.com/wisupai/e2m.git --index-url https://pypi.org/simple
# Option 3: Manual installation
git clone https://github.com/wisupai/e2m.git
cd e2m
pip install poetry
poetry build
pip install dist/wisup_e2m-0.1.56-py3-none-any.whl

⚡️ Parser Quick Start

Here's simple examples demonstrating how to use E2M Parsers:

📄 Pdf Parser

from wisup_e2m import PdfParser

pdf_path = "./test.pdf"
parser = PdfParser(engine="marker") # pdf engines: marker, unstructured, surya_layout
pdf_data = parser.parse(pdf_path)
print(pdf_data.text)

📝 Doc Parser

from wisup_e2m import DocParser

doc_path = "./test.doc"
parser = DocParser(engine="xml") # doc engines: xml
doc_data = parser.parse(doc_path)
print(doc_data.text)

📜 Docx Parser

from wisup_e2m import DocxParser

docx_path = "./test.docx"
parser = DocxParser(engine="xml") # docx engines: xml
docx_data = parser.parse(docx_path)
print(docx_data.text)

📚 Epub Parser

from wisup_e2m import EpubParser

epub_path = "./test.epub"
parser = EpubParser(engine="unstructured") # epub engines: unstructured
epub_data = parser.parse(epub_path)
print(epub_data.text)

🌐 Html Parser

from wisup_e2m import HtmlParser

html_path = "./test.html"
parser = HtmlParser(engine="unstructured") # html engines: unstructured
html_data = parser.parse(html_path)
print(html_data.text)

🔗 Url Parser

from wisup_e2m import UrlParser

url = "https://www.example.com"
parser = UrlParser(engine="jina") # url engines: jina, firecrawl, unstructured
url_data = parser.parse(url)
print(url_data.text)

🖼️ Ppt Parser

from wisup_e2m import PptParser

ppt_path = "./test.ppt"
parser = PptParser(engine="unstructured") # ppt engines: unstructured
ppt_data = parser.parse(ppt_path)
print(ppt_data.text)

🖼️ Pptx Parser

from wisup_e2m import PptxParser

pptx_path = "./test.pptx"
parser = PptxParser(engine="unstructured") # pptx engines: unstructured
pptx_data = parser.parse(pptx_path)
print(pptx_data.text)

🎤 Voice Parser

from wisup_e2m import VoiceParser

voice_path = "./test.mp3"
parser = VoiceParser(
  engine="openai_whisper_local", # voice engines: openai_whisper_api, openai_whisper_local
  model="large" # available models: https://github.com/openai/whisper#available-models-and-languages
  )

voice_data = parser.parse(voice_path)
print(voice_data.text)

🔄 Converter Quick Start

Here's simple examples demonstrating how to use E2M Converters:

📝 Text Converter

from wisup_e2m import TextConverter

text = "Parsed text data from any parser"
converter = TextConverter(
  engine="litellm", # text engines: litellm
  model="deepseek/deepseek-chat",
  api_key="your api key",
  base_url="your base url"
  )
text_data = converter.convert(text)
print(text_data)

🖼️ Image Converter

from wisup_e2m import ImageConverter

images = ["./test1.png", "./test2.png"]
converter = ImageConverter(
  engine="litellm", # image engines: litellm
  model="gpt-4o",
  api_key="your api key",
  base_url="your base url"
  )
image_data = converter.convert(image_path)
print(image_data)

🆙 Next Level

🛠️ E2MParser

E2MParser is an integrated parser that supports multiple file types. It can be used to parse a wide range of file types into Markdown format.

from wisup_e2m import E2MParser

# Initialize the parser with your configuration file
ep = E2MParser.from_config("config.yaml")

# Parse the desired file
data = ep.parse(file_name="/path/to/file.pdf")

# Print the parsed data as a dictionary
print(data.to_dict())

🛠️ E2MConverter

E2MConverter is an integrated converter that supports text and image conversion. It can be used to convert text and images into Markdown format.

from wisup_e2m import E2MConverter

ec = E2MConverter.from_config("./config.yaml")

text = "Parsed text data from any parser"

ec.convert(text=text)

images = ["test.jpg", "test.png"]
ec.convert(images=images)

You can use a config.yaml file to specify the parsers and converters you want to use. Here is an example of a config.yaml file:

parsers:
    doc_parser:
        engine: "xml"
        langs: ["en", "zh"]
    docx_parser:
        engine: "xml"
        langs: ["en", "zh"]
    epub_parser:
        engine: "unstructured"
        langs: ["en", "zh"]
    html_parser:
        engine: "unstructured"
        langs: ["en", "zh"]
    url_parser:
        engine: "jina"
        langs: ["en", "zh"]
    pdf_parser:
        engine: "marker"
        langs: ["en", "zh"]
    pptx_parser:
        engine: "unstructured"
        langs: ["en", "zh"]
    voice_parser:
        # option 1: use openai whisper api
        # engine: "openai_whisper_api"
        # api_base: "https://api.openai.com/v1"
        # api_key: "your_api_key"
        # model: "whisper"

        # option 2: use local whisper model
        engine: "openai_whisper_local"
        model: "large" # available models: https://github.com/openai/whisper#available-models-and-languages

converters:
    text_converter:
        engine: "litellm"
        model: "deepseek/deepseek-chat"
        api_key: "your_api_key"
        # base_url: ""
    image_converter:
        engine: "litellm"
        model: "gpt-4o-mini"
        api_key: "your_api_key"
        # base_url: ""

❓ Q&A

Why set up parsers and converters instead of directly converting to Markdown in one step?
- The core purpose of a parser is to extract data such as text and images without heavily processing it. In some projects, like knowledge bases, not all files need to be converted into Markdown. If the extracted text and image content already meets basic retrieval-augmented generation (RAG) needs, there is no need to incur additional costs on format conversion.
- Based on the extracted images and text, converters can further refine and format the data to make it more suitable for training and fine-tuning RAG models.

Why does the PdfParser produce poor Markdown text results?

The primary function of PdfParser is parsing, not directly converting to Markdown.
PdfParser supports three engines:
- marker: Inspired by the well-known marker project, it can directly convert to Markdown. However, due to suboptimal results with complex text, it is only part of the parser.
- unstructured: The parsed output is raw text with almost no formatting, recommended for use with well-structured PDFs.
- surya_layout: The output is not text but images with layout information marked, requiring conversion using the ImageConverter. If the ImageConverter uses multimodal models like gpt-4o, the conversion to Markdown yields the best results, comparable to some commercial conversion software.

Below is a code example that produces the best conversion results:

import os
from wisup_e2m import PdfParser, ImageConverter

work_dir = os.getcwd()  # Use the current directory as the working directory
image_dir = os.path.join(work_dir, "figure")

pdf = "./test.pdf"

# Load the parser
pdf_parser = PdfParser(engine="surya_layout")
# Load the converter
image_converter = ImageConverter(
    engine="litellm",
    api_key="<your API key>",  # Replace with your API key
    model="gpt-4o",
    base_url="<your base URL>",  # If using a model proxy, provide the base URL
    caching=True,
    cache_type="disk-cache",
)

# Parse the PDF into images
pdf_data = pdf_parser.parse(
    pdf,
    start_page=0,  # Starting page number
    end_page=20,  # Ending page number
    work_dir=work_dir,
    image_dir=image_dir,  # Directory to save extracted images
    relative_path=True,  # Whether the image paths are relative to work_dir
)

# Convert images to text using ImageConverter
md_text = image_converter.convert(
    images=pdf_data.images,
    attached_images_map=pdf_data.attached_images_map,
    work_dir=work_dir,  # Image addresses in Markdown will be relative to work_dir; the default is absolute paths
)

# Save test Markdown
with open("test.md", "w") as f:
    f.write(md_text)

Unable to connect to 'https://huggingface.co'
- Method 1: Try accessing through a VPN or proxy.
- Method 2: Use a mirror in your code:
```
import os
os.environ['CURL_CA_BUNDLE'] = ''
os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'
```
- Method 3: Set environment variables in the terminal:
```
export CURL_CA_BUNDLE=''
export HF_ENDPOINT='https://hf-mirror.com'
```

Resource xxx not found. Please use the NLTK Downloader to obtain the resource:

  import nltk
  nltk.download('all') # you can directly download all resources

Resource wordnet not found.
- Uninstall nltk completely: pip uninstall nltk
- Reinstall nltk with the following command: pip install nltk
- Download corpora/wordnet.zip manually and unzip it to the directory specified in the error message. Otherwise, you can download it using the following commands:
  - Windows: wget https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/wordnet.zip -O ~\AppData\Roaming\nltk_data\corpora\wordnet.zip and unzip ~\AppData\Roaming\nltk_data\corpora\wordnet.zip -d ~\AppData\Roaming\nltk_data\corpora\
  - Unix: wget https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/wordnet.zip -O ~/nltk_data/corpora/wordnet.zip and unzip ~/nltk_data/corpora/wordnet.zip -d ~/nltk_data/corpora/

📜 License

This project is licensed under the MIT License. See the LICENSE file for details.

📧 Contact

You can scan the QR code below to join our WeChat group:

wisup_e2m Logo

For any questions or inquiries, please open an issue on GitHub or contact us at team@wisup.ai.

🌟 Contributing

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.61

Aug 30, 2024

0.1.61b0 pre-release

Aug 30, 2024

0.1.57

Aug 27, 2024

0.1.56.1

Aug 27, 2024

This version

0.1.56

Aug 27, 2024

0.1.55

Aug 26, 2024

0.1.54

Aug 24, 2024

0.1.53

Aug 24, 2024

0.1.52

Aug 24, 2024

0.1.51

Aug 24, 2024

0.1.41

Aug 18, 2024

0.1.34

Aug 18, 2024

0.1.33

Aug 18, 2024

0.1.32

Aug 18, 2024

0.1.31

Aug 15, 2024

0.1.23

Aug 9, 2024

0.1.22

Aug 9, 2024

0.1.6

Aug 28, 2024

0.1.5

Aug 24, 2024

0.1.4

Aug 18, 2024

0.1.3

Aug 15, 2024

0.1.1

Aug 5, 2024

0.1.0

Aug 4, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wisup_e2m-0.1.56.tar.gz (47.5 kB view details)

Uploaded Aug 27, 2024 Source

Built Distribution

wisup_e2m-0.1.56-py3-none-any.whl (67.7 kB view details)

Uploaded Aug 27, 2024 Python 3

File details

Details for the file wisup_e2m-0.1.56.tar.gz.

File metadata

Download URL: wisup_e2m-0.1.56.tar.gz
Upload date: Aug 27, 2024
Size: 47.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.3 CPython/3.10.14 Darwin/23.6.0

File hashes

Hashes for wisup_e2m-0.1.56.tar.gz
Algorithm	Hash digest
SHA256	`21fa5c802a55ab3215673e48a586439fc7d6d407e477ca4f2bb2345e98ff86fc`
MD5	`9e9252551bfe84d6bdd07d1602813947`
BLAKE2b-256	`3dc05ceb7bbcde074581b2ade0a7d5ed583af68db283ec34cf0ad06a9d237d82`

See more details on using hashes here.

File details

Details for the file wisup_e2m-0.1.56-py3-none-any.whl.

File metadata

Download URL: wisup_e2m-0.1.56-py3-none-any.whl
Upload date: Aug 27, 2024
Size: 67.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.3 CPython/3.10.14 Darwin/23.6.0

File hashes

Hashes for wisup_e2m-0.1.56-py3-none-any.whl
Algorithm	Hash digest
SHA256	`01c2dc9624bcad87e90f371cd81b80f8ac9b11d5932b7017af10b75dab9d1663`
MD5	`cd1a908b111fd9b7499baad5540cdafc`
BLAKE2b-256	`aa41af320dee8091ddd762b300e3cb04834d6bbe0f3a872b4782da8fcd114eae`

See more details on using hashes here.

wisup_e2m 0.1.56

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🚀 E2M: Everything to Markdown

📹 Video Introduction

📂 All Converters and Parsers

📦 Installation

⚡️ Parser Quick Start

📄 Pdf Parser

📝 Doc Parser

📜 Docx Parser

📚 Epub Parser

🌐 Html Parser

🔗 Url Parser

🖼️ Ppt Parser

🖼️ Pptx Parser

🎤 Voice Parser

🔄 Converter Quick Start

📝 Text Converter

🖼️ Image Converter

🆙 Next Level

🛠️ E2MParser

🛠️ E2MConverter

❓ Q&A

📜 License

📧 Contact

🌟 Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes