17 Commits

Author SHA1 Message Date
Meirna
8a9d8f1593 feat: add checkbox support to Markdown converter (#1208)
This change introduces functionality to convert HTML checkbox input elements
(<input type=checkbox>) into Markdown checkbox syntax ([ ] or [x]).
Co-authored-by: Meirna Kamal <meirna.kamal@vodafone.com>
2025-08-26 15:30:47 -07:00
Richard Ye
17365654c9 Handle PPTX shapes where position is None (#1161)
* Handle shapes where position is None
* Fixed recursion error, and place no-coord shapes at front
2025-08-26 15:28:17 -07:00
Yuzhong Zhang
59eb60f8cb fix docx parse error(\n in alt) (#1163) 2025-08-26 15:20:17 -07:00
Dmitry
459d462f29 docs: correct minor typos (#1173) 2025-08-26 15:15:23 -07:00
Noah Zhu
c3f6cb356c Adding support for data-src Attribute (#1226)
* supportfordata-src
2025-08-26 15:11:53 -07:00
Ebrahim Tayabali
0c4d3945a0 Update README.md (#1191)
Fix: Subtle spelling mistake fixed.
2025-08-26 15:07:27 -07:00
Utkarsh kumar
f8b60b5403 Update README.md (#1350)
ISSUE #1339
2025-08-26 15:02:56 -07:00
[W]DOS_
16ca285d30 Update README.md (#1335)
Fix typo in README.md
2025-08-26 14:55:58 -07:00
Stefan Rink
b81a387616 fix: correctly pass custom llm prompt parameter (#1319)
* fix: correctly pass custom llm prompt parameter
2025-08-26 14:51:10 -07:00
safen0s
ea1a3dfb60 Add HTML support to DocumentIntelligenceConverter (#1352) 2025-08-26 14:34:43 -07:00
dependabot[bot]
b6e5da8874 Bump actions/checkout from 4 to 5 (#1394)
Bumps [actions/checkout](https://github.com/actions/checkout) from 4 to 5.
2025-08-26 14:27:38 -07:00
t3tra
fb1ad24833 Ensure safe ExifTool usage: require >= 12.24 (#1399)
* feat: add version verification for ExifTool to ensure security compliance
* fix: improve ExifTool version verification

---------
2025-08-26 14:25:13 -07:00
JonahDelman
1178c2e211 Fixed documentation typos in _base_converter.py (#1393) 2025-08-26 14:23:10 -07:00
afourney
9278119bb3 Resolved an issue with linked images in docx [mammoth] (#1405) 2025-08-26 14:20:29 -07:00
onefloid
da7bcea527 docs: rephrase sentence (#1278) 2025-06-03 21:09:25 -07:00
afourney
3bfb821c09 Have the MarkItDown MCP server read MARKITDOWN_ENABLE_PLUGINS from ENV (#1273)
* Have the MarkItdown MCP server read MARKITDOWN_ENABLE_PLUGINS from os.environ

* Update the Dockerfile to enable plugins. No puglins are installed by default.
2025-06-03 09:35:33 -07:00
Tomasz Kalinowski
62b72284fe pin onnxruntime on Windows (#1274)
closes #1266
2025-05-28 13:13:51 -07:00
17 changed files with 172 additions and 20 deletions

View File

@@ -5,7 +5,7 @@ jobs:
pre-commit:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@v5
- name: Set up Python
uses: actions/setup-python@v5
with:

View File

@@ -5,7 +5,7 @@ jobs:
tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@v5
- uses: actions/setup-python@v5
with:
python-version: |

View File

@@ -15,7 +15,7 @@
MarkItDown is a lightweight Python utility for converting various files to Markdown for use with LLMs and related text analysis pipelines. To this end, it is most comparable to [textract](https://github.com/deanmalmgren/textract), but with a focus on preserving important document structure and content as Markdown (including: headings, lists, tables, links, etc.) While the output is often reasonably presentable and human-friendly, it is meant to be consumed by text analysis tools -- and may not be the best option for high-fidelity document conversions for human consumption.
At present, MarkItDown supports:
MarkItDown currently supports the conversion from:
- PDF
- PowerPoint
@@ -164,14 +164,14 @@ result = md.convert("test.pdf")
print(result.text_content)
```
To use Large Language Models for image descriptions, provide `llm_client` and `llm_model`:
To use Large Language Models for image descriptions (currently only for pptx and image files), provide `llm_client` and `llm_model`:
```python
from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
md = MarkItDown(llm_client=client, llm_model="gpt-4o", llm_prompt="optional custom prompt")
result = md.convert("example.jpg")
print(result.text_content)
```
@@ -199,7 +199,7 @@ contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additio
### How to Contribute
You can help by looking at issues or helping review PRs. Any issue or PR is welcome, but we have also marked some as 'open for contribution' and 'open for reviewing' to help facilitate community contributions. These are ofcourse just suggestions and you are welcome to contribute in any way you like.
You can help by looking at issues or helping review PRs. Any issue or PR is welcome, but we have also marked some as 'open for contribution' and 'open for reviewing' to help facilitate community contributions. These are of course just suggestions and you are welcome to contribute in any way you like.
<div align="center">

View File

@@ -3,8 +3,10 @@ FROM python:3.13-slim-bullseye
ENV DEBIAN_FRONTEND=noninteractive
ENV EXIFTOOL_PATH=/usr/bin/exiftool
ENV FFMPEG_PATH=/usr/bin/ffmpeg
ENV MARKITDOWN_ENABLE_PLUGINS=True
# Runtime dependency
# NOTE: Add any additional MarkItDown plugins here
RUN apt-get update && apt-get install -y --no-install-recommends \
ffmpeg \
exiftool

View File

@@ -54,7 +54,7 @@ Once mounted, all files under data will be accessible under `/workdir` in the co
It is recommended to use the Docker image when running the MCP server for Claude Desktop.
Follow [these instrutions](https://modelcontextprotocol.io/quickstart/user#for-claude-desktop-users) to access Claude's `claude_desktop_config.json` file.
Follow [these instructions](https://modelcontextprotocol.io/quickstart/user#for-claude-desktop-users) to access Claude's `claude_desktop_config.json` file.
Edit it to include the following JSON entry:
@@ -102,7 +102,7 @@ To debug the MCP server you can use the `mcpinspector` tool.
npx @modelcontextprotocol/inspector
```
You can then connect to the insepctor through the specified host and port (e.g., `http://localhost:5173/`).
You can then connect to the inspector through the specified host and port (e.g., `http://localhost:5173/`).
If using STDIO:
* select `STDIO` as the transport type,
@@ -127,8 +127,7 @@ Finally:
## Security Considerations
The server does not support authentication, and runs with the privileges if the user running it. For this reason, when running in SSE or Streamable HTTP mode, it is recommended to run the server bound to `localhost` (default).
The server does not support authentication, and runs with the privileges of the user running it. For this reason, when running in SSE or Streamable HTTP mode, it is recommended to run the server bound to `localhost` (default).
## Trademarks

View File

@@ -1,5 +1,6 @@
import contextlib
import sys
import os
from collections.abc import AsyncIterator
from mcp.server.fastmcp import FastMCP
from starlette.applications import Starlette
@@ -19,7 +20,15 @@ mcp = FastMCP("markitdown")
@mcp.tool()
async def convert_to_markdown(uri: str) -> str:
"""Convert a resource described by an http:, https:, file: or data: URI to markdown"""
return MarkItDown().convert_uri(uri).markdown
return MarkItDown(enable_plugins=check_plugins_enabled()).convert_uri(uri).markdown
def check_plugins_enabled() -> bool:
return os.getenv("MARKITDOWN_ENABLE_PLUGINS", "false").strip().lower() in (
"true",
"1",
"yes",
)
def create_starlette_app(mcp_server: Server, *, debug: bool = False) -> Starlette:

View File

@@ -30,12 +30,13 @@ dependencies = [
"magika~=0.6.1",
"charset-normalizer",
"defusedxml",
"onnxruntime<=1.20.1; sys_platform == 'win32'",
]
[project.optional-dependencies]
all = [
"python-pptx",
"mammoth",
"mammoth~=1.10.0",
"pandas",
"openpyxl",
"xlrd",

View File

@@ -1,4 +1,4 @@
# SPDX-FileCopyrightText: 2024-present Adam Fourney <adamfo@microsoft.com>
#
# SPDX-License-Identifier: MIT
__version__ = "0.1.2"
__version__ = "0.1.3"

View File

@@ -69,7 +69,7 @@ class DocumentConverter:
data = file_stream.read(100) # ... peek at the first 100 bytes, etc.
file_stream.seek(cur_pos) # Reset the position to the original position
Prameters:
Parameters:
- file_stream: The file-like object to convert. Must support seek(), tell(), and read() methods.
- stream_info: The StreamInfo object containing metadata about the file (mimetype, extension, charset, set)
- kwargs: Additional keyword arguments for the converter.
@@ -90,7 +90,7 @@ class DocumentConverter:
"""
Convert a document to Markdown text.
Prameters:
Parameters:
- file_stream: The file-like object to convert. Must support seek(), tell(), and read() methods.
- stream_info: The StreamInfo object containing metadata about the file (mimetype, extension, charset, set)
- kwargs: Additional keyword arguments for the converter.

View File

@@ -115,6 +115,7 @@ class MarkItDown:
# TODO - remove these (see enable_builtins)
self._llm_client: Any = None
self._llm_model: Union[str | None] = None
self._llm_prompt: Union[str | None] = None
self._exiftool_path: Union[str | None] = None
self._style_map: Union[str | None] = None
@@ -139,6 +140,7 @@ class MarkItDown:
# TODO: Move these into converter constructors
self._llm_client = kwargs.get("llm_client")
self._llm_model = kwargs.get("llm_model")
self._llm_prompt = kwargs.get("llm_prompt")
self._exiftool_path = kwargs.get("exiftool_path")
self._style_map = kwargs.get("style_map")
@@ -559,6 +561,9 @@ class MarkItDown:
if "llm_model" not in _kwargs and self._llm_model is not None:
_kwargs["llm_model"] = self._llm_model
if "llm_prompt" not in _kwargs and self._llm_prompt is not None:
_kwargs["llm_prompt"] = self._llm_prompt
if "style_map" not in _kwargs and self._style_map is not None:
_kwargs["style_map"] = self._style_map

View File

@@ -84,6 +84,9 @@ def _get_mime_type_prefixes(types: List[DocumentIntelligenceFileType]) -> List[s
prefixes.append(
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
)
elif type_ == DocumentIntelligenceFileType.HTML:
prefixes.append("text/html")
prefixes.append("application/xhtml+xml")
elif type_ == DocumentIntelligenceFileType.PDF:
prefixes.append("application/pdf")
prefixes.append("application/x-pdf")
@@ -119,6 +122,8 @@ def _get_file_extensions(types: List[DocumentIntelligenceFileType]) -> List[str]
extensions.append(".bmp")
elif type_ == DocumentIntelligenceFileType.TIFF:
extensions.append(".tiff")
elif type_ == DocumentIntelligenceFileType.HTML:
extensions.append(".html")
return extensions

View File

@@ -1,4 +1,6 @@
import sys
import io
from warnings import warn
from typing import BinaryIO, Any
@@ -13,6 +15,14 @@ from .._exceptions import MissingDependencyException, MISSING_DEPENDENCY_MESSAGE
_dependency_exc_info = None
try:
import mammoth
import mammoth.docx.files
def mammoth_files_open(self, uri):
warn("DOCX: processing of r:link resources (e.g., linked images) is disabled.")
return io.BytesIO(b"")
mammoth.docx.files.Files.open = mammoth_files_open
except ImportError:
# Preserve the error and stack trace for later
_dependency_exc_info = sys.exc_info()

View File

@@ -1,7 +1,11 @@
import json
import subprocess
import locale
from typing import BinaryIO, Any, Union
import subprocess
from typing import Any, BinaryIO, Union
def _parse_version(version: str) -> tuple:
return tuple(map(int, (version.split("."))))
def exiftool_metadata(
@@ -13,6 +17,24 @@ def exiftool_metadata(
if not exiftool_path:
return {}
# Verify exiftool version
try:
version_output = subprocess.run(
[exiftool_path, "-ver"],
capture_output=True,
text=True,
check=True,
).stdout.strip()
version = _parse_version(version_output)
min_version = (12, 24)
if version < min_version:
raise RuntimeError(
f"ExifTool version {version_output} is vulnerable to CVE-2021-22204. "
"Please upgrade to version 12.24 or later."
)
except (subprocess.CalledProcessError, ValueError) as e:
raise RuntimeError("Failed to verify ExifTool version.") from e
# Run exiftool
cur_pos = file_stream.tell()
try:

View File

@@ -92,9 +92,11 @@ class _CustomMarkdownify(markdownify.MarkdownConverter):
"""Same as usual converter, but removes data URIs"""
alt = el.attrs.get("alt", None) or ""
src = el.attrs.get("src", None) or ""
src = el.attrs.get("src", None) or el.attrs.get("data-src", None) or ""
title = el.attrs.get("title", None) or ""
title_part = ' "%s"' % title.replace('"', r"\"") if title else ""
# Remove all line breaks from alt
alt = alt.replace("\n", " ")
if (
convert_as_inline
and el.parent.name not in self.options["keep_inline_images_in"]
@@ -107,5 +109,18 @@ class _CustomMarkdownify(markdownify.MarkdownConverter):
return "![%s](%s%s)" % (alt, src, title_part)
def convert_input(
self,
el: Any,
text: str,
convert_as_inline: Optional[bool] = False,
**kwargs,
) -> str:
"""Convert checkboxes to Markdown [x]/[ ] syntax."""
if el.get("type") == "checkbox":
return "[x] " if el.has_attr("checked") else "[ ] "
return ""
def convert_soup(self, soup: Any) -> str:
return super().convert_soup(soup) # type: ignore

View File

@@ -168,11 +168,23 @@ class PptxConverter(DocumentConverter):
# Group Shapes
if shape.shape_type == pptx.enum.shapes.MSO_SHAPE_TYPE.GROUP:
sorted_shapes = sorted(shape.shapes, key=attrgetter("top", "left"))
sorted_shapes = sorted(
shape.shapes,
key=lambda x: (
float("-inf") if not x.top else x.top,
float("-inf") if not x.left else x.left,
),
)
for subshape in sorted_shapes:
get_shape_content(subshape, **kwargs)
sorted_shapes = sorted(slide.shapes, key=attrgetter("top", "left"))
sorted_shapes = sorted(
slide.shapes,
key=lambda x: (
float("-inf") if not x.top else x.top,
float("-inf") if not x.left else x.left,
),
)
for shape in sorted_shapes:
get_shape_content(shape, **kwargs)

View File

@@ -0,0 +1,26 @@
import io
from markitdown.converters._doc_intel_converter import (
DocumentIntelligenceConverter,
DocumentIntelligenceFileType,
)
from markitdown._stream_info import StreamInfo
def _make_converter(file_types):
conv = DocumentIntelligenceConverter.__new__(DocumentIntelligenceConverter)
conv._file_types = file_types
return conv
def test_docintel_accepts_html_extension():
conv = _make_converter([DocumentIntelligenceFileType.HTML])
stream_info = StreamInfo(mimetype=None, extension=".html")
assert conv.accepts(io.BytesIO(b""), stream_info)
def test_docintel_accepts_html_mimetype():
conv = _make_converter([DocumentIntelligenceFileType.HTML])
stream_info = StreamInfo(mimetype="text/html", extension=None)
assert conv.accepts(io.BytesIO(b""), stream_info)
stream_info = StreamInfo(mimetype="application/xhtml+xml", extension=None)
assert conv.accepts(io.BytesIO(b""), stream_info)

View File

@@ -4,6 +4,7 @@ import os
import re
import shutil
import pytest
from unittest.mock import MagicMock
from markitdown._uri_utils import parse_data_uri, file_uri_to_path
@@ -370,6 +371,50 @@ def test_markitdown_exiftool() -> None:
assert target in result.text_content
def test_markitdown_llm_parameters() -> None:
"""Test that LLM parameters are correctly passed to the client."""
mock_client = MagicMock()
mock_response = MagicMock()
mock_response.choices = [
MagicMock(
message=MagicMock(
content="Test caption with red circle and blue square 5bda1dd6"
)
)
]
mock_client.chat.completions.create.return_value = mock_response
test_prompt = "You are a professional test prompt."
markitdown = MarkItDown(
llm_client=mock_client, llm_model="gpt-4o", llm_prompt=test_prompt
)
# Test image file
markitdown.convert(os.path.join(TEST_FILES_DIR, "test_llm.jpg"))
# Verify the prompt was passed to the OpenAI API
assert mock_client.chat.completions.create.called
call_args = mock_client.chat.completions.create.call_args
messages = call_args[1]["messages"]
assert len(messages) == 1
assert messages[0]["content"][0]["text"] == test_prompt
# Reset the mock for the next test
mock_client.chat.completions.create.reset_mock()
# TODO: may only use one test after the llm caption method duplicate has been removed:
# https://github.com/microsoft/markitdown/pull/1254
# Test PPTX file
markitdown.convert(os.path.join(TEST_FILES_DIR, "test.pptx"))
# Verify the prompt was passed to the OpenAI API for PPTX images too
assert mock_client.chat.completions.create.called
call_args = mock_client.chat.completions.create.call_args
messages = call_args[1]["messages"]
assert len(messages) == 1
assert messages[0]["content"][0]["text"] == test_prompt
@pytest.mark.skipif(
skip_llm,
reason="do not run llm tests without a key",
@@ -408,6 +453,7 @@ if __name__ == "__main__":
test_speech_transcription,
test_exceptions,
test_markitdown_exiftool,
test_markitdown_llm_parameters,
test_markitdown_llm,
]:
print(f"Running {test.__name__}...", end="")