20 Commits

Author SHA1 Message Date
dependabot[bot]
3d3da11ffe Bump actions/setup-python from 5 to 6
Bumps [actions/setup-python](https://github.com/actions/setup-python) from 5 to 6.
- [Release notes](https://github.com/actions/setup-python/releases)
- [Commits](https://github.com/actions/setup-python/compare/v5...v6)

---
updated-dependencies:
- dependency-name: actions/setup-python
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
2025-09-08 15:37:33 +00:00
Meirna
8a9d8f1593 feat: add checkbox support to Markdown converter (#1208)
This change introduces functionality to convert HTML checkbox input elements
(<input type=checkbox>) into Markdown checkbox syntax ([ ] or [x]).
Co-authored-by: Meirna Kamal <meirna.kamal@vodafone.com>
2025-08-26 15:30:47 -07:00
Richard Ye
17365654c9 Handle PPTX shapes where position is None (#1161)
* Handle shapes where position is None
* Fixed recursion error, and place no-coord shapes at front
2025-08-26 15:28:17 -07:00
Yuzhong Zhang
59eb60f8cb fix docx parse error(\n in alt) (#1163) 2025-08-26 15:20:17 -07:00
Dmitry
459d462f29 docs: correct minor typos (#1173) 2025-08-26 15:15:23 -07:00
Noah Zhu
c3f6cb356c Adding support for data-src Attribute (#1226)
* supportfordata-src
2025-08-26 15:11:53 -07:00
Ebrahim Tayabali
0c4d3945a0 Update README.md (#1191)
Fix: Subtle spelling mistake fixed.
2025-08-26 15:07:27 -07:00
Utkarsh kumar
f8b60b5403 Update README.md (#1350)
ISSUE #1339
2025-08-26 15:02:56 -07:00
[W]DOS_
16ca285d30 Update README.md (#1335)
Fix typo in README.md
2025-08-26 14:55:58 -07:00
Stefan Rink
b81a387616 fix: correctly pass custom llm prompt parameter (#1319)
* fix: correctly pass custom llm prompt parameter
2025-08-26 14:51:10 -07:00
safen0s
ea1a3dfb60 Add HTML support to DocumentIntelligenceConverter (#1352) 2025-08-26 14:34:43 -07:00
dependabot[bot]
b6e5da8874 Bump actions/checkout from 4 to 5 (#1394)
Bumps [actions/checkout](https://github.com/actions/checkout) from 4 to 5.
2025-08-26 14:27:38 -07:00
t3tra
fb1ad24833 Ensure safe ExifTool usage: require >= 12.24 (#1399)
* feat: add version verification for ExifTool to ensure security compliance
* fix: improve ExifTool version verification

---------
2025-08-26 14:25:13 -07:00
JonahDelman
1178c2e211 Fixed documentation typos in _base_converter.py (#1393) 2025-08-26 14:23:10 -07:00
afourney
9278119bb3 Resolved an issue with linked images in docx [mammoth] (#1405) 2025-08-26 14:20:29 -07:00
onefloid
da7bcea527 docs: rephrase sentence (#1278) 2025-06-03 21:09:25 -07:00
afourney
3bfb821c09 Have the MarkItDown MCP server read MARKITDOWN_ENABLE_PLUGINS from ENV (#1273)
* Have the MarkItdown MCP server read MARKITDOWN_ENABLE_PLUGINS from os.environ

* Update the Dockerfile to enable plugins. No puglins are installed by default.
2025-06-03 09:35:33 -07:00
Tomasz Kalinowski
62b72284fe pin onnxruntime on Windows (#1274)
closes #1266
2025-05-28 13:13:51 -07:00
afourney
1dd3c83339 Promoting 0.1.2a1 to 0.1.2 (#1272) 2025-05-28 10:04:42 -07:00
afourney
9dc982a3b1 Small changes to favor streamable HTTP over deprecated SSE (#1264) 2025-05-23 11:39:41 -07:00
17 changed files with 196 additions and 35 deletions

View File

@@ -5,9 +5,9 @@ jobs:
pre-commit:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@v5
- name: Set up Python
uses: actions/setup-python@v5
uses: actions/setup-python@v6
with:
python-version: "3.x"

View File

@@ -5,8 +5,8 @@ jobs:
tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
- uses: actions/checkout@v5
- uses: actions/setup-python@v6
with:
python-version: |
3.10

View File

@@ -15,7 +15,7 @@
MarkItDown is a lightweight Python utility for converting various files to Markdown for use with LLMs and related text analysis pipelines. To this end, it is most comparable to [textract](https://github.com/deanmalmgren/textract), but with a focus on preserving important document structure and content as Markdown (including: headings, lists, tables, links, etc.) While the output is often reasonably presentable and human-friendly, it is meant to be consumed by text analysis tools -- and may not be the best option for high-fidelity document conversions for human consumption.
At present, MarkItDown supports:
MarkItDown currently supports the conversion from:
- PDF
- PowerPoint
@@ -164,14 +164,14 @@ result = md.convert("test.pdf")
print(result.text_content)
```
To use Large Language Models for image descriptions, provide `llm_client` and `llm_model`:
To use Large Language Models for image descriptions (currently only for pptx and image files), provide `llm_client` and `llm_model`:
```python
from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
md = MarkItDown(llm_client=client, llm_model="gpt-4o", llm_prompt="optional custom prompt")
result = md.convert("example.jpg")
print(result.text_content)
```
@@ -199,7 +199,7 @@ contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additio
### How to Contribute
You can help by looking at issues or helping review PRs. Any issue or PR is welcome, but we have also marked some as 'open for contribution' and 'open for reviewing' to help facilitate community contributions. These are ofcourse just suggestions and you are welcome to contribute in any way you like.
You can help by looking at issues or helping review PRs. Any issue or PR is welcome, but we have also marked some as 'open for contribution' and 'open for reviewing' to help facilitate community contributions. These are of course just suggestions and you are welcome to contribute in any way you like.
<div align="center">

View File

@@ -3,8 +3,10 @@ FROM python:3.13-slim-bullseye
ENV DEBIAN_FRONTEND=noninteractive
ENV EXIFTOOL_PATH=/usr/bin/exiftool
ENV FFMPEG_PATH=/usr/bin/ffmpeg
ENV MARKITDOWN_ENABLE_PLUGINS=True
# Runtime dependency
# NOTE: Add any additional MarkItDown plugins here
RUN apt-get update && apt-get install -y --no-install-recommends \
ffmpeg \
exiftool

View File

@@ -4,7 +4,7 @@
![PyPI - Downloads](https://img.shields.io/pypi/dd/markitdown-mcp)
[![Built by AutoGen Team](https://img.shields.io/badge/Built%20by-AutoGen%20Team-blue)](https://github.com/microsoft/autogen)
The `markitdown-mcp` package provides a lightweight STDIO, SSE and Streamable HTTP MCP server for calling MarkItDown.
The `markitdown-mcp` package provides a lightweight STDIO, Streamable HTTP, and SSE MCP server for calling MarkItDown.
It exposes one tool: `convert_to_markdown(uri)`, where uri can be any `http:`, `https:`, `file:`, or `data:` URI.
@@ -25,10 +25,10 @@ To run the MCP server, using STDIO (default) use the following command:
markitdown-mcp
```
To run the MCP server, using SSE or Streamable HTTP use the following command:
To run the MCP server, using Streamable HTTP and SSE use the following command:
```bash
markitdown-mcp --sse --host 127.0.0.1 --port 3001
markitdown-mcp --http --host 127.0.0.1 --port 3001
```
## Running in Docker
@@ -54,7 +54,7 @@ Once mounted, all files under data will be accessible under `/workdir` in the co
It is recommended to use the Docker image when running the MCP server for Claude Desktop.
Follow [these instrutions](https://modelcontextprotocol.io/quickstart/user#for-claude-desktop-users) to access Claude's `claude_desktop_config.json` file.
Follow [these instructions](https://modelcontextprotocol.io/quickstart/user#for-claude-desktop-users) to access Claude's `claude_desktop_config.json` file.
Edit it to include the following JSON entry:
@@ -102,23 +102,23 @@ To debug the MCP server you can use the `mcpinspector` tool.
npx @modelcontextprotocol/inspector
```
You can then connect to the insepctor through the specified host and port (e.g., `http://localhost:5173/`).
You can then connect to the inspector through the specified host and port (e.g., `http://localhost:5173/`).
If using STDIO:
* select `STDIO` as the transport type,
* input `markitdown-mcp` as the command, and
* click `Connect`
If using SSE:
* select `SSE` as the transport type,
* input `http://127.0.0.1:3001/sse` as the URL, and
* click `Connect`
If using Streamable HTTP:
* select `Streamable HTTP` as the transport type,
* input `http://127.0.0.1:3001/mcp` as the URL, and
* click `Connect`
If using SSE:
* select `SSE` as the transport type,
* input `http://127.0.0.1:3001/sse` as the URL, and
* click `Connect`
Finally:
* click the `Tools` tab,
* click `List Tools`,
@@ -127,8 +127,7 @@ Finally:
## Security Considerations
The server does not support authentication, and runs with the privileges if the user running it. For this reason, when running in SSE or Streamable HTTP mode, it is recommended to run the server bound to `localhost` (default).
The server does not support authentication, and runs with the privileges of the user running it. For this reason, when running in SSE or Streamable HTTP mode, it is recommended to run the server bound to `localhost` (default).
## Trademarks

View File

@@ -1,5 +1,6 @@
import contextlib
import sys
import os
from collections.abc import AsyncIterator
from mcp.server.fastmcp import FastMCP
from starlette.applications import Starlette
@@ -19,7 +20,15 @@ mcp = FastMCP("markitdown")
@mcp.tool()
async def convert_to_markdown(uri: str) -> str:
"""Convert a resource described by an http:, https:, file: or data: URI to markdown"""
return MarkItDown().convert_uri(uri).markdown
return MarkItDown(enable_plugins=check_plugins_enabled()).convert_uri(uri).markdown
def check_plugins_enabled() -> bool:
return os.getenv("MARKITDOWN_ENABLE_PLUGINS", "false").strip().lower() in (
"true",
"1",
"yes",
)
def create_starlette_app(mcp_server: Server, *, debug: bool = False) -> Starlette:
@@ -75,12 +84,17 @@ def main():
mcp_server = mcp._mcp_server
parser = argparse.ArgumentParser(description="Run MCP SSE-based MarkItDown server")
parser = argparse.ArgumentParser(description="Run a MarkItDown MCP server")
parser.add_argument(
"--http",
action="store_true",
help="Run the server with Streamable HTTP and SSE transport rather than STDIO (default: False)",
)
parser.add_argument(
"--sse",
action="store_true",
help="Run the server with SSE transport rather than STDIO (default: False)",
help="(Deprecated) An alias for --http (default: False)",
)
parser.add_argument(
"--host", default=None, help="Host to bind to (default: 127.0.0.1)"
@@ -90,11 +104,15 @@ def main():
)
args = parser.parse_args()
if not args.sse and (args.host or args.port):
parser.error("Host and port arguments are only valid when using SSE transport.")
use_http = args.http or args.sse
if not use_http and (args.host or args.port):
parser.error(
"Host and port arguments are only valid when using streamable HTTP or SSE transport (see: --http)."
)
sys.exit(1)
if args.sse:
if use_http:
starlette_app = create_starlette_app(mcp_server, debug=True)
uvicorn.run(
starlette_app,

View File

@@ -30,12 +30,13 @@ dependencies = [
"magika~=0.6.1",
"charset-normalizer",
"defusedxml",
"onnxruntime<=1.20.1; sys_platform == 'win32'",
]
[project.optional-dependencies]
all = [
"python-pptx",
"mammoth",
"mammoth~=1.10.0",
"pandas",
"openpyxl",
"xlrd",

View File

@@ -1,4 +1,4 @@
# SPDX-FileCopyrightText: 2024-present Adam Fourney <adamfo@microsoft.com>
#
# SPDX-License-Identifier: MIT
__version__ = "0.1.2a1"
__version__ = "0.1.3"

View File

@@ -69,7 +69,7 @@ class DocumentConverter:
data = file_stream.read(100) # ... peek at the first 100 bytes, etc.
file_stream.seek(cur_pos) # Reset the position to the original position
Prameters:
Parameters:
- file_stream: The file-like object to convert. Must support seek(), tell(), and read() methods.
- stream_info: The StreamInfo object containing metadata about the file (mimetype, extension, charset, set)
- kwargs: Additional keyword arguments for the converter.
@@ -90,7 +90,7 @@ class DocumentConverter:
"""
Convert a document to Markdown text.
Prameters:
Parameters:
- file_stream: The file-like object to convert. Must support seek(), tell(), and read() methods.
- stream_info: The StreamInfo object containing metadata about the file (mimetype, extension, charset, set)
- kwargs: Additional keyword arguments for the converter.

View File

@@ -115,6 +115,7 @@ class MarkItDown:
# TODO - remove these (see enable_builtins)
self._llm_client: Any = None
self._llm_model: Union[str | None] = None
self._llm_prompt: Union[str | None] = None
self._exiftool_path: Union[str | None] = None
self._style_map: Union[str | None] = None
@@ -139,6 +140,7 @@ class MarkItDown:
# TODO: Move these into converter constructors
self._llm_client = kwargs.get("llm_client")
self._llm_model = kwargs.get("llm_model")
self._llm_prompt = kwargs.get("llm_prompt")
self._exiftool_path = kwargs.get("exiftool_path")
self._style_map = kwargs.get("style_map")
@@ -559,6 +561,9 @@ class MarkItDown:
if "llm_model" not in _kwargs and self._llm_model is not None:
_kwargs["llm_model"] = self._llm_model
if "llm_prompt" not in _kwargs and self._llm_prompt is not None:
_kwargs["llm_prompt"] = self._llm_prompt
if "style_map" not in _kwargs and self._style_map is not None:
_kwargs["style_map"] = self._style_map

View File

@@ -84,6 +84,9 @@ def _get_mime_type_prefixes(types: List[DocumentIntelligenceFileType]) -> List[s
prefixes.append(
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
)
elif type_ == DocumentIntelligenceFileType.HTML:
prefixes.append("text/html")
prefixes.append("application/xhtml+xml")
elif type_ == DocumentIntelligenceFileType.PDF:
prefixes.append("application/pdf")
prefixes.append("application/x-pdf")
@@ -119,6 +122,8 @@ def _get_file_extensions(types: List[DocumentIntelligenceFileType]) -> List[str]
extensions.append(".bmp")
elif type_ == DocumentIntelligenceFileType.TIFF:
extensions.append(".tiff")
elif type_ == DocumentIntelligenceFileType.HTML:
extensions.append(".html")
return extensions

View File

@@ -1,4 +1,6 @@
import sys
import io
from warnings import warn
from typing import BinaryIO, Any
@@ -13,6 +15,14 @@ from .._exceptions import MissingDependencyException, MISSING_DEPENDENCY_MESSAGE
_dependency_exc_info = None
try:
import mammoth
import mammoth.docx.files
def mammoth_files_open(self, uri):
warn("DOCX: processing of r:link resources (e.g., linked images) is disabled.")
return io.BytesIO(b"")
mammoth.docx.files.Files.open = mammoth_files_open
except ImportError:
# Preserve the error and stack trace for later
_dependency_exc_info = sys.exc_info()

View File

@@ -1,7 +1,11 @@
import json
import subprocess
import locale
from typing import BinaryIO, Any, Union
import subprocess
from typing import Any, BinaryIO, Union
def _parse_version(version: str) -> tuple:
return tuple(map(int, (version.split("."))))
def exiftool_metadata(
@@ -13,6 +17,24 @@ def exiftool_metadata(
if not exiftool_path:
return {}
# Verify exiftool version
try:
version_output = subprocess.run(
[exiftool_path, "-ver"],
capture_output=True,
text=True,
check=True,
).stdout.strip()
version = _parse_version(version_output)
min_version = (12, 24)
if version < min_version:
raise RuntimeError(
f"ExifTool version {version_output} is vulnerable to CVE-2021-22204. "
"Please upgrade to version 12.24 or later."
)
except (subprocess.CalledProcessError, ValueError) as e:
raise RuntimeError("Failed to verify ExifTool version.") from e
# Run exiftool
cur_pos = file_stream.tell()
try:

View File

@@ -92,9 +92,11 @@ class _CustomMarkdownify(markdownify.MarkdownConverter):
"""Same as usual converter, but removes data URIs"""
alt = el.attrs.get("alt", None) or ""
src = el.attrs.get("src", None) or ""
src = el.attrs.get("src", None) or el.attrs.get("data-src", None) or ""
title = el.attrs.get("title", None) or ""
title_part = ' "%s"' % title.replace('"', r"\"") if title else ""
# Remove all line breaks from alt
alt = alt.replace("\n", " ")
if (
convert_as_inline
and el.parent.name not in self.options["keep_inline_images_in"]
@@ -107,5 +109,18 @@ class _CustomMarkdownify(markdownify.MarkdownConverter):
return "![%s](%s%s)" % (alt, src, title_part)
def convert_input(
self,
el: Any,
text: str,
convert_as_inline: Optional[bool] = False,
**kwargs,
) -> str:
"""Convert checkboxes to Markdown [x]/[ ] syntax."""
if el.get("type") == "checkbox":
return "[x] " if el.has_attr("checked") else "[ ] "
return ""
def convert_soup(self, soup: Any) -> str:
return super().convert_soup(soup) # type: ignore

View File

@@ -168,11 +168,23 @@ class PptxConverter(DocumentConverter):
# Group Shapes
if shape.shape_type == pptx.enum.shapes.MSO_SHAPE_TYPE.GROUP:
sorted_shapes = sorted(shape.shapes, key=attrgetter("top", "left"))
sorted_shapes = sorted(
shape.shapes,
key=lambda x: (
float("-inf") if not x.top else x.top,
float("-inf") if not x.left else x.left,
),
)
for subshape in sorted_shapes:
get_shape_content(subshape, **kwargs)
sorted_shapes = sorted(slide.shapes, key=attrgetter("top", "left"))
sorted_shapes = sorted(
slide.shapes,
key=lambda x: (
float("-inf") if not x.top else x.top,
float("-inf") if not x.left else x.left,
),
)
for shape in sorted_shapes:
get_shape_content(shape, **kwargs)

View File

@@ -0,0 +1,26 @@
import io
from markitdown.converters._doc_intel_converter import (
DocumentIntelligenceConverter,
DocumentIntelligenceFileType,
)
from markitdown._stream_info import StreamInfo
def _make_converter(file_types):
conv = DocumentIntelligenceConverter.__new__(DocumentIntelligenceConverter)
conv._file_types = file_types
return conv
def test_docintel_accepts_html_extension():
conv = _make_converter([DocumentIntelligenceFileType.HTML])
stream_info = StreamInfo(mimetype=None, extension=".html")
assert conv.accepts(io.BytesIO(b""), stream_info)
def test_docintel_accepts_html_mimetype():
conv = _make_converter([DocumentIntelligenceFileType.HTML])
stream_info = StreamInfo(mimetype="text/html", extension=None)
assert conv.accepts(io.BytesIO(b""), stream_info)
stream_info = StreamInfo(mimetype="application/xhtml+xml", extension=None)
assert conv.accepts(io.BytesIO(b""), stream_info)

View File

@@ -4,6 +4,7 @@ import os
import re
import shutil
import pytest
from unittest.mock import MagicMock
from markitdown._uri_utils import parse_data_uri, file_uri_to_path
@@ -370,6 +371,50 @@ def test_markitdown_exiftool() -> None:
assert target in result.text_content
def test_markitdown_llm_parameters() -> None:
"""Test that LLM parameters are correctly passed to the client."""
mock_client = MagicMock()
mock_response = MagicMock()
mock_response.choices = [
MagicMock(
message=MagicMock(
content="Test caption with red circle and blue square 5bda1dd6"
)
)
]
mock_client.chat.completions.create.return_value = mock_response
test_prompt = "You are a professional test prompt."
markitdown = MarkItDown(
llm_client=mock_client, llm_model="gpt-4o", llm_prompt=test_prompt
)
# Test image file
markitdown.convert(os.path.join(TEST_FILES_DIR, "test_llm.jpg"))
# Verify the prompt was passed to the OpenAI API
assert mock_client.chat.completions.create.called
call_args = mock_client.chat.completions.create.call_args
messages = call_args[1]["messages"]
assert len(messages) == 1
assert messages[0]["content"][0]["text"] == test_prompt
# Reset the mock for the next test
mock_client.chat.completions.create.reset_mock()
# TODO: may only use one test after the llm caption method duplicate has been removed:
# https://github.com/microsoft/markitdown/pull/1254
# Test PPTX file
markitdown.convert(os.path.join(TEST_FILES_DIR, "test.pptx"))
# Verify the prompt was passed to the OpenAI API for PPTX images too
assert mock_client.chat.completions.create.called
call_args = mock_client.chat.completions.create.call_args
messages = call_args[1]["messages"]
assert len(messages) == 1
assert messages[0]["content"][0]["text"] == test_prompt
@pytest.mark.skipif(
skip_llm,
reason="do not run llm tests without a key",
@@ -408,6 +453,7 @@ if __name__ == "__main__":
test_speech_transcription,
test_exceptions,
test_markitdown_exiftool,
test_markitdown_llm_parameters,
test_markitdown_llm,
]:
print(f"Running {test.__name__}...", end="")