feat: add checkbox support to Markdown converter (#1208 )

This change introduces functionality to convert HTML checkbox input elements (<input type=checkbox>) into Markdown checkbox syntax ([ ] or [x]). Co-authored-by: Meirna Kamal <meirna.kamal@vodafone.com>
Handle PPTX shapes where position is None (#1161 )
2025-08-26 15:30:47 -07:00 · 2025-08-26 15:28:17 -07:00 · 2025-08-26 15:20:17 -07:00 · 2025-08-26 15:15:23 -07:00 · 2025-08-26 15:11:53 -07:00 · 2025-08-26 15:07:27 -07:00
17 changed files with 172 additions and 20 deletions
--- a/.github/workflows/pre-commit.yml
+++ b/.github/workflows/pre-commit.yml
@@ -5,7 +5,7 @@ jobs:
  pre-commit:
    runs-on: ubuntu-latest
    steps:
-      - uses: actions/checkout@v4
+      - uses: actions/checkout@v5
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
--- a/.github/workflows/tests.yml
+++ b/.github/workflows/tests.yml
@@ -5,7 +5,7 @@ jobs:
  tests:
    runs-on: ubuntu-latest
    steps:
-      - uses: actions/checkout@v4
+      - uses: actions/checkout@v5
      - uses: actions/setup-python@v5
        with:
          python-version: |
--- a/README.md
+++ b/README.md
@@ -15,7 +15,7 @@

 MarkItDown is a lightweight Python utility for converting various files to Markdown for use with LLMs and related text analysis pipelines. To this end, it is most comparable to [textract](https://github.com/deanmalmgren/textract), but with a focus on preserving important document structure and content as Markdown (including: headings, lists, tables, links, etc.) While the output is often reasonably presentable and human-friendly, it is meant to be consumed by text analysis tools -- and may not be the best option for high-fidelity document conversions for human consumption.

-At present, MarkItDown supports:
+MarkItDown currently supports the conversion from:

 - PDF
 - PowerPoint
@@ -164,14 +164,14 @@ result = md.convert("test.pdf")
 print(result.text_content)
 ```

-To use Large Language Models for image descriptions, provide `llm_client` and `llm_model`:
+To use Large Language Models for image descriptions (currently only for pptx and image files), provide `llm_client` and `llm_model`:

 ```python
 from markitdown import MarkItDown
 from openai import OpenAI

 client = OpenAI()
-md = MarkItDown(llm_client=client, llm_model="gpt-4o")
+md = MarkItDown(llm_client=client, llm_model="gpt-4o", llm_prompt="optional custom prompt")
 result = md.convert("example.jpg")
 print(result.text_content)
 ```
@@ -199,7 +199,7 @@ contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additio

 ### How to Contribute

-You can help by looking at issues or helping review PRs. Any issue or PR is welcome, but we have also marked some as 'open for contribution' and 'open for reviewing' to help facilitate community contributions. These are ofcourse just suggestions and you are welcome to contribute in any way you like.
+You can help by looking at issues or helping review PRs. Any issue or PR is welcome, but we have also marked some as 'open for contribution' and 'open for reviewing' to help facilitate community contributions. These are of course just suggestions and you are welcome to contribute in any way you like.

 <div align="center">

--- a/packages/markitdown-mcp/Dockerfile
+++ b/packages/markitdown-mcp/Dockerfile
@@ -3,8 +3,10 @@ FROM python:3.13-slim-bullseye
 ENV DEBIAN_FRONTEND=noninteractive
 ENV EXIFTOOL_PATH=/usr/bin/exiftool
 ENV FFMPEG_PATH=/usr/bin/ffmpeg
+ENV MARKITDOWN_ENABLE_PLUGINS=True

 # Runtime dependency
+# NOTE: Add any additional MarkItDown plugins here
 RUN apt-get update && apt-get install -y --no-install-recommends \
    ffmpeg \
    exiftool
--- a/packages/markitdown-mcp/README.md
+++ b/packages/markitdown-mcp/README.md
@@ -54,7 +54,7 @@ Once mounted, all files under data will be accessible under `/workdir` in the co

 It is recommended to use the Docker image when running the MCP server for Claude Desktop.

-Follow [these instrutions](https://modelcontextprotocol.io/quickstart/user#for-claude-desktop-users) to access Claude's `claude_desktop_config.json` file.
+Follow [these instructions](https://modelcontextprotocol.io/quickstart/user#for-claude-desktop-users) to access Claude's `claude_desktop_config.json` file.

 Edit it to include the following JSON entry:

@@ -102,7 +102,7 @@ To debug the MCP server you can use the `mcpinspector` tool.
 npx @modelcontextprotocol/inspector
 ```

-You can then connect to the insepctor through the specified host and port (e.g., `http://localhost:5173/`).
+You can then connect to the inspector through the specified host and port (e.g., `http://localhost:5173/`).

 If using STDIO:
 * select `STDIO` as the transport type,
@@ -127,8 +127,7 @@ Finally:

 ## Security Considerations

-The server does not support authentication, and runs with the privileges if the user running it. For this reason, when running in SSE or Streamable HTTP mode, it is recommended to run the server bound to `localhost` (default).
-
+The server does not support authentication, and runs with the privileges of the user running it. For this reason, when running in SSE or Streamable HTTP mode, it is recommended to run the server bound to `localhost` (default).

 ## Trademarks

--- a/packages/markitdown-mcp/src/markitdown_mcp/main.py
+++ b/packages/markitdown-mcp/src/markitdown_mcp/main.py
@@ -1,5 +1,6 @@
 import contextlib
 import sys
+import os
 from collections.abc import AsyncIterator
 from mcp.server.fastmcp import FastMCP
 from starlette.applications import Starlette
@@ -19,7 +20,15 @@ mcp = FastMCP("markitdown")
@mcp.tool()
 async def convert_to_markdown(uri: str) -> str:
    """Convert a resource described by an http:, https:, file: or data: URI to markdown"""
-    return MarkItDown().convert_uri(uri).markdown
+    return MarkItDown(enable_plugins=check_plugins_enabled()).convert_uri(uri).markdown
+
+
+def check_plugins_enabled() -> bool:
+    return os.getenv("MARKITDOWN_ENABLE_PLUGINS", "false").strip().lower() in (
+        "true",
+        "1",
+        "yes",
+    )


 def create_starlette_app(mcp_server: Server, *, debug: bool = False) -> Starlette:
--- a/packages/markitdown/pyproject.toml
+++ b/packages/markitdown/pyproject.toml
@@ -30,12 +30,13 @@ dependencies = [
  "magika~=0.6.1",
  "charset-normalizer",
  "defusedxml",
+  "onnxruntime<=1.20.1; sys_platform == 'win32'",
 ]

 [project.optional-dependencies]
 all = [
  "python-pptx",
-  "mammoth",
+  "mammoth~=1.10.0",
  "pandas",
  "openpyxl",
  "xlrd",
--- a/packages/markitdown/src/markitdown/about.py
+++ b/packages/markitdown/src/markitdown/about.py
@@ -1,4 +1,4 @@
 # SPDX-FileCopyrightText: 2024-present Adam Fourney <adamfo@microsoft.com>
 #
 # SPDX-License-Identifier: MIT
-__version__ = "0.1.2"
+__version__ = "0.1.3"
--- a/packages/markitdown/src/markitdown/_base_converter.py
+++ b/packages/markitdown/src/markitdown/_base_converter.py
@@ -69,7 +69,7 @@ class DocumentConverter:
        data = file_stream.read(100) # ... peek at the first 100 bytes, etc.
        file_stream.seek(cur_pos)    # Reset the position to the original position

-        Prameters:
+        Parameters:
        - file_stream: The file-like object to convert. Must support seek(), tell(), and read() methods.
        - stream_info: The StreamInfo object containing metadata about the file (mimetype, extension, charset, set)
        - kwargs: Additional keyword arguments for the converter.
@@ -90,7 +90,7 @@ class DocumentConverter:
        """
        Convert a document to Markdown text.

-        Prameters:
+        Parameters:
        - file_stream: The file-like object to convert. Must support seek(), tell(), and read() methods.
        - stream_info: The StreamInfo object containing metadata about the file (mimetype, extension, charset, set)
        - kwargs: Additional keyword arguments for the converter.
--- a/packages/markitdown/src/markitdown/_markitdown.py
+++ b/packages/markitdown/src/markitdown/_markitdown.py
@@ -115,6 +115,7 @@ class MarkItDown:
        # TODO - remove these (see enable_builtins)
        self._llm_client: Any = None
        self._llm_model: Union[str | None] = None
+        self._llm_prompt: Union[str | None] = None
        self._exiftool_path: Union[str | None] = None
        self._style_map: Union[str | None] = None

@@ -139,6 +140,7 @@ class MarkItDown:
            # TODO: Move these into converter constructors
            self._llm_client = kwargs.get("llm_client")
            self._llm_model = kwargs.get("llm_model")
+            self._llm_prompt = kwargs.get("llm_prompt")
            self._exiftool_path = kwargs.get("exiftool_path")
            self._style_map = kwargs.get("style_map")

@@ -559,6 +561,9 @@ class MarkItDown:
                if "llm_model" not in _kwargs and self._llm_model is not None:
                    _kwargs["llm_model"] = self._llm_model

+                if "llm_prompt" not in _kwargs and self._llm_prompt is not None:
+                    _kwargs["llm_prompt"] = self._llm_prompt
+
                if "style_map" not in _kwargs and self._style_map is not None:
                    _kwargs["style_map"] = self._style_map

--- a/packages/markitdown/src/markitdown/converters/_doc_intel_converter.py
+++ b/packages/markitdown/src/markitdown/converters/_doc_intel_converter.py
@@ -84,6 +84,9 @@ def _get_mime_type_prefixes(types: List[DocumentIntelligenceFileType]) -> List[s
            prefixes.append(
                "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
            )
+        elif type_ == DocumentIntelligenceFileType.HTML:
+            prefixes.append("text/html")
+            prefixes.append("application/xhtml+xml")
        elif type_ == DocumentIntelligenceFileType.PDF:
            prefixes.append("application/pdf")
            prefixes.append("application/x-pdf")
@@ -119,6 +122,8 @@ def _get_file_extensions(types: List[DocumentIntelligenceFileType]) -> List[str]
            extensions.append(".bmp")
        elif type_ == DocumentIntelligenceFileType.TIFF:
            extensions.append(".tiff")
+        elif type_ == DocumentIntelligenceFileType.HTML:
+            extensions.append(".html")
    return extensions


--- a/packages/markitdown/src/markitdown/converters/_docx_converter.py
+++ b/packages/markitdown/src/markitdown/converters/_docx_converter.py
@@ -1,4 +1,6 @@
 import sys
+import io
+from warnings import warn

 from typing import BinaryIO, Any

@@ -13,6 +15,14 @@ from .._exceptions import MissingDependencyException, MISSING_DEPENDENCY_MESSAGE
 _dependency_exc_info = None
 try:
    import mammoth
+    import mammoth.docx.files
+
+    def mammoth_files_open(self, uri):
+        warn("DOCX: processing of r:link resources (e.g., linked images) is disabled.")
+        return io.BytesIO(b"")
+
+    mammoth.docx.files.Files.open = mammoth_files_open
+
 except ImportError:
    # Preserve the error and stack trace for later
    _dependency_exc_info = sys.exc_info()
--- a/packages/markitdown/src/markitdown/converters/_exiftool.py
+++ b/packages/markitdown/src/markitdown/converters/_exiftool.py
@@ -1,7 +1,11 @@
 import json
-import subprocess
 import locale
-from typing import BinaryIO, Any, Union
+import subprocess
+from typing import Any, BinaryIO, Union
+
+
+def _parse_version(version: str) -> tuple:
+    return tuple(map(int, (version.split("."))))


 def exiftool_metadata(
@@ -13,6 +17,24 @@ def exiftool_metadata(
    if not exiftool_path:
        return {}

+    # Verify exiftool version
+    try:
+        version_output = subprocess.run(
+            [exiftool_path, "-ver"],
+            capture_output=True,
+            text=True,
+            check=True,
+        ).stdout.strip()
+        version = _parse_version(version_output)
+        min_version = (12, 24)
+        if version < min_version:
+            raise RuntimeError(
+                f"ExifTool version {version_output} is vulnerable to CVE-2021-22204. "
+                "Please upgrade to version 12.24 or later."
+            )
+    except (subprocess.CalledProcessError, ValueError) as e:
+        raise RuntimeError("Failed to verify ExifTool version.") from e
+
    # Run exiftool
    cur_pos = file_stream.tell()
    try:
--- a/packages/markitdown/src/markitdown/converters/_markdownify.py
+++ b/packages/markitdown/src/markitdown/converters/_markdownify.py
@@ -92,9 +92,11 @@ class _CustomMarkdownify(markdownify.MarkdownConverter):
        """Same as usual converter, but removes data URIs"""

        alt = el.attrs.get("alt", None) or ""
-        src = el.attrs.get("src", None) or ""
+        src = el.attrs.get("src", None) or el.attrs.get("data-src", None) or ""
        title = el.attrs.get("title", None) or ""
        title_part = ' "%s"' % title.replace('"', r"\"") if title else ""
+        # Remove all line breaks from alt
+        alt = alt.replace("\n", " ")
        if (
            convert_as_inline
            and el.parent.name not in self.options["keep_inline_images_in"]
@@ -107,5 +109,18 @@ class _CustomMarkdownify(markdownify.MarkdownConverter):

        return "![%s](%s%s)" % (alt, src, title_part)

+    def convert_input(
+        self,
+        el: Any,
+        text: str,
+        convert_as_inline: Optional[bool] = False,
+        **kwargs,
+    ) -> str:
+        """Convert checkboxes to Markdown [x]/[ ] syntax."""
+
+        if el.get("type") == "checkbox":
+            return "[x] " if el.has_attr("checked") else "[ ] "
+        return ""
+
    def convert_soup(self, soup: Any) -> str:
        return super().convert_soup(soup)  # type: ignore
--- a/packages/markitdown/src/markitdown/converters/_pptx_converter.py
+++ b/packages/markitdown/src/markitdown/converters/_pptx_converter.py
@@ -168,11 +168,23 @@ class PptxConverter(DocumentConverter):

                # Group Shapes
                if shape.shape_type == pptx.enum.shapes.MSO_SHAPE_TYPE.GROUP:
-                    sorted_shapes = sorted(shape.shapes, key=attrgetter("top", "left"))
+                    sorted_shapes = sorted(
+                        shape.shapes,
+                        key=lambda x: (
+                            float("-inf") if not x.top else x.top,
+                            float("-inf") if not x.left else x.left,
+                        ),
+                    )
                    for subshape in sorted_shapes:
                        get_shape_content(subshape, **kwargs)

-            sorted_shapes = sorted(slide.shapes, key=attrgetter("top", "left"))
+            sorted_shapes = sorted(
+                slide.shapes,
+                key=lambda x: (
+                    float("-inf") if not x.top else x.top,
+                    float("-inf") if not x.left else x.left,
+                ),
+            )
            for shape in sorted_shapes:
                get_shape_content(shape, **kwargs)

--- a/packages/markitdown/tests/test_docintel_html.py
+++ b/packages/markitdown/tests/test_docintel_html.py
@@ -0,0 +1,26 @@
+import io
+from markitdown.converters._doc_intel_converter import (
+    DocumentIntelligenceConverter,
+    DocumentIntelligenceFileType,
+)
+from markitdown._stream_info import StreamInfo
+
+
+def _make_converter(file_types):
+    conv = DocumentIntelligenceConverter.__new__(DocumentIntelligenceConverter)
+    conv._file_types = file_types
+    return conv
+
+
+def test_docintel_accepts_html_extension():
+    conv = _make_converter([DocumentIntelligenceFileType.HTML])
+    stream_info = StreamInfo(mimetype=None, extension=".html")
+    assert conv.accepts(io.BytesIO(b""), stream_info)
+
+
+def test_docintel_accepts_html_mimetype():
+    conv = _make_converter([DocumentIntelligenceFileType.HTML])
+    stream_info = StreamInfo(mimetype="text/html", extension=None)
+    assert conv.accepts(io.BytesIO(b""), stream_info)
+    stream_info = StreamInfo(mimetype="application/xhtml+xml", extension=None)
+    assert conv.accepts(io.BytesIO(b""), stream_info)
--- a/packages/markitdown/tests/test_module_misc.py
+++ b/packages/markitdown/tests/test_module_misc.py
@@ -4,6 +4,7 @@ import os
 import re
 import shutil
 import pytest
+from unittest.mock import MagicMock

 from markitdown._uri_utils import parse_data_uri, file_uri_to_path

@@ -370,6 +371,50 @@ def test_markitdown_exiftool() -> None:
        assert target in result.text_content


+def test_markitdown_llm_parameters() -> None:
+    """Test that LLM parameters are correctly passed to the client."""
+    mock_client = MagicMock()
+    mock_response = MagicMock()
+    mock_response.choices = [
+        MagicMock(
+            message=MagicMock(
+                content="Test caption with red circle and blue square 5bda1dd6"
+            )
+        )
+    ]
+    mock_client.chat.completions.create.return_value = mock_response
+
+    test_prompt = "You are a professional test prompt."
+    markitdown = MarkItDown(
+        llm_client=mock_client, llm_model="gpt-4o", llm_prompt=test_prompt
+    )
+
+    # Test image file
+    markitdown.convert(os.path.join(TEST_FILES_DIR, "test_llm.jpg"))
+
+    # Verify the prompt was passed to the OpenAI API
+    assert mock_client.chat.completions.create.called
+    call_args = mock_client.chat.completions.create.call_args
+    messages = call_args[1]["messages"]
+    assert len(messages) == 1
+    assert messages[0]["content"][0]["text"] == test_prompt
+
+    # Reset the mock for the next test
+    mock_client.chat.completions.create.reset_mock()
+
+    # TODO: may only use one test after the llm caption method duplicate has been removed:
+    # https://github.com/microsoft/markitdown/pull/1254
+    # Test PPTX file
+    markitdown.convert(os.path.join(TEST_FILES_DIR, "test.pptx"))
+
+    # Verify the prompt was passed to the OpenAI API for PPTX images too
+    assert mock_client.chat.completions.create.called
+    call_args = mock_client.chat.completions.create.call_args
+    messages = call_args[1]["messages"]
+    assert len(messages) == 1
+    assert messages[0]["content"][0]["text"] == test_prompt
+
+
@pytest.mark.skipif(
    skip_llm,
    reason="do not run llm tests without a key",
@@ -408,6 +453,7 @@ if __name__ == "__main__":
        test_speech_transcription,
        test_exceptions,
        test_markitdown_exiftool,
+        test_markitdown_llm_parameters,
        test_markitdown_llm,
    ]:
        print(f"Running {test.__name__}...", end="")
Author	SHA1	Message	Date
Meirna	8a9d8f1593	feat: add checkbox support to Markdown converter (#1208 ) This change introduces functionality to convert HTML checkbox input elements (<input type=checkbox>) into Markdown checkbox syntax ([ ] or [x]). Co-authored-by: Meirna Kamal <meirna.kamal@vodafone.com>	2025-08-26 15:30:47 -07:00
Richard Ye	17365654c9	Handle PPTX shapes where position is None (#1161 ) * Handle shapes where position is None * Fixed recursion error, and place no-coord shapes at front	2025-08-26 15:28:17 -07:00
Yuzhong Zhang	59eb60f8cb	fix docx parse error(\n in alt) (#1163 )	2025-08-26 15:20:17 -07:00
Dmitry	459d462f29	docs: correct minor typos (#1173 )	2025-08-26 15:15:23 -07:00
Noah Zhu	c3f6cb356c	Adding support for data-src Attribute (#1226 ) * supportfordata-src	2025-08-26 15:11:53 -07:00
Ebrahim Tayabali	0c4d3945a0	Update README.md (#1191 ) Fix: Subtle spelling mistake fixed.	2025-08-26 15:07:27 -07:00
Utkarsh kumar	f8b60b5403	Update README.md (#1350 ) ISSUE #1339	2025-08-26 15:02:56 -07:00
[W]DOS_	16ca285d30	Update README.md (#1335 ) Fix typo in README.md	2025-08-26 14:55:58 -07:00
Stefan Rink	b81a387616	fix: correctly pass custom llm prompt parameter (#1319 ) * fix: correctly pass custom llm prompt parameter	2025-08-26 14:51:10 -07:00
safen0s	ea1a3dfb60	Add HTML support to DocumentIntelligenceConverter (#1352 )	2025-08-26 14:34:43 -07:00
dependabot[bot]	b6e5da8874	Bump actions/checkout from 4 to 5 (#1394 ) Bumps [actions/checkout](https://github.com/actions/checkout) from 4 to 5.	2025-08-26 14:27:38 -07:00
t3tra	fb1ad24833	Ensure safe ExifTool usage: require >= 12.24 (#1399 ) * feat: add version verification for ExifTool to ensure security compliance * fix: improve ExifTool version verification ---------	2025-08-26 14:25:13 -07:00
JonahDelman	1178c2e211	Fixed documentation typos in _base_converter.py (#1393 )	2025-08-26 14:23:10 -07:00
afourney	9278119bb3	Resolved an issue with linked images in docx [mammoth] (#1405 )	2025-08-26 14:20:29 -07:00
onefloid	da7bcea527	docs: rephrase sentence (#1278 )	2025-06-03 21:09:25 -07:00
afourney	3bfb821c09	Have the MarkItDown MCP server read MARKITDOWN_ENABLE_PLUGINS from ENV (#1273 ) * Have the MarkItdown MCP server read MARKITDOWN_ENABLE_PLUGINS from os.environ * Update the Dockerfile to enable plugins. No puglins are installed by default.	2025-06-03 09:35:33 -07:00
Tomasz Kalinowski	62b72284fe	pin onnxruntime on Windows (#1274 ) closes #1266	2025-05-28 13:13:51 -07:00