If files use zip packaging, be smarter about inspecting their types.

Fix exiftool in well-known paths. (#1106 )
feat(docker): improve dockerfile build (#220 )
2025-03-07 23:06:56 -08:00 · 2025-03-07 21:47:20 -08:00 · 2025-03-07 20:07:40 -08:00 · 2025-03-07 15:45:14 -08:00
14 changed files with 157 additions and 143 deletions
--- a/.dockerignore
+++ b/.dockerignore
@@ -1 +1,2 @@
-*
+*
+!packages/
--- a/30
+++ b/30
@@ -1,22 +1,32 @@
 FROM python:3.13-slim-bullseye

-USER root
-
-ARG INSTALL_GIT=false
-RUN if [ "$INSTALL_GIT" = "true" ]; then \
-    apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/*; \
-    fi
+ENV DEBIAN_FRONTEND=noninteractive
+ENV EXIFTOOL_PATH=/usr/bin/exiftool
+ENV FFMPEG_PATH=/usr/bin/ffmpeg

 # Runtime dependency
 RUN apt-get update && apt-get install -y --no-install-recommends \
    ffmpeg \
-    && rm -rf /var/lib/apt/lists/*
+    exiftool

-RUN pip install markitdown
+ARG INSTALL_GIT=false
+RUN if [ "$INSTALL_GIT" = "true" ]; then \
+    apt-get install -y --no-install-recommends \
+    git; \
+    fi
+
+# Cleanup
+RUN rm -rf /var/lib/apt/lists/*
+
+WORKDIR /app
+COPY . /app
+RUN pip --no-cache-dir install \
+    /app/packages/markitdown[all] \
+    /app/packages/markitdown-sample-plugin

 # Default USERID and GROUPID
-ARG USERID=10000
-ARG GROUPID=10000
+ARG USERID=nobody
+ARG GROUPID=nogroup

 USER $USERID:$GROUPID

--- a/README.md
+++ b/README.md
@@ -14,10 +14,9 @@ MarkItDown is a lightweight Python utility for converting various files to Markd
 At present, MarkItDown supports:

 - PDF
- PowerPoint
+- PowerPoint (reading in top-to-bottom, left-to-right order)
 - Word
 - Excel
- OneNote
 - Images (EXIF metadata and OCR)
 - Audio (EXIF metadata and speech transcription)
 - HTML
@@ -83,7 +82,6 @@ At the moment, the following optional dependencies are available:
 * `[xls]` Installs dependencies for older Excel files
 * `[pdf]` Installs dependencies for PDF files
 * `[outlook]` Installs dependencies for Outlook messages
-* `[onenote]` Installs dependencies for OneNote .one files
 * `[az-doc-intel]` Installs dependencies for Azure Document Intelligence
 * `[audio-transcription]` Installs dependencies for audio transcription of wav and mp3 files
 * `[youtube-transcription]` Installs dependencies for fetching YouTube video transcription
--- a/packages/markitdown/pyproject.toml
+++ b/packages/markitdown/pyproject.toml
@@ -45,8 +45,7 @@ all = [
  "SpeechRecognition",
  "youtube-transcript-api",
  "azure-ai-documentintelligence",
-  "azure-identity",
-  "one-extract",
+  "azure-identity"
 ]
 pptx = ["python-pptx"]
 docx = ["mammoth"]
@@ -54,7 +53,6 @@ xlsx = ["pandas", "openpyxl"]
 xls = ["pandas", "xlrd"]
 pdf = ["pdfminer.six"]
 outlook = ["olefile"]
-onenote = ["one-extract"]
 audio-transcription = ["pydub", "SpeechRecognition"]
 youtube-transcription = ["youtube-transcript-api"]
 az-doc-intel = ["azure-ai-documentintelligence", "azure-identity"]
--- a/packages/markitdown/src/markitdown/about.py
+++ b/packages/markitdown/src/markitdown/about.py
@@ -1,4 +1,4 @@
 # SPDX-FileCopyrightText: 2024-present Adam Fourney <adamfo@microsoft.com>
 #
 # SPDX-License-Identifier: MIT
-__version__ = "0.1.0a1"
+__version__ = "0.1.0a2"
--- a/packages/markitdown/src/markitdown/_markitdown.py
+++ b/packages/markitdown/src/markitdown/_markitdown.py
@@ -3,6 +3,7 @@ import mimetypes
 import os
 import re
 import sys
+import shutil
 import tempfile
 import warnings
 import traceback
@@ -30,7 +31,6 @@ from .converters import (
    BingSerpConverter,
    PdfConverter,
    DocxConverter,
-    OneNoteConverter,
    XlsxConverter,
    XlsConverter,
    PptxConverter,
@@ -139,9 +139,30 @@ class MarkItDown:
            self._llm_model = kwargs.get("llm_model")
            self._exiftool_path = kwargs.get("exiftool_path")
            self._style_map = kwargs.get("style_map")
+
            if self._exiftool_path is None:
                self._exiftool_path = os.getenv("EXIFTOOL_PATH")

+            # Still none? Check well-known paths
+            if self._exiftool_path is None:
+                candidate = shutil.which("exiftool")
+                if candidate:
+                    candidate = os.path.abspath(candidate)
+                    if any(
+                        d == os.path.dirname(candidate)
+                        for d in [
+                            "/usr/bin",
+                            "/usr/local/bin",
+                            "/opt",
+                            "/opt/bin",
+                            "/opt/local/bin",
+                            "/opt/homebrew/bin" "C:\\Windows\\System32",
+                            "C:\\Program Files",
+                            "C:\\Program Files (x86)",
+                        ]
+                    ):
+                        self._exiftool_path = candidate
+
            # Register converters for successful browsing operations
            # Later registrations are tried first / take higher priority than earlier registrations
            # To this end, the most specific converters should appear below the most generic converters
@@ -159,7 +180,6 @@ class MarkItDown:
            self.register_converter(YouTubeConverter())
            self.register_converter(BingSerpConverter())
            self.register_converter(DocxConverter())
-            self.register_converter(OneNoteConverter())
            self.register_converter(XlsxConverter())
            self.register_converter(XlsConverter())
            self.register_converter(PptxConverter())
@@ -329,6 +349,17 @@ class MarkItDown:
        elif base_guess.extension is not None:
            placeholder_filename = "placeholder" + base_guess.extension

+        # Check if we have a seekable stream. If not, load the entire stream into memory.
+        if not stream.seekable():
+            buffer = io.BytesIO()
+            while True:
+                chunk = stream.read(4096)
+                if not chunk:
+                    break
+                buffer.write(chunk)
+            buffer.seek(0)
+            stream = buffer
+
        # Add guesses based on stream content
        for guess in _guess_stream_info_from_stream(
            file_stream=stream, filename_hint=placeholder_filename
--- a/packages/markitdown/src/markitdown/_stream_info.py
+++ b/packages/markitdown/src/markitdown/_stream_info.py
@@ -1,8 +1,9 @@
 import puremagic
 import mimetypes
+import zipfile
 import os
 from dataclasses import dataclass, asdict
-from typing import Optional, BinaryIO, List, TypeVar, Type
+from typing import Optional, BinaryIO, List, Union

 # Mimetype substitutions table
 MIMETYPE_SUBSTITUTIONS = {
@@ -74,6 +75,20 @@ def _guess_stream_info_from_stream(
                )
            )

+    # If it looks like a zip use _guess_stream_info_from_zip rather than puremagic
+    cur_pos = file_stream.tell()
+    try:
+        header = file_stream.read(4)
+        file_stream.seek(cur_pos)
+        if header == b"PK\x03\x04":
+            zip_guess = _guess_stream_info_from_zip(file_stream)
+            if zip_guess:
+                guesses.append(zip_guess)
+                return guesses
+    finally:
+        file_stream.seek(cur_pos)
+
+    # Fall back to using puremagic
    def _puremagic(
        file_stream, filename_hint
    ) -> List[puremagic.main.PureMagicWithConfidence]:
@@ -120,3 +135,74 @@ def _guess_stream_info_from_stream(
            guesses.append(StreamInfo(**kwargs))

    return guesses
+
+
+def _guess_stream_info_from_zip(file_stream: BinaryIO) -> Union[None, StreamInfo]:
+    """
+    Guess StreamInfo properties (mostly mimetype and extension) from a zip stream.
+
+    Args:
+    - stream: The stream to guess the StreamInfo from.
+
+    Returns the single best guess, or None if no guess could be made.
+    """
+
+    cur_pos = file_stream.tell()
+    try:
+        with zipfile.ZipFile(file_stream) as z:
+            table_of_contents = z.namelist()
+
+            # OpenPackageFormat (OPF) file
+            if "[Content_Types].xml" in table_of_contents:
+                # Word file
+                if "word/document.xml" in table_of_contents:
+                    return StreamInfo(
+                        mimetype="application/vnd.openxmlformats-officedocument.wordprocessingml.document",
+                        extension=".docx",
+                    )
+
+                # Excel file
+                if "xl/workbook.xml" in table_of_contents:
+                    return StreamInfo(
+                        mimetype="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
+                        extension=".xlsx",
+                    )
+
+                # PowerPoint file
+                if "ppt/presentation.xml" in table_of_contents:
+                    return StreamInfo(
+                        mimetype="application/vnd.openxmlformats-officedocument.presentationml.presentation",
+                        extension=".pptx",
+                    )
+
+                # Visio file
+                if "visio/document.xml" in table_of_contents:
+                    return StreamInfo(
+                        mimetype="application/vnd.ms-visio.drawing",
+                        extension=".vsd",
+                    )
+
+                # XPS file
+                if "FixedDocSeq.fdseq" in table_of_contents:
+                    return StreamInfo(
+                        mimetype="application/vnd.ms-xpsdocument",
+                        extension=".xps",
+                    )
+
+            # EPUB, or similar
+            if "mimetype" in table_of_contents:
+                _mimetype = z.read("mimetype").decode("ascii").strip()
+                _extension = mimetypes.guess_extension(_mimetype)
+                return StreamInfo(mimetype=_mimetype, extension=_extension)
+
+            # JAR
+            if "META-INF/MANIFEST.MF" in table_of_contents:
+                return StreamInfo(mimetype="application/java-archive", extension=".jar")
+
+            # If we made it this far, we couldn't identify the file
+            return StreamInfo(mimetype="application/zip", extension=".zip")
+
+    except zipfile.BadZipFile:
+        return None
+    finally:
+        file_stream.seek(cur_pos)
--- a/packages/markitdown/src/markitdown/converters/init.py
+++ b/packages/markitdown/src/markitdown/converters/init.py
@@ -11,7 +11,6 @@ from ._ipynb_converter import IpynbConverter
 from ._bing_serp_converter import BingSerpConverter
 from ._pdf_converter import PdfConverter
 from ._docx_converter import DocxConverter
-from ._onenote_converter import OneNoteConverter
 from ._xlsx_converter import XlsxConverter, XlsConverter
 from ._pptx_converter import PptxConverter
 from ._image_converter import ImageConverter
@@ -30,7 +29,6 @@ __all__ = [
    "BingSerpConverter",
    "PdfConverter",
    "DocxConverter",
-    "OneNoteConverter",
    "XlsxConverter",
    "XlsConverter",
    "PptxConverter",
--- a/packages/markitdown/src/markitdown/converters/_exiftool.py
+++ b/packages/markitdown/src/markitdown/converters/_exiftool.py
@@ -5,26 +5,16 @@ import sys
 import shutil
 import os
 import warnings
-from typing import BinaryIO, Optional, Any
+from typing import BinaryIO, Any, Union


 def exiftool_metadata(
-    file_stream: BinaryIO, *, exiftool_path: Optional[str] = None
+    file_stream: BinaryIO,
+    *,
+    exiftool_path: Union[str, None],
 ) -> Any:  # Need a better type for json data
-    # Check if we have a valid pointer to exiftool
+    # Nothing to do
    if not exiftool_path:
-        which_exiftool = shutil.which("exiftool")
-        if which_exiftool:
-            warnings.warn(
-                f"""Implicit discovery of 'exiftool' is disabled. If you would like to continue to use exiftool in MarkItDown, please set the exiftool_path parameter in the MarkItDown consructor. E.g., 
-
-    md = MarkItDown(exiftool_path="{which_exiftool}")
-
-This warning will be removed in future releases.
-""",
-                DeprecationWarning,
-            )
-        # Nothing to do
        return {}

    # Run exiftool
--- a/packages/markitdown/src/markitdown/converters/_onenote_converter.py
+++ b/packages/markitdown/src/markitdown/converters/_onenote_converter.py
@@ -1,87 +0,0 @@
-import sys
-
-from typing import BinaryIO, Any
-
-from ._html_converter import HtmlConverter
-from .._base_converter import DocumentConverter, DocumentConverterResult
-from .._stream_info import StreamInfo
-from .._exceptions import MissingDependencyException, MISSING_DEPENDENCY_MESSAGE
-
-# Try loading optional (but in this case, required) dependencies
-# Save reporting of any exceptions for later
-_dependency_exc_info = None
-try:
-    import one_extract
-except ImportError:
-    # Preserve the error and stack trace for later
-    _dependency_exc_info = sys.exc_info()
-
-
-ACCEPTED_MIME_TYPE_PREFIXES = []
-
-ACCEPTED_FILE_EXTENSIONS = [".one"]
-
-
-class OneNoteConverter(DocumentConverter):
-    """
-    Converts OneNote files to Markdown.
-    """
-
-    def __init__(self):
-        super().__init__()
-        self._html_converter = HtmlConverter()
-
-    def accepts(
-        self,
-        file_stream: BinaryIO,
-        stream_info: StreamInfo,
-        **kwargs: Any,  # Options to pass to the converter
-    ) -> bool:
-        mimetype = (stream_info.mimetype or "").lower()
-        extension = (stream_info.extension or "").lower()
-
-        if extension in ACCEPTED_FILE_EXTENSIONS:
-            return True
-
-        for prefix in ACCEPTED_MIME_TYPE_PREFIXES:
-            if mimetype.startswith(prefix):
-                return True
-
-        return False
-
-    def convert(
-        self,
-        file_stream: BinaryIO,
-        stream_info: StreamInfo,
-        **kwargs: Any,  # Options to pass to the converter
-    ) -> DocumentConverterResult:
-        # Check: the dependencies
-        if _dependency_exc_info is not None:
-            raise MissingDependencyException(
-                MISSING_DEPENDENCY_MESSAGE.format(
-                    converter=type(self).__name__,
-                    extension=".one",
-                    feature="onenote",
-                )
-            ) from _dependency_exc_info[
-                1
-            ].with_traceback(  # type: ignore[union-attr]
-                _dependency_exc_info[2]
-            )
-
-        # Perform the conversion
-        md_content = ""
-        notebook = one_extract.Notebook(file_stream)
-        for section in notebook.sections:
-            md_content += f"\n\n# {section.name}\n"
-            for page in section.pages:
-                md_content += f"\n\n## {page.name}\n"
-                md_content += (
-                    self._html_converter.convert_string(page.content).markdown.strip()
-                    + "\n\n"
-                )
-
-        return DocumentConverterResult(
-            title=None,
-            text_content=md_content.strip(),
-        )
--- a/packages/markitdown/src/markitdown/converters/_outlook_msg_converter.py
+++ b/packages/markitdown/src/markitdown/converters/_outlook_msg_converter.py
@@ -7,6 +7,7 @@ from .._exceptions import MissingDependencyException, MISSING_DEPENDENCY_MESSAGE
 # Try loading optional (but in this case, required) dependencies
 # Save reporting of any exceptions for later
 _dependency_exc_info = None
+olefile = None
 try:
    import olefile
 except ImportError:
@@ -48,7 +49,7 @@ class OutlookMsgConverter(DocumentConverter):
        # Brute force, check if we have an OLE file
        cur_pos = file_stream.tell()
        try:
-            if not olefile.isOleFile(file_stream):
+            if olefile and not olefile.isOleFile(file_stream):
                return False
        finally:
            file_stream.seek(cur_pos)
--- a/packages/markitdown/src/markitdown/converters/_pptx_converter.py
+++ b/packages/markitdown/src/markitdown/converters/_pptx_converter.py
@@ -6,6 +6,7 @@ import re
 import html

 from typing import BinaryIO, Any
+from operator import attrgetter

 from ._html_converter import HtmlConverter
 from ._llm_caption import llm_caption
@@ -160,10 +161,12 @@ class PptxConverter(DocumentConverter):

                # Group Shapes
                if shape.shape_type == pptx.enum.shapes.MSO_SHAPE_TYPE.GROUP:
-                    for subshape in shape.shapes:
+                    sorted_shapes = sorted(shape.shapes, key=attrgetter("top", "left"))
+                    for subshape in sorted_shapes:
                        get_shape_content(subshape, **kwargs)

-            for shape in slide.shapes:
+            sorted_shapes = sorted(slide.shapes, key=attrgetter("top", "left"))
+            for shape in sorted_shapes:
                get_shape_content(shape, **kwargs)

            md_content = md_content.strip()
--- a/packages/markitdown/tests/test_files/test.one
+++ b/packages/markitdown/tests/test_files/test.one
--- a/packages/markitdown/tests/test_markitdown.py
+++ b/packages/markitdown/tests/test_markitdown.py
@@ -7,8 +7,6 @@ import openai
 import pytest
 import requests

-import warnings
-
 from markitdown import (
    MarkItDown,
    UnsupportedFormatException,
@@ -517,19 +515,6 @@ def test_exceptions() -> None:
    reason="do not run if exiftool is not installed",
 )
 def test_markitdown_exiftool() -> None:
-    # Test the automatic discovery of exiftool throws a warning
-    # and is disabled
-    try:
-        warnings.simplefilter("default")
-        with warnings.catch_warnings(record=True) as w:
-            markitdown = MarkItDown()
-            result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.jpg"))
-            assert len(w) == 1
-            assert w[0].category is DeprecationWarning
-            assert result.text_content.strip() == ""
-    finally:
-        warnings.resetwarnings()
-
    which_exiftool = shutil.which("exiftool")
    assert which_exiftool is not None
Author	SHA1	Message	Date
Adam Fourney	f17bc21c9d	If files use zip packaging, be smarter about inspecting their types.	2025-03-07 23:06:56 -08:00
afourney	99d8e562db	Fix exiftool in well-known paths. (#1106 )	2025-03-07 21:47:20 -08:00
Sebastian Yaghoubi	515fa854bf	feat(docker): improve dockerfile build (#220 ) * refactor(docker): remove unnecessary root user The USER root directive isn't needed directly after FROM Signed-off-by: Sebastian Yaghoubi <sebastianyaghoubi@gmail.com> * fix(docker): use generic nobody nogroup default instead of uid gid Signed-off-by: Sebastian Yaghoubi <sebastianyaghoubi@gmail.com> * fix(docker): build app from source locally instead of installing package Signed-off-by: Sebastian Yaghoubi <sebastianyaghoubi@gmail.com> * fix(docker): use correct files in dockerignore Signed-off-by: Sebastian Yaghoubi <sebastianyaghoubi@gmail.com> * chore(docker): dont install recommended packages with git Signed-off-by: Sebastian Yaghoubi <sebastianyaghoubi@gmail.com> * fix(docker): run apt as non-interactive Signed-off-by: Sebastian Yaghoubi <sebastianyaghoubi@gmail.com> * Update Dockerfile to new package structure, and fix streaming bugs. --------- Signed-off-by: Sebastian Yaghoubi <sebastianyaghoubi@gmail.com> Co-authored-by: afourney <adamfo@microsoft.com>	2025-03-07 20:07:40 -08:00
Richard Ye	0229ff6cb7	feat: sort pptx shapes to be parsed in top-to-bottom, left-to-right order (#1104 ) * Sort PPTX shapes to be read in top-to-bottom, left-to-right order Referenced from `39bef65b31/pptx2md/parser.py (L249)` * Update README.md * Fixed formatting. * Added missing import	2025-03-07 15:45:14 -08:00