Bump version and resolve a console encoding error. (#1149 )

Add support for preserving base64 encoded images (#1140 )
* optional reserve base64 string in markdown _CustomMarkdownify and pptx * add other converter para support * fix linter * Use *kwarg to pass keep_data_uri para. * Add module cli vector tests * Fixed formatting, and adjusted tests.
2025-03-21 09:27:25 -07:00 · 2025-03-20 18:50:23 -07:00 · 2025-03-20 12:25:56 -07:00 · 2025-03-19 22:09:14 -07:00 · 2025-03-19 20:46:35 -07:00 · 2025-03-17 07:48:15 -07:00
32 changed files with 1413 additions and 710 deletions
--- a/.dockerignore
+++ b/.dockerignore
@@ -1 +1,2 @@
-*
+*
+!packages/
--- a/30
+++ b/30
@@ -1,22 +1,32 @@
 FROM python:3.13-slim-bullseye

-USER root
-
-ARG INSTALL_GIT=false
-RUN if [ "$INSTALL_GIT" = "true" ]; then \
-    apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/*; \
-    fi
+ENV DEBIAN_FRONTEND=noninteractive
+ENV EXIFTOOL_PATH=/usr/bin/exiftool
+ENV FFMPEG_PATH=/usr/bin/ffmpeg

 # Runtime dependency
 RUN apt-get update && apt-get install -y --no-install-recommends \
    ffmpeg \
-    && rm -rf /var/lib/apt/lists/*
+    exiftool

-RUN pip install markitdown
+ARG INSTALL_GIT=false
+RUN if [ "$INSTALL_GIT" = "true" ]; then \
+    apt-get install -y --no-install-recommends \
+    git; \
+    fi
+
+# Cleanup
+RUN rm -rf /var/lib/apt/lists/*
+
+WORKDIR /app
+COPY . /app
+RUN pip --no-cache-dir install \
+    /app/packages/markitdown[all] \
+    /app/packages/markitdown-sample-plugin

 # Default USERID and GROUPID
-ARG USERID=10000
-ARG GROUPID=10000
+ARG USERID=nobody
+ARG GROUPID=nogroup

 USER $USERID:$GROUPID

--- a/README.md
+++ b/README.md
@@ -5,8 +5,8 @@
 [![Built by AutoGen Team](https://img.shields.io/badge/Built%20by-AutoGen%20Team-blue)](https://github.com/microsoft/autogen)

 > [!IMPORTANT]
-> Breaking changes between 0.0.1 to 0.0.2:
-> * Dependencies are now organized into optional feature-groups (further details below). Use `pip install markitdown[all]` to have backward-compatible behavior. 
+> Breaking changes between 0.0.1 to 0.1.0:
+> * Dependencies are now organized into optional feature-groups (further details below). Use `pip install 'markitdown[all]~=0.1.0a1'` to have backward-compatible behavior. 
 > * The DocumentConverter class interface has changed to read from file-like streams rather than file paths. *No temporary files are created anymore*. If you are the maintainer of a plugin, or custom DocumentConverter, you likely need to update your code. Otherwise, if only using the MarkItDown class or CLI (as in these examples), you should not need to change anything.

 MarkItDown is a lightweight Python utility for converting various files to Markdown for use with LLMs and related text analysis pipelines. To this end, it is most comparable to [textract](https://github.com/deanmalmgren/textract), but with a focus on preserving important document structure and content as Markdown (including: headings, lists, tables, links, etc.) While the output is often reasonably presentable and human-friendly, it is meant to be consumed by text analysis tools -- and may not be the best option for high-fidelity document conversions for human consumption.
@@ -23,6 +23,7 @@ At present, MarkItDown supports:
 - Text-based formats (CSV, JSON, XML)
 - ZIP files (iterates over contents)
 - Youtube URLs
+- EPubs
 - ... and more!

 ## Why Markdown?
@@ -36,7 +37,7 @@ are also highly token-efficient.

 ## Installation

-To install MarkItDown, use pip: `pip install markitdown[all]`. Alternatively, you can install it from the source:
+To install MarkItDown, use pip: `pip install 'markitdown[all]~=0.1.0a1'`. Alternatively, you can install it from the source:

 ```bash
 git clone git@github.com:microsoft/markitdown.git
--- a/packages/markitdown/pyproject.toml
+++ b/packages/markitdown/pyproject.toml
@@ -27,8 +27,7 @@ dependencies = [
  "beautifulsoup4",
  "requests",
  "markdownify",
-  "puremagic",
-  "pathvalidate",
+  "magika~=0.6.1",
  "charset-normalizer",
 ]

@@ -43,7 +42,7 @@ all = [
  "olefile",
  "pydub",
  "SpeechRecognition",
-  "youtube-transcript-api",
+  "youtube-transcript-api~=1.0.0",
  "azure-ai-documentintelligence",
  "azure-identity"
 ]
--- a/packages/markitdown/src/markitdown/about.py
+++ b/packages/markitdown/src/markitdown/about.py
@@ -1,4 +1,4 @@
 # SPDX-FileCopyrightText: 2024-present Adam Fourney <adamfo@microsoft.com>
 #
 # SPDX-License-Identifier: MIT
-__version__ = "0.1.0a1"
+__version__ = "0.1.0a6"
--- a/packages/markitdown/src/markitdown/main.py
+++ b/packages/markitdown/src/markitdown/main.py
@@ -3,10 +3,12 @@
 # SPDX-License-Identifier: MIT
 import argparse
 import sys
+import codecs
+import locale
 from textwrap import dedent
 from importlib.metadata import entry_points
 from .__about__ import __version__
-from ._markitdown import MarkItDown, DocumentConverterResult
+from ._markitdown import MarkItDown, StreamInfo, DocumentConverterResult


 def main():
@@ -58,6 +60,24 @@ def main():
        help="Output file name. If not provided, output is written to stdout.",
    )

+    parser.add_argument(
+        "-x",
+        "--extension",
+        help="Provide a hint about the file extension (e.g., when reading from stdin).",
+    )
+
+    parser.add_argument(
+        "-m",
+        "--mime-type",
+        help="Provide a hint about the file's MIME type.",
+    )
+
+    parser.add_argument(
+        "-c",
+        "--charset",
+        help="Provide a hint about the file's charset (e.g, UTF-8).",
+    )
+
    parser.add_argument(
        "-d",
        "--use-docintel",
@@ -85,9 +105,57 @@ def main():
        help="List installed 3rd-party plugins. Plugins are loaded when using the -p or --use-plugin option.",
    )

+    parser.add_argument(
+        "--keep-data-uris",
+        action="store_true",
+        help="Keep data URIs (like base64-encoded images) in the output. By default, data URIs are truncated.",
+    )
+
    parser.add_argument("filename", nargs="?")
    args = parser.parse_args()

+    # Parse the extension hint
+    extension_hint = args.extension
+    if extension_hint is not None:
+        extension_hint = extension_hint.strip().lower()
+        if len(extension_hint) > 0:
+            if not extension_hint.startswith("."):
+                extension_hint = "." + extension_hint
+        else:
+            extension_hint = None
+
+    # Parse the mime type
+    mime_type_hint = args.mime_type
+    if mime_type_hint is not None:
+        mime_type_hint = mime_type_hint.strip()
+        if len(mime_type_hint) > 0:
+            if mime_type_hint.count("/") != 1:
+                _exit_with_error(f"Invalid MIME type: {mime_type_hint}")
+        else:
+            mime_type_hint = None
+
+    # Parse the charset
+    charset_hint = args.charset
+    if charset_hint is not None:
+        charset_hint = charset_hint.strip()
+        if len(charset_hint) > 0:
+            try:
+                charset_hint = codecs.lookup(charset_hint).name
+            except LookupError:
+                _exit_with_error(f"Invalid charset: {charset_hint}")
+        else:
+            charset_hint = None
+
+    stream_info = None
+    if (
+        extension_hint is not None
+        or mime_type_hint is not None
+        or charset_hint is not None
+    ):
+        stream_info = StreamInfo(
+            extension=extension_hint, mimetype=mime_type_hint, charset=charset_hint
+        )
+
    if args.list_plugins:
        # List installed plugins, then exit
        print("Installed MarkItDown 3rd-party Plugins:\n")
@@ -107,11 +175,12 @@ def main():

    if args.use_docintel:
        if args.endpoint is None:
-            raise ValueError(
+            _exit_with_error(
                "Document Intelligence Endpoint is required when using Document Intelligence."
            )
        elif args.filename is None:
-            raise ValueError("Filename is required when using Document Intelligence.")
+            _exit_with_error("Filename is required when using Document Intelligence.")
+
        markitdown = MarkItDown(
            enable_plugins=args.use_plugins, docintel_endpoint=args.endpoint
        )
@@ -119,9 +188,15 @@ def main():
        markitdown = MarkItDown(enable_plugins=args.use_plugins)

    if args.filename is None:
-        result = markitdown.convert_stream(sys.stdin.buffer)
+        result = markitdown.convert_stream(
+            sys.stdin.buffer,
+            stream_info=stream_info,
+            keep_data_uris=args.keep_data_uris,
+        )
    else:
-        result = markitdown.convert(args.filename)
+        result = markitdown.convert(
+            args.filename, stream_info=stream_info, keep_data_uris=args.keep_data_uris
+        )

    _handle_output(args, result)

@@ -130,9 +205,19 @@ def _handle_output(args, result: DocumentConverterResult):
    """Handle output to stdout or file"""
    if args.output:
        with open(args.output, "w", encoding="utf-8") as f:
-            f.write(result.text_content)
+            f.write(result.markdown)
    else:
-        print(result.text_content)
+        # Handle stdout encoding errors more gracefully
+        print(
+            result.markdown.encode(sys.stdout.encoding, errors="replace").decode(
+                sys.stdout.encoding
+            )
+        )
+
+
+def _exit_with_error(message: str):
+    print(message)
+    sys.exit(1)


 if __name__ == "__main__":
--- a/packages/markitdown/src/markitdown/_exceptions.py
+++ b/packages/markitdown/src/markitdown/_exceptions.py
@@ -69,7 +69,7 @@ class FileConversionException(MarkItDownException):
                message = f"File conversion failed after {len(attempts)} attempts:\n"
                for attempt in attempts:
                    if attempt.exc_info is None:
-                        message += " -  {type(attempt.converter).__name__} provided no execution info."
+                        message += f" -  {type(attempt.converter).__name__} provided no execution info."
                    else:
                        message += f" - {type(attempt.converter).__name__} threw {attempt.exc_info[0].__name__} with message: {attempt.exc_info[1]}\n"

--- a/packages/markitdown/src/markitdown/_markitdown.py
+++ b/packages/markitdown/src/markitdown/_markitdown.py
@@ -3,6 +3,7 @@ import mimetypes
 import os
 import re
 import sys
+import shutil
 import tempfile
 import warnings
 import traceback
@@ -13,12 +14,12 @@ from typing import Any, List, Optional, Union, BinaryIO
 from pathlib import Path
 from urllib.parse import urlparse
 from warnings import warn
-
-# File-format detection
-import puremagic
 import requests
+import magika
+import charset_normalizer
+import codecs

-from ._stream_info import StreamInfo, _guess_stream_info_from_stream
+from ._stream_info import StreamInfo

 from .converters import (
    PlainTextConverter,
@@ -37,6 +38,7 @@ from .converters import (
    AudioConverter,
    OutlookMsgConverter,
    ZipConverter,
+    EpubConverter,
    DocumentIntelligenceConverter,
 )

@@ -109,6 +111,8 @@ class MarkItDown:
        else:
            self._requests_session = requests_session

+        self._magika = magika.Magika()
+
        # TODO - remove these (see enable_builtins)
        self._llm_client: Any = None
        self._llm_model: Union[str | None] = None
@@ -138,9 +142,31 @@ class MarkItDown:
            self._llm_model = kwargs.get("llm_model")
            self._exiftool_path = kwargs.get("exiftool_path")
            self._style_map = kwargs.get("style_map")
+
            if self._exiftool_path is None:
                self._exiftool_path = os.getenv("EXIFTOOL_PATH")

+            # Still none? Check well-known paths
+            if self._exiftool_path is None:
+                candidate = shutil.which("exiftool")
+                if candidate:
+                    candidate = os.path.abspath(candidate)
+                    if any(
+                        d == os.path.dirname(candidate)
+                        for d in [
+                            "/usr/bin",
+                            "/usr/local/bin",
+                            "/opt",
+                            "/opt/bin",
+                            "/opt/local/bin",
+                            "/opt/homebrew/bin",
+                            "C:\\Windows\\System32",
+                            "C:\\Program Files",
+                            "C:\\Program Files (x86)",
+                        ]
+                    ):
+                        self._exiftool_path = candidate
+
            # Register converters for successful browsing operations
            # Later registrations are tried first / take higher priority than earlier registrations
            # To this end, the most specific converters should appear below the most generic converters
@@ -166,6 +192,7 @@ class MarkItDown:
            self.register_converter(IpynbConverter())
            self.register_converter(PdfConverter())
            self.register_converter(OutlookMsgConverter())
+            self.register_converter(EpubConverter())

            # Register Document Intelligence converter at the top of the stack if endpoint is provided
            docintel_endpoint = kwargs.get("docintel_endpoint")
@@ -219,7 +246,14 @@ class MarkItDown:
                or source.startswith("https://")
                or source.startswith("file://")
            ):
-                return self.convert_url(source, **kwargs)
+                # Rename the url argument to mock_url
+                # (Deprecated -- use stream_info)
+                _kwargs = {k: v for k, v in kwargs.items()}
+                if "url" in _kwargs:
+                    _kwargs["mock_url"] = _kwargs["url"]
+                    del _kwargs["url"]
+
+                return self.convert_url(source, stream_info=stream_info, **_kwargs)
            else:
                return self.convert_local(source, stream_info=stream_info, **kwargs)
        # Path object
@@ -227,14 +261,14 @@ class MarkItDown:
            return self.convert_local(source, stream_info=stream_info, **kwargs)
        # Request response
        elif isinstance(source, requests.Response):
-            return self.convert_response(source, **kwargs)
+            return self.convert_response(source, stream_info=stream_info, **kwargs)
        # Binary stream
        elif (
            hasattr(source, "read")
            and callable(source.read)
            and not isinstance(source, io.TextIOBase)
        ):
-            return self.convert_stream(source, **kwargs)
+            return self.convert_stream(source, stream_info=stream_info, **kwargs)
        else:
            raise TypeError(
                f"Invalid source type: {type(source)}. Expected str, requests.Response, BinaryIO."
@@ -253,33 +287,28 @@ class MarkItDown:
            path = str(path)

        # Build a base StreamInfo object from which to start guesses
-        base_stream_info = StreamInfo(
+        base_guess = StreamInfo(
            local_path=path,
            extension=os.path.splitext(path)[1],
            filename=os.path.basename(path),
        )

-        # Extend the base_stream_info with any additional info from the arguments
+        # Extend the base_guess with any additional info from the arguments
        if stream_info is not None:
-            base_stream_info = base_stream_info.copy_and_update(stream_info)
+            base_guess = base_guess.copy_and_update(stream_info)

        if file_extension is not None:
            # Deprecated -- use stream_info
-            base_stream_info = base_stream_info.copy_and_update(
-                extension=file_extension
-            )
+            base_guess = base_guess.copy_and_update(extension=file_extension)

        if url is not None:
            # Deprecated -- use stream_info
-            base_stream_info = base_stream_info.copy_and_update(url=url)
+            base_guess = base_guess.copy_and_update(url=url)

        with open(path, "rb") as fh:
-            # Prepare a list of configurations to try, starting with the base_stream_info
-            guesses: List[StreamInfo] = [base_stream_info]
-            for guess in _guess_stream_info_from_stream(
-                file_stream=fh, filename_hint=path
-            ):
-                guesses.append(base_stream_info.copy_and_update(guess))
+            guesses = self._get_stream_info_guesses(
+                file_stream=fh, base_guess=base_guess
+            )
            return self._convert(file_stream=fh, stream_info_guesses=guesses, **kwargs)

    def convert_stream(
@@ -312,37 +341,44 @@ class MarkItDown:
                assert base_guess is not None  # for mypy
                base_guess = base_guess.copy_and_update(url=url)

-        # Append the base guess, if it's non-trivial
-        if base_guess is not None:
-            if base_guess.mimetype is not None or base_guess.extension is not None:
-                guesses.append(base_guess)
-        else:
-            # Create a base guess with no information
-            base_guess = StreamInfo()
-
-        # Create a placeholder filename to help with guessing
-        placeholder_filename = None
-        if base_guess.filename is not None:
-            placeholder_filename = base_guess.filename
-        elif base_guess.extension is not None:
-            placeholder_filename = "placeholder" + base_guess.extension
+        # Check if we have a seekable stream. If not, load the entire stream into memory.
+        if not stream.seekable():
+            buffer = io.BytesIO()
+            while True:
+                chunk = stream.read(4096)
+                if not chunk:
+                    break
+                buffer.write(chunk)
+            buffer.seek(0)
+            stream = buffer

        # Add guesses based on stream content
-        for guess in _guess_stream_info_from_stream(
-            file_stream=stream, filename_hint=placeholder_filename
-        ):
-            guesses.append(base_guess.copy_and_update(guess))
-
-        # Perform the conversion
+        guesses = self._get_stream_info_guesses(
+            file_stream=stream, base_guess=base_guess or StreamInfo()
+        )
        return self._convert(file_stream=stream, stream_info_guesses=guesses, **kwargs)

    def convert_url(
-        self, url: str, **kwargs: Any
+        self,
+        url: str,
+        *,
+        stream_info: Optional[StreamInfo] = None,
+        file_extension: Optional[str] = None,  # Deprecated -- use stream_info
+        mock_url: Optional[
+            str
+        ] = None,  # Mock the request as if it came from a different URL
+        **kwargs: Any,
    ) -> DocumentConverterResult:  # TODO: fix kwargs type
        # Send a HTTP request to the URL
        response = self._requests_session.get(url, stream=True)
        response.raise_for_status()
-        return self.convert_response(response, **kwargs)
+        return self.convert_response(
+            response,
+            stream_info=stream_info,
+            file_extension=file_extension,
+            url=mock_url,
+            **kwargs,
+        )

    def convert_response(
        self,
@@ -404,31 +440,16 @@ class MarkItDown:
            # Deprecated -- use stream_info
            base_guess = base_guess.copy_and_update(url=url)

-        # Add the guess if its non-trivial
-        guesses: List[StreamInfo] = []
-        if base_guess.mimetype is not None or base_guess.extension is not None:
-            guesses.append(base_guess)
-
        # Read into BytesIO
        buffer = io.BytesIO()
        for chunk in response.iter_content(chunk_size=512):
            buffer.write(chunk)
        buffer.seek(0)

-        # Create a placeholder filename to help with guessing
-        placeholder_filename = None
-        if base_guess.filename is not None:
-            placeholder_filename = base_guess.filename
-        elif base_guess.extension is not None:
-            placeholder_filename = "placeholder" + base_guess.extension
-
-        # Add guesses based on stream content
-        for guess in _guess_stream_info_from_stream(
-            file_stream=buffer, filename_hint=placeholder_filename
-        ):
-            guesses.append(base_guess.copy_and_update(guess))
-
        # Convert
+        guesses = self._get_stream_info_guesses(
+            file_stream=buffer, base_guess=base_guess
+        )
        return self._convert(file_stream=buffer, stream_info_guesses=guesses, **kwargs)

    def _convert(
@@ -455,7 +476,7 @@ class MarkItDown:
                    cur_pos == file_stream.tell()
                ), f"File stream position should NOT change between guess iterations"

-                _kwargs = copy.deepcopy(kwargs)
+                _kwargs = {k: v for k, v in kwargs.items()}

                # Copy any additional global options
                if "llm_client" not in _kwargs and self._llm_client is not None:
@@ -562,3 +583,115 @@ class MarkItDown:
        self._converters.insert(
            0, ConverterRegistration(converter=converter, priority=priority)
        )
+
+    def _get_stream_info_guesses(
+        self, file_stream: BinaryIO, base_guess: StreamInfo
+    ) -> List[StreamInfo]:
+        """
+        Given a base guess, attempt to guess or expand on the stream info using the stream content (via magika).
+        """
+        guesses: List[StreamInfo] = []
+
+        # Enhance the base guess with information based on the extension or mimetype
+        enhanced_guess = base_guess.copy_and_update()
+
+        # If there's an extension and no mimetype, try to guess the mimetype
+        if base_guess.mimetype is None and base_guess.extension is not None:
+            _m, _ = mimetypes.guess_type(
+                "placeholder" + base_guess.extension, strict=False
+            )
+            if _m is not None:
+                enhanced_guess = enhanced_guess.copy_and_update(mimetype=_m)
+
+        # If there's a mimetype and no extension, try to guess the extension
+        if base_guess.mimetype is not None and base_guess.extension is None:
+            _e = mimetypes.guess_all_extensions(base_guess.mimetype, strict=False)
+            if len(_e) > 0:
+                enhanced_guess = enhanced_guess.copy_and_update(extension=_e[0])
+
+        # Call magika to guess from the stream
+        cur_pos = file_stream.tell()
+        try:
+            result = self._magika.identify_stream(file_stream)
+            if result.status == "ok" and result.prediction.output.label != "unknown":
+                # If it's text, also guess the charset
+                charset = None
+                if result.prediction.output.is_text:
+                    # Read the first 4k to guess the charset
+                    file_stream.seek(cur_pos)
+                    stream_page = file_stream.read(4096)
+                    charset_result = charset_normalizer.from_bytes(stream_page).best()
+
+                    if charset_result is not None:
+                        charset = self._normalize_charset(charset_result.encoding)
+
+                # Normalize the first extension listed
+                guessed_extension = None
+                if len(result.prediction.output.extensions) > 0:
+                    guessed_extension = "." + result.prediction.output.extensions[0]
+
+                # Determine if the guess is compatible with the base guess
+                compatible = True
+                if (
+                    base_guess.mimetype is not None
+                    and base_guess.mimetype != result.prediction.output.mime_type
+                ):
+                    compatible = False
+
+                if (
+                    base_guess.extension is not None
+                    and base_guess.extension.lstrip(".")
+                    not in result.prediction.output.extensions
+                ):
+                    compatible = False
+
+                if (
+                    base_guess.charset is not None
+                    and self._normalize_charset(base_guess.charset) != charset
+                ):
+                    compatible = False
+
+                if compatible:
+                    # Add the compatible base guess
+                    guesses.append(
+                        StreamInfo(
+                            mimetype=base_guess.mimetype
+                            or result.prediction.output.mime_type,
+                            extension=base_guess.extension or guessed_extension,
+                            charset=base_guess.charset or charset,
+                            filename=base_guess.filename,
+                            local_path=base_guess.local_path,
+                            url=base_guess.url,
+                        )
+                    )
+                else:
+                    # The magika guess was incompatible with the base guess, so add both guesses
+                    guesses.append(enhanced_guess)
+                    guesses.append(
+                        StreamInfo(
+                            mimetype=result.prediction.output.mime_type,
+                            extension=guessed_extension,
+                            charset=charset,
+                            filename=base_guess.filename,
+                            local_path=base_guess.local_path,
+                            url=base_guess.url,
+                        )
+                    )
+            else:
+                # There were no other guesses, so just add the base guess
+                guesses.append(enhanced_guess)
+        finally:
+            file_stream.seek(cur_pos)
+
+        return guesses
+
+    def _normalize_charset(self, charset: str | None) -> str | None:
+        """
+        Normalize a charset string to a canonical form.
+        """
+        if charset is None:
+            return None
+        try:
+            return codecs.lookup(charset).name
+        except LookupError:
+            return charset
--- a/packages/markitdown/src/markitdown/_stream_info.py
+++ b/packages/markitdown/src/markitdown/_stream_info.py
@@ -1,14 +1,5 @@
-import puremagic
-import mimetypes
-import os
 from dataclasses import dataclass, asdict
-from typing import Optional, BinaryIO, List, TypeVar, Type
-
-# Mimetype substitutions table
-MIMETYPE_SUBSTITUTIONS = {
-    "application/excel": "application/vnd.ms-excel",
-    "application/mspowerpoint": "application/vnd.ms-powerpoint",
-}
+from typing import Optional


@dataclass(kw_only=True, frozen=True)
@@ -39,84 +30,3 @@ class StreamInfo:
            new_info.update(kwargs)

        return StreamInfo(**new_info)
-
-
-# Behavior subject to change.
-# Do not rely on this outside of this module.
-def _guess_stream_info_from_stream(
-    file_stream: BinaryIO,
-    *,
-    filename_hint: Optional[str] = None,
-) -> List[StreamInfo]:
-    """
-    Guess StreamInfo properties (mostly mimetype and extension) from a stream.
-
-    Args:
-    - stream: The stream to guess the StreamInfo from.
-    - filename_hint [Optional]: A filename hint to help with the guessing (may be a placeholder, and not actually be the file name)
-
-    Returns a list of StreamInfo objects in order of confidence.
-    """
-    guesses: List[StreamInfo] = []
-
-    # Add a guess purely based on the filename hint
-    if filename_hint:
-        try:
-            # Requires Python 3.13+
-            mimetype, _ = mimetypes.guess_file_type(filename_hint)  # type: ignore
-        except AttributeError:
-            mimetype, _ = mimetypes.guess_type(filename_hint)
-
-        if mimetype:
-            guesses.append(
-                StreamInfo(
-                    mimetype=mimetype, extension=os.path.splitext(filename_hint)[1]
-                )
-            )
-
-    def _puremagic(
-        file_stream, filename_hint
-    ) -> List[puremagic.main.PureMagicWithConfidence]:
-        """Wrap guesses to handle exceptions."""
-        try:
-            return puremagic.magic_stream(file_stream, filename=filename_hint)
-        except puremagic.main.PureError as e:
-            return []
-
-    cur_pos = file_stream.tell()
-    type_guesses = _puremagic(file_stream, filename_hint=filename_hint)
-    if len(type_guesses) == 0:
-        # Fix for: https://github.com/microsoft/markitdown/issues/222
-        # If there are no guesses, then try again after trimming leading ASCII whitespaces.
-        # ASCII whitespace characters are those byte values in the sequence b' \t\n\r\x0b\f'
-        # (space, tab, newline, carriage return, vertical tab, form feed).
-
-        # Eat all the leading whitespace
-        file_stream.seek(cur_pos)
-        while True:
-            char = file_stream.read(1)
-            if not char:  # End of file
-                break
-            if not char.isspace():
-                file_stream.seek(file_stream.tell() - 1)
-                break
-
-        # Try again
-        type_guesses = _puremagic(file_stream, filename_hint=filename_hint)
-    file_stream.seek(cur_pos)
-
-    # Convert and return the guesses
-    for guess in type_guesses:
-        kwargs: dict[str, str] = {}
-        if guess.extension:
-            kwargs["extension"] = guess.extension
-        if guess.mime_type:
-            kwargs["mimetype"] = MIMETYPE_SUBSTITUTIONS.get(
-                guess.mime_type, guess.mime_type
-            )
-        if len(kwargs) > 0:
-            # We don't add the filename_hint, because sometimes it's just a placeholder,
-            # and, in any case, doesn't add new information.
-            guesses.append(StreamInfo(**kwargs))
-
-    return guesses
--- a/packages/markitdown/src/markitdown/converters/init.py
+++ b/packages/markitdown/src/markitdown/converters/init.py
@@ -18,6 +18,7 @@ from ._audio_converter import AudioConverter
 from ._outlook_msg_converter import OutlookMsgConverter
 from ._zip_converter import ZipConverter
 from ._doc_intel_converter import DocumentIntelligenceConverter
+from ._epub_converter import EpubConverter

 __all__ = [
    "PlainTextConverter",
@@ -37,4 +38,5 @@ __all__ = [
    "OutlookMsgConverter",
    "ZipConverter",
    "DocumentIntelligenceConverter",
+    "EpubConverter",
 ]
--- a/packages/markitdown/src/markitdown/converters/_bing_serp_converter.py
+++ b/packages/markitdown/src/markitdown/converters/_bing_serp_converter.py
@@ -1,6 +1,7 @@
 import io
 import re
 import base64
+import binascii
 from urllib.parse import parse_qs, urlparse
 from typing import Any, BinaryIO, Optional
 from bs4 import BeautifulSoup
@@ -60,6 +61,8 @@ class BingSerpConverter(DocumentConverter):
        stream_info: StreamInfo,
        **kwargs: Any,  # Options to pass to the converter
    ) -> DocumentConverterResult:
+        assert stream_info.url is not None
+
        # Parse the query parameters
        parsed_params = parse_qs(urlparse(stream_info.url).query)
        query = parsed_params.get("q", [""])[0]
@@ -76,9 +79,12 @@ class BingSerpConverter(DocumentConverter):
            slug.extract()

        # Parse the algorithmic results
-        _markdownify = _CustomMarkdownify()
+        _markdownify = _CustomMarkdownify(**kwargs)
        results = list()
        for result in soup.find_all(class_="b_algo"):
+            if not hasattr(result, "find_all"):
+                continue
+
            # Rewrite redirect urls
            for a in result.find_all("a", href=True):
                parsed_href = urlparse(a["href"])
--- a/packages/markitdown/src/markitdown/converters/_docx_converter.py
+++ b/packages/markitdown/src/markitdown/converters/_docx_converter.py
@@ -73,5 +73,5 @@ class DocxConverter(HtmlConverter):

        style_map = kwargs.get("style_map", None)
        return self._html_converter.convert_string(
-            mammoth.convert_to_html(file_stream, style_map=style_map).value
+            mammoth.convert_to_html(file_stream, style_map=style_map).value, **kwargs
        )
--- a/packages/markitdown/src/markitdown/converters/_epub_converter.py
+++ b/packages/markitdown/src/markitdown/converters/_epub_converter.py
@@ -0,0 +1,147 @@
+import os
+import zipfile
+import xml.dom.minidom as minidom
+
+from typing import BinaryIO, Any, Dict, List
+
+from ._html_converter import HtmlConverter
+from .._base_converter import DocumentConverter, DocumentConverterResult
+from .._stream_info import StreamInfo
+
+ACCEPTED_MIME_TYPE_PREFIXES = [
+    "application/epub",
+    "application/epub+zip",
+    "application/x-epub+zip",
+]
+
+ACCEPTED_FILE_EXTENSIONS = [".epub"]
+
+MIME_TYPE_MAPPING = {
+    ".html": "text/html",
+    ".xhtml": "application/xhtml+xml",
+}
+
+
+class EpubConverter(HtmlConverter):
+    """
+    Converts EPUB files to Markdown. Style information (e.g.m headings) and tables are preserved where possible.
+    """
+
+    def __init__(self):
+        super().__init__()
+        self._html_converter = HtmlConverter()
+
+    def accepts(
+        self,
+        file_stream: BinaryIO,
+        stream_info: StreamInfo,
+        **kwargs: Any,  # Options to pass to the converter
+    ) -> bool:
+        mimetype = (stream_info.mimetype or "").lower()
+        extension = (stream_info.extension or "").lower()
+
+        if extension in ACCEPTED_FILE_EXTENSIONS:
+            return True
+
+        for prefix in ACCEPTED_MIME_TYPE_PREFIXES:
+            if mimetype.startswith(prefix):
+                return True
+
+        return False
+
+    def convert(
+        self,
+        file_stream: BinaryIO,
+        stream_info: StreamInfo,
+        **kwargs: Any,  # Options to pass to the converter
+    ) -> DocumentConverterResult:
+        with zipfile.ZipFile(file_stream, "r") as z:
+            # Extracts metadata (title, authors, language, publisher, date, description, cover) from an EPUB file."""
+
+            # Locate content.opf
+            container_dom = minidom.parse(z.open("META-INF/container.xml"))
+            opf_path = container_dom.getElementsByTagName("rootfile")[0].getAttribute(
+                "full-path"
+            )
+
+            # Parse content.opf
+            opf_dom = minidom.parse(z.open(opf_path))
+            metadata: Dict[str, Any] = {
+                "title": self._get_text_from_node(opf_dom, "dc:title"),
+                "authors": self._get_all_texts_from_nodes(opf_dom, "dc:creator"),
+                "language": self._get_text_from_node(opf_dom, "dc:language"),
+                "publisher": self._get_text_from_node(opf_dom, "dc:publisher"),
+                "date": self._get_text_from_node(opf_dom, "dc:date"),
+                "description": self._get_text_from_node(opf_dom, "dc:description"),
+                "identifier": self._get_text_from_node(opf_dom, "dc:identifier"),
+            }
+
+            # Extract manifest items (ID → href mapping)
+            manifest = {
+                item.getAttribute("id"): item.getAttribute("href")
+                for item in opf_dom.getElementsByTagName("item")
+            }
+
+            # Extract spine order (ID refs)
+            spine_items = opf_dom.getElementsByTagName("itemref")
+            spine_order = [item.getAttribute("idref") for item in spine_items]
+
+            # Convert spine order to actual file paths
+            base_path = "/".join(
+                opf_path.split("/")[:-1]
+            )  # Get base directory of content.opf
+            spine = [
+                f"{base_path}/{manifest[item_id]}" if base_path else manifest[item_id]
+                for item_id in spine_order
+                if item_id in manifest
+            ]
+
+            # Extract and convert the content
+            markdown_content: List[str] = []
+            for file in spine:
+                if file in z.namelist():
+                    with z.open(file) as f:
+                        filename = os.path.basename(file)
+                        extension = os.path.splitext(filename)[1].lower()
+                        mimetype = MIME_TYPE_MAPPING.get(extension)
+                        converted_content = self._html_converter.convert(
+                            f,
+                            StreamInfo(
+                                mimetype=mimetype,
+                                extension=extension,
+                                filename=filename,
+                            ),
+                        )
+                        markdown_content.append(converted_content.markdown.strip())
+
+            # Format and add the metadata
+            metadata_markdown = []
+            for key, value in metadata.items():
+                if isinstance(value, list):
+                    value = ", ".join(value)
+                if value:
+                    metadata_markdown.append(f"**{key.capitalize()}:** {value}")
+
+            markdown_content.insert(0, "\n".join(metadata_markdown))
+
+            return DocumentConverterResult(
+                markdown="\n\n".join(markdown_content), title=metadata["title"]
+            )
+
+    def _get_text_from_node(self, dom: minidom.Document, tag_name: str) -> str | None:
+        """Convenience function to extract a single occurrence of a tag (e.g., title)."""
+        texts = self._get_all_texts_from_nodes(dom, tag_name)
+        if len(texts) > 0:
+            return texts[0]
+        else:
+            return None
+
+    def _get_all_texts_from_nodes(
+        self, dom: minidom.Document, tag_name: str
+    ) -> List[str]:
+        """Helper function to extract all occurrences of a tag (e.g., multiple authors)."""
+        texts: List[str] = []
+        for node in dom.getElementsByTagName(tag_name):
+            if node.firstChild and hasattr(node.firstChild, "nodeValue"):
+                texts.append(node.firstChild.nodeValue.strip())
+        return texts
--- a/packages/markitdown/src/markitdown/converters/_exiftool.py
+++ b/packages/markitdown/src/markitdown/converters/_exiftool.py
@@ -5,26 +5,16 @@ import sys
 import shutil
 import os
 import warnings
-from typing import BinaryIO, Optional, Any
+from typing import BinaryIO, Any, Union


 def exiftool_metadata(
-    file_stream: BinaryIO, *, exiftool_path: Optional[str] = None
+    file_stream: BinaryIO,
+    *,
+    exiftool_path: Union[str, None],
 ) -> Any:  # Need a better type for json data
-    # Check if we have a valid pointer to exiftool
+    # Nothing to do
    if not exiftool_path:
-        which_exiftool = shutil.which("exiftool")
-        if which_exiftool:
-            warnings.warn(
-                f"""Implicit discovery of 'exiftool' is disabled. If you would like to continue to use exiftool in MarkItDown, please set the exiftool_path parameter in the MarkItDown consructor. E.g., 
-
-    md = MarkItDown(exiftool_path="{which_exiftool}")
-
-This warning will be removed in future releases.
-""",
-                DeprecationWarning,
-            )
-        # Nothing to do
        return {}

    # Run exiftool
--- a/packages/markitdown/src/markitdown/converters/_html_converter.py
+++ b/packages/markitdown/src/markitdown/converters/_html_converter.py
@@ -56,9 +56,9 @@ class HtmlConverter(DocumentConverter):
        body_elm = soup.find("body")
        webpage_text = ""
        if body_elm:
-            webpage_text = _CustomMarkdownify().convert_soup(body_elm)
+            webpage_text = _CustomMarkdownify(**kwargs).convert_soup(body_elm)
        else:
-            webpage_text = _CustomMarkdownify().convert_soup(soup)
+            webpage_text = _CustomMarkdownify(**kwargs).convert_soup(soup)

        assert isinstance(webpage_text, str)

--- a/packages/markitdown/src/markitdown/converters/_markdownify.py
+++ b/packages/markitdown/src/markitdown/converters/_markdownify.py
@@ -17,6 +17,7 @@ class _CustomMarkdownify(markdownify.MarkdownConverter):

    def __init__(self, **options: Any):
        options["heading_style"] = options.get("heading_style", markdownify.ATX)
+        options["keep_data_uris"] = options.get("keep_data_uris", False)
        # Explicitly cast options to the expected type if necessary
        super().__init__(**options)

@@ -101,7 +102,7 @@ class _CustomMarkdownify(markdownify.MarkdownConverter):
            return alt

        # Remove dataURIs
-        if src.startswith("data:"):
+        if src.startswith("data:") and not self.options["keep_data_uris"]:
            src = src.split(",")[0] + "..."

        return "![%s](%s%s)" % (alt, src, title_part)
--- a/packages/markitdown/src/markitdown/converters/_outlook_msg_converter.py
+++ b/packages/markitdown/src/markitdown/converters/_outlook_msg_converter.py
@@ -7,8 +7,9 @@ from .._exceptions import MissingDependencyException, MISSING_DEPENDENCY_MESSAGE
 # Try loading optional (but in this case, required) dependencies
 # Save reporting of any exceptions for later
 _dependency_exc_info = None
+olefile = None
 try:
-    import olefile
+    import olefile  # type: ignore[no-redef]
 except ImportError:
    # Preserve the error and stack trace for later
    _dependency_exc_info = sys.exc_info()
@@ -48,19 +49,20 @@ class OutlookMsgConverter(DocumentConverter):
        # Brute force, check if we have an OLE file
        cur_pos = file_stream.tell()
        try:
-            if not olefile.isOleFile(file_stream):
+            if olefile and not olefile.isOleFile(file_stream):
                return False
        finally:
            file_stream.seek(cur_pos)

        # Brue force, check if it's an Outlook file
        try:
-            msg = olefile.OleFileIO(file_stream)
-            toc = "\n".join([str(stream) for stream in msg.listdir()])
-            return (
-                "__properties_version1.0" in toc
-                and "__recip_version1.0_#00000000" in toc
-            )
+            if olefile is not None:
+                msg = olefile.OleFileIO(file_stream)
+                toc = "\n".join([str(stream) for stream in msg.listdir()])
+                return (
+                    "__properties_version1.0" in toc
+                    and "__recip_version1.0_#00000000" in toc
+                )
        except Exception as e:
            pass
        finally:
@@ -88,7 +90,11 @@ class OutlookMsgConverter(DocumentConverter):
                _dependency_exc_info[2]
            )

+        assert (
+            olefile is not None
+        )  # If we made it this far, olefile should be available
        msg = olefile.OleFileIO(file_stream)
+
        # Extract email metadata
        md_content = "# Email Message\n\n"

@@ -120,6 +126,7 @@ class OutlookMsgConverter(DocumentConverter):

    def _get_stream_data(self, msg: Any, stream_path: str) -> Union[str, None]:
        """Helper to safely extract and decode stream data from the MSG file."""
+        assert olefile is not None
        assert isinstance(
            msg, olefile.OleFileIO
        )  # Ensure msg is of the correct type (type hinting is not possible with the optional olefile package)
--- a/packages/markitdown/src/markitdown/converters/_plain_text_converter.py
+++ b/packages/markitdown/src/markitdown/converters/_plain_text_converter.py
@@ -17,12 +17,16 @@ except ImportError:
 ACCEPTED_MIME_TYPE_PREFIXES = [
    "text/",
    "application/json",
+    "application/markdown",
 ]

-# Mimetypes to ignore (commonly confused extensions)
-IGNORE_MIME_TYPE_PREFIXES = [
-    "text/vnd.in3d.spot",  # .spo wich is confused with xls, doc, etc.
-    "text/vnd.graphviz",  # .dot which is confused with xls, doc, etc.
+ACCEPTED_FILE_EXTENSIONS = [
+    ".txt",
+    ".text",
+    ".md",
+    ".markdown",
+    ".json",
+    ".jsonl",
 ]


@@ -38,9 +42,14 @@ class PlainTextConverter(DocumentConverter):
        mimetype = (stream_info.mimetype or "").lower()
        extension = (stream_info.extension or "").lower()

-        for prefix in IGNORE_MIME_TYPE_PREFIXES:
-            if mimetype.startswith(prefix):
-                return False
+        # If we have a charset, we can safely assume it's text
+        # With Magika in the earlier stages, this handles most cases
+        if stream_info.charset is not None:
+            return True
+
+        # Otherwise, check the mimetype and extension
+        if extension in ACCEPTED_FILE_EXTENSIONS:
+            return True

        for prefix in ACCEPTED_MIME_TYPE_PREFIXES:
            if mimetype.startswith(prefix):
--- a/packages/markitdown/src/markitdown/converters/_pptx_converter.py
+++ b/packages/markitdown/src/markitdown/converters/_pptx_converter.py
@@ -6,6 +6,7 @@ import re
 import html

 from typing import BinaryIO, Any
+from operator import attrgetter

 from ._html_converter import HtmlConverter
 from ._llm_caption import llm_caption
@@ -139,13 +140,20 @@ class PptxConverter(DocumentConverter):
                    alt_text = re.sub(r"[\r\n\[\]]", " ", alt_text)
                    alt_text = re.sub(r"\s+", " ", alt_text).strip()

-                    # A placeholder name
-                    filename = re.sub(r"\W", "", shape.name) + ".jpg"
-                    md_content += "\n![" + alt_text + "](" + filename + ")\n"
+                    # If keep_data_uris is True, use base64 encoding for images
+                    if kwargs.get("keep_data_uris", False):
+                        blob = shape.image.blob
+                        content_type = shape.image.content_type or "image/png"
+                        b64_string = base64.b64encode(blob).decode("utf-8")
+                        md_content += f"\n![{alt_text}](data:{content_type};base64,{b64_string})\n"
+                    else:
+                        # A placeholder name
+                        filename = re.sub(r"\W", "", shape.name) + ".jpg"
+                        md_content += "\n![" + alt_text + "](" + filename + ")\n"

                # Tables
                if self._is_table(shape):
-                    md_content += self._convert_table_to_markdown(shape.table)
+                    md_content += self._convert_table_to_markdown(shape.table, **kwargs)

                # Charts
                if shape.has_chart:
@@ -160,10 +168,12 @@ class PptxConverter(DocumentConverter):

                # Group Shapes
                if shape.shape_type == pptx.enum.shapes.MSO_SHAPE_TYPE.GROUP:
-                    for subshape in shape.shapes:
+                    sorted_shapes = sorted(shape.shapes, key=attrgetter("top", "left"))
+                    for subshape in sorted_shapes:
                        get_shape_content(subshape, **kwargs)

-            for shape in slide.shapes:
+            sorted_shapes = sorted(slide.shapes, key=attrgetter("top", "left"))
+            for shape in sorted_shapes:
                get_shape_content(shape, **kwargs)

            md_content = md_content.strip()
@@ -190,7 +200,7 @@ class PptxConverter(DocumentConverter):
            return True
        return False

-    def _convert_table_to_markdown(self, table):
+    def _convert_table_to_markdown(self, table, **kwargs):
        # Write the table as HTML, then convert it to Markdown
        html_table = "<html><body><table>"
        first_row = True
@@ -205,27 +215,38 @@ class PptxConverter(DocumentConverter):
            first_row = False
        html_table += "</table></body></html>"

-        return self._html_converter.convert_string(html_table).markdown.strip() + "\n"
+        return (
+            self._html_converter.convert_string(html_table, **kwargs).markdown.strip()
+            + "\n"
+        )

    def _convert_chart_to_markdown(self, chart):
-        md = "\n\n### Chart"
-        if chart.has_title:
-            md += f": {chart.chart_title.text_frame.text}"
-        md += "\n\n"
-        data = []
-        category_names = [c.label for c in chart.plots[0].categories]
-        series_names = [s.name for s in chart.series]
-        data.append(["Category"] + series_names)
+        try:
+            md = "\n\n### Chart"
+            if chart.has_title:
+                md += f": {chart.chart_title.text_frame.text}"
+            md += "\n\n"
+            data = []
+            category_names = [c.label for c in chart.plots[0].categories]
+            series_names = [s.name for s in chart.series]
+            data.append(["Category"] + series_names)

-        for idx, category in enumerate(category_names):
-            row = [category]
-            for series in chart.series:
-                row.append(series.values[idx])
-            data.append(row)
+            for idx, category in enumerate(category_names):
+                row = [category]
+                for series in chart.series:
+                    row.append(series.values[idx])
+                data.append(row)

-        markdown_table = []
-        for row in data:
-            markdown_table.append("| " + " | ".join(map(str, row)) + " |")
-        header = markdown_table[0]
-        separator = "|" + "|".join(["---"] * len(data[0])) + "|"
-        return md + "\n".join([header, separator] + markdown_table[1:])
+            markdown_table = []
+            for row in data:
+                markdown_table.append("| " + " | ".join(map(str, row)) + " |")
+            header = markdown_table[0]
+            separator = "|" + "|".join(["---"] * len(data[0])) + "|"
+            return md + "\n".join([header, separator] + markdown_table[1:])
+        except ValueError as e:
+            # Handle the specific error for unsupported chart types
+            if "unsupported plot type" in str(e):
+                return "\n\n[unsupported chart]\n\n"
+        except Exception:
+            # Catch any other exceptions that might occur
+            return "\n\n[unsupported chart]\n\n"
--- a/packages/markitdown/src/markitdown/converters/_rss_converter.py
+++ b/packages/markitdown/src/markitdown/converters/_rss_converter.py
@@ -8,7 +8,9 @@ from .._base_converter import DocumentConverter, DocumentConverterResult

 PRECISE_MIME_TYPE_PREFIXES = [
    "application/rss",
+    "application/rss+xml",
    "application/atom",
+    "application/atom+xml",
 ]

 PRECISE_FILE_EXTENSIONS = [".rss", ".atom"]
@@ -26,6 +28,10 @@ CANDIDATE_FILE_EXTENSIONS = [
 class RssConverter(DocumentConverter):
    """Convert RSS / Atom type to markdown"""

+    def __init__(self):
+        super().__init__()
+        self._kwargs = {}
+
    def accepts(
        self,
        file_stream: BinaryIO,
@@ -64,7 +70,7 @@ class RssConverter(DocumentConverter):
            file_stream.seek(cur_pos)
        return False

-    def _feed_type(self, doc: Any) -> str:
+    def _feed_type(self, doc: Any) -> str | None:
        if doc.getElementsByTagName("rss"):
            return "rss"
        elif doc.getElementsByTagName("feed"):
@@ -80,6 +86,7 @@ class RssConverter(DocumentConverter):
        stream_info: StreamInfo,
        **kwargs: Any,  # Options to pass to the converter
    ) -> DocumentConverterResult:
+        self._kwargs = kwargs
        doc = minidom.parse(file_stream)
        feed_type = self._feed_type(doc)

@@ -128,10 +135,10 @@ class RssConverter(DocumentConverter):
        Returns None if the feed type is not recognized or something goes wrong.
        """
        root = doc.getElementsByTagName("rss")[0]
-        channel = root.getElementsByTagName("channel")
-        if not channel:
-            return None
-        channel = channel[0]
+        channel_list = root.getElementsByTagName("channel")
+        if not channel_list:
+            raise ValueError("No channel found in RSS feed")
+        channel = channel_list[0]
        channel_title = self._get_data_by_tag_name(channel, "title")
        channel_description = self._get_data_by_tag_name(channel, "description")
        items = channel.getElementsByTagName("item")
@@ -139,8 +146,6 @@ class RssConverter(DocumentConverter):
            md_text = f"# {channel_title}\n"
        if channel_description:
            md_text += f"{channel_description}\n"
-        if not items:
-            items = []
        for item in items:
            title = self._get_data_by_tag_name(item, "title")
            description = self._get_data_by_tag_name(item, "description")
@@ -166,7 +171,7 @@ class RssConverter(DocumentConverter):
        try:
            # using bs4 because many RSS feeds have HTML-styled content
            soup = BeautifulSoup(content, "html.parser")
-            return _CustomMarkdownify().convert_soup(soup)
+            return _CustomMarkdownify(**self._kwargs).convert_soup(soup)
        except BaseException as _:
            return content

@@ -181,5 +186,6 @@ class RssConverter(DocumentConverter):
            return None
        fc = nodes[0].firstChild
        if fc:
-            return fc.data
+            if hasattr(fc, "data"):
+                return fc.data
        return None
--- a/packages/markitdown/src/markitdown/converters/_transcribe_audio.py
+++ b/packages/markitdown/src/markitdown/converters/_transcribe_audio.py
@@ -7,8 +7,14 @@ from .._exceptions import MissingDependencyException
 # Save reporting of any exceptions for later
 _dependency_exc_info = None
 try:
-    import speech_recognition as sr
-    import pydub
+    # Suppress some warnings on library import
+    import warnings
+
+    with warnings.catch_warnings():
+        warnings.filterwarnings("ignore", category=DeprecationWarning)
+        warnings.filterwarnings("ignore", category=SyntaxWarning)
+        import speech_recognition as sr
+        import pydub
 except ImportError:
    # Preserve the error and stack trace for later
    _dependency_exc_info = sys.exc_info()
--- a/packages/markitdown/src/markitdown/converters/_wikipedia_converter.py
+++ b/packages/markitdown/src/markitdown/converters/_wikipedia_converter.py
@@ -1,7 +1,7 @@
 import io
 import re
+import bs4
 from typing import Any, BinaryIO, Optional
-from bs4 import BeautifulSoup

 from .._base_converter import DocumentConverter, DocumentConverterResult
 from .._stream_info import StreamInfo
@@ -57,7 +57,7 @@ class WikipediaConverter(DocumentConverter):
    ) -> DocumentConverterResult:
        # Parse the stream
        encoding = "utf-8" if stream_info.charset is None else stream_info.charset
-        soup = BeautifulSoup(file_stream, "html.parser", from_encoding=encoding)
+        soup = bs4.BeautifulSoup(file_stream, "html.parser", from_encoding=encoding)

        # Remove javascript and style blocks
        for script in soup(["script", "style"]):
@@ -72,16 +72,15 @@ class WikipediaConverter(DocumentConverter):

        if body_elm:
            # What's the title
-            if title_elm and len(title_elm) > 0:
-                main_title = title_elm.string  # type: ignore
-                assert isinstance(main_title, str)
+            if title_elm and isinstance(title_elm, bs4.Tag):
+                main_title = title_elm.string

            # Convert the page
-            webpage_text = f"# {main_title}\n\n" + _CustomMarkdownify().convert_soup(
-                body_elm
-            )
+            webpage_text = f"# {main_title}\n\n" + _CustomMarkdownify(
+                **kwargs
+            ).convert_soup(body_elm)
        else:
-            webpage_text = _CustomMarkdownify().convert_soup(soup)
+            webpage_text = _CustomMarkdownify(**kwargs).convert_soup(soup)

        return DocumentConverterResult(
            markdown=webpage_text,
--- a/packages/markitdown/src/markitdown/converters/_xlsx_converter.py
+++ b/packages/markitdown/src/markitdown/converters/_xlsx_converter.py
@@ -86,7 +86,9 @@ class XlsxConverter(DocumentConverter):
            md_content += f"## {s}\n"
            html_content = sheets[s].to_html(index=False)
            md_content += (
-                self._html_converter.convert_string(html_content).markdown.strip()
+                self._html_converter.convert_string(
+                    html_content, **kwargs
+                ).markdown.strip()
                + "\n\n"
            )

@@ -146,7 +148,9 @@ class XlsConverter(DocumentConverter):
            md_content += f"## {s}\n"
            html_content = sheets[s].to_html(index=False)
            md_content += (
-                self._html_converter.convert_string(html_content).markdown.strip()
+                self._html_converter.convert_string(
+                    html_content, **kwargs
+                ).markdown.strip()
                + "\n\n"
            )

--- a/packages/markitdown/src/markitdown/converters/_youtube_converter.py
+++ b/packages/markitdown/src/markitdown/converters/_youtube_converter.py
@@ -3,17 +3,22 @@ import json
 import time
 import io
 import re
+import bs4
 from typing import Any, BinaryIO, Optional, Dict, List, Union
 from urllib.parse import parse_qs, urlparse, unquote
-from bs4 import BeautifulSoup

 from .._base_converter import DocumentConverter, DocumentConverterResult
 from .._stream_info import StreamInfo
-from ._markdownify import _CustomMarkdownify

 # Optional YouTube transcription support
 try:
-    from youtube_transcript_api import YouTubeTranscriptApi
+    # Suppress some warnings on library import
+    import warnings
+
+    with warnings.catch_warnings():
+        warnings.filterwarnings("ignore", category=SyntaxWarning)
+        # Patch submitted upstream to fix the SyntaxWarning
+        from youtube_transcript_api import YouTubeTranscriptApi

    IS_YOUTUBE_TRANSCRIPT_CAPABLE = True
 except ModuleNotFoundError:
@@ -72,21 +77,31 @@ class YouTubeConverter(DocumentConverter):
    ) -> DocumentConverterResult:
        # Parse the stream
        encoding = "utf-8" if stream_info.charset is None else stream_info.charset
-        soup = BeautifulSoup(file_stream, "html.parser", from_encoding=encoding)
+        soup = bs4.BeautifulSoup(file_stream, "html.parser", from_encoding=encoding)

        # Read the meta tags
-        metadata: Dict[str, str] = {"title": soup.title.string}
+        metadata: Dict[str, str] = {}
+
+        if soup.title and soup.title.string:
+            metadata["title"] = soup.title.string
+
        for meta in soup(["meta"]):
+            if not isinstance(meta, bs4.Tag):
+                continue
+
            for a in meta.attrs:
                if a in ["itemprop", "property", "name"]:
-                    content = meta.get("content", "")
-                    if content:  # Only add non-empty content
-                        metadata[meta[a]] = content
+                    key = str(meta.get(a, ""))
+                    content = str(meta.get("content", ""))
+                    if key and content:  # Only add non-empty content
+                        metadata[key] = content
                    break

        # Try reading the description
        try:
            for script in soup(["script"]):
+                if not isinstance(script, bs4.Tag):
+                    continue
                if not script.string:  # Skip empty scripts
                    continue
                content = script.string
@@ -132,6 +147,7 @@ class YouTubeConverter(DocumentConverter):
            webpage_text += f"\n### Description\n{description}\n"

        if IS_YOUTUBE_TRANSCRIPT_CAPABLE:
+            ytt_api = YouTubeTranscriptApi()
            transcript_text = ""
            parsed_url = urlparse(stream_info.url)  # type: ignore
            params = parse_qs(parsed_url.query)  # type: ignore
@@ -143,7 +159,7 @@ class YouTubeConverter(DocumentConverter):
                    )
                    # Retry the transcript fetching operation
                    transcript = self._retry_operation(
-                        lambda: YouTubeTranscriptApi.get_transcript(
+                        lambda: ytt_api.fetch(
                            video_id, languages=youtube_transcript_languages
                        ),
                        retries=3,  # Retry 3 times
@@ -151,17 +167,14 @@ class YouTubeConverter(DocumentConverter):
                    )
                    if transcript:
                        transcript_text = " ".join(
-                            [part["text"] for part in transcript]
+                            [part.text for part in transcript]
                        )  # type: ignore
-                    # Alternative formatting:
-                    # formatter = TextFormatter()
-                    # formatter.format_transcript(transcript)
                except Exception as e:
                    print(f"Error fetching transcript: {e}")
            if transcript_text:
                webpage_text += f"\n### Transcript\n{transcript_text}\n"

-        title = title if title else soup.title.string
+        title = title if title else (soup.title.string if soup.title else "")
        assert isinstance(title, str)

        return DocumentConverterResult(
--- a/packages/markitdown/tests/_test_vectors.py
+++ b/packages/markitdown/tests/_test_vectors.py
@@ -0,0 +1,278 @@
+import dataclasses
+from typing import List
+
+
+@dataclasses.dataclass(frozen=True, kw_only=True)
+class FileTestVector(object):
+    filename: str
+    mimetype: str | None
+    charset: str | None
+    url: str | None
+    must_include: List[str]
+    must_not_include: List[str]
+
+
+GENERAL_TEST_VECTORS = [
+    FileTestVector(
+        filename="test.docx",
+        mimetype="application/vnd.openxmlformats-officedocument.wordprocessingml.document",
+        charset=None,
+        url=None,
+        must_include=[
+            "314b0a30-5b04-470b-b9f7-eed2c2bec74a",
+            "49e168b7-d2ae-407f-a055-2167576f39a1",
+            "## d666f1f7-46cb-42bd-9a39-9a39cf2a509f",
+            "# Abstract",
+            "# Introduction",
+            "AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation",
+            "data:image/png;base64...",
+        ],
+        must_not_include=[
+            "data:image/png;base64,iVBORw0KGgoAAAANSU",
+        ],
+    ),
+    FileTestVector(
+        filename="test.xlsx",
+        mimetype="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
+        charset=None,
+        url=None,
+        must_include=[
+            "## 09060124-b5e7-4717-9d07-3c046eb",
+            "6ff4173b-42a5-4784-9b19-f49caff4d93d",
+            "affc7dad-52dc-4b98-9b5d-51e65d8a8ad0",
+        ],
+        must_not_include=[],
+    ),
+    FileTestVector(
+        filename="test.xls",
+        mimetype="application/vnd.ms-excel",
+        charset=None,
+        url=None,
+        must_include=[
+            "## 09060124-b5e7-4717-9d07-3c046eb",
+            "6ff4173b-42a5-4784-9b19-f49caff4d93d",
+            "affc7dad-52dc-4b98-9b5d-51e65d8a8ad0",
+        ],
+        must_not_include=[],
+    ),
+    FileTestVector(
+        filename="test.pptx",
+        mimetype="application/vnd.openxmlformats-officedocument.presentationml.presentation",
+        charset=None,
+        url=None,
+        must_include=[
+            "2cdda5c8-e50e-4db4-b5f0-9722a649f455",
+            "04191ea8-5c73-4215-a1d3-1cfb43aaaf12",
+            "44bf7d06-5e7a-4a40-a2e1-a2e42ef28c8a",
+            "1b92870d-e3b5-4e65-8153-919f4ff45592",
+            "AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation",
+            "a3f6004b-6f4f-4ea8-bee3-3741f4dc385f",  # chart title
+            "2003",  # chart value
+            "![This phrase of the caption is Human-written.](Picture4.jpg)",
+        ],
+        must_not_include=["data:image/jpeg;base64,/9j/4AAQSkZJRgABAQE"],
+    ),
+    FileTestVector(
+        filename="test_outlook_msg.msg",
+        mimetype="application/vnd.ms-outlook",
+        charset=None,
+        url=None,
+        must_include=[
+            "# Email Message",
+            "**From:** test.sender@example.com",
+            "**To:** test.recipient@example.com",
+            "**Subject:** Test Email Message",
+            "## Content",
+            "This is the body of the test email message",
+        ],
+        must_not_include=[],
+    ),
+    FileTestVector(
+        filename="test.pdf",
+        mimetype="application/pdf",
+        charset=None,
+        url=None,
+        must_include=[
+            "While there is contemporaneous exploration of multi-agent approaches"
+        ],
+        must_not_include=[],
+    ),
+    FileTestVector(
+        filename="test_blog.html",
+        mimetype="text/html",
+        charset="utf-8",
+        url="https://microsoft.github.io/autogen/blog/2023/04/21/LLM-tuning-math",
+        must_include=[
+            "Large language models (LLMs) are powerful tools that can generate natural language texts for various applications, such as chatbots, summarization, translation, and more. GPT-4 is currently the state of the art LLM in the world. Is model selection irrelevant? What about inference parameters?",
+            "an example where high cost can easily prevent a generic complex",
+        ],
+        must_not_include=[],
+    ),
+    FileTestVector(
+        filename="test_wikipedia.html",
+        mimetype="text/html",
+        charset="utf-8",
+        url="https://en.wikipedia.org/wiki/Microsoft",
+        must_include=[
+            "Microsoft entered the operating system (OS) business in 1980 with its own version of [Unix]",
+            'Microsoft was founded by [Bill Gates](/wiki/Bill_Gates "Bill Gates")',
+        ],
+        must_not_include=[
+            "You are encouraged to create an account and log in",
+            "154 languages",
+            "move to sidebar",
+        ],
+    ),
+    FileTestVector(
+        filename="test_serp.html",
+        mimetype="text/html",
+        charset="utf-8",
+        url="https://www.bing.com/search?q=microsoft+wikipedia",
+        must_include=[
+            "](https://en.wikipedia.org/wiki/Microsoft",
+            "Microsoft Corporation is **an American multinational corporation and technology company headquartered** in Redmond",
+            "1995–2007: Foray into the Web, Windows 95, Windows XP, and Xbox",
+        ],
+        must_not_include=[
+            "https://www.bing.com/ck/a?!&&p=",
+            "data:image/svg+xml,%3Csvg%20width%3D",
+        ],
+    ),
+    FileTestVector(
+        filename="test_mskanji.csv",
+        mimetype="text/csv",
+        charset="cp932",
+        url=None,
+        must_include=[
+            "名前,年齢,住所",
+            "佐藤太郎,30,東京",
+            "三木英子,25,大阪",
+            "髙橋淳,35,名古屋",
+        ],
+        must_not_include=[],
+    ),
+    FileTestVector(
+        filename="test.json",
+        mimetype="application/json",
+        charset="ascii",
+        url=None,
+        must_include=[
+            "5b64c88c-b3c3-4510-bcb8-da0b200602d8",
+            "9700dc99-6685-40b4-9a3a-5e406dcb37f3",
+        ],
+        must_not_include=[],
+    ),
+    FileTestVector(
+        filename="test_rss.xml",
+        mimetype="text/xml",
+        charset="utf-8",
+        url=None,
+        must_include=[
+            "# The Official Microsoft Blog",
+            "## Ignite 2024: Why nearly 70% of the Fortune 500 now use Microsoft 365 Copilot",
+            "In the case of AI, it is absolutely true that the industry is moving incredibly fast",
+        ],
+        must_not_include=["<rss", "<feed"],
+    ),
+    FileTestVector(
+        filename="test_notebook.ipynb",
+        mimetype="application/json",
+        charset="ascii",
+        url=None,
+        must_include=[
+            "# Test Notebook",
+            "```python",
+            'print("markitdown")',
+            "```",
+            "## Code Cell Below",
+        ],
+        must_not_include=[
+            "nbformat",
+            "nbformat_minor",
+        ],
+    ),
+    FileTestVector(
+        filename="test_files.zip",
+        mimetype="application/zip",
+        charset=None,
+        url=None,
+        must_include=[
+            "314b0a30-5b04-470b-b9f7-eed2c2bec74a",
+            "49e168b7-d2ae-407f-a055-2167576f39a1",
+            "## d666f1f7-46cb-42bd-9a39-9a39cf2a509f",
+            "# Abstract",
+            "# Introduction",
+            "AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation",
+            "2cdda5c8-e50e-4db4-b5f0-9722a649f455",
+            "04191ea8-5c73-4215-a1d3-1cfb43aaaf12",
+            "44bf7d06-5e7a-4a40-a2e1-a2e42ef28c8a",
+            "1b92870d-e3b5-4e65-8153-919f4ff45592",
+            "## 09060124-b5e7-4717-9d07-3c046eb",
+            "6ff4173b-42a5-4784-9b19-f49caff4d93d",
+            "affc7dad-52dc-4b98-9b5d-51e65d8a8ad0",
+            "Microsoft entered the operating system (OS) business in 1980 with its own version of [Unix]",
+            'Microsoft was founded by [Bill Gates](/wiki/Bill_Gates "Bill Gates")',
+        ],
+        must_not_include=[],
+    ),
+    FileTestVector(
+        filename="test.epub",
+        mimetype="application/epub+zip",
+        charset=None,
+        url=None,
+        must_include=[
+            "**Authors:** Test Author",
+            "A test EPUB document for MarkItDown testing",
+            "# Chapter 1: Test Content",
+            "This is a **test** paragraph with some formatting",
+            "* A bullet point",
+            "* Another point",
+            "# Chapter 2: More Content",
+            "*different* style",
+            "> This is a blockquote for testing",
+        ],
+        must_not_include=[],
+    ),
+]
+
+
+DATA_URI_TEST_VECTORS = [
+    FileTestVector(
+        filename="test.docx",
+        mimetype="application/vnd.openxmlformats-officedocument.wordprocessingml.document",
+        charset=None,
+        url=None,
+        must_include=[
+            "314b0a30-5b04-470b-b9f7-eed2c2bec74a",
+            "49e168b7-d2ae-407f-a055-2167576f39a1",
+            "## d666f1f7-46cb-42bd-9a39-9a39cf2a509f",
+            "# Abstract",
+            "# Introduction",
+            "AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation",
+            "data:image/png;base64,iVBORw0KGgoAAAANSU",
+        ],
+        must_not_include=[
+            "data:image/png;base64...",
+        ],
+    ),
+    FileTestVector(
+        filename="test.pptx",
+        mimetype="application/vnd.openxmlformats-officedocument.presentationml.presentation",
+        charset=None,
+        url=None,
+        must_include=[
+            "2cdda5c8-e50e-4db4-b5f0-9722a649f455",
+            "04191ea8-5c73-4215-a1d3-1cfb43aaaf12",
+            "44bf7d06-5e7a-4a40-a2e1-a2e42ef28c8a",
+            "1b92870d-e3b5-4e65-8153-919f4ff45592",
+            "AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation",
+            "a3f6004b-6f4f-4ea8-bee3-3741f4dc385f",  # chart title
+            "2003",  # chart value
+            "![This phrase of the caption is Human-written.]",  # image caption
+            "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQE",
+        ],
+        must_not_include=[
+            "![This phrase of the caption is Human-written.](Picture4.jpg)",
+        ],
+    ),
+]
--- a/packages/markitdown/tests/test_cli.py
+++ b/packages/markitdown/tests/test_cli.py
@@ -1,119 +0,0 @@
-#!/usr/bin/env python3 -m pytest
-import os
-import subprocess
-import pytest
-from markitdown import __version__
-
-try:
-    from .test_markitdown import TEST_FILES_DIR, DOCX_TEST_STRINGS
-except ImportError:
-    from test_markitdown import TEST_FILES_DIR, DOCX_TEST_STRINGS  # type: ignore
-
-
-@pytest.fixture(scope="session")
-def shared_tmp_dir(tmp_path_factory):
-    return tmp_path_factory.mktemp("pytest_tmp")
-
-
-def test_version(shared_tmp_dir) -> None:
-    result = subprocess.run(
-        ["python", "-m", "markitdown", "--version"], capture_output=True, text=True
-    )
-
-    assert result.returncode == 0, f"CLI exited with error: {result.stderr}"
-    assert __version__ in result.stdout, f"Version not found in output: {result.stdout}"
-
-
-def test_invalid_flag(shared_tmp_dir) -> None:
-    result = subprocess.run(
-        ["python", "-m", "markitdown", "--foobar"], capture_output=True, text=True
-    )
-
-    assert result.returncode != 0, f"CLI exited with error: {result.stderr}"
-    assert (
-        "unrecognized arguments" in result.stderr
-    ), f"Expected 'unrecognized arguments' to appear in STDERR"
-    assert "SYNTAX" in result.stderr, f"Expected 'SYNTAX' to appear in STDERR"
-
-
-def test_output_to_stdout(shared_tmp_dir) -> None:
-    # DOC X
-    result = subprocess.run(
-        ["python", "-m", "markitdown", os.path.join(TEST_FILES_DIR, "test.docx")],
-        capture_output=True,
-        text=True,
-    )
-
-    assert result.returncode == 0, f"CLI exited with error: {result.stderr}"
-    for test_string in DOCX_TEST_STRINGS:
-        assert (
-            test_string in result.stdout
-        ), f"Expected string not found in output: {test_string}"
-
-
-def test_output_to_file(shared_tmp_dir) -> None:
-    # DOC X, flag -o at the end
-    docx_output_file_1 = os.path.join(shared_tmp_dir, "test_docx_1.md")
-    result = subprocess.run(
-        [
-            "python",
-            "-m",
-            "markitdown",
-            os.path.join(TEST_FILES_DIR, "test.docx"),
-            "-o",
-            docx_output_file_1,
-        ],
-        capture_output=True,
-        text=True,
-    )
-
-    assert result.returncode == 0, f"CLI exited with error: {result.stderr}"
-    assert os.path.exists(
-        docx_output_file_1
-    ), f"Output file not created: {docx_output_file_1}"
-
-    with open(docx_output_file_1, "r") as f:
-        output = f.read()
-        for test_string in DOCX_TEST_STRINGS:
-            assert (
-                test_string in output
-            ), f"Expected string not found in output: {test_string}"
-
-    # DOC X, flag -o at the beginning
-    docx_output_file_2 = os.path.join(shared_tmp_dir, "test_docx_2.md")
-    result = subprocess.run(
-        [
-            "python",
-            "-m",
-            "markitdown",
-            "-o",
-            docx_output_file_2,
-            os.path.join(TEST_FILES_DIR, "test.docx"),
-        ],
-        capture_output=True,
-        text=True,
-    )
-
-    assert result.returncode == 0, f"CLI exited with error: {result.stderr}"
-    assert os.path.exists(
-        docx_output_file_2
-    ), f"Output file not created: {docx_output_file_2}"
-
-    with open(docx_output_file_2, "r") as f:
-        output = f.read()
-        for test_string in DOCX_TEST_STRINGS:
-            assert (
-                test_string in output
-            ), f"Expected string not found in output: {test_string}"
-
-
-if __name__ == "__main__":
-    """Runs this file's tests from the command line."""
-    import tempfile
-
-    with tempfile.TemporaryDirectory() as tmp_dir:
-        test_version(tmp_dir)
-        test_invalid_flag(tmp_dir)
-        test_output_to_stdout(tmp_dir)
-        test_output_to_file(tmp_dir)
-    print("All tests passed!")
--- a/packages/markitdown/tests/test_cli_misc.py
+++ b/packages/markitdown/tests/test_cli_misc.py
@@ -0,0 +1,35 @@
+#!/usr/bin/env python3 -m pytest
+import subprocess
+import pytest
+from markitdown import __version__
+
+# This file contains CLI tests that are not directly tested by the FileTestVectors.
+# This includes things like help messages, version numbers, and invalid flags.
+
+
+def test_version() -> None:
+    result = subprocess.run(
+        ["python", "-m", "markitdown", "--version"], capture_output=True, text=True
+    )
+
+    assert result.returncode == 0, f"CLI exited with error: {result.stderr}"
+    assert __version__ in result.stdout, f"Version not found in output: {result.stdout}"
+
+
+def test_invalid_flag() -> None:
+    result = subprocess.run(
+        ["python", "-m", "markitdown", "--foobar"], capture_output=True, text=True
+    )
+
+    assert result.returncode != 0, f"CLI exited with error: {result.stderr}"
+    assert (
+        "unrecognized arguments" in result.stderr
+    ), f"Expected 'unrecognized arguments' to appear in STDERR"
+    assert "SYNTAX" in result.stderr, f"Expected 'SYNTAX' to appear in STDERR"
+
+
+if __name__ == "__main__":
+    """Runs this file's tests from the command line."""
+    test_version()
+    test_invalid_flag()
+    print("All tests passed!")
--- a/packages/markitdown/tests/test_cli_vectors.py
+++ b/packages/markitdown/tests/test_cli_vectors.py
@@ -0,0 +1,227 @@
+#!/usr/bin/env python3 -m pytest
+import os
+import time
+import pytest
+import subprocess
+import locale
+from typing import List
+
+if __name__ == "__main__":
+    from _test_vectors import (
+        GENERAL_TEST_VECTORS,
+        DATA_URI_TEST_VECTORS,
+        FileTestVector,
+    )
+else:
+    from ._test_vectors import (
+        GENERAL_TEST_VECTORS,
+        DATA_URI_TEST_VECTORS,
+        FileTestVector,
+    )
+
+from markitdown import (
+    MarkItDown,
+    UnsupportedFormatException,
+    FileConversionException,
+    StreamInfo,
+)
+
+skip_remote = (
+    True if os.environ.get("GITHUB_ACTIONS") else False
+)  # Don't run these tests in CI
+
+TEST_FILES_DIR = os.path.join(os.path.dirname(__file__), "test_files")
+TEST_FILES_URL = "https://raw.githubusercontent.com/microsoft/markitdown/refs/heads/main/packages/markitdown/tests/test_files"
+
+
+# Prepare CLI test vectors (remove vectors that require mockig the url)
+CLI_TEST_VECTORS: List[FileTestVector] = []
+for test_vector in GENERAL_TEST_VECTORS:
+    if test_vector.url is not None:
+        continue
+    CLI_TEST_VECTORS.append(test_vector)
+
+
+@pytest.fixture(scope="session")
+def shared_tmp_dir(tmp_path_factory):
+    return tmp_path_factory.mktemp("pytest_tmp")
+
+
+@pytest.mark.parametrize("test_vector", CLI_TEST_VECTORS)
+def test_output_to_stdout(shared_tmp_dir, test_vector) -> None:
+    """Test that the CLI outputs to stdout correctly."""
+
+    result = subprocess.run(
+        [
+            "python",
+            "-m",
+            "markitdown",
+            os.path.join(TEST_FILES_DIR, test_vector.filename),
+        ],
+        capture_output=True,
+        text=True,
+    )
+
+    assert result.returncode == 0, f"CLI exited with error: {result.stderr}"
+    for test_string in test_vector.must_include:
+        assert test_string in result.stdout
+    for test_string in test_vector.must_not_include:
+        assert test_string not in result.stdout
+
+
+@pytest.mark.parametrize("test_vector", CLI_TEST_VECTORS)
+def test_output_to_file(shared_tmp_dir, test_vector) -> None:
+    """Test that the CLI outputs to a file correctly."""
+
+    output_file = os.path.join(shared_tmp_dir, test_vector.filename + ".output")
+    result = subprocess.run(
+        [
+            "python",
+            "-m",
+            "markitdown",
+            "-o",
+            output_file,
+            os.path.join(TEST_FILES_DIR, test_vector.filename),
+        ],
+        capture_output=True,
+        text=True,
+    )
+
+    assert result.returncode == 0, f"CLI exited with error: {result.stderr}"
+    assert os.path.exists(output_file), f"Output file not created: {output_file}"
+
+    with open(output_file, "r") as f:
+        output_data = f.read()
+        for test_string in test_vector.must_include:
+            assert test_string in output_data
+        for test_string in test_vector.must_not_include:
+            assert test_string not in output_data
+
+    os.remove(output_file)
+    assert not os.path.exists(output_file), f"Output file not deleted: {output_file}"
+
+
+@pytest.mark.parametrize("test_vector", CLI_TEST_VECTORS)
+def test_input_from_stdin_without_hints(shared_tmp_dir, test_vector) -> None:
+    """Test that the CLI readds from stdin correctly."""
+
+    test_input = b""
+    with open(os.path.join(TEST_FILES_DIR, test_vector.filename), "rb") as stream:
+        test_input = stream.read()
+
+    result = subprocess.run(
+        [
+            "python",
+            "-m",
+            "markitdown",
+            os.path.join(TEST_FILES_DIR, test_vector.filename),
+        ],
+        input=test_input,
+        capture_output=True,
+        text=False,
+    )
+
+    stdout = result.stdout.decode(locale.getpreferredencoding())
+    assert (
+        result.returncode == 0
+    ), f"CLI exited with error: {result.stderr.decode('utf-8')}"
+    for test_string in test_vector.must_include:
+        assert test_string in stdout
+    for test_string in test_vector.must_not_include:
+        assert test_string not in stdout
+
+
+@pytest.mark.skipif(
+    skip_remote,
+    reason="do not run tests that query external urls",
+)
+@pytest.mark.parametrize("test_vector", CLI_TEST_VECTORS)
+def test_convert_url(shared_tmp_dir, test_vector):
+    """Test the conversion of a stream with no stream info."""
+    # Note: tmp_dir is not used here, but is needed to match the signature
+
+    markitdown = MarkItDown()
+
+    time.sleep(1)  # Ensure we don't hit rate limits
+    result = subprocess.run(
+        ["python", "-m", "markitdown", TEST_FILES_URL + "/" + test_vector.filename],
+        capture_output=True,
+        text=False,
+    )
+
+    stdout = result.stdout.decode(locale.getpreferredencoding())
+    assert result.returncode == 0, f"CLI exited with error: {result.stderr}"
+    for test_string in test_vector.must_include:
+        assert test_string in stdout
+    for test_string in test_vector.must_not_include:
+        assert test_string not in stdout
+
+
+@pytest.mark.parametrize("test_vector", DATA_URI_TEST_VECTORS)
+def test_output_to_file_with_data_uris(shared_tmp_dir, test_vector) -> None:
+    """Test CLI functionality when keep_data_uris is enabled"""
+
+    output_file = os.path.join(shared_tmp_dir, test_vector.filename + ".output")
+    result = subprocess.run(
+        [
+            "python",
+            "-m",
+            "markitdown",
+            "--keep-data-uris",
+            "-o",
+            output_file,
+            os.path.join(TEST_FILES_DIR, test_vector.filename),
+        ],
+        capture_output=True,
+        text=True,
+    )
+
+    assert result.returncode == 0, f"CLI exited with error: {result.stderr}"
+    assert os.path.exists(output_file), f"Output file not created: {output_file}"
+
+    with open(output_file, "r") as f:
+        output_data = f.read()
+        for test_string in test_vector.must_include:
+            assert test_string in output_data
+        for test_string in test_vector.must_not_include:
+            assert test_string not in output_data
+
+    os.remove(output_file)
+    assert not os.path.exists(output_file), f"Output file not deleted: {output_file}"
+
+
+if __name__ == "__main__":
+    import sys
+    import tempfile
+
+    """Runs this file's tests from the command line."""
+
+    with tempfile.TemporaryDirectory() as tmp_dir:
+        # General tests
+        for test_function in [
+            test_output_to_stdout,
+            test_output_to_file,
+            test_input_from_stdin_without_hints,
+            test_convert_url,
+        ]:
+            for test_vector in CLI_TEST_VECTORS:
+                print(
+                    f"Running {test_function.__name__} on {test_vector.filename}...",
+                    end="",
+                )
+                test_function(tmp_dir, test_vector)
+                print("OK")
+
+        # Data URI tests
+        for test_function in [
+            test_output_to_file_with_data_uris,
+        ]:
+            for test_vector in DATA_URI_TEST_VECTORS:
+                print(
+                    f"Running {test_function.__name__} on {test_vector.filename}...",
+                    end="",
+                )
+                test_function(tmp_dir, test_vector)
+                print("OK")
+
+    print("All tests passed!")
--- a/packages/markitdown/tests/test_files/test.docx
+++ b/packages/markitdown/tests/test_files/test.docx
--- a/packages/markitdown/tests/test_files/test.epub
+++ b/packages/markitdown/tests/test_files/test.epub
--- a/packages/markitdown/tests/test_module_misc.py
+++ b/packages/markitdown/tests/test_module_misc.py
@@ -3,11 +3,7 @@ import io
 import os
 import shutil
 import openai
-
 import pytest
-import requests
-
-import warnings

 from markitdown import (
    MarkItDown,
@@ -15,7 +11,10 @@ from markitdown import (
    FileConversionException,
    StreamInfo,
 )
-from markitdown._stream_info import _guess_stream_info_from_stream
+
+# This file contains module tests that are not directly tested by the FileTestVectors.
+# This includes things like helper functions and runtime conversion options
+# (e.g., LLM clients, exiftool path, transcription services, etc.)

 skip_remote = (
    True if os.environ.get("GITHUB_ACTIONS") else False
@@ -62,36 +61,6 @@ YOUTUBE_TEST_STRINGS = [
    "the model we're going to be using today is GPT 3.5 turbo",  # From the transcript
 ]

-XLSX_TEST_STRINGS = [
-    "## 09060124-b5e7-4717-9d07-3c046eb",
-    "6ff4173b-42a5-4784-9b19-f49caff4d93d",
-    "affc7dad-52dc-4b98-9b5d-51e65d8a8ad0",
-]
-
-XLS_TEST_STRINGS = [
-    "## 09060124-b5e7-4717-9d07-3c046eb",
-    "6ff4173b-42a5-4784-9b19-f49caff4d93d",
-    "affc7dad-52dc-4b98-9b5d-51e65d8a8ad0",
-]
-
-DOCX_TEST_STRINGS = [
-    "314b0a30-5b04-470b-b9f7-eed2c2bec74a",
-    "49e168b7-d2ae-407f-a055-2167576f39a1",
-    "## d666f1f7-46cb-42bd-9a39-9a39cf2a509f",
-    "# Abstract",
-    "# Introduction",
-    "AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation",
-]
-
-MSG_TEST_STRINGS = [
-    "# Email Message",
-    "**From:** test.sender@example.com",
-    "**To:** test.recipient@example.com",
-    "**Subject:** Test Email Message",
-    "## Content",
-    "This is the body of the test email message",
-]
-
 DOCX_COMMENT_TEST_STRINGS = [
    "314b0a30-5b04-470b-b9f7-eed2c2bec74a",
    "49e168b7-d2ae-407f-a055-2167576f39a1",
@@ -103,6 +72,16 @@ DOCX_COMMENT_TEST_STRINGS = [
    "Yet another comment in the doc. 55yiyi-asd09",
 ]

+BLOG_TEST_URL = "https://microsoft.github.io/autogen/blog/2023/04/21/LLM-tuning-math"
+BLOG_TEST_STRINGS = [
+    "Large language models (LLMs) are powerful tools that can generate natural language texts for various applications, such as chatbots, summarization, translation, and more. GPT-4 is currently the state of the art LLM in the world. Is model selection irrelevant? What about inference parameters?",
+    "an example where high cost can easily prevent a generic complex",
+]
+
+LLM_TEST_STRINGS = [
+    "5bda1dd6",
+]
+
 PPTX_TEST_STRINGS = [
    "2cdda5c8-e50e-4db4-b5f0-9722a649f455",
    "04191ea8-5c73-4215-a1d3-1cfb43aaaf12",
@@ -113,57 +92,6 @@ PPTX_TEST_STRINGS = [
    "2003",  # chart value
 ]

-BLOG_TEST_URL = "https://microsoft.github.io/autogen/blog/2023/04/21/LLM-tuning-math"
-BLOG_TEST_STRINGS = [
-    "Large language models (LLMs) are powerful tools that can generate natural language texts for various applications, such as chatbots, summarization, translation, and more. GPT-4 is currently the state of the art LLM in the world. Is model selection irrelevant? What about inference parameters?",
-    "an example where high cost can easily prevent a generic complex",
-]
-
-
-RSS_TEST_STRINGS = [
-    "The Official Microsoft Blog",
-    "In the case of AI, it is absolutely true that the industry is moving incredibly fast",
-]
-
-
-WIKIPEDIA_TEST_URL = "https://en.wikipedia.org/wiki/Microsoft"
-WIKIPEDIA_TEST_STRINGS = [
-    "Microsoft entered the operating system (OS) business in 1980 with its own version of [Unix]",
-    'Microsoft was founded by [Bill Gates](/wiki/Bill_Gates "Bill Gates")',
-]
-WIKIPEDIA_TEST_EXCLUDES = [
-    "You are encouraged to create an account and log in",
-    "154 languages",
-    "move to sidebar",
-]
-
-SERP_TEST_URL = "https://www.bing.com/search?q=microsoft+wikipedia"
-SERP_TEST_STRINGS = [
-    "](https://en.wikipedia.org/wiki/Microsoft",
-    "Microsoft Corporation is **an American multinational corporation and technology company headquartered** in Redmond",
-    "1995–2007: Foray into the Web, Windows 95, Windows XP, and Xbox",
-]
-SERP_TEST_EXCLUDES = [
-    "https://www.bing.com/ck/a?!&&p=",
-    "data:image/svg+xml,%3Csvg%20width%3D",
-]
-
-CSV_CP932_TEST_STRINGS = [
-    "名前,年齢,住所",
-    "佐藤太郎,30,東京",
-    "三木英子,25,大阪",
-    "髙橋淳,35,名古屋",
-]
-
-LLM_TEST_STRINGS = [
-    "5bda1dd6",
-]
-
-JSON_TEST_STRINGS = [
-    "5b64c88c-b3c3-4510-bcb8-da0b200602d8",
-    "9700dc99-6685-40b4-9a3a-5e406dcb37f3",
-]
-

 # --- Helper Functions ---
 def validate_strings(result, expected_strings, exclude_strings=None):
@@ -248,33 +176,29 @@ def test_stream_info_operations() -> None:
    assert updated_stream_info.url == "url.1"


-def test_stream_info_guesses() -> None:
-    """Test StreamInfo guesses based on stream content."""
+def test_docx_comments() -> None:
+    markitdown = MarkItDown()

-    test_tuples = [
-        (
-            os.path.join(TEST_FILES_DIR, "test.xlsx"),
-            "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
-        ),
-        (
-            os.path.join(TEST_FILES_DIR, "test.docx"),
-            "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
-        ),
-        (
-            os.path.join(TEST_FILES_DIR, "test.pptx"),
-            "application/vnd.openxmlformats-officedocument.presentationml.presentation",
-        ),
-        (os.path.join(TEST_FILES_DIR, "test.xls"), "application/vnd.ms-excel"),
-    ]
+    # Test DOCX processing, with comments and setting style_map on init
+    markitdown_with_style_map = MarkItDown(style_map="comment-reference => ")
+    result = markitdown_with_style_map.convert(
+        os.path.join(TEST_FILES_DIR, "test_with_comment.docx")
+    )
+    validate_strings(result, DOCX_COMMENT_TEST_STRINGS)

-    for file_path, expected_mimetype in test_tuples:
-        with open(file_path, "rb") as f:
-            guesses = _guess_stream_info_from_stream(
-                f, filename_hint=os.path.basename(file_path)
-            )
-            assert len(guesses) > 0
-            assert guesses[0].mimetype == expected_mimetype
-            assert guesses[0].extension == os.path.splitext(file_path)[1]
+
+def test_input_as_strings() -> None:
+    markitdown = MarkItDown()
+
+    # Test input from a stream
+    input_data = b"<html><body><h1>Test</h1></body></html>"
+    result = markitdown.convert_stream(io.BytesIO(input_data))
+    assert "# Test" in result.text_content
+
+    # Test input with leading blank characters
+    input_data = b"   \n\n\n<html><body><h1>Test</h1></body></html>"
+    result = markitdown.convert_stream(io.BytesIO(input_data))
+    assert "# Test" in result.text_content


@pytest.mark.skipif(
@@ -289,194 +213,12 @@ def test_markitdown_remote() -> None:
    for test_string in PDF_TEST_STRINGS:
        assert test_string in result.text_content

-    # By stream
-    response = requests.get(PDF_TEST_URL)
-    result = markitdown.convert_stream(
-        io.BytesIO(response.content), file_extension=".pdf", url=PDF_TEST_URL
-    )
-    for test_string in PDF_TEST_STRINGS:
-        assert test_string in result.text_content
-
    # Youtube
    result = markitdown.convert(YOUTUBE_TEST_URL)
    for test_string in YOUTUBE_TEST_STRINGS:
        assert test_string in result.text_content


-def test_markitdown_local() -> None:
-    markitdown = MarkItDown()
-
-    # Test PDF processing
-    result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.pdf"))
-    validate_strings(result, PDF_TEST_STRINGS)
-
-    # Test XLSX processing
-    result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.xlsx"))
-    validate_strings(result, XLSX_TEST_STRINGS)
-
-    # Test XLS processing
-    result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.xls"))
-    for test_string in XLS_TEST_STRINGS:
-        text_content = result.text_content.replace("\\", "")
-        assert test_string in text_content
-
-    # Test DOCX processing
-    result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.docx"))
-    validate_strings(result, DOCX_TEST_STRINGS)
-
-    # Test DOCX processing, with comments
-    result = markitdown.convert(
-        os.path.join(TEST_FILES_DIR, "test_with_comment.docx"),
-        style_map="comment-reference => ",
-    )
-    validate_strings(result, DOCX_COMMENT_TEST_STRINGS)
-
-    # Test DOCX processing, with comments and setting style_map on init
-    markitdown_with_style_map = MarkItDown(style_map="comment-reference => ")
-    result = markitdown_with_style_map.convert(
-        os.path.join(TEST_FILES_DIR, "test_with_comment.docx")
-    )
-    validate_strings(result, DOCX_COMMENT_TEST_STRINGS)
-
-    # Test PPTX processing
-    result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.pptx"))
-    validate_strings(result, PPTX_TEST_STRINGS)
-
-    # Test HTML processing
-    result = markitdown.convert(
-        os.path.join(TEST_FILES_DIR, "test_blog.html"), url=BLOG_TEST_URL
-    )
-    validate_strings(result, BLOG_TEST_STRINGS)
-
-    # Test Wikipedia processing
-    result = markitdown.convert(
-        os.path.join(TEST_FILES_DIR, "test_wikipedia.html"), url=WIKIPEDIA_TEST_URL
-    )
-    text_content = result.text_content.replace("\\", "")
-    validate_strings(result, WIKIPEDIA_TEST_STRINGS, WIKIPEDIA_TEST_EXCLUDES)
-
-    # Test Bing processing
-    result = markitdown.convert(
-        os.path.join(TEST_FILES_DIR, "test_serp.html"), url=SERP_TEST_URL
-    )
-    text_content = result.text_content.replace("\\", "")
-    validate_strings(result, SERP_TEST_STRINGS, SERP_TEST_EXCLUDES)
-
-    # Test RSS processing
-    result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test_rss.xml"))
-    text_content = result.text_content.replace("\\", "")
-    for test_string in RSS_TEST_STRINGS:
-        assert test_string in text_content
-
-    # Test MSG (Outlook email) processing
-    result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test_outlook_msg.msg"))
-    validate_strings(result, MSG_TEST_STRINGS)
-
-    # Test non-UTF-8 encoding
-    result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test_mskanji.csv"))
-    validate_strings(result, CSV_CP932_TEST_STRINGS)
-
-    # Test JSON processing
-    result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.json"))
-    validate_strings(result, JSON_TEST_STRINGS)
-
-    # # Test ZIP file processing
-    result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test_files.zip"))
-    validate_strings(result, DOCX_TEST_STRINGS)
-    validate_strings(result, XLSX_TEST_STRINGS)
-    validate_strings(result, BLOG_TEST_STRINGS)
-
-    # Test input from a stream
-    input_data = b"<html><body><h1>Test</h1></body></html>"
-    result = markitdown.convert_stream(io.BytesIO(input_data))
-    assert "# Test" in result.text_content
-
-    # Test input with leading blank characters
-    input_data = b"   \n\n\n<html><body><h1>Test</h1></body></html>"
-    result = markitdown.convert_stream(io.BytesIO(input_data))
-    assert "# Test" in result.text_content
-
-
-def test_markitdown_streams() -> None:
-    markitdown = MarkItDown()
-
-    # Test PDF processing
-    with open(os.path.join(TEST_FILES_DIR, "test.pdf"), "rb") as f:
-        result = markitdown.convert(f, file_extension=".pdf")
-        validate_strings(result, PDF_TEST_STRINGS)
-
-    # Test XLSX processing
-    with open(os.path.join(TEST_FILES_DIR, "test.xlsx"), "rb") as f:
-        result = markitdown.convert(f, file_extension=".xlsx")
-        validate_strings(result, XLSX_TEST_STRINGS)
-
-    # Test XLS processing
-    with open(os.path.join(TEST_FILES_DIR, "test.xls"), "rb") as f:
-        result = markitdown.convert(f, file_extension=".xls")
-        for test_string in XLS_TEST_STRINGS:
-            text_content = result.text_content.replace("\\", "")
-            assert test_string in text_content
-
-    # Test DOCX processing
-    with open(os.path.join(TEST_FILES_DIR, "test.docx"), "rb") as f:
-        result = markitdown.convert(f, file_extension=".docx")
-        validate_strings(result, DOCX_TEST_STRINGS)
-
-    # Test DOCX processing, with comments
-    with open(os.path.join(TEST_FILES_DIR, "test_with_comment.docx"), "rb") as f:
-        result = markitdown.convert(
-            f,
-            file_extension=".docx",
-            style_map="comment-reference => ",
-        )
-        validate_strings(result, DOCX_COMMENT_TEST_STRINGS)
-
-    # Test DOCX processing, with comments and setting style_map on init
-    markitdown_with_style_map = MarkItDown(style_map="comment-reference => ")
-    with open(os.path.join(TEST_FILES_DIR, "test_with_comment.docx"), "rb") as f:
-        result = markitdown_with_style_map.convert(f, file_extension=".docx")
-        validate_strings(result, DOCX_COMMENT_TEST_STRINGS)
-
-    # Test PPTX processing
-    with open(os.path.join(TEST_FILES_DIR, "test.pptx"), "rb") as f:
-        result = markitdown.convert(f, file_extension=".pptx")
-        validate_strings(result, PPTX_TEST_STRINGS)
-
-    # Test HTML processing
-    with open(os.path.join(TEST_FILES_DIR, "test_blog.html"), "rb") as f:
-        result = markitdown.convert(f, file_extension=".html", url=BLOG_TEST_URL)
-        validate_strings(result, BLOG_TEST_STRINGS)
-
-    # Test Wikipedia processing
-    with open(os.path.join(TEST_FILES_DIR, "test_wikipedia.html"), "rb") as f:
-        result = markitdown.convert(f, file_extension=".html", url=WIKIPEDIA_TEST_URL)
-        text_content = result.text_content.replace("\\", "")
-        validate_strings(result, WIKIPEDIA_TEST_STRINGS, WIKIPEDIA_TEST_EXCLUDES)
-
-    # Test Bing processing
-    with open(os.path.join(TEST_FILES_DIR, "test_serp.html"), "rb") as f:
-        result = markitdown.convert(f, file_extension=".html", url=SERP_TEST_URL)
-        text_content = result.text_content.replace("\\", "")
-        validate_strings(result, SERP_TEST_STRINGS, SERP_TEST_EXCLUDES)
-
-    # Test RSS processing
-    with open(os.path.join(TEST_FILES_DIR, "test_rss.xml"), "rb") as f:
-        result = markitdown.convert(f, file_extension=".xml")
-        text_content = result.text_content.replace("\\", "")
-        for test_string in RSS_TEST_STRINGS:
-            assert test_string in text_content
-
-    # Test MSG (Outlook email) processing
-    with open(os.path.join(TEST_FILES_DIR, "test_outlook_msg.msg"), "rb") as f:
-        result = markitdown.convert(f, file_extension=".msg")
-        validate_strings(result, MSG_TEST_STRINGS)
-
-    # Test JSON processing
-    with open(os.path.join(TEST_FILES_DIR, "test.json"), "rb") as f:
-        result = markitdown.convert(f, file_extension=".json")
-        validate_strings(result, JSON_TEST_STRINGS)
-
-
@pytest.mark.skipif(
    skip_remote,
    reason="do not run remotely run speech transcription tests",
@@ -517,19 +259,6 @@ def test_exceptions() -> None:
    reason="do not run if exiftool is not installed",
 )
 def test_markitdown_exiftool() -> None:
-    # Test the automatic discovery of exiftool throws a warning
-    # and is disabled
-    try:
-        warnings.simplefilter("default")
-        with warnings.catch_warnings(record=True) as w:
-            markitdown = MarkItDown()
-            result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.jpg"))
-            assert len(w) == 1
-            assert w[0].category is DeprecationWarning
-            assert result.text_content.strip() == ""
-    finally:
-        warnings.resetwarnings()
-
    which_exiftool = shutil.which("exiftool")
    assert which_exiftool is not None

@@ -583,13 +312,17 @@ def test_markitdown_llm() -> None:

 if __name__ == "__main__":
    """Runs this file's tests from the command line."""
-    test_stream_info_operations()
-    test_stream_info_guesses()
-    test_markitdown_remote()
-    test_markitdown_local()
-    test_markitdown_streams()
-    test_speech_transcription()
-    test_exceptions()
-    test_markitdown_exiftool()
-    test_markitdown_llm()
+    for test in [
+        test_stream_info_operations,
+        test_docx_comments,
+        test_input_as_strings,
+        test_markitdown_remote,
+        test_speech_transcription,
+        test_exceptions,
+        test_markitdown_exiftool,
+        test_markitdown_llm,
+    ]:
+        print(f"Running {test.__name__}...", end="")
+        test()
+        print("OK")
    print("All tests passed!")
--- a/packages/markitdown/tests/test_module_vectors.py
+++ b/packages/markitdown/tests/test_module_vectors.py
@@ -0,0 +1,199 @@
+#!/usr/bin/env python3 -m pytest
+import os
+import time
+import pytest
+import codecs
+
+
+if __name__ == "__main__":
+    from _test_vectors import GENERAL_TEST_VECTORS, DATA_URI_TEST_VECTORS
+else:
+    from ._test_vectors import GENERAL_TEST_VECTORS, DATA_URI_TEST_VECTORS
+
+from markitdown import (
+    MarkItDown,
+    UnsupportedFormatException,
+    FileConversionException,
+    StreamInfo,
+)
+
+skip_remote = (
+    True if os.environ.get("GITHUB_ACTIONS") else False
+)  # Don't run these tests in CI
+
+TEST_FILES_DIR = os.path.join(os.path.dirname(__file__), "test_files")
+TEST_FILES_URL = "https://raw.githubusercontent.com/microsoft/markitdown/refs/heads/main/packages/markitdown/tests/test_files"
+
+
+@pytest.mark.parametrize("test_vector", GENERAL_TEST_VECTORS)
+def test_guess_stream_info(test_vector):
+    """Test the ability to guess stream info."""
+    markitdown = MarkItDown()
+
+    local_path = os.path.join(TEST_FILES_DIR, test_vector.filename)
+    expected_extension = os.path.splitext(test_vector.filename)[1]
+
+    with open(local_path, "rb") as stream:
+        guesses = markitdown._get_stream_info_guesses(
+            stream,
+            base_guess=StreamInfo(
+                filename=os.path.basename(test_vector.filename),
+                local_path=local_path,
+                extension=expected_extension,
+            ),
+        )
+
+        # For some limited exceptions, we can't guarantee the exact
+        # mimetype or extension, so we'll special-case them here.
+        if test_vector.filename in [
+            "test_outlook_msg.msg",
+        ]:
+            return
+
+        assert guesses[0].mimetype == test_vector.mimetype
+        assert guesses[0].extension == expected_extension
+        assert guesses[0].charset == test_vector.charset
+
+
+@pytest.mark.parametrize("test_vector", GENERAL_TEST_VECTORS)
+def test_convert_local(test_vector):
+    """Test the conversion of a local file."""
+    markitdown = MarkItDown()
+
+    result = markitdown.convert(
+        os.path.join(TEST_FILES_DIR, test_vector.filename), url=test_vector.url
+    )
+    for string in test_vector.must_include:
+        assert string in result.markdown
+    for string in test_vector.must_not_include:
+        assert string not in result.markdown
+
+
+@pytest.mark.parametrize("test_vector", GENERAL_TEST_VECTORS)
+def test_convert_stream_with_hints(test_vector):
+    """Test the conversion of a stream with full stream info."""
+    markitdown = MarkItDown()
+
+    stream_info = StreamInfo(
+        extension=os.path.splitext(test_vector.filename)[1],
+        mimetype=test_vector.mimetype,
+        charset=test_vector.charset,
+    )
+
+    with open(os.path.join(TEST_FILES_DIR, test_vector.filename), "rb") as stream:
+        result = markitdown.convert(
+            stream, stream_info=stream_info, url=test_vector.url
+        )
+        for string in test_vector.must_include:
+            assert string in result.markdown
+        for string in test_vector.must_not_include:
+            assert string not in result.markdown
+
+
+@pytest.mark.parametrize("test_vector", GENERAL_TEST_VECTORS)
+def test_convert_stream_without_hints(test_vector):
+    """Test the conversion of a stream with no stream info."""
+    markitdown = MarkItDown()
+
+    with open(os.path.join(TEST_FILES_DIR, test_vector.filename), "rb") as stream:
+        result = markitdown.convert(stream, url=test_vector.url)
+        for string in test_vector.must_include:
+            assert string in result.markdown
+        for string in test_vector.must_not_include:
+            assert string not in result.markdown
+
+
+@pytest.mark.skipif(
+    skip_remote,
+    reason="do not run tests that query external urls",
+)
+@pytest.mark.parametrize("test_vector", GENERAL_TEST_VECTORS)
+def test_convert_url(test_vector):
+    """Test the conversion of a stream with no stream info."""
+    markitdown = MarkItDown()
+
+    time.sleep(1)  # Ensure we don't hit rate limits
+
+    result = markitdown.convert(
+        TEST_FILES_URL + "/" + test_vector.filename,
+        url=test_vector.url,  # Mock where this file would be found
+    )
+    for string in test_vector.must_include:
+        assert string in result.markdown
+    for string in test_vector.must_not_include:
+        assert string not in result.markdown
+
+
+@pytest.mark.parametrize("test_vector", DATA_URI_TEST_VECTORS)
+def test_convert_with_data_uris(test_vector):
+    """Test API functionality when keep_data_uris is enabled"""
+    markitdown = MarkItDown()
+
+    # Test local file conversion
+    result = markitdown.convert(
+        os.path.join(TEST_FILES_DIR, test_vector.filename),
+        keep_data_uris=True,
+        url=test_vector.url,
+    )
+
+    for string in test_vector.must_include:
+        assert string in result.markdown
+    for string in test_vector.must_not_include:
+        assert string not in result.markdown
+
+
+@pytest.mark.parametrize("test_vector", DATA_URI_TEST_VECTORS)
+def test_convert_stream_with_data_uris(test_vector):
+    """Test the conversion of a stream with no stream info."""
+    markitdown = MarkItDown()
+
+    stream_info = StreamInfo(
+        extension=os.path.splitext(test_vector.filename)[1],
+        mimetype=test_vector.mimetype,
+        charset=test_vector.charset,
+    )
+
+    with open(os.path.join(TEST_FILES_DIR, test_vector.filename), "rb") as stream:
+        result = markitdown.convert(
+            stream, stream_info=stream_info, keep_data_uris=True, url=test_vector.url
+        )
+
+        for string in test_vector.must_include:
+            assert string in result.markdown
+        for string in test_vector.must_not_include:
+            assert string not in result.markdown
+
+
+if __name__ == "__main__":
+    import sys
+
+    """Runs this file's tests from the command line."""
+
+    # General tests
+    for test_function in [
+        test_guess_stream_info,
+        test_convert_local,
+        test_convert_stream_with_hints,
+        test_convert_stream_without_hints,
+        test_convert_url,
+    ]:
+        for test_vector in GENERAL_TEST_VECTORS:
+            print(
+                f"Running {test_function.__name__} on {test_vector.filename}...", end=""
+            )
+            test_function(test_vector)
+            print("OK")
+
+    # Data URI tests
+    for test_function in [
+        test_convert_with_data_uris,
+        test_convert_stream_with_data_uris,
+    ]:
+        for test_vector in DATA_URI_TEST_VECTORS:
+            print(
+                f"Running {test_function.__name__} on {test_vector.filename}...", end=""
+            )
+            test_function(test_vector)
+            print("OK")
+
+    print("All tests passed!")
Author	SHA1	Message	Date
afourney	efc55b260d	Bump version and resolve a console encoding error. (#1149 )	2025-03-21 09:27:25 -07:00
Yuzhong Zhang	52432bd228	Add support for preserving base64 encoded images (#1140 ) * optional reserve base64 string in markdown _CustomMarkdownify and pptx * add other converter para support * fix linter * Use kwarg to pass keep_data_uri para. Add module cli vector tests * Fixed formatting, and adjusted tests.	2025-03-20 18:50:23 -07:00
afourney	c0a511ecff	Updated docx file to include an image. (#1146 )	2025-03-20 12:25:56 -07:00
afourney	cd6aa41361	Adjust warning filters and update dependencies (#1143 ) Adjusts warning filters to be more contextual Updates dependencies for magika and youtube-transcript-api Updates the version to 0.1.0a5 in __about__.py	2025-03-19 22:09:14 -07:00
afourney	716f74dcb9	Consider anything with a charset as plain text-convertible. (#1142 )	2025-03-19 20:46:35 -07:00
afourney	a93e0567e6	EPub Support. Adapted #123 to not use epublib. (#1131 ) * Adapted #123 to not use epublib. * Updated README.md	2025-03-17 07:48:15 -07:00
afourney	c5f70b904f	Have magika read from the stream. (#1136 )	2025-03-17 07:39:19 -07:00
afourney	53834fdd24	Investigate and silence warnings. (#1133 )	2025-03-15 23:41:35 -07:00
afourney	5c565b7d79	Fix remaining mypy errors. (#1132 )	2025-03-15 23:12:48 -07:00
afourney	a78857bd43	Added epub test file. (#1130 )	2025-03-15 18:34:51 -07:00
afourney	09df7fe8df	Small fixes for autogen integration. (#1124 )	2025-03-12 19:18:11 -07:00
Adam Fourney	6a9f09b153	Updated Magika dependency.	2025-03-12 16:15:33 -07:00
afourney	0b815fb916	Bumping version to 0.1.0a2 (#1123 )	2025-03-12 11:44:19 -07:00
Emanuele Meazzo	12620f1545	Handle not supported plot type in pptx (#1122 ) * Handle not supported plot type in pptx * Fixed formatting.	2025-03-12 11:26:23 -07:00
afourney	5f75e16d20	Refactored tests. (#1120 ) * Refactored tests. * Fixed CI errors, and included misc tests. * Omit mskanji from streaminfo test. * Omit mskanji from no hints test. * Log results of debugging in comments (linked to Magika issue) * Added docs as to when to use misc tests.	2025-03-12 11:08:06 -07:00
yushihang	75140a90e2	fix: correct f-string formatting in FileConversionException (#1121 )	2025-03-12 10:15:09 -07:00
afourney	af1be36e0c	Added CLI options for extension, mimetypes, and charset. (#1115 )	2025-03-11 13:16:33 -07:00
Adam Fourney	2a2ccc86aa	Added mimetypes to _rss_converter	2025-03-10 16:17:41 -07:00
Adam Fourney	2e51ba22e7	Enhance type guessing.	2025-03-10 16:05:41 -07:00
afourney	8f8e58c9bb	Minimize guesses when guesses are compatible. (#1114 ) * Minimize guesses when guesses are compatible.	2025-03-10 15:30:44 -07:00
afourney	8e73a325c6	Switch from puremagic to magika. (#1108 )	2025-03-10 12:49:52 -07:00
Mohit Agarwal	2405f201af	fix typo in well-known path list (#1109 )	2025-03-08 19:32:44 -08:00
afourney	99d8e562db	Fix exiftool in well-known paths. (#1106 )	2025-03-07 21:47:20 -08:00
Sebastian Yaghoubi	515fa854bf	feat(docker): improve dockerfile build (#220 ) * refactor(docker): remove unnecessary root user The USER root directive isn't needed directly after FROM Signed-off-by: Sebastian Yaghoubi <sebastianyaghoubi@gmail.com> * fix(docker): use generic nobody nogroup default instead of uid gid Signed-off-by: Sebastian Yaghoubi <sebastianyaghoubi@gmail.com> * fix(docker): build app from source locally instead of installing package Signed-off-by: Sebastian Yaghoubi <sebastianyaghoubi@gmail.com> * fix(docker): use correct files in dockerignore Signed-off-by: Sebastian Yaghoubi <sebastianyaghoubi@gmail.com> * chore(docker): dont install recommended packages with git Signed-off-by: Sebastian Yaghoubi <sebastianyaghoubi@gmail.com> * fix(docker): run apt as non-interactive Signed-off-by: Sebastian Yaghoubi <sebastianyaghoubi@gmail.com> * Update Dockerfile to new package structure, and fix streaming bugs. --------- Signed-off-by: Sebastian Yaghoubi <sebastianyaghoubi@gmail.com> Co-authored-by: afourney <adamfo@microsoft.com>	2025-03-07 20:07:40 -08:00
Richard Ye	0229ff6cb7	feat: sort pptx shapes to be parsed in top-to-bottom, left-to-right order (#1104 ) * Sort PPTX shapes to be read in top-to-bottom, left-to-right order Referenced from `39bef65b31/pptx2md/parser.py (L249)` * Update README.md * Fixed formatting. * Added missing import	2025-03-07 15:45:14 -08:00
afourney	82d84e3edd	Fixed formatting. (#1098 )	2025-03-05 23:30:29 -08:00
scalabreseGD	36c4bc9ec3	Fixed deepcopy failure when passing llm_client (#1089 ) Co-authored-by: afourney <adamfo@microsoft.com>	2025-03-05 23:25:37 -08:00
Andrea Pietrobon	80baa5db18	fix(README): correct pip install command formatting (#1090 ) Added missing quotes around `markitdown[all]` in the installation command to ensure proper package resolution by pip.	2025-03-05 23:21:10 -08:00
Adam Fourney	00a65e8f8b	Fixed version in README.	2025-03-05 23:10:21 -08:00