ran unit tests locally

formatting
updated readme
2025-02-27 16:44:50 -05:00 · 2025-02-27 15:08:10 -05:00 · 2025-02-27 15:07:46 -05:00 · 2025-02-27 15:05:29 -05:00 · 2025-02-27 14:55:49 -05:00 · 2025-02-27 11:27:05 -05:00
61 changed files with 2072 additions and 3888 deletions
--- a/.dockerignore
+++ b/.dockerignore
@@ -1,2 +1 @@
-*
+*
 !packages/
--- a/.gitattributes
+++ b/.gitattributes
@@ -1,2 +1 @@
-packages/markitdown/tests/test_files/** linguist-vendored
+tests/test_files/** linguist-vendored
 packages/markitdown-sample-plugin/tests/test_files/** linguist-vendored
--- a/30
+++ b/30
@@ -1,32 +1,22 @@
 FROM python:3.13-slim-bullseye
-ENV DEBIAN_FRONTEND=noninteractive
+USER root
-ENV EXIFTOOL_PATH=/usr/bin/exiftool
+
-ENV FFMPEG_PATH=/usr/bin/ffmpeg
+ARG INSTALL_GIT=false
 RUN if [ "$INSTALL_GIT" = "true" ]; then \
    apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/*; \
    fi
 # Runtime dependency
 RUN apt-get update && apt-get install -y --no-install-recommends \
    ffmpeg \
-    exiftool
+    && rm -rf /var/lib/apt/lists/*
-ARG INSTALL_GIT=false
+RUN pip install markitdown
 RUN if [ "$INSTALL_GIT" = "true" ]; then \
    apt-get install -y --no-install-recommends \
    git; \
    fi
 # Cleanup
 RUN rm -rf /var/lib/apt/lists/*
 WORKDIR /app
 COPY . /app
 RUN pip --no-cache-dir install \
    /app/packages/markitdown[all] \
    /app/packages/markitdown-sample-plugin
 # Default USERID and GROUPID
-ARG USERID=nobody
+ARG USERID=10000
-ARG GROUPID=nogroup
+ARG GROUPID=10000
 USER $USERID:$GROUPID
--- a/README.md
+++ b/README.md
@@ -5,15 +5,10 @@
 [![Built by AutoGen Team](https://img.shields.io/badge/Built%20by-AutoGen%20Team-blue)](https://github.com/microsoft/autogen)
 > [!IMPORTANT]
-> Breaking changes between 0.0.1 to 0.1.0:
+> MarkItDown 0.0.2 alpha 1 (0.0.2a1) introduces a plugin-based architecture. As much as was possible, command-line and Python interfaces have remained the same as 0.0.1a3 to support backward compatibility. Please report any issues you encounter. Some interface changes may yet occur as we continue to refine MarkItDown to a first non-alpha release.
 > * Dependencies are now organized into optional feature-groups (further details below). Use `pip install 'markitdown[all]'` to have backward-compatible behavior. 
 > * convert\_stream() now requires a binary file-like object (e.g., a file opened in binary mode, or an io.BytesIO object). This is a breaking change from the previous version, where it previously also accepted text file-like objects, like io.StringIO.
 > * The DocumentConverter class interface has changed to read from file-like streams rather than file paths. *No temporary files are created anymore*. If you are the maintainer of a plugin, or custom DocumentConverter, you likely need to update your code. Otherwise, if only using the MarkItDown class or CLI (as in these examples), you should not need to change anything.
 MarkItDown is a lightweight Python utility for converting various files to Markdown for use with LLMs and related text analysis pipelines. To this end, it is most comparable to [textract](https://github.com/deanmalmgren/textract), but with a focus on preserving important document structure and content as Markdown (including: headings, lists, tables, links, etc.) While the output is often reasonably presentable and human-friendly, it is meant to be consumed by text analysis tools -- and may not be the best option for high-fidelity document conversions for human consumption.
 At present, MarkItDown supports:
 MarkItDown is a utility for converting various files to Markdown (e.g., for indexing, text analysis, etc).
 It supports:
 - PDF
 - PowerPoint
 - Word
@@ -23,27 +18,14 @@ At present, MarkItDown supports:
 - HTML
 - Text-based formats (CSV, JSON, XML)
 - ZIP files (iterates over contents)
 - Youtube URLs
 - EPubs
 - ... and more!
-## Why Markdown?
+To install MarkItDown, use pip: `pip install markitdown`. Alternatively, you can install it from the source: 
 Markdown is extremely close to plain text, with minimal markup or formatting, but still
 provides a way to represent important document structure. Mainstream LLMs, such as
 OpenAI's GPT-4o, natively "_speak_" Markdown, and often incorporate Markdown into their
 responses unprompted. This suggests that they have been trained on vast amounts of
 Markdown-formatted text, and understand it well. As a side benefit, Markdown conventions
 are also highly token-efficient.
 ## Installation
 To install MarkItDown, use pip: `pip install 'markitdown[all]'`. Alternatively, you can install it from the source:
 ```bash
 git clone git@github.com:microsoft/markitdown.git
 cd markitdown
-pip install -e packages/markitdown[all]
+pip install -e packages/markitdown
 ```
 ## Usage
@@ -66,28 +48,6 @@ You can also pipe content:
 cat path-to-file.pdf | markitdown
 ```
 ### Optional Dependencies
 MarkItDown has optional dependencies for activating various file formats. Earlier in this document, we installed all optional dependencies with the `[all]` option. However, you can also install them individually for more control. For example:
 ```bash
 pip install markitdown[pdf, docx, pptx]
 ```
 will install only the dependencies for PDF, DOCX, and PPTX files.
 At the moment, the following optional dependencies are available:
 * `[all]` Installs all optional dependencies
 * `[pptx]` Installs dependencies for PowerPoint files
 * `[docx]` Installs dependencies for Word files
 * `[xlsx]` Installs dependencies for Excel files
 * `[xls]` Installs dependencies for older Excel files
 * `[pdf]` Installs dependencies for PDF files
 * `[outlook]` Installs dependencies for Outlook messages
 * `[az-doc-intel]` Installs dependencies for Azure Document Intelligence
 * `[audio-transcription]` Installs dependencies for audio transcription of wav and mp3 files
 * `[youtube-transcription]` Installs dependencies for fetching YouTube video transcription
 ### Plugins
 MarkItDown also supports 3rd-party plugins. Plugins are disabled by default. To list installed plugins:
@@ -114,6 +74,7 @@ markitdown path-to-file.pdf -o document.md -d -e "<document_intelligence_endpoin
 More information about how to set up an Azure Document Intelligence Resource can be found [here](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/how-to-guides/create-document-intelligence-resource?view=doc-intel-4.0.0)
 ### Python API
 Basic usage in Python:
@@ -136,6 +97,25 @@ result = md.convert("test.pdf")
 print(result.text_content)
 ```
 MarkItDown also supports converting file objects directly:
 ```python
 from markitdown import MarkItDown
 md = MarkItDown()
 # Providing the file extension when converting via file objects is recommended for most consistent results
 # Binary Mode
 with open("test.docx", 'rb') as file:
    result = md.convert(file, file_extension=".docx")
    print(result.text_content)
 # Non-Binary Mode
 with open("sample.ipynb", 'rt', encoding="utf-8") as file:
    result = md.convert(file, file_extension=".ipynb")
    print(result.text_content)
 ```
 To use Large Language Models for image descriptions, provide `llm_client` and `llm_model`:
 ```python
@@ -154,10 +134,10 @@ print(result.text_content)
 docker build -t markitdown:latest .
 docker run --rm -i markitdown:latest < ~/your-file.pdf > output.md
 ```
-
+   
 ## Contributing
-This project welcomes contributions and suggestions. Most contributions require you to agree to a
+This project welcomes contributions and suggestions.  Most contributions require you to agree to a
 Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
 the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
@@ -173,12 +153,13 @@ contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additio
 You can help by looking at issues or helping review PRs. Any issue or PR is welcome, but we have also marked some as 'open for contribution' and 'open for reviewing' to help facilitate community contributions. These are ofcourse just suggestions and you are welcome to contribute in any way you like.
 <div align="center">
-|            | All                                                          | Especially Needs Help from Community                                                                                                      |
+|                       | All                                      | Especially Needs Help from Community                                                                 |
-| ---------- | ------------------------------------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------- |
+|-----------------------|------------------------------------------|------------------------------------------------------------------------------------------|
-| **Issues** | [All Issues](https://github.com/microsoft/markitdown/issues) | [Issues open for contribution](https://github.com/microsoft/markitdown/issues?q=is%3Aissue+is%3Aopen+label%3A%22open+for+contribution%22) |
+| **Issues**            | [All Issues](https://github.com/microsoft/markitdown/issues) | [Issues open for contribution](https://github.com/microsoft/markitdown/issues?q=is%3Aissue+is%3Aopen+label%3A%22open+for+contribution%22) |
-| **PRs**    | [All PRs](https://github.com/microsoft/markitdown/pulls)     | [PRs open for reviewing](https://github.com/microsoft/markitdown/pulls?q=is%3Apr+is%3Aopen+label%3A%22open+for+reviewing%22)              |
+| **PRs**               | [All PRs](https://github.com/microsoft/markitdown/pulls)     | [PRs open for reviewing](https://github.com/microsoft/markitdown/pulls?q=is%3Apr+is%3Aopen+label%3A%22open+for+reviewing%22)               |
 </div>
@@ -186,24 +167,22 @@ You can help by looking at issues or helping review PRs. Any issue or PR is welc
 - Navigate to the MarkItDown package:
-  ```sh
+    ```sh
-  cd packages/markitdown
+    cd packages/markitdown
-  ```
+    ```
 - Install `hatch` in your environment and run tests:
-
+    ```sh
-  ```sh
+    pip install hatch  # Other ways of installing hatch: https://hatch.pypa.io/dev/install/
-  pip install hatch  # Other ways of installing hatch: https://hatch.pypa.io/dev/install/
+    hatch shell
-  hatch shell
+    hatch test
-  hatch test
+    ```
  ```
  (Alternative) Use the Devcontainer which has all the dependencies installed:
-
+    ```sh
-  ```sh
+    # Reopen the project in Devcontainer and run:
-  # Reopen the project in Devcontainer and run:
+    hatch test
-  hatch test
+    ```
  ```
 - Run pre-commit checks before submitting a PR: `pre-commit run --all-files`
@@ -211,6 +190,7 @@ You can help by looking at issues or helping review PRs. Any issue or PR is welc
 You can also contribute by creating and sharing 3rd party plugins. See `packages/markitdown-sample-plugin` for more details.
 ## Trademarks
 This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft
--- a/packages/markitdown-sample-plugin/README.md
+++ b/packages/markitdown-sample-plugin/README.md
@@ -10,38 +10,23 @@ This project shows how to create a sample plugin for MarkItDown. The most import
 Next, implement your custom DocumentConverter:
 ```python
-from typing import BinaryIO, Any
+from typing import Union
-from markitdown import MarkItDown, DocumentConverter, DocumentConverterResult, StreamInfo
+from markitdown import DocumentConverter, DocumentConverterResult
 class RtfConverter(DocumentConverter):
    def convert(self, local_path, **kwargs) -> Union[None, DocumentConverterResult]:
        # Bail if not an RTF file 
        extension = kwargs.get("file_extension", "")
        if extension.lower() != ".rtf":
            return None
-    def __init__(
+	# Implement the conversion logic here ...
        self, priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT
    ):
        super().__init__(priority=priority)
-    def accepts(
+        # Return the result
-        self,
+        return DocumentConverterResult(
-        file_stream: BinaryIO,
+            title=title,
-        stream_info: StreamInfo,
+            text_content=text_content,
-        **kwargs: Any,
+        )
    ) -> bool:
 	# Implement logic to check if the file stream is an RTF file
 	# ...
 	raise NotImplementedError()
    def convert(
        self,
        file_stream: BinaryIO,
        stream_info: StreamInfo,
        **kwargs: Any,
    ) -> DocumentConverterResult:
 	# Implement logic to convert the file stream to Markdown
 	# ...
 	raise NotImplementedError()
 ```
 Next, make sure your package implements and exports the following:
@@ -86,10 +71,10 @@ Once the plugin package is installed, verify that it is available to MarkItDown
 markitdown --list-plugins
 ```
-To use the plugin for a conversion use the `--use-plugins` flag. For example, to convert an RTF file:
+To use the plugin for a conversion use the `--use-plugins` flag. For example, to convert a PDF:
 ```bash
-markitdown --use-plugins path-to-file.rtf
+markitdown --use-plugins path-to-file.pdf
 ```
 In Python, plugins can be enabled as follows:
@@ -98,7 +83,7 @@ In Python, plugins can be enabled as follows:
 from markitdown import MarkItDown
 md = MarkItDown(enable_plugins=True) 
-result = md.convert("path-to-file.rtf")
+result = md.convert("path-to-file.pdf")
 print(result.text_content)
 ```
--- a/packages/markitdown-sample-plugin/pyproject.toml
+++ b/packages/markitdown-sample-plugin/pyproject.toml
@@ -24,7 +24,7 @@ classifiers = [
  "Programming Language :: Python :: Implementation :: PyPy",
 ]
 dependencies = [
-  "markitdown>=0.1.0a1",
+  "markitdown",
  "striprtf",
 ]
--- a/packages/markitdown-sample-plugin/src/markitdown_sample_plugin/about.py
+++ b/packages/markitdown-sample-plugin/src/markitdown_sample_plugin/about.py
@@ -1,4 +1,4 @@
 # SPDX-FileCopyrightText: 2024-present Adam Fourney <adamfo@microsoft.com>
 #
 # SPDX-License-Identifier: MIT
-__version__ = "0.1.0a1"
+__version__ = "0.0.1a2"
--- a/packages/markitdown-sample-plugin/src/markitdown_sample_plugin/_plugin.py
+++ b/packages/markitdown-sample-plugin/src/markitdown_sample_plugin/_plugin.py
@@ -1,26 +1,12 @@
-import locale
+from typing import Union
 from typing import BinaryIO, Any
 from striprtf.striprtf import rtf_to_text
-from markitdown import (
+from markitdown import MarkItDown, DocumentConverter, DocumentConverterResult
    MarkItDown,
    DocumentConverter,
    DocumentConverterResult,
    StreamInfo,
 )
 __plugin_interface_version__ = (
    1  # The version of the plugin interface that this plugin uses
 )
 ACCEPTED_MIME_TYPE_PREFIXES = [
    "text/rtf",
    "application/rtf",
 ]
 ACCEPTED_FILE_EXTENSIONS = [".rtf"]
 def register_converters(markitdown: MarkItDown, **kwargs):
    """
@@ -36,36 +22,18 @@ class RtfConverter(DocumentConverter):
    Converts an RTF file to in the simplest possible way.
    """
-    def accepts(
+    def convert(self, local_path, **kwargs) -> Union[None, DocumentConverterResult]:
-        self,
+        # Bail if not a RTF
-        file_stream: BinaryIO,
+        extension = kwargs.get("file_extension", "")
-        stream_info: StreamInfo,
+        if extension.lower() != ".rtf":
-        **kwargs: Any,
+            return None
    ) -> bool:
        mimetype = (stream_info.mimetype or "").lower()
        extension = (stream_info.extension or "").lower()
-        if extension in ACCEPTED_FILE_EXTENSIONS:
+        # Read the RTF file
-            return True
+        with open(local_path, "r") as f:
-
+            rtf = f.read()
        for prefix in ACCEPTED_MIME_TYPE_PREFIXES:
            if mimetype.startswith(prefix):
                return True
        return False
    def convert(
        self,
        file_stream: BinaryIO,
        stream_info: StreamInfo,
        **kwargs: Any,
    ) -> DocumentConverterResult:
        # Read the file stream into an str using hte provided charset encoding, or using the system default
        encoding = stream_info.charset or locale.getpreferredencoding()
        stream_data = file_stream.read().decode(encoding)
        # Return the result
        return DocumentConverterResult(
            title=None,
-            markdown=rtf_to_text(stream_data),
+            text_content=rtf_to_text(rtf),
        )
--- a/packages/markitdown-sample-plugin/tests/test_sample_plugin.py
+++ b/packages/markitdown-sample-plugin/tests/test_sample_plugin.py
@@ -2,7 +2,7 @@
 import os
 import pytest
-from markitdown import MarkItDown, StreamInfo
+from markitdown import MarkItDown
 from markitdown_sample_plugin import RtfConverter
 TEST_FILES_DIR = os.path.join(os.path.dirname(__file__), "test_files")
@@ -15,22 +15,18 @@ RTF_TEST_STRINGS = {
 def test_converter() -> None:
    """Tests the RTF converter dirctly."""
-    with open(os.path.join(TEST_FILES_DIR, "test.rtf"), "rb") as file_stream:
+    converter = RtfConverter()
-        converter = RtfConverter()
+    result = converter.convert(
-        result = converter.convert(
+        os.path.join(TEST_FILES_DIR, "test.rtf"), file_extension=".rtf"
-            file_stream=file_stream,
+    )
            stream_info=StreamInfo(
                mimetype="text/rtf", extension=".rtf", filename="test.rtf"
            ),
        )
-        for test_string in RTF_TEST_STRINGS:
+    for test_string in RTF_TEST_STRINGS:
-            assert test_string in result.text_content
+        assert test_string in result.text_content
 def test_markitdown() -> None:
    """Tests that MarkItDown correctly loads the plugin."""
-    md = MarkItDown(enable_plugins=True)
+    md = MarkItDown()
    result = md.convert(os.path.join(TEST_FILES_DIR, "test.rtf"))
    for test_string in RTF_TEST_STRINGS:
--- a/packages/markitdown/README.md
+++ b/packages/markitdown/README.md
@@ -10,7 +10,7 @@
 From PyPI:
 ```bash
-pip install markitdown[all]
+pip install markitdown
 ```
 From source:
@@ -18,7 +18,7 @@ From source:
 ```bash
 git clone git@github.com:microsoft/markitdown.git
 cd markitdown
-pip install -e packages/markitdown[all]
+pip install -e packages/markitdown
 ```
 ## Usage
--- a/packages/markitdown/pyproject.toml
+++ b/packages/markitdown/pyproject.toml
@@ -26,35 +26,25 @@ classifiers = [
 dependencies = [
  "beautifulsoup4",
  "requests",
  "markdownify",
  "magika~=0.6.1",
  "charset-normalizer",
 ]
 [project.optional-dependencies]
 all = [
  "python-pptx",
  "mammoth",
  "markdownify",
  "numpy",
  "python-pptx",
  "pandas",
  "openpyxl",
  "xlrd",
  "pdfminer.six",
-  "olefile",
+  "puremagic",
  "pydub",
  "olefile",
  "youtube-transcript-api",
  "SpeechRecognition",
-  "youtube-transcript-api~=1.0.0",
+  "pathvalidate",
  "charset-normalizer",
  "openai",
  "azure-ai-documentintelligence",
  "azure-identity"
 ]
 pptx = ["python-pptx"]
 docx = ["mammoth"]
 xlsx = ["pandas", "openpyxl"]
 xls = ["pandas", "xlrd"]
 pdf = ["pdfminer.six"]
 outlook = ["olefile"]
 audio-transcription = ["pydub", "SpeechRecognition"]
 youtube-transcription = ["youtube-transcript-api"]
 az-doc-intel = ["azure-ai-documentintelligence", "azure-identity"]
 [project.urls]
 Documentation = "https://github.com/microsoft/markitdown#readme"
@@ -67,24 +57,12 @@ path = "src/markitdown/__about__.py"
 [project.scripts]
 markitdown = "markitdown.__main__:main"
 [tool.hatch.envs.default]
 features = ["all"]
 [tool.hatch.envs.hatch-test]
 features = ["all"]
 extra-dependencies = [
  "openai",
 ]
 [tool.hatch.envs.types]
 features = ["all"]
 extra-dependencies = [
  "openai",
  "mypy>=1.0.0",
 ]
 [tool.hatch.envs.types.scripts]
-check = "mypy --install-types --non-interactive --ignore-missing-imports {args:src/markitdown tests}"
+check = "mypy --install-types --non-interactive {args:src/markitdown tests}"
 [tool.coverage.run]
 source_pkgs = ["markitdown", "tests"]
--- a/packages/markitdown/src/markitdown/about.py
+++ b/packages/markitdown/src/markitdown/about.py
@@ -1,4 +1,4 @@
 # SPDX-FileCopyrightText: 2024-present Adam Fourney <adamfo@microsoft.com>
 #
 # SPDX-License-Identifier: MIT
-__version__ = "0.1.0"
+__version__ = "0.0.2a1"
--- a/packages/markitdown/src/markitdown/init.py
+++ b/packages/markitdown/src/markitdown/init.py
@@ -3,20 +3,14 @@
 # SPDX-License-Identifier: MIT
 from .__about__ import __version__
-from ._markitdown import (
+from ._markitdown import MarkItDown
    MarkItDown,
    PRIORITY_SPECIFIC_FILE_FORMAT,
    PRIORITY_GENERIC_FILE_FORMAT,
 )
 from ._base_converter import DocumentConverterResult, DocumentConverter
 from ._stream_info import StreamInfo
 from ._exceptions import (
    MarkItDownException,
-    MissingDependencyException,
+    ConverterPrerequisiteException,
    FailedConversionAttempt,
    FileConversionException,
    UnsupportedFormatException,
 )
 from .converters import DocumentConverter, DocumentConverterResult
 __all__ = [
    "__version__",
@@ -24,11 +18,7 @@ __all__ = [
    "DocumentConverter",
    "DocumentConverterResult",
    "MarkItDownException",
-    "MissingDependencyException",
+    "ConverterPrerequisiteException",
    "FailedConversionAttempt",
    "FileConversionException",
    "UnsupportedFormatException",
    "StreamInfo",
    "PRIORITY_SPECIFIC_FILE_FORMAT",
    "PRIORITY_GENERIC_FILE_FORMAT",
 ]
--- a/packages/markitdown/src/markitdown/main.py
+++ b/packages/markitdown/src/markitdown/main.py
@@ -3,12 +3,10 @@
 # SPDX-License-Identifier: MIT
 import argparse
 import sys
 import codecs
 import locale
 from textwrap import dedent
 from importlib.metadata import entry_points
 from .__about__ import __version__
-from ._markitdown import MarkItDown, StreamInfo, DocumentConverterResult
+from ._markitdown import MarkItDown, DocumentConverterResult
 def main():
@@ -60,24 +58,6 @@ def main():
        help="Output file name. If not provided, output is written to stdout.",
    )
    parser.add_argument(
        "-x",
        "--extension",
        help="Provide a hint about the file extension (e.g., when reading from stdin).",
    )
    parser.add_argument(
        "-m",
        "--mime-type",
        help="Provide a hint about the file's MIME type.",
    )
    parser.add_argument(
        "-c",
        "--charset",
        help="Provide a hint about the file's charset (e.g, UTF-8).",
    )
    parser.add_argument(
        "-d",
        "--use-docintel",
@@ -105,57 +85,9 @@ def main():
        help="List installed 3rd-party plugins. Plugins are loaded when using the -p or --use-plugin option.",
    )
    parser.add_argument(
        "--keep-data-uris",
        action="store_true",
        help="Keep data URIs (like base64-encoded images) in the output. By default, data URIs are truncated.",
    )
    parser.add_argument("filename", nargs="?")
    args = parser.parse_args()
    # Parse the extension hint
    extension_hint = args.extension
    if extension_hint is not None:
        extension_hint = extension_hint.strip().lower()
        if len(extension_hint) > 0:
            if not extension_hint.startswith("."):
                extension_hint = "." + extension_hint
        else:
            extension_hint = None
    # Parse the mime type
    mime_type_hint = args.mime_type
    if mime_type_hint is not None:
        mime_type_hint = mime_type_hint.strip()
        if len(mime_type_hint) > 0:
            if mime_type_hint.count("/") != 1:
                _exit_with_error(f"Invalid MIME type: {mime_type_hint}")
        else:
            mime_type_hint = None
    # Parse the charset
    charset_hint = args.charset
    if charset_hint is not None:
        charset_hint = charset_hint.strip()
        if len(charset_hint) > 0:
            try:
                charset_hint = codecs.lookup(charset_hint).name
            except LookupError:
                _exit_with_error(f"Invalid charset: {charset_hint}")
        else:
            charset_hint = None
    stream_info = None
    if (
        extension_hint is not None
        or mime_type_hint is not None
        or charset_hint is not None
    ):
        stream_info = StreamInfo(
            extension=extension_hint, mimetype=mime_type_hint, charset=charset_hint
        )
    if args.list_plugins:
        # List installed plugins, then exit
        print("Installed MarkItDown 3rd-party Plugins:\n")
@@ -175,12 +107,11 @@ def main():
    if args.use_docintel:
        if args.endpoint is None:
-            _exit_with_error(
+            raise ValueError(
                "Document Intelligence Endpoint is required when using Document Intelligence."
            )
        elif args.filename is None:
-            _exit_with_error("Filename is required when using Document Intelligence.")
+            raise ValueError("Filename is required when using Document Intelligence.")
        markitdown = MarkItDown(
            enable_plugins=args.use_plugins, docintel_endpoint=args.endpoint
        )
@@ -188,15 +119,9 @@ def main():
        markitdown = MarkItDown(enable_plugins=args.use_plugins)
    if args.filename is None:
-        result = markitdown.convert_stream(
+        result = markitdown.convert_stream(sys.stdin.buffer)
            sys.stdin.buffer,
            stream_info=stream_info,
            keep_data_uris=args.keep_data_uris,
        )
    else:
-        result = markitdown.convert(
+        result = markitdown.convert(args.filename)
            args.filename, stream_info=stream_info, keep_data_uris=args.keep_data_uris
        )
    _handle_output(args, result)
@@ -205,19 +130,9 @@ def _handle_output(args, result: DocumentConverterResult):
    """Handle output to stdout or file"""
    if args.output:
        with open(args.output, "w", encoding="utf-8") as f:
-            f.write(result.markdown)
+            f.write(result.text_content)
    else:
-        # Handle stdout encoding errors more gracefully
+        print(result.text_content)
        print(
            result.markdown.encode(sys.stdout.encoding, errors="replace").decode(
                sys.stdout.encoding
            )
        )
 def _exit_with_error(message: str):
    print(message)
    sys.exit(1)
 if __name__ == "__main__":
--- a/packages/markitdown/src/markitdown/_base_converter.py
+++ b/packages/markitdown/src/markitdown/_base_converter.py
@@ -1,108 +0,0 @@
 import os
 import tempfile
 from warnings import warn
 from typing import Any, Union, BinaryIO, Optional, List
 from ._stream_info import StreamInfo
 class DocumentConverterResult:
    """The result of converting a document to Markdown."""
    def __init__(
        self,
        markdown: str,
        *,
        title: Optional[str] = None,
    ):
        """
        Initialize the DocumentConverterResult.
        The only required parameter is the converted Markdown text.
        The title, and any other metadata that may be added in the future, are optional.
        Parameters:
        - markdown: The converted Markdown text.
        - title: Optional title of the document.
        """
        self.markdown = markdown
        self.title = title
    @property
    def text_content(self) -> str:
        """Soft-deprecated alias for `markdown`. New code should migrate to using `markdown` or __str__."""
        return self.markdown
    @text_content.setter
    def text_content(self, markdown: str):
        """Soft-deprecated alias for `markdown`. New code should migrate to using `markdown` or __str__."""
        self.markdown = markdown
    def __str__(self) -> str:
        """Return the converted Markdown text."""
        return self.markdown
 class DocumentConverter:
    """Abstract superclass of all DocumentConverters."""
    def accepts(
        self,
        file_stream: BinaryIO,
        stream_info: StreamInfo,
        **kwargs: Any,  # Options to pass to the converter
    ) -> bool:
        """
        Return a quick determination on if the converter should attempt converting the document.
        This is primarily based `stream_info` (typically, `stream_info.mimetype`, `stream_info.extension`).
        In cases where the data is retrieved via HTTP, the `steam_info.url` might also be referenced to
        make a determination (e.g., special converters for Wikipedia, YouTube etc).
        Finally, it is conceivable that the `stream_info.filename` might be used to in cases
        where the filename is well-known (e.g., `Dockerfile`, `Makefile`, etc)
        NOTE: The method signature is designed to match that of the convert() method. This provides some
        assurance that, if accepts() returns True, the convert() method will also be able to handle the document.
        IMPORTANT: In rare cases, (e.g., OutlookMsgConverter) we need to read more from the stream to make a final
        determination. Read operations inevitably advances the position in file_stream. In these case, the position
        MUST be reset it MUST be reset before returning. This is because the convert() method may be called immediately
        after accepts(), and will expect the file_stream to be at the original position.
        E.g.,
        cur_pos = file_stream.tell() # Save the current position
        data = file_stream.read(100) # ... peek at the first 100 bytes, etc.
        file_stream.seek(cur_pos)    # Reset the position to the original position
        Prameters:
        - file_stream: The file-like object to convert. Must support seek(), tell(), and read() methods.
        - stream_info: The StreamInfo object containing metadata about the file (mimetype, extension, charset, set)
        - kwargs: Additional keyword arguments for the converter.
        Returns:
        - bool: True if the converter can handle the document, False otherwise.
        """
        raise NotImplementedError(
            f"The subclass, {type(self).__name__}, must implement the accepts() method to determine if they can handle the document."
        )
    def convert(
        self,
        file_stream: BinaryIO,
        stream_info: StreamInfo,
        **kwargs: Any,  # Options to pass to the converter
    ) -> DocumentConverterResult:
        """
        Convert a document to Markdown text.
        Prameters:
        - file_stream: The file-like object to convert. Must support seek(), tell(), and read() methods.
        - stream_info: The StreamInfo object containing metadata about the file (mimetype, extension, charset, set)
        - kwargs: Additional keyword arguments for the converter.
        Returns:
        - DocumentConverterResult: The result of the conversion, which includes the title and markdown content.
        Raises:
        - FileConversionException: If the mimetype is recognized, but the conversion fails for some other reason.
        - MissingDependencyException: If the converter requires a dependency that is not installed.
        """
        raise NotImplementedError("Subclasses must implement this method")
--- a/packages/markitdown/src/markitdown/_exceptions.py
+++ b/packages/markitdown/src/markitdown/_exceptions.py
@@ -1,14 +1,4 @@
-from typing import Optional, List, Any
+class MarkItDownException(BaseException):
 MISSING_DEPENDENCY_MESSAGE = """{converter} recognized the input as a potential {extension} file, but the dependencies needed to read {extension} files have not been installed. To resolve this error, include the optional dependency [{feature}] or [all] when installing MarkItDown. For example:
 * pip install markitdown[{feature}]
 * pip install markitdown[all]
 * pip install markitdown[{feature}, ...]
 * etc."""
 class MarkItDownException(Exception):
    """
    Base exception class for MarkItDown.
    """
@@ -16,16 +6,24 @@ class MarkItDownException(Exception):
    pass
-class MissingDependencyException(MarkItDownException):
+class ConverterPrerequisiteException(MarkItDownException):
    """
-    Converters shipped with MarkItDown may depend on optional
+    Thrown when instantiating a DocumentConverter in cases where
-    dependencies. This exception is thrown when a converter's
+    a required library or dependency is not installed, an API key
-    convert() method is called, but the required dependency is not
+    is not set, or some other prerequisite is not met.
    installed. This is not necessarily a fatal error, as the converter
    will simply be skipped (an error will bubble up only if no other
    suitable converter is found).
-    Error messages should clearly indicate which dependency is missing.
+    This is not necessarily a fatal error. If thrown during
    MarkItDown's plugin loading phase, the converter will simply be
    skipped, and a warning will be issued.
    """
    pass
 class FileConversionException(MarkItDownException):
    """
    Thrown when a suitable converter was found, but the conversion
    process fails for any reason.
    """
    pass
@@ -37,40 +35,3 @@ class UnsupportedFormatException(MarkItDownException):
    """
    pass
 class FailedConversionAttempt(object):
    """
    Represents an a single attempt to convert a file.
    """
    def __init__(self, converter: Any, exc_info: Optional[tuple] = None):
        self.converter = converter
        self.exc_info = exc_info
 class FileConversionException(MarkItDownException):
    """
    Thrown when a suitable converter was found, but the conversion
    process fails for any reason.
    """
    def __init__(
        self,
        message: Optional[str] = None,
        attempts: Optional[List[FailedConversionAttempt]] = None,
    ):
        self.attempts = attempts
        if message is None:
            if attempts is None:
                message = "File conversion failed."
            else:
                message = f"File conversion failed after {len(attempts)} attempts:\n"
                for attempt in attempts:
                    if attempt.exc_info is None:
                        message += f" -  {type(attempt.converter).__name__} provided no execution info."
                    else:
                        message += f" - {type(attempt.converter).__name__} threw {attempt.exc_info[0].__name__} with message: {attempt.exc_info[1]}\n"
        super().__init__(message)
--- a/packages/markitdown/src/markitdown/_markitdown.py
+++ b/packages/markitdown/src/markitdown/_markitdown.py
@@ -2,26 +2,23 @@ import copy
 import mimetypes
 import os
 import re
 import sys
 import shutil
 import tempfile
 import warnings
 import traceback
 import io
 from dataclasses import dataclass
 from importlib.metadata import entry_points
-from typing import Any, List, Optional, Union, BinaryIO
+from typing import Any, List, Optional, Union
 from pathlib import Path
 from urllib.parse import urlparse
 from warnings import warn
-import requests
+from io import BufferedIOBase, TextIOBase, BytesIO
 import magika
 import charset_normalizer
 import codecs
-from ._stream_info import StreamInfo
+# File-format detection
 import puremagic
 import requests
 from .converters import (
    DocumentConverter,
    DocumentConverterResult,
    PlainTextConverter,
    HtmlConverter,
    RssConverter,
@@ -35,35 +32,27 @@ from .converters import (
    XlsConverter,
    PptxConverter,
    ImageConverter,
-    AudioConverter,
+    WavConverter,
    Mp3Converter,
    OutlookMsgConverter,
    ZipConverter,
    EpubConverter,
    DocumentIntelligenceConverter,
    ConverterInput,
 )
 from ._base_converter import DocumentConverter, DocumentConverterResult
 from ._exceptions import (
    FileConversionException,
    UnsupportedFormatException,
-    FailedConversionAttempt,
+    ConverterPrerequisiteException,
 )
 # Override mimetype for csv to fix issue on windows
 mimetypes.add_type("text/csv", ".csv")
-# Lower priority values are tried first.
+_plugins: Union[None | List[Any]] = None
 PRIORITY_SPECIFIC_FILE_FORMAT = (
    0.0  # e.g., .docx, .pdf, .xlsx, Or specific pages, e.g., wikipedia
 )
 PRIORITY_GENERIC_FILE_FORMAT = (
    10.0  # Near catch-all converters for mimetypes like text/*, etc.
 )
-_plugins: Union[None, List[Any]] = None  # If None, plugins have not been loaded yet.
+def _load_plugins() -> Union[None | List[Any]]:
 def _load_plugins() -> Union[None, List[Any]]:
    """Lazy load plugins, exiting early if already loaded."""
    global _plugins
@@ -83,14 +72,6 @@ def _load_plugins() -> Union[None, List[Any]]:
    return _plugins
@dataclass(kw_only=True, frozen=True)
 class ConverterRegistration:
    """A registration of a converter with its priority and other metadata."""
    converter: DocumentConverter
    priority: float
 class MarkItDown:
    """(In preview) An extremely simple text-based document reader, suitable for LLM use.
    This reader will convert common file-types or webpages to Markdown."""
@@ -111,16 +92,14 @@ class MarkItDown:
        else:
            self._requests_session = requests_session
        self._magika = magika.Magika()
        # TODO - remove these (see enable_builtins)
-        self._llm_client: Any = None
+        self._llm_client = None
-        self._llm_model: Union[str | None] = None
+        self._llm_model = None
-        self._exiftool_path: Union[str | None] = None
+        self._exiftool_path = None
-        self._style_map: Union[str | None] = None
+        self._style_map = None
        # Register the converters
-        self._converters: List[ConverterRegistration] = []
+        self._page_converters: List[DocumentConverter] = []
        if (
            enable_builtins is None or enable_builtins
@@ -142,43 +121,15 @@ class MarkItDown:
            self._llm_model = kwargs.get("llm_model")
            self._exiftool_path = kwargs.get("exiftool_path")
            self._style_map = kwargs.get("style_map")
            if self._exiftool_path is None:
                self._exiftool_path = os.getenv("EXIFTOOL_PATH")
            # Still none? Check well-known paths
            if self._exiftool_path is None:
                candidate = shutil.which("exiftool")
                if candidate:
                    candidate = os.path.abspath(candidate)
                    if any(
                        d == os.path.dirname(candidate)
                        for d in [
                            "/usr/bin",
                            "/usr/local/bin",
                            "/opt",
                            "/opt/bin",
                            "/opt/local/bin",
                            "/opt/homebrew/bin",
                            "C:\\Windows\\System32",
                            "C:\\Program Files",
                            "C:\\Program Files (x86)",
                        ]
                    ):
                        self._exiftool_path = candidate
            # Register converters for successful browsing operations
            # Later registrations are tried first / take higher priority than earlier registrations
            # To this end, the most specific converters should appear below the most generic converters
-            self.register_converter(
+            self.register_converter(PlainTextConverter())
-                PlainTextConverter(), priority=PRIORITY_GENERIC_FILE_FORMAT
+            self.register_converter(ZipConverter())
-            )
+            self.register_converter(HtmlConverter())
            self.register_converter(
                ZipConverter(markitdown=self), priority=PRIORITY_GENERIC_FILE_FORMAT
            )
            self.register_converter(
                HtmlConverter(), priority=PRIORITY_GENERIC_FILE_FORMAT
            )
            self.register_converter(RssConverter())
            self.register_converter(WikipediaConverter())
            self.register_converter(YouTubeConverter())
@@ -187,12 +138,12 @@ class MarkItDown:
            self.register_converter(XlsxConverter())
            self.register_converter(XlsConverter())
            self.register_converter(PptxConverter())
-            self.register_converter(AudioConverter())
+            self.register_converter(WavConverter())
            self.register_converter(Mp3Converter())
            self.register_converter(ImageConverter())
            self.register_converter(IpynbConverter())
            self.register_converter(PdfConverter())
            self.register_converter(OutlookMsgConverter())
            self.register_converter(EpubConverter())
            # Register Document Intelligence converter at the top of the stack if endpoint is provided
            docintel_endpoint = kwargs.get("docintel_endpoint")
@@ -213,9 +164,7 @@ class MarkItDown:
        """
        if not self._plugins_enabled:
            # Load plugins
-            plugins = _load_plugins()
+            for plugin in _load_plugins():
            assert plugins is not None
            for plugin in plugins:
                try:
                    plugin.register_converters(self, **kwargs)
                except Exception:
@@ -227,18 +176,14 @@ class MarkItDown:
    def convert(
        self,
-        source: Union[str, requests.Response, Path, BinaryIO],
+        source: Union[str, requests.Response, Path, BufferedIOBase, TextIOBase],
        *,
        stream_info: Optional[StreamInfo] = None,
        **kwargs: Any,
    ) -> DocumentConverterResult:  # TODO: deal with kwargs
        """
        Args:
-            - source: can be a path (str or Path), url, or a requests.response object
+            - source: can be a string representing a path either as string pathlib path object or url, a requests.response object, or a file object (TextIO or BinaryIO)
-            - stream_info: optional stream info to use for the conversion. If None, infer from source
+            - extension: specifies the file extension to use when interpreting the file. If None, infer from source (path, uri, content-type, etc.)
            - kwargs: additional arguments to pass to the converter
        """
        # Local path or url
        if isinstance(source, str):
            if (
@@ -246,237 +191,177 @@ class MarkItDown:
                or source.startswith("https://")
                or source.startswith("file://")
            ):
-                # Rename the url argument to mock_url
+                return self.convert_url(source, **kwargs)
                # (Deprecated -- use stream_info)
                _kwargs = {k: v for k, v in kwargs.items()}
                if "url" in _kwargs:
                    _kwargs["mock_url"] = _kwargs["url"]
                    del _kwargs["url"]
                return self.convert_url(source, stream_info=stream_info, **_kwargs)
            else:
-                return self.convert_local(source, stream_info=stream_info, **kwargs)
+                return self.convert_local(source, **kwargs)
        # Path object
        elif isinstance(source, Path):
            return self.convert_local(source, stream_info=stream_info, **kwargs)
        # Request response
        elif isinstance(source, requests.Response):
-            return self.convert_response(source, stream_info=stream_info, **kwargs)
+            return self.convert_response(source, **kwargs)
-        # Binary stream
+        elif isinstance(source, Path):
-        elif (
+            return self.convert_local(source, **kwargs)
-            hasattr(source, "read")
+        # File object
-            and callable(source.read)
+        elif isinstance(source, BufferedIOBase) or isinstance(source, TextIOBase):
-            and not isinstance(source, io.TextIOBase)
+            return self.convert_file_object(source, **kwargs)
        ):
            return self.convert_stream(source, stream_info=stream_info, **kwargs)
        else:
            raise TypeError(
                f"Invalid source type: {type(source)}. Expected str, requests.Response, BinaryIO."
            )
    def convert_local(
-        self,
+        self, path: Union[str, Path], **kwargs: Any
-        path: Union[str, Path],
+    ) -> DocumentConverterResult:  # TODO: deal with kwargs
        *,
        stream_info: Optional[StreamInfo] = None,
        file_extension: Optional[str] = None,  # Deprecated -- use stream_info
        url: Optional[str] = None,  # Deprecated -- use stream_info
        **kwargs: Any,
    ) -> DocumentConverterResult:
        if isinstance(path, Path):
            path = str(path)
        # Prepare a list of extensions to try (in order of priority)
        ext = kwargs.get("file_extension")
        extensions = [ext] if ext is not None else []
-        # Build a base StreamInfo object from which to start guesses
+        # Get extension alternatives from the path and puremagic
-        base_guess = StreamInfo(
+        base, ext = os.path.splitext(path)
-            local_path=path,
+        self._append_ext(extensions, ext)
            extension=os.path.splitext(path)[1],
            filename=os.path.basename(path),
        )
-        # Extend the base_guess with any additional info from the arguments
+        for g in self._guess_ext_magic(source=path):
-        if stream_info is not None:
+            self._append_ext(extensions, g)
            base_guess = base_guess.copy_and_update(stream_info)
-        if file_extension is not None:
+        # Create the ConverterInput object
-            # Deprecated -- use stream_info
+        input = ConverterInput(input_type="filepath", filepath=path)
            base_guess = base_guess.copy_and_update(extension=file_extension)
-        if url is not None:
+        # Convert
-            # Deprecated -- use stream_info
+        return self._convert(input, extensions, **kwargs)
            base_guess = base_guess.copy_and_update(url=url)
-        with open(path, "rb") as fh:
+    def convert_file_object(
-            guesses = self._get_stream_info_guesses(
+        self, file_object: Union[BufferedIOBase, TextIOBase], **kwargs: Any
-                file_stream=fh, base_guess=base_guess
+    ) -> DocumentConverterResult:  # TODO: deal with kwargs
-            )
+        # Prepare a list of extensions to try (in order of priority
-            return self._convert(file_stream=fh, stream_info_guesses=guesses, **kwargs)
+        ext = kwargs.get("file_extension")
        extensions = [ext] if ext is not None else []
        # TODO: Curently, there are some ongoing issues with passing direct file objects to puremagic (incorrect guesses, unsupported file type errors, etc.)
        # Only use puremagic as a last resort if no extensions were provided
        if extensions == []:
            for g in self._guess_ext_magic(source=file_object):
                self._append_ext(extensions, g)
        # Create the ConverterInput object
        input = ConverterInput(input_type="object", file_object=file_object)
        # Convert
        return self._convert(input, extensions, **kwargs)
    # TODO what should stream's type be?
    def convert_stream(
-        self,
+        self, stream: Any, **kwargs: Any
-        stream: BinaryIO,
+    ) -> DocumentConverterResult:  # TODO: deal with kwargs
-        *,
+        # Prepare a list of extensions to try (in order of priority)
-        stream_info: Optional[StreamInfo] = None,
+        ext = kwargs.get("file_extension")
-        file_extension: Optional[str] = None,  # Deprecated -- use stream_info
+        extensions = [ext] if ext is not None else []
        url: Optional[str] = None,  # Deprecated -- use stream_info
        **kwargs: Any,
    ) -> DocumentConverterResult:
        guesses: List[StreamInfo] = []
-        # Do we have anything on which to base a guess?
+        # Save the file locally to a temporary file. It will be deleted before this method exits
-        base_guess = None
+        handle, temp_path = tempfile.mkstemp()
-        if stream_info is not None or file_extension is not None or url is not None:
+        fh = os.fdopen(handle, "wb")
-            # Start with a non-Null base guess
+        result = None
-            if stream_info is None:
+        try:
-                base_guess = StreamInfo()
+            # Write to the temporary file
            content = stream.read()
            if isinstance(content, str):
                fh.write(content.encode("utf-8"))
            else:
-                base_guess = stream_info
+                fh.write(content)
            fh.close()
-            if file_extension is not None:
+            # Use puremagic to check for more extension options
-                # Deprecated -- use stream_info
+            for g in self._guess_ext_magic(source=temp_path):
-                assert base_guess is not None  # for mypy
+                self._append_ext(extensions, g)
                base_guess = base_guess.copy_and_update(extension=file_extension)
-            if url is not None:
+            # Create the ConverterInput object
-                # Deprecated -- use stream_info
+            input = ConverterInput(input_type="filepath", filepath=temp_path)
                assert base_guess is not None  # for mypy
                base_guess = base_guess.copy_and_update(url=url)
-        # Check if we have a seekable stream. If not, load the entire stream into memory.
+            # Convert
-        if not stream.seekable():
+            result = self._convert(input, extensions, **kwargs)
-            buffer = io.BytesIO()
+        # Clean up
-            while True:
+        finally:
-                chunk = stream.read(4096)
+            try:
-                if not chunk:
+                fh.close()
-                    break
+            except Exception:
-                buffer.write(chunk)
+                pass
-            buffer.seek(0)
+            os.unlink(temp_path)
            stream = buffer
-        # Add guesses based on stream content
+        return result
        guesses = self._get_stream_info_guesses(
            file_stream=stream, base_guess=base_guess or StreamInfo()
        )
        return self._convert(file_stream=stream, stream_info_guesses=guesses, **kwargs)
    def convert_url(
-        self,
+        self, url: str, **kwargs: Any
        url: str,
        *,
        stream_info: Optional[StreamInfo] = None,
        file_extension: Optional[str] = None,  # Deprecated -- use stream_info
        mock_url: Optional[
            str
        ] = None,  # Mock the request as if it came from a different URL
        **kwargs: Any,
    ) -> DocumentConverterResult:  # TODO: fix kwargs type
        # Send a HTTP request to the URL
        response = self._requests_session.get(url, stream=True)
        response.raise_for_status()
-        return self.convert_response(
+        return self.convert_response(response, **kwargs)
            response,
            stream_info=stream_info,
            file_extension=file_extension,
            url=mock_url,
            **kwargs,
        )
    def convert_response(
-        self,
+        self, response: requests.Response, **kwargs: Any
-        response: requests.Response,
+    ) -> DocumentConverterResult:  # TODO fix kwargs type
-        *,
+        # Prepare a list of extensions to try (in order of priority)
-        stream_info: Optional[StreamInfo] = None,
+        ext = kwargs.get("file_extension")
-        file_extension: Optional[str] = None,  # Deprecated -- use stream_info
+        extensions = [ext] if ext is not None else []
        url: Optional[str] = None,  # Deprecated -- use stream_info
        **kwargs: Any,
    ) -> DocumentConverterResult:
        # If there is a content-type header, get the mimetype and charset (if present)
        mimetype: Optional[str] = None
        charset: Optional[str] = None
-        if "content-type" in response.headers:
+        # Guess from the mimetype
-            parts = response.headers["content-type"].split(";")
+        content_type = response.headers.get("content-type", "").split(";")[0]
-            mimetype = parts.pop(0).strip()
+        self._append_ext(extensions, mimetypes.guess_extension(content_type))
            for part in parts:
                if part.strip().startswith("charset="):
                    _charset = part.split("=")[1].strip()
                    if len(_charset) > 0:
                        charset = _charset
-        # If there is a content-disposition header, get the filename and possibly the extension
+        # Read the content disposition if there is one
-        filename: Optional[str] = None
+        content_disposition = response.headers.get("content-disposition", "")
-        extension: Optional[str] = None
+        m = re.search(r"filename=([^;]+)", content_disposition)
-        if "content-disposition" in response.headers:
+        if m:
-            m = re.search(r"filename=([^;]+)", response.headers["content-disposition"])
+            base, ext = os.path.splitext(m.group(1).strip("\"'"))
-            if m:
+            self._append_ext(extensions, ext)
                filename = m.group(1).strip("\"'")
                _, _extension = os.path.splitext(filename)
                if len(_extension) > 0:
                    extension = _extension
-        # If there is still no filename, try to read it from the url
+        # Read from the extension from the path
-        if filename is None:
+        base, ext = os.path.splitext(urlparse(response.url).path)
-            parsed_url = urlparse(response.url)
+        self._append_ext(extensions, ext)
            _, _extension = os.path.splitext(parsed_url.path)
            if len(_extension) > 0:  # Looks like this might be a file!
                filename = os.path.basename(parsed_url.path)
                extension = _extension
-        # Create an initial guess from all this information
+        # Save the file locally to a temporary file. It will be deleted before this method exits
-        base_guess = StreamInfo(
+        handle, temp_path = tempfile.mkstemp()
-            mimetype=mimetype,
+        fh = os.fdopen(handle, "wb")
-            charset=charset,
+        result = None
-            filename=filename,
+        try:
-            extension=extension,
+            # Download the file
-            url=response.url,
+            for chunk in response.iter_content(chunk_size=512):
-        )
+                fh.write(chunk)
            fh.close()
-        # Update with any additional info from the arguments
+            # Use puremagic to check for more extension options
-        if stream_info is not None:
+            for g in self._guess_ext_magic(source=temp_path):
-            base_guess = base_guess.copy_and_update(stream_info)
+                self._append_ext(extensions, g)
        if file_extension is not None:
            # Deprecated -- use stream_info
            base_guess = base_guess.copy_and_update(extension=file_extension)
        if url is not None:
            # Deprecated -- use stream_info
            base_guess = base_guess.copy_and_update(url=url)
-        # Read into BytesIO
+            # Create the ConverterInput object
-        buffer = io.BytesIO()
+            input = ConverterInput(input_type="filepath", filepath=temp_path)
        for chunk in response.iter_content(chunk_size=512):
            buffer.write(chunk)
        buffer.seek(0)
-        # Convert
+            # Convert
-        guesses = self._get_stream_info_guesses(
+            result = self._convert(input, extensions, url=response.url, **kwargs)
-            file_stream=buffer, base_guess=base_guess
+        # Clean up
-        )
+        finally:
-        return self._convert(file_stream=buffer, stream_info_guesses=guesses, **kwargs)
+            try:
                fh.close()
            except Exception:
                pass
            os.unlink(temp_path)
        return result
    def _convert(
-        self, *, file_stream: BinaryIO, stream_info_guesses: List[StreamInfo], **kwargs
+        self, input: ConverterInput, extensions: List[Union[str, None]], **kwargs
    ) -> DocumentConverterResult:
-        res: Union[None, DocumentConverterResult] = None
+        error_trace = ""
        # Keep track of which converters throw exceptions
        failed_attempts: List[FailedConversionAttempt] = []
        # Create a copy of the page_converters list, sorted by priority.
        # We do this with each call to _convert because the priority of converters may change between calls.
        # The sort is guaranteed to be stable, so converters with the same priority will remain in the same order.
-        sorted_registrations = sorted(self._converters, key=lambda x: x.priority)
+        sorted_converters = sorted(self._page_converters, key=lambda x: x.priority)
-        # Remember the initial stream position so that we can return to it
+        for ext in extensions + [None]:  # Try last with no extension
-        cur_pos = file_stream.tell()
+            for converter in sorted_converters:
                _kwargs = copy.deepcopy(kwargs)
-        for stream_info in stream_info_guesses + [StreamInfo()]:
+                # Overwrite file_extension appropriately
-            for converter_registration in sorted_registrations:
+                if ext is None:
-                converter = converter_registration.converter
+                    if "file_extension" in _kwargs:
-                # Sanity check -- make sure the cur_pos is still the same
+                        del _kwargs["file_extension"]
-                assert (
+                else:
-                    cur_pos == file_stream.tell()
+                    _kwargs.update({"file_extension": ext})
                ), f"File stream position should NOT change between guess iterations"
                _kwargs = {k: v for k, v in kwargs.items()}
                # Copy any additional global options
                if "llm_client" not in _kwargs and self._llm_client is not None:
@@ -492,40 +377,13 @@ class MarkItDown:
                    _kwargs["exiftool_path"] = self._exiftool_path
                # Add the list of converters for nested processing
-                _kwargs["_parent_converters"] = self._converters
+                _kwargs["_parent_converters"] = self._page_converters
-                # Add legaxy kwargs
+                # If we hit an error log it and keep trying
                if stream_info is not None:
                    if stream_info.extension is not None:
                        _kwargs["file_extension"] = stream_info.extension
                    if stream_info.url is not None:
                        _kwargs["url"] = stream_info.url
                # Check if the converter will accept the file, and if so, try to convert it
                _accepts = False
                try:
-                    _accepts = converter.accepts(file_stream, stream_info, **_kwargs)
+                    res = converter.convert(input, **_kwargs)
-                except NotImplementedError:
+                except Exception:
-                    pass
+                    error_trace = ("\n\n" + traceback.format_exc()).strip()
                # accept() should not have changed the file stream position
                assert (
                    cur_pos == file_stream.tell()
                ), f"{type(converter).__name__}.accept() should NOT change the file_stream position"
                # Attempt the conversion
                if _accepts:
                    try:
                        res = converter.convert(file_stream, stream_info, **_kwargs)
                    except Exception:
                        failed_attempts.append(
                            FailedConversionAttempt(
                                converter=converter, exc_info=sys.exc_info()
                            )
                        )
                    finally:
                        file_stream.seek(cur_pos)
                if res is not None:
                    # Normalize the content
@@ -533,17 +391,81 @@ class MarkItDown:
                        [line.rstrip() for line in re.split(r"\r?\n", res.text_content)]
                    )
                    res.text_content = re.sub(r"\n{3,}", "\n\n", res.text_content)
                    # Todo
                    return res
        # If we got this far without success, report any exceptions
-        if len(failed_attempts) > 0:
+        if len(error_trace) > 0:
-            raise FileConversionException(attempts=failed_attempts)
+            raise FileConversionException(
                f"Could not convert '{input.filepath}' to Markdown. File type was recognized as {extensions}. While converting the file, the following error was encountered:\n\n{error_trace}"
            )
        # Nothing can handle it!
        raise UnsupportedFormatException(
-            f"Could not convert stream to Markdown. No converter attempted a conversion, suggesting that the filetype is simply not supported."
+            f"Could not convert '{input.filepath}' to Markdown. The formats {extensions} are not supported."
        )
    def _append_ext(self, extensions, ext):
        """Append a unique non-None, non-empty extension to a list of extensions."""
        if ext is None:
            return
        ext = ext.strip()
        if ext == "":
            return
        # if ext not in extensions:
        extensions.append(ext)
    def _guess_ext_magic(self, source):
        """Use puremagic (a Python implementation of libmagic) to guess a file's extension based on the first few bytes."""
        # Use puremagic to guess
        try:
            guesses = []
            # Guess extensions for filepaths
            if isinstance(source, str):
                guesses = puremagic.magic_file(source)
                # Fix for: https://github.com/microsoft/markitdown/issues/222
                # If there are no guesses, then try again after trimming leading ASCII whitespaces.
                # ASCII whitespace characters are those byte values in the sequence b' \t\n\r\x0b\f'
                # (space, tab, newline, carriage return, vertical tab, form feed).
                if len(guesses) == 0:
                    with open(source, "rb") as file:
                        while True:
                            char = file.read(1)
                            if not char:  # End of file
                                break
                            if not char.isspace():
                                file.seek(file.tell() - 1)
                                break
                        try:
                            guesses = puremagic.magic_stream(file)
                        except puremagic.main.PureError:
                            pass
            # Guess extensions for file objects. Note that the puremagic's magic_stream function requires a BytesIO-like file source
            # TODO: Figure out how to guess extensions for TextIO-like file sources (manually converting to BytesIO does not work)
            elif isinstance(source, BufferedIOBase):
                guesses = puremagic.magic_stream(source)
            extensions = list()
            for g in guesses:
                ext = g.extension.strip()
                if len(ext) > 0:
                    if not ext.startswith("."):
                        ext = "." + ext
                    if ext not in extensions:
                        extensions.append(ext)
            return extensions
        except FileNotFoundError:
            pass
        except IsADirectoryError:
            pass
        except PermissionError:
            pass
        return []
    def register_page_converter(self, converter: DocumentConverter) -> None:
        """DEPRECATED: User register_converter instead."""
        warn(
@@ -552,146 +474,6 @@ class MarkItDown:
        )
        self.register_converter(converter)
-    def register_converter(
+    def register_converter(self, converter: DocumentConverter) -> None:
-        self,
+        """Register a page text converter."""
-        converter: DocumentConverter,
+        self._page_converters.insert(0, converter)
        *,
        priority: float = PRIORITY_SPECIFIC_FILE_FORMAT,
    ) -> None:
        """
        Register a DocumentConverter with a given priority.
        Priorities work as follows: By default, most converters get priority
        DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT (== 0). The exception
        is the PlainTextConverter, HtmlConverter, and ZipConverter, which get
        priority PRIORITY_SPECIFIC_FILE_FORMAT (== 10), with lower values
        being tried first (i.e., higher priority).
        Just prior to conversion, the converters are sorted by priority, using
        a stable sort. This means that converters with the same priority will
        remain in the same order, with the most recently registered converters
        appearing first.
        We have tight control over the order of built-in converters, but
        plugins can register converters in any order. The registration's priority
        field reasserts some control over the order of converters.
        Plugins can register converters with any priority, to appear before or
        after the built-ins. For example, a plugin with priority 9 will run
        before the PlainTextConverter, but after the built-in converters.
        """
        self._converters.insert(
            0, ConverterRegistration(converter=converter, priority=priority)
        )
    def _get_stream_info_guesses(
        self, file_stream: BinaryIO, base_guess: StreamInfo
    ) -> List[StreamInfo]:
        """
        Given a base guess, attempt to guess or expand on the stream info using the stream content (via magika).
        """
        guesses: List[StreamInfo] = []
        # Enhance the base guess with information based on the extension or mimetype
        enhanced_guess = base_guess.copy_and_update()
        # If there's an extension and no mimetype, try to guess the mimetype
        if base_guess.mimetype is None and base_guess.extension is not None:
            _m, _ = mimetypes.guess_type(
                "placeholder" + base_guess.extension, strict=False
            )
            if _m is not None:
                enhanced_guess = enhanced_guess.copy_and_update(mimetype=_m)
        # If there's a mimetype and no extension, try to guess the extension
        if base_guess.mimetype is not None and base_guess.extension is None:
            _e = mimetypes.guess_all_extensions(base_guess.mimetype, strict=False)
            if len(_e) > 0:
                enhanced_guess = enhanced_guess.copy_and_update(extension=_e[0])
        # Call magika to guess from the stream
        cur_pos = file_stream.tell()
        try:
            result = self._magika.identify_stream(file_stream)
            if result.status == "ok" and result.prediction.output.label != "unknown":
                # If it's text, also guess the charset
                charset = None
                if result.prediction.output.is_text:
                    # Read the first 4k to guess the charset
                    file_stream.seek(cur_pos)
                    stream_page = file_stream.read(4096)
                    charset_result = charset_normalizer.from_bytes(stream_page).best()
                    if charset_result is not None:
                        charset = self._normalize_charset(charset_result.encoding)
                # Normalize the first extension listed
                guessed_extension = None
                if len(result.prediction.output.extensions) > 0:
                    guessed_extension = "." + result.prediction.output.extensions[0]
                # Determine if the guess is compatible with the base guess
                compatible = True
                if (
                    base_guess.mimetype is not None
                    and base_guess.mimetype != result.prediction.output.mime_type
                ):
                    compatible = False
                if (
                    base_guess.extension is not None
                    and base_guess.extension.lstrip(".")
                    not in result.prediction.output.extensions
                ):
                    compatible = False
                if (
                    base_guess.charset is not None
                    and self._normalize_charset(base_guess.charset) != charset
                ):
                    compatible = False
                if compatible:
                    # Add the compatible base guess
                    guesses.append(
                        StreamInfo(
                            mimetype=base_guess.mimetype
                            or result.prediction.output.mime_type,
                            extension=base_guess.extension or guessed_extension,
                            charset=base_guess.charset or charset,
                            filename=base_guess.filename,
                            local_path=base_guess.local_path,
                            url=base_guess.url,
                        )
                    )
                else:
                    # The magika guess was incompatible with the base guess, so add both guesses
                    guesses.append(enhanced_guess)
                    guesses.append(
                        StreamInfo(
                            mimetype=result.prediction.output.mime_type,
                            extension=guessed_extension,
                            charset=charset,
                            filename=base_guess.filename,
                            local_path=base_guess.local_path,
                            url=base_guess.url,
                        )
                    )
            else:
                # There were no other guesses, so just add the base guess
                guesses.append(enhanced_guess)
        finally:
            file_stream.seek(cur_pos)
        return guesses
    def _normalize_charset(self, charset: str | None) -> str | None:
        """
        Normalize a charset string to a canonical form.
        """
        if charset is None:
            return None
        try:
            return codecs.lookup(charset).name
        except LookupError:
            return charset
--- a/packages/markitdown/src/markitdown/_stream_info.py
+++ b/packages/markitdown/src/markitdown/_stream_info.py
@@ -1,32 +0,0 @@
 from dataclasses import dataclass, asdict
 from typing import Optional
@dataclass(kw_only=True, frozen=True)
 class StreamInfo:
    """The StreamInfo class is used to store information about a file stream.
    All fields can be None, and will depend on how the stream was opened.
    """
    mimetype: Optional[str] = None
    extension: Optional[str] = None
    charset: Optional[str] = None
    filename: Optional[
        str
    ] = None  # From local path, url, or Content-Disposition header
    local_path: Optional[str] = None  # If read from disk
    url: Optional[str] = None  # If read from url
    def copy_and_update(self, *args, **kwargs):
        """Copy the StreamInfo object and update it with the given StreamInfo
        instance and/or other keyword arguments."""
        new_info = asdict(self)
        for si in args:
            assert isinstance(si, StreamInfo)
            new_info.update({k: v for k, v in asdict(si).items() if v is not None})
        if len(kwargs) > 0:
            new_info.update(kwargs)
        return StreamInfo(**new_info)
--- a/packages/markitdown/src/markitdown/converters/init.py
+++ b/packages/markitdown/src/markitdown/converters/init.py
@@ -2,6 +2,7 @@
 #
 # SPDX-License-Identifier: MIT
 from ._base import DocumentConverter, DocumentConverterResult
 from ._plain_text_converter import PlainTextConverter
 from ._html_converter import HtmlConverter
 from ._rss_converter import RssConverter
@@ -14,13 +15,16 @@ from ._docx_converter import DocxConverter
 from ._xlsx_converter import XlsxConverter, XlsConverter
 from ._pptx_converter import PptxConverter
 from ._image_converter import ImageConverter
-from ._audio_converter import AudioConverter
+from ._wav_converter import WavConverter
 from ._mp3_converter import Mp3Converter
 from ._outlook_msg_converter import OutlookMsgConverter
 from ._zip_converter import ZipConverter
 from ._doc_intel_converter import DocumentIntelligenceConverter
-from ._epub_converter import EpubConverter
+from ._converter_input import ConverterInput
 __all__ = [
    "DocumentConverter",
    "DocumentConverterResult",
    "PlainTextConverter",
    "HtmlConverter",
    "RssConverter",
@@ -34,9 +38,10 @@ __all__ = [
    "XlsConverter",
    "PptxConverter",
    "ImageConverter",
-    "AudioConverter",
+    "WavConverter",
    "Mp3Converter",
    "OutlookMsgConverter",
    "ZipConverter",
    "DocumentIntelligenceConverter",
-    "EpubConverter",
+    "ConverterInput",
 ]
--- a/packages/markitdown/src/markitdown/converters/_audio_converter.py
+++ b/packages/markitdown/src/markitdown/converters/_audio_converter.py
@@ -1,102 +0,0 @@
 import io
 from typing import Any, BinaryIO, Optional
 from ._exiftool import exiftool_metadata
 from ._transcribe_audio import transcribe_audio
 from .._base_converter import DocumentConverter, DocumentConverterResult
 from .._stream_info import StreamInfo
 from .._exceptions import MissingDependencyException
 ACCEPTED_MIME_TYPE_PREFIXES = [
    "audio/x-wav",
    "audio/mpeg",
    "video/mp4",
 ]
 ACCEPTED_FILE_EXTENSIONS = [
    ".wav",
    ".mp3",
    ".m4a",
    ".mp4",
 ]
 class AudioConverter(DocumentConverter):
    """
    Converts audio files to markdown via extraction of metadata (if `exiftool` is installed), and speech transcription (if `speech_recognition` is installed).
    """
    def accepts(
        self,
        file_stream: BinaryIO,
        stream_info: StreamInfo,
        **kwargs: Any,  # Options to pass to the converter
    ) -> bool:
        mimetype = (stream_info.mimetype or "").lower()
        extension = (stream_info.extension or "").lower()
        if extension in ACCEPTED_FILE_EXTENSIONS:
            return True
        for prefix in ACCEPTED_MIME_TYPE_PREFIXES:
            if mimetype.startswith(prefix):
                return True
        return False
    def convert(
        self,
        file_stream: BinaryIO,
        stream_info: StreamInfo,
        **kwargs: Any,  # Options to pass to the converter
    ) -> DocumentConverterResult:
        md_content = ""
        # Add metadata
        metadata = exiftool_metadata(
            file_stream, exiftool_path=kwargs.get("exiftool_path")
        )
        if metadata:
            for f in [
                "Title",
                "Artist",
                "Author",
                "Band",
                "Album",
                "Genre",
                "Track",
                "DateTimeOriginal",
                "CreateDate",
                # "Duration", -- Wrong values when read from memory
                "NumChannels",
                "SampleRate",
                "AvgBytesPerSec",
                "BitsPerSample",
            ]:
                if f in metadata:
                    md_content += f"{f}: {metadata[f]}\n"
        # Figure out the audio format for transcription
        if stream_info.extension == ".wav" or stream_info.mimetype == "audio/x-wav":
            audio_format = "wav"
        elif stream_info.extension == ".mp3" or stream_info.mimetype == "audio/mpeg":
            audio_format = "mp3"
        elif (
            stream_info.extension in [".mp4", ".m4a"]
            or stream_info.mimetype == "video/mp4"
        ):
            audio_format = "mp4"
        else:
            audio_format = None
        # Transcribe
        if audio_format:
            try:
                transcript = transcribe_audio(file_stream, audio_format=audio_format)
                if transcript:
                    md_content += "\n\n### Audio Transcript:\n" + transcript
            except MissingDependencyException:
                pass
        # Return the result
        return DocumentConverterResult(markdown=md_content.strip())
--- a/packages/markitdown/src/markitdown/converters/_base.py
+++ b/packages/markitdown/src/markitdown/converters/_base.py
@@ -0,0 +1,63 @@
 from typing import Any, Union
 class DocumentConverterResult:
    """The result of converting a document to text."""
    def __init__(self, title: Union[str, None] = None, text_content: str = ""):
        self.title: Union[str, None] = title
        self.text_content: str = text_content
 class DocumentConverter:
    """Abstract superclass of all DocumentConverters."""
    # Lower priority values are tried first.
    PRIORITY_SPECIFIC_FILE_FORMAT = (
        0.0  # e.g., .docx, .pdf, .xlsx, Or specific pages, e.g., wikipedia
    )
    PRIORITY_GENERIC_FILE_FORMAT = (
        10.0  # Near catch-all converters for mimetypes like text/*, etc.
    )
    def __init__(self, priority: float = PRIORITY_SPECIFIC_FILE_FORMAT):
        """
        Initialize the DocumentConverter with a given priority.
        Priorities work as follows: By default, most converters get priority
        DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT (== 0). The exception
        is the PlainTextConverter, which gets priority PRIORITY_SPECIFIC_FILE_FORMAT (== 10),
        with lower values being tried first (i.e., higher priority).
        Just prior to conversion, the converters are sorted by priority, using
        a stable sort. This means that converters with the same priority will
        remain in the same order, with the most recently registered converters
        appearing first.
        We have tight control over the order of built-in converters, but
        plugins can register converters in any order. A converter's priority
        field reasserts some control over the order of converters.
        Plugins can register converters with any priority, to appear before or
        after the built-ins. For example, a plugin with priority 9 will run
        before the PlainTextConverter, but after the built-in converters.
        """
        self._priority = priority
    def convert(
        self, local_path: str, **kwargs: Any
    ) -> Union[None, DocumentConverterResult]:
        raise NotImplementedError("Subclasses must implement this method")
    @property
    def priority(self) -> float:
        """Priority of the converter in markitdown's converter list. Higher priority values are tried first."""
        return self._priority
    @priority.setter
    def radius(self, value: float):
        self._priority = value
    @priority.deleter
    def radius(self):
        raise AttributeError("Cannot delete the priority attribute")
--- a/packages/markitdown/src/markitdown/converters/_bing_serp_converter.py
+++ b/packages/markitdown/src/markitdown/converters/_bing_serp_converter.py
@@ -1,24 +1,14 @@
-import io
+# type: ignore
 import re
 import base64
-import binascii
+import re
 from typing import Union
 from urllib.parse import parse_qs, urlparse
 from typing import Any, BinaryIO, Optional
 from bs4 import BeautifulSoup
-from .._base_converter import DocumentConverter, DocumentConverterResult
+from ._base import DocumentConverter, DocumentConverterResult
 from .._stream_info import StreamInfo
 from ._markdownify import _CustomMarkdownify
-
+from ._converter_input import ConverterInput
 ACCEPTED_MIME_TYPE_PREFIXES = [
    "text/html",
    "application/xhtml",
 ]
 ACCEPTED_FILE_EXTENSIONS = [
    ".html",
    ".htm",
 ]
 class BingSerpConverter(DocumentConverter):
@@ -27,49 +17,31 @@ class BingSerpConverter(DocumentConverter):
    NOTE: It is better to use the Bing API
    """
-    def accepts(
+    def __init__(
-        self,
+        self, priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT
-        file_stream: BinaryIO,
+    ):
-        stream_info: StreamInfo,
+        super().__init__(priority=priority)
        **kwargs: Any,  # Options to pass to the converter
    ) -> bool:
        """
        Make sure we're dealing with HTML content *from* Bing.
        """
        url = stream_info.url or ""
        mimetype = (stream_info.mimetype or "").lower()
        extension = (stream_info.extension or "").lower()
        if not re.search(r"^https://www\.bing\.com/search\?q=", url):
            # Not a Bing SERP URL
            return False
        if extension in ACCEPTED_FILE_EXTENSIONS:
            return True
        for prefix in ACCEPTED_MIME_TYPE_PREFIXES:
            if mimetype.startswith(prefix):
                return True
        # Not HTML content
        return False
    def convert(
-        self,
+        self, input: ConverterInput, **kwargs
-        file_stream: BinaryIO,
+    ) -> Union[None, DocumentConverterResult]:
-        stream_info: StreamInfo,
+        # Bail if not a Bing SERP
-        **kwargs: Any,  # Options to pass to the converter
+        extension = kwargs.get("file_extension", "")
-    ) -> DocumentConverterResult:
+        if extension.lower() not in [".html", ".htm"]:
-        assert stream_info.url is not None
+            return None
        url = kwargs.get("url", "")
        if not re.search(r"^https://www\.bing\.com/search\?q=", url):
            return None
        # Parse the query parameters
-        parsed_params = parse_qs(urlparse(stream_info.url).query)
+        parsed_params = parse_qs(urlparse(url).query)
        query = parsed_params.get("q", [""])[0]
-        # Parse the stream
+        # Parse the file
-        encoding = "utf-8" if stream_info.charset is None else stream_info.charset
+        soup = None
-        soup = BeautifulSoup(file_stream, "html.parser", from_encoding=encoding)
+        file_obj = input.read_file(mode="rt", encoding="utf-8")
        soup = BeautifulSoup(file_obj.read(), "html.parser")
        file_obj.close()
        # Clean up some formatting
        for tptt in soup.find_all(class_="tptt"):
@@ -79,12 +51,9 @@ class BingSerpConverter(DocumentConverter):
            slug.extract()
        # Parse the algorithmic results
-        _markdownify = _CustomMarkdownify(**kwargs)
+        _markdownify = _CustomMarkdownify()
        results = list()
        for result in soup.find_all(class_="b_algo"):
            if not hasattr(result, "find_all"):
                continue
            # Rewrite redirect urls
            for a in result.find_all("a", href=True):
                parsed_href = urlparse(a["href"])
@@ -116,6 +85,6 @@ class BingSerpConverter(DocumentConverter):
        )
        return DocumentConverterResult(
            markdown=webpage_text,
            title=None if soup.title is None else soup.title.string,
            text_content=webpage_text,
        )
--- a/packages/markitdown/src/markitdown/converters/_converter_input.py
+++ b/packages/markitdown/src/markitdown/converters/_converter_input.py
@@ -0,0 +1,30 @@
 from typing import Any, Union
 class ConverterInput:
    """
    Wrapper for inputs to converter functions.
    """
    def __init__(
        self,
        input_type: str = "filepath",
        filepath: Union[str, None] = None,
        file_object: Union[Any, None] = None,
    ):
        if input_type not in ["filepath", "object"]:
            raise ValueError(f"Invalid converter input type: {input_type}")
        self.input_type = input_type
        self.filepath = filepath
        self.file_object = file_object
    def read_file(
        self,
        mode: str = "rb",
        encoding: Union[str, None] = None,
    ) -> Any:
        if self.input_type == "object":
            return self.file_object
        return open(self.filepath, mode=mode, encoding=encoding)
--- a/packages/markitdown/src/markitdown/converters/_doc_intel_converter.py
+++ b/packages/markitdown/src/markitdown/converters/_doc_intel_converter.py
@@ -1,27 +1,17 @@
-import sys
+from typing import Any, Union
 import re
-from typing import BinaryIO, Any, List
+# Azure imports
 from azure.ai.documentintelligence import DocumentIntelligenceClient
 from azure.ai.documentintelligence.models import (
    AnalyzeDocumentRequest,
    AnalyzeResult,
    DocumentAnalysisFeature,
 )
 from azure.identity import DefaultAzureCredential
-from ._html_converter import HtmlConverter
+from ._base import DocumentConverter, DocumentConverterResult
-from .._base_converter import DocumentConverter, DocumentConverterResult
+from ._converter_input import ConverterInput
 from .._stream_info import StreamInfo
 from .._exceptions import MissingDependencyException, MISSING_DEPENDENCY_MESSAGE
 # Try loading optional (but in this case, required) dependencies
 # Save reporting of any exceptions for later
 _dependency_exc_info = None
 try:
    from azure.ai.documentintelligence import DocumentIntelligenceClient
    from azure.ai.documentintelligence.models import (
        AnalyzeDocumentRequest,
        AnalyzeResult,
        DocumentAnalysisFeature,
    )
    from azure.identity import DefaultAzureCredential
 except ImportError:
    # Preserve the error and stack trace for later
    _dependency_exc_info = sys.exc_info()
 # TODO: currently, there is a bug in the document intelligence SDK with importing the "ContentFormat" enum.
@@ -29,62 +19,17 @@ except ImportError:
 CONTENT_FORMAT = "markdown"
 OFFICE_MIME_TYPE_PREFIXES = [
    "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
    "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
    "application/vnd.openxmlformats-officedocument.presentationml",
    "application/xhtml",
    "text/html",
 ]
 OTHER_MIME_TYPE_PREFIXES = [
    "application/pdf",
    "application/x-pdf",
    "text/html",
    "image/",
 ]
 OFFICE_FILE_EXTENSIONS = [
    ".docx",
    ".xlsx",
    ".pptx",
    ".html",
    ".htm",
 ]
 OTHER_FILE_EXTENSIONS = [
    ".pdf",
    ".jpeg",
    ".jpg",
    ".png",
    ".bmp",
    ".tiff",
    ".heif",
 ]
 class DocumentIntelligenceConverter(DocumentConverter):
    """Specialized DocumentConverter that uses Document Intelligence to extract text from documents."""
    def __init__(
        self,
        *,
        priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT,
        endpoint: str,
        api_version: str = "2024-07-31-preview",
    ):
-        super().__init__()
+        super().__init__(priority=priority)
        # Raise an error if the dependencies are not available.
        # This is different than other converters since this one isn't even instantiated
        # unless explicitly requested.
        if _dependency_exc_info is not None:
            raise MissingDependencyException(
                "DocumentIntelligenceConverter requires the optional dependency [az-doc-intel] (or [all]) to be installed. E.g., `pip install markitdown[az-doc-intel]`"
            ) from _dependency_exc_info[
                1
            ].with_traceback(  # type: ignore[union-attr]
                _dependency_exc_info[2]
            )
        self.endpoint = endpoint
        self.api_version = api_version
@@ -94,61 +39,54 @@ class DocumentIntelligenceConverter(DocumentConverter):
            credential=DefaultAzureCredential(),
        )
    def accepts(
        self,
        file_stream: BinaryIO,
        stream_info: StreamInfo,
        **kwargs: Any,  # Options to pass to the converter
    ) -> bool:
        mimetype = (stream_info.mimetype or "").lower()
        extension = (stream_info.extension or "").lower()
        if extension in OFFICE_FILE_EXTENSIONS + OTHER_FILE_EXTENSIONS:
            return True
        for prefix in OFFICE_MIME_TYPE_PREFIXES + OTHER_MIME_TYPE_PREFIXES:
            if mimetype.startswith(prefix):
                return True
        return False
    def _analysis_features(self, stream_info: StreamInfo) -> List[str]:
        """
        Helper needed to determine which analysis features to use.
        Certain document analysis features are not availiable for
        office filetypes (.xlsx, .pptx, .html, .docx)
        """
        mimetype = (stream_info.mimetype or "").lower()
        extension = (stream_info.extension or "").lower()
        if extension in OFFICE_FILE_EXTENSIONS:
            return []
        for prefix in OFFICE_MIME_TYPE_PREFIXES:
            if mimetype.startswith(prefix):
                return []
        return [
            DocumentAnalysisFeature.FORMULAS,  # enable formula extraction
            DocumentAnalysisFeature.OCR_HIGH_RESOLUTION,  # enable high resolution OCR
            DocumentAnalysisFeature.STYLE_FONT,  # enable font style extraction
        ]
    def convert(
-        self,
+        self, input: ConverterInput, **kwargs: Any
-        file_stream: BinaryIO,
+    ) -> Union[None, DocumentConverterResult]:
-        stream_info: StreamInfo,
+        # Bail if extension is not supported by Document Intelligence
-        **kwargs: Any,  # Options to pass to the converter
+        extension = kwargs.get("file_extension", "")
-    ) -> DocumentConverterResult:
+        docintel_extensions = [
            ".pdf",
            ".docx",
            ".xlsx",
            ".pptx",
            ".html",
            ".jpeg",
            ".jpg",
            ".png",
            ".bmp",
            ".tiff",
            ".heif",
        ]
        if extension.lower() not in docintel_extensions:
            return None
        # Get the bytestring from the converter input
        file_obj = input.read_file(mode="rb")
        file_bytes = file_obj.read()
        file_obj.close()
        # Certain document analysis features are not availiable for office filetypes (.xlsx, .pptx, .html, .docx)
        if extension.lower() in [".xlsx", ".pptx", ".html", ".docx"]:
            analysis_features = []
        else:
            analysis_features = [
                DocumentAnalysisFeature.FORMULAS,  # enable formula extraction
                DocumentAnalysisFeature.OCR_HIGH_RESOLUTION,  # enable high resolution OCR
                DocumentAnalysisFeature.STYLE_FONT,  # enable font style extraction
            ]
        # Extract the text using Azure Document Intelligence
        poller = self.doc_intel_client.begin_analyze_document(
            model_id="prebuilt-layout",
-            body=AnalyzeDocumentRequest(bytes_source=file_stream.read()),
+            body=AnalyzeDocumentRequest(bytes_source=file_bytes),
-            features=self._analysis_features(stream_info),
+            features=analysis_features,
            output_content_format=CONTENT_FORMAT,  # TODO: replace with "ContentFormat.MARKDOWN" when the bug is fixed
        )
        result: AnalyzeResult = poller.result()
        # remove comments from the markdown content generated by Doc Intelligence and append to markdown string
        markdown_text = re.sub(r"<!--.*?-->", "", result.content, flags=re.DOTALL)
-        return DocumentConverterResult(markdown=markdown_text)
+        return DocumentConverterResult(
            title=None,
            text_content=markdown_text,
        )
--- a/packages/markitdown/src/markitdown/converters/_docx_converter.py
+++ b/packages/markitdown/src/markitdown/converters/_docx_converter.py
@@ -1,27 +1,14 @@
-import sys
+from typing import Union
-from typing import BinaryIO, Any
+import mammoth
 from ._base import (
    DocumentConverterResult,
 )
 from ._base import DocumentConverter
 from ._html_converter import HtmlConverter
-from .._base_converter import DocumentConverter, DocumentConverterResult
+from ._converter_input import ConverterInput
 from .._stream_info import StreamInfo
 from .._exceptions import MissingDependencyException, MISSING_DEPENDENCY_MESSAGE
 # Try loading optional (but in this case, required) dependencies
 # Save reporting of any exceptions for later
 _dependency_exc_info = None
 try:
    import mammoth
 except ImportError:
    # Preserve the error and stack trace for later
    _dependency_exc_info = sys.exc_info()
 ACCEPTED_MIME_TYPE_PREFIXES = [
    "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
 ]
 ACCEPTED_FILE_EXTENSIONS = [".docx"]
 class DocxConverter(HtmlConverter):
@@ -29,49 +16,25 @@ class DocxConverter(HtmlConverter):
    Converts DOCX files to Markdown. Style information (e.g.m headings) and tables are preserved where possible.
    """
-    def __init__(self):
+    def __init__(
-        super().__init__()
+        self, priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT
-        self._html_converter = HtmlConverter()
+    ):
-
+        super().__init__(priority=priority)
    def accepts(
        self,
        file_stream: BinaryIO,
        stream_info: StreamInfo,
        **kwargs: Any,  # Options to pass to the converter
    ) -> bool:
        mimetype = (stream_info.mimetype or "").lower()
        extension = (stream_info.extension or "").lower()
        if extension in ACCEPTED_FILE_EXTENSIONS:
            return True
        for prefix in ACCEPTED_MIME_TYPE_PREFIXES:
            if mimetype.startswith(prefix):
                return True
        return False
    def convert(
-        self,
+        self, input: ConverterInput, **kwargs
-        file_stream: BinaryIO,
+    ) -> Union[None, DocumentConverterResult]:
-        stream_info: StreamInfo,
+        # Bail if not a DOCX
-        **kwargs: Any,  # Options to pass to the converter
+        extension = kwargs.get("file_extension", "")
-    ) -> DocumentConverterResult:
+        if extension.lower() != ".docx":
-        # Check: the dependencies
+            return None
        if _dependency_exc_info is not None:
            raise MissingDependencyException(
                MISSING_DEPENDENCY_MESSAGE.format(
                    converter=type(self).__name__,
                    extension=".docx",
                    feature="docx",
                )
            ) from _dependency_exc_info[
                1
            ].with_traceback(  # type: ignore[union-attr]
                _dependency_exc_info[2]
            )
        result = None
        style_map = kwargs.get("style_map", None)
-        return self._html_converter.convert_string(
+        file_obj = input.read_file(mode="rb")
-            mammoth.convert_to_html(file_stream, style_map=style_map).value, **kwargs
+        result = mammoth.convert_to_html(file_obj, style_map=style_map)
-        )
+        file_obj.close()
        html_content = result.value
        result = self._convert(html_content)
        return result
--- a/packages/markitdown/src/markitdown/converters/_epub_converter.py
+++ b/packages/markitdown/src/markitdown/converters/_epub_converter.py
@@ -1,147 +0,0 @@
 import os
 import zipfile
 import xml.dom.minidom as minidom
 from typing import BinaryIO, Any, Dict, List
 from ._html_converter import HtmlConverter
 from .._base_converter import DocumentConverter, DocumentConverterResult
 from .._stream_info import StreamInfo
 ACCEPTED_MIME_TYPE_PREFIXES = [
    "application/epub",
    "application/epub+zip",
    "application/x-epub+zip",
 ]
 ACCEPTED_FILE_EXTENSIONS = [".epub"]
 MIME_TYPE_MAPPING = {
    ".html": "text/html",
    ".xhtml": "application/xhtml+xml",
 }
 class EpubConverter(HtmlConverter):
    """
    Converts EPUB files to Markdown. Style information (e.g.m headings) and tables are preserved where possible.
    """
    def __init__(self):
        super().__init__()
        self._html_converter = HtmlConverter()
    def accepts(
        self,
        file_stream: BinaryIO,
        stream_info: StreamInfo,
        **kwargs: Any,  # Options to pass to the converter
    ) -> bool:
        mimetype = (stream_info.mimetype or "").lower()
        extension = (stream_info.extension or "").lower()
        if extension in ACCEPTED_FILE_EXTENSIONS:
            return True
        for prefix in ACCEPTED_MIME_TYPE_PREFIXES:
            if mimetype.startswith(prefix):
                return True
        return False
    def convert(
        self,
        file_stream: BinaryIO,
        stream_info: StreamInfo,
        **kwargs: Any,  # Options to pass to the converter
    ) -> DocumentConverterResult:
        with zipfile.ZipFile(file_stream, "r") as z:
            # Extracts metadata (title, authors, language, publisher, date, description, cover) from an EPUB file."""
            # Locate content.opf
            container_dom = minidom.parse(z.open("META-INF/container.xml"))
            opf_path = container_dom.getElementsByTagName("rootfile")[0].getAttribute(
                "full-path"
            )
            # Parse content.opf
            opf_dom = minidom.parse(z.open(opf_path))
            metadata: Dict[str, Any] = {
                "title": self._get_text_from_node(opf_dom, "dc:title"),
                "authors": self._get_all_texts_from_nodes(opf_dom, "dc:creator"),
                "language": self._get_text_from_node(opf_dom, "dc:language"),
                "publisher": self._get_text_from_node(opf_dom, "dc:publisher"),
                "date": self._get_text_from_node(opf_dom, "dc:date"),
                "description": self._get_text_from_node(opf_dom, "dc:description"),
                "identifier": self._get_text_from_node(opf_dom, "dc:identifier"),
            }
            # Extract manifest items (ID → href mapping)
            manifest = {
                item.getAttribute("id"): item.getAttribute("href")
                for item in opf_dom.getElementsByTagName("item")
            }
            # Extract spine order (ID refs)
            spine_items = opf_dom.getElementsByTagName("itemref")
            spine_order = [item.getAttribute("idref") for item in spine_items]
            # Convert spine order to actual file paths
            base_path = "/".join(
                opf_path.split("/")[:-1]
            )  # Get base directory of content.opf
            spine = [
                f"{base_path}/{manifest[item_id]}" if base_path else manifest[item_id]
                for item_id in spine_order
                if item_id in manifest
            ]
            # Extract and convert the content
            markdown_content: List[str] = []
            for file in spine:
                if file in z.namelist():
                    with z.open(file) as f:
                        filename = os.path.basename(file)
                        extension = os.path.splitext(filename)[1].lower()
                        mimetype = MIME_TYPE_MAPPING.get(extension)
                        converted_content = self._html_converter.convert(
                            f,
                            StreamInfo(
                                mimetype=mimetype,
                                extension=extension,
                                filename=filename,
                            ),
                        )
                        markdown_content.append(converted_content.markdown.strip())
            # Format and add the metadata
            metadata_markdown = []
            for key, value in metadata.items():
                if isinstance(value, list):
                    value = ", ".join(value)
                if value:
                    metadata_markdown.append(f"**{key.capitalize()}:** {value}")
            markdown_content.insert(0, "\n".join(metadata_markdown))
            return DocumentConverterResult(
                markdown="\n\n".join(markdown_content), title=metadata["title"]
            )
    def _get_text_from_node(self, dom: minidom.Document, tag_name: str) -> str | None:
        """Convenience function to extract a single occurrence of a tag (e.g., title)."""
        texts = self._get_all_texts_from_nodes(dom, tag_name)
        if len(texts) > 0:
            return texts[0]
        else:
            return None
    def _get_all_texts_from_nodes(
        self, dom: minidom.Document, tag_name: str
    ) -> List[str]:
        """Helper function to extract all occurrences of a tag (e.g., multiple authors)."""
        texts: List[str] = []
        for node in dom.getElementsByTagName(tag_name):
            if node.firstChild and hasattr(node.firstChild, "nodeValue"):
                texts.append(node.firstChild.nodeValue.strip())
        return texts
--- a/packages/markitdown/src/markitdown/converters/_exiftool.py
+++ b/packages/markitdown/src/markitdown/converters/_exiftool.py
@@ -1,34 +0,0 @@
 import json
 import subprocess
 import locale
 import sys
 import shutil
 import os
 import warnings
 from typing import BinaryIO, Any, Union
 def exiftool_metadata(
    file_stream: BinaryIO,
    *,
    exiftool_path: Union[str, None],
 ) -> Any:  # Need a better type for json data
    # Nothing to do
    if not exiftool_path:
        return {}
    # Run exiftool
    cur_pos = file_stream.tell()
    try:
        output = subprocess.run(
            [exiftool_path, "-json", "-"],
            input=file_stream.read(),
            capture_output=True,
            text=False,
        ).stdout
        return json.loads(
            output.decode(locale.getpreferredencoding(False)),
        )[0]
    finally:
        file_stream.seek(cur_pos)
--- a/packages/markitdown/src/markitdown/converters/_html_converter.py
+++ b/packages/markitdown/src/markitdown/converters/_html_converter.py
@@ -1,52 +1,39 @@
-import io
+from typing import Any, Union
 from typing import Any, BinaryIO, Optional
 from bs4 import BeautifulSoup
-from .._base_converter import DocumentConverter, DocumentConverterResult
+from ._base import DocumentConverter, DocumentConverterResult
 from .._stream_info import StreamInfo
 from ._markdownify import _CustomMarkdownify
-
+from ._converter_input import ConverterInput
 ACCEPTED_MIME_TYPE_PREFIXES = [
    "text/html",
    "application/xhtml",
 ]
 ACCEPTED_FILE_EXTENSIONS = [
    ".html",
    ".htm",
 ]
 class HtmlConverter(DocumentConverter):
    """Anything with content type text/html"""
-    def accepts(
+    def __init__(
-        self,
+        self, priority: float = DocumentConverter.PRIORITY_GENERIC_FILE_FORMAT
-        file_stream: BinaryIO,
+    ):
-        stream_info: StreamInfo,
+        super().__init__(priority=priority)
        **kwargs: Any,  # Options to pass to the converter
    ) -> bool:
        mimetype = (stream_info.mimetype or "").lower()
        extension = (stream_info.extension or "").lower()
        if extension in ACCEPTED_FILE_EXTENSIONS:
            return True
        for prefix in ACCEPTED_MIME_TYPE_PREFIXES:
            if mimetype.startswith(prefix):
                return True
        return False
    def convert(
-        self,
+        self, input: ConverterInput, **kwargs: Any
-        file_stream: BinaryIO,
+    ) -> Union[None, DocumentConverterResult]:
-        stream_info: StreamInfo,
+        # Bail if not html
-        **kwargs: Any,  # Options to pass to the converter
+        extension = kwargs.get("file_extension", "")
-    ) -> DocumentConverterResult:
+        if extension.lower() not in [".html", ".htm"]:
-        # Parse the stream
+            return None
-        encoding = "utf-8" if stream_info.charset is None else stream_info.charset
+
-        soup = BeautifulSoup(file_stream, "html.parser", from_encoding=encoding)
+        result = None
        file_obj = input.read_file(mode="rt", encoding="utf-8")
        result = self._convert(file_obj.read())
        file_obj.close()
        return result
    def _convert(self, html_content: str) -> Union[None, DocumentConverterResult]:
        """Helper function that converts an HTML string."""
        # Parse the string
        soup = BeautifulSoup(html_content, "html.parser")
        # Remove javascript and style blocks
        for script in soup(["script", "style"]):
@@ -56,9 +43,9 @@ class HtmlConverter(DocumentConverter):
        body_elm = soup.find("body")
        webpage_text = ""
        if body_elm:
-            webpage_text = _CustomMarkdownify(**kwargs).convert_soup(body_elm)
+            webpage_text = _CustomMarkdownify().convert_soup(body_elm)
        else:
-            webpage_text = _CustomMarkdownify(**kwargs).convert_soup(soup)
+            webpage_text = _CustomMarkdownify().convert_soup(soup)
        assert isinstance(webpage_text, str)
@@ -66,25 +53,6 @@ class HtmlConverter(DocumentConverter):
        webpage_text = webpage_text.strip()
        return DocumentConverterResult(
            markdown=webpage_text,
            title=None if soup.title is None else soup.title.string,
-        )
+            text_content=webpage_text,
    def convert_string(
        self, html_content: str, *, url: Optional[str] = None, **kwargs
    ) -> DocumentConverterResult:
        """
        Non-standard convenience method to convert a string to markdown.
        Given that many converters produce HTML as intermediate output, this
        allows for easy conversion of HTML to markdown.
        """
        return self.convert(
            file_stream=io.BytesIO(html_content.encode("utf-8")),
            stream_info=StreamInfo(
                mimetype="text/html",
                extension=".html",
                charset="utf-8",
                url=url,
            ),
            **kwargs,
        )
--- a/packages/markitdown/src/markitdown/converters/_image_converter.py
+++ b/packages/markitdown/src/markitdown/converters/_image_converter.py
@@ -1,53 +1,32 @@
-from typing import BinaryIO, Any, Union
+from typing import Union
-import base64
+from ._base import DocumentConverter, DocumentConverterResult
-import mimetypes
+from ._media_converter import MediaConverter
-from ._exiftool import exiftool_metadata
+from ._converter_input import ConverterInput
 from .._base_converter import DocumentConverter, DocumentConverterResult
 from .._stream_info import StreamInfo
 ACCEPTED_MIME_TYPE_PREFIXES = [
    "image/jpeg",
    "image/png",
 ]
 ACCEPTED_FILE_EXTENSIONS = [".jpg", ".jpeg", ".png"]
-class ImageConverter(DocumentConverter):
+class ImageConverter(MediaConverter):
    """
-    Converts images to markdown via extraction of metadata (if `exiftool` is installed), and description via a multimodal LLM (if an llm_client is configured).
+    Converts images to markdown via extraction of metadata (if `exiftool` is installed), OCR (if `easyocr` is installed), and description via a multimodal LLM (if an llm_client is configured).
    """
-    def accepts(
+    def __init__(
-        self,
+        self, priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT
-        file_stream: BinaryIO,
+    ):
-        stream_info: StreamInfo,
+        super().__init__(priority=priority)
        **kwargs: Any,
    ) -> bool:
        mimetype = (stream_info.mimetype or "").lower()
        extension = (stream_info.extension or "").lower()
        if extension in ACCEPTED_FILE_EXTENSIONS:
            return True
        for prefix in ACCEPTED_MIME_TYPE_PREFIXES:
            if mimetype.startswith(prefix):
                return True
        return False
    def convert(
-        self,
+        self, input: ConverterInput, **kwargs
-        file_stream: BinaryIO,
+    ) -> Union[None, DocumentConverterResult]:
-        stream_info: StreamInfo,
+        # Bail if not an image
-        **kwargs: Any,  # Options to pass to the converter
+        extension = kwargs.get("file_extension", "")
-    ) -> DocumentConverterResult:
+        if extension.lower() not in [".jpg", ".jpeg", ".png"]:
            return None
        md_content = ""
-        # Add metadata
+        # Add metadata if a local path is provided
-        metadata = exiftool_metadata(
+        if input.input_type == "filepath":
-            file_stream, exiftool_path=kwargs.get("exiftool_path")
+            metadata = self._get_metadata(input.filepath, kwargs.get("exiftool_path"))
        )
        if metadata:
            for f in [
@@ -65,59 +44,42 @@ class ImageConverter(DocumentConverter):
                if f in metadata:
                    md_content += f"{f}: {metadata[f]}\n"
-        # Try describing the image with GPT
+        # Try describing the image with GPTV
        llm_client = kwargs.get("llm_client")
        llm_model = kwargs.get("llm_model")
        if llm_client is not None and llm_model is not None:
-            llm_description = self._get_llm_description(
+            md_content += (
-                file_stream,
+                "\n# Description:\n"
-                stream_info,
+                + self._get_llm_description(
-                client=llm_client,
+                    input,
-                model=llm_model,
+                    extension,
-                prompt=kwargs.get("llm_prompt"),
+                    llm_client,
                    llm_model,
                    prompt=kwargs.get("llm_prompt"),
                ).strip()
                + "\n"
            )
            if llm_description is not None:
                md_content += "\n# Description:\n" + llm_description.strip() + "\n"
        return DocumentConverterResult(
-            markdown=md_content,
+            title=None,
            text_content=md_content,
        )
    def _get_llm_description(
-        self,
+        self, input: ConverterInput, extension, client, model, prompt=None
-        file_stream: BinaryIO,
+    ):
        stream_info: StreamInfo,
        *,
        client,
        model,
        prompt=None,
    ) -> Union[None, str]:
        if prompt is None or prompt.strip() == "":
            prompt = "Write a detailed caption for this image."
-        # Get the content type
+        data_uri = ""
-        content_type = stream_info.mimetype
+        content_type, encoding = mimetypes.guess_type("_dummy" + extension)
-        if not content_type:
+        if content_type is None:
-            content_type, _ = mimetypes.guess_type(
+            content_type = "image/jpeg"
-                "_dummy" + (stream_info.extension or "")
+        image_file = input.read_file(mode="rb")
-            )
+        image_base64 = base64.b64encode(image_file.read()).decode("utf-8")
-        if not content_type:
+        image_file.close()
-            content_type = "application/octet-stream"
+        data_uri = f"data:{content_type};base64,{image_base64}"
        # Convert to base64
        cur_pos = file_stream.tell()
        try:
            base64_image = base64.b64encode(file_stream.read()).decode("utf-8")
        except Exception as e:
            return None
        finally:
            file_stream.seek(cur_pos)
        # Prepare the data-uri
        data_uri = f"data:{content_type};base64,{base64_image}"
        # Prepare the OpenAI API request
        messages = [
            {
                "role": "user",
@@ -133,6 +95,5 @@ class ImageConverter(DocumentConverter):
            }
        ]
        # Call the OpenAI API
        response = client.chat.completions.create(model=model, messages=messages)
        return response.choices[0].message.content
--- a/packages/markitdown/src/markitdown/converters/_ipynb_converter.py
+++ b/packages/markitdown/src/markitdown/converters/_ipynb_converter.py
@@ -1,62 +1,41 @@
 from typing import BinaryIO, Any
 import json
 from typing import Any, Union
 from ._base import (
    DocumentConverter,
    DocumentConverterResult,
 )
 from .._base_converter import DocumentConverter, DocumentConverterResult
 from .._exceptions import FileConversionException
-from .._stream_info import StreamInfo
+from ._converter_input import ConverterInput
 CANDIDATE_MIME_TYPE_PREFIXES = [
    "application/json",
 ]
 ACCEPTED_FILE_EXTENSIONS = [".ipynb"]
 class IpynbConverter(DocumentConverter):
    """Converts Jupyter Notebook (.ipynb) files to Markdown."""
-    def accepts(
+    def __init__(
-        self,
+        self, priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT
-        file_stream: BinaryIO,
+    ):
-        stream_info: StreamInfo,
+        super().__init__(priority=priority)
        **kwargs: Any,  # Options to pass to the converter
    ) -> bool:
        mimetype = (stream_info.mimetype or "").lower()
        extension = (stream_info.extension or "").lower()
        if extension in ACCEPTED_FILE_EXTENSIONS:
            return True
        for prefix in CANDIDATE_MIME_TYPE_PREFIXES:
            if mimetype.startswith(prefix):
                # Read further to see if it's a notebook
                cur_pos = file_stream.tell()
                try:
                    encoding = stream_info.charset or "utf-8"
                    notebook_content = file_stream.read().decode(encoding)
                    return (
                        "nbformat" in notebook_content
                        and "nbformat_minor" in notebook_content
                    )
                finally:
                    file_stream.seek(cur_pos)
        return False
    def convert(
-        self,
+        self, input: ConverterInput, **kwargs: Any
-        file_stream: BinaryIO,
+    ) -> Union[None, DocumentConverterResult]:
-        stream_info: StreamInfo,
+        # Bail if not ipynb
-        **kwargs: Any,  # Options to pass to the converter
+        extension = kwargs.get("file_extension", "")
-    ) -> DocumentConverterResult:
+        if extension.lower() != ".ipynb":
            return None
        # Parse and convert the notebook
        result = None
        file_obj = input.read_file(mode="rt", encoding="utf-8")
        notebook_content = json.load(file_obj)
        file_obj.close()
        result = self._convert(notebook_content)
-        encoding = stream_info.charset or "utf-8"
+        return result
        notebook_content = file_stream.read().decode(encoding=encoding)
        return self._convert(json.loads(notebook_content))
-    def _convert(self, notebook_content: dict) -> DocumentConverterResult:
+    def _convert(self, notebook_content: dict) -> Union[None, DocumentConverterResult]:
        """Helper function that converts notebook JSON content to Markdown."""
        try:
            md_output = []
@@ -88,8 +67,8 @@ class IpynbConverter(DocumentConverter):
            title = notebook_content.get("metadata", {}).get("title", title)
            return DocumentConverterResult(
                markdown=md_text,
                title=title,
                text_content=md_text,
            )
        except Exception as e:
--- a/packages/markitdown/src/markitdown/converters/_llm_caption.py
+++ b/packages/markitdown/src/markitdown/converters/_llm_caption.py
@@ -1,50 +0,0 @@
 from typing import BinaryIO, Any, Union
 import base64
 import mimetypes
 from .._stream_info import StreamInfo
 def llm_caption(
    file_stream: BinaryIO, stream_info: StreamInfo, *, client, model, prompt=None
 ) -> Union[None, str]:
    if prompt is None or prompt.strip() == "":
        prompt = "Write a detailed caption for this image."
    # Get the content type
    content_type = stream_info.mimetype
    if not content_type:
        content_type, _ = mimetypes.guess_type("_dummy" + (stream_info.extension or ""))
    if not content_type:
        content_type = "application/octet-stream"
    # Convert to base64
    cur_pos = file_stream.tell()
    try:
        base64_image = base64.b64encode(file_stream.read()).decode("utf-8")
    except Exception as e:
        return None
    finally:
        file_stream.seek(cur_pos)
    # Prepare the data-uri
    data_uri = f"data:{content_type};base64,{base64_image}"
    # Prepare the OpenAI API request
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": prompt},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": data_uri,
                    },
                },
            ],
        }
    ]
    # Call the OpenAI API
    response = client.chat.completions.create(model=model, messages=messages)
    return response.choices[0].message.content
--- a/packages/markitdown/src/markitdown/converters/_markdownify.py
+++ b/packages/markitdown/src/markitdown/converters/_markdownify.py
@@ -1,7 +1,7 @@
 import re
 import markdownify
-from typing import Any, Optional
+from typing import Any
 from urllib.parse import quote, unquote, urlparse, urlunparse
@@ -17,18 +17,10 @@ class _CustomMarkdownify(markdownify.MarkdownConverter):
    def __init__(self, **options: Any):
        options["heading_style"] = options.get("heading_style", markdownify.ATX)
        options["keep_data_uris"] = options.get("keep_data_uris", False)
        # Explicitly cast options to the expected type if necessary
        super().__init__(**options)
-    def convert_hn(
+    def convert_hn(self, n: int, el: Any, text: str, convert_as_inline: bool) -> str:
        self,
        n: int,
        el: Any,
        text: str,
        convert_as_inline: Optional[bool] = False,
        **kwargs,
    ) -> str:
        """Same as usual, but be sure to start with a new line"""
        if not convert_as_inline:
            if not re.search(r"^\n", text):
@@ -36,13 +28,7 @@ class _CustomMarkdownify(markdownify.MarkdownConverter):
        return super().convert_hn(n, el, text, convert_as_inline)  # type: ignore
-    def convert_a(
+    def convert_a(self, el: Any, text: str, convert_as_inline: bool):
        self,
        el: Any,
        text: str,
        convert_as_inline: Optional[bool] = False,
        **kwargs,
    ):
        """Same as usual converter, but removes Javascript links and escapes URIs."""
        prefix, suffix, text = markdownify.chomp(text)  # type: ignore
        if not text:
@@ -82,13 +68,7 @@ class _CustomMarkdownify(markdownify.MarkdownConverter):
            else text
        )
-    def convert_img(
+    def convert_img(self, el: Any, text: str, convert_as_inline: bool) -> str:
        self,
        el: Any,
        text: str,
        convert_as_inline: Optional[bool] = False,
        **kwargs,
    ) -> str:
        """Same as usual converter, but removes data URIs"""
        alt = el.attrs.get("alt", None) or ""
@@ -102,7 +82,7 @@ class _CustomMarkdownify(markdownify.MarkdownConverter):
            return alt
        # Remove dataURIs
-        if src.startswith("data:") and not self.options["keep_data_uris"]:
+        if src.startswith("data:"):
            src = src.split(",")[0] + "..."
        return "![%s](%s%s)" % (alt, src, title_part)
--- a/packages/markitdown/src/markitdown/converters/_media_converter.py
+++ b/packages/markitdown/src/markitdown/converters/_media_converter.py
@@ -0,0 +1,41 @@
 import subprocess
 import shutil
 import json
 from warnings import warn
 from ._base import DocumentConverter
 class MediaConverter(DocumentConverter):
    """
    Abstract class for multi-modal media (e.g., images and audio)
    """
    def __init__(
        self, priority: float = DocumentConverter.PRIORITY_GENERIC_FILE_FORMAT
    ):
        super().__init__(priority=priority)
    def _get_metadata(self, local_path, exiftool_path=None):
        if not exiftool_path:
            which_exiftool = shutil.which("exiftool")
            if which_exiftool:
                warn(
                    f"""Implicit discovery of 'exiftool' is disabled. If you would like to continue to use exiftool in MarkItDown, please set the exiftool_path parameter in the MarkItDown consructor. E.g., 
    md = MarkItDown(exiftool_path="{which_exiftool}")
 This warning will be removed in future releases.
 """,
                    DeprecationWarning,
                )
            return None
        else:
            if True:
                result = subprocess.run(
                    [exiftool_path, "-json", local_path], capture_output=True, text=True
                ).stdout
                return json.loads(result)[0]
            # except Exception:
            #    return None
--- a/packages/markitdown/src/markitdown/converters/_mp3_converter.py
+++ b/packages/markitdown/src/markitdown/converters/_mp3_converter.py
@@ -0,0 +1,98 @@
 import tempfile
 import os
 from typing import Union
 from ._base import DocumentConverter, DocumentConverterResult
 from ._wav_converter import WavConverter
 from warnings import resetwarnings, catch_warnings
 from ._converter_input import ConverterInput
 # Optional Transcription support
 IS_AUDIO_TRANSCRIPTION_CAPABLE = False
 try:
    # Using warnings' catch_warnings to catch
    # pydub's warning of ffmpeg or avconv missing
    with catch_warnings(record=True) as w:
        import pydub
        if w:
            raise ModuleNotFoundError
    import speech_recognition as sr
    IS_AUDIO_TRANSCRIPTION_CAPABLE = True
 except ModuleNotFoundError:
    pass
 finally:
    resetwarnings()
 class Mp3Converter(WavConverter):
    """
    Converts MP3 files to markdown via extraction of metadata (if `exiftool` is installed), and speech transcription (if `speech_recognition` AND `pydub` are installed).
    """
    def __init__(
        self, priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT
    ):
        super().__init__(priority=priority)
    def convert(
        self, input: ConverterInput, **kwargs
    ) -> Union[None, DocumentConverterResult]:
        # Bail if not a MP3
        extension = kwargs.get("file_extension", "")
        if extension.lower() != ".mp3":
            return None
        # Bail if a local path was not provided
        if input.input_type != "filepath":
            return None
        local_path = input.filepath
        md_content = ""
        # Add metadata
        metadata = self._get_metadata(local_path, kwargs.get("exiftool_path"))
        if metadata:
            for f in [
                "Title",
                "Artist",
                "Author",
                "Band",
                "Album",
                "Genre",
                "Track",
                "DateTimeOriginal",
                "CreateDate",
                "Duration",
            ]:
                if f in metadata:
                    md_content += f"{f}: {metadata[f]}\n"
        # Transcribe
        if IS_AUDIO_TRANSCRIPTION_CAPABLE:
            handle, temp_path = tempfile.mkstemp(suffix=".wav")
            os.close(handle)
            try:
                sound = pydub.AudioSegment.from_mp3(local_path)
                sound.export(temp_path, format="wav")
                _args = dict()
                _args.update(kwargs)
                _args["file_extension"] = ".wav"
                try:
                    transcript = super()._transcribe_audio(temp_path).strip()
                    md_content += "\n\n### Audio Transcript:\n" + (
                        "[No speech detected]" if transcript == "" else transcript
                    )
                except Exception:
                    md_content += "\n\n### Audio Transcript:\nError. Could not transcribe this audio."
            finally:
                os.unlink(temp_path)
        # Return the result
        return DocumentConverterResult(
            title=None,
            text_content=md_content.strip(),
        )
--- a/packages/markitdown/src/markitdown/converters/_outlook_msg_converter.py
+++ b/packages/markitdown/src/markitdown/converters/_outlook_msg_converter.py
@@ -1,24 +1,7 @@
-import sys
+import olefile
-from typing import Any, Union, BinaryIO
+from typing import Any, Union
-from .._stream_info import StreamInfo
+from ._base import DocumentConverter, DocumentConverterResult
-from .._base_converter import DocumentConverter, DocumentConverterResult
+from ._converter_input import ConverterInput
 from .._exceptions import MissingDependencyException, MISSING_DEPENDENCY_MESSAGE
 # Try loading optional (but in this case, required) dependencies
 # Save reporting of any exceptions for later
 _dependency_exc_info = None
 olefile = None
 try:
    import olefile  # type: ignore[no-redef]
 except ImportError:
    # Preserve the error and stack trace for later
    _dependency_exc_info = sys.exc_info()
 ACCEPTED_MIME_TYPE_PREFIXES = [
    "application/vnd.ms-outlook",
 ]
 ACCEPTED_FILE_EXTENSIONS = [".msg"]
 class OutlookMsgConverter(DocumentConverter):
@@ -29,108 +12,61 @@ class OutlookMsgConverter(DocumentConverter):
    - Email body content
    """
-    def accepts(
+    def __init__(
-        self,
+        self, priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT
-        file_stream: BinaryIO,
+    ):
-        stream_info: StreamInfo,
+        super().__init__(priority=priority)
        **kwargs: Any,  # Options to pass to the converter
    ) -> bool:
        mimetype = (stream_info.mimetype or "").lower()
        extension = (stream_info.extension or "").lower()
        # Check the extension and mimetype
        if extension in ACCEPTED_FILE_EXTENSIONS:
            return True
        for prefix in ACCEPTED_MIME_TYPE_PREFIXES:
            if mimetype.startswith(prefix):
                return True
        # Brute force, check if we have an OLE file
        cur_pos = file_stream.tell()
        try:
            if olefile and not olefile.isOleFile(file_stream):
                return False
        finally:
            file_stream.seek(cur_pos)
        # Brue force, check if it's an Outlook file
        try:
            if olefile is not None:
                msg = olefile.OleFileIO(file_stream)
                toc = "\n".join([str(stream) for stream in msg.listdir()])
                return (
                    "__properties_version1.0" in toc
                    and "__recip_version1.0_#00000000" in toc
                )
        except Exception as e:
            pass
        finally:
            file_stream.seek(cur_pos)
        return False
    def convert(
-        self,
+        self, input: ConverterInput, **kwargs: Any
-        file_stream: BinaryIO,
+    ) -> Union[None, DocumentConverterResult]:
-        stream_info: StreamInfo,
+        # Bail if not a MSG file
-        **kwargs: Any,  # Options to pass to the converter
+        extension = kwargs.get("file_extension", "")
-    ) -> DocumentConverterResult:
+        if extension.lower() != ".msg":
-        # Check: the dependencies
+            return None
-        if _dependency_exc_info is not None:
+
-            raise MissingDependencyException(
+        try:
-                MISSING_DEPENDENCY_MESSAGE.format(
+            file_obj = input.read_file(mode="rb")
-                    converter=type(self).__name__,
+            msg = olefile.OleFileIO(file_obj)
-                    extension=".msg",
+
-                    feature="outlook",
+            # Extract email metadata
-                )
+            md_content = "# Email Message\n\n"
-            ) from _dependency_exc_info[
+
-                1
+            # Get headers
-            ].with_traceback(  # type: ignore[union-attr]
+            headers = {
-                _dependency_exc_info[2]
+                "From": self._get_stream_data(msg, "__substg1.0_0C1F001F"),
                "To": self._get_stream_data(msg, "__substg1.0_0E04001F"),
                "Subject": self._get_stream_data(msg, "__substg1.0_0037001F"),
            }
            # Add headers to markdown
            for key, value in headers.items():
                if value:
                    md_content += f"**{key}:** {value}\n"
            md_content += "\n## Content\n\n"
            # Get email body
            body = self._get_stream_data(msg, "__substg1.0_1000001F")
            if body:
                md_content += body
            msg.close()
            file_obj.close()
            return DocumentConverterResult(
                title=headers.get("Subject"), text_content=md_content.strip()
            )
-        assert (
+        except Exception as e:
-            olefile is not None
+            raise FileConversionException(
-        )  # If we made it this far, olefile should be available
+                f"Could not convert MSG file '{input.filepath}': {str(e)}"
-        msg = olefile.OleFileIO(file_stream)
+            )
-        # Extract email metadata
+    def _get_stream_data(
-        md_content = "# Email Message\n\n"
+        self, msg: olefile.OleFileIO, stream_path: str
-
+    ) -> Union[str, None]:
        # Get headers
        headers = {
            "From": self._get_stream_data(msg, "__substg1.0_0C1F001F"),
            "To": self._get_stream_data(msg, "__substg1.0_0E04001F"),
            "Subject": self._get_stream_data(msg, "__substg1.0_0037001F"),
        }
        # Add headers to markdown
        for key, value in headers.items():
            if value:
                md_content += f"**{key}:** {value}\n"
        md_content += "\n## Content\n\n"
        # Get email body
        body = self._get_stream_data(msg, "__substg1.0_1000001F")
        if body:
            md_content += body
        msg.close()
        return DocumentConverterResult(
            markdown=md_content.strip(),
            title=headers.get("Subject"),
        )
    def _get_stream_data(self, msg: Any, stream_path: str) -> Union[str, None]:
        """Helper to safely extract and decode stream data from the MSG file."""
        assert olefile is not None
        assert isinstance(
            msg, olefile.OleFileIO
        )  # Ensure msg is of the correct type (type hinting is not possible with the optional olefile package)
        try:
            if msg.exists(stream_path):
                data = msg.openstream(stream_path).read()
--- a/packages/markitdown/src/markitdown/converters/_pdf_converter.py
+++ b/packages/markitdown/src/markitdown/converters/_pdf_converter.py
@@ -1,32 +1,9 @@
-import sys
+import pdfminer
-import io
+import pdfminer.high_level
-
+from typing import Union
-from typing import BinaryIO, Any
+from io import StringIO
-
+from ._base import DocumentConverter, DocumentConverterResult
-
+from ._converter_input import ConverterInput
 from ._html_converter import HtmlConverter
 from .._base_converter import DocumentConverter, DocumentConverterResult
 from .._stream_info import StreamInfo
 from .._exceptions import MissingDependencyException, MISSING_DEPENDENCY_MESSAGE
 # Try loading optional (but in this case, required) dependencies
 # Save reporting of any exceptions for later
 _dependency_exc_info = None
 try:
    import pdfminer
    import pdfminer.high_level
 except ImportError:
    # Preserve the error and stack trace for later
    _dependency_exc_info = sys.exc_info()
 ACCEPTED_MIME_TYPE_PREFIXES = [
    "application/pdf",
    "application/x-pdf",
 ]
 ACCEPTED_FILE_EXTENSIONS = [".pdf"]
 class PdfConverter(DocumentConverter):
@@ -34,45 +11,25 @@ class PdfConverter(DocumentConverter):
    Converts PDFs to Markdown. Most style information is ignored, so the results are essentially plain-text.
    """
-    def accepts(
+    def __init__(
-        self,
+        self, priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT
-        file_stream: BinaryIO,
+    ):
-        stream_info: StreamInfo,
+        super().__init__(priority=priority)
        **kwargs: Any,  # Options to pass to the converter
    ) -> bool:
        mimetype = (stream_info.mimetype or "").lower()
        extension = (stream_info.extension or "").lower()
        if extension in ACCEPTED_FILE_EXTENSIONS:
            return True
        for prefix in ACCEPTED_MIME_TYPE_PREFIXES:
            if mimetype.startswith(prefix):
                return True
        return False
    def convert(
-        self,
+        self, input: ConverterInput, **kwargs
-        file_stream: BinaryIO,
+    ) -> Union[None, DocumentConverterResult]:
-        stream_info: StreamInfo,
+        # Bail if not a PDF
-        **kwargs: Any,  # Options to pass to the converter
+        extension = kwargs.get("file_extension", "")
-    ) -> DocumentConverterResult:
+        if extension.lower() != ".pdf":
-        # Check the dependencies
+            return None
-        if _dependency_exc_info is not None:
+
-            raise MissingDependencyException(
+        output = StringIO()
-                MISSING_DEPENDENCY_MESSAGE.format(
+        file_obj = input.read_file(mode="rb")
-                    converter=type(self).__name__,
+        pdfminer.high_level.extract_text_to_fp(file_obj, output)
-                    extension=".pdf",
+        file_obj.close()
                    feature="pdf",
                )
            ) from _dependency_exc_info[
                1
            ].with_traceback(  # type: ignore[union-attr]
                _dependency_exc_info[2]
            )
        assert isinstance(file_stream, io.IOBase)  # for mypy
        return DocumentConverterResult(
-            markdown=pdfminer.high_level.extract_text(file_stream),
+            title=None,
            text_content=output.getvalue(),
        )
--- a/packages/markitdown/src/markitdown/converters/_plain_text_converter.py
+++ b/packages/markitdown/src/markitdown/converters/_plain_text_converter.py
@@ -1,71 +1,43 @@
-import sys
+import mimetypes
-from typing import BinaryIO, Any
+from charset_normalizer import from_path, from_bytes
-from charset_normalizer import from_bytes
+from typing import Any, Union
 from .._base_converter import DocumentConverter, DocumentConverterResult
 from .._stream_info import StreamInfo
-# Try loading optional (but in this case, required) dependencies
+from ._base import DocumentConverter, DocumentConverterResult
-# Save reporting of any exceptions for later
+from ._converter_input import ConverterInput
 _dependency_exc_info = None
 try:
    import mammoth
 except ImportError:
    # Preserve the error and stack trace for later
    _dependency_exc_info = sys.exc_info()
 ACCEPTED_MIME_TYPE_PREFIXES = [
    "text/",
    "application/json",
    "application/markdown",
 ]
 ACCEPTED_FILE_EXTENSIONS = [
    ".txt",
    ".text",
    ".md",
    ".markdown",
    ".json",
    ".jsonl",
 ]
 class PlainTextConverter(DocumentConverter):
    """Anything with content type text/plain"""
-    def accepts(
+    def __init__(
-        self,
+        self, priority: float = DocumentConverter.PRIORITY_GENERIC_FILE_FORMAT
-        file_stream: BinaryIO,
+    ):
-        stream_info: StreamInfo,
+        super().__init__(priority=priority)
        **kwargs: Any,  # Options to pass to the converter
    ) -> bool:
        mimetype = (stream_info.mimetype or "").lower()
        extension = (stream_info.extension or "").lower()
        # If we have a charset, we can safely assume it's text
        # With Magika in the earlier stages, this handles most cases
        if stream_info.charset is not None:
            return True
        # Otherwise, check the mimetype and extension
        if extension in ACCEPTED_FILE_EXTENSIONS:
            return True
        for prefix in ACCEPTED_MIME_TYPE_PREFIXES:
            if mimetype.startswith(prefix):
                return True
        return False
    def convert(
-        self,
+        self, input: ConverterInput, **kwargs: Any
-        file_stream: BinaryIO,
+    ) -> Union[None, DocumentConverterResult]:
-        stream_info: StreamInfo,
+        # Read file object from input
-        **kwargs: Any,  # Options to pass to the converter
+        file_obj = input.read_file(mode="rb")
    ) -> DocumentConverterResult:
        if stream_info.charset:
            text_content = file_stream.read().decode(stream_info.charset)
        else:
            text_content = str(from_bytes(file_stream.read()).best())
-        return DocumentConverterResult(markdown=text_content)
+        # Guess the content type from any file extension that might be around
        content_type, _ = mimetypes.guess_type(
            "__placeholder" + kwargs.get("file_extension", "")
        )
        # Only accept text files
        if content_type is None:
            return None
        elif all(
            not content_type.lower().startswith(type_prefix)
            for type_prefix in ["text/", "application/json"]
        ):
            return None
        text_content = str(from_bytes(file_obj.read()).best())
        file_obj.close()
        return DocumentConverterResult(
            title=None,
            text_content=text_content,
        )
--- a/packages/markitdown/src/markitdown/converters/_pptx_converter.py
+++ b/packages/markitdown/src/markitdown/converters/_pptx_converter.py
@@ -1,86 +1,68 @@
 import sys
 import base64
-import os
+import pptx
 import io
 import re
 import html
-from typing import BinaryIO, Any
+from typing import Union
 from operator import attrgetter
 from ._base import DocumentConverterResult, DocumentConverter
 from ._html_converter import HtmlConverter
-from ._llm_caption import llm_caption
+from ._converter_input import ConverterInput
 from .._base_converter import DocumentConverter, DocumentConverterResult
 from .._stream_info import StreamInfo
 from .._exceptions import MissingDependencyException, MISSING_DEPENDENCY_MESSAGE
 # Try loading optional (but in this case, required) dependencies
 # Save reporting of any exceptions for later
 _dependency_exc_info = None
 try:
    import pptx
 except ImportError:
    # Preserve the error and stack trace for later
    _dependency_exc_info = sys.exc_info()
-ACCEPTED_MIME_TYPE_PREFIXES = [
+class PptxConverter(HtmlConverter):
    "application/vnd.openxmlformats-officedocument.presentationml",
 ]
 ACCEPTED_FILE_EXTENSIONS = [".pptx"]
 class PptxConverter(DocumentConverter):
    """
    Converts PPTX files to Markdown. Supports heading, tables and images with alt text.
    """
-    def __init__(self):
+    def __init__(
-        super().__init__()
+        self, priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT
-        self._html_converter = HtmlConverter()
+    ):
        super().__init__(priority=priority)
-    def accepts(
+    def _get_llm_description(
-        self,
+        self, llm_client, llm_model, image_blob, content_type, prompt=None
-        file_stream: BinaryIO,
+    ):
-        stream_info: StreamInfo,
+        if prompt is None or prompt.strip() == "":
-        **kwargs: Any,  # Options to pass to the converter
+            prompt = "Write a detailed alt text for this image with less than 50 words."
    ) -> bool:
        mimetype = (stream_info.mimetype or "").lower()
        extension = (stream_info.extension or "").lower()
-        if extension in ACCEPTED_FILE_EXTENSIONS:
+        image_base64 = base64.b64encode(image_blob).decode("utf-8")
-            return True
+        data_uri = f"data:{content_type};base64,{image_base64}"
-        for prefix in ACCEPTED_MIME_TYPE_PREFIXES:
+        messages = [
-            if mimetype.startswith(prefix):
+            {
-                return True
+                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": data_uri,
                        },
                    },
                    {"type": "text", "text": prompt},
                ],
            }
        ]
-        return False
+        response = llm_client.chat.completions.create(
            model=llm_model, messages=messages
        )
        return response.choices[0].message.content
    def convert(
-        self,
+        self, input: ConverterInput, **kwargs
-        file_stream: BinaryIO,
+    ) -> Union[None, DocumentConverterResult]:
-        stream_info: StreamInfo,
+        # Bail if not a PPTX
-        **kwargs: Any,  # Options to pass to the converter
+        extension = kwargs.get("file_extension", "")
-    ) -> DocumentConverterResult:
+        if extension.lower() != ".pptx":
-        # Check the dependencies
+            return None
        if _dependency_exc_info is not None:
            raise MissingDependencyException(
                MISSING_DEPENDENCY_MESSAGE.format(
                    converter=type(self).__name__,
                    extension=".pptx",
                    feature="pptx",
                )
            ) from _dependency_exc_info[
                1
            ].with_traceback(  # type: ignore[union-attr]
                _dependency_exc_info[2]
            )
        # Perform the conversion
        presentation = pptx.Presentation(file_stream)
        md_content = ""
        file_obj = input.read_file(mode="rb")
        presentation = pptx.Presentation(file_obj)
        file_obj.close()
        slide_num = 0
        for slide in presentation.slides:
            slide_num += 1
@@ -88,72 +70,64 @@ class PptxConverter(DocumentConverter):
            md_content += f"\n\n<!-- Slide number: {slide_num} -->\n"
            title = slide.shapes.title
-
+            for shape in slide.shapes:
            def get_shape_content(shape, **kwargs):
                nonlocal md_content
                # Pictures
                if self._is_picture(shape):
                    # https://github.com/scanny/python-pptx/pull/512#issuecomment-1713100069
-                    llm_description = ""
+                    llm_description = None
-                    alt_text = ""
+                    alt_text = None
                    # Potentially generate a description using an LLM
                    llm_client = kwargs.get("llm_client")
                    llm_model = kwargs.get("llm_model")
                    if llm_client is not None and llm_model is not None:
                        # Prepare a file_stream and stream_info for the image data
                        image_filename = shape.image.filename
                        image_extension = None
                        if image_filename:
                            image_extension = os.path.splitext(image_filename)[1]
                        image_stream_info = StreamInfo(
                            mimetype=shape.image.content_type,
                            extension=image_extension,
                            filename=image_filename,
                        )
                        image_stream = io.BytesIO(shape.image.blob)
                        # Caption the image
                        try:
-                            llm_description = llm_caption(
+                            llm_description = self._get_llm_description(
-                                image_stream,
+                                llm_client,
-                                image_stream_info,
+                                llm_model,
-                                client=llm_client,
+                                shape.image.blob,
-                                model=llm_model,
+                                shape.image.content_type,
                                prompt=kwargs.get("llm_prompt"),
                            )
                        except Exception:
-                            # Unable to generate a description
+                            # Unable to describe with LLM
                            pass
-                    # Also grab any description embedded in the deck
+                    if not llm_description:
-                    try:
+                        try:
-                        alt_text = shape._element._nvXxPr.cNvPr.attrib.get("descr", "")
+                            alt_text = shape._element._nvXxPr.cNvPr.attrib.get(
-                    except Exception:
+                                "descr", ""
-                        # Unable to get alt text
+                            )
-                        pass
+                        except Exception:
                            # Unable to get alt text
                            pass
-                    # Prepare the alt, escaping any special characters
+                    # A placeholder name
-                    alt_text = "\n".join([llm_description, alt_text]) or shape.name
+                    filename = re.sub(r"\W", "", shape.name) + ".jpg"
-                    alt_text = re.sub(r"[\r\n\[\]]", " ", alt_text)
+                    md_content += (
-                    alt_text = re.sub(r"\s+", " ", alt_text).strip()
+                        "\n!["
-
+                        + (llm_description or alt_text or shape.name)
-                    # If keep_data_uris is True, use base64 encoding for images
+                        + "]("
-                    if kwargs.get("keep_data_uris", False):
+                        + filename
-                        blob = shape.image.blob
+                        + ")\n"
-                        content_type = shape.image.content_type or "image/png"
+                    )
                        b64_string = base64.b64encode(blob).decode("utf-8")
                        md_content += f"\n![{alt_text}](data:{content_type};base64,{b64_string})\n"
                    else:
                        # A placeholder name
                        filename = re.sub(r"\W", "", shape.name) + ".jpg"
                        md_content += "\n![" + alt_text + "](" + filename + ")\n"
                # Tables
                if self._is_table(shape):
-                    md_content += self._convert_table_to_markdown(shape.table, **kwargs)
+                    html_table = "<html><body><table>"
                    first_row = True
                    for row in shape.table.rows:
                        html_table += "<tr>"
                        for cell in row.cells:
                            if first_row:
                                html_table += "<th>" + html.escape(cell.text) + "</th>"
                            else:
                                html_table += "<td>" + html.escape(cell.text) + "</td>"
                        html_table += "</tr>"
                        first_row = False
                    html_table += "</table></body></html>"
                    md_content += (
                        "\n" + self._convert(html_table).text_content.strip() + "\n"
                    )
                # Charts
                if shape.has_chart:
@@ -166,16 +140,6 @@ class PptxConverter(DocumentConverter):
                    else:
                        md_content += shape.text + "\n"
                # Group Shapes
                if shape.shape_type == pptx.enum.shapes.MSO_SHAPE_TYPE.GROUP:
                    sorted_shapes = sorted(shape.shapes, key=attrgetter("top", "left"))
                    for subshape in sorted_shapes:
                        get_shape_content(subshape, **kwargs)
            sorted_shapes = sorted(slide.shapes, key=attrgetter("top", "left"))
            for shape in sorted_shapes:
                get_shape_content(shape, **kwargs)
            md_content = md_content.strip()
            if slide.has_notes_slide:
@@ -185,7 +149,10 @@ class PptxConverter(DocumentConverter):
                    md_content += notes_frame.text
                md_content = md_content.strip()
-        return DocumentConverterResult(markdown=md_content.strip())
+        return DocumentConverterResult(
            title=None,
            text_content=md_content.strip(),
        )
    def _is_picture(self, shape):
        if shape.shape_type == pptx.enum.shapes.MSO_SHAPE_TYPE.PICTURE:
@@ -200,53 +167,25 @@ class PptxConverter(DocumentConverter):
            return True
        return False
    def _convert_table_to_markdown(self, table, **kwargs):
        # Write the table as HTML, then convert it to Markdown
        html_table = "<html><body><table>"
        first_row = True
        for row in table.rows:
            html_table += "<tr>"
            for cell in row.cells:
                if first_row:
                    html_table += "<th>" + html.escape(cell.text) + "</th>"
                else:
                    html_table += "<td>" + html.escape(cell.text) + "</td>"
            html_table += "</tr>"
            first_row = False
        html_table += "</table></body></html>"
        return (
            self._html_converter.convert_string(html_table, **kwargs).markdown.strip()
            + "\n"
        )
    def _convert_chart_to_markdown(self, chart):
-        try:
+        md = "\n\n### Chart"
-            md = "\n\n### Chart"
+        if chart.has_title:
-            if chart.has_title:
+            md += f": {chart.chart_title.text_frame.text}"
-                md += f": {chart.chart_title.text_frame.text}"
+        md += "\n\n"
-            md += "\n\n"
+        data = []
-            data = []
+        category_names = [c.label for c in chart.plots[0].categories]
-            category_names = [c.label for c in chart.plots[0].categories]
+        series_names = [s.name for s in chart.series]
-            series_names = [s.name for s in chart.series]
+        data.append(["Category"] + series_names)
            data.append(["Category"] + series_names)
-            for idx, category in enumerate(category_names):
+        for idx, category in enumerate(category_names):
-                row = [category]
+            row = [category]
-                for series in chart.series:
+            for series in chart.series:
-                    row.append(series.values[idx])
+                row.append(series.values[idx])
-                data.append(row)
+            data.append(row)
-            markdown_table = []
+        markdown_table = []
-            for row in data:
+        for row in data:
-                markdown_table.append("| " + " | ".join(map(str, row)) + " |")
+            markdown_table.append("| " + " | ".join(map(str, row)) + " |")
-            header = markdown_table[0]
+        header = markdown_table[0]
-            separator = "|" + "|".join(["---"] * len(data[0])) + "|"
+        separator = "|" + "|".join(["---"] * len(data[0])) + "|"
-            return md + "\n".join([header, separator] + markdown_table[1:])
+        return md + "\n".join([header, separator] + markdown_table[1:])
        except ValueError as e:
            # Handle the specific error for unsupported chart types
            if "unsupported plot type" in str(e):
                return "\n\n[unsupported chart]\n\n"
        except Exception:
            # Catch any other exceptions that might occur
            return "\n\n[unsupported chart]\n\n"
--- a/packages/markitdown/src/markitdown/converters/_rss_converter.py
+++ b/packages/markitdown/src/markitdown/converters/_rss_converter.py
@@ -1,177 +1,141 @@
 from xml.dom import minidom
-from typing import BinaryIO, Any, Union
+from typing import Union
 from bs4 import BeautifulSoup
 from ._markdownify import _CustomMarkdownify
-from .._stream_info import StreamInfo
+from ._base import DocumentConverter, DocumentConverterResult
-from .._base_converter import DocumentConverter, DocumentConverterResult
+from ._converter_input import ConverterInput
 PRECISE_MIME_TYPE_PREFIXES = [
    "application/rss",
    "application/rss+xml",
    "application/atom",
    "application/atom+xml",
 ]
 PRECISE_FILE_EXTENSIONS = [".rss", ".atom"]
 CANDIDATE_MIME_TYPE_PREFIXES = [
    "text/xml",
    "application/xml",
 ]
 CANDIDATE_FILE_EXTENSIONS = [
    ".xml",
 ]
 class RssConverter(DocumentConverter):
    """Convert RSS / Atom type to markdown"""
-    def __init__(self):
+    def __init__(
-        super().__init__()
+        self, priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT
-        self._kwargs = {}
+    ):
        super().__init__(priority=priority)
-    def accepts(
+    def convert(
-        self,
+        self, input: ConverterInput, **kwargs
-        file_stream: BinaryIO,
+    ) -> Union[None, DocumentConverterResult]:
-        stream_info: StreamInfo,
+        # Bail if not RSS type
-        **kwargs: Any,  # Options to pass to the converter
+        extension = kwargs.get("file_extension", "")
-    ) -> bool:
+        if extension.lower() not in [".xml", ".rss", ".atom"]:
-        mimetype = (stream_info.mimetype or "").lower()
+            return None
-        extension = (stream_info.extension or "").lower()
+        # Read file object from input
        file_obj = input.read_file(mode="rb")
        # Check for precise mimetypes and file extensions
        if extension in PRECISE_FILE_EXTENSIONS:
            return True
        for prefix in PRECISE_MIME_TYPE_PREFIXES:
            if mimetype.startswith(prefix):
                return True
        # Check for precise mimetypes and file extensions
        if extension in CANDIDATE_FILE_EXTENSIONS:
            return self._check_xml(file_stream)
        for prefix in CANDIDATE_MIME_TYPE_PREFIXES:
            if mimetype.startswith(prefix):
                return self._check_xml(file_stream)
        return False
    def _check_xml(self, file_stream: BinaryIO) -> bool:
        cur_pos = file_stream.tell()
        try:
-            doc = minidom.parse(file_stream)
+            doc = minidom.parse(file_obj)
            return self._feed_type(doc) is not None
        except BaseException as _:
-            pass
+            return None
-        finally:
+        file_obj.close()
            file_stream.seek(cur_pos)
        return False
-    def _feed_type(self, doc: Any) -> str | None:
+        result = None
        if doc.getElementsByTagName("rss"):
-            return "rss"
+            # A RSS feed must have a root element of <rss>
            result = self._parse_rss_type(doc)
        elif doc.getElementsByTagName("feed"):
            root = doc.getElementsByTagName("feed")[0]
            if root.getElementsByTagName("entry"):
                # An Atom feed must have a root element of <feed> and at least one <entry>
-                return "atom"
+                result = self._parse_atom_type(doc)
-        return None
+            else:
-
+                return None
    def convert(
        self,
        file_stream: BinaryIO,
        stream_info: StreamInfo,
        **kwargs: Any,  # Options to pass to the converter
    ) -> DocumentConverterResult:
        self._kwargs = kwargs
        doc = minidom.parse(file_stream)
        feed_type = self._feed_type(doc)
        if feed_type == "rss":
            return self._parse_rss_type(doc)
        elif feed_type == "atom":
            return self._parse_atom_type(doc)
        else:
-            raise ValueError("Unknown feed type")
+            # not rss or atom
            return None
-    def _parse_atom_type(self, doc: minidom.Document) -> DocumentConverterResult:
+        return result
    def _parse_atom_type(
        self, doc: minidom.Document
    ) -> Union[None, DocumentConverterResult]:
        """Parse the type of an Atom feed.
        Returns None if the feed type is not recognized or something goes wrong.
        """
-        root = doc.getElementsByTagName("feed")[0]
+        try:
-        title = self._get_data_by_tag_name(root, "title")
+            root = doc.getElementsByTagName("feed")[0]
-        subtitle = self._get_data_by_tag_name(root, "subtitle")
+            title = self._get_data_by_tag_name(root, "title")
-        entries = root.getElementsByTagName("entry")
+            subtitle = self._get_data_by_tag_name(root, "subtitle")
-        md_text = f"# {title}\n"
+            entries = root.getElementsByTagName("entry")
-        if subtitle:
+            md_text = f"# {title}\n"
-            md_text += f"{subtitle}\n"
+            if subtitle:
-        for entry in entries:
+                md_text += f"{subtitle}\n"
-            entry_title = self._get_data_by_tag_name(entry, "title")
+            for entry in entries:
-            entry_summary = self._get_data_by_tag_name(entry, "summary")
+                entry_title = self._get_data_by_tag_name(entry, "title")
-            entry_updated = self._get_data_by_tag_name(entry, "updated")
+                entry_summary = self._get_data_by_tag_name(entry, "summary")
-            entry_content = self._get_data_by_tag_name(entry, "content")
+                entry_updated = self._get_data_by_tag_name(entry, "updated")
                entry_content = self._get_data_by_tag_name(entry, "content")
-            if entry_title:
+                if entry_title:
-                md_text += f"\n## {entry_title}\n"
+                    md_text += f"\n## {entry_title}\n"
-            if entry_updated:
+                if entry_updated:
-                md_text += f"Updated on: {entry_updated}\n"
+                    md_text += f"Updated on: {entry_updated}\n"
-            if entry_summary:
+                if entry_summary:
-                md_text += self._parse_content(entry_summary)
+                    md_text += self._parse_content(entry_summary)
-            if entry_content:
+                if entry_content:
-                md_text += self._parse_content(entry_content)
+                    md_text += self._parse_content(entry_content)
-        return DocumentConverterResult(
+            return DocumentConverterResult(
-            markdown=md_text,
+                title=title,
-            title=title,
+                text_content=md_text,
-        )
+            )
        except BaseException as _:
            return None
-    def _parse_rss_type(self, doc: minidom.Document) -> DocumentConverterResult:
+    def _parse_rss_type(
        self, doc: minidom.Document
    ) -> Union[None, DocumentConverterResult]:
        """Parse the type of an RSS feed.
        Returns None if the feed type is not recognized or something goes wrong.
        """
-        root = doc.getElementsByTagName("rss")[0]
+        try:
-        channel_list = root.getElementsByTagName("channel")
+            root = doc.getElementsByTagName("rss")[0]
-        if not channel_list:
+            channel = root.getElementsByTagName("channel")
-            raise ValueError("No channel found in RSS feed")
+            if not channel:
-        channel = channel_list[0]
+                return None
-        channel_title = self._get_data_by_tag_name(channel, "title")
+            channel = channel[0]
-        channel_description = self._get_data_by_tag_name(channel, "description")
+            channel_title = self._get_data_by_tag_name(channel, "title")
-        items = channel.getElementsByTagName("item")
+            channel_description = self._get_data_by_tag_name(channel, "description")
-        if channel_title:
+            items = channel.getElementsByTagName("item")
-            md_text = f"# {channel_title}\n"
+            if channel_title:
-        if channel_description:
+                md_text = f"# {channel_title}\n"
-            md_text += f"{channel_description}\n"
+            if channel_description:
-        for item in items:
+                md_text += f"{channel_description}\n"
-            title = self._get_data_by_tag_name(item, "title")
+            if not items:
-            description = self._get_data_by_tag_name(item, "description")
+                items = []
-            pubDate = self._get_data_by_tag_name(item, "pubDate")
+            for item in items:
-            content = self._get_data_by_tag_name(item, "content:encoded")
+                title = self._get_data_by_tag_name(item, "title")
                description = self._get_data_by_tag_name(item, "description")
                pubDate = self._get_data_by_tag_name(item, "pubDate")
                content = self._get_data_by_tag_name(item, "content:encoded")
-            if title:
+                if title:
-                md_text += f"\n## {title}\n"
+                    md_text += f"\n## {title}\n"
-            if pubDate:
+                if pubDate:
-                md_text += f"Published on: {pubDate}\n"
+                    md_text += f"Published on: {pubDate}\n"
-            if description:
+                if description:
-                md_text += self._parse_content(description)
+                    md_text += self._parse_content(description)
-            if content:
+                if content:
-                md_text += self._parse_content(content)
+                    md_text += self._parse_content(content)
-        return DocumentConverterResult(
+            return DocumentConverterResult(
-            markdown=md_text,
+                title=channel_title,
-            title=channel_title,
+                text_content=md_text,
-        )
+            )
        except BaseException as _:
            print(traceback.format_exc())
            return None
    def _parse_content(self, content: str) -> str:
        """Parse the content of an RSS feed item"""
        try:
            # using bs4 because many RSS feeds have HTML-styled content
            soup = BeautifulSoup(content, "html.parser")
-            return _CustomMarkdownify(**self._kwargs).convert_soup(soup)
+            return _CustomMarkdownify().convert_soup(soup)
        except BaseException as _:
            return content
@@ -186,6 +150,5 @@ class RssConverter(DocumentConverter):
            return None
        fc = nodes[0].firstChild
        if fc:
-            if hasattr(fc, "data"):
+            return fc.data
                return fc.data
        return None
--- a/packages/markitdown/src/markitdown/converters/_transcribe_audio.py
+++ b/packages/markitdown/src/markitdown/converters/_transcribe_audio.py
@@ -1,49 +0,0 @@
 import io
 import sys
 from typing import BinaryIO
 from .._exceptions import MissingDependencyException
 # Try loading optional (but in this case, required) dependencies
 # Save reporting of any exceptions for later
 _dependency_exc_info = None
 try:
    # Suppress some warnings on library import
    import warnings
    with warnings.catch_warnings():
        warnings.filterwarnings("ignore", category=DeprecationWarning)
        warnings.filterwarnings("ignore", category=SyntaxWarning)
        import speech_recognition as sr
        import pydub
 except ImportError:
    # Preserve the error and stack trace for later
    _dependency_exc_info = sys.exc_info()
 def transcribe_audio(file_stream: BinaryIO, *, audio_format: str = "wav") -> str:
    # Check for installed dependencies
    if _dependency_exc_info is not None:
        raise MissingDependencyException(
            "Speech transcription requires installing MarkItdown with the [audio-transcription] optional dependencies. E.g., `pip install markitdown[audio-transcription]` or `pip install markitdown[all]`"
        ) from _dependency_exc_info[
            1
        ].with_traceback(  # type: ignore[union-attr]
            _dependency_exc_info[2]
        )
    if audio_format in ["wav", "aiff", "flac"]:
        audio_source = file_stream
    elif audio_format in ["mp3", "mp4"]:
        audio_segment = pydub.AudioSegment.from_file(file_stream, format=audio_format)
        audio_source = io.BytesIO()
        audio_segment.export(audio_source, format="wav")
        audio_source.seek(0)
    else:
        raise ValueError(f"Unsupported audio format: {audio_format}")
    recognizer = sr.Recognizer()
    with sr.AudioFile(audio_source) as source:
        audio = recognizer.record(source)
        transcript = recognizer.recognize_google(audio).strip()
        return "[No speech detected]" if transcript == "" else transcript
--- a/packages/markitdown/src/markitdown/converters/_wav_converter.py
+++ b/packages/markitdown/src/markitdown/converters/_wav_converter.py
@@ -0,0 +1,80 @@
 from typing import Union
 from ._base import DocumentConverter, DocumentConverterResult
 from ._media_converter import MediaConverter
 from ._converter_input import ConverterInput
 # Optional Transcription support
 IS_AUDIO_TRANSCRIPTION_CAPABLE = False
 try:
    import speech_recognition as sr
    IS_AUDIO_TRANSCRIPTION_CAPABLE = True
 except ModuleNotFoundError:
    pass
 class WavConverter(MediaConverter):
    """
    Converts WAV files to markdown via extraction of metadata (if `exiftool` is installed), and speech transcription (if `speech_recognition` is installed).
    """
    def __init__(
        self, priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT
    ):
        super().__init__(priority=priority)
    def convert(
        self, input: ConverterInput, **kwargs
    ) -> Union[None, DocumentConverterResult]:
        # Bail if not a WAV
        extension = kwargs.get("file_extension", "")
        if extension.lower() != ".wav":
            return None
        # Bail if a local path was not provided
        if input.input_type != "filepath":
            return None
        local_path = input.filepath
        md_content = ""
        # Add metadata
        metadata = self._get_metadata(local_path, kwargs.get("exiftool_path"))
        if metadata:
            for f in [
                "Title",
                "Artist",
                "Author",
                "Band",
                "Album",
                "Genre",
                "Track",
                "DateTimeOriginal",
                "CreateDate",
                "Duration",
            ]:
                if f in metadata:
                    md_content += f"{f}: {metadata[f]}\n"
        # Transcribe
        if IS_AUDIO_TRANSCRIPTION_CAPABLE:
            try:
                transcript = self._transcribe_audio(local_path)
                md_content += "\n\n### Audio Transcript:\n" + (
                    "[No speech detected]" if transcript == "" else transcript
                )
            except Exception:
                md_content += (
                    "\n\n### Audio Transcript:\nError. Could not transcribe this audio."
                )
        return DocumentConverterResult(
            title=None,
            text_content=md_content.strip(),
        )
    def _transcribe_audio(self, local_path) -> str:
        recognizer = sr.Recognizer()
        with sr.AudioFile(local_path) as source:
            audio = recognizer.record(source)
            return recognizer.recognize_google(audio).strip()
--- a/packages/markitdown/src/markitdown/converters/_wikipedia_converter.py
+++ b/packages/markitdown/src/markitdown/converters/_wikipedia_converter.py
@@ -1,63 +1,37 @@
 import io
 import re
 import bs4
 from typing import Any, BinaryIO, Optional
-from .._base_converter import DocumentConverter, DocumentConverterResult
+from typing import Any, Union
-from .._stream_info import StreamInfo
+from bs4 import BeautifulSoup
 from ._base import DocumentConverter, DocumentConverterResult
 from ._markdownify import _CustomMarkdownify
-
+from ._converter_input import ConverterInput
 ACCEPTED_MIME_TYPE_PREFIXES = [
    "text/html",
    "application/xhtml",
 ]
 ACCEPTED_FILE_EXTENSIONS = [
    ".html",
    ".htm",
 ]
 class WikipediaConverter(DocumentConverter):
    """Handle Wikipedia pages separately, focusing only on the main document content."""
-    def accepts(
+    def __init__(
-        self,
+        self, priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT
-        file_stream: BinaryIO,
+    ):
-        stream_info: StreamInfo,
+        super().__init__(priority=priority)
        **kwargs: Any,  # Options to pass to the converter
    ) -> bool:
        """
        Make sure we're dealing with HTML content *from* Wikipedia.
        """
        url = stream_info.url or ""
        mimetype = (stream_info.mimetype or "").lower()
        extension = (stream_info.extension or "").lower()
        if not re.search(r"^https?:\/\/[a-zA-Z]{2,3}\.wikipedia.org\/", url):
            # Not a Wikipedia URL
            return False
        if extension in ACCEPTED_FILE_EXTENSIONS:
            return True
        for prefix in ACCEPTED_MIME_TYPE_PREFIXES:
            if mimetype.startswith(prefix):
                return True
        # Not HTML content
        return False
    def convert(
-        self,
+        self, input: ConverterInput, **kwargs: Any
-        file_stream: BinaryIO,
+    ) -> Union[None, DocumentConverterResult]:
-        stream_info: StreamInfo,
+        # Bail if not Wikipedia
-        **kwargs: Any,  # Options to pass to the converter
+        extension = kwargs.get("file_extension", "")
-    ) -> DocumentConverterResult:
+        if extension.lower() not in [".html", ".htm"]:
-        # Parse the stream
+            return None
-        encoding = "utf-8" if stream_info.charset is None else stream_info.charset
+        url = kwargs.get("url", "")
-        soup = bs4.BeautifulSoup(file_stream, "html.parser", from_encoding=encoding)
+        if not re.search(r"^https?:\/\/[a-zA-Z]{2,3}\.wikipedia.org\/", url):
            return None
        # Parse the file
        soup = None
        file_obj = input.read_file(mode="rt", encoding="utf-8")
        soup = BeautifulSoup(file_obj.read(), "html.parser")
        file_obj.close()
        # Remove javascript and style blocks
        for script in soup(["script", "style"]):
@@ -72,17 +46,18 @@ class WikipediaConverter(DocumentConverter):
        if body_elm:
            # What's the title
-            if title_elm and isinstance(title_elm, bs4.Tag):
+            if title_elm and len(title_elm) > 0:
-                main_title = title_elm.string
+                main_title = title_elm.string  # type: ignore
                assert isinstance(main_title, str)
            # Convert the page
-            webpage_text = f"# {main_title}\n\n" + _CustomMarkdownify(
+            webpage_text = f"# {main_title}\n\n" + _CustomMarkdownify().convert_soup(
-                **kwargs
+                body_elm
-            ).convert_soup(body_elm)
+            )
        else:
-            webpage_text = _CustomMarkdownify(**kwargs).convert_soup(soup)
+            webpage_text = _CustomMarkdownify().convert_soup(soup)
        return DocumentConverterResult(
            markdown=webpage_text,
            title=main_title,
            text_content=webpage_text,
        )
--- a/packages/markitdown/src/markitdown/converters/_xlsx_converter.py
+++ b/packages/markitdown/src/markitdown/converters/_xlsx_converter.py
@@ -1,157 +1,70 @@
-import sys
+from typing import Union
-from typing import BinaryIO, Any
+
 import pandas as pd
 from ._base import DocumentConverter, DocumentConverterResult
 from ._html_converter import HtmlConverter
-from .._base_converter import DocumentConverter, DocumentConverterResult
+from ._converter_input import ConverterInput
 from .._exceptions import MissingDependencyException, MISSING_DEPENDENCY_MESSAGE
 from .._stream_info import StreamInfo
 # Try loading optional (but in this case, required) dependencies
 # Save reporting of any exceptions for later
 _xlsx_dependency_exc_info = None
 try:
    import pandas as pd
    import openpyxl
 except ImportError:
    _xlsx_dependency_exc_info = sys.exc_info()
 _xls_dependency_exc_info = None
 try:
    import pandas as pd
    import xlrd
 except ImportError:
    _xls_dependency_exc_info = sys.exc_info()
 ACCEPTED_XLSX_MIME_TYPE_PREFIXES = [
    "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
 ]
 ACCEPTED_XLSX_FILE_EXTENSIONS = [".xlsx"]
 ACCEPTED_XLS_MIME_TYPE_PREFIXES = [
    "application/vnd.ms-excel",
    "application/excel",
 ]
 ACCEPTED_XLS_FILE_EXTENSIONS = [".xls"]
-class XlsxConverter(DocumentConverter):
+class XlsxConverter(HtmlConverter):
    """
    Converts XLSX files to Markdown, with each sheet presented as a separate Markdown table.
    """
-    def __init__(self):
+    def __init__(
-        super().__init__()
+        self, priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT
-        self._html_converter = HtmlConverter()
+    ):
-
+        super().__init__(priority=priority)
    def accepts(
        self,
        file_stream: BinaryIO,
        stream_info: StreamInfo,
        **kwargs: Any,  # Options to pass to the converter
    ) -> bool:
        mimetype = (stream_info.mimetype or "").lower()
        extension = (stream_info.extension or "").lower()
        if extension in ACCEPTED_XLSX_FILE_EXTENSIONS:
            return True
        for prefix in ACCEPTED_XLSX_MIME_TYPE_PREFIXES:
            if mimetype.startswith(prefix):
                return True
        return False
    def convert(
-        self,
+        self, input: ConverterInput, **kwargs
-        file_stream: BinaryIO,
+    ) -> Union[None, DocumentConverterResult]:
-        stream_info: StreamInfo,
+        # Bail if not a XLSX
-        **kwargs: Any,  # Options to pass to the converter
+        extension = kwargs.get("file_extension", "")
-    ) -> DocumentConverterResult:
+        if extension.lower() != ".xlsx":
-        # Check the dependencies
+            return None
-        if _xlsx_dependency_exc_info is not None:
+
-            raise MissingDependencyException(
+        file_obj = input.read_file(mode="rb")
-                MISSING_DEPENDENCY_MESSAGE.format(
+        sheets = pd.read_excel(file_obj, sheet_name=None, engine="openpyxl")
-                    converter=type(self).__name__,
+        file_obj.close()
                    extension=".xlsx",
                    feature="xlsx",
                )
            ) from _xlsx_dependency_exc_info[
                1
            ].with_traceback(  # type: ignore[union-attr]
                _xlsx_dependency_exc_info[2]
            )
        sheets = pd.read_excel(file_stream, sheet_name=None, engine="openpyxl")
        md_content = ""
        for s in sheets:
            md_content += f"## {s}\n"
            html_content = sheets[s].to_html(index=False)
-            md_content += (
+            md_content += self._convert(html_content).text_content.strip() + "\n\n"
                self._html_converter.convert_string(
                    html_content, **kwargs
                ).markdown.strip()
                + "\n\n"
            )
-        return DocumentConverterResult(markdown=md_content.strip())
+        return DocumentConverterResult(
            title=None,
            text_content=md_content.strip(),
        )
-class XlsConverter(DocumentConverter):
+class XlsConverter(HtmlConverter):
    """
    Converts XLS files to Markdown, with each sheet presented as a separate Markdown table.
    """
    def __init__(self):
        super().__init__()
        self._html_converter = HtmlConverter()
    def accepts(
        self,
        file_stream: BinaryIO,
        stream_info: StreamInfo,
        **kwargs: Any,  # Options to pass to the converter
    ) -> bool:
        mimetype = (stream_info.mimetype or "").lower()
        extension = (stream_info.extension or "").lower()
        if extension in ACCEPTED_XLS_FILE_EXTENSIONS:
            return True
        for prefix in ACCEPTED_XLS_MIME_TYPE_PREFIXES:
            if mimetype.startswith(prefix):
                return True
        return False
    def convert(
-        self,
+        self, input: ConverterInput, **kwargs
-        file_stream: BinaryIO,
+    ) -> Union[None, DocumentConverterResult]:
-        stream_info: StreamInfo,
+        # Bail if not a XLS
-        **kwargs: Any,  # Options to pass to the converter
+        extension = kwargs.get("file_extension", "")
-    ) -> DocumentConverterResult:
+        if extension.lower() != ".xls":
-        # Load the dependencies
+            return None
-        if _xls_dependency_exc_info is not None:
+
-            raise MissingDependencyException(
+        file_obj = input.read_file(mode="rb")
-                MISSING_DEPENDENCY_MESSAGE.format(
+        sheets = pd.read_excel(file_obj, sheet_name=None, engine="xlrd")
-                    converter=type(self).__name__,
+        file_obj.close()
                    extension=".xls",
                    feature="xls",
                )
            ) from _xls_dependency_exc_info[
                1
            ].with_traceback(  # type: ignore[union-attr]
                _xls_dependency_exc_info[2]
            )
        sheets = pd.read_excel(file_stream, sheet_name=None, engine="xlrd")
        md_content = ""
        for s in sheets:
            md_content += f"## {s}\n"
            html_content = sheets[s].to_html(index=False)
-            md_content += (
+            md_content += self._convert(html_content).text_content.strip() + "\n\n"
                self._html_converter.convert_string(
                    html_content, **kwargs
                ).markdown.strip()
                + "\n\n"
            )
-        return DocumentConverterResult(markdown=md_content.strip())
+        return DocumentConverterResult(
            title=None,
            text_content=md_content.strip(),
        )
--- a/packages/markitdown/src/markitdown/converters/_youtube_converter.py
+++ b/packages/markitdown/src/markitdown/converters/_youtube_converter.py
@@ -1,120 +1,72 @@
 import sys
 import json
 import time
 import io
 import re
-import bs4
+import json
-from typing import Any, BinaryIO, Optional, Dict, List, Union
+
-from urllib.parse import parse_qs, urlparse, unquote
+from typing import Any, Union, Dict, List
 from urllib.parse import parse_qs, urlparse
 from bs4 import BeautifulSoup
 from ._base import DocumentConverter, DocumentConverterResult
 from ._converter_input import ConverterInput
 from .._base_converter import DocumentConverter, DocumentConverterResult
 from .._stream_info import StreamInfo
 # Optional YouTube transcription support
 try:
-    # Suppress some warnings on library import
+    from youtube_transcript_api import YouTubeTranscriptApi
    import warnings
    with warnings.catch_warnings():
        warnings.filterwarnings("ignore", category=SyntaxWarning)
        # Patch submitted upstream to fix the SyntaxWarning
        from youtube_transcript_api import YouTubeTranscriptApi
    IS_YOUTUBE_TRANSCRIPT_CAPABLE = True
 except ModuleNotFoundError:
-    IS_YOUTUBE_TRANSCRIPT_CAPABLE = False
+    pass
 ACCEPTED_MIME_TYPE_PREFIXES = [
    "text/html",
    "application/xhtml",
 ]
 ACCEPTED_FILE_EXTENSIONS = [
    ".html",
    ".htm",
 ]
 class YouTubeConverter(DocumentConverter):
    """Handle YouTube specially, focusing on the video title, description, and transcript."""
-    def accepts(
+    def __init__(
-        self,
+        self, priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT
-        file_stream: BinaryIO,
+    ):
-        stream_info: StreamInfo,
+        super().__init__(priority=priority)
        **kwargs: Any,  # Options to pass to the converter
    ) -> bool:
        """
        Make sure we're dealing with HTML content *from* YouTube.
        """
        url = stream_info.url or ""
        mimetype = (stream_info.mimetype or "").lower()
        extension = (stream_info.extension or "").lower()
        url = unquote(url)
        url = url.replace(r"\?", "?").replace(r"\=", "=")
        if not url.startswith("https://www.youtube.com/watch?"):
            # Not a YouTube URL
            return False
        if extension in ACCEPTED_FILE_EXTENSIONS:
            return True
        for prefix in ACCEPTED_MIME_TYPE_PREFIXES:
            if mimetype.startswith(prefix):
                return True
        # Not HTML content
        return False
    def convert(
-        self,
+        self, input: ConverterInput, **kwargs: Any
-        file_stream: BinaryIO,
+    ) -> Union[None, DocumentConverterResult]:
-        stream_info: StreamInfo,
+        # Bail if not YouTube
-        **kwargs: Any,  # Options to pass to the converter
+        extension = kwargs.get("file_extension", "")
-    ) -> DocumentConverterResult:
+        if extension.lower() not in [".html", ".htm"]:
-        # Parse the stream
+            return None
-        encoding = "utf-8" if stream_info.charset is None else stream_info.charset
+        url = kwargs.get("url", "")
-        soup = bs4.BeautifulSoup(file_stream, "html.parser", from_encoding=encoding)
+        if not url.startswith("https://www.youtube.com/watch?"):
            return None
        # Parse the file
        soup = None
        file_obj = input.read_file(mode="rt", encoding="utf-8")
        soup = BeautifulSoup(file_obj.read(), "html.parser")
        file_obj.close()
        # Read the meta tags
-        metadata: Dict[str, str] = {}
+        assert soup.title is not None and soup.title.string is not None
-
+        metadata: Dict[str, str] = {"title": soup.title.string}
        if soup.title and soup.title.string:
            metadata["title"] = soup.title.string
        for meta in soup(["meta"]):
            if not isinstance(meta, bs4.Tag):
                continue
            for a in meta.attrs:
                if a in ["itemprop", "property", "name"]:
-                    key = str(meta.get(a, ""))
+                    metadata[meta[a]] = meta.get("content", "")
                    content = str(meta.get("content", ""))
                    if key and content:  # Only add non-empty content
                        metadata[key] = content
                    break
-        # Try reading the description
+        # We can also try to read the full description. This is more prone to breaking, since it reaches into the page implementation
        try:
            for script in soup(["script"]):
-                if not isinstance(script, bs4.Tag):
+                content = script.text
                    continue
                if not script.string:  # Skip empty scripts
                    continue
                content = script.string
                if "ytInitialData" in content:
-                    match = re.search(r"var ytInitialData = ({.*?});", content)
+                    lines = re.split(r"\r?\n", content)
-                    if match:
+                    obj_start = lines[0].find("{")
-                        data = json.loads(match.group(1))
+                    obj_end = lines[0].rfind("}")
-                        attrdesc = self._findKey(data, "attributedDescriptionBodyText")
+                    if obj_start >= 0 and obj_end >= 0:
-                        if attrdesc and isinstance(attrdesc, dict):
+                        data = json.loads(lines[0][obj_start : obj_end + 1])
-                            metadata["description"] = str(attrdesc.get("content", ""))
+                        attrdesc = self._findKey(data, "attributedDescriptionBodyText")  # type: ignore
                        if attrdesc:
                            metadata["description"] = str(attrdesc["content"])
                    break
-        except Exception as e:
+        except Exception:
            print(f"Error extracting description: {e}")
            pass
        # Start preparing the page
@@ -147,39 +99,33 @@ class YouTubeConverter(DocumentConverter):
            webpage_text += f"\n### Description\n{description}\n"
        if IS_YOUTUBE_TRANSCRIPT_CAPABLE:
            ytt_api = YouTubeTranscriptApi()
            transcript_text = ""
-            parsed_url = urlparse(stream_info.url)  # type: ignore
+            parsed_url = urlparse(url)  # type: ignore
            params = parse_qs(parsed_url.query)  # type: ignore
-            if "v" in params and params["v"][0]:
+            if "v" in params:
                assert isinstance(params["v"][0], str)
                video_id = str(params["v"][0])
                try:
                    youtube_transcript_languages = kwargs.get(
                        "youtube_transcript_languages", ("en",)
                    )
-                    # Retry the transcript fetching operation
+                    # Must be a single transcript.
-                    transcript = self._retry_operation(
+                    transcript = YouTubeTranscriptApi.get_transcript(video_id, languages=youtube_transcript_languages)  # type: ignore
-                        lambda: ytt_api.fetch(
+                    transcript_text = " ".join([part["text"] for part in transcript])  # type: ignore
-                            video_id, languages=youtube_transcript_languages
+                    # Alternative formatting:
-                        ),
+                    # formatter = TextFormatter()
-                        retries=3,  # Retry 3 times
+                    # formatter.format_transcript(transcript)
-                        delay=2,  # 2 seconds delay between retries
+                except Exception:
-                    )
+                    pass
                    if transcript:
                        transcript_text = " ".join(
                            [part.text for part in transcript]
                        )  # type: ignore
                except Exception as e:
                    print(f"Error fetching transcript: {e}")
            if transcript_text:
                webpage_text += f"\n### Transcript\n{transcript_text}\n"
-        title = title if title else (soup.title.string if soup.title else "")
+        title = title if title else soup.title.string
        assert isinstance(title, str)
        return DocumentConverterResult(
            markdown=webpage_text,
            title=title,
            text_content=webpage_text,
        )
    def _get(
@@ -188,37 +134,23 @@ class YouTubeConverter(DocumentConverter):
        keys: List[str],
        default: Union[str, None] = None,
    ) -> Union[str, None]:
        """Get first non-empty value from metadata matching given keys."""
        for k in keys:
            if k in metadata:
                return metadata[k]
        return default
    def _findKey(self, json: Any, key: str) -> Union[str, None]:  # TODO: Fix json type
        """Recursively search for a key in nested dictionary/list structures."""
        if isinstance(json, list):
            for elm in json:
                ret = self._findKey(elm, key)
                if ret is not None:
                    return ret
        elif isinstance(json, dict):
-            for k, v in json.items():
+            for k in json:
                if k == key:
                    return json[k]
-                if result := self._findKey(v, key):
+                else:
-                    return result
+                    ret = self._findKey(json[k], key)
                    if ret is not None:
                        return ret
        return None
    def _retry_operation(self, operation, retries=3, delay=2):
        """Retries the operation if it fails."""
        attempt = 0
        while attempt < retries:
            try:
                return operation()  # Attempt the operation
            except Exception as e:
                print(f"Attempt {attempt + 1} failed: {e}")
                if attempt < retries - 1:
                    time.sleep(delay)  # Wait before retrying
                attempt += 1
        # If all attempts fail, raise the last exception
        raise Exception(f"Operation failed after {retries} attempts.")
--- a/packages/markitdown/src/markitdown/converters/_zip_converter.py
+++ b/packages/markitdown/src/markitdown/converters/_zip_converter.py
@@ -1,23 +1,10 @@
 import sys
 import zipfile
 import io
 import os
 import zipfile
 import shutil
 from typing import Any, Union
-from typing import BinaryIO, Any, TYPE_CHECKING
+from ._base import DocumentConverter, DocumentConverterResult
-
+from ._converter_input import ConverterInput
 from .._base_converter import DocumentConverter, DocumentConverterResult
 from .._stream_info import StreamInfo
 from .._exceptions import UnsupportedFormatException, FileConversionException
 # Break otherwise circular import for type hinting
 if TYPE_CHECKING:
    from .._markitdown import MarkItDown
 ACCEPTED_MIME_TYPE_PREFIXES = [
    "application/zip",
 ]
 ACCEPTED_FILE_EXTENSIONS = [".zip"]
 class ZipConverter(DocumentConverter):
@@ -60,58 +47,104 @@ class ZipConverter(DocumentConverter):
    """
    def __init__(
-        self,
+        self, priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT
        *,
        markitdown: "MarkItDown",
    ):
-        super().__init__()
+        super().__init__(priority=priority)
        self._markitdown = markitdown
    def accepts(
        self,
        file_stream: BinaryIO,
        stream_info: StreamInfo,
        **kwargs: Any,  # Options to pass to the converter
    ) -> bool:
        mimetype = (stream_info.mimetype or "").lower()
        extension = (stream_info.extension or "").lower()
        if extension in ACCEPTED_FILE_EXTENSIONS:
            return True
        for prefix in ACCEPTED_MIME_TYPE_PREFIXES:
            if mimetype.startswith(prefix):
                return True
        return False
    def convert(
-        self,
+        self, input: ConverterInput, **kwargs: Any
-        file_stream: BinaryIO,
+    ) -> Union[None, DocumentConverterResult]:
-        stream_info: StreamInfo,
+        # Bail if not a ZIP
-        **kwargs: Any,  # Options to pass to the converter
+        extension = kwargs.get("file_extension", "")
-    ) -> DocumentConverterResult:
+        if extension.lower() != ".zip":
-        file_path = stream_info.url or stream_info.local_path or stream_info.filename
+            return None
        md_content = f"Content from the zip file `{file_path}`:\n\n"
-        with zipfile.ZipFile(file_stream, "r") as zipObj:
+        # Bail if a local path is not provided
-            for name in zipObj.namelist():
+        if input.input_type != "filepath":
-                try:
+            return None
-                    z_file_stream = io.BytesIO(zipObj.read(name))
+        local_path = input.filepath
                    z_file_stream_info = StreamInfo(
                        extension=os.path.splitext(name)[1],
                        filename=os.path.basename(name),
                    )
                    result = self._markitdown.convert_stream(
                        stream=z_file_stream,
                        stream_info=z_file_stream_info,
                    )
                    if result is not None:
                        md_content += f"## File: {name}\n\n"
                        md_content += result.markdown + "\n\n"
                except UnsupportedFormatException:
                    pass
                except FileConversionException:
                    pass
-        return DocumentConverterResult(markdown=md_content.strip())
+        # Get parent converters list if available
        parent_converters = kwargs.get("_parent_converters", [])
        if not parent_converters:
            return DocumentConverterResult(
                title=None,
                text_content=f"[ERROR] No converters available to process zip contents from: {local_path}",
            )
        extracted_zip_folder_name = (
            f"extracted_{os.path.basename(local_path).replace('.zip', '_zip')}"
        )
        extraction_dir = os.path.normpath(
            os.path.join(os.path.dirname(local_path), extracted_zip_folder_name)
        )
        md_content = f"Content from the zip file `{os.path.basename(local_path)}`:\n\n"
        try:
            # Extract the zip file safely
            with zipfile.ZipFile(local_path, "r") as zipObj:
                # Safeguard against path traversal
                for member in zipObj.namelist():
                    member_path = os.path.normpath(os.path.join(extraction_dir, member))
                    if (
                        not os.path.commonprefix([extraction_dir, member_path])
                        == extraction_dir
                    ):
                        raise ValueError(
                            f"Path traversal detected in zip file: {member}"
                        )
                # Extract all files safely
                zipObj.extractall(path=extraction_dir)
            # Process each extracted file
            for root, dirs, files in os.walk(extraction_dir):
                for name in files:
                    file_path = os.path.join(root, name)
                    relative_path = os.path.relpath(file_path, extraction_dir)
                    # Get file extension
                    _, file_extension = os.path.splitext(name)
                    # Update kwargs for the file
                    file_kwargs = kwargs.copy()
                    file_kwargs["file_extension"] = file_extension
                    file_kwargs["_parent_converters"] = parent_converters
                    # Try converting the file using available converters
                    for converter in parent_converters:
                        # Skip the zip converter to avoid infinite recursion
                        if isinstance(converter, ZipConverter):
                            continue
                        # Create a ConverterInput for the parent converter and attempt conversion
                        input = ConverterInput(
                            input_type="filepath", filepath=file_path
                        )
                        result = converter.convert(input, **file_kwargs)
                        if result is not None:
                            md_content += f"\n## File: {relative_path}\n\n"
                            md_content += result.text_content + "\n\n"
                            break
            # Clean up extracted files if specified
            if kwargs.get("cleanup_extracted", True):
                shutil.rmtree(extraction_dir)
            return DocumentConverterResult(title=None, text_content=md_content.strip())
        except zipfile.BadZipFile:
            return DocumentConverterResult(
                title=None,
                text_content=f"[ERROR] Invalid or corrupted zip file: {local_path}",
            )
        except ValueError as ve:
            return DocumentConverterResult(
                title=None,
                text_content=f"[ERROR] Security error in zip file {local_path}: {str(ve)}",
            )
        except Exception as e:
            return DocumentConverterResult(
                title=None,
                text_content=f"[ERROR] Failed to process zip file {local_path}: {str(e)}",
            )
--- a/packages/markitdown/tests/_test_vectors.py
+++ b/packages/markitdown/tests/_test_vectors.py
@@ -1,278 +0,0 @@
 import dataclasses
 from typing import List
@dataclasses.dataclass(frozen=True, kw_only=True)
 class FileTestVector(object):
    filename: str
    mimetype: str | None
    charset: str | None
    url: str | None
    must_include: List[str]
    must_not_include: List[str]
 GENERAL_TEST_VECTORS = [
    FileTestVector(
        filename="test.docx",
        mimetype="application/vnd.openxmlformats-officedocument.wordprocessingml.document",
        charset=None,
        url=None,
        must_include=[
            "314b0a30-5b04-470b-b9f7-eed2c2bec74a",
            "49e168b7-d2ae-407f-a055-2167576f39a1",
            "## d666f1f7-46cb-42bd-9a39-9a39cf2a509f",
            "# Abstract",
            "# Introduction",
            "AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation",
            "data:image/png;base64...",
        ],
        must_not_include=[
            "data:image/png;base64,iVBORw0KGgoAAAANSU",
        ],
    ),
    FileTestVector(
        filename="test.xlsx",
        mimetype="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
        charset=None,
        url=None,
        must_include=[
            "## 09060124-b5e7-4717-9d07-3c046eb",
            "6ff4173b-42a5-4784-9b19-f49caff4d93d",
            "affc7dad-52dc-4b98-9b5d-51e65d8a8ad0",
        ],
        must_not_include=[],
    ),
    FileTestVector(
        filename="test.xls",
        mimetype="application/vnd.ms-excel",
        charset=None,
        url=None,
        must_include=[
            "## 09060124-b5e7-4717-9d07-3c046eb",
            "6ff4173b-42a5-4784-9b19-f49caff4d93d",
            "affc7dad-52dc-4b98-9b5d-51e65d8a8ad0",
        ],
        must_not_include=[],
    ),
    FileTestVector(
        filename="test.pptx",
        mimetype="application/vnd.openxmlformats-officedocument.presentationml.presentation",
        charset=None,
        url=None,
        must_include=[
            "2cdda5c8-e50e-4db4-b5f0-9722a649f455",
            "04191ea8-5c73-4215-a1d3-1cfb43aaaf12",
            "44bf7d06-5e7a-4a40-a2e1-a2e42ef28c8a",
            "1b92870d-e3b5-4e65-8153-919f4ff45592",
            "AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation",
            "a3f6004b-6f4f-4ea8-bee3-3741f4dc385f",  # chart title
            "2003",  # chart value
            "![This phrase of the caption is Human-written.](Picture4.jpg)",
        ],
        must_not_include=["data:image/jpeg;base64,/9j/4AAQSkZJRgABAQE"],
    ),
    FileTestVector(
        filename="test_outlook_msg.msg",
        mimetype="application/vnd.ms-outlook",
        charset=None,
        url=None,
        must_include=[
            "# Email Message",
            "**From:** test.sender@example.com",
            "**To:** test.recipient@example.com",
            "**Subject:** Test Email Message",
            "## Content",
            "This is the body of the test email message",
        ],
        must_not_include=[],
    ),
    FileTestVector(
        filename="test.pdf",
        mimetype="application/pdf",
        charset=None,
        url=None,
        must_include=[
            "While there is contemporaneous exploration of multi-agent approaches"
        ],
        must_not_include=[],
    ),
    FileTestVector(
        filename="test_blog.html",
        mimetype="text/html",
        charset="utf-8",
        url="https://microsoft.github.io/autogen/blog/2023/04/21/LLM-tuning-math",
        must_include=[
            "Large language models (LLMs) are powerful tools that can generate natural language texts for various applications, such as chatbots, summarization, translation, and more. GPT-4 is currently the state of the art LLM in the world. Is model selection irrelevant? What about inference parameters?",
            "an example where high cost can easily prevent a generic complex",
        ],
        must_not_include=[],
    ),
    FileTestVector(
        filename="test_wikipedia.html",
        mimetype="text/html",
        charset="utf-8",
        url="https://en.wikipedia.org/wiki/Microsoft",
        must_include=[
            "Microsoft entered the operating system (OS) business in 1980 with its own version of [Unix]",
            'Microsoft was founded by [Bill Gates](/wiki/Bill_Gates "Bill Gates")',
        ],
        must_not_include=[
            "You are encouraged to create an account and log in",
            "154 languages",
            "move to sidebar",
        ],
    ),
    FileTestVector(
        filename="test_serp.html",
        mimetype="text/html",
        charset="utf-8",
        url="https://www.bing.com/search?q=microsoft+wikipedia",
        must_include=[
            "](https://en.wikipedia.org/wiki/Microsoft",
            "Microsoft Corporation is **an American multinational corporation and technology company headquartered** in Redmond",
            "1995–2007: Foray into the Web, Windows 95, Windows XP, and Xbox",
        ],
        must_not_include=[
            "https://www.bing.com/ck/a?!&&p=",
            "data:image/svg+xml,%3Csvg%20width%3D",
        ],
    ),
    FileTestVector(
        filename="test_mskanji.csv",
        mimetype="text/csv",
        charset="cp932",
        url=None,
        must_include=[
            "名前,年齢,住所",
            "佐藤太郎,30,東京",
            "三木英子,25,大阪",
            "髙橋淳,35,名古屋",
        ],
        must_not_include=[],
    ),
    FileTestVector(
        filename="test.json",
        mimetype="application/json",
        charset="ascii",
        url=None,
        must_include=[
            "5b64c88c-b3c3-4510-bcb8-da0b200602d8",
            "9700dc99-6685-40b4-9a3a-5e406dcb37f3",
        ],
        must_not_include=[],
    ),
    FileTestVector(
        filename="test_rss.xml",
        mimetype="text/xml",
        charset="utf-8",
        url=None,
        must_include=[
            "# The Official Microsoft Blog",
            "## Ignite 2024: Why nearly 70% of the Fortune 500 now use Microsoft 365 Copilot",
            "In the case of AI, it is absolutely true that the industry is moving incredibly fast",
        ],
        must_not_include=["<rss", "<feed"],
    ),
    FileTestVector(
        filename="test_notebook.ipynb",
        mimetype="application/json",
        charset="ascii",
        url=None,
        must_include=[
            "# Test Notebook",
            "```python",
            'print("markitdown")',
            "```",
            "## Code Cell Below",
        ],
        must_not_include=[
            "nbformat",
            "nbformat_minor",
        ],
    ),
    FileTestVector(
        filename="test_files.zip",
        mimetype="application/zip",
        charset=None,
        url=None,
        must_include=[
            "314b0a30-5b04-470b-b9f7-eed2c2bec74a",
            "49e168b7-d2ae-407f-a055-2167576f39a1",
            "## d666f1f7-46cb-42bd-9a39-9a39cf2a509f",
            "# Abstract",
            "# Introduction",
            "AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation",
            "2cdda5c8-e50e-4db4-b5f0-9722a649f455",
            "04191ea8-5c73-4215-a1d3-1cfb43aaaf12",
            "44bf7d06-5e7a-4a40-a2e1-a2e42ef28c8a",
            "1b92870d-e3b5-4e65-8153-919f4ff45592",
            "## 09060124-b5e7-4717-9d07-3c046eb",
            "6ff4173b-42a5-4784-9b19-f49caff4d93d",
            "affc7dad-52dc-4b98-9b5d-51e65d8a8ad0",
            "Microsoft entered the operating system (OS) business in 1980 with its own version of [Unix]",
            'Microsoft was founded by [Bill Gates](/wiki/Bill_Gates "Bill Gates")',
        ],
        must_not_include=[],
    ),
    FileTestVector(
        filename="test.epub",
        mimetype="application/epub+zip",
        charset=None,
        url=None,
        must_include=[
            "**Authors:** Test Author",
            "A test EPUB document for MarkItDown testing",
            "# Chapter 1: Test Content",
            "This is a **test** paragraph with some formatting",
            "* A bullet point",
            "* Another point",
            "# Chapter 2: More Content",
            "*different* style",
            "> This is a blockquote for testing",
        ],
        must_not_include=[],
    ),
 ]
 DATA_URI_TEST_VECTORS = [
    FileTestVector(
        filename="test.docx",
        mimetype="application/vnd.openxmlformats-officedocument.wordprocessingml.document",
        charset=None,
        url=None,
        must_include=[
            "314b0a30-5b04-470b-b9f7-eed2c2bec74a",
            "49e168b7-d2ae-407f-a055-2167576f39a1",
            "## d666f1f7-46cb-42bd-9a39-9a39cf2a509f",
            "# Abstract",
            "# Introduction",
            "AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation",
            "data:image/png;base64,iVBORw0KGgoAAAANSU",
        ],
        must_not_include=[
            "data:image/png;base64...",
        ],
    ),
    FileTestVector(
        filename="test.pptx",
        mimetype="application/vnd.openxmlformats-officedocument.presentationml.presentation",
        charset=None,
        url=None,
        must_include=[
            "2cdda5c8-e50e-4db4-b5f0-9722a649f455",
            "04191ea8-5c73-4215-a1d3-1cfb43aaaf12",
            "44bf7d06-5e7a-4a40-a2e1-a2e42ef28c8a",
            "1b92870d-e3b5-4e65-8153-919f4ff45592",
            "AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation",
            "a3f6004b-6f4f-4ea8-bee3-3741f4dc385f",  # chart title
            "2003",  # chart value
            "![This phrase of the caption is Human-written.]",  # image caption
            "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQE",
        ],
        must_not_include=[
            "![This phrase of the caption is Human-written.](Picture4.jpg)",
        ],
    ),
 ]
--- a/packages/markitdown/tests/test_cli.py
+++ b/packages/markitdown/tests/test_cli.py
@@ -0,0 +1,119 @@
 #!/usr/bin/env python3 -m pytest
 import os
 import subprocess
 import pytest
 from markitdown import __version__
 try:
    from .test_markitdown import TEST_FILES_DIR, DOCX_TEST_STRINGS
 except ImportError:
    from test_markitdown import TEST_FILES_DIR, DOCX_TEST_STRINGS
@pytest.fixture(scope="session")
 def shared_tmp_dir(tmp_path_factory):
    return tmp_path_factory.mktemp("pytest_tmp")
 def test_version(shared_tmp_dir) -> None:
    result = subprocess.run(
        ["python", "-m", "markitdown", "--version"], capture_output=True, text=True
    )
    assert result.returncode == 0, f"CLI exited with error: {result.stderr}"
    assert __version__ in result.stdout, f"Version not found in output: {result.stdout}"
 def test_invalid_flag(shared_tmp_dir) -> None:
    result = subprocess.run(
        ["python", "-m", "markitdown", "--foobar"], capture_output=True, text=True
    )
    assert result.returncode != 0, f"CLI exited with error: {result.stderr}"
    assert (
        "unrecognized arguments" in result.stderr
    ), f"Expected 'unrecognized arguments' to appear in STDERR"
    assert "SYNTAX" in result.stderr, f"Expected 'SYNTAX' to appear in STDERR"
 def test_output_to_stdout(shared_tmp_dir) -> None:
    # DOC X
    result = subprocess.run(
        ["python", "-m", "markitdown", os.path.join(TEST_FILES_DIR, "test.docx")],
        capture_output=True,
        text=True,
    )
    assert result.returncode == 0, f"CLI exited with error: {result.stderr}"
    for test_string in DOCX_TEST_STRINGS:
        assert (
            test_string in result.stdout
        ), f"Expected string not found in output: {test_string}"
 def test_output_to_file(shared_tmp_dir) -> None:
    # DOC X, flag -o at the end
    docx_output_file_1 = os.path.join(shared_tmp_dir, "test_docx_1.md")
    result = subprocess.run(
        [
            "python",
            "-m",
            "markitdown",
            os.path.join(TEST_FILES_DIR, "test.docx"),
            "-o",
            docx_output_file_1,
        ],
        capture_output=True,
        text=True,
    )
    assert result.returncode == 0, f"CLI exited with error: {result.stderr}"
    assert os.path.exists(
        docx_output_file_1
    ), f"Output file not created: {docx_output_file_1}"
    with open(docx_output_file_1, "r") as f:
        output = f.read()
        for test_string in DOCX_TEST_STRINGS:
            assert (
                test_string in output
            ), f"Expected string not found in output: {test_string}"
    # DOC X, flag -o at the beginning
    docx_output_file_2 = os.path.join(shared_tmp_dir, "test_docx_2.md")
    result = subprocess.run(
        [
            "python",
            "-m",
            "markitdown",
            "-o",
            docx_output_file_2,
            os.path.join(TEST_FILES_DIR, "test.docx"),
        ],
        capture_output=True,
        text=True,
    )
    assert result.returncode == 0, f"CLI exited with error: {result.stderr}"
    assert os.path.exists(
        docx_output_file_2
    ), f"Output file not created: {docx_output_file_2}"
    with open(docx_output_file_2, "r") as f:
        output = f.read()
        for test_string in DOCX_TEST_STRINGS:
            assert (
                test_string in output
            ), f"Expected string not found in output: {test_string}"
 if __name__ == "__main__":
    """Runs this file's tests from the command line."""
    import tempfile
    with tempfile.TemporaryDirectory() as tmp_dir:
        test_version(tmp_dir)
        test_invalid_flag(tmp_dir)
        test_output_to_stdout(tmp_dir)
        test_output_to_file(tmp_dir)
    print("All tests passed!")
--- a/packages/markitdown/tests/test_cli_misc.py
+++ b/packages/markitdown/tests/test_cli_misc.py
@@ -1,35 +0,0 @@
 #!/usr/bin/env python3 -m pytest
 import subprocess
 import pytest
 from markitdown import __version__
 # This file contains CLI tests that are not directly tested by the FileTestVectors.
 # This includes things like help messages, version numbers, and invalid flags.
 def test_version() -> None:
    result = subprocess.run(
        ["python", "-m", "markitdown", "--version"], capture_output=True, text=True
    )
    assert result.returncode == 0, f"CLI exited with error: {result.stderr}"
    assert __version__ in result.stdout, f"Version not found in output: {result.stdout}"
 def test_invalid_flag() -> None:
    result = subprocess.run(
        ["python", "-m", "markitdown", "--foobar"], capture_output=True, text=True
    )
    assert result.returncode != 0, f"CLI exited with error: {result.stderr}"
    assert (
        "unrecognized arguments" in result.stderr
    ), f"Expected 'unrecognized arguments' to appear in STDERR"
    assert "SYNTAX" in result.stderr, f"Expected 'SYNTAX' to appear in STDERR"
 if __name__ == "__main__":
    """Runs this file's tests from the command line."""
    test_version()
    test_invalid_flag()
    print("All tests passed!")
--- a/packages/markitdown/tests/test_cli_vectors.py
+++ b/packages/markitdown/tests/test_cli_vectors.py
@@ -1,227 +0,0 @@
 #!/usr/bin/env python3 -m pytest
 import os
 import time
 import pytest
 import subprocess
 import locale
 from typing import List
 if __name__ == "__main__":
    from _test_vectors import (
        GENERAL_TEST_VECTORS,
        DATA_URI_TEST_VECTORS,
        FileTestVector,
    )
 else:
    from ._test_vectors import (
        GENERAL_TEST_VECTORS,
        DATA_URI_TEST_VECTORS,
        FileTestVector,
    )
 from markitdown import (
    MarkItDown,
    UnsupportedFormatException,
    FileConversionException,
    StreamInfo,
 )
 skip_remote = (
    True if os.environ.get("GITHUB_ACTIONS") else False
 )  # Don't run these tests in CI
 TEST_FILES_DIR = os.path.join(os.path.dirname(__file__), "test_files")
 TEST_FILES_URL = "https://raw.githubusercontent.com/microsoft/markitdown/refs/heads/main/packages/markitdown/tests/test_files"
 # Prepare CLI test vectors (remove vectors that require mockig the url)
 CLI_TEST_VECTORS: List[FileTestVector] = []
 for test_vector in GENERAL_TEST_VECTORS:
    if test_vector.url is not None:
        continue
    CLI_TEST_VECTORS.append(test_vector)
@pytest.fixture(scope="session")
 def shared_tmp_dir(tmp_path_factory):
    return tmp_path_factory.mktemp("pytest_tmp")
@pytest.mark.parametrize("test_vector", CLI_TEST_VECTORS)
 def test_output_to_stdout(shared_tmp_dir, test_vector) -> None:
    """Test that the CLI outputs to stdout correctly."""
    result = subprocess.run(
        [
            "python",
            "-m",
            "markitdown",
            os.path.join(TEST_FILES_DIR, test_vector.filename),
        ],
        capture_output=True,
        text=True,
    )
    assert result.returncode == 0, f"CLI exited with error: {result.stderr}"
    for test_string in test_vector.must_include:
        assert test_string in result.stdout
    for test_string in test_vector.must_not_include:
        assert test_string not in result.stdout
@pytest.mark.parametrize("test_vector", CLI_TEST_VECTORS)
 def test_output_to_file(shared_tmp_dir, test_vector) -> None:
    """Test that the CLI outputs to a file correctly."""
    output_file = os.path.join(shared_tmp_dir, test_vector.filename + ".output")
    result = subprocess.run(
        [
            "python",
            "-m",
            "markitdown",
            "-o",
            output_file,
            os.path.join(TEST_FILES_DIR, test_vector.filename),
        ],
        capture_output=True,
        text=True,
    )
    assert result.returncode == 0, f"CLI exited with error: {result.stderr}"
    assert os.path.exists(output_file), f"Output file not created: {output_file}"
    with open(output_file, "r") as f:
        output_data = f.read()
        for test_string in test_vector.must_include:
            assert test_string in output_data
        for test_string in test_vector.must_not_include:
            assert test_string not in output_data
    os.remove(output_file)
    assert not os.path.exists(output_file), f"Output file not deleted: {output_file}"
@pytest.mark.parametrize("test_vector", CLI_TEST_VECTORS)
 def test_input_from_stdin_without_hints(shared_tmp_dir, test_vector) -> None:
    """Test that the CLI readds from stdin correctly."""
    test_input = b""
    with open(os.path.join(TEST_FILES_DIR, test_vector.filename), "rb") as stream:
        test_input = stream.read()
    result = subprocess.run(
        [
            "python",
            "-m",
            "markitdown",
            os.path.join(TEST_FILES_DIR, test_vector.filename),
        ],
        input=test_input,
        capture_output=True,
        text=False,
    )
    stdout = result.stdout.decode(locale.getpreferredencoding())
    assert (
        result.returncode == 0
    ), f"CLI exited with error: {result.stderr.decode('utf-8')}"
    for test_string in test_vector.must_include:
        assert test_string in stdout
    for test_string in test_vector.must_not_include:
        assert test_string not in stdout
@pytest.mark.skipif(
    skip_remote,
    reason="do not run tests that query external urls",
 )
@pytest.mark.parametrize("test_vector", CLI_TEST_VECTORS)
 def test_convert_url(shared_tmp_dir, test_vector):
    """Test the conversion of a stream with no stream info."""
    # Note: tmp_dir is not used here, but is needed to match the signature
    markitdown = MarkItDown()
    time.sleep(1)  # Ensure we don't hit rate limits
    result = subprocess.run(
        ["python", "-m", "markitdown", TEST_FILES_URL + "/" + test_vector.filename],
        capture_output=True,
        text=False,
    )
    stdout = result.stdout.decode(locale.getpreferredencoding())
    assert result.returncode == 0, f"CLI exited with error: {result.stderr}"
    for test_string in test_vector.must_include:
        assert test_string in stdout
    for test_string in test_vector.must_not_include:
        assert test_string not in stdout
@pytest.mark.parametrize("test_vector", DATA_URI_TEST_VECTORS)
 def test_output_to_file_with_data_uris(shared_tmp_dir, test_vector) -> None:
    """Test CLI functionality when keep_data_uris is enabled"""
    output_file = os.path.join(shared_tmp_dir, test_vector.filename + ".output")
    result = subprocess.run(
        [
            "python",
            "-m",
            "markitdown",
            "--keep-data-uris",
            "-o",
            output_file,
            os.path.join(TEST_FILES_DIR, test_vector.filename),
        ],
        capture_output=True,
        text=True,
    )
    assert result.returncode == 0, f"CLI exited with error: {result.stderr}"
    assert os.path.exists(output_file), f"Output file not created: {output_file}"
    with open(output_file, "r") as f:
        output_data = f.read()
        for test_string in test_vector.must_include:
            assert test_string in output_data
        for test_string in test_vector.must_not_include:
            assert test_string not in output_data
    os.remove(output_file)
    assert not os.path.exists(output_file), f"Output file not deleted: {output_file}"
 if __name__ == "__main__":
    import sys
    import tempfile
    """Runs this file's tests from the command line."""
    with tempfile.TemporaryDirectory() as tmp_dir:
        # General tests
        for test_function in [
            test_output_to_stdout,
            test_output_to_file,
            test_input_from_stdin_without_hints,
            test_convert_url,
        ]:
            for test_vector in CLI_TEST_VECTORS:
                print(
                    f"Running {test_function.__name__} on {test_vector.filename}...",
                    end="",
                )
                test_function(tmp_dir, test_vector)
                print("OK")
        # Data URI tests
        for test_function in [
            test_output_to_file_with_data_uris,
        ]:
            for test_vector in DATA_URI_TEST_VECTORS:
                print(
                    f"Running {test_function.__name__} on {test_vector.filename}...",
                    end="",
                )
                test_function(tmp_dir, test_vector)
                print("OK")
    print("All tests passed!")
--- a/packages/markitdown/tests/test_files/random.bin
+++ b/packages/markitdown/tests/test_files/random.bin
--- a/packages/markitdown/tests/test_files/test.docx
+++ b/packages/markitdown/tests/test_files/test.docx
--- a/packages/markitdown/tests/test_files/test.epub
+++ b/packages/markitdown/tests/test_files/test.epub
--- a/packages/markitdown/tests/test_files/test.m4a
+++ b/packages/markitdown/tests/test_files/test.m4a
--- a/packages/markitdown/tests/test_files/test.mp3
+++ b/packages/markitdown/tests/test_files/test.mp3
--- a/packages/markitdown/tests/test_files/test.pdf
+++ b/packages/markitdown/tests/test_files/test.pdf
--- a/packages/markitdown/tests/test_files/test.pptx
+++ b/packages/markitdown/tests/test_files/test.pptx
--- a/packages/markitdown/tests/test_files/test.wav
+++ b/packages/markitdown/tests/test_files/test.wav
--- a/packages/markitdown/tests/test_files/test_notebook.ipynb
+++ b/packages/markitdown/tests/test_files/test_notebook.ipynb
@@ -1,89 +1,89 @@
 {
- "cells": [
+    "cells": [
-  {
+        {
-   "cell_type": "markdown",
+            "cell_type": "markdown",
-   "id": "0f61db80",
+            "id": "0f61db80",
-   "metadata": {},
+            "metadata": {},
-   "source": [
+            "source": [
-    "# Test Notebook"
+                "# Test Notebook"
-   ]
+            ]
-  },
+        },
-  {
+        {
-   "cell_type": "code",
+            "cell_type": "code",
-   "execution_count": 11,
+            "execution_count": 11,
-   "id": "3f2a5bbd",
+            "id": "3f2a5bbd",
-   "metadata": {},
+            "metadata": {},
-   "outputs": [
+            "outputs": [
-    {
+                {
-     "name": "stdout",
+                    "name": "stdout",
-     "output_type": "stream",
+                    "output_type": "stream",
-     "text": [
+                    "text": [
-      "markitdown\n"
+                        "markitdown\n"
-     ]
+                    ]
-    }
+                }
-   ],
+            ],
-   "source": [
+            "source": [
-    "print(\"markitdown\")"
+                "print('markitdown')"
-   ]
+            ]
-  },
+        },
-  {
+        {
-   "cell_type": "markdown",
+            "cell_type": "markdown",
-   "id": "9b9c0468",
+            "id": "9b9c0468",
-   "metadata": {},
+            "metadata": {},
-   "source": [
+            "source": [
-    "## Code Cell Below"
+                "## Code Cell Below"
-   ]
+            ]
-  },
+        },
-  {
+        {
-   "cell_type": "code",
+            "cell_type": "code",
-   "execution_count": 10,
+            "execution_count": 10,
-   "id": "37d8088a",
+            "id": "37d8088a",
-   "metadata": {},
+            "metadata": {},
-   "outputs": [
+            "outputs": [
-    {
+                {
-     "name": "stdout",
+                    "name": "stdout",
-     "output_type": "stream",
+                    "output_type": "stream",
-     "text": [
+                    "text": [
-      "42\n"
+                        "42\n"
-     ]
+                    ]
-    }
+                }
-   ],
+            ],
-   "source": [
+            "source": [
-    "# comment in code\n",
+                "# comment in code\n",
-    "print(42)"
+                "print(42)"
-   ]
+            ]
-  },
+        },
-  {
+        {
-   "cell_type": "markdown",
+            "cell_type": "markdown",
-   "id": "2e3177bd",
+            "id": "2e3177bd",
-   "metadata": {},
+            "metadata": {},
-   "source": [
+            "source": [
-    "End\n",
+                "End\n",
-    "\n",
+                "\n",
-    "---"
+                "---"
-   ]
+            ]
-  }
+        }
- ],
+    ],
- "metadata": {
+    "metadata": {
-  "kernelspec": {
+        "kernelspec": {
-   "display_name": "Python 3",
+            "display_name": "Python 3",
-   "language": "python",
+            "language": "python",
-   "name": "python3"
+            "name": "python3"
-  },
+        },
-  "language_info": {
+        "language_info": {
-   "codemirror_mode": {
+            "codemirror_mode": {
-    "name": "ipython",
+                "name": "ipython",
-    "version": 3
+                "version": 3
-   },
+            },
-   "file_extension": ".py",
+            "file_extension": ".py",
-   "mimetype": "text/x-python",
+            "mimetype": "text/x-python",
-   "name": "python",
+            "name": "python",
-   "nbconvert_exporter": "python",
+            "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
+            "pygments_lexer": "ipython3",
-   "version": "3.12.8"
+            "version": "3.12.8"
-  },
+        },
-  "title": "Test Notebook Title"
+        "title": "Test Notebook Title"
- },
+    },
- "nbformat": 4,
+    "nbformat": 4,
- "nbformat_minor": 5
+    "nbformat_minor": 5
 }
--- a/packages/markitdown/tests/test_markitdown.py
+++ b/packages/markitdown/tests/test_markitdown.py
@@ -0,0 +1,416 @@
 #!/usr/bin/env python3 -m pytest
 import io
 import os
 import shutil
 import pytest
 import requests
 from warnings import catch_warnings, resetwarnings
 from markitdown import MarkItDown
 skip_remote = (
    True if os.environ.get("GITHUB_ACTIONS") else False
 )  # Don't run these tests in CI
 # Don't run the llm tests without a key and the client library
 skip_llm = False if os.environ.get("OPENAI_API_KEY") else True
 try:
    import openai
 except ModuleNotFoundError:
    skip_llm = True
 # Skip exiftool tests if not installed
 skip_exiftool = shutil.which("exiftool") is None
 TEST_FILES_DIR = os.path.join(os.path.dirname(__file__), "test_files")
 JPG_TEST_EXIFTOOL = {
    "Author": "AutoGen Authors",
    "Title": "AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation",
    "Description": "AutoGen enables diverse LLM-based applications",
    "ImageSize": "1615x1967",
    "DateTimeOriginal": "2024:03:14 22:10:00",
 }
 PDF_TEST_URL = "https://arxiv.org/pdf/2308.08155v2.pdf"
 PDF_TEST_STRINGS = [
    "While there is contemporaneous exploration of multi-agent approaches"
 ]
 YOUTUBE_TEST_URL = "https://www.youtube.com/watch?v=V2qZ_lgxTzg"
 YOUTUBE_TEST_STRINGS = [
    "## AutoGen FULL Tutorial with Python (Step-By-Step)",
    "This is an intermediate tutorial for installing and using AutoGen locally",
    "PT15M4S",
    "the model we're going to be using today is GPT 3.5 turbo",  # From the transcript
 ]
 XLSX_TEST_STRINGS = [
    "## 09060124-b5e7-4717-9d07-3c046eb",
    "6ff4173b-42a5-4784-9b19-f49caff4d93d",
    "affc7dad-52dc-4b98-9b5d-51e65d8a8ad0",
 ]
 XLS_TEST_STRINGS = [
    "## 09060124-b5e7-4717-9d07-3c046eb",
    "6ff4173b-42a5-4784-9b19-f49caff4d93d",
    "affc7dad-52dc-4b98-9b5d-51e65d8a8ad0",
 ]
 DOCX_TEST_STRINGS = [
    "314b0a30-5b04-470b-b9f7-eed2c2bec74a",
    "49e168b7-d2ae-407f-a055-2167576f39a1",
    "## d666f1f7-46cb-42bd-9a39-9a39cf2a509f",
    "# Abstract",
    "# Introduction",
    "AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation",
 ]
 MSG_TEST_STRINGS = [
    "# Email Message",
    "**From:** test.sender@example.com",
    "**To:** test.recipient@example.com",
    "**Subject:** Test Email Message",
    "## Content",
    "This is the body of the test email message",
 ]
 DOCX_COMMENT_TEST_STRINGS = [
    "314b0a30-5b04-470b-b9f7-eed2c2bec74a",
    "49e168b7-d2ae-407f-a055-2167576f39a1",
    "## d666f1f7-46cb-42bd-9a39-9a39cf2a509f",
    "# Abstract",
    "# Introduction",
    "AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation",
    "This is a test comment. 12df-321a",
    "Yet another comment in the doc. 55yiyi-asd09",
 ]
 PPTX_TEST_STRINGS = [
    "2cdda5c8-e50e-4db4-b5f0-9722a649f455",
    "04191ea8-5c73-4215-a1d3-1cfb43aaaf12",
    "44bf7d06-5e7a-4a40-a2e1-a2e42ef28c8a",
    "1b92870d-e3b5-4e65-8153-919f4ff45592",
    "AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation",
    "a3f6004b-6f4f-4ea8-bee3-3741f4dc385f",  # chart title
    "2003",  # chart value
 ]
 BLOG_TEST_URL = "https://microsoft.github.io/autogen/blog/2023/04/21/LLM-tuning-math"
 BLOG_TEST_STRINGS = [
    "Large language models (LLMs) are powerful tools that can generate natural language texts for various applications, such as chatbots, summarization, translation, and more. GPT-4 is currently the state of the art LLM in the world. Is model selection irrelevant? What about inference parameters?",
    "an example where high cost can easily prevent a generic complex",
 ]
 RSS_TEST_STRINGS = [
    "The Official Microsoft Blog",
    "In the case of AI, it is absolutely true that the industry is moving incredibly fast",
 ]
 WIKIPEDIA_TEST_URL = "https://en.wikipedia.org/wiki/Microsoft"
 WIKIPEDIA_TEST_STRINGS = [
    "Microsoft entered the operating system (OS) business in 1980 with its own version of [Unix]",
    'Microsoft was founded by [Bill Gates](/wiki/Bill_Gates "Bill Gates")',
 ]
 WIKIPEDIA_TEST_EXCLUDES = [
    "You are encouraged to create an account and log in",
    "154 languages",
    "move to sidebar",
 ]
 SERP_TEST_URL = "https://www.bing.com/search?q=microsoft+wikipedia"
 SERP_TEST_STRINGS = [
    "](https://en.wikipedia.org/wiki/Microsoft",
    "Microsoft Corporation is **an American multinational corporation and technology company headquartered** in Redmond",
    "1995–2007: Foray into the Web, Windows 95, Windows XP, and Xbox",
 ]
 SERP_TEST_EXCLUDES = [
    "https://www.bing.com/ck/a?!&&p=",
    "data:image/svg+xml,%3Csvg%20width%3D",
 ]
 CSV_CP932_TEST_STRINGS = [
    "名前,年齢,住所",
    "佐藤太郎,30,東京",
    "三木英子,25,大阪",
    "髙橋淳,35,名古屋",
 ]
 LLM_TEST_STRINGS = [
    "5bda1dd6",
 ]
 JSON_TEST_STRINGS = [
    "5b64c88c-b3c3-4510-bcb8-da0b200602d8",
    "9700dc99-6685-40b4-9a3a-5e406dcb37f3",
 ]
 # --- Helper Functions ---
 def validate_strings(result, expected_strings, exclude_strings=None):
    """Validate presence or absence of specific strings."""
    text_content = result.text_content.replace("\\", "")
    for string in expected_strings:
        assert string in text_content
    if exclude_strings:
        for string in exclude_strings:
            assert string not in text_content
@pytest.mark.skipif(
    skip_remote,
    reason="do not run tests that query external urls",
 )
 def test_markitdown_remote() -> None:
    markitdown = MarkItDown()
    # By URL
    result = markitdown.convert(PDF_TEST_URL)
    for test_string in PDF_TEST_STRINGS:
        assert test_string in result.text_content
    # By stream
    response = requests.get(PDF_TEST_URL)
    result = markitdown.convert_stream(
        io.BytesIO(response.content), file_extension=".pdf", url=PDF_TEST_URL
    )
    for test_string in PDF_TEST_STRINGS:
        assert test_string in result.text_content
    # Youtube
    # TODO: This test randomly fails for some reason. Haven't been able to repro it yet. Disabling until I can debug the issue
    # result = markitdown.convert(YOUTUBE_TEST_URL)
    # for test_string in YOUTUBE_TEST_STRINGS:
    #     assert test_string in result.text_content
 def test_markitdown_local_paths() -> None:
    markitdown = MarkItDown()
    # Test XLSX processing
    result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.xlsx"))
    validate_strings(result, XLSX_TEST_STRINGS)
    # Test XLS processing
    result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.xls"))
    for test_string in XLS_TEST_STRINGS:
        text_content = result.text_content.replace("\\", "")
        assert test_string in text_content
    # Test DOCX processing
    result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.docx"))
    validate_strings(result, DOCX_TEST_STRINGS)
    # Test DOCX processing, with comments
    result = markitdown.convert(
        os.path.join(TEST_FILES_DIR, "test_with_comment.docx"),
        style_map="comment-reference => ",
    )
    validate_strings(result, DOCX_COMMENT_TEST_STRINGS)
    # Test DOCX processing, with comments and setting style_map on init
    markitdown_with_style_map = MarkItDown(style_map="comment-reference => ")
    result = markitdown_with_style_map.convert(
        os.path.join(TEST_FILES_DIR, "test_with_comment.docx")
    )
    validate_strings(result, DOCX_COMMENT_TEST_STRINGS)
    # Test PPTX processing
    result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.pptx"))
    validate_strings(result, PPTX_TEST_STRINGS)
    # Test HTML processing
    result = markitdown.convert(
        os.path.join(TEST_FILES_DIR, "test_blog.html"), url=BLOG_TEST_URL
    )
    validate_strings(result, BLOG_TEST_STRINGS)
    # Test ZIP file processing
    result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test_files.zip"))
    validate_strings(result, XLSX_TEST_STRINGS)
    # Test Wikipedia processing
    result = markitdown.convert(
        os.path.join(TEST_FILES_DIR, "test_wikipedia.html"), url=WIKIPEDIA_TEST_URL
    )
    text_content = result.text_content.replace("\\", "")
    validate_strings(result, WIKIPEDIA_TEST_STRINGS, WIKIPEDIA_TEST_EXCLUDES)
    # Test Bing processing
    result = markitdown.convert(
        os.path.join(TEST_FILES_DIR, "test_serp.html"), url=SERP_TEST_URL
    )
    text_content = result.text_content.replace("\\", "")
    validate_strings(result, SERP_TEST_STRINGS, SERP_TEST_EXCLUDES)
    # Test RSS processing
    result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test_rss.xml"))
    text_content = result.text_content.replace("\\", "")
    for test_string in RSS_TEST_STRINGS:
        assert test_string in text_content
    ## Test non-UTF-8 encoding
    result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test_mskanji.csv"))
    validate_strings(result, CSV_CP932_TEST_STRINGS)
    # Test MSG (Outlook email) processing
    result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test_outlook_msg.msg"))
    validate_strings(result, MSG_TEST_STRINGS)
    # Test JSON processing
    result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.json"))
    validate_strings(result, JSON_TEST_STRINGS)
    # Test input with leading blank characters
    input_data = b"   \n\n\n<html><body><h1>Test</h1></body></html>"
    result = markitdown.convert_stream(io.BytesIO(input_data))
    assert "# Test" in result.text_content
 def test_markitdown_local_objects() -> None:
    markitdown = MarkItDown()
    # Test XLSX processing
    with open(os.path.join(TEST_FILES_DIR, "test.xlsx"), "rb") as f:
        result = markitdown.convert(f, file_extension=".xlsx")
        validate_strings(result, XLSX_TEST_STRINGS)
    # Test XLS processing
    with open(os.path.join(TEST_FILES_DIR, "test.xls"), "rb") as f:
        result = markitdown.convert(f, file_extension=".xls")
        for test_string in XLS_TEST_STRINGS:
            text_content = result.text_content.replace("\\", "")
            assert test_string in text_content
    # Test DOCX processing
    with open(os.path.join(TEST_FILES_DIR, "test.docx"), "rb") as f:
        result = markitdown.convert(f, file_extension=".docx")
        validate_strings(result, DOCX_TEST_STRINGS)
    # Test DOCX processing, with comments
    with open(os.path.join(TEST_FILES_DIR, "test_with_comment.docx"), "rb") as f:
        result = markitdown.convert(
            f,
            file_extension=".docx",
            style_map="comment-reference => ",
        )
        validate_strings(result, DOCX_COMMENT_TEST_STRINGS)
    # Test DOCX processing, with comments and setting style_map on init
    markitdown_with_style_map = MarkItDown(style_map="comment-reference => ")
    with open(os.path.join(TEST_FILES_DIR, "test_with_comment.docx"), "rb") as f:
        result = markitdown_with_style_map.convert(f, file_extension=".docx")
        validate_strings(result, DOCX_COMMENT_TEST_STRINGS)
    # Test PPTX processing
    with open(os.path.join(TEST_FILES_DIR, "test.pptx"), "rb") as f:
        result = markitdown.convert(f, file_extension=".pptx")
        validate_strings(result, PPTX_TEST_STRINGS)
    # Test HTML processing
    with open(
        os.path.join(TEST_FILES_DIR, "test_blog.html"), "rt", encoding="utf-8"
    ) as f:
        result = markitdown.convert(f, file_extension=".html", url=BLOG_TEST_URL)
        validate_strings(result, BLOG_TEST_STRINGS)
    # Test Wikipedia processing
    with open(
        os.path.join(TEST_FILES_DIR, "test_wikipedia.html"), "rt", encoding="utf-8"
    ) as f:
        result = markitdown.convert(f, file_extension=".html", url=WIKIPEDIA_TEST_URL)
        text_content = result.text_content.replace("\\", "")
        validate_strings(result, WIKIPEDIA_TEST_STRINGS, WIKIPEDIA_TEST_EXCLUDES)
    # Test Bing processing
    with open(
        os.path.join(TEST_FILES_DIR, "test_serp.html"), "rt", encoding="utf-8"
    ) as f:
        result = markitdown.convert(f, file_extension=".html", url=SERP_TEST_URL)
        text_content = result.text_content.replace("\\", "")
        validate_strings(result, SERP_TEST_STRINGS, SERP_TEST_EXCLUDES)
    # Test RSS processing
    with open(os.path.join(TEST_FILES_DIR, "test_rss.xml"), "rb") as f:
        result = markitdown.convert(f, file_extension=".xml")
        text_content = result.text_content.replace("\\", "")
        for test_string in RSS_TEST_STRINGS:
            assert test_string in text_content
    # Test MSG (Outlook email) processing
    with open(os.path.join(TEST_FILES_DIR, "test_outlook_msg.msg"), "rb") as f:
        result = markitdown.convert(f, file_extension=".msg")
        validate_strings(result, MSG_TEST_STRINGS)
    # Test JSON processing
    with open(os.path.join(TEST_FILES_DIR, "test.json"), "rb") as f:
        result = markitdown.convert(f, file_extension=".json")
        validate_strings(result, JSON_TEST_STRINGS)
@pytest.mark.skipif(
    skip_exiftool,
    reason="do not run if exiftool is not installed",
 )
 def test_markitdown_exiftool() -> None:
    # Test the automatic discovery of exiftool throws a warning
    # and is disabled
    try:
        with catch_warnings(record=True) as w:
            markitdown = MarkItDown()
            result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.jpg"))
            assert len(w) == 1
            assert w[0].category is DeprecationWarning
            assert result.text_content.strip() == ""
    finally:
        resetwarnings()
    # Test explicitly setting the location of exiftool
    which_exiftool = shutil.which("exiftool")
    markitdown = MarkItDown(exiftool_path=which_exiftool)
    result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.jpg"))
    for key in JPG_TEST_EXIFTOOL:
        target = f"{key}: {JPG_TEST_EXIFTOOL[key]}"
        assert target in result.text_content
    # Test setting the exiftool path through an environment variable
    os.environ["EXIFTOOL_PATH"] = which_exiftool
    markitdown = MarkItDown()
    result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.jpg"))
    for key in JPG_TEST_EXIFTOOL:
        target = f"{key}: {JPG_TEST_EXIFTOOL[key]}"
        assert target in result.text_content
@pytest.mark.skipif(
    skip_llm,
    reason="do not run llm tests without a key",
 )
 def test_markitdown_llm() -> None:
    client = openai.OpenAI()
    markitdown = MarkItDown(llm_client=client, llm_model="gpt-4o")
    result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test_llm.jpg"))
    for test_string in LLM_TEST_STRINGS:
        assert test_string in result.text_content
    # This is not super precise. It would also accept "red square", "blue circle",
    # "the square is not blue", etc. But it's sufficient for this test.
    for test_string in ["red", "circle", "blue", "square"]:
        assert test_string in result.text_content.lower()
 if __name__ == "__main__":
    """Runs this file's tests from the command line."""
    test_markitdown_remote()
    test_markitdown_local_paths()
    test_markitdown_local_objects()
    test_markitdown_exiftool()
    # test_markitdown_llm()
    print("All tests passed!")
--- a/packages/markitdown/tests/test_module_misc.py
+++ b/packages/markitdown/tests/test_module_misc.py
@@ -1,328 +0,0 @@
 #!/usr/bin/env python3 -m pytest
 import io
 import os
 import shutil
 import openai
 import pytest
 from markitdown import (
    MarkItDown,
    UnsupportedFormatException,
    FileConversionException,
    StreamInfo,
 )
 # This file contains module tests that are not directly tested by the FileTestVectors.
 # This includes things like helper functions and runtime conversion options
 # (e.g., LLM clients, exiftool path, transcription services, etc.)
 skip_remote = (
    True if os.environ.get("GITHUB_ACTIONS") else False
 )  # Don't run these tests in CI
 # Don't run the llm tests without a key and the client library
 skip_llm = False if os.environ.get("OPENAI_API_KEY") else True
 try:
    import openai
 except ModuleNotFoundError:
    skip_llm = True
 # Skip exiftool tests if not installed
 skip_exiftool = shutil.which("exiftool") is None
 TEST_FILES_DIR = os.path.join(os.path.dirname(__file__), "test_files")
 JPG_TEST_EXIFTOOL = {
    "Author": "AutoGen Authors",
    "Title": "AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation",
    "Description": "AutoGen enables diverse LLM-based applications",
    "ImageSize": "1615x1967",
    "DateTimeOriginal": "2024:03:14 22:10:00",
 }
 MP3_TEST_EXIFTOOL = {
    "Title": "f67a499e-a7d0-4ca3-a49b-358bd934ae3e",
    "Artist": "Artist Name Test String",
    "Album": "Album Name Test String",
    "SampleRate": "48000",
 }
 PDF_TEST_URL = "https://arxiv.org/pdf/2308.08155v2.pdf"
 PDF_TEST_STRINGS = [
    "While there is contemporaneous exploration of multi-agent approaches"
 ]
 YOUTUBE_TEST_URL = "https://www.youtube.com/watch?v=V2qZ_lgxTzg"
 YOUTUBE_TEST_STRINGS = [
    "## AutoGen FULL Tutorial with Python (Step-By-Step)",
    "This is an intermediate tutorial for installing and using AutoGen locally",
    "PT15M4S",
    "the model we're going to be using today is GPT 3.5 turbo",  # From the transcript
 ]
 DOCX_COMMENT_TEST_STRINGS = [
    "314b0a30-5b04-470b-b9f7-eed2c2bec74a",
    "49e168b7-d2ae-407f-a055-2167576f39a1",
    "## d666f1f7-46cb-42bd-9a39-9a39cf2a509f",
    "# Abstract",
    "# Introduction",
    "AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation",
    "This is a test comment. 12df-321a",
    "Yet another comment in the doc. 55yiyi-asd09",
 ]
 BLOG_TEST_URL = "https://microsoft.github.io/autogen/blog/2023/04/21/LLM-tuning-math"
 BLOG_TEST_STRINGS = [
    "Large language models (LLMs) are powerful tools that can generate natural language texts for various applications, such as chatbots, summarization, translation, and more. GPT-4 is currently the state of the art LLM in the world. Is model selection irrelevant? What about inference parameters?",
    "an example where high cost can easily prevent a generic complex",
 ]
 LLM_TEST_STRINGS = [
    "5bda1dd6",
 ]
 PPTX_TEST_STRINGS = [
    "2cdda5c8-e50e-4db4-b5f0-9722a649f455",
    "04191ea8-5c73-4215-a1d3-1cfb43aaaf12",
    "44bf7d06-5e7a-4a40-a2e1-a2e42ef28c8a",
    "1b92870d-e3b5-4e65-8153-919f4ff45592",
    "AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation",
    "a3f6004b-6f4f-4ea8-bee3-3741f4dc385f",  # chart title
    "2003",  # chart value
 ]
 # --- Helper Functions ---
 def validate_strings(result, expected_strings, exclude_strings=None):
    """Validate presence or absence of specific strings."""
    text_content = result.text_content.replace("\\", "")
    for string in expected_strings:
        assert string in text_content
    if exclude_strings:
        for string in exclude_strings:
            assert string not in text_content
 def test_stream_info_operations() -> None:
    """Test operations performed on StreamInfo objects."""
    stream_info_original = StreamInfo(
        mimetype="mimetype.1",
        extension="extension.1",
        charset="charset.1",
        filename="filename.1",
        local_path="local_path.1",
        url="url.1",
    )
    # Check updating all attributes by keyword
    keywords = ["mimetype", "extension", "charset", "filename", "local_path", "url"]
    for keyword in keywords:
        updated_stream_info = stream_info_original.copy_and_update(
            **{keyword: f"{keyword}.2"}
        )
        # Make sure the targted attribute is updated
        assert getattr(updated_stream_info, keyword) == f"{keyword}.2"
        # Make sure the other attributes are unchanged
        for k in keywords:
            if k != keyword:
                assert getattr(stream_info_original, k) == getattr(
                    updated_stream_info, k
                )
    # Check updating all attributes by passing a new StreamInfo object
    keywords = ["mimetype", "extension", "charset", "filename", "local_path", "url"]
    for keyword in keywords:
        updated_stream_info = stream_info_original.copy_and_update(
            StreamInfo(**{keyword: f"{keyword}.2"})
        )
        # Make sure the targted attribute is updated
        assert getattr(updated_stream_info, keyword) == f"{keyword}.2"
        # Make sure the other attributes are unchanged
        for k in keywords:
            if k != keyword:
                assert getattr(stream_info_original, k) == getattr(
                    updated_stream_info, k
                )
    # Check mixing and matching
    updated_stream_info = stream_info_original.copy_and_update(
        StreamInfo(extension="extension.2", filename="filename.2"),
        mimetype="mimetype.3",
        charset="charset.3",
    )
    assert updated_stream_info.extension == "extension.2"
    assert updated_stream_info.filename == "filename.2"
    assert updated_stream_info.mimetype == "mimetype.3"
    assert updated_stream_info.charset == "charset.3"
    assert updated_stream_info.local_path == "local_path.1"
    assert updated_stream_info.url == "url.1"
    # Check multiple StreamInfo objects
    updated_stream_info = stream_info_original.copy_and_update(
        StreamInfo(extension="extension.4", filename="filename.5"),
        StreamInfo(mimetype="mimetype.6", charset="charset.7"),
    )
    assert updated_stream_info.extension == "extension.4"
    assert updated_stream_info.filename == "filename.5"
    assert updated_stream_info.mimetype == "mimetype.6"
    assert updated_stream_info.charset == "charset.7"
    assert updated_stream_info.local_path == "local_path.1"
    assert updated_stream_info.url == "url.1"
 def test_docx_comments() -> None:
    markitdown = MarkItDown()
    # Test DOCX processing, with comments and setting style_map on init
    markitdown_with_style_map = MarkItDown(style_map="comment-reference => ")
    result = markitdown_with_style_map.convert(
        os.path.join(TEST_FILES_DIR, "test_with_comment.docx")
    )
    validate_strings(result, DOCX_COMMENT_TEST_STRINGS)
 def test_input_as_strings() -> None:
    markitdown = MarkItDown()
    # Test input from a stream
    input_data = b"<html><body><h1>Test</h1></body></html>"
    result = markitdown.convert_stream(io.BytesIO(input_data))
    assert "# Test" in result.text_content
    # Test input with leading blank characters
    input_data = b"   \n\n\n<html><body><h1>Test</h1></body></html>"
    result = markitdown.convert_stream(io.BytesIO(input_data))
    assert "# Test" in result.text_content
@pytest.mark.skipif(
    skip_remote,
    reason="do not run tests that query external urls",
 )
 def test_markitdown_remote() -> None:
    markitdown = MarkItDown()
    # By URL
    result = markitdown.convert(PDF_TEST_URL)
    for test_string in PDF_TEST_STRINGS:
        assert test_string in result.text_content
    # Youtube
    result = markitdown.convert(YOUTUBE_TEST_URL)
    for test_string in YOUTUBE_TEST_STRINGS:
        assert test_string in result.text_content
@pytest.mark.skipif(
    skip_remote,
    reason="do not run remotely run speech transcription tests",
 )
 def test_speech_transcription() -> None:
    markitdown = MarkItDown()
    # Test WAV files, MP3 and M4A files
    for file_name in ["test.wav", "test.mp3", "test.m4a"]:
        result = markitdown.convert(os.path.join(TEST_FILES_DIR, file_name))
        result_lower = result.text_content.lower()
        assert (
            ("1" in result_lower or "one" in result_lower)
            and ("2" in result_lower or "two" in result_lower)
            and ("3" in result_lower or "three" in result_lower)
            and ("4" in result_lower or "four" in result_lower)
            and ("5" in result_lower or "five" in result_lower)
        )
 def test_exceptions() -> None:
    # Check that an exception is raised when trying to convert an unsupported format
    markitdown = MarkItDown()
    with pytest.raises(UnsupportedFormatException):
        markitdown.convert(os.path.join(TEST_FILES_DIR, "random.bin"))
    # Check that an exception is raised when trying to convert a file that is corrupted
    with pytest.raises(FileConversionException) as exc_info:
        markitdown.convert(
            os.path.join(TEST_FILES_DIR, "random.bin"), file_extension=".pptx"
        )
    assert len(exc_info.value.attempts) == 1
    assert type(exc_info.value.attempts[0].converter).__name__ == "PptxConverter"
@pytest.mark.skipif(
    skip_exiftool,
    reason="do not run if exiftool is not installed",
 )
 def test_markitdown_exiftool() -> None:
    which_exiftool = shutil.which("exiftool")
    assert which_exiftool is not None
    # Test explicitly setting the location of exiftool
    markitdown = MarkItDown(exiftool_path=which_exiftool)
    result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.jpg"))
    for key in JPG_TEST_EXIFTOOL:
        target = f"{key}: {JPG_TEST_EXIFTOOL[key]}"
        assert target in result.text_content
    # Test setting the exiftool path through an environment variable
    os.environ["EXIFTOOL_PATH"] = which_exiftool
    markitdown = MarkItDown()
    result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.jpg"))
    for key in JPG_TEST_EXIFTOOL:
        target = f"{key}: {JPG_TEST_EXIFTOOL[key]}"
        assert target in result.text_content
    # Test some other media types
    result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.mp3"))
    for key in MP3_TEST_EXIFTOOL:
        target = f"{key}: {MP3_TEST_EXIFTOOL[key]}"
        assert target in result.text_content
@pytest.mark.skipif(
    skip_llm,
    reason="do not run llm tests without a key",
 )
 def test_markitdown_llm() -> None:
    client = openai.OpenAI()
    markitdown = MarkItDown(llm_client=client, llm_model="gpt-4o")
    result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test_llm.jpg"))
    for test_string in LLM_TEST_STRINGS:
        assert test_string in result.text_content
    # This is not super precise. It would also accept "red square", "blue circle",
    # "the square is not blue", etc. But it's sufficient for this test.
    for test_string in ["red", "circle", "blue", "square"]:
        assert test_string in result.text_content.lower()
    # Images embedded in PPTX files
    result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.pptx"))
    # LLM Captions are included
    for test_string in LLM_TEST_STRINGS:
        assert test_string in result.text_content
    # Standard alt text is included
    validate_strings(result, PPTX_TEST_STRINGS)
 if __name__ == "__main__":
    """Runs this file's tests from the command line."""
    for test in [
        test_stream_info_operations,
        test_docx_comments,
        test_input_as_strings,
        test_markitdown_remote,
        test_speech_transcription,
        test_exceptions,
        test_markitdown_exiftool,
        test_markitdown_llm,
    ]:
        print(f"Running {test.__name__}...", end="")
        test()
        print("OK")
    print("All tests passed!")
--- a/packages/markitdown/tests/test_module_vectors.py
+++ b/packages/markitdown/tests/test_module_vectors.py
@@ -1,199 +0,0 @@
 #!/usr/bin/env python3 -m pytest
 import os
 import time
 import pytest
 import codecs
 if __name__ == "__main__":
    from _test_vectors import GENERAL_TEST_VECTORS, DATA_URI_TEST_VECTORS
 else:
    from ._test_vectors import GENERAL_TEST_VECTORS, DATA_URI_TEST_VECTORS
 from markitdown import (
    MarkItDown,
    UnsupportedFormatException,
    FileConversionException,
    StreamInfo,
 )
 skip_remote = (
    True if os.environ.get("GITHUB_ACTIONS") else False
 )  # Don't run these tests in CI
 TEST_FILES_DIR = os.path.join(os.path.dirname(__file__), "test_files")
 TEST_FILES_URL = "https://raw.githubusercontent.com/microsoft/markitdown/refs/heads/main/packages/markitdown/tests/test_files"
@pytest.mark.parametrize("test_vector", GENERAL_TEST_VECTORS)
 def test_guess_stream_info(test_vector):
    """Test the ability to guess stream info."""
    markitdown = MarkItDown()
    local_path = os.path.join(TEST_FILES_DIR, test_vector.filename)
    expected_extension = os.path.splitext(test_vector.filename)[1]
    with open(local_path, "rb") as stream:
        guesses = markitdown._get_stream_info_guesses(
            stream,
            base_guess=StreamInfo(
                filename=os.path.basename(test_vector.filename),
                local_path=local_path,
                extension=expected_extension,
            ),
        )
        # For some limited exceptions, we can't guarantee the exact
        # mimetype or extension, so we'll special-case them here.
        if test_vector.filename in [
            "test_outlook_msg.msg",
        ]:
            return
        assert guesses[0].mimetype == test_vector.mimetype
        assert guesses[0].extension == expected_extension
        assert guesses[0].charset == test_vector.charset
@pytest.mark.parametrize("test_vector", GENERAL_TEST_VECTORS)
 def test_convert_local(test_vector):
    """Test the conversion of a local file."""
    markitdown = MarkItDown()
    result = markitdown.convert(
        os.path.join(TEST_FILES_DIR, test_vector.filename), url=test_vector.url
    )
    for string in test_vector.must_include:
        assert string in result.markdown
    for string in test_vector.must_not_include:
        assert string not in result.markdown
@pytest.mark.parametrize("test_vector", GENERAL_TEST_VECTORS)
 def test_convert_stream_with_hints(test_vector):
    """Test the conversion of a stream with full stream info."""
    markitdown = MarkItDown()
    stream_info = StreamInfo(
        extension=os.path.splitext(test_vector.filename)[1],
        mimetype=test_vector.mimetype,
        charset=test_vector.charset,
    )
    with open(os.path.join(TEST_FILES_DIR, test_vector.filename), "rb") as stream:
        result = markitdown.convert(
            stream, stream_info=stream_info, url=test_vector.url
        )
        for string in test_vector.must_include:
            assert string in result.markdown
        for string in test_vector.must_not_include:
            assert string not in result.markdown
@pytest.mark.parametrize("test_vector", GENERAL_TEST_VECTORS)
 def test_convert_stream_without_hints(test_vector):
    """Test the conversion of a stream with no stream info."""
    markitdown = MarkItDown()
    with open(os.path.join(TEST_FILES_DIR, test_vector.filename), "rb") as stream:
        result = markitdown.convert(stream, url=test_vector.url)
        for string in test_vector.must_include:
            assert string in result.markdown
        for string in test_vector.must_not_include:
            assert string not in result.markdown
@pytest.mark.skipif(
    skip_remote,
    reason="do not run tests that query external urls",
 )
@pytest.mark.parametrize("test_vector", GENERAL_TEST_VECTORS)
 def test_convert_url(test_vector):
    """Test the conversion of a stream with no stream info."""
    markitdown = MarkItDown()
    time.sleep(1)  # Ensure we don't hit rate limits
    result = markitdown.convert(
        TEST_FILES_URL + "/" + test_vector.filename,
        url=test_vector.url,  # Mock where this file would be found
    )
    for string in test_vector.must_include:
        assert string in result.markdown
    for string in test_vector.must_not_include:
        assert string not in result.markdown
@pytest.mark.parametrize("test_vector", DATA_URI_TEST_VECTORS)
 def test_convert_with_data_uris(test_vector):
    """Test API functionality when keep_data_uris is enabled"""
    markitdown = MarkItDown()
    # Test local file conversion
    result = markitdown.convert(
        os.path.join(TEST_FILES_DIR, test_vector.filename),
        keep_data_uris=True,
        url=test_vector.url,
    )
    for string in test_vector.must_include:
        assert string in result.markdown
    for string in test_vector.must_not_include:
        assert string not in result.markdown
@pytest.mark.parametrize("test_vector", DATA_URI_TEST_VECTORS)
 def test_convert_stream_with_data_uris(test_vector):
    """Test the conversion of a stream with no stream info."""
    markitdown = MarkItDown()
    stream_info = StreamInfo(
        extension=os.path.splitext(test_vector.filename)[1],
        mimetype=test_vector.mimetype,
        charset=test_vector.charset,
    )
    with open(os.path.join(TEST_FILES_DIR, test_vector.filename), "rb") as stream:
        result = markitdown.convert(
            stream, stream_info=stream_info, keep_data_uris=True, url=test_vector.url
        )
        for string in test_vector.must_include:
            assert string in result.markdown
        for string in test_vector.must_not_include:
            assert string not in result.markdown
 if __name__ == "__main__":
    import sys
    """Runs this file's tests from the command line."""
    # General tests
    for test_function in [
        test_guess_stream_info,
        test_convert_local,
        test_convert_stream_with_hints,
        test_convert_stream_without_hints,
        test_convert_url,
    ]:
        for test_vector in GENERAL_TEST_VECTORS:
            print(
                f"Running {test_function.__name__} on {test_vector.filename}...", end=""
            )
            test_function(test_vector)
            print("OK")
    # Data URI tests
    for test_function in [
        test_convert_with_data_uris,
        test_convert_stream_with_data_uris,
    ]:
        for test_vector in DATA_URI_TEST_VECTORS:
            print(
                f"Running {test_function.__name__} on {test_vector.filename}...", end=""
            )
            test_function(test_vector)
            print("OK")
    print("All tests passed!")
Author	SHA1	Message	Date
Kenny Zhang	4e0a10ecf3	ran unit tests locally	2025-02-27 16:44:50 -05:00
Kenny Zhang	950b135da6	formatting	2025-02-27 15:08:10 -05:00
Kenny Zhang	b671345bb9	updated readme	2025-02-27 15:07:46 -05:00
Kenny Zhang	d9a92f7f06	added file obj unit tests for rss and json	2025-02-27 15:05:29 -05:00
Kenny Zhang	db0c8acbaf	added file obj support to rss and plain text converters	2025-02-27 14:55:49 -05:00
Kenny Zhang	08330c2ac3	added core unit tests for file obj support	2025-02-27 11:27:05 -05:00
Kenny Zhang	4afc1fe886	added non-binary example to README	2025-02-21 13:31:37 -05:00
Kenny Zhang	b0044720da	updated docs	2025-02-20 16:56:47 -05:00
Kenny Zhang	07a28d4f00	black formatting	2025-02-20 16:49:37 -05:00
Kenny Zhang	b8b3897952	modify ext guesser	2025-02-20 16:47:37 -05:00
Kenny Zhang	395ce2d301	close file object after using	2025-02-20 13:54:51 -05:00
Kenny Zhang	808401a331	added conversion path for file object in central class	2025-02-19 17:02:51 -05:00
Kenny Zhang	e75f3f6f5b	local path inputs to MarkitDown class adhere to new converterinput structure	2025-02-19 15:16:45 -05:00
Kenny Zhang	8e950325d2	refactored remaining converters	2025-02-19 14:01:43 -05:00
Kenny Zhang	096fef3d5f	refactored more converters to support input class	2025-02-19 13:34:28 -05:00
Kenny Zhang	52cbff061a	begin refactoring converter classes	2025-02-19 11:48:00 -05:00
Kenny Zhang	0027e6d425	added wrapper class for converter file input	2025-02-18 12:44:18 -05:00
Kenny Zhang	63a7bafadd	removed redundant priority setting	2025-02-18 12:18:49 -05:00
`@@ -1,2 +1 @@`
	`packages/markitdown/tests/test_files/** linguist-vendored`	`tests/test_files/** linguist-vendored`
	`packages/markitdown-sample-plugin/tests/test_files/** linguist-vendored`