small formatting change

Set exiftool path explicitly. (#267 )
Removed the holiday away message from README.md (#266 )
2025-01-14 18:04:14 -05:00 · 2025-01-06 12:43:47 -08:00 · 2025-01-06 09:06:21 -08:00 · 2025-01-03 16:40:43 -08:00 · 2025-01-03 16:03:11 -08:00 · 2025-01-03 14:34:33 -08:00
17 changed files with 796 additions and 151 deletions
--- a/.devcontainer/devcontainer.json
+++ b/.devcontainer/devcontainer.json
@@ -0,0 +1,32 @@
+// For format details, see https://aka.ms/devcontainer.json. For config options, see the
+// README at: https://github.com/devcontainers/templates/tree/main/src/docker-existing-dockerfile
+{
+	"name": "Existing Dockerfile",
+	"build": {
+		// Sets the run context to one level up instead of the .devcontainer folder.
+		"context": "..",
+		// Update the 'dockerFile' property if you aren't using the standard 'Dockerfile' filename.
+		"dockerfile": "../Dockerfile",
+		"args": {
+			"INSTALL_GIT": "true"
+		}
+	},
+
+	// Features to add to the dev container. More info: https://containers.dev/features.
+	// "features": {},
+	"features": {
+		"ghcr.io/devcontainers-extra/features/hatch:2": {}
+	},
+
+	// Use 'forwardPorts' to make a list of ports inside the container available locally.
+	// "forwardPorts": [],
+
+	// Uncomment the next line to run commands after the container is created.
+	// "postCreateCommand": "cat /etc/os-release",
+
+	// Configure tool-specific properties.
+	// "customizations": {},
+
+	// Uncomment to connect as an existing user other than the container default. More info: https://aka.ms/dev-containers-non-root.
+	"remoteUser": "root"
+}
--- a/.github/dependabot.yml
+++ b/.github/dependabot.yml
@@ -0,0 +1,6 @@
+version: 2
+updates:
+  - package-ecosystem: "github-actions"
+    directory: "/"
+    schedule:
+      interval: "weekly"
--- a/.github/workflows/pre-commit.yml
+++ b/.github/workflows/pre-commit.yml
@@ -5,9 +5,9 @@ jobs:
  pre-commit:
    runs-on: ubuntu-latest
    steps:
-      - uses: actions/checkout@v2
+      - uses: actions/checkout@v4
      - name: Set up Python
-        uses: actions/setup-python@v2
+        uses: actions/setup-python@v5
        with:
          python-version: "3.x"

--- a/.github/workflows/tests.yml
+++ b/.github/workflows/tests.yml
@@ -5,8 +5,8 @@ jobs:
  tests:
    runs-on: ubuntu-latest
    steps:
-      - uses: actions/checkout@v3
-      - uses: actions/setup-python@v4
+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v5
        with:
          python-version: |
            3.10
@@ -14,7 +14,7 @@ jobs:
            3.12
      - name: Set up pip cache
        if: runner.os == 'Linux'
-        uses: actions/cache@v3
+        uses: actions/cache@v4
        with:
          path: ~/.cache/pip
          key: ${{ runner.os }}-pip-${{ hashFiles('pyproject.toml') }}
--- a/.gitignore
+++ b/.gitignore
@@ -1,3 +1,5 @@
+.vscode
+
 # Byte-compiled / optimized / DLL files
 __pycache__/
 *.py[cod]
@@ -160,3 +162,5 @@ cython_debug/
 #  and can be added to the global gitignore or merged into this file.  For a more nuclear
 #  option (not recommended) you can uncomment the following to ignore the entire idea folder.
 #.idea/
+src/.DS_Store
+.DS_Store
--- a/12
+++ b/12
@@ -1,16 +1,22 @@
-FROM python:3.13-alpine
+FROM python:3.13-slim-bullseye

 USER root

+ARG INSTALL_GIT=false
+RUN if [ "$INSTALL_GIT" = "true" ]; then \
+    apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/*; \
+    fi
+
 # Runtime dependency
-RUN apk add --no-cache ffmpeg
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    ffmpeg \
+    && rm -rf /var/lib/apt/lists/*

 RUN pip install markitdown

 # Default USERID and GROUPID
 ARG USERID=10000
 ARG GROUPID=10000
-
 USER $USERID:$GROUPID

 ENTRYPOINT [ "markitdown" ]
--- a/README.md
+++ b/README.md
@@ -1,66 +1,57 @@
 # MarkItDown

 [![PyPI](https://img.shields.io/pypi/v/markitdown.svg)](https://pypi.org/project/markitdown/)
+![PyPI - Downloads](https://img.shields.io/pypi/dd/markitdown)
+[![Built by AutoGen Team](https://img.shields.io/badge/Built%20by-AutoGen%20Team-blue)](https://github.com/microsoft/autogen)

-The MarkItDown library is a utility tool for converting various files to Markdown (e.g., for indexing, text analysis, etc.)

-It presently supports:
+MarkItDown is a utility for converting various files to Markdown (e.g., for indexing, text analysis, etc).
+It supports:
+- PDF
+- PowerPoint
+- Word
+- Excel
+- Images (EXIF metadata and OCR)
+- Audio (EXIF metadata and speech transcription)
+- HTML
+- Text-based formats (CSV, JSON, XML)
+- ZIP files (iterates over contents)

- PDF (.pdf)
- PowerPoint (.pptx)
- Word (.docx)
- Excel (.xlsx)
- Images (EXIF metadata, and OCR)
- Audio (EXIF metadata, and speech transcription)
- HTML (special handling of Wikipedia, etc.)
- Various other text-based formats (csv, json, xml, etc.)
- ZIP (Iterates over contents and converts each file)
+To install MarkItDown, use pip: `pip install markitdown`. Alternatively, you can install it from the source: `pip install -e .`

-# Installation
+## Usage

-You can install `markitdown` using pip:
-
-```python
-pip install markitdown
-```
-
-or from the source
-
-```sh
-pip install -e .
-```
-
-# Usage
-The API is simple:
-
-```python
-from markitdown import MarkItDown
-
-markitdown = MarkItDown()
-result = markitdown.convert("test.xlsx")
-print(result.text_content)
-```
-
-To use this as a command-line utility, install it and then run it like this:
-
-```bash
-markitdown path-to-file.pdf
-```
-
-This will output Markdown to standard output. You can save it like this:
+### Command-Line

 ```bash
 markitdown path-to-file.pdf > document.md
 ```

-You can pipe content to standard input by omitting the argument:
+Or use `-o` to specify the output file:
+
+```bash
+markitdown path-to-file.pdf -o document.md
+```
+
+You can also pipe content:

 ```bash
 cat path-to-file.pdf | markitdown
 ```

-You can also configure markitdown to use Large Language Models to describe images. To do so you must provide `llm_client` and `llm_model` parameters to MarkItDown object, according to your specific client.
+### Python API

+Basic usage in Python:
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+result = md.convert("test.xlsx")
+print(result.text_content)
+```
+
+To use Large Language Models for image descriptions, provide `llm_client` and `llm_model`:

 ```python
 from markitdown import MarkItDown
@@ -72,13 +63,49 @@ result = md.convert("example.jpg")
 print(result.text_content)
 ```

-You can also use the project as Docker Image:
+### Docker

 ```sh
 docker build -t markitdown:latest .
 docker run --rm -i markitdown:latest < ~/your-file.pdf > output.md
 ```
+<details>
+    
+<summary>Batch Processing Multiple Files</summary>

+This example shows how to convert multiple files to markdown format in a single run. The script processes all supported files in a directory and creates corresponding markdown files.
+
+
+```python convert.py
+from markitdown import MarkItDown
+from openai import OpenAI
+import os
+client = OpenAI(api_key="your-api-key-here")
+md = MarkItDown(llm_client=client, llm_model="gpt-4o-2024-11-20")
+supported_extensions = ('.pptx', '.docx', '.pdf', '.jpg', '.jpeg', '.png')
+files_to_convert = [f for f in os.listdir('.') if f.lower().endswith(supported_extensions)]
+for file in files_to_convert:
+    print(f"\nConverting {file}...")
+    try:
+        md_file = os.path.splitext(file)[0] + '.md'
+        result = md.convert(file)
+        with open(md_file, 'w') as f:
+            f.write(result.text_content)
+        
+        print(f"Successfully converted {file} to {md_file}")
+    except Exception as e:
+        print(f"Error converting {file}: {str(e)}")
+
+print("\nAll conversions completed!")
+```
+2. Place the script in the same directory as your files
+3. Install required packages: like openai
+4. Run script ```bash python convert.py ```
+
+Note that original files will remain unchanged and new markdown files are created with the same base name.
+
+</details>
+   
 ## Contributing

 This project welcomes contributions and suggestions.  Most contributions require you to agree to a
@@ -93,28 +120,41 @@ This project has adopted the [Microsoft Open Source Code of Conduct](https://ope
 For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
 contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.

-### Running Tests
+### How to Contribute

-To run tests, install `hatch` using `pip` or other methods as described [here](https://hatch.pypa.io/dev/install).
+You can help by looking at issues or helping review PRs. Any issue or PR is welcome, but we have also marked some as 'open for contribution' and 'open for reviewing' to help facilitate community contributions. These are ofcourse just suggestions and you are welcome to contribute in any way you like.

-```sh
-pip install hatch
-hatch shell
-hatch test
-```

-### Running Pre-commit Checks
+<div align="center">

-Please run the pre-commit checks before submitting a PR.
+|                       | All                                      | Especially Needs Help from Community                                                                 |
+|-----------------------|------------------------------------------|------------------------------------------------------------------------------------------|
+| **Issues**            | [All Issues](https://github.com/microsoft/markitdown/issues) | [Issues open for contribution](https://github.com/microsoft/markitdown/issues?q=is%3Aissue+is%3Aopen+label%3A%22open+for+contribution%22) |
+| **PRs**               | [All PRs](https://github.com/microsoft/markitdown/pulls)     | [PRs open for reviewing](https://github.com/microsoft/markitdown/pulls?q=is%3Apr+is%3Aopen+label%3A%22open+for+reviewing%22)               |

-```sh
-pre-commit run --all-files
-```
+</div>
+
+### Running Tests and Checks
+
+- Install `hatch` in your environment and run tests:
+    ```sh
+    pip install hatch  # Other ways of installing hatch: https://hatch.pypa.io/dev/install/
+    hatch shell
+    hatch test
+    ```
+
+  (Alternative) Use the Devcontainer which has all the dependencies installed:
+    ```sh
+    # Reopen the project in Devcontainer and run:
+    hatch test
+    ```
+
+- Run pre-commit checks before submitting a PR: `pre-commit run --all-files`

 ## Trademarks

-This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft 
-trademarks or logos is subject to and must follow 
+This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft
+trademarks or logos is subject to and must follow
 [Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).
 Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.
 Any use of third-party trademarks or logos are subject to those third-party's policies.
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -5,7 +5,7 @@ build-backend = "hatchling.build"
 [project]
 name = "markitdown"
 dynamic = ["version"]
-description = ''
+description = 'Utility tool for converting various files to Markdown'
 readme = "README.md"
 requires-python = ">=3.10"
 license = "MIT"
@@ -32,9 +32,11 @@ dependencies = [
  "python-pptx",
  "pandas",
  "openpyxl",
+  "xlrd",
  "pdfminer.six",
  "puremagic",
  "pydub",
+  "olefile",
  "youtube-transcript-api",
  "SpeechRecognition",
  "pathvalidate",
--- a/src/markitdown/main.py
+++ b/src/markitdown/main.py
@@ -1,45 +1,80 @@
 # SPDX-FileCopyrightText: 2024-present Adam Fourney <adamfo@microsoft.com>
 #
 # SPDX-License-Identifier: MIT
-import sys
 import argparse
-from ._markitdown import MarkItDown
+import sys
+from textwrap import dedent
+from .__about__ import __version__
+from ._markitdown import MarkItDown, DocumentConverterResult


 def main():
    parser = argparse.ArgumentParser(
        description="Convert various file formats to markdown.",
+        prog="markitdown",
        formatter_class=argparse.RawDescriptionHelpFormatter,
-        usage="""
-SYNTAX: 
+        usage=dedent(
+            """
+            SYNTAX:
+
+                markitdown <OPTIONAL: FILENAME>
+                If FILENAME is empty, markitdown reads from stdin.
+
+            EXAMPLE:
+
+                markitdown example.pdf
+
+                OR
+
+                cat example.pdf | markitdown
+
+                OR
+
+                markitdown < example.pdf
+                
+                OR to save to a file use
    
-    markitdown <OPTIONAL: FILENAME>
-    If FILENAME is empty, markitdown reads from stdin.
+                markitdown example.pdf -o example.md
+                
+                OR
+                
+                markitdown example.pdf > example.md
+            """
+        ).strip(),
+    )

-EXAMPLE:
-    
-    markitdown example.pdf
-    
-    OR
-
-    cat example.pdf | markitdown
-
-    OR 
-
-    markitdown < example.pdf
-""".strip(),
+    parser.add_argument(
+        "-v",
+        "--version",
+        action="version",
+        version=f"%(prog)s {__version__}",
+        help="show the version number and exit",
    )

    parser.add_argument("filename", nargs="?")
+    parser.add_argument(
+        "-o",
+        "--output",
+        help="Output file name. If not provided, output is written to stdout.",
+    )
    args = parser.parse_args()

    if args.filename is None:
        markitdown = MarkItDown()
        result = markitdown.convert_stream(sys.stdin.buffer)
-        print(result.text_content)
+        _handle_output(args, result)
    else:
        markitdown = MarkItDown()
        result = markitdown.convert(args.filename)
+        _handle_output(args, result)
+
+
+def _handle_output(args, result: DocumentConverterResult):
+    """Handle output to stdout or file"""
+    if args.output:
+        with open(args.output, "w", encoding="utf-8") as f:
+            f.write(result.text_content)
+    else:
        print(result.text_content)


--- a/src/markitdown/_markitdown.py
+++ b/src/markitdown/_markitdown.py
@@ -13,12 +13,15 @@ import sys
 import tempfile
 import traceback
 import zipfile
+from xml.dom import minidom
 from typing import Any, Dict, List, Optional, Union
+from pathlib import Path
 from urllib.parse import parse_qs, quote, unquote, urlparse, urlunparse
 from warnings import warn, resetwarnings, catch_warnings

 import mammoth
 import markdownify
+import olefile
 import pandas as pd
 import pdfminer
 import pdfminer.high_level
@@ -31,6 +34,7 @@ from bs4 import BeautifulSoup
 from charset_normalizer import from_path

 # Optional Transcription support
+IS_AUDIO_TRANSCRIPTION_CAPABLE = False
 try:
    # Using warnings' catch_warnings to catch
    # pydub's warning of ffmpeg or avconv missing
@@ -169,7 +173,10 @@ class PlainTextConverter(DocumentConverter):
        # Only accept text files
        if content_type is None:
            return None
-        elif "text/" not in content_type.lower():
+        elif all(
+            not content_type.lower().startswith(type_prefix)
+            for type_prefix in ["text/", "application/json"]
+        ):
            return None

        text_content = str(from_path(local_path).best())
@@ -222,6 +229,143 @@ class HtmlConverter(DocumentConverter):
        )


+class RSSConverter(DocumentConverter):
+    """Convert RSS / Atom type to markdown"""
+
+    def convert(
+        self, local_path: str, **kwargs
+    ) -> Union[None, DocumentConverterResult]:
+        # Bail if not RSS type
+        extension = kwargs.get("file_extension", "")
+        if extension.lower() not in [".xml", ".rss", ".atom"]:
+            return None
+        try:
+            doc = minidom.parse(local_path)
+        except BaseException as _:
+            return None
+        result = None
+        if doc.getElementsByTagName("rss"):
+            # A RSS feed must have a root element of <rss>
+            result = self._parse_rss_type(doc)
+        elif doc.getElementsByTagName("feed"):
+            root = doc.getElementsByTagName("feed")[0]
+            if root.getElementsByTagName("entry"):
+                # An Atom feed must have a root element of <feed> and at least one <entry>
+                result = self._parse_atom_type(doc)
+            else:
+                return None
+        else:
+            # not rss or atom
+            return None
+
+        return result
+
+    def _parse_atom_type(
+        self, doc: minidom.Document
+    ) -> Union[None, DocumentConverterResult]:
+        """Parse the type of an Atom feed.
+
+        Returns None if the feed type is not recognized or something goes wrong.
+        """
+        try:
+            root = doc.getElementsByTagName("feed")[0]
+            title = self._get_data_by_tag_name(root, "title")
+            subtitle = self._get_data_by_tag_name(root, "subtitle")
+            entries = root.getElementsByTagName("entry")
+            md_text = f"# {title}\n"
+            if subtitle:
+                md_text += f"{subtitle}\n"
+            for entry in entries:
+                entry_title = self._get_data_by_tag_name(entry, "title")
+                entry_summary = self._get_data_by_tag_name(entry, "summary")
+                entry_updated = self._get_data_by_tag_name(entry, "updated")
+                entry_content = self._get_data_by_tag_name(entry, "content")
+
+                if entry_title:
+                    md_text += f"\n## {entry_title}\n"
+                if entry_updated:
+                    md_text += f"Updated on: {entry_updated}\n"
+                if entry_summary:
+                    md_text += self._parse_content(entry_summary)
+                if entry_content:
+                    md_text += self._parse_content(entry_content)
+
+            return DocumentConverterResult(
+                title=title,
+                text_content=md_text,
+            )
+        except BaseException as _:
+            return None
+
+    def _parse_rss_type(
+        self, doc: minidom.Document
+    ) -> Union[None, DocumentConverterResult]:
+        """Parse the type of an RSS feed.
+
+        Returns None if the feed type is not recognized or something goes wrong.
+        """
+        try:
+            root = doc.getElementsByTagName("rss")[0]
+            channel = root.getElementsByTagName("channel")
+            if not channel:
+                return None
+            channel = channel[0]
+            channel_title = self._get_data_by_tag_name(channel, "title")
+            channel_description = self._get_data_by_tag_name(channel, "description")
+            items = channel.getElementsByTagName("item")
+            if channel_title:
+                md_text = f"# {channel_title}\n"
+            if channel_description:
+                md_text += f"{channel_description}\n"
+            if not items:
+                items = []
+            for item in items:
+                title = self._get_data_by_tag_name(item, "title")
+                description = self._get_data_by_tag_name(item, "description")
+                pubDate = self._get_data_by_tag_name(item, "pubDate")
+                content = self._get_data_by_tag_name(item, "content:encoded")
+
+                if title:
+                    md_text += f"\n## {title}\n"
+                if pubDate:
+                    md_text += f"Published on: {pubDate}\n"
+                if description:
+                    md_text += self._parse_content(description)
+                if content:
+                    md_text += self._parse_content(content)
+
+            return DocumentConverterResult(
+                title=channel_title,
+                text_content=md_text,
+            )
+        except BaseException as _:
+            print(traceback.format_exc())
+            return None
+
+    def _parse_content(self, content: str) -> str:
+        """Parse the content of an RSS feed item"""
+        try:
+            # using bs4 because many RSS feeds have HTML-styled content
+            soup = BeautifulSoup(content, "html.parser")
+            return _CustomMarkdownify().convert_soup(soup)
+        except BaseException as _:
+            return content
+
+    def _get_data_by_tag_name(
+        self, element: minidom.Element, tag_name: str
+    ) -> Union[str, None]:
+        """Get data from first child element with the given tag name.
+        Returns None when no such element is found.
+        """
+        nodes = element.getElementsByTagName(tag_name)
+        if not nodes:
+            return None
+        fc = nodes[0].firstChild
+        if fc:
+            return fc.data
+        return None
+
+
 class WikipediaConverter(DocumentConverter):
    """Handle Wikipedia pages separately, focusing only on the main document content."""

@@ -403,6 +547,67 @@ class YouTubeConverter(DocumentConverter):
        return None


+class IpynbConverter(DocumentConverter):
+    """Converts Jupyter Notebook (.ipynb) files to Markdown."""
+
+    def convert(
+        self, local_path: str, **kwargs: Any
+    ) -> Union[None, DocumentConverterResult]:
+        # Bail if not ipynb
+        extension = kwargs.get("file_extension", "")
+        if extension.lower() != ".ipynb":
+            return None
+
+        # Parse and convert the notebook
+        result = None
+        with open(local_path, "rt", encoding="utf-8") as fh:
+            notebook_content = json.load(fh)
+            result = self._convert(notebook_content)
+
+        return result
+
+    def _convert(self, notebook_content: dict) -> Union[None, DocumentConverterResult]:
+        """Helper function that converts notebook JSON content to Markdown."""
+        try:
+            md_output = []
+            title = None
+
+            for cell in notebook_content.get("cells", []):
+                cell_type = cell.get("cell_type", "")
+                source_lines = cell.get("source", [])
+
+                if cell_type == "markdown":
+                    md_output.append("".join(source_lines))
+
+                    # Extract the first # heading as title if not already found
+                    if title is None:
+                        for line in source_lines:
+                            if line.startswith("# "):
+                                title = line.lstrip("# ").strip()
+                                break
+
+                elif cell_type == "code":
+                    # Code cells are wrapped in Markdown code blocks
+                    md_output.append(f"```python\n{''.join(source_lines)}\n```")
+                elif cell_type == "raw":
+                    md_output.append(f"```\n{''.join(source_lines)}\n```")
+
+            md_text = "\n\n".join(md_output)
+
+            # Check for title in notebook metadata
+            title = notebook_content.get("metadata", {}).get("title", title)
+
+            return DocumentConverterResult(
+                title=title,
+                text_content=md_text,
+            )
+
+        except Exception as e:
+            raise FileConversionException(
+                f"Error converting .ipynb file: {str(e)}"
+            ) from e
+
+
 class BingSerpConverter(DocumentConverter):
    """
    Handle Bing results pages (only the organic search results).
@@ -524,7 +729,31 @@ class XlsxConverter(HtmlConverter):
        if extension.lower() != ".xlsx":
            return None

-        sheets = pd.read_excel(local_path, sheet_name=None)
+        sheets = pd.read_excel(local_path, sheet_name=None, engine="openpyxl")
+        md_content = ""
+        for s in sheets:
+            md_content += f"## {s}\n"
+            html_content = sheets[s].to_html(index=False)
+            md_content += self._convert(html_content).text_content.strip() + "\n\n"
+
+        return DocumentConverterResult(
+            title=None,
+            text_content=md_content.strip(),
+        )
+
+
+class XlsConverter(HtmlConverter):
+    """
+    Converts XLS files to Markdown, with each sheet presented as a separate Markdown table.
+    """
+
+    def convert(self, local_path, **kwargs) -> Union[None, DocumentConverterResult]:
+        # Bail if not a XLS
+        extension = kwargs.get("file_extension", "")
+        if extension.lower() != ".xls":
+            return None
+
+        sheets = pd.read_excel(local_path, sheet_name=None, engine="xlrd")
        md_content = ""
        for s in sheets:
            md_content += f"## {s}\n"
@@ -663,14 +892,25 @@ class MediaConverter(DocumentConverter):
    Abstract class for multi-modal media (e.g., images and audio)
    """

-    def _get_metadata(self, local_path):
-        exiftool = shutil.which("exiftool")
-        if not exiftool:
+    def _get_metadata(self, local_path, exiftool_path=None):
+        if not exiftool_path:
+            which_exiftool = shutil.which("exiftool")
+            if which_exiftool:
+                warn(
+                    f"""Implicit discovery of 'exiftool' is disabled. If you would like to continue to use exiftool in MarkItDown, please set the exiftool_path parameter in the MarkItDown consructor. E.g., 
+
+    md = MarkItDown(exiftool_path="{which_exiftool}")
+
+This warning will be removed in future releases.
+""",
+                    DeprecationWarning,
+                )
+
            return None
        else:
            try:
                result = subprocess.run(
-                    [exiftool, "-json", local_path], capture_output=True, text=True
+                    [exiftool_path, "-json", local_path], capture_output=True, text=True
                ).stdout
                return json.loads(result)[0]
            except Exception:
@@ -683,7 +923,7 @@ class WavConverter(MediaConverter):
    """

    def convert(self, local_path, **kwargs) -> Union[None, DocumentConverterResult]:
-        # Bail if not a XLSX
+        # Bail if not a WAV
        extension = kwargs.get("file_extension", "")
        if extension.lower() != ".wav":
            return None
@@ -691,7 +931,7 @@ class WavConverter(MediaConverter):
        md_content = ""

        # Add metadata
-        metadata = self._get_metadata(local_path)
+        metadata = self._get_metadata(local_path, kwargs.get("exiftool_path"))
        if metadata:
            for f in [
                "Title",
@@ -746,7 +986,7 @@ class Mp3Converter(WavConverter):
        md_content = ""

        # Add metadata
-        metadata = self._get_metadata(local_path)
+        metadata = self._get_metadata(local_path, kwargs.get("exiftool_path"))
        if metadata:
            for f in [
                "Title",
@@ -799,7 +1039,7 @@ class ImageConverter(MediaConverter):
    """

    def convert(self, local_path, **kwargs) -> Union[None, DocumentConverterResult]:
-        # Bail if not a XLSX
+        # Bail if not an image
        extension = kwargs.get("file_extension", "")
        if extension.lower() not in [".jpg", ".jpeg", ".png"]:
            return None
@@ -807,7 +1047,7 @@ class ImageConverter(MediaConverter):
        md_content = ""

        # Add metadata
-        metadata = self._get_metadata(local_path)
+        metadata = self._get_metadata(local_path, kwargs.get("exiftool_path"))
        if metadata:
            for f in [
                "ImageSize",
@@ -876,6 +1116,79 @@ class ImageConverter(MediaConverter):
        return response.choices[0].message.content


+class OutlookMsgConverter(DocumentConverter):
+    """Converts Outlook .msg files to markdown by extracting email metadata and content.
+
+    Uses the olefile package to parse the .msg file structure and extract:
+    - Email headers (From, To, Subject)
+    - Email body content
+    """
+
+    def convert(
+        self, local_path: str, **kwargs: Any
+    ) -> Union[None, DocumentConverterResult]:
+        # Bail if not a MSG file
+        extension = kwargs.get("file_extension", "")
+        if extension.lower() != ".msg":
+            return None
+
+        try:
+            msg = olefile.OleFileIO(local_path)
+            # Extract email metadata
+            md_content = "# Email Message\n\n"
+
+            # Get headers
+            headers = {
+                "From": self._get_stream_data(msg, "__substg1.0_0C1F001F"),
+                "To": self._get_stream_data(msg, "__substg1.0_0E04001F"),
+                "Subject": self._get_stream_data(msg, "__substg1.0_0037001F"),
+            }
+
+            # Add headers to markdown
+            for key, value in headers.items():
+                if value:
+                    md_content += f"**{key}:** {value}\n"
+
+            md_content += "\n## Content\n\n"
+
+            # Get email body
+            body = self._get_stream_data(msg, "__substg1.0_1000001F")
+            if body:
+                md_content += body
+
+            msg.close()
+
+            return DocumentConverterResult(
+                title=headers.get("Subject"), text_content=md_content.strip()
+            )
+
+        except Exception as e:
+            raise FileConversionException(
+                f"Could not convert MSG file '{local_path}': {str(e)}"
+            )
+
+    def _get_stream_data(
+        self, msg: olefile.OleFileIO, stream_path: str
+    ) -> Union[str, None]:
+        """Helper to safely extract and decode stream data from the MSG file."""
+        try:
+            if msg.exists(stream_path):
+                data = msg.openstream(stream_path).read()
+                # Try UTF-16 first (common for .msg files)
+                try:
+                    return data.decode("utf-16-le").strip()
+                except UnicodeDecodeError:
+                    # Fall back to UTF-8
+                    try:
+                        return data.decode("utf-8").strip()
+                    except UnicodeDecodeError:
+                        # Last resort - ignore errors
+                        return data.decode("utf-8", errors="ignore").strip()
+        except Exception:
+            pass
+        return None
+
+
 class ZipConverter(DocumentConverter):
    """Converts ZIP files to markdown by extracting and converting all contained files.

@@ -934,27 +1247,33 @@ class ZipConverter(DocumentConverter):
        extracted_zip_folder_name = (
            f"extracted_{os.path.basename(local_path).replace('.zip', '_zip')}"
        )
-        new_folder = os.path.normpath(
+        extraction_dir = os.path.normpath(
            os.path.join(os.path.dirname(local_path), extracted_zip_folder_name)
        )
        md_content = f"Content from the zip file `{os.path.basename(local_path)}`:\n\n"

-        # Safety check for path traversal
-        if not new_folder.startswith(os.path.dirname(local_path)):
-            return DocumentConverterResult(
-                title=None, text_content=f"[ERROR] Invalid zip file path: {local_path}"
-            )
-
        try:
-            # Extract the zip file
+            # Extract the zip file safely
            with zipfile.ZipFile(local_path, "r") as zipObj:
-                zipObj.extractall(path=new_folder)
+                # Safeguard against path traversal
+                for member in zipObj.namelist():
+                    member_path = os.path.normpath(os.path.join(extraction_dir, member))
+                    if (
+                        not os.path.commonprefix([extraction_dir, member_path])
+                        == extraction_dir
+                    ):
+                        raise ValueError(
+                            f"Path traversal detected in zip file: {member}"
+                        )
+
+                # Extract all files safely
+                zipObj.extractall(path=extraction_dir)

            # Process each extracted file
-            for root, dirs, files in os.walk(new_folder):
+            for root, dirs, files in os.walk(extraction_dir):
                for name in files:
                    file_path = os.path.join(root, name)
-                    relative_path = os.path.relpath(file_path, new_folder)
+                    relative_path = os.path.relpath(file_path, extraction_dir)

                    # Get file extension
                    _, file_extension = os.path.splitext(name)
@@ -978,7 +1297,7 @@ class ZipConverter(DocumentConverter):

            # Clean up extracted files if specified
            if kwargs.get("cleanup_extracted", True):
-                shutil.rmtree(new_folder)
+                shutil.rmtree(extraction_dir)

            return DocumentConverterResult(title=None, text_content=md_content.strip())

@@ -987,6 +1306,11 @@ class ZipConverter(DocumentConverter):
                title=None,
                text_content=f"[ERROR] Invalid or corrupted zip file: {local_path}",
            )
+        except ValueError as ve:
+            return DocumentConverterResult(
+                title=None,
+                text_content=f"[ERROR] Security error in zip file {local_path}: {str(ve)}",
+            )
        except Exception as e:
            return DocumentConverterResult(
                title=None,
@@ -1012,6 +1336,7 @@ class MarkItDown:
        llm_client: Optional[Any] = None,
        llm_model: Optional[str] = None,
        style_map: Optional[str] = None,
+        exiftool_path: Optional[str] = None,
        # Deprecated
        mlm_client: Optional[Any] = None,
        mlm_model: Optional[str] = None,
@@ -1021,6 +1346,9 @@ class MarkItDown:
        else:
            self._requests_session = requests_session

+        if exiftool_path is None:
+            exiftool_path = os.environ.get("EXIFTOOL_PATH")
+
        # Handle deprecation notices
        #############################
        if mlm_client is not None:
@@ -1053,6 +1381,7 @@ class MarkItDown:
        self._llm_client = llm_client
        self._llm_model = llm_model
        self._style_map = style_map
+        self._exiftool_path = exiftool_path

        self._page_converters: List[DocumentConverter] = []

@@ -1061,24 +1390,28 @@ class MarkItDown:
        # To this end, the most specific converters should appear below the most generic converters
        self.register_page_converter(PlainTextConverter())
        self.register_page_converter(HtmlConverter())
+        self.register_page_converter(RSSConverter())
        self.register_page_converter(WikipediaConverter())
        self.register_page_converter(YouTubeConverter())
        self.register_page_converter(BingSerpConverter())
        self.register_page_converter(DocxConverter())
        self.register_page_converter(XlsxConverter())
+        self.register_page_converter(XlsConverter())
        self.register_page_converter(PptxConverter())
        self.register_page_converter(WavConverter())
        self.register_page_converter(Mp3Converter())
        self.register_page_converter(ImageConverter())
+        self.register_page_converter(IpynbConverter())
        self.register_page_converter(PdfConverter())
        self.register_page_converter(ZipConverter())
+        self.register_page_converter(OutlookMsgConverter())

    def convert(
-        self, source: Union[str, requests.Response], **kwargs: Any
+        self, source: Union[str, requests.Response, Path], **kwargs: Any
    ) -> DocumentConverterResult:  # TODO: deal with kwargs
        """
        Args:
-            - source: can be a string representing a path or url, or a requests.response object
+            - source: can be a string representing a path either as string pathlib path object or url, or a requests.response object
            - extension: specifies the file extension to use when interpreting the file. If None, infer from source (path, uri, content-type, etc.)
        """

@@ -1095,10 +1428,14 @@ class MarkItDown:
        # Request response
        elif isinstance(source, requests.Response):
            return self.convert_response(source, **kwargs)
+        elif isinstance(source, Path):
+            return self.convert_local(source, **kwargs)

    def convert_local(
-        self, path: str, **kwargs: Any
+        self, path: Union[str, Path], **kwargs: Any
    ) -> DocumentConverterResult:  # TODO: deal with kwargs
+        if isinstance(path, Path):
+            path = str(path)
        # Prepare a list of extensions to try (in order of priority)
        ext = kwargs.get("file_extension")
        extensions = [ext] if ext is not None else []
@@ -1228,12 +1565,15 @@ class MarkItDown:
                if "llm_model" not in _kwargs and self._llm_model is not None:
                    _kwargs["llm_model"] = self._llm_model

-                # Add the list of converters for nested processing
-                _kwargs["_parent_converters"] = self._page_converters
-
                if "style_map" not in _kwargs and self._style_map is not None:
                    _kwargs["style_map"] = self._style_map

+                if "exiftool_path" not in _kwargs and self._exiftool_path is not None:
+                    _kwargs["exiftool_path"] = self._exiftool_path
+
+                # Add the list of converters for nested processing
+                _kwargs["_parent_converters"] = self._page_converters
+
                # If we hit an error log it and keep trying
                try:
                    res = converter.convert(local_path, **_kwargs)
@@ -1276,6 +1616,25 @@ class MarkItDown:
        # Use puremagic to guess
        try:
            guesses = puremagic.magic_file(path)
+
+            # Fix for: https://github.com/microsoft/markitdown/issues/222
+            # If there are no guesses, then try again after trimming leading ASCII whitespaces.
+            # ASCII whitespace characters are those byte values in the sequence b' \t\n\r\x0b\f'
+            # (space, tab, newline, carriage return, vertical tab, form feed).
+            if len(guesses) == 0:
+                with open(path, "rb") as file:
+                    while True:
+                        char = file.read(1)
+                        if not char:  # End of file
+                            break
+                        if not char.isspace():
+                            file.seek(file.tell() - 1)
+                            break
+                    try:
+                        guesses = puremagic.magic_stream(file)
+                    except puremagic.main.PureError:
+                        pass
+
            extensions = list()
            for g in guesses:
                ext = g.extension.strip()
--- a/src/markitdown/py.typed
+++ b/src/markitdown/py.typed
--- a/tests/test_files/test.json
+++ b/tests/test_files/test.json
@@ -0,0 +1,10 @@
+{
+    "key1": "string_value",
+    "key2": 1234,
+    "key3": [
+        "list_value1",
+        "list_value2"
+    ],
+    "5b64c88c-b3c3-4510-bcb8-da0b200602d8": "uuid_key",
+    "uuid_value": "9700dc99-6685-40b4-9a3a-5e406dcb37f3"
+}
--- a/tests/test_files/test.xls
+++ b/tests/test_files/test.xls
--- a/tests/test_files/test_notebook.ipynb
+++ b/tests/test_files/test_notebook.ipynb
@@ -0,0 +1,89 @@
+{
+    "cells": [
+        {
+            "cell_type": "markdown",
+            "id": "0f61db80",
+            "metadata": {},
+            "source": [
+                "# Test Notebook"
+            ]
+        },
+        {
+            "cell_type": "code",
+            "execution_count": 11,
+            "id": "3f2a5bbd",
+            "metadata": {},
+            "outputs": [
+                {
+                    "name": "stdout",
+                    "output_type": "stream",
+                    "text": [
+                        "markitdown\n"
+                    ]
+                }
+            ],
+            "source": [
+                "print('markitdown')"
+            ]
+        },
+        {
+            "cell_type": "markdown",
+            "id": "9b9c0468",
+            "metadata": {},
+            "source": [
+                "## Code Cell Below"
+            ]
+        },
+        {
+            "cell_type": "code",
+            "execution_count": 10,
+            "id": "37d8088a",
+            "metadata": {},
+            "outputs": [
+                {
+                    "name": "stdout",
+                    "output_type": "stream",
+                    "text": [
+                        "42\n"
+                    ]
+                }
+            ],
+            "source": [
+                "# comment in code\n",
+                "print(42)"
+            ]
+        },
+        {
+            "cell_type": "markdown",
+            "id": "2e3177bd",
+            "metadata": {},
+            "source": [
+                "End\n",
+                "\n",
+                "---"
+            ]
+        }
+    ],
+    "metadata": {
+        "kernelspec": {
+            "display_name": "Python 3",
+            "language": "python",
+            "name": "python3"
+        },
+        "language_info": {
+            "codemirror_mode": {
+                "name": "ipython",
+                "version": 3
+            },
+            "file_extension": ".py",
+            "mimetype": "text/x-python",
+            "name": "python",
+            "nbconvert_exporter": "python",
+            "pygments_lexer": "ipython3",
+            "version": "3.12.8"
+        },
+        "title": "Test Notebook Title"
+    },
+    "nbformat": 4,
+    "nbformat_minor": 5
+}
--- a/tests/test_files/test_outlook_msg.msg
+++ b/tests/test_files/test_outlook_msg.msg
--- a/tests/test_files/test_rss.xml
+++ b/tests/test_files/test_rss.xml
--- a/tests/test_markitdown.py
+++ b/tests/test_markitdown.py
@@ -54,6 +54,12 @@ XLSX_TEST_STRINGS = [
    "affc7dad-52dc-4b98-9b5d-51e65d8a8ad0",
 ]

+XLS_TEST_STRINGS = [
+    "## 09060124-b5e7-4717-9d07-3c046eb",
+    "6ff4173b-42a5-4784-9b19-f49caff4d93d",
+    "affc7dad-52dc-4b98-9b5d-51e65d8a8ad0",
+]
+
 DOCX_TEST_STRINGS = [
    "314b0a30-5b04-470b-b9f7-eed2c2bec74a",
    "49e168b7-d2ae-407f-a055-2167576f39a1",
@@ -63,6 +69,15 @@ DOCX_TEST_STRINGS = [
    "AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation",
 ]

+MSG_TEST_STRINGS = [
+    "# Email Message",
+    "**From:** test.sender@example.com",
+    "**To:** test.recipient@example.com",
+    "**Subject:** Test Email Message",
+    "## Content",
+    "This is the body of the test email message",
+]
+
 DOCX_COMMENT_TEST_STRINGS = [
    "314b0a30-5b04-470b-b9f7-eed2c2bec74a",
    "49e168b7-d2ae-407f-a055-2167576f39a1",
@@ -90,6 +105,13 @@ BLOG_TEST_STRINGS = [
    "an example where high cost can easily prevent a generic complex",
 ]

+
+RSS_TEST_STRINGS = [
+    "The Official Microsoft Blog",
+    "In the case of AI, it is absolutely true that the industry is moving incredibly fast",
+]
+
+
 WIKIPEDIA_TEST_URL = "https://en.wikipedia.org/wiki/Microsoft"
 WIKIPEDIA_TEST_STRINGS = [
    "Microsoft entered the operating system (OS) business in 1980 with its own version of [Unix]",
@@ -123,6 +145,22 @@ LLM_TEST_STRINGS = [
    "5bda1dd6",
 ]

+JSON_TEST_STRINGS = [
+    "5b64c88c-b3c3-4510-bcb8-da0b200602d8",
+    "9700dc99-6685-40b4-9a3a-5e406dcb37f3",
+]
+
+
+# --- Helper Functions ---
+def validate_strings(result, expected_strings, exclude_strings=None):
+    """Validate presence or absence of specific strings."""
+    text_content = result.text_content.replace("\\", "")
+    for string in expected_strings:
+        assert string in text_content
+    if exclude_strings:
+        for string in exclude_strings:
+            assert string not in text_content
+

@pytest.mark.skipif(
    skip_remote,
@@ -156,79 +194,82 @@ def test_markitdown_local() -> None:

    # Test XLSX processing
    result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.xlsx"))
-    for test_string in XLSX_TEST_STRINGS:
+    validate_strings(result, XLSX_TEST_STRINGS)
+
+    # Test XLS processing
+    result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.xls"))
+    for test_string in XLS_TEST_STRINGS:
        text_content = result.text_content.replace("\\", "")
        assert test_string in text_content

    # Test DOCX processing
    result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.docx"))
-    for test_string in DOCX_TEST_STRINGS:
-        text_content = result.text_content.replace("\\", "")
-        assert test_string in text_content
+    validate_strings(result, DOCX_TEST_STRINGS)

    # Test DOCX processing, with comments
    result = markitdown.convert(
        os.path.join(TEST_FILES_DIR, "test_with_comment.docx"),
        style_map="comment-reference => ",
    )
-    for test_string in DOCX_COMMENT_TEST_STRINGS:
-        text_content = result.text_content.replace("\\", "")
-        assert test_string in text_content
+    validate_strings(result, DOCX_COMMENT_TEST_STRINGS)

    # Test DOCX processing, with comments and setting style_map on init
    markitdown_with_style_map = MarkItDown(style_map="comment-reference => ")
    result = markitdown_with_style_map.convert(
        os.path.join(TEST_FILES_DIR, "test_with_comment.docx")
    )
-    for test_string in DOCX_COMMENT_TEST_STRINGS:
-        text_content = result.text_content.replace("\\", "")
-        assert test_string in text_content
+    validate_strings(result, DOCX_COMMENT_TEST_STRINGS)

    # Test PPTX processing
    result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.pptx"))
-    for test_string in PPTX_TEST_STRINGS:
-        text_content = result.text_content.replace("\\", "")
-        assert test_string in text_content
+    validate_strings(result, PPTX_TEST_STRINGS)

    # Test HTML processing
    result = markitdown.convert(
        os.path.join(TEST_FILES_DIR, "test_blog.html"), url=BLOG_TEST_URL
    )
-    for test_string in BLOG_TEST_STRINGS:
-        text_content = result.text_content.replace("\\", "")
-        assert test_string in text_content
+    validate_strings(result, BLOG_TEST_STRINGS)

    # Test ZIP file processing
    result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test_files.zip"))
-    for test_string in DOCX_TEST_STRINGS:
-        text_content = result.text_content.replace("\\", "")
-        assert test_string in text_content
+    validate_strings(result, XLSX_TEST_STRINGS)

    # Test Wikipedia processing
    result = markitdown.convert(
        os.path.join(TEST_FILES_DIR, "test_wikipedia.html"), url=WIKIPEDIA_TEST_URL
    )
    text_content = result.text_content.replace("\\", "")
-    for test_string in WIKIPEDIA_TEST_EXCLUDES:
-        assert test_string not in text_content
-    for test_string in WIKIPEDIA_TEST_STRINGS:
-        assert test_string in text_content
+    validate_strings(result, WIKIPEDIA_TEST_STRINGS, WIKIPEDIA_TEST_EXCLUDES)

    # Test Bing processing
    result = markitdown.convert(
        os.path.join(TEST_FILES_DIR, "test_serp.html"), url=SERP_TEST_URL
    )
    text_content = result.text_content.replace("\\", "")
-    for test_string in SERP_TEST_EXCLUDES:
-        assert test_string not in text_content
-    for test_string in SERP_TEST_STRINGS:
+    validate_strings(result, SERP_TEST_STRINGS, SERP_TEST_EXCLUDES)
+
+    # Test RSS processing
+    result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test_rss.xml"))
+    text_content = result.text_content.replace("\\", "")
+    for test_string in RSS_TEST_STRINGS:
        assert test_string in text_content

    ## Test non-UTF-8 encoding
    result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test_mskanji.csv"))
-    text_content = result.text_content.replace("\\", "")
-    for test_string in CSV_CP932_TEST_STRINGS:
-        assert test_string in text_content
+    validate_strings(result, CSV_CP932_TEST_STRINGS)
+
+    # Test MSG (Outlook email) processing
+    result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test_outlook_msg.msg"))
+    validate_strings(result, MSG_TEST_STRINGS)
+
+    # Test JSON processing
+    result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.json"))
+    validate_strings(result, JSON_TEST_STRINGS)
+
+    # Test input with leading blank characters
+    input_data = b"   \n\n\n<html><body><h1>Test</h1></body></html>"
+    result = markitdown.convert_stream(io.BytesIO(input_data))
+    assert "# Test" in result.text_content


@pytest.mark.skipif(
@@ -236,9 +277,29 @@ def test_markitdown_local() -> None:
    reason="do not run if exiftool is not installed",
 )
 def test_markitdown_exiftool() -> None:
-    markitdown = MarkItDown()
+    # Test the automatic discovery of exiftool throws a warning
+    # and is disabled
+    try:
+        with catch_warnings(record=True) as w:
+            markitdown = MarkItDown()
+            result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.jpg"))
+            assert len(w) == 1
+            assert w[0].category is DeprecationWarning
+            assert result.text_content.strip() == ""
+    finally:
+        resetwarnings()

-    # Test JPG metadata processing
+    # Test explicitly setting the location of exiftool
+    which_exiftool = shutil.which("exiftool")
+    markitdown = MarkItDown(exiftool_path=which_exiftool)
+    result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.jpg"))
+    for key in JPG_TEST_EXIFTOOL:
+        target = f"{key}: {JPG_TEST_EXIFTOOL[key]}"
+        assert target in result.text_content
+
+    # Test setting the exiftool path through an environment variable
+    os.environ["EXIFTOOL_PATH"] = which_exiftool
+    markitdown = MarkItDown()
    result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.jpg"))
    for key in JPG_TEST_EXIFTOOL:
        target = f"{key}: {JPG_TEST_EXIFTOOL[key]}"
@@ -300,8 +361,8 @@ def test_markitdown_llm() -> None:

 if __name__ == "__main__":
    """Runs this file's tests from the command line."""
-    test_markitdown_remote()
-    test_markitdown_local()
+    # test_markitdown_remote()
+    # test_markitdown_local()
    test_markitdown_exiftool()
-    test_markitdown_deprecation()
-    test_markitdown_llm()
+    # test_markitdown_deprecation()
+    # test_markitdown_llm()
Author	SHA1	Message	Date
Josh Bradley	33a0cd8efe	small formatting change	2025-01-14 18:04:14 -05:00
afourney	f58a864951	Set exiftool path explicitly. (#267 )	2025-01-06 12:43:47 -08:00
afourney	265aea2edf	Removed the holiday away message from README.md (#266 )	2025-01-06 09:06:21 -08:00
afourney	05b78e7ce1	Recognize json as plain text (if no other handlers are present). (#261 ) * Recognize json as plain text (if no other handlers are present).	2025-01-03 16:40:43 -08:00
afourney	436407288f	If puremagic has no guesses, try again after ltrim. (#260 )	2025-01-03 16:03:11 -08:00
afourney	731b39e7f5	Added a test for leading spaces. (#258 )	2025-01-03 14:34:33 -08:00
yeungadrian	08ed32869e	Feature/ Add xls support (#169 ) * add xlrd * add xls converter with tests	2025-01-03 13:58:17 -08:00
Murat Can Kurtuluş	d248621ba4	feat: outlook ".msg" file converter (#196 ) * feat: outlook .msg converter * add test, adjust docstring	2025-01-03 13:34:39 -08:00
AbSadiki	4678c8a2a4	fix(transcription): IS_AUDIO_TRANSCRIPTION_CAPABLE should be iniztialized (#194 )	2025-01-03 13:29:26 -08:00
Ikko Eltociear Ashimine	125e206047	docs: update README.md (#182 ) faciliate -> facilitate	2024-12-21 01:51:30 -08:00
numekudi	f94d09990e	feat: enable Git support in devcontainer (#136 ) Co-authored-by: gagb <gagb@users.noreply.github.com>	2024-12-20 18:09:17 -08:00
lumin	cfd2319c14	feat: add version option to markitdown CLI (#172 ) Add a `--version` option to the markitdown command-line interface that displays the current version number.	2024-12-20 16:24:45 -08:00
dependabot[bot]	73161982ff	Bump actions/setup-python from 2 to 5 (#179 ) Bumps [actions/setup-python](https://github.com/actions/setup-python) from 2 to 5. - [Release notes](https://github.com/actions/setup-python/releases) - [Commits](https://github.com/actions/setup-python/compare/v2...v5) --- updated-dependencies: - dependency-name: actions/setup-python dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: afourney <adamfo@microsoft.com>	2024-12-20 16:20:22 -08:00
dependabot[bot]	9b69467772	Bump actions/cache from 3 to 4 (#178 ) Bumps [actions/cache](https://github.com/actions/cache) from 3 to 4. - [Release notes](https://github.com/actions/cache/releases) - [Changelog](https://github.com/actions/cache/blob/main/RELEASES.md) - [Commits](https://github.com/actions/cache/compare/v3...v4) --- updated-dependencies: - dependency-name: actions/cache dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: gagb <gagb@users.noreply.github.com> Co-authored-by: afourney <adamfo@microsoft.com>	2024-12-20 16:17:43 -08:00
gagb	857a2d160d	Update README.md (#180 )	2024-12-20 14:49:20 -08:00
Soulter	1123392306	fix: support -o param to avoid encoding issues (#116 ) * perf: cli supports -o param * doc: update README --------- Co-authored-by: gagb <gagb@users.noreply.github.com>	2024-12-20 14:43:00 -08:00
dependabot[bot]	377a7eaa7d	Bump actions/checkout from 2 to 4 (#177 ) Bumps [actions/checkout](https://github.com/actions/checkout) from 2 to 4. - [Release notes](https://github.com/actions/checkout/releases) - [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md) - [Commits](https://github.com/actions/checkout/compare/v2...v4) --- updated-dependencies: - dependency-name: actions/checkout dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2024-12-20 14:36:48 -08:00
lumin	c1a0d3deaf	chore: configure Dependabot for GitHub Actions updates (#112 ) Sets up Dependabot to automatically check for updates to GitHub Actions on a weekly basis, ensuring that the project remains up-to-date with the latest dependencies and security fixes. Co-authored-by: gagb <gagb@users.noreply.github.com>	2024-12-20 14:28:55 -08:00
SigireddyBalasai	5276616ba1	Added support to use Pathlib (#93 ) * Add support for Path objects in MarkItDown conversion methods * Remove unnecessary blank line in test_markitdown_exiftool function * Remove unnecessary blank line in test_markitdown_exiftool function * remove pathlib path in test file --------- Co-authored-by: afourney <adamfo@microsoft.com> Co-authored-by: gagb <gagb@users.noreply.github.com>	2024-12-20 14:12:48 -08:00
gagb	7e6c36c5d4	docs: add contribution guidelines to README (#176 )	2024-12-20 14:08:58 -08:00
lumin	52d73080c7	refactor(tests): add helper function for tests (#87 ) * refactor(tests): simplify string validation in tests Introduce a helper function `validate_strings` to streamline the validation of expected and excluded strings in test cases. Replace repetitive string assertions in the `test_markitdown_local` function with calls to this new helper, improving code readability and maintainability. * run pre-commit --------- Co-authored-by: lumin <71011125+l-melon@users.noreply.github.com> Co-authored-by: gagb <gagb@users.noreply.github.com>	2024-12-20 11:42:32 -08:00
afourney	e6421000e3	Merge pull request #160 from sugatoray/support_type_hinting Add support for type-hinting (PEP-561)	2024-12-20 10:54:43 -08:00
Sugato Ray	08a25345e3	[feat]: add support for type-hinting for PEP-561	2024-12-20 02:37:10 +00:00
Sugato Ray	8921fe7304	ignore .vscode folder - avoid local developer vscode editor settings	2024-12-20 02:18:14 +00:00
Sugato Ray	613825d5b3	[feat]: add support for type-hinting for PEP-561	2024-12-20 02:12:24 +00:00
gagb	18e3f1d428	Merge pull request #91 from PetrAPConsulting/patch-1 Update README.md	2024-12-19 14:02:47 -08:00
gagb	c295dee5e4	Merge branch 'main' into patch-1	2024-12-19 13:22:51 -08:00
gagb	dd87dd5e36	Merge pull request #156 from microsoft/afourney-patch-1 Added holiday notice.	2024-12-19 11:18:24 -08:00
afourney	535147b2e8	Added holiday notice. Added holiday notice.	2024-12-19 11:11:54 -08:00
gagb	5c776bda70	Update README.md	2024-12-19 10:30:53 -08:00
gagb	423a01844a	Merge branch 'main' into patch-1	2024-12-19 10:30:10 -08:00
gagb	7147bef7d5	Merge pull request #130 from sugatoray/update_commandline_help Update CLI helpdoc formatting to allow indentation in code	2024-12-19 10:20:23 -08:00
Sugato Ray	a5f39d6922	Merge branch 'main' into update_commandline_help	2024-12-19 07:58:48 -05:00
gagb	925c4499f7	Merge pull request #121 from l-lumin/add-project-description	2024-12-19 00:53:54 -08:00
Petr@AP Consulting	b28f380a47	Update README.md Co-authored-by: gagb <gagb@users.noreply.github.com>	2024-12-19 09:23:15 +01:00
lumin	c86287b7e3	feat: add project description in pyproject.toml	2024-12-19 13:02:47 +09:00
Sugato Ray	6f3c762526	Merge branch 'main' into update_commandline_help	2024-12-18 17:50:07 -05:00
gagb	cb66b35f11	Merge pull request #132 from microsoft/gagb-patch-1 Add downloads badge	2024-12-18 14:30:09 -08:00
gagb	a2743a5314	Add downloads badge	2024-12-18 14:26:36 -08:00
Sugato Ray	277480066a	Merge branch 'update_commandline_help' of https://github.com/sugatoray/markitdown into update_commandline_help	2024-12-18 21:53:54 +00:00
gagb	6e1b9a7402	Run precommit	2024-12-18 13:46:10 -08:00
Sugato Ray	1384e80725	update .gitignore to exclude .vscode folder	2024-12-18 21:46:06 +00:00
Sugato Ray	356e895306	update formatting with pre-commit	2024-12-18 21:45:23 +00:00
Petr@AP Consulting	f6e75c46d4	Update README.md I changed command for running script from Mac version (python3) to Windows version (python)	2024-12-18 21:17:47 +01:00
afourney	8bc1bee18b	Merge pull request #129 from finchy/main Safeguard against path traversal for ZipConverter	2024-12-18 12:11:00 -08:00
Petr@AP Consulting	f4471d96e2	Update README.md Co-authored-by: gagb <gagb@users.noreply.github.com>	2024-12-18 21:08:10 +01:00
Petr@AP Consulting	088007338d	Update README.md Co-authored-by: gagb <gagb@users.noreply.github.com>	2024-12-18 21:07:55 +01:00
Petr@AP Consulting	bb929629f3	Update README.md Co-authored-by: gagb <gagb@users.noreply.github.com>	2024-12-18 21:05:36 +01:00
Petr@AP Consulting	233ba679b8	Update README.md Co-authored-by: gagb <gagb@users.noreply.github.com>	2024-12-18 21:05:04 +01:00
gagb	46b7f043d3	Merge branch 'main' into patch-1	2024-12-18 11:57:57 -08:00
gagb	5fc70864f2	Run pre-commit	2024-12-18 11:46:39 -08:00
Sugato Ray	39410d01df	Update CLI helpdoc formatting to allow indentation in code Use `textwrap.dedent()` to allow indented cli-helpdoc in `__main__.py` file. The indentation increases readability, while `textwrap.dedent` helps maintain the same functionality without breaking code.	2024-12-18 14:22:58 -05:00
Joel Esler	6e4caac70d	Safeguard against path traversal for ZipConverter fix: prevent path traversal vulnerabilities in ZipConverter Added a secure check for path traversal vulnerabilities in the ZipConverter class. Now validates extracted file paths using `os.path.commonprefix` to ensure all files remain within the intended extraction directory. Raises a `ValueError` if a path traversal attempt is detected. - Normalized file paths using `os.path.normpath`. - Added specific exception handling for `zipfile.BadZipFile` and traversal errors. - Ensured cleanup of extracted files after processing when `cleanup_extracted` is enabled.	2024-12-18 13:12:55 -05:00
Petr@AP Consulting	224f1df0fc	Update README.md I collapsed section about batch processing as was suggested	2024-12-18 09:28:18 +01:00
gagb	1deaba1c6c	Merge pull request #98 from waterimp/feature/fix-code-comments fix incorrect comments for "bail if not ..." for WAV and image cases.	2024-12-17 17:57:25 -08:00
gagb	09cb048cbe	Merge branch 'main' into feature/fix-code-comments	2024-12-17 17:34:53 -08:00
gagb	b029ae1cd4	Merge pull request #108 from microsoft/gagb-readme Simplify README	2024-12-17 17:30:49 -08:00
gagb	524aa0da75	Update README.md	2024-12-17 17:25:40 -08:00
gagb	de1b54d79f	Update README.md	2024-12-17 17:25:13 -08:00
gagb	1e7806a7ac	Simplify	2024-12-17 17:21:39 -08:00
gagb	1163aa2b4e	Merge pull request #106 from microsoft/gagb-patch-1 Update README.md	2024-12-17 16:57:32 -08:00
gagb	3bcf2bdae7	Update README.md	2024-12-17 16:54:17 -08:00
gagb	41a10b9a35	Merge pull request #64 from l-lumin/add-devcontainer-config feat(devcontainer): Add DevContainer Configuration for Easier Contribution Setup	2024-12-17 16:52:50 -08:00
gagb	f1e399eee4	Merge branch 'main' into add-devcontainer-config	2024-12-17 16:50:32 -08:00
gagb	8b02c0bf9f	Merge pull request #80 from diya155/main Update README.md	2024-12-17 16:49:58 -08:00
gagb	1dda535330	Merge branch 'main' into main	2024-12-17 16:46:23 -08:00
gagb	362214323e	Merge branch 'main' into feature/fix-code-comments	2024-12-17 16:38:47 -08:00
lumin	457b6234e6	Merge branch 'main' into add-devcontainer-config	2024-12-18 09:14:31 +09:00
afourney	790031409b	Merge pull request #71 from AumGupta/main feat: Add IpynbConverter	2024-12-17 15:41:51 -08:00
afourney	9e546a8588	Merge branch 'main' into main	2024-12-17 15:37:28 -08:00
afourney	ddf695cf81	Merge pull request #97 from Soulter/main feat: Add RSSConverter	2024-12-17 15:34:22 -08:00
Adam Fourney	8d5f16ecd2	Fixed formatting.	2024-12-17 15:27:06 -08:00
afourney	a571021199	Merge branch 'main' into main	2024-12-17 15:12:59 -08:00
afourney	9add517510	Merge branch 'main' into feature/fix-code-comments	2024-12-17 14:56:16 -08:00
Lee Bush	05a49ca129	fix incorrect comments for "bail if not ..." for WAV and image cases.	2024-12-17 08:10:53 -07:00
Soulter	752fbd333c	feat: add tests of rss convertor	2024-12-17 22:45:27 +08:00
Soulter	7dc2695b96	feat: support convert atom to markdown	2024-12-17 21:44:50 +08:00
Soulter	53fad6eb31	feat: add rss converter	2024-12-17 21:22:27 +08:00
Petr@AP Consulting	f398f3d443	Update README.md I added description and script for batch of files processing	2024-12-17 10:26:09 +01:00
lumin	e0a30295ff	docs: update README with Devcontainer instructions Add instructions for using Dev to run tests.Remove the install script it is no longer needed. Update trademark section for clarity.	2024-12-17 17:04:31 +09:00
lumin	07fe457a90	feat: add devcontainer configuration and installation script Add a devcontainer configuration to streamline the development environment setup. Introduce an `install.sh` script to install the project in editable mode. Update the Dockerfile to use the `python:3.13-slim-bullseye` base image and install dependencies using `apt-get` for better compatibility.	2024-12-17 17:04:31 +09:00
Om Gupta	60c4a62917	Merge branch 'microsoft:main' into main	2024-12-17 10:33:40 +05:30
Om Gupta	3eb8cf385b	Merge branch 'main' of https://github.com/AumGupta/markitdown	2024-12-17 10:24:30 +05:30
Om Gupta	8c91c11ea8	pre-commit run	2024-12-17 10:24:25 +05:30
diya155	14bd8d319a	Update README.md	2024-12-17 09:16:40 +05:30
gagb	dbc727615d	Merge branch 'main' into main	2024-12-16 15:48:49 -08:00
afourney	afaff11ef0	Merge branch 'main' into main	2024-12-16 14:40:58 -08:00
Om Gupta	a3208f2bd0	feat: Add IpynbConverter - Implemented IpynbConverter class for converting Jupyter Notebook (.ipynb) files into Markdown format. - Supports markdown cells, code cells and raw cells. - First markdown heading is used as the title if no title is found in notebook metadata. - Created a test notebook (`test_notebook.ipynb`) to verify the functionality of the converter.	2024-12-17 01:00:41 +05:30