18 Commits

Author SHA1 Message Date
Kenny Zhang
4e0a10ecf3 ran unit tests locally 2025-02-27 16:44:50 -05:00
Kenny Zhang
950b135da6 formatting 2025-02-27 15:08:10 -05:00
Kenny Zhang
b671345bb9 updated readme 2025-02-27 15:07:46 -05:00
Kenny Zhang
d9a92f7f06 added file obj unit tests for rss and json 2025-02-27 15:05:29 -05:00
Kenny Zhang
db0c8acbaf added file obj support to rss and plain text converters 2025-02-27 14:55:49 -05:00
Kenny Zhang
08330c2ac3 added core unit tests for file obj support 2025-02-27 11:27:05 -05:00
Kenny Zhang
4afc1fe886 added non-binary example to README 2025-02-21 13:31:37 -05:00
Kenny Zhang
b0044720da updated docs 2025-02-20 16:56:47 -05:00
Kenny Zhang
07a28d4f00 black formatting 2025-02-20 16:49:37 -05:00
Kenny Zhang
b8b3897952 modify ext guesser 2025-02-20 16:47:37 -05:00
Kenny Zhang
395ce2d301 close file object after using 2025-02-20 13:54:51 -05:00
Kenny Zhang
808401a331 added conversion path for file object in central class 2025-02-19 17:02:51 -05:00
Kenny Zhang
e75f3f6f5b local path inputs to MarkitDown class adhere to new converterinput structure 2025-02-19 15:16:45 -05:00
Kenny Zhang
8e950325d2 refactored remaining converters 2025-02-19 14:01:43 -05:00
Kenny Zhang
096fef3d5f refactored more converters to support input class 2025-02-19 13:34:28 -05:00
Kenny Zhang
52cbff061a begin refactoring converter classes 2025-02-19 11:48:00 -05:00
Kenny Zhang
0027e6d425 added wrapper class for converter file input 2025-02-18 12:44:18 -05:00
Kenny Zhang
63a7bafadd removed redundant priority setting 2025-02-18 12:18:49 -05:00
50 changed files with 1504 additions and 2616 deletions

3
.gitattributes vendored
View File

@@ -1,2 +1 @@
packages/markitdown/tests/test_files/** linguist-vendored tests/test_files/** linguist-vendored
packages/markitdown-sample-plugin/tests/test_files/** linguist-vendored

View File

@@ -5,14 +5,10 @@
[![Built by AutoGen Team](https://img.shields.io/badge/Built%20by-AutoGen%20Team-blue)](https://github.com/microsoft/autogen) [![Built by AutoGen Team](https://img.shields.io/badge/Built%20by-AutoGen%20Team-blue)](https://github.com/microsoft/autogen)
> [!IMPORTANT] > [!IMPORTANT]
> Breaking changes between 0.0.1 to 0.0.2: > MarkItDown 0.0.2 alpha 1 (0.0.2a1) introduces a plugin-based architecture. As much as was possible, command-line and Python interfaces have remained the same as 0.0.1a3 to support backward compatibility. Please report any issues you encounter. Some interface changes may yet occur as we continue to refine MarkItDown to a first non-alpha release.
> * Dependencies are now organized into optional feature-groups (further details below). Use `pip install markitdown[all]` to have backward-compatible behavior.
> * The DocumentConverter class interface has changed to read from file-like streams rather than file paths. *No temporary files are created anymore*. If you are the maintainer of a plugin, or custom DocumentConverter, you likely need to update your code. Otherwise, if only using the MarkItDown class or CLI (as in these examples), you should not need to change anything.
MarkItDown is a lightweight Python utility for converting various files to Markdown for use with LLMs and related text analysis pipelines. To this end, it is most comparable to [textract](https://github.com/deanmalmgren/textract), but with a focus on preserving important document structure and content as Markdown (including: headings, lists, tables, links, etc.) While the output is often reasonably presentable and human-friendly, it is meant to be consumed by text analysis tools -- and may not be the best option for high-fidelity document conversions for human consumption.
At present, MarkItDown supports:
MarkItDown is a utility for converting various files to Markdown (e.g., for indexing, text analysis, etc).
It supports:
- PDF - PDF
- PowerPoint - PowerPoint
- Word - Word
@@ -22,26 +18,14 @@ At present, MarkItDown supports:
- HTML - HTML
- Text-based formats (CSV, JSON, XML) - Text-based formats (CSV, JSON, XML)
- ZIP files (iterates over contents) - ZIP files (iterates over contents)
- Youtube URLs
- ... and more! - ... and more!
## Why Markdown? To install MarkItDown, use pip: `pip install markitdown`. Alternatively, you can install it from the source:
Markdown is extremely close to plain text, with minimal markup or formatting, but still
provides a way to represent important document structure. Mainstream LLMs, such as
OpenAI's GPT-4o, natively "_speak_" Markdown, and often incorporate Markdown into their
responses unprompted. This suggests that they have been trained on vast amounts of
Markdown-formatted text, and understand it well. As a side benefit, Markdown conventions
are also highly token-efficient.
## Installation
To install MarkItDown, use pip: `pip install markitdown[all]`. Alternatively, you can install it from the source:
```bash ```bash
git clone git@github.com:microsoft/markitdown.git git clone git@github.com:microsoft/markitdown.git
cd markitdown cd markitdown
pip install -e packages/markitdown[all] pip install -e packages/markitdown
``` ```
## Usage ## Usage
@@ -64,28 +48,6 @@ You can also pipe content:
cat path-to-file.pdf | markitdown cat path-to-file.pdf | markitdown
``` ```
### Optional Dependencies
MarkItDown has optional dependencies for activating various file formats. Earlier in this document, we installed all optional dependencies with the `[all]` option. However, you can also install them individually for more control. For example:
```bash
pip install markitdown[pdf, docx, pptx]
```
will install only the dependencies for PDF, DOCX, and PPTX files.
At the moment, the following optional dependencies are available:
* `[all]` Installs all optional dependencies
* `[pptx]` Installs dependencies for PowerPoint files
* `[docx]` Installs dependencies for Word files
* `[xlsx]` Installs dependencies for Excel files
* `[xls]` Installs dependencies for older Excel files
* `[pdf]` Installs dependencies for PDF files
* `[outlook]` Installs dependencies for Outlook messages
* `[az-doc-intel]` Installs dependencies for Azure Document Intelligence
* `[audio-transcription]` Installs dependencies for audio transcription of wav and mp3 files
* `[youtube-transcription]` Installs dependencies for fetching YouTube video transcription
### Plugins ### Plugins
MarkItDown also supports 3rd-party plugins. Plugins are disabled by default. To list installed plugins: MarkItDown also supports 3rd-party plugins. Plugins are disabled by default. To list installed plugins:
@@ -112,6 +74,7 @@ markitdown path-to-file.pdf -o document.md -d -e "<document_intelligence_endpoin
More information about how to set up an Azure Document Intelligence Resource can be found [here](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/how-to-guides/create-document-intelligence-resource?view=doc-intel-4.0.0) More information about how to set up an Azure Document Intelligence Resource can be found [here](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/how-to-guides/create-document-intelligence-resource?view=doc-intel-4.0.0)
### Python API ### Python API
Basic usage in Python: Basic usage in Python:
@@ -134,6 +97,25 @@ result = md.convert("test.pdf")
print(result.text_content) print(result.text_content)
``` ```
MarkItDown also supports converting file objects directly:
```python
from markitdown import MarkItDown
md = MarkItDown()
# Providing the file extension when converting via file objects is recommended for most consistent results
# Binary Mode
with open("test.docx", 'rb') as file:
result = md.convert(file, file_extension=".docx")
print(result.text_content)
# Non-Binary Mode
with open("sample.ipynb", 'rt', encoding="utf-8") as file:
result = md.convert(file, file_extension=".ipynb")
print(result.text_content)
```
To use Large Language Models for image descriptions, provide `llm_client` and `llm_model`: To use Large Language Models for image descriptions, provide `llm_client` and `llm_model`:
```python ```python
@@ -171,10 +153,11 @@ contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additio
You can help by looking at issues or helping review PRs. Any issue or PR is welcome, but we have also marked some as 'open for contribution' and 'open for reviewing' to help facilitate community contributions. These are ofcourse just suggestions and you are welcome to contribute in any way you like. You can help by looking at issues or helping review PRs. Any issue or PR is welcome, but we have also marked some as 'open for contribution' and 'open for reviewing' to help facilitate community contributions. These are ofcourse just suggestions and you are welcome to contribute in any way you like.
<div align="center"> <div align="center">
| | All | Especially Needs Help from Community | | | All | Especially Needs Help from Community |
| ---------- | ------------------------------------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------- | |-----------------------|------------------------------------------|------------------------------------------------------------------------------------------|
| **Issues** | [All Issues](https://github.com/microsoft/markitdown/issues) | [Issues open for contribution](https://github.com/microsoft/markitdown/issues?q=is%3Aissue+is%3Aopen+label%3A%22open+for+contribution%22) | | **Issues** | [All Issues](https://github.com/microsoft/markitdown/issues) | [Issues open for contribution](https://github.com/microsoft/markitdown/issues?q=is%3Aissue+is%3Aopen+label%3A%22open+for+contribution%22) |
| **PRs** | [All PRs](https://github.com/microsoft/markitdown/pulls) | [PRs open for reviewing](https://github.com/microsoft/markitdown/pulls?q=is%3Apr+is%3Aopen+label%3A%22open+for+reviewing%22) | | **PRs** | [All PRs](https://github.com/microsoft/markitdown/pulls) | [PRs open for reviewing](https://github.com/microsoft/markitdown/pulls?q=is%3Apr+is%3Aopen+label%3A%22open+for+reviewing%22) |
@@ -189,7 +172,6 @@ You can help by looking at issues or helping review PRs. Any issue or PR is welc
``` ```
- Install `hatch` in your environment and run tests: - Install `hatch` in your environment and run tests:
```sh ```sh
pip install hatch # Other ways of installing hatch: https://hatch.pypa.io/dev/install/ pip install hatch # Other ways of installing hatch: https://hatch.pypa.io/dev/install/
hatch shell hatch shell
@@ -197,7 +179,6 @@ You can help by looking at issues or helping review PRs. Any issue or PR is welc
``` ```
(Alternative) Use the Devcontainer which has all the dependencies installed: (Alternative) Use the Devcontainer which has all the dependencies installed:
```sh ```sh
# Reopen the project in Devcontainer and run: # Reopen the project in Devcontainer and run:
hatch test hatch test
@@ -209,6 +190,7 @@ You can help by looking at issues or helping review PRs. Any issue or PR is welc
You can also contribute by creating and sharing 3rd party plugins. See `packages/markitdown-sample-plugin` for more details. You can also contribute by creating and sharing 3rd party plugins. See `packages/markitdown-sample-plugin` for more details.
## Trademarks ## Trademarks
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft

View File

@@ -10,38 +10,23 @@ This project shows how to create a sample plugin for MarkItDown. The most import
Next, implement your custom DocumentConverter: Next, implement your custom DocumentConverter:
```python ```python
from typing import BinaryIO, Any from typing import Union
from markitdown import MarkItDown, DocumentConverter, DocumentConverterResult, StreamInfo from markitdown import DocumentConverter, DocumentConverterResult
class RtfConverter(DocumentConverter): class RtfConverter(DocumentConverter):
def convert(self, local_path, **kwargs) -> Union[None, DocumentConverterResult]:
# Bail if not an RTF file
extension = kwargs.get("file_extension", "")
if extension.lower() != ".rtf":
return None
def __init__( # Implement the conversion logic here ...
self, priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT
):
super().__init__(priority=priority)
def accepts( # Return the result
self, return DocumentConverterResult(
file_stream: BinaryIO, title=title,
stream_info: StreamInfo, text_content=text_content,
**kwargs: Any, )
) -> bool:
# Implement logic to check if the file stream is an RTF file
# ...
raise NotImplementedError()
def convert(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any,
) -> DocumentConverterResult:
# Implement logic to convert the file stream to Markdown
# ...
raise NotImplementedError()
``` ```
Next, make sure your package implements and exports the following: Next, make sure your package implements and exports the following:
@@ -86,10 +71,10 @@ Once the plugin package is installed, verify that it is available to MarkItDown
markitdown --list-plugins markitdown --list-plugins
``` ```
To use the plugin for a conversion use the `--use-plugins` flag. For example, to convert an RTF file: To use the plugin for a conversion use the `--use-plugins` flag. For example, to convert a PDF:
```bash ```bash
markitdown --use-plugins path-to-file.rtf markitdown --use-plugins path-to-file.pdf
``` ```
In Python, plugins can be enabled as follows: In Python, plugins can be enabled as follows:
@@ -98,7 +83,7 @@ In Python, plugins can be enabled as follows:
from markitdown import MarkItDown from markitdown import MarkItDown
md = MarkItDown(enable_plugins=True) md = MarkItDown(enable_plugins=True)
result = md.convert("path-to-file.rtf") result = md.convert("path-to-file.pdf")
print(result.text_content) print(result.text_content)
``` ```

View File

@@ -24,7 +24,7 @@ classifiers = [
"Programming Language :: Python :: Implementation :: PyPy", "Programming Language :: Python :: Implementation :: PyPy",
] ]
dependencies = [ dependencies = [
"markitdown>=0.1.0a1", "markitdown",
"striprtf", "striprtf",
] ]

View File

@@ -1,4 +1,4 @@
# SPDX-FileCopyrightText: 2024-present Adam Fourney <adamfo@microsoft.com> # SPDX-FileCopyrightText: 2024-present Adam Fourney <adamfo@microsoft.com>
# #
# SPDX-License-Identifier: MIT # SPDX-License-Identifier: MIT
__version__ = "0.1.0a1" __version__ = "0.0.1a2"

View File

@@ -1,26 +1,12 @@
import locale from typing import Union
from typing import BinaryIO, Any
from striprtf.striprtf import rtf_to_text from striprtf.striprtf import rtf_to_text
from markitdown import ( from markitdown import MarkItDown, DocumentConverter, DocumentConverterResult
MarkItDown,
DocumentConverter,
DocumentConverterResult,
StreamInfo,
)
__plugin_interface_version__ = ( __plugin_interface_version__ = (
1 # The version of the plugin interface that this plugin uses 1 # The version of the plugin interface that this plugin uses
) )
ACCEPTED_MIME_TYPE_PREFIXES = [
"text/rtf",
"application/rtf",
]
ACCEPTED_FILE_EXTENSIONS = [".rtf"]
def register_converters(markitdown: MarkItDown, **kwargs): def register_converters(markitdown: MarkItDown, **kwargs):
""" """
@@ -36,36 +22,18 @@ class RtfConverter(DocumentConverter):
Converts an RTF file to in the simplest possible way. Converts an RTF file to in the simplest possible way.
""" """
def accepts( def convert(self, local_path, **kwargs) -> Union[None, DocumentConverterResult]:
self, # Bail if not a RTF
file_stream: BinaryIO, extension = kwargs.get("file_extension", "")
stream_info: StreamInfo, if extension.lower() != ".rtf":
**kwargs: Any, return None
) -> bool:
mimetype = (stream_info.mimetype or "").lower()
extension = (stream_info.extension or "").lower()
if extension in ACCEPTED_FILE_EXTENSIONS: # Read the RTF file
return True with open(local_path, "r") as f:
rtf = f.read()
for prefix in ACCEPTED_MIME_TYPE_PREFIXES:
if mimetype.startswith(prefix):
return True
return False
def convert(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any,
) -> DocumentConverterResult:
# Read the file stream into an str using hte provided charset encoding, or using the system default
encoding = stream_info.charset or locale.getpreferredencoding()
stream_data = file_stream.read().decode(encoding)
# Return the result # Return the result
return DocumentConverterResult( return DocumentConverterResult(
title=None, title=None,
markdown=rtf_to_text(stream_data), text_content=rtf_to_text(rtf),
) )

View File

@@ -2,7 +2,7 @@
import os import os
import pytest import pytest
from markitdown import MarkItDown, StreamInfo from markitdown import MarkItDown
from markitdown_sample_plugin import RtfConverter from markitdown_sample_plugin import RtfConverter
TEST_FILES_DIR = os.path.join(os.path.dirname(__file__), "test_files") TEST_FILES_DIR = os.path.join(os.path.dirname(__file__), "test_files")
@@ -15,13 +15,9 @@ RTF_TEST_STRINGS = {
def test_converter() -> None: def test_converter() -> None:
"""Tests the RTF converter dirctly.""" """Tests the RTF converter dirctly."""
with open(os.path.join(TEST_FILES_DIR, "test.rtf"), "rb") as file_stream:
converter = RtfConverter() converter = RtfConverter()
result = converter.convert( result = converter.convert(
file_stream=file_stream, os.path.join(TEST_FILES_DIR, "test.rtf"), file_extension=".rtf"
stream_info=StreamInfo(
mimetype="text/rtf", extension=".rtf", filename="test.rtf"
),
) )
for test_string in RTF_TEST_STRINGS: for test_string in RTF_TEST_STRINGS:
@@ -30,7 +26,7 @@ def test_converter() -> None:
def test_markitdown() -> None: def test_markitdown() -> None:
"""Tests that MarkItDown correctly loads the plugin.""" """Tests that MarkItDown correctly loads the plugin."""
md = MarkItDown(enable_plugins=True) md = MarkItDown()
result = md.convert(os.path.join(TEST_FILES_DIR, "test.rtf")) result = md.convert(os.path.join(TEST_FILES_DIR, "test.rtf"))
for test_string in RTF_TEST_STRINGS: for test_string in RTF_TEST_STRINGS:

View File

@@ -10,7 +10,7 @@
From PyPI: From PyPI:
```bash ```bash
pip install markitdown[all] pip install markitdown
``` ```
From source: From source:
@@ -18,7 +18,7 @@ From source:
```bash ```bash
git clone git@github.com:microsoft/markitdown.git git clone git@github.com:microsoft/markitdown.git
cd markitdown cd markitdown
pip install -e packages/markitdown[all] pip install -e packages/markitdown
``` ```
## Usage ## Usage

View File

@@ -26,36 +26,25 @@ classifiers = [
dependencies = [ dependencies = [
"beautifulsoup4", "beautifulsoup4",
"requests", "requests",
"markdownify",
"puremagic",
"pathvalidate",
"charset-normalizer",
]
[project.optional-dependencies]
all = [
"python-pptx",
"mammoth", "mammoth",
"markdownify",
"numpy",
"python-pptx",
"pandas", "pandas",
"openpyxl", "openpyxl",
"xlrd", "xlrd",
"pdfminer.six", "pdfminer.six",
"olefile", "puremagic",
"pydub", "pydub",
"SpeechRecognition", "olefile",
"youtube-transcript-api", "youtube-transcript-api",
"SpeechRecognition",
"pathvalidate",
"charset-normalizer",
"openai",
"azure-ai-documentintelligence", "azure-ai-documentintelligence",
"azure-identity" "azure-identity"
] ]
pptx = ["python-pptx"]
docx = ["mammoth"]
xlsx = ["pandas", "openpyxl"]
xls = ["pandas", "xlrd"]
pdf = ["pdfminer.six"]
outlook = ["olefile"]
audio-transcription = ["pydub", "SpeechRecognition"]
youtube-transcription = ["youtube-transcript-api"]
az-doc-intel = ["azure-ai-documentintelligence", "azure-identity"]
[project.urls] [project.urls]
Documentation = "https://github.com/microsoft/markitdown#readme" Documentation = "https://github.com/microsoft/markitdown#readme"
@@ -68,24 +57,12 @@ path = "src/markitdown/__about__.py"
[project.scripts] [project.scripts]
markitdown = "markitdown.__main__:main" markitdown = "markitdown.__main__:main"
[tool.hatch.envs.default]
features = ["all"]
[tool.hatch.envs.hatch-test]
features = ["all"]
extra-dependencies = [
"openai",
]
[tool.hatch.envs.types] [tool.hatch.envs.types]
features = ["all"]
extra-dependencies = [ extra-dependencies = [
"openai",
"mypy>=1.0.0", "mypy>=1.0.0",
] ]
[tool.hatch.envs.types.scripts] [tool.hatch.envs.types.scripts]
check = "mypy --install-types --non-interactive --ignore-missing-imports {args:src/markitdown tests}" check = "mypy --install-types --non-interactive {args:src/markitdown tests}"
[tool.coverage.run] [tool.coverage.run]
source_pkgs = ["markitdown", "tests"] source_pkgs = ["markitdown", "tests"]

View File

@@ -1,4 +1,4 @@
# SPDX-FileCopyrightText: 2024-present Adam Fourney <adamfo@microsoft.com> # SPDX-FileCopyrightText: 2024-present Adam Fourney <adamfo@microsoft.com>
# #
# SPDX-License-Identifier: MIT # SPDX-License-Identifier: MIT
__version__ = "0.1.0a1" __version__ = "0.0.2a1"

View File

@@ -3,20 +3,14 @@
# SPDX-License-Identifier: MIT # SPDX-License-Identifier: MIT
from .__about__ import __version__ from .__about__ import __version__
from ._markitdown import ( from ._markitdown import MarkItDown
MarkItDown,
PRIORITY_SPECIFIC_FILE_FORMAT,
PRIORITY_GENERIC_FILE_FORMAT,
)
from ._base_converter import DocumentConverterResult, DocumentConverter
from ._stream_info import StreamInfo
from ._exceptions import ( from ._exceptions import (
MarkItDownException, MarkItDownException,
MissingDependencyException, ConverterPrerequisiteException,
FailedConversionAttempt,
FileConversionException, FileConversionException,
UnsupportedFormatException, UnsupportedFormatException,
) )
from .converters import DocumentConverter, DocumentConverterResult
__all__ = [ __all__ = [
"__version__", "__version__",
@@ -24,11 +18,7 @@ __all__ = [
"DocumentConverter", "DocumentConverter",
"DocumentConverterResult", "DocumentConverterResult",
"MarkItDownException", "MarkItDownException",
"MissingDependencyException", "ConverterPrerequisiteException",
"FailedConversionAttempt",
"FileConversionException", "FileConversionException",
"UnsupportedFormatException", "UnsupportedFormatException",
"StreamInfo",
"PRIORITY_SPECIFIC_FILE_FORMAT",
"PRIORITY_GENERIC_FILE_FORMAT",
] ]

View File

@@ -1,108 +0,0 @@
import os
import tempfile
from warnings import warn
from typing import Any, Union, BinaryIO, Optional, List
from ._stream_info import StreamInfo
class DocumentConverterResult:
"""The result of converting a document to Markdown."""
def __init__(
self,
markdown: str,
*,
title: Optional[str] = None,
):
"""
Initialize the DocumentConverterResult.
The only required parameter is the converted Markdown text.
The title, and any other metadata that may be added in the future, are optional.
Parameters:
- markdown: The converted Markdown text.
- title: Optional title of the document.
"""
self.markdown = markdown
self.title = title
@property
def text_content(self) -> str:
"""Soft-deprecated alias for `markdown`. New code should migrate to using `markdown` or __str__."""
return self.markdown
@text_content.setter
def text_content(self, markdown: str):
"""Soft-deprecated alias for `markdown`. New code should migrate to using `markdown` or __str__."""
self.markdown = markdown
def __str__(self) -> str:
"""Return the converted Markdown text."""
return self.markdown
class DocumentConverter:
"""Abstract superclass of all DocumentConverters."""
def accepts(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any, # Options to pass to the converter
) -> bool:
"""
Return a quick determination on if the converter should attempt converting the document.
This is primarily based `stream_info` (typically, `stream_info.mimetype`, `stream_info.extension`).
In cases where the data is retrieved via HTTP, the `steam_info.url` might also be referenced to
make a determination (e.g., special converters for Wikipedia, YouTube etc).
Finally, it is conceivable that the `stream_info.filename` might be used to in cases
where the filename is well-known (e.g., `Dockerfile`, `Makefile`, etc)
NOTE: The method signature is designed to match that of the convert() method. This provides some
assurance that, if accepts() returns True, the convert() method will also be able to handle the document.
IMPORTANT: In rare cases, (e.g., OutlookMsgConverter) we need to read more from the stream to make a final
determination. Read operations inevitably advances the position in file_stream. In these case, the position
MUST be reset it MUST be reset before returning. This is because the convert() method may be called immediately
after accepts(), and will expect the file_stream to be at the original position.
E.g.,
cur_pos = file_stream.tell() # Save the current position
data = file_stream.read(100) # ... peek at the first 100 bytes, etc.
file_stream.seek(cur_pos) # Reset the position to the original position
Prameters:
- file_stream: The file-like object to convert. Must support seek(), tell(), and read() methods.
- stream_info: The StreamInfo object containing metadata about the file (mimetype, extension, charset, set)
- kwargs: Additional keyword arguments for the converter.
Returns:
- bool: True if the converter can handle the document, False otherwise.
"""
raise NotImplementedError(
f"The subclass, {type(self).__name__}, must implement the accepts() method to determine if they can handle the document."
)
def convert(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any, # Options to pass to the converter
) -> DocumentConverterResult:
"""
Convert a document to Markdown text.
Prameters:
- file_stream: The file-like object to convert. Must support seek(), tell(), and read() methods.
- stream_info: The StreamInfo object containing metadata about the file (mimetype, extension, charset, set)
- kwargs: Additional keyword arguments for the converter.
Returns:
- DocumentConverterResult: The result of the conversion, which includes the title and markdown content.
Raises:
- FileConversionException: If the mimetype is recognized, but the conversion fails for some other reason.
- MissingDependencyException: If the converter requires a dependency that is not installed.
"""
raise NotImplementedError("Subclasses must implement this method")

View File

@@ -1,14 +1,4 @@
from typing import Optional, List, Any class MarkItDownException(BaseException):
MISSING_DEPENDENCY_MESSAGE = """{converter} recognized the input as a potential {extension} file, but the dependencies needed to read {extension} files have not been installed. To resolve this error, include the optional dependency [{feature}] or [all] when installing MarkItDown. For example:
* pip install markitdown[{feature}]
* pip install markitdown[all]
* pip install markitdown[{feature}, ...]
* etc."""
class MarkItDownException(Exception):
""" """
Base exception class for MarkItDown. Base exception class for MarkItDown.
""" """
@@ -16,16 +6,24 @@ class MarkItDownException(Exception):
pass pass
class MissingDependencyException(MarkItDownException): class ConverterPrerequisiteException(MarkItDownException):
""" """
Converters shipped with MarkItDown may depend on optional Thrown when instantiating a DocumentConverter in cases where
dependencies. This exception is thrown when a converter's a required library or dependency is not installed, an API key
convert() method is called, but the required dependency is not is not set, or some other prerequisite is not met.
installed. This is not necessarily a fatal error, as the converter
will simply be skipped (an error will bubble up only if no other
suitable converter is found).
Error messages should clearly indicate which dependency is missing. This is not necessarily a fatal error. If thrown during
MarkItDown's plugin loading phase, the converter will simply be
skipped, and a warning will be issued.
"""
pass
class FileConversionException(MarkItDownException):
"""
Thrown when a suitable converter was found, but the conversion
process fails for any reason.
""" """
pass pass
@@ -37,40 +35,3 @@ class UnsupportedFormatException(MarkItDownException):
""" """
pass pass
class FailedConversionAttempt(object):
"""
Represents an a single attempt to convert a file.
"""
def __init__(self, converter: Any, exc_info: Optional[tuple] = None):
self.converter = converter
self.exc_info = exc_info
class FileConversionException(MarkItDownException):
"""
Thrown when a suitable converter was found, but the conversion
process fails for any reason.
"""
def __init__(
self,
message: Optional[str] = None,
attempts: Optional[List[FailedConversionAttempt]] = None,
):
self.attempts = attempts
if message is None:
if attempts is None:
message = "File conversion failed."
else:
message = f"File conversion failed after {len(attempts)} attempts:\n"
for attempt in attempts:
if attempt.exc_info is None:
message += " - {type(attempt.converter).__name__} provided no execution info."
else:
message += f" - {type(attempt.converter).__name__} threw {attempt.exc_info[0].__name__} with message: {attempt.exc_info[1]}\n"
super().__init__(message)

View File

@@ -2,25 +2,23 @@ import copy
import mimetypes import mimetypes
import os import os
import re import re
import sys
import tempfile import tempfile
import warnings import warnings
import traceback import traceback
import io
from dataclasses import dataclass
from importlib.metadata import entry_points from importlib.metadata import entry_points
from typing import Any, List, Optional, Union, BinaryIO from typing import Any, List, Optional, Union
from pathlib import Path from pathlib import Path
from urllib.parse import urlparse from urllib.parse import urlparse
from warnings import warn from warnings import warn
from io import BufferedIOBase, TextIOBase, BytesIO
# File-format detection # File-format detection
import puremagic import puremagic
import requests import requests
from ._stream_info import StreamInfo, _guess_stream_info_from_stream
from .converters import ( from .converters import (
DocumentConverter,
DocumentConverterResult,
PlainTextConverter, PlainTextConverter,
HtmlConverter, HtmlConverter,
RssConverter, RssConverter,
@@ -34,34 +32,27 @@ from .converters import (
XlsConverter, XlsConverter,
PptxConverter, PptxConverter,
ImageConverter, ImageConverter,
AudioConverter, WavConverter,
Mp3Converter,
OutlookMsgConverter, OutlookMsgConverter,
ZipConverter, ZipConverter,
DocumentIntelligenceConverter, DocumentIntelligenceConverter,
ConverterInput,
) )
from ._base_converter import DocumentConverter, DocumentConverterResult
from ._exceptions import ( from ._exceptions import (
FileConversionException, FileConversionException,
UnsupportedFormatException, UnsupportedFormatException,
FailedConversionAttempt, ConverterPrerequisiteException,
) )
# Override mimetype for csv to fix issue on windows
mimetypes.add_type("text/csv", ".csv")
# Lower priority values are tried first. _plugins: Union[None | List[Any]] = None
PRIORITY_SPECIFIC_FILE_FORMAT = (
0.0 # e.g., .docx, .pdf, .xlsx, Or specific pages, e.g., wikipedia
)
PRIORITY_GENERIC_FILE_FORMAT = (
10.0 # Near catch-all converters for mimetypes like text/*, etc.
)
_plugins: Union[None, List[Any]] = None # If None, plugins have not been loaded yet. def _load_plugins() -> Union[None | List[Any]]:
def _load_plugins() -> Union[None, List[Any]]:
"""Lazy load plugins, exiting early if already loaded.""" """Lazy load plugins, exiting early if already loaded."""
global _plugins global _plugins
@@ -81,14 +72,6 @@ def _load_plugins() -> Union[None, List[Any]]:
return _plugins return _plugins
@dataclass(kw_only=True, frozen=True)
class ConverterRegistration:
"""A registration of a converter with its priority and other metadata."""
converter: DocumentConverter
priority: float
class MarkItDown: class MarkItDown:
"""(In preview) An extremely simple text-based document reader, suitable for LLM use. """(In preview) An extremely simple text-based document reader, suitable for LLM use.
This reader will convert common file-types or webpages to Markdown.""" This reader will convert common file-types or webpages to Markdown."""
@@ -110,13 +93,13 @@ class MarkItDown:
self._requests_session = requests_session self._requests_session = requests_session
# TODO - remove these (see enable_builtins) # TODO - remove these (see enable_builtins)
self._llm_client: Any = None self._llm_client = None
self._llm_model: Union[str | None] = None self._llm_model = None
self._exiftool_path: Union[str | None] = None self._exiftool_path = None
self._style_map: Union[str | None] = None self._style_map = None
# Register the converters # Register the converters
self._converters: List[ConverterRegistration] = [] self._page_converters: List[DocumentConverter] = []
if ( if (
enable_builtins is None or enable_builtins enable_builtins is None or enable_builtins
@@ -144,15 +127,9 @@ class MarkItDown:
# Register converters for successful browsing operations # Register converters for successful browsing operations
# Later registrations are tried first / take higher priority than earlier registrations # Later registrations are tried first / take higher priority than earlier registrations
# To this end, the most specific converters should appear below the most generic converters # To this end, the most specific converters should appear below the most generic converters
self.register_converter( self.register_converter(PlainTextConverter())
PlainTextConverter(), priority=PRIORITY_GENERIC_FILE_FORMAT self.register_converter(ZipConverter())
) self.register_converter(HtmlConverter())
self.register_converter(
ZipConverter(markitdown=self), priority=PRIORITY_GENERIC_FILE_FORMAT
)
self.register_converter(
HtmlConverter(), priority=PRIORITY_GENERIC_FILE_FORMAT
)
self.register_converter(RssConverter()) self.register_converter(RssConverter())
self.register_converter(WikipediaConverter()) self.register_converter(WikipediaConverter())
self.register_converter(YouTubeConverter()) self.register_converter(YouTubeConverter())
@@ -161,7 +138,8 @@ class MarkItDown:
self.register_converter(XlsxConverter()) self.register_converter(XlsxConverter())
self.register_converter(XlsConverter()) self.register_converter(XlsConverter())
self.register_converter(PptxConverter()) self.register_converter(PptxConverter())
self.register_converter(AudioConverter()) self.register_converter(WavConverter())
self.register_converter(Mp3Converter())
self.register_converter(ImageConverter()) self.register_converter(ImageConverter())
self.register_converter(IpynbConverter()) self.register_converter(IpynbConverter())
self.register_converter(PdfConverter()) self.register_converter(PdfConverter())
@@ -186,9 +164,7 @@ class MarkItDown:
""" """
if not self._plugins_enabled: if not self._plugins_enabled:
# Load plugins # Load plugins
plugins = _load_plugins() for plugin in _load_plugins():
assert plugins is not None
for plugin in plugins:
try: try:
plugin.register_converters(self, **kwargs) plugin.register_converters(self, **kwargs)
except Exception: except Exception:
@@ -200,18 +176,14 @@ class MarkItDown:
def convert( def convert(
self, self,
source: Union[str, requests.Response, Path, BinaryIO], source: Union[str, requests.Response, Path, BufferedIOBase, TextIOBase],
*,
stream_info: Optional[StreamInfo] = None,
**kwargs: Any, **kwargs: Any,
) -> DocumentConverterResult: # TODO: deal with kwargs ) -> DocumentConverterResult: # TODO: deal with kwargs
""" """
Args: Args:
- source: can be a path (str or Path), url, or a requests.response object - source: can be a string representing a path either as string pathlib path object or url, a requests.response object, or a file object (TextIO or BinaryIO)
- stream_info: optional stream info to use for the conversion. If None, infer from source - extension: specifies the file extension to use when interpreting the file. If None, infer from source (path, uri, content-type, etc.)
- kwargs: additional arguments to pass to the converter
""" """
# Local path or url # Local path or url
if isinstance(source, str): if isinstance(source, str):
if ( if (
@@ -221,120 +193,96 @@ class MarkItDown:
): ):
return self.convert_url(source, **kwargs) return self.convert_url(source, **kwargs)
else: else:
return self.convert_local(source, stream_info=stream_info, **kwargs) return self.convert_local(source, **kwargs)
# Path object
elif isinstance(source, Path):
return self.convert_local(source, stream_info=stream_info, **kwargs)
# Request response # Request response
elif isinstance(source, requests.Response): elif isinstance(source, requests.Response):
return self.convert_response(source, **kwargs) return self.convert_response(source, **kwargs)
# Binary stream elif isinstance(source, Path):
elif ( return self.convert_local(source, **kwargs)
hasattr(source, "read") # File object
and callable(source.read) elif isinstance(source, BufferedIOBase) or isinstance(source, TextIOBase):
and not isinstance(source, io.TextIOBase) return self.convert_file_object(source, **kwargs)
):
return self.convert_stream(source, **kwargs)
else:
raise TypeError(
f"Invalid source type: {type(source)}. Expected str, requests.Response, BinaryIO."
)
def convert_local( def convert_local(
self, self, path: Union[str, Path], **kwargs: Any
path: Union[str, Path], ) -> DocumentConverterResult: # TODO: deal with kwargs
*,
stream_info: Optional[StreamInfo] = None,
file_extension: Optional[str] = None, # Deprecated -- use stream_info
url: Optional[str] = None, # Deprecated -- use stream_info
**kwargs: Any,
) -> DocumentConverterResult:
if isinstance(path, Path): if isinstance(path, Path):
path = str(path) path = str(path)
# Prepare a list of extensions to try (in order of priority)
ext = kwargs.get("file_extension")
extensions = [ext] if ext is not None else []
# Build a base StreamInfo object from which to start guesses # Get extension alternatives from the path and puremagic
base_stream_info = StreamInfo( base, ext = os.path.splitext(path)
local_path=path, self._append_ext(extensions, ext)
extension=os.path.splitext(path)[1],
filename=os.path.basename(path),
)
# Extend the base_stream_info with any additional info from the arguments for g in self._guess_ext_magic(source=path):
if stream_info is not None: self._append_ext(extensions, g)
base_stream_info = base_stream_info.copy_and_update(stream_info)
if file_extension is not None: # Create the ConverterInput object
# Deprecated -- use stream_info input = ConverterInput(input_type="filepath", filepath=path)
base_stream_info = base_stream_info.copy_and_update(
extension=file_extension
)
if url is not None: # Convert
# Deprecated -- use stream_info return self._convert(input, extensions, **kwargs)
base_stream_info = base_stream_info.copy_and_update(url=url)
with open(path, "rb") as fh: def convert_file_object(
# Prepare a list of configurations to try, starting with the base_stream_info self, file_object: Union[BufferedIOBase, TextIOBase], **kwargs: Any
guesses: List[StreamInfo] = [base_stream_info] ) -> DocumentConverterResult: # TODO: deal with kwargs
for guess in _guess_stream_info_from_stream( # Prepare a list of extensions to try (in order of priority
file_stream=fh, filename_hint=path ext = kwargs.get("file_extension")
): extensions = [ext] if ext is not None else []
guesses.append(base_stream_info.copy_and_update(guess))
return self._convert(file_stream=fh, stream_info_guesses=guesses, **kwargs)
# TODO: Curently, there are some ongoing issues with passing direct file objects to puremagic (incorrect guesses, unsupported file type errors, etc.)
# Only use puremagic as a last resort if no extensions were provided
if extensions == []:
for g in self._guess_ext_magic(source=file_object):
self._append_ext(extensions, g)
# Create the ConverterInput object
input = ConverterInput(input_type="object", file_object=file_object)
# Convert
return self._convert(input, extensions, **kwargs)
# TODO what should stream's type be?
def convert_stream( def convert_stream(
self, self, stream: Any, **kwargs: Any
stream: BinaryIO, ) -> DocumentConverterResult: # TODO: deal with kwargs
*, # Prepare a list of extensions to try (in order of priority)
stream_info: Optional[StreamInfo] = None, ext = kwargs.get("file_extension")
file_extension: Optional[str] = None, # Deprecated -- use stream_info extensions = [ext] if ext is not None else []
url: Optional[str] = None, # Deprecated -- use stream_info
**kwargs: Any,
) -> DocumentConverterResult:
guesses: List[StreamInfo] = []
# Do we have anything on which to base a guess? # Save the file locally to a temporary file. It will be deleted before this method exits
base_guess = None handle, temp_path = tempfile.mkstemp()
if stream_info is not None or file_extension is not None or url is not None: fh = os.fdopen(handle, "wb")
# Start with a non-Null base guess result = None
if stream_info is None: try:
base_guess = StreamInfo() # Write to the temporary file
content = stream.read()
if isinstance(content, str):
fh.write(content.encode("utf-8"))
else: else:
base_guess = stream_info fh.write(content)
fh.close()
if file_extension is not None: # Use puremagic to check for more extension options
# Deprecated -- use stream_info for g in self._guess_ext_magic(source=temp_path):
assert base_guess is not None # for mypy self._append_ext(extensions, g)
base_guess = base_guess.copy_and_update(extension=file_extension)
if url is not None: # Create the ConverterInput object
# Deprecated -- use stream_info input = ConverterInput(input_type="filepath", filepath=temp_path)
assert base_guess is not None # for mypy
base_guess = base_guess.copy_and_update(url=url)
# Append the base guess, if it's non-trivial # Convert
if base_guess is not None: result = self._convert(input, extensions, **kwargs)
if base_guess.mimetype is not None or base_guess.extension is not None: # Clean up
guesses.append(base_guess) finally:
else: try:
# Create a base guess with no information fh.close()
base_guess = StreamInfo() except Exception:
pass
os.unlink(temp_path)
# Create a placeholder filename to help with guessing return result
placeholder_filename = None
if base_guess.filename is not None:
placeholder_filename = base_guess.filename
elif base_guess.extension is not None:
placeholder_filename = "placeholder" + base_guess.extension
# Add guesses based on stream content
for guess in _guess_stream_info_from_stream(
file_stream=stream, filename_hint=placeholder_filename
):
guesses.append(base_guess.copy_and_update(guess))
# Perform the conversion
return self._convert(file_stream=stream, stream_info_guesses=guesses, **kwargs)
def convert_url( def convert_url(
self, url: str, **kwargs: Any self, url: str, **kwargs: Any
@@ -345,118 +293,76 @@ class MarkItDown:
return self.convert_response(response, **kwargs) return self.convert_response(response, **kwargs)
def convert_response( def convert_response(
self, self, response: requests.Response, **kwargs: Any
response: requests.Response, ) -> DocumentConverterResult: # TODO fix kwargs type
*, # Prepare a list of extensions to try (in order of priority)
stream_info: Optional[StreamInfo] = None, ext = kwargs.get("file_extension")
file_extension: Optional[str] = None, # Deprecated -- use stream_info extensions = [ext] if ext is not None else []
url: Optional[str] = None, # Deprecated -- use stream_info
**kwargs: Any,
) -> DocumentConverterResult:
# If there is a content-type header, get the mimetype and charset (if present)
mimetype: Optional[str] = None
charset: Optional[str] = None
if "content-type" in response.headers: # Guess from the mimetype
parts = response.headers["content-type"].split(";") content_type = response.headers.get("content-type", "").split(";")[0]
mimetype = parts.pop(0).strip() self._append_ext(extensions, mimetypes.guess_extension(content_type))
for part in parts:
if part.strip().startswith("charset="):
_charset = part.split("=")[1].strip()
if len(_charset) > 0:
charset = _charset
# If there is a content-disposition header, get the filename and possibly the extension # Read the content disposition if there is one
filename: Optional[str] = None content_disposition = response.headers.get("content-disposition", "")
extension: Optional[str] = None m = re.search(r"filename=([^;]+)", content_disposition)
if "content-disposition" in response.headers:
m = re.search(r"filename=([^;]+)", response.headers["content-disposition"])
if m: if m:
filename = m.group(1).strip("\"'") base, ext = os.path.splitext(m.group(1).strip("\"'"))
_, _extension = os.path.splitext(filename) self._append_ext(extensions, ext)
if len(_extension) > 0:
extension = _extension
# If there is still no filename, try to read it from the url # Read from the extension from the path
if filename is None: base, ext = os.path.splitext(urlparse(response.url).path)
parsed_url = urlparse(response.url) self._append_ext(extensions, ext)
_, _extension = os.path.splitext(parsed_url.path)
if len(_extension) > 0: # Looks like this might be a file!
filename = os.path.basename(parsed_url.path)
extension = _extension
# Create an initial guess from all this information # Save the file locally to a temporary file. It will be deleted before this method exits
base_guess = StreamInfo( handle, temp_path = tempfile.mkstemp()
mimetype=mimetype, fh = os.fdopen(handle, "wb")
charset=charset, result = None
filename=filename, try:
extension=extension, # Download the file
url=response.url,
)
# Update with any additional info from the arguments
if stream_info is not None:
base_guess = base_guess.copy_and_update(stream_info)
if file_extension is not None:
# Deprecated -- use stream_info
base_guess = base_guess.copy_and_update(extension=file_extension)
if url is not None:
# Deprecated -- use stream_info
base_guess = base_guess.copy_and_update(url=url)
# Add the guess if its non-trivial
guesses: List[StreamInfo] = []
if base_guess.mimetype is not None or base_guess.extension is not None:
guesses.append(base_guess)
# Read into BytesIO
buffer = io.BytesIO()
for chunk in response.iter_content(chunk_size=512): for chunk in response.iter_content(chunk_size=512):
buffer.write(chunk) fh.write(chunk)
buffer.seek(0) fh.close()
# Create a placeholder filename to help with guessing # Use puremagic to check for more extension options
placeholder_filename = None for g in self._guess_ext_magic(source=temp_path):
if base_guess.filename is not None: self._append_ext(extensions, g)
placeholder_filename = base_guess.filename
elif base_guess.extension is not None:
placeholder_filename = "placeholder" + base_guess.extension
# Add guesses based on stream content # Create the ConverterInput object
for guess in _guess_stream_info_from_stream( input = ConverterInput(input_type="filepath", filepath=temp_path)
file_stream=buffer, filename_hint=placeholder_filename
):
guesses.append(base_guess.copy_and_update(guess))
# Convert # Convert
return self._convert(file_stream=buffer, stream_info_guesses=guesses, **kwargs) result = self._convert(input, extensions, url=response.url, **kwargs)
# Clean up
finally:
try:
fh.close()
except Exception:
pass
os.unlink(temp_path)
return result
def _convert( def _convert(
self, *, file_stream: BinaryIO, stream_info_guesses: List[StreamInfo], **kwargs self, input: ConverterInput, extensions: List[Union[str, None]], **kwargs
) -> DocumentConverterResult: ) -> DocumentConverterResult:
res: Union[None, DocumentConverterResult] = None error_trace = ""
# Keep track of which converters throw exceptions
failed_attempts: List[FailedConversionAttempt] = []
# Create a copy of the page_converters list, sorted by priority. # Create a copy of the page_converters list, sorted by priority.
# We do this with each call to _convert because the priority of converters may change between calls. # We do this with each call to _convert because the priority of converters may change between calls.
# The sort is guaranteed to be stable, so converters with the same priority will remain in the same order. # The sort is guaranteed to be stable, so converters with the same priority will remain in the same order.
sorted_registrations = sorted(self._converters, key=lambda x: x.priority) sorted_converters = sorted(self._page_converters, key=lambda x: x.priority)
# Remember the initial stream position so that we can return to it
cur_pos = file_stream.tell()
for stream_info in stream_info_guesses + [StreamInfo()]:
for converter_registration in sorted_registrations:
converter = converter_registration.converter
# Sanity check -- make sure the cur_pos is still the same
assert (
cur_pos == file_stream.tell()
), f"File stream position should NOT change between guess iterations"
for ext in extensions + [None]: # Try last with no extension
for converter in sorted_converters:
_kwargs = copy.deepcopy(kwargs) _kwargs = copy.deepcopy(kwargs)
# Overwrite file_extension appropriately
if ext is None:
if "file_extension" in _kwargs:
del _kwargs["file_extension"]
else:
_kwargs.update({"file_extension": ext})
# Copy any additional global options # Copy any additional global options
if "llm_client" not in _kwargs and self._llm_client is not None: if "llm_client" not in _kwargs and self._llm_client is not None:
_kwargs["llm_client"] = self._llm_client _kwargs["llm_client"] = self._llm_client
@@ -471,40 +377,13 @@ class MarkItDown:
_kwargs["exiftool_path"] = self._exiftool_path _kwargs["exiftool_path"] = self._exiftool_path
# Add the list of converters for nested processing # Add the list of converters for nested processing
_kwargs["_parent_converters"] = self._converters _kwargs["_parent_converters"] = self._page_converters
# Add legaxy kwargs # If we hit an error log it and keep trying
if stream_info is not None:
if stream_info.extension is not None:
_kwargs["file_extension"] = stream_info.extension
if stream_info.url is not None:
_kwargs["url"] = stream_info.url
# Check if the converter will accept the file, and if so, try to convert it
_accepts = False
try: try:
_accepts = converter.accepts(file_stream, stream_info, **_kwargs) res = converter.convert(input, **_kwargs)
except NotImplementedError:
pass
# accept() should not have changed the file stream position
assert (
cur_pos == file_stream.tell()
), f"{type(converter).__name__}.accept() should NOT change the file_stream position"
# Attempt the conversion
if _accepts:
try:
res = converter.convert(file_stream, stream_info, **_kwargs)
except Exception: except Exception:
failed_attempts.append( error_trace = ("\n\n" + traceback.format_exc()).strip()
FailedConversionAttempt(
converter=converter, exc_info=sys.exc_info()
)
)
finally:
file_stream.seek(cur_pos)
if res is not None: if res is not None:
# Normalize the content # Normalize the content
@@ -512,17 +391,81 @@ class MarkItDown:
[line.rstrip() for line in re.split(r"\r?\n", res.text_content)] [line.rstrip() for line in re.split(r"\r?\n", res.text_content)]
) )
res.text_content = re.sub(r"\n{3,}", "\n\n", res.text_content) res.text_content = re.sub(r"\n{3,}", "\n\n", res.text_content)
# Todo
return res return res
# If we got this far without success, report any exceptions # If we got this far without success, report any exceptions
if len(failed_attempts) > 0: if len(error_trace) > 0:
raise FileConversionException(attempts=failed_attempts) raise FileConversionException(
f"Could not convert '{input.filepath}' to Markdown. File type was recognized as {extensions}. While converting the file, the following error was encountered:\n\n{error_trace}"
)
# Nothing can handle it! # Nothing can handle it!
raise UnsupportedFormatException( raise UnsupportedFormatException(
f"Could not convert stream to Markdown. No converter attempted a conversion, suggesting that the filetype is simply not supported." f"Could not convert '{input.filepath}' to Markdown. The formats {extensions} are not supported."
) )
def _append_ext(self, extensions, ext):
"""Append a unique non-None, non-empty extension to a list of extensions."""
if ext is None:
return
ext = ext.strip()
if ext == "":
return
# if ext not in extensions:
extensions.append(ext)
def _guess_ext_magic(self, source):
"""Use puremagic (a Python implementation of libmagic) to guess a file's extension based on the first few bytes."""
# Use puremagic to guess
try:
guesses = []
# Guess extensions for filepaths
if isinstance(source, str):
guesses = puremagic.magic_file(source)
# Fix for: https://github.com/microsoft/markitdown/issues/222
# If there are no guesses, then try again after trimming leading ASCII whitespaces.
# ASCII whitespace characters are those byte values in the sequence b' \t\n\r\x0b\f'
# (space, tab, newline, carriage return, vertical tab, form feed).
if len(guesses) == 0:
with open(source, "rb") as file:
while True:
char = file.read(1)
if not char: # End of file
break
if not char.isspace():
file.seek(file.tell() - 1)
break
try:
guesses = puremagic.magic_stream(file)
except puremagic.main.PureError:
pass
# Guess extensions for file objects. Note that the puremagic's magic_stream function requires a BytesIO-like file source
# TODO: Figure out how to guess extensions for TextIO-like file sources (manually converting to BytesIO does not work)
elif isinstance(source, BufferedIOBase):
guesses = puremagic.magic_stream(source)
extensions = list()
for g in guesses:
ext = g.extension.strip()
if len(ext) > 0:
if not ext.startswith("."):
ext = "." + ext
if ext not in extensions:
extensions.append(ext)
return extensions
except FileNotFoundError:
pass
except IsADirectoryError:
pass
except PermissionError:
pass
return []
def register_page_converter(self, converter: DocumentConverter) -> None: def register_page_converter(self, converter: DocumentConverter) -> None:
"""DEPRECATED: User register_converter instead.""" """DEPRECATED: User register_converter instead."""
warn( warn(
@@ -531,34 +474,6 @@ class MarkItDown:
) )
self.register_converter(converter) self.register_converter(converter)
def register_converter( def register_converter(self, converter: DocumentConverter) -> None:
self, """Register a page text converter."""
converter: DocumentConverter, self._page_converters.insert(0, converter)
*,
priority: float = PRIORITY_SPECIFIC_FILE_FORMAT,
) -> None:
"""
Register a DocumentConverter with a given priority.
Priorities work as follows: By default, most converters get priority
DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT (== 0). The exception
is the PlainTextConverter, HtmlConverter, and ZipConverter, which get
priority PRIORITY_SPECIFIC_FILE_FORMAT (== 10), with lower values
being tried first (i.e., higher priority).
Just prior to conversion, the converters are sorted by priority, using
a stable sort. This means that converters with the same priority will
remain in the same order, with the most recently registered converters
appearing first.
We have tight control over the order of built-in converters, but
plugins can register converters in any order. The registration's priority
field reasserts some control over the order of converters.
Plugins can register converters with any priority, to appear before or
after the built-ins. For example, a plugin with priority 9 will run
before the PlainTextConverter, but after the built-in converters.
"""
self._converters.insert(
0, ConverterRegistration(converter=converter, priority=priority)
)

View File

@@ -1,122 +0,0 @@
import puremagic
import mimetypes
import os
from dataclasses import dataclass, asdict
from typing import Optional, BinaryIO, List, TypeVar, Type
# Mimetype substitutions table
MIMETYPE_SUBSTITUTIONS = {
"application/excel": "application/vnd.ms-excel",
"application/mspowerpoint": "application/vnd.ms-powerpoint",
}
@dataclass(kw_only=True, frozen=True)
class StreamInfo:
"""The StreamInfo class is used to store information about a file stream.
All fields can be None, and will depend on how the stream was opened.
"""
mimetype: Optional[str] = None
extension: Optional[str] = None
charset: Optional[str] = None
filename: Optional[
str
] = None # From local path, url, or Content-Disposition header
local_path: Optional[str] = None # If read from disk
url: Optional[str] = None # If read from url
def copy_and_update(self, *args, **kwargs):
"""Copy the StreamInfo object and update it with the given StreamInfo
instance and/or other keyword arguments."""
new_info = asdict(self)
for si in args:
assert isinstance(si, StreamInfo)
new_info.update({k: v for k, v in asdict(si).items() if v is not None})
if len(kwargs) > 0:
new_info.update(kwargs)
return StreamInfo(**new_info)
# Behavior subject to change.
# Do not rely on this outside of this module.
def _guess_stream_info_from_stream(
file_stream: BinaryIO,
*,
filename_hint: Optional[str] = None,
) -> List[StreamInfo]:
"""
Guess StreamInfo properties (mostly mimetype and extension) from a stream.
Args:
- stream: The stream to guess the StreamInfo from.
- filename_hint [Optional]: A filename hint to help with the guessing (may be a placeholder, and not actually be the file name)
Returns a list of StreamInfo objects in order of confidence.
"""
guesses: List[StreamInfo] = []
# Add a guess purely based on the filename hint
if filename_hint:
try:
# Requires Python 3.13+
mimetype, _ = mimetypes.guess_file_type(filename_hint) # type: ignore
except AttributeError:
mimetype, _ = mimetypes.guess_type(filename_hint)
if mimetype:
guesses.append(
StreamInfo(
mimetype=mimetype, extension=os.path.splitext(filename_hint)[1]
)
)
def _puremagic(
file_stream, filename_hint
) -> List[puremagic.main.PureMagicWithConfidence]:
"""Wrap guesses to handle exceptions."""
try:
return puremagic.magic_stream(file_stream, filename=filename_hint)
except puremagic.main.PureError as e:
return []
cur_pos = file_stream.tell()
type_guesses = _puremagic(file_stream, filename_hint=filename_hint)
if len(type_guesses) == 0:
# Fix for: https://github.com/microsoft/markitdown/issues/222
# If there are no guesses, then try again after trimming leading ASCII whitespaces.
# ASCII whitespace characters are those byte values in the sequence b' \t\n\r\x0b\f'
# (space, tab, newline, carriage return, vertical tab, form feed).
# Eat all the leading whitespace
file_stream.seek(cur_pos)
while True:
char = file_stream.read(1)
if not char: # End of file
break
if not char.isspace():
file_stream.seek(file_stream.tell() - 1)
break
# Try again
type_guesses = _puremagic(file_stream, filename_hint=filename_hint)
file_stream.seek(cur_pos)
# Convert and return the guesses
for guess in type_guesses:
kwargs: dict[str, str] = {}
if guess.extension:
kwargs["extension"] = guess.extension
if guess.mime_type:
kwargs["mimetype"] = MIMETYPE_SUBSTITUTIONS.get(
guess.mime_type, guess.mime_type
)
if len(kwargs) > 0:
# We don't add the filename_hint, because sometimes it's just a placeholder,
# and, in any case, doesn't add new information.
guesses.append(StreamInfo(**kwargs))
return guesses

View File

@@ -2,6 +2,7 @@
# #
# SPDX-License-Identifier: MIT # SPDX-License-Identifier: MIT
from ._base import DocumentConverter, DocumentConverterResult
from ._plain_text_converter import PlainTextConverter from ._plain_text_converter import PlainTextConverter
from ._html_converter import HtmlConverter from ._html_converter import HtmlConverter
from ._rss_converter import RssConverter from ._rss_converter import RssConverter
@@ -14,12 +15,16 @@ from ._docx_converter import DocxConverter
from ._xlsx_converter import XlsxConverter, XlsConverter from ._xlsx_converter import XlsxConverter, XlsConverter
from ._pptx_converter import PptxConverter from ._pptx_converter import PptxConverter
from ._image_converter import ImageConverter from ._image_converter import ImageConverter
from ._audio_converter import AudioConverter from ._wav_converter import WavConverter
from ._mp3_converter import Mp3Converter
from ._outlook_msg_converter import OutlookMsgConverter from ._outlook_msg_converter import OutlookMsgConverter
from ._zip_converter import ZipConverter from ._zip_converter import ZipConverter
from ._doc_intel_converter import DocumentIntelligenceConverter from ._doc_intel_converter import DocumentIntelligenceConverter
from ._converter_input import ConverterInput
__all__ = [ __all__ = [
"DocumentConverter",
"DocumentConverterResult",
"PlainTextConverter", "PlainTextConverter",
"HtmlConverter", "HtmlConverter",
"RssConverter", "RssConverter",
@@ -33,8 +38,10 @@ __all__ = [
"XlsConverter", "XlsConverter",
"PptxConverter", "PptxConverter",
"ImageConverter", "ImageConverter",
"AudioConverter", "WavConverter",
"Mp3Converter",
"OutlookMsgConverter", "OutlookMsgConverter",
"ZipConverter", "ZipConverter",
"DocumentIntelligenceConverter", "DocumentIntelligenceConverter",
"ConverterInput",
] ]

View File

@@ -1,102 +0,0 @@
import io
from typing import Any, BinaryIO, Optional
from ._exiftool import exiftool_metadata
from ._transcribe_audio import transcribe_audio
from .._base_converter import DocumentConverter, DocumentConverterResult
from .._stream_info import StreamInfo
from .._exceptions import MissingDependencyException
ACCEPTED_MIME_TYPE_PREFIXES = [
"audio/x-wav",
"audio/mpeg",
"video/mp4",
]
ACCEPTED_FILE_EXTENSIONS = [
".wav",
".mp3",
".m4a",
".mp4",
]
class AudioConverter(DocumentConverter):
"""
Converts audio files to markdown via extraction of metadata (if `exiftool` is installed), and speech transcription (if `speech_recognition` is installed).
"""
def accepts(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any, # Options to pass to the converter
) -> bool:
mimetype = (stream_info.mimetype or "").lower()
extension = (stream_info.extension or "").lower()
if extension in ACCEPTED_FILE_EXTENSIONS:
return True
for prefix in ACCEPTED_MIME_TYPE_PREFIXES:
if mimetype.startswith(prefix):
return True
return False
def convert(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any, # Options to pass to the converter
) -> DocumentConverterResult:
md_content = ""
# Add metadata
metadata = exiftool_metadata(
file_stream, exiftool_path=kwargs.get("exiftool_path")
)
if metadata:
for f in [
"Title",
"Artist",
"Author",
"Band",
"Album",
"Genre",
"Track",
"DateTimeOriginal",
"CreateDate",
# "Duration", -- Wrong values when read from memory
"NumChannels",
"SampleRate",
"AvgBytesPerSec",
"BitsPerSample",
]:
if f in metadata:
md_content += f"{f}: {metadata[f]}\n"
# Figure out the audio format for transcription
if stream_info.extension == ".wav" or stream_info.mimetype == "audio/x-wav":
audio_format = "wav"
elif stream_info.extension == ".mp3" or stream_info.mimetype == "audio/mpeg":
audio_format = "mp3"
elif (
stream_info.extension in [".mp4", ".m4a"]
or stream_info.mimetype == "video/mp4"
):
audio_format = "mp4"
else:
audio_format = None
# Transcribe
if audio_format:
try:
transcript = transcribe_audio(file_stream, audio_format=audio_format)
if transcript:
md_content += "\n\n### Audio Transcript:\n" + transcript
except MissingDependencyException:
pass
# Return the result
return DocumentConverterResult(markdown=md_content.strip())

View File

@@ -0,0 +1,63 @@
from typing import Any, Union
class DocumentConverterResult:
"""The result of converting a document to text."""
def __init__(self, title: Union[str, None] = None, text_content: str = ""):
self.title: Union[str, None] = title
self.text_content: str = text_content
class DocumentConverter:
"""Abstract superclass of all DocumentConverters."""
# Lower priority values are tried first.
PRIORITY_SPECIFIC_FILE_FORMAT = (
0.0 # e.g., .docx, .pdf, .xlsx, Or specific pages, e.g., wikipedia
)
PRIORITY_GENERIC_FILE_FORMAT = (
10.0 # Near catch-all converters for mimetypes like text/*, etc.
)
def __init__(self, priority: float = PRIORITY_SPECIFIC_FILE_FORMAT):
"""
Initialize the DocumentConverter with a given priority.
Priorities work as follows: By default, most converters get priority
DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT (== 0). The exception
is the PlainTextConverter, which gets priority PRIORITY_SPECIFIC_FILE_FORMAT (== 10),
with lower values being tried first (i.e., higher priority).
Just prior to conversion, the converters are sorted by priority, using
a stable sort. This means that converters with the same priority will
remain in the same order, with the most recently registered converters
appearing first.
We have tight control over the order of built-in converters, but
plugins can register converters in any order. A converter's priority
field reasserts some control over the order of converters.
Plugins can register converters with any priority, to appear before or
after the built-ins. For example, a plugin with priority 9 will run
before the PlainTextConverter, but after the built-in converters.
"""
self._priority = priority
def convert(
self, local_path: str, **kwargs: Any
) -> Union[None, DocumentConverterResult]:
raise NotImplementedError("Subclasses must implement this method")
@property
def priority(self) -> float:
"""Priority of the converter in markitdown's converter list. Higher priority values are tried first."""
return self._priority
@priority.setter
def radius(self, value: float):
self._priority = value
@priority.deleter
def radius(self):
raise AttributeError("Cannot delete the priority attribute")

View File

@@ -1,23 +1,14 @@
import io # type: ignore
import re
import base64 import base64
import re
from typing import Union
from urllib.parse import parse_qs, urlparse from urllib.parse import parse_qs, urlparse
from typing import Any, BinaryIO, Optional
from bs4 import BeautifulSoup from bs4 import BeautifulSoup
from .._base_converter import DocumentConverter, DocumentConverterResult from ._base import DocumentConverter, DocumentConverterResult
from .._stream_info import StreamInfo
from ._markdownify import _CustomMarkdownify from ._markdownify import _CustomMarkdownify
from ._converter_input import ConverterInput
ACCEPTED_MIME_TYPE_PREFIXES = [
"text/html",
"application/xhtml",
]
ACCEPTED_FILE_EXTENSIONS = [
".html",
".htm",
]
class BingSerpConverter(DocumentConverter): class BingSerpConverter(DocumentConverter):
@@ -26,47 +17,31 @@ class BingSerpConverter(DocumentConverter):
NOTE: It is better to use the Bing API NOTE: It is better to use the Bing API
""" """
def accepts( def __init__(
self, self, priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT
file_stream: BinaryIO, ):
stream_info: StreamInfo, super().__init__(priority=priority)
**kwargs: Any, # Options to pass to the converter
) -> bool:
"""
Make sure we're dealing with HTML content *from* Bing.
"""
url = stream_info.url or ""
mimetype = (stream_info.mimetype or "").lower()
extension = (stream_info.extension or "").lower()
if not re.search(r"^https://www\.bing\.com/search\?q=", url):
# Not a Bing SERP URL
return False
if extension in ACCEPTED_FILE_EXTENSIONS:
return True
for prefix in ACCEPTED_MIME_TYPE_PREFIXES:
if mimetype.startswith(prefix):
return True
# Not HTML content
return False
def convert( def convert(
self, self, input: ConverterInput, **kwargs
file_stream: BinaryIO, ) -> Union[None, DocumentConverterResult]:
stream_info: StreamInfo, # Bail if not a Bing SERP
**kwargs: Any, # Options to pass to the converter extension = kwargs.get("file_extension", "")
) -> DocumentConverterResult: if extension.lower() not in [".html", ".htm"]:
return None
url = kwargs.get("url", "")
if not re.search(r"^https://www\.bing\.com/search\?q=", url):
return None
# Parse the query parameters # Parse the query parameters
parsed_params = parse_qs(urlparse(stream_info.url).query) parsed_params = parse_qs(urlparse(url).query)
query = parsed_params.get("q", [""])[0] query = parsed_params.get("q", [""])[0]
# Parse the stream # Parse the file
encoding = "utf-8" if stream_info.charset is None else stream_info.charset soup = None
soup = BeautifulSoup(file_stream, "html.parser", from_encoding=encoding) file_obj = input.read_file(mode="rt", encoding="utf-8")
soup = BeautifulSoup(file_obj.read(), "html.parser")
file_obj.close()
# Clean up some formatting # Clean up some formatting
for tptt in soup.find_all(class_="tptt"): for tptt in soup.find_all(class_="tptt"):
@@ -110,6 +85,6 @@ class BingSerpConverter(DocumentConverter):
) )
return DocumentConverterResult( return DocumentConverterResult(
markdown=webpage_text,
title=None if soup.title is None else soup.title.string, title=None if soup.title is None else soup.title.string,
text_content=webpage_text,
) )

View File

@@ -0,0 +1,30 @@
from typing import Any, Union
class ConverterInput:
"""
Wrapper for inputs to converter functions.
"""
def __init__(
self,
input_type: str = "filepath",
filepath: Union[str, None] = None,
file_object: Union[Any, None] = None,
):
if input_type not in ["filepath", "object"]:
raise ValueError(f"Invalid converter input type: {input_type}")
self.input_type = input_type
self.filepath = filepath
self.file_object = file_object
def read_file(
self,
mode: str = "rb",
encoding: Union[str, None] = None,
) -> Any:
if self.input_type == "object":
return self.file_object
return open(self.filepath, mode=mode, encoding=encoding)

View File

@@ -1,27 +1,17 @@
import sys from typing import Any, Union
import re import re
from typing import BinaryIO, Any, List # Azure imports
from azure.ai.documentintelligence import DocumentIntelligenceClient
from ._html_converter import HtmlConverter from azure.ai.documentintelligence.models import (
from .._base_converter import DocumentConverter, DocumentConverterResult
from .._stream_info import StreamInfo
from .._exceptions import MissingDependencyException, MISSING_DEPENDENCY_MESSAGE
# Try loading optional (but in this case, required) dependencies
# Save reporting of any exceptions for later
_dependency_exc_info = None
try:
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import (
AnalyzeDocumentRequest, AnalyzeDocumentRequest,
AnalyzeResult, AnalyzeResult,
DocumentAnalysisFeature, DocumentAnalysisFeature,
) )
from azure.identity import DefaultAzureCredential from azure.identity import DefaultAzureCredential
except ImportError:
# Preserve the error and stack trace for later from ._base import DocumentConverter, DocumentConverterResult
_dependency_exc_info = sys.exc_info() from ._converter_input import ConverterInput
# TODO: currently, there is a bug in the document intelligence SDK with importing the "ContentFormat" enum. # TODO: currently, there is a bug in the document intelligence SDK with importing the "ContentFormat" enum.
@@ -29,62 +19,17 @@ except ImportError:
CONTENT_FORMAT = "markdown" CONTENT_FORMAT = "markdown"
OFFICE_MIME_TYPE_PREFIXES = [
"application/vnd.openxmlformats-officedocument.wordprocessingml.document",
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
"application/vnd.openxmlformats-officedocument.presentationml",
"application/xhtml",
"text/html",
]
OTHER_MIME_TYPE_PREFIXES = [
"application/pdf",
"application/x-pdf",
"text/html",
"image/",
]
OFFICE_FILE_EXTENSIONS = [
".docx",
".xlsx",
".pptx",
".html",
".htm",
]
OTHER_FILE_EXTENSIONS = [
".pdf",
".jpeg",
".jpg",
".png",
".bmp",
".tiff",
".heif",
]
class DocumentIntelligenceConverter(DocumentConverter): class DocumentIntelligenceConverter(DocumentConverter):
"""Specialized DocumentConverter that uses Document Intelligence to extract text from documents.""" """Specialized DocumentConverter that uses Document Intelligence to extract text from documents."""
def __init__( def __init__(
self, self,
*, *,
priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT,
endpoint: str, endpoint: str,
api_version: str = "2024-07-31-preview", api_version: str = "2024-07-31-preview",
): ):
super().__init__() super().__init__(priority=priority)
# Raise an error if the dependencies are not available.
# This is different than other converters since this one isn't even instantiated
# unless explicitly requested.
if _dependency_exc_info is not None:
raise MissingDependencyException(
"DocumentIntelligenceConverter requires the optional dependency [az-doc-intel] (or [all]) to be installed. E.g., `pip install markitdown[az-doc-intel]`"
) from _dependency_exc_info[
1
].with_traceback( # type: ignore[union-attr]
_dependency_exc_info[2]
)
self.endpoint = endpoint self.endpoint = endpoint
self.api_version = api_version self.api_version = api_version
@@ -94,61 +39,54 @@ class DocumentIntelligenceConverter(DocumentConverter):
credential=DefaultAzureCredential(), credential=DefaultAzureCredential(),
) )
def accepts( def convert(
self, self, input: ConverterInput, **kwargs: Any
file_stream: BinaryIO, ) -> Union[None, DocumentConverterResult]:
stream_info: StreamInfo, # Bail if extension is not supported by Document Intelligence
**kwargs: Any, # Options to pass to the converter extension = kwargs.get("file_extension", "")
) -> bool: docintel_extensions = [
mimetype = (stream_info.mimetype or "").lower() ".pdf",
extension = (stream_info.extension or "").lower() ".docx",
".xlsx",
".pptx",
".html",
".jpeg",
".jpg",
".png",
".bmp",
".tiff",
".heif",
]
if extension.lower() not in docintel_extensions:
return None
if extension in OFFICE_FILE_EXTENSIONS + OTHER_FILE_EXTENSIONS: # Get the bytestring from the converter input
return True file_obj = input.read_file(mode="rb")
file_bytes = file_obj.read()
file_obj.close()
for prefix in OFFICE_MIME_TYPE_PREFIXES + OTHER_MIME_TYPE_PREFIXES: # Certain document analysis features are not availiable for office filetypes (.xlsx, .pptx, .html, .docx)
if mimetype.startswith(prefix): if extension.lower() in [".xlsx", ".pptx", ".html", ".docx"]:
return True analysis_features = []
else:
return False analysis_features = [
def _analysis_features(self, stream_info: StreamInfo) -> List[str]:
"""
Helper needed to determine which analysis features to use.
Certain document analysis features are not availiable for
office filetypes (.xlsx, .pptx, .html, .docx)
"""
mimetype = (stream_info.mimetype or "").lower()
extension = (stream_info.extension or "").lower()
if extension in OFFICE_FILE_EXTENSIONS:
return []
for prefix in OFFICE_MIME_TYPE_PREFIXES:
if mimetype.startswith(prefix):
return []
return [
DocumentAnalysisFeature.FORMULAS, # enable formula extraction DocumentAnalysisFeature.FORMULAS, # enable formula extraction
DocumentAnalysisFeature.OCR_HIGH_RESOLUTION, # enable high resolution OCR DocumentAnalysisFeature.OCR_HIGH_RESOLUTION, # enable high resolution OCR
DocumentAnalysisFeature.STYLE_FONT, # enable font style extraction DocumentAnalysisFeature.STYLE_FONT, # enable font style extraction
] ]
def convert(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any, # Options to pass to the converter
) -> DocumentConverterResult:
# Extract the text using Azure Document Intelligence # Extract the text using Azure Document Intelligence
poller = self.doc_intel_client.begin_analyze_document( poller = self.doc_intel_client.begin_analyze_document(
model_id="prebuilt-layout", model_id="prebuilt-layout",
body=AnalyzeDocumentRequest(bytes_source=file_stream.read()), body=AnalyzeDocumentRequest(bytes_source=file_bytes),
features=self._analysis_features(stream_info), features=analysis_features,
output_content_format=CONTENT_FORMAT, # TODO: replace with "ContentFormat.MARKDOWN" when the bug is fixed output_content_format=CONTENT_FORMAT, # TODO: replace with "ContentFormat.MARKDOWN" when the bug is fixed
) )
result: AnalyzeResult = poller.result() result: AnalyzeResult = poller.result()
# remove comments from the markdown content generated by Doc Intelligence and append to markdown string # remove comments from the markdown content generated by Doc Intelligence and append to markdown string
markdown_text = re.sub(r"<!--.*?-->", "", result.content, flags=re.DOTALL) markdown_text = re.sub(r"<!--.*?-->", "", result.content, flags=re.DOTALL)
return DocumentConverterResult(markdown=markdown_text) return DocumentConverterResult(
title=None,
text_content=markdown_text,
)

View File

@@ -1,27 +1,14 @@
import sys from typing import Union
from typing import BinaryIO, Any import mammoth
from ._base import (
DocumentConverterResult,
)
from ._base import DocumentConverter
from ._html_converter import HtmlConverter from ._html_converter import HtmlConverter
from .._base_converter import DocumentConverter, DocumentConverterResult from ._converter_input import ConverterInput
from .._stream_info import StreamInfo
from .._exceptions import MissingDependencyException, MISSING_DEPENDENCY_MESSAGE
# Try loading optional (but in this case, required) dependencies
# Save reporting of any exceptions for later
_dependency_exc_info = None
try:
import mammoth
except ImportError:
# Preserve the error and stack trace for later
_dependency_exc_info = sys.exc_info()
ACCEPTED_MIME_TYPE_PREFIXES = [
"application/vnd.openxmlformats-officedocument.wordprocessingml.document",
]
ACCEPTED_FILE_EXTENSIONS = [".docx"]
class DocxConverter(HtmlConverter): class DocxConverter(HtmlConverter):
@@ -29,49 +16,25 @@ class DocxConverter(HtmlConverter):
Converts DOCX files to Markdown. Style information (e.g.m headings) and tables are preserved where possible. Converts DOCX files to Markdown. Style information (e.g.m headings) and tables are preserved where possible.
""" """
def __init__(self): def __init__(
super().__init__() self, priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT
self._html_converter = HtmlConverter() ):
super().__init__(priority=priority)
def accepts(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any, # Options to pass to the converter
) -> bool:
mimetype = (stream_info.mimetype or "").lower()
extension = (stream_info.extension or "").lower()
if extension in ACCEPTED_FILE_EXTENSIONS:
return True
for prefix in ACCEPTED_MIME_TYPE_PREFIXES:
if mimetype.startswith(prefix):
return True
return False
def convert( def convert(
self, self, input: ConverterInput, **kwargs
file_stream: BinaryIO, ) -> Union[None, DocumentConverterResult]:
stream_info: StreamInfo, # Bail if not a DOCX
**kwargs: Any, # Options to pass to the converter extension = kwargs.get("file_extension", "")
) -> DocumentConverterResult: if extension.lower() != ".docx":
# Check: the dependencies return None
if _dependency_exc_info is not None:
raise MissingDependencyException(
MISSING_DEPENDENCY_MESSAGE.format(
converter=type(self).__name__,
extension=".docx",
feature="docx",
)
) from _dependency_exc_info[
1
].with_traceback( # type: ignore[union-attr]
_dependency_exc_info[2]
)
result = None
style_map = kwargs.get("style_map", None) style_map = kwargs.get("style_map", None)
return self._html_converter.convert_string( file_obj = input.read_file(mode="rb")
mammoth.convert_to_html(file_stream, style_map=style_map).value result = mammoth.convert_to_html(file_obj, style_map=style_map)
) file_obj.close()
html_content = result.value
result = self._convert(html_content)
return result

View File

@@ -1,44 +0,0 @@
import json
import subprocess
import locale
import sys
import shutil
import os
import warnings
from typing import BinaryIO, Optional, Any
def exiftool_metadata(
file_stream: BinaryIO, *, exiftool_path: Optional[str] = None
) -> Any: # Need a better type for json data
# Check if we have a valid pointer to exiftool
if not exiftool_path:
which_exiftool = shutil.which("exiftool")
if which_exiftool:
warnings.warn(
f"""Implicit discovery of 'exiftool' is disabled. If you would like to continue to use exiftool in MarkItDown, please set the exiftool_path parameter in the MarkItDown consructor. E.g.,
md = MarkItDown(exiftool_path="{which_exiftool}")
This warning will be removed in future releases.
""",
DeprecationWarning,
)
# Nothing to do
return {}
# Run exiftool
cur_pos = file_stream.tell()
try:
output = subprocess.run(
[exiftool_path, "-json", "-"],
input=file_stream.read(),
capture_output=True,
text=False,
).stdout
return json.loads(
output.decode(locale.getpreferredencoding(False)),
)[0]
finally:
file_stream.seek(cur_pos)

View File

@@ -1,52 +1,39 @@
import io from typing import Any, Union
from typing import Any, BinaryIO, Optional
from bs4 import BeautifulSoup from bs4 import BeautifulSoup
from .._base_converter import DocumentConverter, DocumentConverterResult from ._base import DocumentConverter, DocumentConverterResult
from .._stream_info import StreamInfo
from ._markdownify import _CustomMarkdownify from ._markdownify import _CustomMarkdownify
from ._converter_input import ConverterInput
ACCEPTED_MIME_TYPE_PREFIXES = [
"text/html",
"application/xhtml",
]
ACCEPTED_FILE_EXTENSIONS = [
".html",
".htm",
]
class HtmlConverter(DocumentConverter): class HtmlConverter(DocumentConverter):
"""Anything with content type text/html""" """Anything with content type text/html"""
def accepts( def __init__(
self, self, priority: float = DocumentConverter.PRIORITY_GENERIC_FILE_FORMAT
file_stream: BinaryIO, ):
stream_info: StreamInfo, super().__init__(priority=priority)
**kwargs: Any, # Options to pass to the converter
) -> bool:
mimetype = (stream_info.mimetype or "").lower()
extension = (stream_info.extension or "").lower()
if extension in ACCEPTED_FILE_EXTENSIONS:
return True
for prefix in ACCEPTED_MIME_TYPE_PREFIXES:
if mimetype.startswith(prefix):
return True
return False
def convert( def convert(
self, self, input: ConverterInput, **kwargs: Any
file_stream: BinaryIO, ) -> Union[None, DocumentConverterResult]:
stream_info: StreamInfo, # Bail if not html
**kwargs: Any, # Options to pass to the converter extension = kwargs.get("file_extension", "")
) -> DocumentConverterResult: if extension.lower() not in [".html", ".htm"]:
# Parse the stream return None
encoding = "utf-8" if stream_info.charset is None else stream_info.charset
soup = BeautifulSoup(file_stream, "html.parser", from_encoding=encoding) result = None
file_obj = input.read_file(mode="rt", encoding="utf-8")
result = self._convert(file_obj.read())
file_obj.close()
return result
def _convert(self, html_content: str) -> Union[None, DocumentConverterResult]:
"""Helper function that converts an HTML string."""
# Parse the string
soup = BeautifulSoup(html_content, "html.parser")
# Remove javascript and style blocks # Remove javascript and style blocks
for script in soup(["script", "style"]): for script in soup(["script", "style"]):
@@ -66,25 +53,6 @@ class HtmlConverter(DocumentConverter):
webpage_text = webpage_text.strip() webpage_text = webpage_text.strip()
return DocumentConverterResult( return DocumentConverterResult(
markdown=webpage_text,
title=None if soup.title is None else soup.title.string, title=None if soup.title is None else soup.title.string,
) text_content=webpage_text,
def convert_string(
self, html_content: str, *, url: Optional[str] = None, **kwargs
) -> DocumentConverterResult:
"""
Non-standard convenience method to convert a string to markdown.
Given that many converters produce HTML as intermediate output, this
allows for easy conversion of HTML to markdown.
"""
return self.convert(
file_stream=io.BytesIO(html_content.encode("utf-8")),
stream_info=StreamInfo(
mimetype="text/html",
extension=".html",
charset="utf-8",
url=url,
),
**kwargs,
) )

View File

@@ -1,53 +1,32 @@
from typing import BinaryIO, Any, Union from typing import Union
import base64 from ._base import DocumentConverter, DocumentConverterResult
import mimetypes from ._media_converter import MediaConverter
from ._exiftool import exiftool_metadata from ._converter_input import ConverterInput
from .._base_converter import DocumentConverter, DocumentConverterResult
from .._stream_info import StreamInfo
ACCEPTED_MIME_TYPE_PREFIXES = [
"image/jpeg",
"image/png",
]
ACCEPTED_FILE_EXTENSIONS = [".jpg", ".jpeg", ".png"]
class ImageConverter(DocumentConverter): class ImageConverter(MediaConverter):
""" """
Converts images to markdown via extraction of metadata (if `exiftool` is installed), and description via a multimodal LLM (if an llm_client is configured). Converts images to markdown via extraction of metadata (if `exiftool` is installed), OCR (if `easyocr` is installed), and description via a multimodal LLM (if an llm_client is configured).
""" """
def accepts( def __init__(
self, self, priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT
file_stream: BinaryIO, ):
stream_info: StreamInfo, super().__init__(priority=priority)
**kwargs: Any,
) -> bool:
mimetype = (stream_info.mimetype or "").lower()
extension = (stream_info.extension or "").lower()
if extension in ACCEPTED_FILE_EXTENSIONS:
return True
for prefix in ACCEPTED_MIME_TYPE_PREFIXES:
if mimetype.startswith(prefix):
return True
return False
def convert( def convert(
self, self, input: ConverterInput, **kwargs
file_stream: BinaryIO, ) -> Union[None, DocumentConverterResult]:
stream_info: StreamInfo, # Bail if not an image
**kwargs: Any, # Options to pass to the converter extension = kwargs.get("file_extension", "")
) -> DocumentConverterResult: if extension.lower() not in [".jpg", ".jpeg", ".png"]:
return None
md_content = "" md_content = ""
# Add metadata # Add metadata if a local path is provided
metadata = exiftool_metadata( if input.input_type == "filepath":
file_stream, exiftool_path=kwargs.get("exiftool_path") metadata = self._get_metadata(input.filepath, kwargs.get("exiftool_path"))
)
if metadata: if metadata:
for f in [ for f in [
@@ -65,59 +44,42 @@ class ImageConverter(DocumentConverter):
if f in metadata: if f in metadata:
md_content += f"{f}: {metadata[f]}\n" md_content += f"{f}: {metadata[f]}\n"
# Try describing the image with GPT # Try describing the image with GPTV
llm_client = kwargs.get("llm_client") llm_client = kwargs.get("llm_client")
llm_model = kwargs.get("llm_model") llm_model = kwargs.get("llm_model")
if llm_client is not None and llm_model is not None: if llm_client is not None and llm_model is not None:
llm_description = self._get_llm_description( md_content += (
file_stream, "\n# Description:\n"
stream_info, + self._get_llm_description(
client=llm_client, input,
model=llm_model, extension,
llm_client,
llm_model,
prompt=kwargs.get("llm_prompt"), prompt=kwargs.get("llm_prompt"),
).strip()
+ "\n"
) )
if llm_description is not None:
md_content += "\n# Description:\n" + llm_description.strip() + "\n"
return DocumentConverterResult( return DocumentConverterResult(
markdown=md_content, title=None,
text_content=md_content,
) )
def _get_llm_description( def _get_llm_description(
self, self, input: ConverterInput, extension, client, model, prompt=None
file_stream: BinaryIO, ):
stream_info: StreamInfo,
*,
client,
model,
prompt=None,
) -> Union[None, str]:
if prompt is None or prompt.strip() == "": if prompt is None or prompt.strip() == "":
prompt = "Write a detailed caption for this image." prompt = "Write a detailed caption for this image."
# Get the content type data_uri = ""
content_type = stream_info.mimetype content_type, encoding = mimetypes.guess_type("_dummy" + extension)
if not content_type: if content_type is None:
content_type, _ = mimetypes.guess_type( content_type = "image/jpeg"
"_dummy" + (stream_info.extension or "") image_file = input.read_file(mode="rb")
) image_base64 = base64.b64encode(image_file.read()).decode("utf-8")
if not content_type: image_file.close()
content_type = "application/octet-stream" data_uri = f"data:{content_type};base64,{image_base64}"
# Convert to base64
cur_pos = file_stream.tell()
try:
base64_image = base64.b64encode(file_stream.read()).decode("utf-8")
except Exception as e:
return None
finally:
file_stream.seek(cur_pos)
# Prepare the data-uri
data_uri = f"data:{content_type};base64,{base64_image}"
# Prepare the OpenAI API request
messages = [ messages = [
{ {
"role": "user", "role": "user",
@@ -133,6 +95,5 @@ class ImageConverter(DocumentConverter):
} }
] ]
# Call the OpenAI API
response = client.chat.completions.create(model=model, messages=messages) response = client.chat.completions.create(model=model, messages=messages)
return response.choices[0].message.content return response.choices[0].message.content

View File

@@ -1,62 +1,41 @@
from typing import BinaryIO, Any
import json import json
from typing import Any, Union
from ._base import (
DocumentConverter,
DocumentConverterResult,
)
from .._base_converter import DocumentConverter, DocumentConverterResult
from .._exceptions import FileConversionException from .._exceptions import FileConversionException
from .._stream_info import StreamInfo from ._converter_input import ConverterInput
CANDIDATE_MIME_TYPE_PREFIXES = [
"application/json",
]
ACCEPTED_FILE_EXTENSIONS = [".ipynb"]
class IpynbConverter(DocumentConverter): class IpynbConverter(DocumentConverter):
"""Converts Jupyter Notebook (.ipynb) files to Markdown.""" """Converts Jupyter Notebook (.ipynb) files to Markdown."""
def accepts( def __init__(
self, self, priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT
file_stream: BinaryIO, ):
stream_info: StreamInfo, super().__init__(priority=priority)
**kwargs: Any, # Options to pass to the converter
) -> bool:
mimetype = (stream_info.mimetype or "").lower()
extension = (stream_info.extension or "").lower()
if extension in ACCEPTED_FILE_EXTENSIONS:
return True
for prefix in CANDIDATE_MIME_TYPE_PREFIXES:
if mimetype.startswith(prefix):
# Read further to see if it's a notebook
cur_pos = file_stream.tell()
try:
encoding = stream_info.charset or "utf-8"
notebook_content = file_stream.read().decode(encoding)
return (
"nbformat" in notebook_content
and "nbformat_minor" in notebook_content
)
finally:
file_stream.seek(cur_pos)
return False
def convert( def convert(
self, self, input: ConverterInput, **kwargs: Any
file_stream: BinaryIO, ) -> Union[None, DocumentConverterResult]:
stream_info: StreamInfo, # Bail if not ipynb
**kwargs: Any, # Options to pass to the converter extension = kwargs.get("file_extension", "")
) -> DocumentConverterResult: if extension.lower() != ".ipynb":
return None
# Parse and convert the notebook # Parse and convert the notebook
result = None result = None
file_obj = input.read_file(mode="rt", encoding="utf-8")
notebook_content = json.load(file_obj)
file_obj.close()
result = self._convert(notebook_content)
encoding = stream_info.charset or "utf-8" return result
notebook_content = file_stream.read().decode(encoding=encoding)
return self._convert(json.loads(notebook_content))
def _convert(self, notebook_content: dict) -> DocumentConverterResult: def _convert(self, notebook_content: dict) -> Union[None, DocumentConverterResult]:
"""Helper function that converts notebook JSON content to Markdown.""" """Helper function that converts notebook JSON content to Markdown."""
try: try:
md_output = [] md_output = []
@@ -88,8 +67,8 @@ class IpynbConverter(DocumentConverter):
title = notebook_content.get("metadata", {}).get("title", title) title = notebook_content.get("metadata", {}).get("title", title)
return DocumentConverterResult( return DocumentConverterResult(
markdown=md_text,
title=title, title=title,
text_content=md_text,
) )
except Exception as e: except Exception as e:

View File

@@ -1,50 +0,0 @@
from typing import BinaryIO, Any, Union
import base64
import mimetypes
from .._stream_info import StreamInfo
def llm_caption(
file_stream: BinaryIO, stream_info: StreamInfo, *, client, model, prompt=None
) -> Union[None, str]:
if prompt is None or prompt.strip() == "":
prompt = "Write a detailed caption for this image."
# Get the content type
content_type = stream_info.mimetype
if not content_type:
content_type, _ = mimetypes.guess_type("_dummy" + (stream_info.extension or ""))
if not content_type:
content_type = "application/octet-stream"
# Convert to base64
cur_pos = file_stream.tell()
try:
base64_image = base64.b64encode(file_stream.read()).decode("utf-8")
except Exception as e:
return None
finally:
file_stream.seek(cur_pos)
# Prepare the data-uri
data_uri = f"data:{content_type};base64,{base64_image}"
# Prepare the OpenAI API request
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{
"type": "image_url",
"image_url": {
"url": data_uri,
},
},
],
}
]
# Call the OpenAI API
response = client.chat.completions.create(model=model, messages=messages)
return response.choices[0].message.content

View File

@@ -1,7 +1,7 @@
import re import re
import markdownify import markdownify
from typing import Any, Optional from typing import Any
from urllib.parse import quote, unquote, urlparse, urlunparse from urllib.parse import quote, unquote, urlparse, urlunparse
@@ -20,14 +20,7 @@ class _CustomMarkdownify(markdownify.MarkdownConverter):
# Explicitly cast options to the expected type if necessary # Explicitly cast options to the expected type if necessary
super().__init__(**options) super().__init__(**options)
def convert_hn( def convert_hn(self, n: int, el: Any, text: str, convert_as_inline: bool) -> str:
self,
n: int,
el: Any,
text: str,
convert_as_inline: Optional[bool] = False,
**kwargs,
) -> str:
"""Same as usual, but be sure to start with a new line""" """Same as usual, but be sure to start with a new line"""
if not convert_as_inline: if not convert_as_inline:
if not re.search(r"^\n", text): if not re.search(r"^\n", text):
@@ -35,13 +28,7 @@ class _CustomMarkdownify(markdownify.MarkdownConverter):
return super().convert_hn(n, el, text, convert_as_inline) # type: ignore return super().convert_hn(n, el, text, convert_as_inline) # type: ignore
def convert_a( def convert_a(self, el: Any, text: str, convert_as_inline: bool):
self,
el: Any,
text: str,
convert_as_inline: Optional[bool] = False,
**kwargs,
):
"""Same as usual converter, but removes Javascript links and escapes URIs.""" """Same as usual converter, but removes Javascript links and escapes URIs."""
prefix, suffix, text = markdownify.chomp(text) # type: ignore prefix, suffix, text = markdownify.chomp(text) # type: ignore
if not text: if not text:
@@ -81,13 +68,7 @@ class _CustomMarkdownify(markdownify.MarkdownConverter):
else text else text
) )
def convert_img( def convert_img(self, el: Any, text: str, convert_as_inline: bool) -> str:
self,
el: Any,
text: str,
convert_as_inline: Optional[bool] = False,
**kwargs,
) -> str:
"""Same as usual converter, but removes data URIs""" """Same as usual converter, but removes data URIs"""
alt = el.attrs.get("alt", None) or "" alt = el.attrs.get("alt", None) or ""

View File

@@ -0,0 +1,41 @@
import subprocess
import shutil
import json
from warnings import warn
from ._base import DocumentConverter
class MediaConverter(DocumentConverter):
"""
Abstract class for multi-modal media (e.g., images and audio)
"""
def __init__(
self, priority: float = DocumentConverter.PRIORITY_GENERIC_FILE_FORMAT
):
super().__init__(priority=priority)
def _get_metadata(self, local_path, exiftool_path=None):
if not exiftool_path:
which_exiftool = shutil.which("exiftool")
if which_exiftool:
warn(
f"""Implicit discovery of 'exiftool' is disabled. If you would like to continue to use exiftool in MarkItDown, please set the exiftool_path parameter in the MarkItDown consructor. E.g.,
md = MarkItDown(exiftool_path="{which_exiftool}")
This warning will be removed in future releases.
""",
DeprecationWarning,
)
return None
else:
if True:
result = subprocess.run(
[exiftool_path, "-json", local_path], capture_output=True, text=True
).stdout
return json.loads(result)[0]
# except Exception:
# return None

View File

@@ -0,0 +1,98 @@
import tempfile
import os
from typing import Union
from ._base import DocumentConverter, DocumentConverterResult
from ._wav_converter import WavConverter
from warnings import resetwarnings, catch_warnings
from ._converter_input import ConverterInput
# Optional Transcription support
IS_AUDIO_TRANSCRIPTION_CAPABLE = False
try:
# Using warnings' catch_warnings to catch
# pydub's warning of ffmpeg or avconv missing
with catch_warnings(record=True) as w:
import pydub
if w:
raise ModuleNotFoundError
import speech_recognition as sr
IS_AUDIO_TRANSCRIPTION_CAPABLE = True
except ModuleNotFoundError:
pass
finally:
resetwarnings()
class Mp3Converter(WavConverter):
"""
Converts MP3 files to markdown via extraction of metadata (if `exiftool` is installed), and speech transcription (if `speech_recognition` AND `pydub` are installed).
"""
def __init__(
self, priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT
):
super().__init__(priority=priority)
def convert(
self, input: ConverterInput, **kwargs
) -> Union[None, DocumentConverterResult]:
# Bail if not a MP3
extension = kwargs.get("file_extension", "")
if extension.lower() != ".mp3":
return None
# Bail if a local path was not provided
if input.input_type != "filepath":
return None
local_path = input.filepath
md_content = ""
# Add metadata
metadata = self._get_metadata(local_path, kwargs.get("exiftool_path"))
if metadata:
for f in [
"Title",
"Artist",
"Author",
"Band",
"Album",
"Genre",
"Track",
"DateTimeOriginal",
"CreateDate",
"Duration",
]:
if f in metadata:
md_content += f"{f}: {metadata[f]}\n"
# Transcribe
if IS_AUDIO_TRANSCRIPTION_CAPABLE:
handle, temp_path = tempfile.mkstemp(suffix=".wav")
os.close(handle)
try:
sound = pydub.AudioSegment.from_mp3(local_path)
sound.export(temp_path, format="wav")
_args = dict()
_args.update(kwargs)
_args["file_extension"] = ".wav"
try:
transcript = super()._transcribe_audio(temp_path).strip()
md_content += "\n\n### Audio Transcript:\n" + (
"[No speech detected]" if transcript == "" else transcript
)
except Exception:
md_content += "\n\n### Audio Transcript:\nError. Could not transcribe this audio."
finally:
os.unlink(temp_path)
# Return the result
return DocumentConverterResult(
title=None,
text_content=md_content.strip(),
)

View File

@@ -1,23 +1,7 @@
import sys import olefile
from typing import Any, Union, BinaryIO from typing import Any, Union
from .._stream_info import StreamInfo from ._base import DocumentConverter, DocumentConverterResult
from .._base_converter import DocumentConverter, DocumentConverterResult from ._converter_input import ConverterInput
from .._exceptions import MissingDependencyException, MISSING_DEPENDENCY_MESSAGE
# Try loading optional (but in this case, required) dependencies
# Save reporting of any exceptions for later
_dependency_exc_info = None
try:
import olefile
except ImportError:
# Preserve the error and stack trace for later
_dependency_exc_info = sys.exc_info()
ACCEPTED_MIME_TYPE_PREFIXES = [
"application/vnd.ms-outlook",
]
ACCEPTED_FILE_EXTENSIONS = [".msg"]
class OutlookMsgConverter(DocumentConverter): class OutlookMsgConverter(DocumentConverter):
@@ -28,67 +12,23 @@ class OutlookMsgConverter(DocumentConverter):
- Email body content - Email body content
""" """
def accepts( def __init__(
self, self, priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT
file_stream: BinaryIO, ):
stream_info: StreamInfo, super().__init__(priority=priority)
**kwargs: Any, # Options to pass to the converter
) -> bool:
mimetype = (stream_info.mimetype or "").lower()
extension = (stream_info.extension or "").lower()
# Check the extension and mimetype
if extension in ACCEPTED_FILE_EXTENSIONS:
return True
for prefix in ACCEPTED_MIME_TYPE_PREFIXES:
if mimetype.startswith(prefix):
return True
# Brute force, check if we have an OLE file
cur_pos = file_stream.tell()
try:
if not olefile.isOleFile(file_stream):
return False
finally:
file_stream.seek(cur_pos)
# Brue force, check if it's an Outlook file
try:
msg = olefile.OleFileIO(file_stream)
toc = "\n".join([str(stream) for stream in msg.listdir()])
return (
"__properties_version1.0" in toc
and "__recip_version1.0_#00000000" in toc
)
except Exception as e:
pass
finally:
file_stream.seek(cur_pos)
return False
def convert( def convert(
self, self, input: ConverterInput, **kwargs: Any
file_stream: BinaryIO, ) -> Union[None, DocumentConverterResult]:
stream_info: StreamInfo, # Bail if not a MSG file
**kwargs: Any, # Options to pass to the converter extension = kwargs.get("file_extension", "")
) -> DocumentConverterResult: if extension.lower() != ".msg":
# Check: the dependencies return None
if _dependency_exc_info is not None:
raise MissingDependencyException( try:
MISSING_DEPENDENCY_MESSAGE.format( file_obj = input.read_file(mode="rb")
converter=type(self).__name__, msg = olefile.OleFileIO(file_obj)
extension=".msg",
feature="outlook",
)
) from _dependency_exc_info[
1
].with_traceback( # type: ignore[union-attr]
_dependency_exc_info[2]
)
msg = olefile.OleFileIO(file_stream)
# Extract email metadata # Extract email metadata
md_content = "# Email Message\n\n" md_content = "# Email Message\n\n"
@@ -112,18 +52,21 @@ class OutlookMsgConverter(DocumentConverter):
md_content += body md_content += body
msg.close() msg.close()
file_obj.close()
return DocumentConverterResult( return DocumentConverterResult(
markdown=md_content.strip(), title=headers.get("Subject"), text_content=md_content.strip()
title=headers.get("Subject"),
) )
def _get_stream_data(self, msg: Any, stream_path: str) -> Union[str, None]: except Exception as e:
"""Helper to safely extract and decode stream data from the MSG file.""" raise FileConversionException(
assert isinstance( f"Could not convert MSG file '{input.filepath}': {str(e)}"
msg, olefile.OleFileIO )
) # Ensure msg is of the correct type (type hinting is not possible with the optional olefile package)
def _get_stream_data(
self, msg: olefile.OleFileIO, stream_path: str
) -> Union[str, None]:
"""Helper to safely extract and decode stream data from the MSG file."""
try: try:
if msg.exists(stream_path): if msg.exists(stream_path):
data = msg.openstream(stream_path).read() data = msg.openstream(stream_path).read()

View File

@@ -1,32 +1,9 @@
import sys import pdfminer
import io import pdfminer.high_level
from typing import Union
from typing import BinaryIO, Any from io import StringIO
from ._base import DocumentConverter, DocumentConverterResult
from ._converter_input import ConverterInput
from ._html_converter import HtmlConverter
from .._base_converter import DocumentConverter, DocumentConverterResult
from .._stream_info import StreamInfo
from .._exceptions import MissingDependencyException, MISSING_DEPENDENCY_MESSAGE
# Try loading optional (but in this case, required) dependencies
# Save reporting of any exceptions for later
_dependency_exc_info = None
try:
import pdfminer
import pdfminer.high_level
except ImportError:
# Preserve the error and stack trace for later
_dependency_exc_info = sys.exc_info()
ACCEPTED_MIME_TYPE_PREFIXES = [
"application/pdf",
"application/x-pdf",
]
ACCEPTED_FILE_EXTENSIONS = [".pdf"]
class PdfConverter(DocumentConverter): class PdfConverter(DocumentConverter):
@@ -34,45 +11,25 @@ class PdfConverter(DocumentConverter):
Converts PDFs to Markdown. Most style information is ignored, so the results are essentially plain-text. Converts PDFs to Markdown. Most style information is ignored, so the results are essentially plain-text.
""" """
def accepts( def __init__(
self, self, priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT
file_stream: BinaryIO, ):
stream_info: StreamInfo, super().__init__(priority=priority)
**kwargs: Any, # Options to pass to the converter
) -> bool:
mimetype = (stream_info.mimetype or "").lower()
extension = (stream_info.extension or "").lower()
if extension in ACCEPTED_FILE_EXTENSIONS:
return True
for prefix in ACCEPTED_MIME_TYPE_PREFIXES:
if mimetype.startswith(prefix):
return True
return False
def convert( def convert(
self, self, input: ConverterInput, **kwargs
file_stream: BinaryIO, ) -> Union[None, DocumentConverterResult]:
stream_info: StreamInfo, # Bail if not a PDF
**kwargs: Any, # Options to pass to the converter extension = kwargs.get("file_extension", "")
) -> DocumentConverterResult: if extension.lower() != ".pdf":
# Check the dependencies return None
if _dependency_exc_info is not None:
raise MissingDependencyException( output = StringIO()
MISSING_DEPENDENCY_MESSAGE.format( file_obj = input.read_file(mode="rb")
converter=type(self).__name__, pdfminer.high_level.extract_text_to_fp(file_obj, output)
extension=".pdf", file_obj.close()
feature="pdf",
)
) from _dependency_exc_info[
1
].with_traceback( # type: ignore[union-attr]
_dependency_exc_info[2]
)
assert isinstance(file_stream, io.IOBase) # for mypy
return DocumentConverterResult( return DocumentConverterResult(
markdown=pdfminer.high_level.extract_text(file_stream), title=None,
text_content=output.getvalue(),
) )

View File

@@ -1,62 +1,43 @@
import sys import mimetypes
from typing import BinaryIO, Any from charset_normalizer import from_path, from_bytes
from charset_normalizer import from_bytes from typing import Any, Union
from .._base_converter import DocumentConverter, DocumentConverterResult
from .._stream_info import StreamInfo
# Try loading optional (but in this case, required) dependencies from ._base import DocumentConverter, DocumentConverterResult
# Save reporting of any exceptions for later from ._converter_input import ConverterInput
_dependency_exc_info = None
try:
import mammoth
except ImportError:
# Preserve the error and stack trace for later
_dependency_exc_info = sys.exc_info()
ACCEPTED_MIME_TYPE_PREFIXES = [
"text/",
"application/json",
]
# Mimetypes to ignore (commonly confused extensions)
IGNORE_MIME_TYPE_PREFIXES = [
"text/vnd.in3d.spot", # .spo wich is confused with xls, doc, etc.
"text/vnd.graphviz", # .dot which is confused with xls, doc, etc.
]
class PlainTextConverter(DocumentConverter): class PlainTextConverter(DocumentConverter):
"""Anything with content type text/plain""" """Anything with content type text/plain"""
def accepts( def __init__(
self, self, priority: float = DocumentConverter.PRIORITY_GENERIC_FILE_FORMAT
file_stream: BinaryIO, ):
stream_info: StreamInfo, super().__init__(priority=priority)
**kwargs: Any, # Options to pass to the converter
) -> bool:
mimetype = (stream_info.mimetype or "").lower()
extension = (stream_info.extension or "").lower()
for prefix in IGNORE_MIME_TYPE_PREFIXES:
if mimetype.startswith(prefix):
return False
for prefix in ACCEPTED_MIME_TYPE_PREFIXES:
if mimetype.startswith(prefix):
return True
return False
def convert( def convert(
self, self, input: ConverterInput, **kwargs: Any
file_stream: BinaryIO, ) -> Union[None, DocumentConverterResult]:
stream_info: StreamInfo, # Read file object from input
**kwargs: Any, # Options to pass to the converter file_obj = input.read_file(mode="rb")
) -> DocumentConverterResult:
if stream_info.charset:
text_content = file_stream.read().decode(stream_info.charset)
else:
text_content = str(from_bytes(file_stream.read()).best())
return DocumentConverterResult(markdown=text_content) # Guess the content type from any file extension that might be around
content_type, _ = mimetypes.guess_type(
"__placeholder" + kwargs.get("file_extension", "")
)
# Only accept text files
if content_type is None:
return None
elif all(
not content_type.lower().startswith(type_prefix)
for type_prefix in ["text/", "application/json"]
):
return None
text_content = str(from_bytes(file_obj.read()).best())
file_obj.close()
return DocumentConverterResult(
title=None,
text_content=text_content,
)

View File

@@ -1,85 +1,68 @@
import sys
import base64 import base64
import os import pptx
import io
import re import re
import html import html
from typing import BinaryIO, Any from typing import Union
from ._base import DocumentConverterResult, DocumentConverter
from ._html_converter import HtmlConverter from ._html_converter import HtmlConverter
from ._llm_caption import llm_caption from ._converter_input import ConverterInput
from .._base_converter import DocumentConverter, DocumentConverterResult
from .._stream_info import StreamInfo
from .._exceptions import MissingDependencyException, MISSING_DEPENDENCY_MESSAGE
# Try loading optional (but in this case, required) dependencies
# Save reporting of any exceptions for later
_dependency_exc_info = None
try:
import pptx
except ImportError:
# Preserve the error and stack trace for later
_dependency_exc_info = sys.exc_info()
ACCEPTED_MIME_TYPE_PREFIXES = [ class PptxConverter(HtmlConverter):
"application/vnd.openxmlformats-officedocument.presentationml",
]
ACCEPTED_FILE_EXTENSIONS = [".pptx"]
class PptxConverter(DocumentConverter):
""" """
Converts PPTX files to Markdown. Supports heading, tables and images with alt text. Converts PPTX files to Markdown. Supports heading, tables and images with alt text.
""" """
def __init__(self): def __init__(
super().__init__() self, priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT
self._html_converter = HtmlConverter() ):
super().__init__(priority=priority)
def accepts( def _get_llm_description(
self, self, llm_client, llm_model, image_blob, content_type, prompt=None
file_stream: BinaryIO, ):
stream_info: StreamInfo, if prompt is None or prompt.strip() == "":
**kwargs: Any, # Options to pass to the converter prompt = "Write a detailed alt text for this image with less than 50 words."
) -> bool:
mimetype = (stream_info.mimetype or "").lower()
extension = (stream_info.extension or "").lower()
if extension in ACCEPTED_FILE_EXTENSIONS: image_base64 = base64.b64encode(image_blob).decode("utf-8")
return True data_uri = f"data:{content_type};base64,{image_base64}"
for prefix in ACCEPTED_MIME_TYPE_PREFIXES: messages = [
if mimetype.startswith(prefix): {
return True "role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": data_uri,
},
},
{"type": "text", "text": prompt},
],
}
]
return False response = llm_client.chat.completions.create(
model=llm_model, messages=messages
)
return response.choices[0].message.content
def convert( def convert(
self, self, input: ConverterInput, **kwargs
file_stream: BinaryIO, ) -> Union[None, DocumentConverterResult]:
stream_info: StreamInfo, # Bail if not a PPTX
**kwargs: Any, # Options to pass to the converter extension = kwargs.get("file_extension", "")
) -> DocumentConverterResult: if extension.lower() != ".pptx":
# Check the dependencies return None
if _dependency_exc_info is not None:
raise MissingDependencyException(
MISSING_DEPENDENCY_MESSAGE.format(
converter=type(self).__name__,
extension=".pptx",
feature="pptx",
)
) from _dependency_exc_info[
1
].with_traceback( # type: ignore[union-attr]
_dependency_exc_info[2]
)
# Perform the conversion
presentation = pptx.Presentation(file_stream)
md_content = "" md_content = ""
file_obj = input.read_file(mode="rb")
presentation = pptx.Presentation(file_obj)
file_obj.close()
slide_num = 0 slide_num = 0
for slide in presentation.slides: for slide in presentation.slides:
slide_num += 1 slide_num += 1
@@ -87,65 +70,64 @@ class PptxConverter(DocumentConverter):
md_content += f"\n\n<!-- Slide number: {slide_num} -->\n" md_content += f"\n\n<!-- Slide number: {slide_num} -->\n"
title = slide.shapes.title title = slide.shapes.title
for shape in slide.shapes:
def get_shape_content(shape, **kwargs):
nonlocal md_content
# Pictures # Pictures
if self._is_picture(shape): if self._is_picture(shape):
# https://github.com/scanny/python-pptx/pull/512#issuecomment-1713100069 # https://github.com/scanny/python-pptx/pull/512#issuecomment-1713100069
llm_description = "" llm_description = None
alt_text = "" alt_text = None
# Potentially generate a description using an LLM
llm_client = kwargs.get("llm_client") llm_client = kwargs.get("llm_client")
llm_model = kwargs.get("llm_model") llm_model = kwargs.get("llm_model")
if llm_client is not None and llm_model is not None: if llm_client is not None and llm_model is not None:
# Prepare a file_stream and stream_info for the image data
image_filename = shape.image.filename
image_extension = None
if image_filename:
image_extension = os.path.splitext(image_filename)[1]
image_stream_info = StreamInfo(
mimetype=shape.image.content_type,
extension=image_extension,
filename=image_filename,
)
image_stream = io.BytesIO(shape.image.blob)
# Caption the image
try: try:
llm_description = llm_caption( llm_description = self._get_llm_description(
image_stream, llm_client,
image_stream_info, llm_model,
client=llm_client, shape.image.blob,
model=llm_model, shape.image.content_type,
prompt=kwargs.get("llm_prompt"),
) )
except Exception: except Exception:
# Unable to generate a description # Unable to describe with LLM
pass pass
# Also grab any description embedded in the deck if not llm_description:
try: try:
alt_text = shape._element._nvXxPr.cNvPr.attrib.get("descr", "") alt_text = shape._element._nvXxPr.cNvPr.attrib.get(
"descr", ""
)
except Exception: except Exception:
# Unable to get alt text # Unable to get alt text
pass pass
# Prepare the alt, escaping any special characters
alt_text = "\n".join([llm_description, alt_text]) or shape.name
alt_text = re.sub(r"[\r\n\[\]]", " ", alt_text)
alt_text = re.sub(r"\s+", " ", alt_text).strip()
# A placeholder name # A placeholder name
filename = re.sub(r"\W", "", shape.name) + ".jpg" filename = re.sub(r"\W", "", shape.name) + ".jpg"
md_content += "\n![" + alt_text + "](" + filename + ")\n" md_content += (
"\n!["
+ (llm_description or alt_text or shape.name)
+ "]("
+ filename
+ ")\n"
)
# Tables # Tables
if self._is_table(shape): if self._is_table(shape):
md_content += self._convert_table_to_markdown(shape.table) html_table = "<html><body><table>"
first_row = True
for row in shape.table.rows:
html_table += "<tr>"
for cell in row.cells:
if first_row:
html_table += "<th>" + html.escape(cell.text) + "</th>"
else:
html_table += "<td>" + html.escape(cell.text) + "</td>"
html_table += "</tr>"
first_row = False
html_table += "</table></body></html>"
md_content += (
"\n" + self._convert(html_table).text_content.strip() + "\n"
)
# Charts # Charts
if shape.has_chart: if shape.has_chart:
@@ -158,14 +140,6 @@ class PptxConverter(DocumentConverter):
else: else:
md_content += shape.text + "\n" md_content += shape.text + "\n"
# Group Shapes
if shape.shape_type == pptx.enum.shapes.MSO_SHAPE_TYPE.GROUP:
for subshape in shape.shapes:
get_shape_content(subshape, **kwargs)
for shape in slide.shapes:
get_shape_content(shape, **kwargs)
md_content = md_content.strip() md_content = md_content.strip()
if slide.has_notes_slide: if slide.has_notes_slide:
@@ -175,7 +149,10 @@ class PptxConverter(DocumentConverter):
md_content += notes_frame.text md_content += notes_frame.text
md_content = md_content.strip() md_content = md_content.strip()
return DocumentConverterResult(markdown=md_content.strip()) return DocumentConverterResult(
title=None,
text_content=md_content.strip(),
)
def _is_picture(self, shape): def _is_picture(self, shape):
if shape.shape_type == pptx.enum.shapes.MSO_SHAPE_TYPE.PICTURE: if shape.shape_type == pptx.enum.shapes.MSO_SHAPE_TYPE.PICTURE:
@@ -190,23 +167,6 @@ class PptxConverter(DocumentConverter):
return True return True
return False return False
def _convert_table_to_markdown(self, table):
# Write the table as HTML, then convert it to Markdown
html_table = "<html><body><table>"
first_row = True
for row in table.rows:
html_table += "<tr>"
for cell in row.cells:
if first_row:
html_table += "<th>" + html.escape(cell.text) + "</th>"
else:
html_table += "<td>" + html.escape(cell.text) + "</td>"
html_table += "</tr>"
first_row = False
html_table += "</table></body></html>"
return self._html_converter.convert_string(html_table).markdown.strip() + "\n"
def _convert_chart_to_markdown(self, chart): def _convert_chart_to_markdown(self, chart):
md = "\n\n### Chart" md = "\n\n### Chart"
if chart.has_title: if chart.has_title:

View File

@@ -1,100 +1,61 @@
from xml.dom import minidom from xml.dom import minidom
from typing import BinaryIO, Any, Union from typing import Union
from bs4 import BeautifulSoup from bs4 import BeautifulSoup
from ._markdownify import _CustomMarkdownify from ._markdownify import _CustomMarkdownify
from .._stream_info import StreamInfo from ._base import DocumentConverter, DocumentConverterResult
from .._base_converter import DocumentConverter, DocumentConverterResult from ._converter_input import ConverterInput
PRECISE_MIME_TYPE_PREFIXES = [
"application/rss",
"application/atom",
]
PRECISE_FILE_EXTENSIONS = [".rss", ".atom"]
CANDIDATE_MIME_TYPE_PREFIXES = [
"text/xml",
"application/xml",
]
CANDIDATE_FILE_EXTENSIONS = [
".xml",
]
class RssConverter(DocumentConverter): class RssConverter(DocumentConverter):
"""Convert RSS / Atom type to markdown""" """Convert RSS / Atom type to markdown"""
def accepts( def __init__(
self, self, priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT
file_stream: BinaryIO, ):
stream_info: StreamInfo, super().__init__(priority=priority)
**kwargs: Any, # Options to pass to the converter
) -> bool:
mimetype = (stream_info.mimetype or "").lower()
extension = (stream_info.extension or "").lower()
# Check for precise mimetypes and file extensions def convert(
if extension in PRECISE_FILE_EXTENSIONS: self, input: ConverterInput, **kwargs
return True ) -> Union[None, DocumentConverterResult]:
# Bail if not RSS type
extension = kwargs.get("file_extension", "")
if extension.lower() not in [".xml", ".rss", ".atom"]:
return None
# Read file object from input
file_obj = input.read_file(mode="rb")
for prefix in PRECISE_MIME_TYPE_PREFIXES:
if mimetype.startswith(prefix):
return True
# Check for precise mimetypes and file extensions
if extension in CANDIDATE_FILE_EXTENSIONS:
return self._check_xml(file_stream)
for prefix in CANDIDATE_MIME_TYPE_PREFIXES:
if mimetype.startswith(prefix):
return self._check_xml(file_stream)
return False
def _check_xml(self, file_stream: BinaryIO) -> bool:
cur_pos = file_stream.tell()
try: try:
doc = minidom.parse(file_stream) doc = minidom.parse(file_obj)
return self._feed_type(doc) is not None
except BaseException as _: except BaseException as _:
pass return None
finally: file_obj.close()
file_stream.seek(cur_pos)
return False
def _feed_type(self, doc: Any) -> str: result = None
if doc.getElementsByTagName("rss"): if doc.getElementsByTagName("rss"):
return "rss" # A RSS feed must have a root element of <rss>
result = self._parse_rss_type(doc)
elif doc.getElementsByTagName("feed"): elif doc.getElementsByTagName("feed"):
root = doc.getElementsByTagName("feed")[0] root = doc.getElementsByTagName("feed")[0]
if root.getElementsByTagName("entry"): if root.getElementsByTagName("entry"):
# An Atom feed must have a root element of <feed> and at least one <entry> # An Atom feed must have a root element of <feed> and at least one <entry>
return "atom" result = self._parse_atom_type(doc)
else:
return None
else:
# not rss or atom
return None return None
def convert( return result
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any, # Options to pass to the converter
) -> DocumentConverterResult:
doc = minidom.parse(file_stream)
feed_type = self._feed_type(doc)
if feed_type == "rss": def _parse_atom_type(
return self._parse_rss_type(doc) self, doc: minidom.Document
elif feed_type == "atom": ) -> Union[None, DocumentConverterResult]:
return self._parse_atom_type(doc)
else:
raise ValueError("Unknown feed type")
def _parse_atom_type(self, doc: minidom.Document) -> DocumentConverterResult:
"""Parse the type of an Atom feed. """Parse the type of an Atom feed.
Returns None if the feed type is not recognized or something goes wrong. Returns None if the feed type is not recognized or something goes wrong.
""" """
try:
root = doc.getElementsByTagName("feed")[0] root = doc.getElementsByTagName("feed")[0]
title = self._get_data_by_tag_name(root, "title") title = self._get_data_by_tag_name(root, "title")
subtitle = self._get_data_by_tag_name(root, "subtitle") subtitle = self._get_data_by_tag_name(root, "subtitle")
@@ -118,15 +79,20 @@ class RssConverter(DocumentConverter):
md_text += self._parse_content(entry_content) md_text += self._parse_content(entry_content)
return DocumentConverterResult( return DocumentConverterResult(
markdown=md_text,
title=title, title=title,
text_content=md_text,
) )
except BaseException as _:
return None
def _parse_rss_type(self, doc: minidom.Document) -> DocumentConverterResult: def _parse_rss_type(
self, doc: minidom.Document
) -> Union[None, DocumentConverterResult]:
"""Parse the type of an RSS feed. """Parse the type of an RSS feed.
Returns None if the feed type is not recognized or something goes wrong. Returns None if the feed type is not recognized or something goes wrong.
""" """
try:
root = doc.getElementsByTagName("rss")[0] root = doc.getElementsByTagName("rss")[0]
channel = root.getElementsByTagName("channel") channel = root.getElementsByTagName("channel")
if not channel: if not channel:
@@ -157,9 +123,12 @@ class RssConverter(DocumentConverter):
md_text += self._parse_content(content) md_text += self._parse_content(content)
return DocumentConverterResult( return DocumentConverterResult(
markdown=md_text,
title=channel_title, title=channel_title,
text_content=md_text,
) )
except BaseException as _:
print(traceback.format_exc())
return None
def _parse_content(self, content: str) -> str: def _parse_content(self, content: str) -> str:
"""Parse the content of an RSS feed item""" """Parse the content of an RSS feed item"""

View File

@@ -1,43 +0,0 @@
import io
import sys
from typing import BinaryIO
from .._exceptions import MissingDependencyException
# Try loading optional (but in this case, required) dependencies
# Save reporting of any exceptions for later
_dependency_exc_info = None
try:
import speech_recognition as sr
import pydub
except ImportError:
# Preserve the error and stack trace for later
_dependency_exc_info = sys.exc_info()
def transcribe_audio(file_stream: BinaryIO, *, audio_format: str = "wav") -> str:
# Check for installed dependencies
if _dependency_exc_info is not None:
raise MissingDependencyException(
"Speech transcription requires installing MarkItdown with the [audio-transcription] optional dependencies. E.g., `pip install markitdown[audio-transcription]` or `pip install markitdown[all]`"
) from _dependency_exc_info[
1
].with_traceback( # type: ignore[union-attr]
_dependency_exc_info[2]
)
if audio_format in ["wav", "aiff", "flac"]:
audio_source = file_stream
elif audio_format in ["mp3", "mp4"]:
audio_segment = pydub.AudioSegment.from_file(file_stream, format=audio_format)
audio_source = io.BytesIO()
audio_segment.export(audio_source, format="wav")
audio_source.seek(0)
else:
raise ValueError(f"Unsupported audio format: {audio_format}")
recognizer = sr.Recognizer()
with sr.AudioFile(audio_source) as source:
audio = recognizer.record(source)
transcript = recognizer.recognize_google(audio).strip()
return "[No speech detected]" if transcript == "" else transcript

View File

@@ -0,0 +1,80 @@
from typing import Union
from ._base import DocumentConverter, DocumentConverterResult
from ._media_converter import MediaConverter
from ._converter_input import ConverterInput
# Optional Transcription support
IS_AUDIO_TRANSCRIPTION_CAPABLE = False
try:
import speech_recognition as sr
IS_AUDIO_TRANSCRIPTION_CAPABLE = True
except ModuleNotFoundError:
pass
class WavConverter(MediaConverter):
"""
Converts WAV files to markdown via extraction of metadata (if `exiftool` is installed), and speech transcription (if `speech_recognition` is installed).
"""
def __init__(
self, priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT
):
super().__init__(priority=priority)
def convert(
self, input: ConverterInput, **kwargs
) -> Union[None, DocumentConverterResult]:
# Bail if not a WAV
extension = kwargs.get("file_extension", "")
if extension.lower() != ".wav":
return None
# Bail if a local path was not provided
if input.input_type != "filepath":
return None
local_path = input.filepath
md_content = ""
# Add metadata
metadata = self._get_metadata(local_path, kwargs.get("exiftool_path"))
if metadata:
for f in [
"Title",
"Artist",
"Author",
"Band",
"Album",
"Genre",
"Track",
"DateTimeOriginal",
"CreateDate",
"Duration",
]:
if f in metadata:
md_content += f"{f}: {metadata[f]}\n"
# Transcribe
if IS_AUDIO_TRANSCRIPTION_CAPABLE:
try:
transcript = self._transcribe_audio(local_path)
md_content += "\n\n### Audio Transcript:\n" + (
"[No speech detected]" if transcript == "" else transcript
)
except Exception:
md_content += (
"\n\n### Audio Transcript:\nError. Could not transcribe this audio."
)
return DocumentConverterResult(
title=None,
text_content=md_content.strip(),
)
def _transcribe_audio(self, local_path) -> str:
recognizer = sr.Recognizer()
with sr.AudioFile(local_path) as source:
audio = recognizer.record(source)
return recognizer.recognize_google(audio).strip()

View File

@@ -1,63 +1,37 @@
import io
import re import re
from typing import Any, BinaryIO, Optional
from typing import Any, Union
from bs4 import BeautifulSoup from bs4 import BeautifulSoup
from .._base_converter import DocumentConverter, DocumentConverterResult from ._base import DocumentConverter, DocumentConverterResult
from .._stream_info import StreamInfo
from ._markdownify import _CustomMarkdownify from ._markdownify import _CustomMarkdownify
from ._converter_input import ConverterInput
ACCEPTED_MIME_TYPE_PREFIXES = [
"text/html",
"application/xhtml",
]
ACCEPTED_FILE_EXTENSIONS = [
".html",
".htm",
]
class WikipediaConverter(DocumentConverter): class WikipediaConverter(DocumentConverter):
"""Handle Wikipedia pages separately, focusing only on the main document content.""" """Handle Wikipedia pages separately, focusing only on the main document content."""
def accepts( def __init__(
self, self, priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT
file_stream: BinaryIO, ):
stream_info: StreamInfo, super().__init__(priority=priority)
**kwargs: Any, # Options to pass to the converter
) -> bool:
"""
Make sure we're dealing with HTML content *from* Wikipedia.
"""
url = stream_info.url or ""
mimetype = (stream_info.mimetype or "").lower()
extension = (stream_info.extension or "").lower()
if not re.search(r"^https?:\/\/[a-zA-Z]{2,3}\.wikipedia.org\/", url):
# Not a Wikipedia URL
return False
if extension in ACCEPTED_FILE_EXTENSIONS:
return True
for prefix in ACCEPTED_MIME_TYPE_PREFIXES:
if mimetype.startswith(prefix):
return True
# Not HTML content
return False
def convert( def convert(
self, self, input: ConverterInput, **kwargs: Any
file_stream: BinaryIO, ) -> Union[None, DocumentConverterResult]:
stream_info: StreamInfo, # Bail if not Wikipedia
**kwargs: Any, # Options to pass to the converter extension = kwargs.get("file_extension", "")
) -> DocumentConverterResult: if extension.lower() not in [".html", ".htm"]:
# Parse the stream return None
encoding = "utf-8" if stream_info.charset is None else stream_info.charset url = kwargs.get("url", "")
soup = BeautifulSoup(file_stream, "html.parser", from_encoding=encoding) if not re.search(r"^https?:\/\/[a-zA-Z]{2,3}\.wikipedia.org\/", url):
return None
# Parse the file
soup = None
file_obj = input.read_file(mode="rt", encoding="utf-8")
soup = BeautifulSoup(file_obj.read(), "html.parser")
file_obj.close()
# Remove javascript and style blocks # Remove javascript and style blocks
for script in soup(["script", "style"]): for script in soup(["script", "style"]):
@@ -84,6 +58,6 @@ class WikipediaConverter(DocumentConverter):
webpage_text = _CustomMarkdownify().convert_soup(soup) webpage_text = _CustomMarkdownify().convert_soup(soup)
return DocumentConverterResult( return DocumentConverterResult(
markdown=webpage_text,
title=main_title, title=main_title,
text_content=webpage_text,
) )

View File

@@ -1,153 +1,70 @@
import sys from typing import Union
from typing import BinaryIO, Any
import pandas as pd
from ._base import DocumentConverter, DocumentConverterResult
from ._html_converter import HtmlConverter from ._html_converter import HtmlConverter
from .._base_converter import DocumentConverter, DocumentConverterResult from ._converter_input import ConverterInput
from .._exceptions import MissingDependencyException, MISSING_DEPENDENCY_MESSAGE
from .._stream_info import StreamInfo
# Try loading optional (but in this case, required) dependencies
# Save reporting of any exceptions for later
_xlsx_dependency_exc_info = None
try:
import pandas as pd
import openpyxl
except ImportError:
_xlsx_dependency_exc_info = sys.exc_info()
_xls_dependency_exc_info = None
try:
import pandas as pd
import xlrd
except ImportError:
_xls_dependency_exc_info = sys.exc_info()
ACCEPTED_XLSX_MIME_TYPE_PREFIXES = [
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
]
ACCEPTED_XLSX_FILE_EXTENSIONS = [".xlsx"]
ACCEPTED_XLS_MIME_TYPE_PREFIXES = [
"application/vnd.ms-excel",
"application/excel",
]
ACCEPTED_XLS_FILE_EXTENSIONS = [".xls"]
class XlsxConverter(DocumentConverter): class XlsxConverter(HtmlConverter):
""" """
Converts XLSX files to Markdown, with each sheet presented as a separate Markdown table. Converts XLSX files to Markdown, with each sheet presented as a separate Markdown table.
""" """
def __init__(self): def __init__(
super().__init__() self, priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT
self._html_converter = HtmlConverter() ):
super().__init__(priority=priority)
def accepts(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any, # Options to pass to the converter
) -> bool:
mimetype = (stream_info.mimetype or "").lower()
extension = (stream_info.extension or "").lower()
if extension in ACCEPTED_XLSX_FILE_EXTENSIONS:
return True
for prefix in ACCEPTED_XLSX_MIME_TYPE_PREFIXES:
if mimetype.startswith(prefix):
return True
return False
def convert( def convert(
self, self, input: ConverterInput, **kwargs
file_stream: BinaryIO, ) -> Union[None, DocumentConverterResult]:
stream_info: StreamInfo, # Bail if not a XLSX
**kwargs: Any, # Options to pass to the converter extension = kwargs.get("file_extension", "")
) -> DocumentConverterResult: if extension.lower() != ".xlsx":
# Check the dependencies return None
if _xlsx_dependency_exc_info is not None:
raise MissingDependencyException( file_obj = input.read_file(mode="rb")
MISSING_DEPENDENCY_MESSAGE.format( sheets = pd.read_excel(file_obj, sheet_name=None, engine="openpyxl")
converter=type(self).__name__, file_obj.close()
extension=".xlsx",
feature="xlsx",
)
) from _xlsx_dependency_exc_info[
1
].with_traceback( # type: ignore[union-attr]
_xlsx_dependency_exc_info[2]
)
sheets = pd.read_excel(file_stream, sheet_name=None, engine="openpyxl")
md_content = "" md_content = ""
for s in sheets: for s in sheets:
md_content += f"## {s}\n" md_content += f"## {s}\n"
html_content = sheets[s].to_html(index=False) html_content = sheets[s].to_html(index=False)
md_content += ( md_content += self._convert(html_content).text_content.strip() + "\n\n"
self._html_converter.convert_string(html_content).markdown.strip()
+ "\n\n" return DocumentConverterResult(
title=None,
text_content=md_content.strip(),
) )
return DocumentConverterResult(markdown=md_content.strip())
class XlsConverter(HtmlConverter):
class XlsConverter(DocumentConverter):
""" """
Converts XLS files to Markdown, with each sheet presented as a separate Markdown table. Converts XLS files to Markdown, with each sheet presented as a separate Markdown table.
""" """
def __init__(self):
super().__init__()
self._html_converter = HtmlConverter()
def accepts(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any, # Options to pass to the converter
) -> bool:
mimetype = (stream_info.mimetype or "").lower()
extension = (stream_info.extension or "").lower()
if extension in ACCEPTED_XLS_FILE_EXTENSIONS:
return True
for prefix in ACCEPTED_XLS_MIME_TYPE_PREFIXES:
if mimetype.startswith(prefix):
return True
return False
def convert( def convert(
self, self, input: ConverterInput, **kwargs
file_stream: BinaryIO, ) -> Union[None, DocumentConverterResult]:
stream_info: StreamInfo, # Bail if not a XLS
**kwargs: Any, # Options to pass to the converter extension = kwargs.get("file_extension", "")
) -> DocumentConverterResult: if extension.lower() != ".xls":
# Load the dependencies return None
if _xls_dependency_exc_info is not None:
raise MissingDependencyException( file_obj = input.read_file(mode="rb")
MISSING_DEPENDENCY_MESSAGE.format( sheets = pd.read_excel(file_obj, sheet_name=None, engine="xlrd")
converter=type(self).__name__, file_obj.close()
extension=".xls",
feature="xls",
)
) from _xls_dependency_exc_info[
1
].with_traceback( # type: ignore[union-attr]
_xls_dependency_exc_info[2]
)
sheets = pd.read_excel(file_stream, sheet_name=None, engine="xlrd")
md_content = "" md_content = ""
for s in sheets: for s in sheets:
md_content += f"## {s}\n" md_content += f"## {s}\n"
html_content = sheets[s].to_html(index=False) html_content = sheets[s].to_html(index=False)
md_content += ( md_content += self._convert(html_content).text_content.strip() + "\n\n"
self._html_converter.convert_string(html_content).markdown.strip()
+ "\n\n"
)
return DocumentConverterResult(markdown=md_content.strip()) return DocumentConverterResult(
title=None,
text_content=md_content.strip(),
)

View File

@@ -1,15 +1,13 @@
import sys
import json
import time
import io
import re import re
from typing import Any, BinaryIO, Optional, Dict, List, Union import json
from urllib.parse import parse_qs, urlparse, unquote
from typing import Any, Union, Dict, List
from urllib.parse import parse_qs, urlparse
from bs4 import BeautifulSoup from bs4 import BeautifulSoup
from .._base_converter import DocumentConverter, DocumentConverterResult from ._base import DocumentConverter, DocumentConverterResult
from .._stream_info import StreamInfo from ._converter_input import ConverterInput
from ._markdownify import _CustomMarkdownify
# Optional YouTube transcription support # Optional YouTube transcription support
try: try:
@@ -17,89 +15,58 @@ try:
IS_YOUTUBE_TRANSCRIPT_CAPABLE = True IS_YOUTUBE_TRANSCRIPT_CAPABLE = True
except ModuleNotFoundError: except ModuleNotFoundError:
IS_YOUTUBE_TRANSCRIPT_CAPABLE = False pass
ACCEPTED_MIME_TYPE_PREFIXES = [
"text/html",
"application/xhtml",
]
ACCEPTED_FILE_EXTENSIONS = [
".html",
".htm",
]
class YouTubeConverter(DocumentConverter): class YouTubeConverter(DocumentConverter):
"""Handle YouTube specially, focusing on the video title, description, and transcript.""" """Handle YouTube specially, focusing on the video title, description, and transcript."""
def accepts( def __init__(
self, self, priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT
file_stream: BinaryIO, ):
stream_info: StreamInfo, super().__init__(priority=priority)
**kwargs: Any, # Options to pass to the converter
) -> bool:
"""
Make sure we're dealing with HTML content *from* YouTube.
"""
url = stream_info.url or ""
mimetype = (stream_info.mimetype or "").lower()
extension = (stream_info.extension or "").lower()
url = unquote(url)
url = url.replace(r"\?", "?").replace(r"\=", "=")
if not url.startswith("https://www.youtube.com/watch?"):
# Not a YouTube URL
return False
if extension in ACCEPTED_FILE_EXTENSIONS:
return True
for prefix in ACCEPTED_MIME_TYPE_PREFIXES:
if mimetype.startswith(prefix):
return True
# Not HTML content
return False
def convert( def convert(
self, self, input: ConverterInput, **kwargs: Any
file_stream: BinaryIO, ) -> Union[None, DocumentConverterResult]:
stream_info: StreamInfo, # Bail if not YouTube
**kwargs: Any, # Options to pass to the converter extension = kwargs.get("file_extension", "")
) -> DocumentConverterResult: if extension.lower() not in [".html", ".htm"]:
# Parse the stream return None
encoding = "utf-8" if stream_info.charset is None else stream_info.charset url = kwargs.get("url", "")
soup = BeautifulSoup(file_stream, "html.parser", from_encoding=encoding) if not url.startswith("https://www.youtube.com/watch?"):
return None
# Parse the file
soup = None
file_obj = input.read_file(mode="rt", encoding="utf-8")
soup = BeautifulSoup(file_obj.read(), "html.parser")
file_obj.close()
# Read the meta tags # Read the meta tags
assert soup.title is not None and soup.title.string is not None
metadata: Dict[str, str] = {"title": soup.title.string} metadata: Dict[str, str] = {"title": soup.title.string}
for meta in soup(["meta"]): for meta in soup(["meta"]):
for a in meta.attrs: for a in meta.attrs:
if a in ["itemprop", "property", "name"]: if a in ["itemprop", "property", "name"]:
content = meta.get("content", "") metadata[meta[a]] = meta.get("content", "")
if content: # Only add non-empty content
metadata[meta[a]] = content
break break
# Try reading the description # We can also try to read the full description. This is more prone to breaking, since it reaches into the page implementation
try: try:
for script in soup(["script"]): for script in soup(["script"]):
if not script.string: # Skip empty scripts content = script.text
continue
content = script.string
if "ytInitialData" in content: if "ytInitialData" in content:
match = re.search(r"var ytInitialData = ({.*?});", content) lines = re.split(r"\r?\n", content)
if match: obj_start = lines[0].find("{")
data = json.loads(match.group(1)) obj_end = lines[0].rfind("}")
attrdesc = self._findKey(data, "attributedDescriptionBodyText") if obj_start >= 0 and obj_end >= 0:
if attrdesc and isinstance(attrdesc, dict): data = json.loads(lines[0][obj_start : obj_end + 1])
metadata["description"] = str(attrdesc.get("content", "")) attrdesc = self._findKey(data, "attributedDescriptionBodyText") # type: ignore
if attrdesc:
metadata["description"] = str(attrdesc["content"])
break break
except Exception as e: except Exception:
print(f"Error extracting description: {e}")
pass pass
# Start preparing the page # Start preparing the page
@@ -133,31 +100,23 @@ class YouTubeConverter(DocumentConverter):
if IS_YOUTUBE_TRANSCRIPT_CAPABLE: if IS_YOUTUBE_TRANSCRIPT_CAPABLE:
transcript_text = "" transcript_text = ""
parsed_url = urlparse(stream_info.url) # type: ignore parsed_url = urlparse(url) # type: ignore
params = parse_qs(parsed_url.query) # type: ignore params = parse_qs(parsed_url.query) # type: ignore
if "v" in params and params["v"][0]: if "v" in params:
assert isinstance(params["v"][0], str)
video_id = str(params["v"][0]) video_id = str(params["v"][0])
try: try:
youtube_transcript_languages = kwargs.get( youtube_transcript_languages = kwargs.get(
"youtube_transcript_languages", ("en",) "youtube_transcript_languages", ("en",)
) )
# Retry the transcript fetching operation # Must be a single transcript.
transcript = self._retry_operation( transcript = YouTubeTranscriptApi.get_transcript(video_id, languages=youtube_transcript_languages) # type: ignore
lambda: YouTubeTranscriptApi.get_transcript( transcript_text = " ".join([part["text"] for part in transcript]) # type: ignore
video_id, languages=youtube_transcript_languages
),
retries=3, # Retry 3 times
delay=2, # 2 seconds delay between retries
)
if transcript:
transcript_text = " ".join(
[part["text"] for part in transcript]
) # type: ignore
# Alternative formatting: # Alternative formatting:
# formatter = TextFormatter() # formatter = TextFormatter()
# formatter.format_transcript(transcript) # formatter.format_transcript(transcript)
except Exception as e: except Exception:
print(f"Error fetching transcript: {e}") pass
if transcript_text: if transcript_text:
webpage_text += f"\n### Transcript\n{transcript_text}\n" webpage_text += f"\n### Transcript\n{transcript_text}\n"
@@ -165,8 +124,8 @@ class YouTubeConverter(DocumentConverter):
assert isinstance(title, str) assert isinstance(title, str)
return DocumentConverterResult( return DocumentConverterResult(
markdown=webpage_text,
title=title, title=title,
text_content=webpage_text,
) )
def _get( def _get(
@@ -175,37 +134,23 @@ class YouTubeConverter(DocumentConverter):
keys: List[str], keys: List[str],
default: Union[str, None] = None, default: Union[str, None] = None,
) -> Union[str, None]: ) -> Union[str, None]:
"""Get first non-empty value from metadata matching given keys."""
for k in keys: for k in keys:
if k in metadata: if k in metadata:
return metadata[k] return metadata[k]
return default return default
def _findKey(self, json: Any, key: str) -> Union[str, None]: # TODO: Fix json type def _findKey(self, json: Any, key: str) -> Union[str, None]: # TODO: Fix json type
"""Recursively search for a key in nested dictionary/list structures."""
if isinstance(json, list): if isinstance(json, list):
for elm in json: for elm in json:
ret = self._findKey(elm, key) ret = self._findKey(elm, key)
if ret is not None: if ret is not None:
return ret return ret
elif isinstance(json, dict): elif isinstance(json, dict):
for k, v in json.items(): for k in json:
if k == key: if k == key:
return json[k] return json[k]
if result := self._findKey(v, key): else:
return result ret = self._findKey(json[k], key)
if ret is not None:
return ret
return None return None
def _retry_operation(self, operation, retries=3, delay=2):
"""Retries the operation if it fails."""
attempt = 0
while attempt < retries:
try:
return operation() # Attempt the operation
except Exception as e:
print(f"Attempt {attempt + 1} failed: {e}")
if attempt < retries - 1:
time.sleep(delay) # Wait before retrying
attempt += 1
# If all attempts fail, raise the last exception
raise Exception(f"Operation failed after {retries} attempts.")

View File

@@ -1,23 +1,10 @@
import sys
import zipfile
import io
import os import os
import zipfile
import shutil
from typing import Any, Union
from typing import BinaryIO, Any, TYPE_CHECKING from ._base import DocumentConverter, DocumentConverterResult
from ._converter_input import ConverterInput
from .._base_converter import DocumentConverter, DocumentConverterResult
from .._stream_info import StreamInfo
from .._exceptions import UnsupportedFormatException, FileConversionException
# Break otherwise circular import for type hinting
if TYPE_CHECKING:
from .._markitdown import MarkItDown
ACCEPTED_MIME_TYPE_PREFIXES = [
"application/zip",
]
ACCEPTED_FILE_EXTENSIONS = [".zip"]
class ZipConverter(DocumentConverter): class ZipConverter(DocumentConverter):
@@ -60,58 +47,104 @@ class ZipConverter(DocumentConverter):
""" """
def __init__( def __init__(
self, self, priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT
*,
markitdown: "MarkItDown",
): ):
super().__init__() super().__init__(priority=priority)
self._markitdown = markitdown
def accepts(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any, # Options to pass to the converter
) -> bool:
mimetype = (stream_info.mimetype or "").lower()
extension = (stream_info.extension or "").lower()
if extension in ACCEPTED_FILE_EXTENSIONS:
return True
for prefix in ACCEPTED_MIME_TYPE_PREFIXES:
if mimetype.startswith(prefix):
return True
return False
def convert( def convert(
self, self, input: ConverterInput, **kwargs: Any
file_stream: BinaryIO, ) -> Union[None, DocumentConverterResult]:
stream_info: StreamInfo, # Bail if not a ZIP
**kwargs: Any, # Options to pass to the converter extension = kwargs.get("file_extension", "")
) -> DocumentConverterResult: if extension.lower() != ".zip":
file_path = stream_info.url or stream_info.local_path or stream_info.filename return None
md_content = f"Content from the zip file `{file_path}`:\n\n"
# Bail if a local path is not provided
if input.input_type != "filepath":
return None
local_path = input.filepath
# Get parent converters list if available
parent_converters = kwargs.get("_parent_converters", [])
if not parent_converters:
return DocumentConverterResult(
title=None,
text_content=f"[ERROR] No converters available to process zip contents from: {local_path}",
)
extracted_zip_folder_name = (
f"extracted_{os.path.basename(local_path).replace('.zip', '_zip')}"
)
extraction_dir = os.path.normpath(
os.path.join(os.path.dirname(local_path), extracted_zip_folder_name)
)
md_content = f"Content from the zip file `{os.path.basename(local_path)}`:\n\n"
with zipfile.ZipFile(file_stream, "r") as zipObj:
for name in zipObj.namelist():
try: try:
z_file_stream = io.BytesIO(zipObj.read(name)) # Extract the zip file safely
z_file_stream_info = StreamInfo( with zipfile.ZipFile(local_path, "r") as zipObj:
extension=os.path.splitext(name)[1], # Safeguard against path traversal
filename=os.path.basename(name), for member in zipObj.namelist():
member_path = os.path.normpath(os.path.join(extraction_dir, member))
if (
not os.path.commonprefix([extraction_dir, member_path])
== extraction_dir
):
raise ValueError(
f"Path traversal detected in zip file: {member}"
) )
result = self._markitdown.convert_stream(
stream=z_file_stream,
stream_info=z_file_stream_info,
)
if result is not None:
md_content += f"## File: {name}\n\n"
md_content += result.markdown + "\n\n"
except UnsupportedFormatException:
pass
except FileConversionException:
pass
return DocumentConverterResult(markdown=md_content.strip()) # Extract all files safely
zipObj.extractall(path=extraction_dir)
# Process each extracted file
for root, dirs, files in os.walk(extraction_dir):
for name in files:
file_path = os.path.join(root, name)
relative_path = os.path.relpath(file_path, extraction_dir)
# Get file extension
_, file_extension = os.path.splitext(name)
# Update kwargs for the file
file_kwargs = kwargs.copy()
file_kwargs["file_extension"] = file_extension
file_kwargs["_parent_converters"] = parent_converters
# Try converting the file using available converters
for converter in parent_converters:
# Skip the zip converter to avoid infinite recursion
if isinstance(converter, ZipConverter):
continue
# Create a ConverterInput for the parent converter and attempt conversion
input = ConverterInput(
input_type="filepath", filepath=file_path
)
result = converter.convert(input, **file_kwargs)
if result is not None:
md_content += f"\n## File: {relative_path}\n\n"
md_content += result.text_content + "\n\n"
break
# Clean up extracted files if specified
if kwargs.get("cleanup_extracted", True):
shutil.rmtree(extraction_dir)
return DocumentConverterResult(title=None, text_content=md_content.strip())
except zipfile.BadZipFile:
return DocumentConverterResult(
title=None,
text_content=f"[ERROR] Invalid or corrupted zip file: {local_path}",
)
except ValueError as ve:
return DocumentConverterResult(
title=None,
text_content=f"[ERROR] Security error in zip file {local_path}: {str(ve)}",
)
except Exception as e:
return DocumentConverterResult(
title=None,
text_content=f"[ERROR] Failed to process zip file {local_path}: {str(e)}",
)

View File

@@ -7,7 +7,7 @@ from markitdown import __version__
try: try:
from .test_markitdown import TEST_FILES_DIR, DOCX_TEST_STRINGS from .test_markitdown import TEST_FILES_DIR, DOCX_TEST_STRINGS
except ImportError: except ImportError:
from test_markitdown import TEST_FILES_DIR, DOCX_TEST_STRINGS # type: ignore from test_markitdown import TEST_FILES_DIR, DOCX_TEST_STRINGS
@pytest.fixture(scope="session") @pytest.fixture(scope="session")

View File

@@ -23,7 +23,7 @@
} }
], ],
"source": [ "source": [
"print(\"markitdown\")" "print('markitdown')"
] ]
}, },
{ {

View File

@@ -2,20 +2,13 @@
import io import io
import os import os
import shutil import shutil
import openai
import pytest import pytest
import requests import requests
import warnings from warnings import catch_warnings, resetwarnings
from markitdown import ( from markitdown import MarkItDown
MarkItDown,
UnsupportedFormatException,
FileConversionException,
StreamInfo,
)
from markitdown._stream_info import _guess_stream_info_from_stream
skip_remote = ( skip_remote = (
True if os.environ.get("GITHUB_ACTIONS") else False True if os.environ.get("GITHUB_ACTIONS") else False
@@ -42,13 +35,6 @@ JPG_TEST_EXIFTOOL = {
"DateTimeOriginal": "2024:03:14 22:10:00", "DateTimeOriginal": "2024:03:14 22:10:00",
} }
MP3_TEST_EXIFTOOL = {
"Title": "f67a499e-a7d0-4ca3-a49b-358bd934ae3e",
"Artist": "Artist Name Test String",
"Album": "Album Name Test String",
"SampleRate": "48000",
}
PDF_TEST_URL = "https://arxiv.org/pdf/2308.08155v2.pdf" PDF_TEST_URL = "https://arxiv.org/pdf/2308.08155v2.pdf"
PDF_TEST_STRINGS = [ PDF_TEST_STRINGS = [
"While there is contemporaneous exploration of multi-agent approaches" "While there is contemporaneous exploration of multi-agent approaches"
@@ -176,107 +162,6 @@ def validate_strings(result, expected_strings, exclude_strings=None):
assert string not in text_content assert string not in text_content
def test_stream_info_operations() -> None:
"""Test operations performed on StreamInfo objects."""
stream_info_original = StreamInfo(
mimetype="mimetype.1",
extension="extension.1",
charset="charset.1",
filename="filename.1",
local_path="local_path.1",
url="url.1",
)
# Check updating all attributes by keyword
keywords = ["mimetype", "extension", "charset", "filename", "local_path", "url"]
for keyword in keywords:
updated_stream_info = stream_info_original.copy_and_update(
**{keyword: f"{keyword}.2"}
)
# Make sure the targted attribute is updated
assert getattr(updated_stream_info, keyword) == f"{keyword}.2"
# Make sure the other attributes are unchanged
for k in keywords:
if k != keyword:
assert getattr(stream_info_original, k) == getattr(
updated_stream_info, k
)
# Check updating all attributes by passing a new StreamInfo object
keywords = ["mimetype", "extension", "charset", "filename", "local_path", "url"]
for keyword in keywords:
updated_stream_info = stream_info_original.copy_and_update(
StreamInfo(**{keyword: f"{keyword}.2"})
)
# Make sure the targted attribute is updated
assert getattr(updated_stream_info, keyword) == f"{keyword}.2"
# Make sure the other attributes are unchanged
for k in keywords:
if k != keyword:
assert getattr(stream_info_original, k) == getattr(
updated_stream_info, k
)
# Check mixing and matching
updated_stream_info = stream_info_original.copy_and_update(
StreamInfo(extension="extension.2", filename="filename.2"),
mimetype="mimetype.3",
charset="charset.3",
)
assert updated_stream_info.extension == "extension.2"
assert updated_stream_info.filename == "filename.2"
assert updated_stream_info.mimetype == "mimetype.3"
assert updated_stream_info.charset == "charset.3"
assert updated_stream_info.local_path == "local_path.1"
assert updated_stream_info.url == "url.1"
# Check multiple StreamInfo objects
updated_stream_info = stream_info_original.copy_and_update(
StreamInfo(extension="extension.4", filename="filename.5"),
StreamInfo(mimetype="mimetype.6", charset="charset.7"),
)
assert updated_stream_info.extension == "extension.4"
assert updated_stream_info.filename == "filename.5"
assert updated_stream_info.mimetype == "mimetype.6"
assert updated_stream_info.charset == "charset.7"
assert updated_stream_info.local_path == "local_path.1"
assert updated_stream_info.url == "url.1"
def test_stream_info_guesses() -> None:
"""Test StreamInfo guesses based on stream content."""
test_tuples = [
(
os.path.join(TEST_FILES_DIR, "test.xlsx"),
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
),
(
os.path.join(TEST_FILES_DIR, "test.docx"),
"application/vnd.openxmlformats-officedocument.wordprocessingml.document",
),
(
os.path.join(TEST_FILES_DIR, "test.pptx"),
"application/vnd.openxmlformats-officedocument.presentationml.presentation",
),
(os.path.join(TEST_FILES_DIR, "test.xls"), "application/vnd.ms-excel"),
]
for file_path, expected_mimetype in test_tuples:
with open(file_path, "rb") as f:
guesses = _guess_stream_info_from_stream(
f, filename_hint=os.path.basename(file_path)
)
assert len(guesses) > 0
assert guesses[0].mimetype == expected_mimetype
assert guesses[0].extension == os.path.splitext(file_path)[1]
@pytest.mark.skipif( @pytest.mark.skipif(
skip_remote, skip_remote,
reason="do not run tests that query external urls", reason="do not run tests that query external urls",
@@ -298,18 +183,15 @@ def test_markitdown_remote() -> None:
assert test_string in result.text_content assert test_string in result.text_content
# Youtube # Youtube
result = markitdown.convert(YOUTUBE_TEST_URL) # TODO: This test randomly fails for some reason. Haven't been able to repro it yet. Disabling until I can debug the issue
for test_string in YOUTUBE_TEST_STRINGS: # result = markitdown.convert(YOUTUBE_TEST_URL)
assert test_string in result.text_content # for test_string in YOUTUBE_TEST_STRINGS:
# assert test_string in result.text_content
def test_markitdown_local() -> None: def test_markitdown_local_paths() -> None:
markitdown = MarkItDown() markitdown = MarkItDown()
# Test PDF processing
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.pdf"))
validate_strings(result, PDF_TEST_STRINGS)
# Test XLSX processing # Test XLSX processing
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.xlsx")) result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.xlsx"))
validate_strings(result, XLSX_TEST_STRINGS) validate_strings(result, XLSX_TEST_STRINGS)
@@ -348,6 +230,10 @@ def test_markitdown_local() -> None:
) )
validate_strings(result, BLOG_TEST_STRINGS) validate_strings(result, BLOG_TEST_STRINGS)
# Test ZIP file processing
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test_files.zip"))
validate_strings(result, XLSX_TEST_STRINGS)
# Test Wikipedia processing # Test Wikipedia processing
result = markitdown.convert( result = markitdown.convert(
os.path.join(TEST_FILES_DIR, "test_wikipedia.html"), url=WIKIPEDIA_TEST_URL os.path.join(TEST_FILES_DIR, "test_wikipedia.html"), url=WIKIPEDIA_TEST_URL
@@ -368,43 +254,27 @@ def test_markitdown_local() -> None:
for test_string in RSS_TEST_STRINGS: for test_string in RSS_TEST_STRINGS:
assert test_string in text_content assert test_string in text_content
## Test non-UTF-8 encoding
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test_mskanji.csv"))
validate_strings(result, CSV_CP932_TEST_STRINGS)
# Test MSG (Outlook email) processing # Test MSG (Outlook email) processing
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test_outlook_msg.msg")) result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test_outlook_msg.msg"))
validate_strings(result, MSG_TEST_STRINGS) validate_strings(result, MSG_TEST_STRINGS)
# Test non-UTF-8 encoding
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test_mskanji.csv"))
validate_strings(result, CSV_CP932_TEST_STRINGS)
# Test JSON processing # Test JSON processing
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.json")) result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.json"))
validate_strings(result, JSON_TEST_STRINGS) validate_strings(result, JSON_TEST_STRINGS)
# # Test ZIP file processing
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test_files.zip"))
validate_strings(result, DOCX_TEST_STRINGS)
validate_strings(result, XLSX_TEST_STRINGS)
validate_strings(result, BLOG_TEST_STRINGS)
# Test input from a stream
input_data = b"<html><body><h1>Test</h1></body></html>"
result = markitdown.convert_stream(io.BytesIO(input_data))
assert "# Test" in result.text_content
# Test input with leading blank characters # Test input with leading blank characters
input_data = b" \n\n\n<html><body><h1>Test</h1></body></html>" input_data = b" \n\n\n<html><body><h1>Test</h1></body></html>"
result = markitdown.convert_stream(io.BytesIO(input_data)) result = markitdown.convert_stream(io.BytesIO(input_data))
assert "# Test" in result.text_content assert "# Test" in result.text_content
def test_markitdown_streams() -> None: def test_markitdown_local_objects() -> None:
markitdown = MarkItDown() markitdown = MarkItDown()
# Test PDF processing
with open(os.path.join(TEST_FILES_DIR, "test.pdf"), "rb") as f:
result = markitdown.convert(f, file_extension=".pdf")
validate_strings(result, PDF_TEST_STRINGS)
# Test XLSX processing # Test XLSX processing
with open(os.path.join(TEST_FILES_DIR, "test.xlsx"), "rb") as f: with open(os.path.join(TEST_FILES_DIR, "test.xlsx"), "rb") as f:
result = markitdown.convert(f, file_extension=".xlsx") result = markitdown.convert(f, file_extension=".xlsx")
@@ -443,18 +313,24 @@ def test_markitdown_streams() -> None:
validate_strings(result, PPTX_TEST_STRINGS) validate_strings(result, PPTX_TEST_STRINGS)
# Test HTML processing # Test HTML processing
with open(os.path.join(TEST_FILES_DIR, "test_blog.html"), "rb") as f: with open(
os.path.join(TEST_FILES_DIR, "test_blog.html"), "rt", encoding="utf-8"
) as f:
result = markitdown.convert(f, file_extension=".html", url=BLOG_TEST_URL) result = markitdown.convert(f, file_extension=".html", url=BLOG_TEST_URL)
validate_strings(result, BLOG_TEST_STRINGS) validate_strings(result, BLOG_TEST_STRINGS)
# Test Wikipedia processing # Test Wikipedia processing
with open(os.path.join(TEST_FILES_DIR, "test_wikipedia.html"), "rb") as f: with open(
os.path.join(TEST_FILES_DIR, "test_wikipedia.html"), "rt", encoding="utf-8"
) as f:
result = markitdown.convert(f, file_extension=".html", url=WIKIPEDIA_TEST_URL) result = markitdown.convert(f, file_extension=".html", url=WIKIPEDIA_TEST_URL)
text_content = result.text_content.replace("\\", "") text_content = result.text_content.replace("\\", "")
validate_strings(result, WIKIPEDIA_TEST_STRINGS, WIKIPEDIA_TEST_EXCLUDES) validate_strings(result, WIKIPEDIA_TEST_STRINGS, WIKIPEDIA_TEST_EXCLUDES)
# Test Bing processing # Test Bing processing
with open(os.path.join(TEST_FILES_DIR, "test_serp.html"), "rb") as f: with open(
os.path.join(TEST_FILES_DIR, "test_serp.html"), "rt", encoding="utf-8"
) as f:
result = markitdown.convert(f, file_extension=".html", url=SERP_TEST_URL) result = markitdown.convert(f, file_extension=".html", url=SERP_TEST_URL)
text_content = result.text_content.replace("\\", "") text_content = result.text_content.replace("\\", "")
validate_strings(result, SERP_TEST_STRINGS, SERP_TEST_EXCLUDES) validate_strings(result, SERP_TEST_STRINGS, SERP_TEST_EXCLUDES)
@@ -477,41 +353,6 @@ def test_markitdown_streams() -> None:
validate_strings(result, JSON_TEST_STRINGS) validate_strings(result, JSON_TEST_STRINGS)
@pytest.mark.skipif(
skip_remote,
reason="do not run remotely run speech transcription tests",
)
def test_speech_transcription() -> None:
markitdown = MarkItDown()
# Test WAV files, MP3 and M4A files
for file_name in ["test.wav", "test.mp3", "test.m4a"]:
result = markitdown.convert(os.path.join(TEST_FILES_DIR, file_name))
result_lower = result.text_content.lower()
assert (
("1" in result_lower or "one" in result_lower)
and ("2" in result_lower or "two" in result_lower)
and ("3" in result_lower or "three" in result_lower)
and ("4" in result_lower or "four" in result_lower)
and ("5" in result_lower or "five" in result_lower)
)
def test_exceptions() -> None:
# Check that an exception is raised when trying to convert an unsupported format
markitdown = MarkItDown()
with pytest.raises(UnsupportedFormatException):
markitdown.convert(os.path.join(TEST_FILES_DIR, "random.bin"))
# Check that an exception is raised when trying to convert a file that is corrupted
with pytest.raises(FileConversionException) as exc_info:
markitdown.convert(
os.path.join(TEST_FILES_DIR, "random.bin"), file_extension=".pptx"
)
assert len(exc_info.value.attempts) == 1
assert type(exc_info.value.attempts[0].converter).__name__ == "PptxConverter"
@pytest.mark.skipif( @pytest.mark.skipif(
skip_exiftool, skip_exiftool,
reason="do not run if exiftool is not installed", reason="do not run if exiftool is not installed",
@@ -520,20 +361,17 @@ def test_markitdown_exiftool() -> None:
# Test the automatic discovery of exiftool throws a warning # Test the automatic discovery of exiftool throws a warning
# and is disabled # and is disabled
try: try:
warnings.simplefilter("default") with catch_warnings(record=True) as w:
with warnings.catch_warnings(record=True) as w:
markitdown = MarkItDown() markitdown = MarkItDown()
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.jpg")) result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.jpg"))
assert len(w) == 1 assert len(w) == 1
assert w[0].category is DeprecationWarning assert w[0].category is DeprecationWarning
assert result.text_content.strip() == "" assert result.text_content.strip() == ""
finally: finally:
warnings.resetwarnings() resetwarnings()
which_exiftool = shutil.which("exiftool")
assert which_exiftool is not None
# Test explicitly setting the location of exiftool # Test explicitly setting the location of exiftool
which_exiftool = shutil.which("exiftool")
markitdown = MarkItDown(exiftool_path=which_exiftool) markitdown = MarkItDown(exiftool_path=which_exiftool)
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.jpg")) result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.jpg"))
for key in JPG_TEST_EXIFTOOL: for key in JPG_TEST_EXIFTOOL:
@@ -548,12 +386,6 @@ def test_markitdown_exiftool() -> None:
target = f"{key}: {JPG_TEST_EXIFTOOL[key]}" target = f"{key}: {JPG_TEST_EXIFTOOL[key]}"
assert target in result.text_content assert target in result.text_content
# Test some other media types
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.mp3"))
for key in MP3_TEST_EXIFTOOL:
target = f"{key}: {MP3_TEST_EXIFTOOL[key]}"
assert target in result.text_content
@pytest.mark.skipif( @pytest.mark.skipif(
skip_llm, skip_llm,
@@ -564,6 +396,7 @@ def test_markitdown_llm() -> None:
markitdown = MarkItDown(llm_client=client, llm_model="gpt-4o") markitdown = MarkItDown(llm_client=client, llm_model="gpt-4o")
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test_llm.jpg")) result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test_llm.jpg"))
for test_string in LLM_TEST_STRINGS: for test_string in LLM_TEST_STRINGS:
assert test_string in result.text_content assert test_string in result.text_content
@@ -572,24 +405,12 @@ def test_markitdown_llm() -> None:
for test_string in ["red", "circle", "blue", "square"]: for test_string in ["red", "circle", "blue", "square"]:
assert test_string in result.text_content.lower() assert test_string in result.text_content.lower()
# Images embedded in PPTX files
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.pptx"))
# LLM Captions are included
for test_string in LLM_TEST_STRINGS:
assert test_string in result.text_content
# Standard alt text is included
validate_strings(result, PPTX_TEST_STRINGS)
if __name__ == "__main__": if __name__ == "__main__":
"""Runs this file's tests from the command line.""" """Runs this file's tests from the command line."""
test_stream_info_operations()
test_stream_info_guesses()
test_markitdown_remote() test_markitdown_remote()
test_markitdown_local() test_markitdown_local_paths()
test_markitdown_streams() test_markitdown_local_objects()
test_speech_transcription()
test_exceptions()
test_markitdown_exiftool() test_markitdown_exiftool()
test_markitdown_llm() # test_markitdown_llm()
print("All tests passed!") print("All tests passed!")