Compare commits
40 Commits
kennyzhang
...
v0.1.0a4
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
a93e0567e6 | ||
|
|
c5f70b904f | ||
|
|
53834fdd24 | ||
|
|
5c565b7d79 | ||
|
|
a78857bd43 | ||
|
|
09df7fe8df | ||
|
|
6a9f09b153 | ||
|
|
0b815fb916 | ||
|
|
12620f1545 | ||
|
|
5f75e16d20 | ||
|
|
75140a90e2 | ||
|
|
af1be36e0c | ||
|
|
2a2ccc86aa | ||
|
|
2e51ba22e7 | ||
|
|
8f8e58c9bb | ||
|
|
8e73a325c6 | ||
|
|
2405f201af | ||
|
|
99d8e562db | ||
|
|
515fa854bf | ||
|
|
0229ff6cb7 | ||
|
|
82d84e3edd | ||
|
|
36c4bc9ec3 | ||
|
|
80baa5db18 | ||
|
|
00a65e8f8b | ||
|
|
6bedf6d950 | ||
|
|
9380112892 | ||
|
|
784c293579 | ||
|
|
70e9f8c3c0 | ||
|
|
e921497f79 | ||
|
|
1d2f231146 | ||
|
|
c5cd659f63 | ||
|
|
f01c6c5277 | ||
|
|
43bd79adc9 | ||
|
|
9182923375 | ||
|
|
9a19fdd134 | ||
|
|
e82e0c1372 | ||
|
|
a394cc7c27 | ||
|
|
a87fbf01ee | ||
|
|
d0ed74fdf4 | ||
|
|
e4b419ba40 |
@@ -1 +1,2 @@
|
||||
*
|
||||
!packages/
|
||||
|
||||
3
.gitattributes
vendored
3
.gitattributes
vendored
@@ -1 +1,2 @@
|
||||
tests/test_files/** linguist-vendored
|
||||
packages/markitdown/tests/test_files/** linguist-vendored
|
||||
packages/markitdown-sample-plugin/tests/test_files/** linguist-vendored
|
||||
|
||||
30
Dockerfile
30
Dockerfile
@@ -1,22 +1,32 @@
|
||||
FROM python:3.13-slim-bullseye
|
||||
|
||||
USER root
|
||||
|
||||
ARG INSTALL_GIT=false
|
||||
RUN if [ "$INSTALL_GIT" = "true" ]; then \
|
||||
apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/*; \
|
||||
fi
|
||||
ENV DEBIAN_FRONTEND=noninteractive
|
||||
ENV EXIFTOOL_PATH=/usr/bin/exiftool
|
||||
ENV FFMPEG_PATH=/usr/bin/ffmpeg
|
||||
|
||||
# Runtime dependency
|
||||
RUN apt-get update && apt-get install -y --no-install-recommends \
|
||||
ffmpeg \
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
exiftool
|
||||
|
||||
RUN pip install markitdown
|
||||
ARG INSTALL_GIT=false
|
||||
RUN if [ "$INSTALL_GIT" = "true" ]; then \
|
||||
apt-get install -y --no-install-recommends \
|
||||
git; \
|
||||
fi
|
||||
|
||||
# Cleanup
|
||||
RUN rm -rf /var/lib/apt/lists/*
|
||||
|
||||
WORKDIR /app
|
||||
COPY . /app
|
||||
RUN pip --no-cache-dir install \
|
||||
/app/packages/markitdown[all] \
|
||||
/app/packages/markitdown-sample-plugin
|
||||
|
||||
# Default USERID and GROUPID
|
||||
ARG USERID=10000
|
||||
ARG GROUPID=10000
|
||||
ARG USERID=nobody
|
||||
ARG GROUPID=nogroup
|
||||
|
||||
USER $USERID:$GROUPID
|
||||
|
||||
|
||||
75
README.md
75
README.md
@@ -5,10 +5,14 @@
|
||||
[](https://github.com/microsoft/autogen)
|
||||
|
||||
> [!IMPORTANT]
|
||||
> MarkItDown 0.0.2 alpha 1 (0.0.2a1) introduces a plugin-based architecture. As much as was possible, command-line and Python interfaces have remained the same as 0.0.1a3 to support backward compatibility. Please report any issues you encounter. Some interface changes may yet occur as we continue to refine MarkItDown to a first non-alpha release.
|
||||
> Breaking changes between 0.0.1 to 0.1.0:
|
||||
> * Dependencies are now organized into optional feature-groups (further details below). Use `pip install 'markitdown[all]~=0.1.0a1'` to have backward-compatible behavior.
|
||||
> * The DocumentConverter class interface has changed to read from file-like streams rather than file paths. *No temporary files are created anymore*. If you are the maintainer of a plugin, or custom DocumentConverter, you likely need to update your code. Otherwise, if only using the MarkItDown class or CLI (as in these examples), you should not need to change anything.
|
||||
|
||||
MarkItDown is a lightweight Python utility for converting various files to Markdown for use with LLMs and related text analysis pipelines. To this end, it is most comparable to [textract](https://github.com/deanmalmgren/textract), but with a focus on preserving important document structure and content as Markdown (including: headings, lists, tables, links, etc.) While the output is often reasonably presentable and human-friendly, it is meant to be consumed by text analysis tools -- and may not be the best option for high-fidelity document conversions for human consumption.
|
||||
|
||||
At present, MarkItDown supports:
|
||||
|
||||
MarkItDown is a utility for converting various files to Markdown (e.g., for indexing, text analysis, etc).
|
||||
It supports:
|
||||
- PDF
|
||||
- PowerPoint
|
||||
- Word
|
||||
@@ -18,14 +22,27 @@ It supports:
|
||||
- HTML
|
||||
- Text-based formats (CSV, JSON, XML)
|
||||
- ZIP files (iterates over contents)
|
||||
- Youtube URLs
|
||||
- EPubs
|
||||
- ... and more!
|
||||
|
||||
To install MarkItDown, use pip: `pip install markitdown`. Alternatively, you can install it from the source:
|
||||
## Why Markdown?
|
||||
|
||||
Markdown is extremely close to plain text, with minimal markup or formatting, but still
|
||||
provides a way to represent important document structure. Mainstream LLMs, such as
|
||||
OpenAI's GPT-4o, natively "_speak_" Markdown, and often incorporate Markdown into their
|
||||
responses unprompted. This suggests that they have been trained on vast amounts of
|
||||
Markdown-formatted text, and understand it well. As a side benefit, Markdown conventions
|
||||
are also highly token-efficient.
|
||||
|
||||
## Installation
|
||||
|
||||
To install MarkItDown, use pip: `pip install 'markitdown[all]~=0.1.0a1'`. Alternatively, you can install it from the source:
|
||||
|
||||
```bash
|
||||
git clone git@github.com:microsoft/markitdown.git
|
||||
cd markitdown
|
||||
pip install -e packages/markitdown
|
||||
pip install -e packages/markitdown[all]
|
||||
```
|
||||
|
||||
## Usage
|
||||
@@ -48,6 +65,28 @@ You can also pipe content:
|
||||
cat path-to-file.pdf | markitdown
|
||||
```
|
||||
|
||||
### Optional Dependencies
|
||||
MarkItDown has optional dependencies for activating various file formats. Earlier in this document, we installed all optional dependencies with the `[all]` option. However, you can also install them individually for more control. For example:
|
||||
|
||||
```bash
|
||||
pip install markitdown[pdf, docx, pptx]
|
||||
```
|
||||
|
||||
will install only the dependencies for PDF, DOCX, and PPTX files.
|
||||
|
||||
At the moment, the following optional dependencies are available:
|
||||
|
||||
* `[all]` Installs all optional dependencies
|
||||
* `[pptx]` Installs dependencies for PowerPoint files
|
||||
* `[docx]` Installs dependencies for Word files
|
||||
* `[xlsx]` Installs dependencies for Excel files
|
||||
* `[xls]` Installs dependencies for older Excel files
|
||||
* `[pdf]` Installs dependencies for PDF files
|
||||
* `[outlook]` Installs dependencies for Outlook messages
|
||||
* `[az-doc-intel]` Installs dependencies for Azure Document Intelligence
|
||||
* `[audio-transcription]` Installs dependencies for audio transcription of wav and mp3 files
|
||||
* `[youtube-transcription]` Installs dependencies for fetching YouTube video transcription
|
||||
|
||||
### Plugins
|
||||
|
||||
MarkItDown also supports 3rd-party plugins. Plugins are disabled by default. To list installed plugins:
|
||||
@@ -74,7 +113,6 @@ markitdown path-to-file.pdf -o document.md -d -e "<document_intelligence_endpoin
|
||||
|
||||
More information about how to set up an Azure Document Intelligence Resource can be found [here](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/how-to-guides/create-document-intelligence-resource?view=doc-intel-4.0.0)
|
||||
|
||||
|
||||
### Python API
|
||||
|
||||
Basic usage in Python:
|
||||
@@ -97,25 +135,6 @@ result = md.convert("test.pdf")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
MarkItDown also supports converting file objects directly:
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
|
||||
# Providing the file extension when converting via file objects is recommended for most consistent results
|
||||
# Binary Mode
|
||||
with open("test.docx", 'rb') as file:
|
||||
result = md.convert(file, file_extension=".docx")
|
||||
print(result.text_content)
|
||||
|
||||
# Non-Binary Mode
|
||||
with open("sample.ipynb", 'rt', encoding="utf-8") as file:
|
||||
result = md.convert(file, file_extension=".ipynb")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
To use Large Language Models for image descriptions, provide `llm_client` and `llm_model`:
|
||||
|
||||
```python
|
||||
@@ -153,11 +172,10 @@ contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additio
|
||||
|
||||
You can help by looking at issues or helping review PRs. Any issue or PR is welcome, but we have also marked some as 'open for contribution' and 'open for reviewing' to help facilitate community contributions. These are ofcourse just suggestions and you are welcome to contribute in any way you like.
|
||||
|
||||
|
||||
<div align="center">
|
||||
|
||||
| | All | Especially Needs Help from Community |
|
||||
|-----------------------|------------------------------------------|------------------------------------------------------------------------------------------|
|
||||
| ---------- | ------------------------------------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| **Issues** | [All Issues](https://github.com/microsoft/markitdown/issues) | [Issues open for contribution](https://github.com/microsoft/markitdown/issues?q=is%3Aissue+is%3Aopen+label%3A%22open+for+contribution%22) |
|
||||
| **PRs** | [All PRs](https://github.com/microsoft/markitdown/pulls) | [PRs open for reviewing](https://github.com/microsoft/markitdown/pulls?q=is%3Apr+is%3Aopen+label%3A%22open+for+reviewing%22) |
|
||||
|
||||
@@ -172,6 +190,7 @@ You can help by looking at issues or helping review PRs. Any issue or PR is welc
|
||||
```
|
||||
|
||||
- Install `hatch` in your environment and run tests:
|
||||
|
||||
```sh
|
||||
pip install hatch # Other ways of installing hatch: https://hatch.pypa.io/dev/install/
|
||||
hatch shell
|
||||
@@ -179,6 +198,7 @@ You can help by looking at issues or helping review PRs. Any issue or PR is welc
|
||||
```
|
||||
|
||||
(Alternative) Use the Devcontainer which has all the dependencies installed:
|
||||
|
||||
```sh
|
||||
# Reopen the project in Devcontainer and run:
|
||||
hatch test
|
||||
@@ -190,7 +210,6 @@ You can help by looking at issues or helping review PRs. Any issue or PR is welc
|
||||
|
||||
You can also contribute by creating and sharing 3rd party plugins. See `packages/markitdown-sample-plugin` for more details.
|
||||
|
||||
|
||||
## Trademarks
|
||||
|
||||
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft
|
||||
|
||||
@@ -10,23 +10,38 @@ This project shows how to create a sample plugin for MarkItDown. The most import
|
||||
Next, implement your custom DocumentConverter:
|
||||
|
||||
```python
|
||||
from typing import Union
|
||||
from markitdown import DocumentConverter, DocumentConverterResult
|
||||
from typing import BinaryIO, Any
|
||||
from markitdown import MarkItDown, DocumentConverter, DocumentConverterResult, StreamInfo
|
||||
|
||||
class RtfConverter(DocumentConverter):
|
||||
def convert(self, local_path, **kwargs) -> Union[None, DocumentConverterResult]:
|
||||
# Bail if not an RTF file
|
||||
extension = kwargs.get("file_extension", "")
|
||||
if extension.lower() != ".rtf":
|
||||
return None
|
||||
|
||||
# Implement the conversion logic here ...
|
||||
def __init__(
|
||||
self, priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT
|
||||
):
|
||||
super().__init__(priority=priority)
|
||||
|
||||
# Return the result
|
||||
return DocumentConverterResult(
|
||||
title=title,
|
||||
text_content=text_content,
|
||||
)
|
||||
def accepts(
|
||||
self,
|
||||
file_stream: BinaryIO,
|
||||
stream_info: StreamInfo,
|
||||
**kwargs: Any,
|
||||
) -> bool:
|
||||
|
||||
# Implement logic to check if the file stream is an RTF file
|
||||
# ...
|
||||
raise NotImplementedError()
|
||||
|
||||
|
||||
def convert(
|
||||
self,
|
||||
file_stream: BinaryIO,
|
||||
stream_info: StreamInfo,
|
||||
**kwargs: Any,
|
||||
) -> DocumentConverterResult:
|
||||
|
||||
# Implement logic to convert the file stream to Markdown
|
||||
# ...
|
||||
raise NotImplementedError()
|
||||
```
|
||||
|
||||
Next, make sure your package implements and exports the following:
|
||||
@@ -71,10 +86,10 @@ Once the plugin package is installed, verify that it is available to MarkItDown
|
||||
markitdown --list-plugins
|
||||
```
|
||||
|
||||
To use the plugin for a conversion use the `--use-plugins` flag. For example, to convert a PDF:
|
||||
To use the plugin for a conversion use the `--use-plugins` flag. For example, to convert an RTF file:
|
||||
|
||||
```bash
|
||||
markitdown --use-plugins path-to-file.pdf
|
||||
markitdown --use-plugins path-to-file.rtf
|
||||
```
|
||||
|
||||
In Python, plugins can be enabled as follows:
|
||||
@@ -83,7 +98,7 @@ In Python, plugins can be enabled as follows:
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown(enable_plugins=True)
|
||||
result = md.convert("path-to-file.pdf")
|
||||
result = md.convert("path-to-file.rtf")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
|
||||
@@ -24,7 +24,7 @@ classifiers = [
|
||||
"Programming Language :: Python :: Implementation :: PyPy",
|
||||
]
|
||||
dependencies = [
|
||||
"markitdown",
|
||||
"markitdown>=0.1.0a1",
|
||||
"striprtf",
|
||||
]
|
||||
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
# SPDX-FileCopyrightText: 2024-present Adam Fourney <adamfo@microsoft.com>
|
||||
#
|
||||
# SPDX-License-Identifier: MIT
|
||||
__version__ = "0.0.1a2"
|
||||
__version__ = "0.1.0a1"
|
||||
|
||||
@@ -1,12 +1,26 @@
|
||||
from typing import Union
|
||||
import locale
|
||||
from typing import BinaryIO, Any
|
||||
from striprtf.striprtf import rtf_to_text
|
||||
|
||||
from markitdown import MarkItDown, DocumentConverter, DocumentConverterResult
|
||||
from markitdown import (
|
||||
MarkItDown,
|
||||
DocumentConverter,
|
||||
DocumentConverterResult,
|
||||
StreamInfo,
|
||||
)
|
||||
|
||||
|
||||
__plugin_interface_version__ = (
|
||||
1 # The version of the plugin interface that this plugin uses
|
||||
)
|
||||
|
||||
ACCEPTED_MIME_TYPE_PREFIXES = [
|
||||
"text/rtf",
|
||||
"application/rtf",
|
||||
]
|
||||
|
||||
ACCEPTED_FILE_EXTENSIONS = [".rtf"]
|
||||
|
||||
|
||||
def register_converters(markitdown: MarkItDown, **kwargs):
|
||||
"""
|
||||
@@ -22,18 +36,36 @@ class RtfConverter(DocumentConverter):
|
||||
Converts an RTF file to in the simplest possible way.
|
||||
"""
|
||||
|
||||
def convert(self, local_path, **kwargs) -> Union[None, DocumentConverterResult]:
|
||||
# Bail if not a RTF
|
||||
extension = kwargs.get("file_extension", "")
|
||||
if extension.lower() != ".rtf":
|
||||
return None
|
||||
def accepts(
|
||||
self,
|
||||
file_stream: BinaryIO,
|
||||
stream_info: StreamInfo,
|
||||
**kwargs: Any,
|
||||
) -> bool:
|
||||
mimetype = (stream_info.mimetype or "").lower()
|
||||
extension = (stream_info.extension or "").lower()
|
||||
|
||||
# Read the RTF file
|
||||
with open(local_path, "r") as f:
|
||||
rtf = f.read()
|
||||
if extension in ACCEPTED_FILE_EXTENSIONS:
|
||||
return True
|
||||
|
||||
for prefix in ACCEPTED_MIME_TYPE_PREFIXES:
|
||||
if mimetype.startswith(prefix):
|
||||
return True
|
||||
|
||||
return False
|
||||
|
||||
def convert(
|
||||
self,
|
||||
file_stream: BinaryIO,
|
||||
stream_info: StreamInfo,
|
||||
**kwargs: Any,
|
||||
) -> DocumentConverterResult:
|
||||
# Read the file stream into an str using hte provided charset encoding, or using the system default
|
||||
encoding = stream_info.charset or locale.getpreferredencoding()
|
||||
stream_data = file_stream.read().decode(encoding)
|
||||
|
||||
# Return the result
|
||||
return DocumentConverterResult(
|
||||
title=None,
|
||||
text_content=rtf_to_text(rtf),
|
||||
markdown=rtf_to_text(stream_data),
|
||||
)
|
||||
|
||||
@@ -2,7 +2,7 @@
|
||||
import os
|
||||
import pytest
|
||||
|
||||
from markitdown import MarkItDown
|
||||
from markitdown import MarkItDown, StreamInfo
|
||||
from markitdown_sample_plugin import RtfConverter
|
||||
|
||||
TEST_FILES_DIR = os.path.join(os.path.dirname(__file__), "test_files")
|
||||
@@ -15,9 +15,13 @@ RTF_TEST_STRINGS = {
|
||||
|
||||
def test_converter() -> None:
|
||||
"""Tests the RTF converter dirctly."""
|
||||
with open(os.path.join(TEST_FILES_DIR, "test.rtf"), "rb") as file_stream:
|
||||
converter = RtfConverter()
|
||||
result = converter.convert(
|
||||
os.path.join(TEST_FILES_DIR, "test.rtf"), file_extension=".rtf"
|
||||
file_stream=file_stream,
|
||||
stream_info=StreamInfo(
|
||||
mimetype="text/rtf", extension=".rtf", filename="test.rtf"
|
||||
),
|
||||
)
|
||||
|
||||
for test_string in RTF_TEST_STRINGS:
|
||||
@@ -26,7 +30,7 @@ def test_converter() -> None:
|
||||
|
||||
def test_markitdown() -> None:
|
||||
"""Tests that MarkItDown correctly loads the plugin."""
|
||||
md = MarkItDown()
|
||||
md = MarkItDown(enable_plugins=True)
|
||||
result = md.convert(os.path.join(TEST_FILES_DIR, "test.rtf"))
|
||||
|
||||
for test_string in RTF_TEST_STRINGS:
|
||||
|
||||
@@ -10,7 +10,7 @@
|
||||
From PyPI:
|
||||
|
||||
```bash
|
||||
pip install markitdown
|
||||
pip install markitdown[all]
|
||||
```
|
||||
|
||||
From source:
|
||||
@@ -18,7 +18,7 @@ From source:
|
||||
```bash
|
||||
git clone git@github.com:microsoft/markitdown.git
|
||||
cd markitdown
|
||||
pip install -e packages/markitdown
|
||||
pip install -e packages/markitdown[all]
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
@@ -26,25 +26,35 @@ classifiers = [
|
||||
dependencies = [
|
||||
"beautifulsoup4",
|
||||
"requests",
|
||||
"mammoth",
|
||||
"markdownify",
|
||||
"numpy",
|
||||
"magika>=0.6.1rc3",
|
||||
"charset-normalizer",
|
||||
]
|
||||
|
||||
[project.optional-dependencies]
|
||||
all = [
|
||||
"python-pptx",
|
||||
"mammoth",
|
||||
"pandas",
|
||||
"openpyxl",
|
||||
"xlrd",
|
||||
"pdfminer.six",
|
||||
"puremagic",
|
||||
"pydub",
|
||||
"olefile",
|
||||
"youtube-transcript-api",
|
||||
"pydub",
|
||||
"SpeechRecognition",
|
||||
"pathvalidate",
|
||||
"charset-normalizer",
|
||||
"openai",
|
||||
"youtube-transcript-api",
|
||||
"azure-ai-documentintelligence",
|
||||
"azure-identity"
|
||||
]
|
||||
pptx = ["python-pptx"]
|
||||
docx = ["mammoth"]
|
||||
xlsx = ["pandas", "openpyxl"]
|
||||
xls = ["pandas", "xlrd"]
|
||||
pdf = ["pdfminer.six"]
|
||||
outlook = ["olefile"]
|
||||
audio-transcription = ["pydub", "SpeechRecognition"]
|
||||
youtube-transcription = ["youtube-transcript-api"]
|
||||
az-doc-intel = ["azure-ai-documentintelligence", "azure-identity"]
|
||||
|
||||
[project.urls]
|
||||
Documentation = "https://github.com/microsoft/markitdown#readme"
|
||||
@@ -57,12 +67,24 @@ path = "src/markitdown/__about__.py"
|
||||
[project.scripts]
|
||||
markitdown = "markitdown.__main__:main"
|
||||
|
||||
[tool.hatch.envs.types]
|
||||
[tool.hatch.envs.default]
|
||||
features = ["all"]
|
||||
|
||||
[tool.hatch.envs.hatch-test]
|
||||
features = ["all"]
|
||||
extra-dependencies = [
|
||||
"openai",
|
||||
]
|
||||
|
||||
[tool.hatch.envs.types]
|
||||
features = ["all"]
|
||||
extra-dependencies = [
|
||||
"openai",
|
||||
"mypy>=1.0.0",
|
||||
]
|
||||
|
||||
[tool.hatch.envs.types.scripts]
|
||||
check = "mypy --install-types --non-interactive {args:src/markitdown tests}"
|
||||
check = "mypy --install-types --non-interactive --ignore-missing-imports {args:src/markitdown tests}"
|
||||
|
||||
[tool.coverage.run]
|
||||
source_pkgs = ["markitdown", "tests"]
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
# SPDX-FileCopyrightText: 2024-present Adam Fourney <adamfo@microsoft.com>
|
||||
#
|
||||
# SPDX-License-Identifier: MIT
|
||||
__version__ = "0.0.2a1"
|
||||
__version__ = "0.1.0a4"
|
||||
|
||||
@@ -3,14 +3,20 @@
|
||||
# SPDX-License-Identifier: MIT
|
||||
|
||||
from .__about__ import __version__
|
||||
from ._markitdown import MarkItDown
|
||||
from ._markitdown import (
|
||||
MarkItDown,
|
||||
PRIORITY_SPECIFIC_FILE_FORMAT,
|
||||
PRIORITY_GENERIC_FILE_FORMAT,
|
||||
)
|
||||
from ._base_converter import DocumentConverterResult, DocumentConverter
|
||||
from ._stream_info import StreamInfo
|
||||
from ._exceptions import (
|
||||
MarkItDownException,
|
||||
ConverterPrerequisiteException,
|
||||
MissingDependencyException,
|
||||
FailedConversionAttempt,
|
||||
FileConversionException,
|
||||
UnsupportedFormatException,
|
||||
)
|
||||
from .converters import DocumentConverter, DocumentConverterResult
|
||||
|
||||
__all__ = [
|
||||
"__version__",
|
||||
@@ -18,7 +24,11 @@ __all__ = [
|
||||
"DocumentConverter",
|
||||
"DocumentConverterResult",
|
||||
"MarkItDownException",
|
||||
"ConverterPrerequisiteException",
|
||||
"MissingDependencyException",
|
||||
"FailedConversionAttempt",
|
||||
"FileConversionException",
|
||||
"UnsupportedFormatException",
|
||||
"StreamInfo",
|
||||
"PRIORITY_SPECIFIC_FILE_FORMAT",
|
||||
"PRIORITY_GENERIC_FILE_FORMAT",
|
||||
]
|
||||
|
||||
@@ -3,10 +3,11 @@
|
||||
# SPDX-License-Identifier: MIT
|
||||
import argparse
|
||||
import sys
|
||||
import codecs
|
||||
from textwrap import dedent
|
||||
from importlib.metadata import entry_points
|
||||
from .__about__ import __version__
|
||||
from ._markitdown import MarkItDown, DocumentConverterResult
|
||||
from ._markitdown import MarkItDown, StreamInfo, DocumentConverterResult
|
||||
|
||||
|
||||
def main():
|
||||
@@ -58,6 +59,24 @@ def main():
|
||||
help="Output file name. If not provided, output is written to stdout.",
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"-x",
|
||||
"--extension",
|
||||
help="Provide a hint about the file extension (e.g., when reading from stdin).",
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"-m",
|
||||
"--mime-type",
|
||||
help="Provide a hint about the file's MIME type.",
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"-c",
|
||||
"--charset",
|
||||
help="Provide a hint about the file's charset (e.g, UTF-8).",
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"-d",
|
||||
"--use-docintel",
|
||||
@@ -88,6 +107,48 @@ def main():
|
||||
parser.add_argument("filename", nargs="?")
|
||||
args = parser.parse_args()
|
||||
|
||||
# Parse the extension hint
|
||||
extension_hint = args.extension
|
||||
if extension_hint is not None:
|
||||
extension_hint = extension_hint.strip().lower()
|
||||
if len(extension_hint) > 0:
|
||||
if not extension_hint.startswith("."):
|
||||
extension_hint = "." + extension_hint
|
||||
else:
|
||||
extension_hint = None
|
||||
|
||||
# Parse the mime type
|
||||
mime_type_hint = args.mime_type
|
||||
if mime_type_hint is not None:
|
||||
mime_type_hint = mime_type_hint.strip()
|
||||
if len(mime_type_hint) > 0:
|
||||
if mime_type_hint.count("/") != 1:
|
||||
_exit_with_error(f"Invalid MIME type: {mime_type_hint}")
|
||||
else:
|
||||
mime_type_hint = None
|
||||
|
||||
# Parse the charset
|
||||
charset_hint = args.charset
|
||||
if charset_hint is not None:
|
||||
charset_hint = charset_hint.strip()
|
||||
if len(charset_hint) > 0:
|
||||
try:
|
||||
charset_hint = codecs.lookup(charset_hint).name
|
||||
except LookupError:
|
||||
_exit_with_error(f"Invalid charset: {charset_hint}")
|
||||
else:
|
||||
charset_hint = None
|
||||
|
||||
stream_info = None
|
||||
if (
|
||||
extension_hint is not None
|
||||
or mime_type_hint is not None
|
||||
or charset_hint is not None
|
||||
):
|
||||
stream_info = StreamInfo(
|
||||
extension=extension_hint, mimetype=mime_type_hint, charset=charset_hint
|
||||
)
|
||||
|
||||
if args.list_plugins:
|
||||
# List installed plugins, then exit
|
||||
print("Installed MarkItDown 3rd-party Plugins:\n")
|
||||
@@ -107,11 +168,12 @@ def main():
|
||||
|
||||
if args.use_docintel:
|
||||
if args.endpoint is None:
|
||||
raise ValueError(
|
||||
_exit_with_error(
|
||||
"Document Intelligence Endpoint is required when using Document Intelligence."
|
||||
)
|
||||
elif args.filename is None:
|
||||
raise ValueError("Filename is required when using Document Intelligence.")
|
||||
_exit_with_error("Filename is required when using Document Intelligence.")
|
||||
|
||||
markitdown = MarkItDown(
|
||||
enable_plugins=args.use_plugins, docintel_endpoint=args.endpoint
|
||||
)
|
||||
@@ -119,9 +181,9 @@ def main():
|
||||
markitdown = MarkItDown(enable_plugins=args.use_plugins)
|
||||
|
||||
if args.filename is None:
|
||||
result = markitdown.convert_stream(sys.stdin.buffer)
|
||||
result = markitdown.convert_stream(sys.stdin.buffer, stream_info=stream_info)
|
||||
else:
|
||||
result = markitdown.convert(args.filename)
|
||||
result = markitdown.convert(args.filename, stream_info=stream_info)
|
||||
|
||||
_handle_output(args, result)
|
||||
|
||||
@@ -135,5 +197,10 @@ def _handle_output(args, result: DocumentConverterResult):
|
||||
print(result.text_content)
|
||||
|
||||
|
||||
def _exit_with_error(message: str):
|
||||
print(message)
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
||||
108
packages/markitdown/src/markitdown/_base_converter.py
Normal file
108
packages/markitdown/src/markitdown/_base_converter.py
Normal file
@@ -0,0 +1,108 @@
|
||||
import os
|
||||
import tempfile
|
||||
from warnings import warn
|
||||
from typing import Any, Union, BinaryIO, Optional, List
|
||||
from ._stream_info import StreamInfo
|
||||
|
||||
|
||||
class DocumentConverterResult:
|
||||
"""The result of converting a document to Markdown."""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
markdown: str,
|
||||
*,
|
||||
title: Optional[str] = None,
|
||||
):
|
||||
"""
|
||||
Initialize the DocumentConverterResult.
|
||||
|
||||
The only required parameter is the converted Markdown text.
|
||||
The title, and any other metadata that may be added in the future, are optional.
|
||||
|
||||
Parameters:
|
||||
- markdown: The converted Markdown text.
|
||||
- title: Optional title of the document.
|
||||
"""
|
||||
self.markdown = markdown
|
||||
self.title = title
|
||||
|
||||
@property
|
||||
def text_content(self) -> str:
|
||||
"""Soft-deprecated alias for `markdown`. New code should migrate to using `markdown` or __str__."""
|
||||
return self.markdown
|
||||
|
||||
@text_content.setter
|
||||
def text_content(self, markdown: str):
|
||||
"""Soft-deprecated alias for `markdown`. New code should migrate to using `markdown` or __str__."""
|
||||
self.markdown = markdown
|
||||
|
||||
def __str__(self) -> str:
|
||||
"""Return the converted Markdown text."""
|
||||
return self.markdown
|
||||
|
||||
|
||||
class DocumentConverter:
|
||||
"""Abstract superclass of all DocumentConverters."""
|
||||
|
||||
def accepts(
|
||||
self,
|
||||
file_stream: BinaryIO,
|
||||
stream_info: StreamInfo,
|
||||
**kwargs: Any, # Options to pass to the converter
|
||||
) -> bool:
|
||||
"""
|
||||
Return a quick determination on if the converter should attempt converting the document.
|
||||
This is primarily based `stream_info` (typically, `stream_info.mimetype`, `stream_info.extension`).
|
||||
In cases where the data is retrieved via HTTP, the `steam_info.url` might also be referenced to
|
||||
make a determination (e.g., special converters for Wikipedia, YouTube etc).
|
||||
Finally, it is conceivable that the `stream_info.filename` might be used to in cases
|
||||
where the filename is well-known (e.g., `Dockerfile`, `Makefile`, etc)
|
||||
|
||||
NOTE: The method signature is designed to match that of the convert() method. This provides some
|
||||
assurance that, if accepts() returns True, the convert() method will also be able to handle the document.
|
||||
|
||||
IMPORTANT: In rare cases, (e.g., OutlookMsgConverter) we need to read more from the stream to make a final
|
||||
determination. Read operations inevitably advances the position in file_stream. In these case, the position
|
||||
MUST be reset it MUST be reset before returning. This is because the convert() method may be called immediately
|
||||
after accepts(), and will expect the file_stream to be at the original position.
|
||||
|
||||
E.g.,
|
||||
cur_pos = file_stream.tell() # Save the current position
|
||||
data = file_stream.read(100) # ... peek at the first 100 bytes, etc.
|
||||
file_stream.seek(cur_pos) # Reset the position to the original position
|
||||
|
||||
Prameters:
|
||||
- file_stream: The file-like object to convert. Must support seek(), tell(), and read() methods.
|
||||
- stream_info: The StreamInfo object containing metadata about the file (mimetype, extension, charset, set)
|
||||
- kwargs: Additional keyword arguments for the converter.
|
||||
|
||||
Returns:
|
||||
- bool: True if the converter can handle the document, False otherwise.
|
||||
"""
|
||||
raise NotImplementedError(
|
||||
f"The subclass, {type(self).__name__}, must implement the accepts() method to determine if they can handle the document."
|
||||
)
|
||||
|
||||
def convert(
|
||||
self,
|
||||
file_stream: BinaryIO,
|
||||
stream_info: StreamInfo,
|
||||
**kwargs: Any, # Options to pass to the converter
|
||||
) -> DocumentConverterResult:
|
||||
"""
|
||||
Convert a document to Markdown text.
|
||||
|
||||
Prameters:
|
||||
- file_stream: The file-like object to convert. Must support seek(), tell(), and read() methods.
|
||||
- stream_info: The StreamInfo object containing metadata about the file (mimetype, extension, charset, set)
|
||||
- kwargs: Additional keyword arguments for the converter.
|
||||
|
||||
Returns:
|
||||
- DocumentConverterResult: The result of the conversion, which includes the title and markdown content.
|
||||
|
||||
Raises:
|
||||
- FileConversionException: If the mimetype is recognized, but the conversion fails for some other reason.
|
||||
- MissingDependencyException: If the converter requires a dependency that is not installed.
|
||||
"""
|
||||
raise NotImplementedError("Subclasses must implement this method")
|
||||
@@ -1,4 +1,14 @@
|
||||
class MarkItDownException(BaseException):
|
||||
from typing import Optional, List, Any
|
||||
|
||||
MISSING_DEPENDENCY_MESSAGE = """{converter} recognized the input as a potential {extension} file, but the dependencies needed to read {extension} files have not been installed. To resolve this error, include the optional dependency [{feature}] or [all] when installing MarkItDown. For example:
|
||||
|
||||
* pip install markitdown[{feature}]
|
||||
* pip install markitdown[all]
|
||||
* pip install markitdown[{feature}, ...]
|
||||
* etc."""
|
||||
|
||||
|
||||
class MarkItDownException(Exception):
|
||||
"""
|
||||
Base exception class for MarkItDown.
|
||||
"""
|
||||
@@ -6,24 +16,16 @@ class MarkItDownException(BaseException):
|
||||
pass
|
||||
|
||||
|
||||
class ConverterPrerequisiteException(MarkItDownException):
|
||||
class MissingDependencyException(MarkItDownException):
|
||||
"""
|
||||
Thrown when instantiating a DocumentConverter in cases where
|
||||
a required library or dependency is not installed, an API key
|
||||
is not set, or some other prerequisite is not met.
|
||||
Converters shipped with MarkItDown may depend on optional
|
||||
dependencies. This exception is thrown when a converter's
|
||||
convert() method is called, but the required dependency is not
|
||||
installed. This is not necessarily a fatal error, as the converter
|
||||
will simply be skipped (an error will bubble up only if no other
|
||||
suitable converter is found).
|
||||
|
||||
This is not necessarily a fatal error. If thrown during
|
||||
MarkItDown's plugin loading phase, the converter will simply be
|
||||
skipped, and a warning will be issued.
|
||||
"""
|
||||
|
||||
pass
|
||||
|
||||
|
||||
class FileConversionException(MarkItDownException):
|
||||
"""
|
||||
Thrown when a suitable converter was found, but the conversion
|
||||
process fails for any reason.
|
||||
Error messages should clearly indicate which dependency is missing.
|
||||
"""
|
||||
|
||||
pass
|
||||
@@ -35,3 +37,40 @@ class UnsupportedFormatException(MarkItDownException):
|
||||
"""
|
||||
|
||||
pass
|
||||
|
||||
|
||||
class FailedConversionAttempt(object):
|
||||
"""
|
||||
Represents an a single attempt to convert a file.
|
||||
"""
|
||||
|
||||
def __init__(self, converter: Any, exc_info: Optional[tuple] = None):
|
||||
self.converter = converter
|
||||
self.exc_info = exc_info
|
||||
|
||||
|
||||
class FileConversionException(MarkItDownException):
|
||||
"""
|
||||
Thrown when a suitable converter was found, but the conversion
|
||||
process fails for any reason.
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
message: Optional[str] = None,
|
||||
attempts: Optional[List[FailedConversionAttempt]] = None,
|
||||
):
|
||||
self.attempts = attempts
|
||||
|
||||
if message is None:
|
||||
if attempts is None:
|
||||
message = "File conversion failed."
|
||||
else:
|
||||
message = f"File conversion failed after {len(attempts)} attempts:\n"
|
||||
for attempt in attempts:
|
||||
if attempt.exc_info is None:
|
||||
message += f" - {type(attempt.converter).__name__} provided no execution info."
|
||||
else:
|
||||
message += f" - {type(attempt.converter).__name__} threw {attempt.exc_info[0].__name__} with message: {attempt.exc_info[1]}\n"
|
||||
|
||||
super().__init__(message)
|
||||
|
||||
@@ -2,23 +2,26 @@ import copy
|
||||
import mimetypes
|
||||
import os
|
||||
import re
|
||||
import sys
|
||||
import shutil
|
||||
import tempfile
|
||||
import warnings
|
||||
import traceback
|
||||
import io
|
||||
from dataclasses import dataclass
|
||||
from importlib.metadata import entry_points
|
||||
from typing import Any, List, Optional, Union
|
||||
from typing import Any, List, Optional, Union, BinaryIO
|
||||
from pathlib import Path
|
||||
from urllib.parse import urlparse
|
||||
from warnings import warn
|
||||
from io import BufferedIOBase, TextIOBase, BytesIO
|
||||
|
||||
# File-format detection
|
||||
import puremagic
|
||||
import requests
|
||||
import magika
|
||||
import charset_normalizer
|
||||
import codecs
|
||||
|
||||
from ._stream_info import StreamInfo
|
||||
|
||||
from .converters import (
|
||||
DocumentConverter,
|
||||
DocumentConverterResult,
|
||||
PlainTextConverter,
|
||||
HtmlConverter,
|
||||
RssConverter,
|
||||
@@ -32,27 +35,35 @@ from .converters import (
|
||||
XlsConverter,
|
||||
PptxConverter,
|
||||
ImageConverter,
|
||||
WavConverter,
|
||||
Mp3Converter,
|
||||
AudioConverter,
|
||||
OutlookMsgConverter,
|
||||
ZipConverter,
|
||||
EpubConverter,
|
||||
DocumentIntelligenceConverter,
|
||||
ConverterInput,
|
||||
)
|
||||
|
||||
from ._base_converter import DocumentConverter, DocumentConverterResult
|
||||
|
||||
from ._exceptions import (
|
||||
FileConversionException,
|
||||
UnsupportedFormatException,
|
||||
ConverterPrerequisiteException,
|
||||
FailedConversionAttempt,
|
||||
)
|
||||
|
||||
# Override mimetype for csv to fix issue on windows
|
||||
mimetypes.add_type("text/csv", ".csv")
|
||||
|
||||
_plugins: Union[None | List[Any]] = None
|
||||
# Lower priority values are tried first.
|
||||
PRIORITY_SPECIFIC_FILE_FORMAT = (
|
||||
0.0 # e.g., .docx, .pdf, .xlsx, Or specific pages, e.g., wikipedia
|
||||
)
|
||||
PRIORITY_GENERIC_FILE_FORMAT = (
|
||||
10.0 # Near catch-all converters for mimetypes like text/*, etc.
|
||||
)
|
||||
|
||||
|
||||
def _load_plugins() -> Union[None | List[Any]]:
|
||||
_plugins: Union[None, List[Any]] = None # If None, plugins have not been loaded yet.
|
||||
|
||||
|
||||
def _load_plugins() -> Union[None, List[Any]]:
|
||||
"""Lazy load plugins, exiting early if already loaded."""
|
||||
global _plugins
|
||||
|
||||
@@ -72,6 +83,14 @@ def _load_plugins() -> Union[None | List[Any]]:
|
||||
return _plugins
|
||||
|
||||
|
||||
@dataclass(kw_only=True, frozen=True)
|
||||
class ConverterRegistration:
|
||||
"""A registration of a converter with its priority and other metadata."""
|
||||
|
||||
converter: DocumentConverter
|
||||
priority: float
|
||||
|
||||
|
||||
class MarkItDown:
|
||||
"""(In preview) An extremely simple text-based document reader, suitable for LLM use.
|
||||
This reader will convert common file-types or webpages to Markdown."""
|
||||
@@ -92,14 +111,16 @@ class MarkItDown:
|
||||
else:
|
||||
self._requests_session = requests_session
|
||||
|
||||
self._magika = magika.Magika()
|
||||
|
||||
# TODO - remove these (see enable_builtins)
|
||||
self._llm_client = None
|
||||
self._llm_model = None
|
||||
self._exiftool_path = None
|
||||
self._style_map = None
|
||||
self._llm_client: Any = None
|
||||
self._llm_model: Union[str | None] = None
|
||||
self._exiftool_path: Union[str | None] = None
|
||||
self._style_map: Union[str | None] = None
|
||||
|
||||
# Register the converters
|
||||
self._page_converters: List[DocumentConverter] = []
|
||||
self._converters: List[ConverterRegistration] = []
|
||||
|
||||
if (
|
||||
enable_builtins is None or enable_builtins
|
||||
@@ -121,15 +142,43 @@ class MarkItDown:
|
||||
self._llm_model = kwargs.get("llm_model")
|
||||
self._exiftool_path = kwargs.get("exiftool_path")
|
||||
self._style_map = kwargs.get("style_map")
|
||||
|
||||
if self._exiftool_path is None:
|
||||
self._exiftool_path = os.getenv("EXIFTOOL_PATH")
|
||||
|
||||
# Still none? Check well-known paths
|
||||
if self._exiftool_path is None:
|
||||
candidate = shutil.which("exiftool")
|
||||
if candidate:
|
||||
candidate = os.path.abspath(candidate)
|
||||
if any(
|
||||
d == os.path.dirname(candidate)
|
||||
for d in [
|
||||
"/usr/bin",
|
||||
"/usr/local/bin",
|
||||
"/opt",
|
||||
"/opt/bin",
|
||||
"/opt/local/bin",
|
||||
"/opt/homebrew/bin",
|
||||
"C:\\Windows\\System32",
|
||||
"C:\\Program Files",
|
||||
"C:\\Program Files (x86)",
|
||||
]
|
||||
):
|
||||
self._exiftool_path = candidate
|
||||
|
||||
# Register converters for successful browsing operations
|
||||
# Later registrations are tried first / take higher priority than earlier registrations
|
||||
# To this end, the most specific converters should appear below the most generic converters
|
||||
self.register_converter(PlainTextConverter())
|
||||
self.register_converter(ZipConverter())
|
||||
self.register_converter(HtmlConverter())
|
||||
self.register_converter(
|
||||
PlainTextConverter(), priority=PRIORITY_GENERIC_FILE_FORMAT
|
||||
)
|
||||
self.register_converter(
|
||||
ZipConverter(markitdown=self), priority=PRIORITY_GENERIC_FILE_FORMAT
|
||||
)
|
||||
self.register_converter(
|
||||
HtmlConverter(), priority=PRIORITY_GENERIC_FILE_FORMAT
|
||||
)
|
||||
self.register_converter(RssConverter())
|
||||
self.register_converter(WikipediaConverter())
|
||||
self.register_converter(YouTubeConverter())
|
||||
@@ -138,12 +187,12 @@ class MarkItDown:
|
||||
self.register_converter(XlsxConverter())
|
||||
self.register_converter(XlsConverter())
|
||||
self.register_converter(PptxConverter())
|
||||
self.register_converter(WavConverter())
|
||||
self.register_converter(Mp3Converter())
|
||||
self.register_converter(AudioConverter())
|
||||
self.register_converter(ImageConverter())
|
||||
self.register_converter(IpynbConverter())
|
||||
self.register_converter(PdfConverter())
|
||||
self.register_converter(OutlookMsgConverter())
|
||||
self.register_converter(EpubConverter())
|
||||
|
||||
# Register Document Intelligence converter at the top of the stack if endpoint is provided
|
||||
docintel_endpoint = kwargs.get("docintel_endpoint")
|
||||
@@ -164,7 +213,9 @@ class MarkItDown:
|
||||
"""
|
||||
if not self._plugins_enabled:
|
||||
# Load plugins
|
||||
for plugin in _load_plugins():
|
||||
plugins = _load_plugins()
|
||||
assert plugins is not None
|
||||
for plugin in plugins:
|
||||
try:
|
||||
plugin.register_converters(self, **kwargs)
|
||||
except Exception:
|
||||
@@ -176,14 +227,18 @@ class MarkItDown:
|
||||
|
||||
def convert(
|
||||
self,
|
||||
source: Union[str, requests.Response, Path, BufferedIOBase, TextIOBase],
|
||||
source: Union[str, requests.Response, Path, BinaryIO],
|
||||
*,
|
||||
stream_info: Optional[StreamInfo] = None,
|
||||
**kwargs: Any,
|
||||
) -> DocumentConverterResult: # TODO: deal with kwargs
|
||||
"""
|
||||
Args:
|
||||
- source: can be a string representing a path either as string pathlib path object or url, a requests.response object, or a file object (TextIO or BinaryIO)
|
||||
- extension: specifies the file extension to use when interpreting the file. If None, infer from source (path, uri, content-type, etc.)
|
||||
- source: can be a path (str or Path), url, or a requests.response object
|
||||
- stream_info: optional stream info to use for the conversion. If None, infer from source
|
||||
- kwargs: additional arguments to pass to the converter
|
||||
"""
|
||||
|
||||
# Local path or url
|
||||
if isinstance(source, str):
|
||||
if (
|
||||
@@ -191,177 +246,237 @@ class MarkItDown:
|
||||
or source.startswith("https://")
|
||||
or source.startswith("file://")
|
||||
):
|
||||
return self.convert_url(source, **kwargs)
|
||||
# Rename the url argument to mock_url
|
||||
# (Deprecated -- use stream_info)
|
||||
_kwargs = {k: v for k, v in kwargs.items()}
|
||||
if "url" in _kwargs:
|
||||
_kwargs["mock_url"] = _kwargs["url"]
|
||||
del _kwargs["url"]
|
||||
|
||||
return self.convert_url(source, stream_info=stream_info, **_kwargs)
|
||||
else:
|
||||
return self.convert_local(source, **kwargs)
|
||||
return self.convert_local(source, stream_info=stream_info, **kwargs)
|
||||
# Path object
|
||||
elif isinstance(source, Path):
|
||||
return self.convert_local(source, stream_info=stream_info, **kwargs)
|
||||
# Request response
|
||||
elif isinstance(source, requests.Response):
|
||||
return self.convert_response(source, **kwargs)
|
||||
elif isinstance(source, Path):
|
||||
return self.convert_local(source, **kwargs)
|
||||
# File object
|
||||
elif isinstance(source, BufferedIOBase) or isinstance(source, TextIOBase):
|
||||
return self.convert_file_object(source, **kwargs)
|
||||
return self.convert_response(source, stream_info=stream_info, **kwargs)
|
||||
# Binary stream
|
||||
elif (
|
||||
hasattr(source, "read")
|
||||
and callable(source.read)
|
||||
and not isinstance(source, io.TextIOBase)
|
||||
):
|
||||
return self.convert_stream(source, stream_info=stream_info, **kwargs)
|
||||
else:
|
||||
raise TypeError(
|
||||
f"Invalid source type: {type(source)}. Expected str, requests.Response, BinaryIO."
|
||||
)
|
||||
|
||||
def convert_local(
|
||||
self, path: Union[str, Path], **kwargs: Any
|
||||
) -> DocumentConverterResult: # TODO: deal with kwargs
|
||||
self,
|
||||
path: Union[str, Path],
|
||||
*,
|
||||
stream_info: Optional[StreamInfo] = None,
|
||||
file_extension: Optional[str] = None, # Deprecated -- use stream_info
|
||||
url: Optional[str] = None, # Deprecated -- use stream_info
|
||||
**kwargs: Any,
|
||||
) -> DocumentConverterResult:
|
||||
if isinstance(path, Path):
|
||||
path = str(path)
|
||||
# Prepare a list of extensions to try (in order of priority)
|
||||
ext = kwargs.get("file_extension")
|
||||
extensions = [ext] if ext is not None else []
|
||||
|
||||
# Get extension alternatives from the path and puremagic
|
||||
base, ext = os.path.splitext(path)
|
||||
self._append_ext(extensions, ext)
|
||||
# Build a base StreamInfo object from which to start guesses
|
||||
base_guess = StreamInfo(
|
||||
local_path=path,
|
||||
extension=os.path.splitext(path)[1],
|
||||
filename=os.path.basename(path),
|
||||
)
|
||||
|
||||
for g in self._guess_ext_magic(source=path):
|
||||
self._append_ext(extensions, g)
|
||||
# Extend the base_guess with any additional info from the arguments
|
||||
if stream_info is not None:
|
||||
base_guess = base_guess.copy_and_update(stream_info)
|
||||
|
||||
# Create the ConverterInput object
|
||||
input = ConverterInput(input_type="filepath", filepath=path)
|
||||
if file_extension is not None:
|
||||
# Deprecated -- use stream_info
|
||||
base_guess = base_guess.copy_and_update(extension=file_extension)
|
||||
|
||||
# Convert
|
||||
return self._convert(input, extensions, **kwargs)
|
||||
if url is not None:
|
||||
# Deprecated -- use stream_info
|
||||
base_guess = base_guess.copy_and_update(url=url)
|
||||
|
||||
def convert_file_object(
|
||||
self, file_object: Union[BufferedIOBase, TextIOBase], **kwargs: Any
|
||||
) -> DocumentConverterResult: # TODO: deal with kwargs
|
||||
# Prepare a list of extensions to try (in order of priority
|
||||
ext = kwargs.get("file_extension")
|
||||
extensions = [ext] if ext is not None else []
|
||||
with open(path, "rb") as fh:
|
||||
guesses = self._get_stream_info_guesses(
|
||||
file_stream=fh, base_guess=base_guess
|
||||
)
|
||||
return self._convert(file_stream=fh, stream_info_guesses=guesses, **kwargs)
|
||||
|
||||
# TODO: Curently, there are some ongoing issues with passing direct file objects to puremagic (incorrect guesses, unsupported file type errors, etc.)
|
||||
# Only use puremagic as a last resort if no extensions were provided
|
||||
if extensions == []:
|
||||
for g in self._guess_ext_magic(source=file_object):
|
||||
self._append_ext(extensions, g)
|
||||
|
||||
# Create the ConverterInput object
|
||||
input = ConverterInput(input_type="object", file_object=file_object)
|
||||
|
||||
# Convert
|
||||
return self._convert(input, extensions, **kwargs)
|
||||
|
||||
# TODO what should stream's type be?
|
||||
def convert_stream(
|
||||
self, stream: Any, **kwargs: Any
|
||||
) -> DocumentConverterResult: # TODO: deal with kwargs
|
||||
# Prepare a list of extensions to try (in order of priority)
|
||||
ext = kwargs.get("file_extension")
|
||||
extensions = [ext] if ext is not None else []
|
||||
self,
|
||||
stream: BinaryIO,
|
||||
*,
|
||||
stream_info: Optional[StreamInfo] = None,
|
||||
file_extension: Optional[str] = None, # Deprecated -- use stream_info
|
||||
url: Optional[str] = None, # Deprecated -- use stream_info
|
||||
**kwargs: Any,
|
||||
) -> DocumentConverterResult:
|
||||
guesses: List[StreamInfo] = []
|
||||
|
||||
# Save the file locally to a temporary file. It will be deleted before this method exits
|
||||
handle, temp_path = tempfile.mkstemp()
|
||||
fh = os.fdopen(handle, "wb")
|
||||
result = None
|
||||
try:
|
||||
# Write to the temporary file
|
||||
content = stream.read()
|
||||
if isinstance(content, str):
|
||||
fh.write(content.encode("utf-8"))
|
||||
# Do we have anything on which to base a guess?
|
||||
base_guess = None
|
||||
if stream_info is not None or file_extension is not None or url is not None:
|
||||
# Start with a non-Null base guess
|
||||
if stream_info is None:
|
||||
base_guess = StreamInfo()
|
||||
else:
|
||||
fh.write(content)
|
||||
fh.close()
|
||||
base_guess = stream_info
|
||||
|
||||
# Use puremagic to check for more extension options
|
||||
for g in self._guess_ext_magic(source=temp_path):
|
||||
self._append_ext(extensions, g)
|
||||
if file_extension is not None:
|
||||
# Deprecated -- use stream_info
|
||||
assert base_guess is not None # for mypy
|
||||
base_guess = base_guess.copy_and_update(extension=file_extension)
|
||||
|
||||
# Create the ConverterInput object
|
||||
input = ConverterInput(input_type="filepath", filepath=temp_path)
|
||||
if url is not None:
|
||||
# Deprecated -- use stream_info
|
||||
assert base_guess is not None # for mypy
|
||||
base_guess = base_guess.copy_and_update(url=url)
|
||||
|
||||
# Convert
|
||||
result = self._convert(input, extensions, **kwargs)
|
||||
# Clean up
|
||||
finally:
|
||||
try:
|
||||
fh.close()
|
||||
except Exception:
|
||||
pass
|
||||
os.unlink(temp_path)
|
||||
# Check if we have a seekable stream. If not, load the entire stream into memory.
|
||||
if not stream.seekable():
|
||||
buffer = io.BytesIO()
|
||||
while True:
|
||||
chunk = stream.read(4096)
|
||||
if not chunk:
|
||||
break
|
||||
buffer.write(chunk)
|
||||
buffer.seek(0)
|
||||
stream = buffer
|
||||
|
||||
return result
|
||||
# Add guesses based on stream content
|
||||
guesses = self._get_stream_info_guesses(
|
||||
file_stream=stream, base_guess=base_guess or StreamInfo()
|
||||
)
|
||||
return self._convert(file_stream=stream, stream_info_guesses=guesses, **kwargs)
|
||||
|
||||
def convert_url(
|
||||
self, url: str, **kwargs: Any
|
||||
self,
|
||||
url: str,
|
||||
*,
|
||||
stream_info: Optional[StreamInfo] = None,
|
||||
file_extension: Optional[str] = None, # Deprecated -- use stream_info
|
||||
mock_url: Optional[
|
||||
str
|
||||
] = None, # Mock the request as if it came from a different URL
|
||||
**kwargs: Any,
|
||||
) -> DocumentConverterResult: # TODO: fix kwargs type
|
||||
# Send a HTTP request to the URL
|
||||
response = self._requests_session.get(url, stream=True)
|
||||
response.raise_for_status()
|
||||
return self.convert_response(response, **kwargs)
|
||||
return self.convert_response(
|
||||
response,
|
||||
stream_info=stream_info,
|
||||
file_extension=file_extension,
|
||||
url=mock_url,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
def convert_response(
|
||||
self, response: requests.Response, **kwargs: Any
|
||||
) -> DocumentConverterResult: # TODO fix kwargs type
|
||||
# Prepare a list of extensions to try (in order of priority)
|
||||
ext = kwargs.get("file_extension")
|
||||
extensions = [ext] if ext is not None else []
|
||||
self,
|
||||
response: requests.Response,
|
||||
*,
|
||||
stream_info: Optional[StreamInfo] = None,
|
||||
file_extension: Optional[str] = None, # Deprecated -- use stream_info
|
||||
url: Optional[str] = None, # Deprecated -- use stream_info
|
||||
**kwargs: Any,
|
||||
) -> DocumentConverterResult:
|
||||
# If there is a content-type header, get the mimetype and charset (if present)
|
||||
mimetype: Optional[str] = None
|
||||
charset: Optional[str] = None
|
||||
|
||||
# Guess from the mimetype
|
||||
content_type = response.headers.get("content-type", "").split(";")[0]
|
||||
self._append_ext(extensions, mimetypes.guess_extension(content_type))
|
||||
if "content-type" in response.headers:
|
||||
parts = response.headers["content-type"].split(";")
|
||||
mimetype = parts.pop(0).strip()
|
||||
for part in parts:
|
||||
if part.strip().startswith("charset="):
|
||||
_charset = part.split("=")[1].strip()
|
||||
if len(_charset) > 0:
|
||||
charset = _charset
|
||||
|
||||
# Read the content disposition if there is one
|
||||
content_disposition = response.headers.get("content-disposition", "")
|
||||
m = re.search(r"filename=([^;]+)", content_disposition)
|
||||
# If there is a content-disposition header, get the filename and possibly the extension
|
||||
filename: Optional[str] = None
|
||||
extension: Optional[str] = None
|
||||
if "content-disposition" in response.headers:
|
||||
m = re.search(r"filename=([^;]+)", response.headers["content-disposition"])
|
||||
if m:
|
||||
base, ext = os.path.splitext(m.group(1).strip("\"'"))
|
||||
self._append_ext(extensions, ext)
|
||||
filename = m.group(1).strip("\"'")
|
||||
_, _extension = os.path.splitext(filename)
|
||||
if len(_extension) > 0:
|
||||
extension = _extension
|
||||
|
||||
# Read from the extension from the path
|
||||
base, ext = os.path.splitext(urlparse(response.url).path)
|
||||
self._append_ext(extensions, ext)
|
||||
# If there is still no filename, try to read it from the url
|
||||
if filename is None:
|
||||
parsed_url = urlparse(response.url)
|
||||
_, _extension = os.path.splitext(parsed_url.path)
|
||||
if len(_extension) > 0: # Looks like this might be a file!
|
||||
filename = os.path.basename(parsed_url.path)
|
||||
extension = _extension
|
||||
|
||||
# Save the file locally to a temporary file. It will be deleted before this method exits
|
||||
handle, temp_path = tempfile.mkstemp()
|
||||
fh = os.fdopen(handle, "wb")
|
||||
result = None
|
||||
try:
|
||||
# Download the file
|
||||
# Create an initial guess from all this information
|
||||
base_guess = StreamInfo(
|
||||
mimetype=mimetype,
|
||||
charset=charset,
|
||||
filename=filename,
|
||||
extension=extension,
|
||||
url=response.url,
|
||||
)
|
||||
|
||||
# Update with any additional info from the arguments
|
||||
if stream_info is not None:
|
||||
base_guess = base_guess.copy_and_update(stream_info)
|
||||
if file_extension is not None:
|
||||
# Deprecated -- use stream_info
|
||||
base_guess = base_guess.copy_and_update(extension=file_extension)
|
||||
if url is not None:
|
||||
# Deprecated -- use stream_info
|
||||
base_guess = base_guess.copy_and_update(url=url)
|
||||
|
||||
# Read into BytesIO
|
||||
buffer = io.BytesIO()
|
||||
for chunk in response.iter_content(chunk_size=512):
|
||||
fh.write(chunk)
|
||||
fh.close()
|
||||
|
||||
# Use puremagic to check for more extension options
|
||||
for g in self._guess_ext_magic(source=temp_path):
|
||||
self._append_ext(extensions, g)
|
||||
|
||||
# Create the ConverterInput object
|
||||
input = ConverterInput(input_type="filepath", filepath=temp_path)
|
||||
buffer.write(chunk)
|
||||
buffer.seek(0)
|
||||
|
||||
# Convert
|
||||
result = self._convert(input, extensions, url=response.url, **kwargs)
|
||||
# Clean up
|
||||
finally:
|
||||
try:
|
||||
fh.close()
|
||||
except Exception:
|
||||
pass
|
||||
os.unlink(temp_path)
|
||||
|
||||
return result
|
||||
guesses = self._get_stream_info_guesses(
|
||||
file_stream=buffer, base_guess=base_guess
|
||||
)
|
||||
return self._convert(file_stream=buffer, stream_info_guesses=guesses, **kwargs)
|
||||
|
||||
def _convert(
|
||||
self, input: ConverterInput, extensions: List[Union[str, None]], **kwargs
|
||||
self, *, file_stream: BinaryIO, stream_info_guesses: List[StreamInfo], **kwargs
|
||||
) -> DocumentConverterResult:
|
||||
error_trace = ""
|
||||
res: Union[None, DocumentConverterResult] = None
|
||||
|
||||
# Keep track of which converters throw exceptions
|
||||
failed_attempts: List[FailedConversionAttempt] = []
|
||||
|
||||
# Create a copy of the page_converters list, sorted by priority.
|
||||
# We do this with each call to _convert because the priority of converters may change between calls.
|
||||
# The sort is guaranteed to be stable, so converters with the same priority will remain in the same order.
|
||||
sorted_converters = sorted(self._page_converters, key=lambda x: x.priority)
|
||||
sorted_registrations = sorted(self._converters, key=lambda x: x.priority)
|
||||
|
||||
for ext in extensions + [None]: # Try last with no extension
|
||||
for converter in sorted_converters:
|
||||
_kwargs = copy.deepcopy(kwargs)
|
||||
# Remember the initial stream position so that we can return to it
|
||||
cur_pos = file_stream.tell()
|
||||
|
||||
# Overwrite file_extension appropriately
|
||||
if ext is None:
|
||||
if "file_extension" in _kwargs:
|
||||
del _kwargs["file_extension"]
|
||||
else:
|
||||
_kwargs.update({"file_extension": ext})
|
||||
for stream_info in stream_info_guesses + [StreamInfo()]:
|
||||
for converter_registration in sorted_registrations:
|
||||
converter = converter_registration.converter
|
||||
# Sanity check -- make sure the cur_pos is still the same
|
||||
assert (
|
||||
cur_pos == file_stream.tell()
|
||||
), f"File stream position should NOT change between guess iterations"
|
||||
|
||||
_kwargs = {k: v for k, v in kwargs.items()}
|
||||
|
||||
# Copy any additional global options
|
||||
if "llm_client" not in _kwargs and self._llm_client is not None:
|
||||
@@ -377,13 +492,40 @@ class MarkItDown:
|
||||
_kwargs["exiftool_path"] = self._exiftool_path
|
||||
|
||||
# Add the list of converters for nested processing
|
||||
_kwargs["_parent_converters"] = self._page_converters
|
||||
_kwargs["_parent_converters"] = self._converters
|
||||
|
||||
# If we hit an error log it and keep trying
|
||||
# Add legaxy kwargs
|
||||
if stream_info is not None:
|
||||
if stream_info.extension is not None:
|
||||
_kwargs["file_extension"] = stream_info.extension
|
||||
|
||||
if stream_info.url is not None:
|
||||
_kwargs["url"] = stream_info.url
|
||||
|
||||
# Check if the converter will accept the file, and if so, try to convert it
|
||||
_accepts = False
|
||||
try:
|
||||
res = converter.convert(input, **_kwargs)
|
||||
_accepts = converter.accepts(file_stream, stream_info, **_kwargs)
|
||||
except NotImplementedError:
|
||||
pass
|
||||
|
||||
# accept() should not have changed the file stream position
|
||||
assert (
|
||||
cur_pos == file_stream.tell()
|
||||
), f"{type(converter).__name__}.accept() should NOT change the file_stream position"
|
||||
|
||||
# Attempt the conversion
|
||||
if _accepts:
|
||||
try:
|
||||
res = converter.convert(file_stream, stream_info, **_kwargs)
|
||||
except Exception:
|
||||
error_trace = ("\n\n" + traceback.format_exc()).strip()
|
||||
failed_attempts.append(
|
||||
FailedConversionAttempt(
|
||||
converter=converter, exc_info=sys.exc_info()
|
||||
)
|
||||
)
|
||||
finally:
|
||||
file_stream.seek(cur_pos)
|
||||
|
||||
if res is not None:
|
||||
# Normalize the content
|
||||
@@ -391,81 +533,17 @@ class MarkItDown:
|
||||
[line.rstrip() for line in re.split(r"\r?\n", res.text_content)]
|
||||
)
|
||||
res.text_content = re.sub(r"\n{3,}", "\n\n", res.text_content)
|
||||
|
||||
# Todo
|
||||
return res
|
||||
|
||||
# If we got this far without success, report any exceptions
|
||||
if len(error_trace) > 0:
|
||||
raise FileConversionException(
|
||||
f"Could not convert '{input.filepath}' to Markdown. File type was recognized as {extensions}. While converting the file, the following error was encountered:\n\n{error_trace}"
|
||||
)
|
||||
if len(failed_attempts) > 0:
|
||||
raise FileConversionException(attempts=failed_attempts)
|
||||
|
||||
# Nothing can handle it!
|
||||
raise UnsupportedFormatException(
|
||||
f"Could not convert '{input.filepath}' to Markdown. The formats {extensions} are not supported."
|
||||
f"Could not convert stream to Markdown. No converter attempted a conversion, suggesting that the filetype is simply not supported."
|
||||
)
|
||||
|
||||
def _append_ext(self, extensions, ext):
|
||||
"""Append a unique non-None, non-empty extension to a list of extensions."""
|
||||
if ext is None:
|
||||
return
|
||||
ext = ext.strip()
|
||||
if ext == "":
|
||||
return
|
||||
# if ext not in extensions:
|
||||
extensions.append(ext)
|
||||
|
||||
def _guess_ext_magic(self, source):
|
||||
"""Use puremagic (a Python implementation of libmagic) to guess a file's extension based on the first few bytes."""
|
||||
# Use puremagic to guess
|
||||
try:
|
||||
guesses = []
|
||||
|
||||
# Guess extensions for filepaths
|
||||
if isinstance(source, str):
|
||||
guesses = puremagic.magic_file(source)
|
||||
|
||||
# Fix for: https://github.com/microsoft/markitdown/issues/222
|
||||
# If there are no guesses, then try again after trimming leading ASCII whitespaces.
|
||||
# ASCII whitespace characters are those byte values in the sequence b' \t\n\r\x0b\f'
|
||||
# (space, tab, newline, carriage return, vertical tab, form feed).
|
||||
if len(guesses) == 0:
|
||||
with open(source, "rb") as file:
|
||||
while True:
|
||||
char = file.read(1)
|
||||
if not char: # End of file
|
||||
break
|
||||
if not char.isspace():
|
||||
file.seek(file.tell() - 1)
|
||||
break
|
||||
try:
|
||||
guesses = puremagic.magic_stream(file)
|
||||
except puremagic.main.PureError:
|
||||
pass
|
||||
|
||||
# Guess extensions for file objects. Note that the puremagic's magic_stream function requires a BytesIO-like file source
|
||||
# TODO: Figure out how to guess extensions for TextIO-like file sources (manually converting to BytesIO does not work)
|
||||
elif isinstance(source, BufferedIOBase):
|
||||
guesses = puremagic.magic_stream(source)
|
||||
|
||||
extensions = list()
|
||||
for g in guesses:
|
||||
ext = g.extension.strip()
|
||||
if len(ext) > 0:
|
||||
if not ext.startswith("."):
|
||||
ext = "." + ext
|
||||
if ext not in extensions:
|
||||
extensions.append(ext)
|
||||
return extensions
|
||||
except FileNotFoundError:
|
||||
pass
|
||||
except IsADirectoryError:
|
||||
pass
|
||||
except PermissionError:
|
||||
pass
|
||||
return []
|
||||
|
||||
def register_page_converter(self, converter: DocumentConverter) -> None:
|
||||
"""DEPRECATED: User register_converter instead."""
|
||||
warn(
|
||||
@@ -474,6 +552,146 @@ class MarkItDown:
|
||||
)
|
||||
self.register_converter(converter)
|
||||
|
||||
def register_converter(self, converter: DocumentConverter) -> None:
|
||||
"""Register a page text converter."""
|
||||
self._page_converters.insert(0, converter)
|
||||
def register_converter(
|
||||
self,
|
||||
converter: DocumentConverter,
|
||||
*,
|
||||
priority: float = PRIORITY_SPECIFIC_FILE_FORMAT,
|
||||
) -> None:
|
||||
"""
|
||||
Register a DocumentConverter with a given priority.
|
||||
|
||||
Priorities work as follows: By default, most converters get priority
|
||||
DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT (== 0). The exception
|
||||
is the PlainTextConverter, HtmlConverter, and ZipConverter, which get
|
||||
priority PRIORITY_SPECIFIC_FILE_FORMAT (== 10), with lower values
|
||||
being tried first (i.e., higher priority).
|
||||
|
||||
Just prior to conversion, the converters are sorted by priority, using
|
||||
a stable sort. This means that converters with the same priority will
|
||||
remain in the same order, with the most recently registered converters
|
||||
appearing first.
|
||||
|
||||
We have tight control over the order of built-in converters, but
|
||||
plugins can register converters in any order. The registration's priority
|
||||
field reasserts some control over the order of converters.
|
||||
|
||||
Plugins can register converters with any priority, to appear before or
|
||||
after the built-ins. For example, a plugin with priority 9 will run
|
||||
before the PlainTextConverter, but after the built-in converters.
|
||||
"""
|
||||
self._converters.insert(
|
||||
0, ConverterRegistration(converter=converter, priority=priority)
|
||||
)
|
||||
|
||||
def _get_stream_info_guesses(
|
||||
self, file_stream: BinaryIO, base_guess: StreamInfo
|
||||
) -> List[StreamInfo]:
|
||||
"""
|
||||
Given a base guess, attempt to guess or expand on the stream info using the stream content (via magika).
|
||||
"""
|
||||
guesses: List[StreamInfo] = []
|
||||
|
||||
# Enhance the base guess with information based on the extension or mimetype
|
||||
enhanced_guess = base_guess.copy_and_update()
|
||||
|
||||
# If there's an extension and no mimetype, try to guess the mimetype
|
||||
if base_guess.mimetype is None and base_guess.extension is not None:
|
||||
_m, _ = mimetypes.guess_type(
|
||||
"placeholder" + base_guess.extension, strict=False
|
||||
)
|
||||
if _m is not None:
|
||||
enhanced_guess = enhanced_guess.copy_and_update(mimetype=_m)
|
||||
|
||||
# If there's a mimetype and no extension, try to guess the extension
|
||||
if base_guess.mimetype is not None and base_guess.extension is None:
|
||||
_e = mimetypes.guess_all_extensions(base_guess.mimetype, strict=False)
|
||||
if len(_e) > 0:
|
||||
enhanced_guess = enhanced_guess.copy_and_update(extension=_e[0])
|
||||
|
||||
# Call magika to guess from the stream
|
||||
cur_pos = file_stream.tell()
|
||||
try:
|
||||
result = self._magika.identify_stream(file_stream)
|
||||
if result.status == "ok" and result.prediction.output.label != "unknown":
|
||||
# If it's text, also guess the charset
|
||||
charset = None
|
||||
if result.prediction.output.is_text:
|
||||
# Read the first 4k to guess the charset
|
||||
file_stream.seek(cur_pos)
|
||||
stream_page = file_stream.read(4096)
|
||||
charset_result = charset_normalizer.from_bytes(stream_page).best()
|
||||
|
||||
if charset_result is not None:
|
||||
charset = self._normalize_charset(charset_result.encoding)
|
||||
|
||||
# Normalize the first extension listed
|
||||
guessed_extension = None
|
||||
if len(result.prediction.output.extensions) > 0:
|
||||
guessed_extension = "." + result.prediction.output.extensions[0]
|
||||
|
||||
# Determine if the guess is compatible with the base guess
|
||||
compatible = True
|
||||
if (
|
||||
base_guess.mimetype is not None
|
||||
and base_guess.mimetype != result.prediction.output.mime_type
|
||||
):
|
||||
compatible = False
|
||||
|
||||
if (
|
||||
base_guess.extension is not None
|
||||
and base_guess.extension.lstrip(".")
|
||||
not in result.prediction.output.extensions
|
||||
):
|
||||
compatible = False
|
||||
|
||||
if (
|
||||
base_guess.charset is not None
|
||||
and self._normalize_charset(base_guess.charset) != charset
|
||||
):
|
||||
compatible = False
|
||||
|
||||
if compatible:
|
||||
# Add the compatible base guess
|
||||
guesses.append(
|
||||
StreamInfo(
|
||||
mimetype=base_guess.mimetype
|
||||
or result.prediction.output.mime_type,
|
||||
extension=base_guess.extension or guessed_extension,
|
||||
charset=base_guess.charset or charset,
|
||||
filename=base_guess.filename,
|
||||
local_path=base_guess.local_path,
|
||||
url=base_guess.url,
|
||||
)
|
||||
)
|
||||
else:
|
||||
# The magika guess was incompatible with the base guess, so add both guesses
|
||||
guesses.append(enhanced_guess)
|
||||
guesses.append(
|
||||
StreamInfo(
|
||||
mimetype=result.prediction.output.mime_type,
|
||||
extension=guessed_extension,
|
||||
charset=charset,
|
||||
filename=base_guess.filename,
|
||||
local_path=base_guess.local_path,
|
||||
url=base_guess.url,
|
||||
)
|
||||
)
|
||||
else:
|
||||
# There were no other guesses, so just add the base guess
|
||||
guesses.append(enhanced_guess)
|
||||
finally:
|
||||
file_stream.seek(cur_pos)
|
||||
|
||||
return guesses
|
||||
|
||||
def _normalize_charset(self, charset: str | None) -> str | None:
|
||||
"""
|
||||
Normalize a charset string to a canonical form.
|
||||
"""
|
||||
if charset is None:
|
||||
return None
|
||||
try:
|
||||
return codecs.lookup(charset).name
|
||||
except LookupError:
|
||||
return charset
|
||||
|
||||
32
packages/markitdown/src/markitdown/_stream_info.py
Normal file
32
packages/markitdown/src/markitdown/_stream_info.py
Normal file
@@ -0,0 +1,32 @@
|
||||
from dataclasses import dataclass, asdict
|
||||
from typing import Optional
|
||||
|
||||
|
||||
@dataclass(kw_only=True, frozen=True)
|
||||
class StreamInfo:
|
||||
"""The StreamInfo class is used to store information about a file stream.
|
||||
All fields can be None, and will depend on how the stream was opened.
|
||||
"""
|
||||
|
||||
mimetype: Optional[str] = None
|
||||
extension: Optional[str] = None
|
||||
charset: Optional[str] = None
|
||||
filename: Optional[
|
||||
str
|
||||
] = None # From local path, url, or Content-Disposition header
|
||||
local_path: Optional[str] = None # If read from disk
|
||||
url: Optional[str] = None # If read from url
|
||||
|
||||
def copy_and_update(self, *args, **kwargs):
|
||||
"""Copy the StreamInfo object and update it with the given StreamInfo
|
||||
instance and/or other keyword arguments."""
|
||||
new_info = asdict(self)
|
||||
|
||||
for si in args:
|
||||
assert isinstance(si, StreamInfo)
|
||||
new_info.update({k: v for k, v in asdict(si).items() if v is not None})
|
||||
|
||||
if len(kwargs) > 0:
|
||||
new_info.update(kwargs)
|
||||
|
||||
return StreamInfo(**new_info)
|
||||
@@ -2,7 +2,6 @@
|
||||
#
|
||||
# SPDX-License-Identifier: MIT
|
||||
|
||||
from ._base import DocumentConverter, DocumentConverterResult
|
||||
from ._plain_text_converter import PlainTextConverter
|
||||
from ._html_converter import HtmlConverter
|
||||
from ._rss_converter import RssConverter
|
||||
@@ -15,16 +14,13 @@ from ._docx_converter import DocxConverter
|
||||
from ._xlsx_converter import XlsxConverter, XlsConverter
|
||||
from ._pptx_converter import PptxConverter
|
||||
from ._image_converter import ImageConverter
|
||||
from ._wav_converter import WavConverter
|
||||
from ._mp3_converter import Mp3Converter
|
||||
from ._audio_converter import AudioConverter
|
||||
from ._outlook_msg_converter import OutlookMsgConverter
|
||||
from ._zip_converter import ZipConverter
|
||||
from ._doc_intel_converter import DocumentIntelligenceConverter
|
||||
from ._converter_input import ConverterInput
|
||||
from ._epub_converter import EpubConverter
|
||||
|
||||
__all__ = [
|
||||
"DocumentConverter",
|
||||
"DocumentConverterResult",
|
||||
"PlainTextConverter",
|
||||
"HtmlConverter",
|
||||
"RssConverter",
|
||||
@@ -38,10 +34,9 @@ __all__ = [
|
||||
"XlsConverter",
|
||||
"PptxConverter",
|
||||
"ImageConverter",
|
||||
"WavConverter",
|
||||
"Mp3Converter",
|
||||
"AudioConverter",
|
||||
"OutlookMsgConverter",
|
||||
"ZipConverter",
|
||||
"DocumentIntelligenceConverter",
|
||||
"ConverterInput",
|
||||
"EpubConverter",
|
||||
]
|
||||
|
||||
@@ -0,0 +1,102 @@
|
||||
import io
|
||||
from typing import Any, BinaryIO, Optional
|
||||
|
||||
from ._exiftool import exiftool_metadata
|
||||
from ._transcribe_audio import transcribe_audio
|
||||
from .._base_converter import DocumentConverter, DocumentConverterResult
|
||||
from .._stream_info import StreamInfo
|
||||
from .._exceptions import MissingDependencyException
|
||||
|
||||
ACCEPTED_MIME_TYPE_PREFIXES = [
|
||||
"audio/x-wav",
|
||||
"audio/mpeg",
|
||||
"video/mp4",
|
||||
]
|
||||
|
||||
ACCEPTED_FILE_EXTENSIONS = [
|
||||
".wav",
|
||||
".mp3",
|
||||
".m4a",
|
||||
".mp4",
|
||||
]
|
||||
|
||||
|
||||
class AudioConverter(DocumentConverter):
|
||||
"""
|
||||
Converts audio files to markdown via extraction of metadata (if `exiftool` is installed), and speech transcription (if `speech_recognition` is installed).
|
||||
"""
|
||||
|
||||
def accepts(
|
||||
self,
|
||||
file_stream: BinaryIO,
|
||||
stream_info: StreamInfo,
|
||||
**kwargs: Any, # Options to pass to the converter
|
||||
) -> bool:
|
||||
mimetype = (stream_info.mimetype or "").lower()
|
||||
extension = (stream_info.extension or "").lower()
|
||||
|
||||
if extension in ACCEPTED_FILE_EXTENSIONS:
|
||||
return True
|
||||
|
||||
for prefix in ACCEPTED_MIME_TYPE_PREFIXES:
|
||||
if mimetype.startswith(prefix):
|
||||
return True
|
||||
|
||||
return False
|
||||
|
||||
def convert(
|
||||
self,
|
||||
file_stream: BinaryIO,
|
||||
stream_info: StreamInfo,
|
||||
**kwargs: Any, # Options to pass to the converter
|
||||
) -> DocumentConverterResult:
|
||||
md_content = ""
|
||||
|
||||
# Add metadata
|
||||
metadata = exiftool_metadata(
|
||||
file_stream, exiftool_path=kwargs.get("exiftool_path")
|
||||
)
|
||||
if metadata:
|
||||
for f in [
|
||||
"Title",
|
||||
"Artist",
|
||||
"Author",
|
||||
"Band",
|
||||
"Album",
|
||||
"Genre",
|
||||
"Track",
|
||||
"DateTimeOriginal",
|
||||
"CreateDate",
|
||||
# "Duration", -- Wrong values when read from memory
|
||||
"NumChannels",
|
||||
"SampleRate",
|
||||
"AvgBytesPerSec",
|
||||
"BitsPerSample",
|
||||
]:
|
||||
if f in metadata:
|
||||
md_content += f"{f}: {metadata[f]}\n"
|
||||
|
||||
# Figure out the audio format for transcription
|
||||
if stream_info.extension == ".wav" or stream_info.mimetype == "audio/x-wav":
|
||||
audio_format = "wav"
|
||||
elif stream_info.extension == ".mp3" or stream_info.mimetype == "audio/mpeg":
|
||||
audio_format = "mp3"
|
||||
elif (
|
||||
stream_info.extension in [".mp4", ".m4a"]
|
||||
or stream_info.mimetype == "video/mp4"
|
||||
):
|
||||
audio_format = "mp4"
|
||||
else:
|
||||
audio_format = None
|
||||
|
||||
# Transcribe
|
||||
if audio_format:
|
||||
try:
|
||||
transcript = transcribe_audio(file_stream, audio_format=audio_format)
|
||||
if transcript:
|
||||
md_content += "\n\n### Audio Transcript:\n" + transcript
|
||||
except MissingDependencyException:
|
||||
pass
|
||||
|
||||
# Return the result
|
||||
return DocumentConverterResult(markdown=md_content.strip())
|
||||
@@ -1,63 +0,0 @@
|
||||
from typing import Any, Union
|
||||
|
||||
|
||||
class DocumentConverterResult:
|
||||
"""The result of converting a document to text."""
|
||||
|
||||
def __init__(self, title: Union[str, None] = None, text_content: str = ""):
|
||||
self.title: Union[str, None] = title
|
||||
self.text_content: str = text_content
|
||||
|
||||
|
||||
class DocumentConverter:
|
||||
"""Abstract superclass of all DocumentConverters."""
|
||||
|
||||
# Lower priority values are tried first.
|
||||
PRIORITY_SPECIFIC_FILE_FORMAT = (
|
||||
0.0 # e.g., .docx, .pdf, .xlsx, Or specific pages, e.g., wikipedia
|
||||
)
|
||||
PRIORITY_GENERIC_FILE_FORMAT = (
|
||||
10.0 # Near catch-all converters for mimetypes like text/*, etc.
|
||||
)
|
||||
|
||||
def __init__(self, priority: float = PRIORITY_SPECIFIC_FILE_FORMAT):
|
||||
"""
|
||||
Initialize the DocumentConverter with a given priority.
|
||||
|
||||
Priorities work as follows: By default, most converters get priority
|
||||
DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT (== 0). The exception
|
||||
is the PlainTextConverter, which gets priority PRIORITY_SPECIFIC_FILE_FORMAT (== 10),
|
||||
with lower values being tried first (i.e., higher priority).
|
||||
|
||||
Just prior to conversion, the converters are sorted by priority, using
|
||||
a stable sort. This means that converters with the same priority will
|
||||
remain in the same order, with the most recently registered converters
|
||||
appearing first.
|
||||
|
||||
We have tight control over the order of built-in converters, but
|
||||
plugins can register converters in any order. A converter's priority
|
||||
field reasserts some control over the order of converters.
|
||||
|
||||
Plugins can register converters with any priority, to appear before or
|
||||
after the built-ins. For example, a plugin with priority 9 will run
|
||||
before the PlainTextConverter, but after the built-in converters.
|
||||
"""
|
||||
self._priority = priority
|
||||
|
||||
def convert(
|
||||
self, local_path: str, **kwargs: Any
|
||||
) -> Union[None, DocumentConverterResult]:
|
||||
raise NotImplementedError("Subclasses must implement this method")
|
||||
|
||||
@property
|
||||
def priority(self) -> float:
|
||||
"""Priority of the converter in markitdown's converter list. Higher priority values are tried first."""
|
||||
return self._priority
|
||||
|
||||
@priority.setter
|
||||
def radius(self, value: float):
|
||||
self._priority = value
|
||||
|
||||
@priority.deleter
|
||||
def radius(self):
|
||||
raise AttributeError("Cannot delete the priority attribute")
|
||||
@@ -1,14 +1,24 @@
|
||||
# type: ignore
|
||||
import base64
|
||||
import io
|
||||
import re
|
||||
|
||||
from typing import Union
|
||||
import base64
|
||||
import binascii
|
||||
from urllib.parse import parse_qs, urlparse
|
||||
from typing import Any, BinaryIO, Optional
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
from ._base import DocumentConverter, DocumentConverterResult
|
||||
from .._base_converter import DocumentConverter, DocumentConverterResult
|
||||
from .._stream_info import StreamInfo
|
||||
from ._markdownify import _CustomMarkdownify
|
||||
from ._converter_input import ConverterInput
|
||||
|
||||
ACCEPTED_MIME_TYPE_PREFIXES = [
|
||||
"text/html",
|
||||
"application/xhtml",
|
||||
]
|
||||
|
||||
ACCEPTED_FILE_EXTENSIONS = [
|
||||
".html",
|
||||
".htm",
|
||||
]
|
||||
|
||||
|
||||
class BingSerpConverter(DocumentConverter):
|
||||
@@ -17,31 +27,49 @@ class BingSerpConverter(DocumentConverter):
|
||||
NOTE: It is better to use the Bing API
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self, priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT
|
||||
):
|
||||
super().__init__(priority=priority)
|
||||
def accepts(
|
||||
self,
|
||||
file_stream: BinaryIO,
|
||||
stream_info: StreamInfo,
|
||||
**kwargs: Any, # Options to pass to the converter
|
||||
) -> bool:
|
||||
"""
|
||||
Make sure we're dealing with HTML content *from* Bing.
|
||||
"""
|
||||
|
||||
url = stream_info.url or ""
|
||||
mimetype = (stream_info.mimetype or "").lower()
|
||||
extension = (stream_info.extension or "").lower()
|
||||
|
||||
if not re.search(r"^https://www\.bing\.com/search\?q=", url):
|
||||
# Not a Bing SERP URL
|
||||
return False
|
||||
|
||||
if extension in ACCEPTED_FILE_EXTENSIONS:
|
||||
return True
|
||||
|
||||
for prefix in ACCEPTED_MIME_TYPE_PREFIXES:
|
||||
if mimetype.startswith(prefix):
|
||||
return True
|
||||
|
||||
# Not HTML content
|
||||
return False
|
||||
|
||||
def convert(
|
||||
self, input: ConverterInput, **kwargs
|
||||
) -> Union[None, DocumentConverterResult]:
|
||||
# Bail if not a Bing SERP
|
||||
extension = kwargs.get("file_extension", "")
|
||||
if extension.lower() not in [".html", ".htm"]:
|
||||
return None
|
||||
url = kwargs.get("url", "")
|
||||
if not re.search(r"^https://www\.bing\.com/search\?q=", url):
|
||||
return None
|
||||
self,
|
||||
file_stream: BinaryIO,
|
||||
stream_info: StreamInfo,
|
||||
**kwargs: Any, # Options to pass to the converter
|
||||
) -> DocumentConverterResult:
|
||||
assert stream_info.url is not None
|
||||
|
||||
# Parse the query parameters
|
||||
parsed_params = parse_qs(urlparse(url).query)
|
||||
parsed_params = parse_qs(urlparse(stream_info.url).query)
|
||||
query = parsed_params.get("q", [""])[0]
|
||||
|
||||
# Parse the file
|
||||
soup = None
|
||||
file_obj = input.read_file(mode="rt", encoding="utf-8")
|
||||
soup = BeautifulSoup(file_obj.read(), "html.parser")
|
||||
file_obj.close()
|
||||
# Parse the stream
|
||||
encoding = "utf-8" if stream_info.charset is None else stream_info.charset
|
||||
soup = BeautifulSoup(file_stream, "html.parser", from_encoding=encoding)
|
||||
|
||||
# Clean up some formatting
|
||||
for tptt in soup.find_all(class_="tptt"):
|
||||
@@ -54,6 +82,9 @@ class BingSerpConverter(DocumentConverter):
|
||||
_markdownify = _CustomMarkdownify()
|
||||
results = list()
|
||||
for result in soup.find_all(class_="b_algo"):
|
||||
if not hasattr(result, "find_all"):
|
||||
continue
|
||||
|
||||
# Rewrite redirect urls
|
||||
for a in result.find_all("a", href=True):
|
||||
parsed_href = urlparse(a["href"])
|
||||
@@ -85,6 +116,6 @@ class BingSerpConverter(DocumentConverter):
|
||||
)
|
||||
|
||||
return DocumentConverterResult(
|
||||
markdown=webpage_text,
|
||||
title=None if soup.title is None else soup.title.string,
|
||||
text_content=webpage_text,
|
||||
)
|
||||
|
||||
@@ -1,30 +0,0 @@
|
||||
from typing import Any, Union
|
||||
|
||||
|
||||
class ConverterInput:
|
||||
"""
|
||||
Wrapper for inputs to converter functions.
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
input_type: str = "filepath",
|
||||
filepath: Union[str, None] = None,
|
||||
file_object: Union[Any, None] = None,
|
||||
):
|
||||
if input_type not in ["filepath", "object"]:
|
||||
raise ValueError(f"Invalid converter input type: {input_type}")
|
||||
|
||||
self.input_type = input_type
|
||||
self.filepath = filepath
|
||||
self.file_object = file_object
|
||||
|
||||
def read_file(
|
||||
self,
|
||||
mode: str = "rb",
|
||||
encoding: Union[str, None] = None,
|
||||
) -> Any:
|
||||
if self.input_type == "object":
|
||||
return self.file_object
|
||||
|
||||
return open(self.filepath, mode=mode, encoding=encoding)
|
||||
@@ -1,7 +1,17 @@
|
||||
from typing import Any, Union
|
||||
import sys
|
||||
import re
|
||||
|
||||
# Azure imports
|
||||
from typing import BinaryIO, Any, List
|
||||
|
||||
from ._html_converter import HtmlConverter
|
||||
from .._base_converter import DocumentConverter, DocumentConverterResult
|
||||
from .._stream_info import StreamInfo
|
||||
from .._exceptions import MissingDependencyException, MISSING_DEPENDENCY_MESSAGE
|
||||
|
||||
# Try loading optional (but in this case, required) dependencies
|
||||
# Save reporting of any exceptions for later
|
||||
_dependency_exc_info = None
|
||||
try:
|
||||
from azure.ai.documentintelligence import DocumentIntelligenceClient
|
||||
from azure.ai.documentintelligence.models import (
|
||||
AnalyzeDocumentRequest,
|
||||
@@ -9,9 +19,9 @@ from azure.ai.documentintelligence.models import (
|
||||
DocumentAnalysisFeature,
|
||||
)
|
||||
from azure.identity import DefaultAzureCredential
|
||||
|
||||
from ._base import DocumentConverter, DocumentConverterResult
|
||||
from ._converter_input import ConverterInput
|
||||
except ImportError:
|
||||
# Preserve the error and stack trace for later
|
||||
_dependency_exc_info = sys.exc_info()
|
||||
|
||||
|
||||
# TODO: currently, there is a bug in the document intelligence SDK with importing the "ContentFormat" enum.
|
||||
@@ -19,17 +29,62 @@ from ._converter_input import ConverterInput
|
||||
CONTENT_FORMAT = "markdown"
|
||||
|
||||
|
||||
OFFICE_MIME_TYPE_PREFIXES = [
|
||||
"application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
||||
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
|
||||
"application/vnd.openxmlformats-officedocument.presentationml",
|
||||
"application/xhtml",
|
||||
"text/html",
|
||||
]
|
||||
|
||||
OTHER_MIME_TYPE_PREFIXES = [
|
||||
"application/pdf",
|
||||
"application/x-pdf",
|
||||
"text/html",
|
||||
"image/",
|
||||
]
|
||||
|
||||
OFFICE_FILE_EXTENSIONS = [
|
||||
".docx",
|
||||
".xlsx",
|
||||
".pptx",
|
||||
".html",
|
||||
".htm",
|
||||
]
|
||||
|
||||
OTHER_FILE_EXTENSIONS = [
|
||||
".pdf",
|
||||
".jpeg",
|
||||
".jpg",
|
||||
".png",
|
||||
".bmp",
|
||||
".tiff",
|
||||
".heif",
|
||||
]
|
||||
|
||||
|
||||
class DocumentIntelligenceConverter(DocumentConverter):
|
||||
"""Specialized DocumentConverter that uses Document Intelligence to extract text from documents."""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
*,
|
||||
priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT,
|
||||
endpoint: str,
|
||||
api_version: str = "2024-07-31-preview",
|
||||
):
|
||||
super().__init__(priority=priority)
|
||||
super().__init__()
|
||||
|
||||
# Raise an error if the dependencies are not available.
|
||||
# This is different than other converters since this one isn't even instantiated
|
||||
# unless explicitly requested.
|
||||
if _dependency_exc_info is not None:
|
||||
raise MissingDependencyException(
|
||||
"DocumentIntelligenceConverter requires the optional dependency [az-doc-intel] (or [all]) to be installed. E.g., `pip install markitdown[az-doc-intel]`"
|
||||
) from _dependency_exc_info[
|
||||
1
|
||||
].with_traceback( # type: ignore[union-attr]
|
||||
_dependency_exc_info[2]
|
||||
)
|
||||
|
||||
self.endpoint = endpoint
|
||||
self.api_version = api_version
|
||||
@@ -39,54 +94,61 @@ class DocumentIntelligenceConverter(DocumentConverter):
|
||||
credential=DefaultAzureCredential(),
|
||||
)
|
||||
|
||||
def convert(
|
||||
self, input: ConverterInput, **kwargs: Any
|
||||
) -> Union[None, DocumentConverterResult]:
|
||||
# Bail if extension is not supported by Document Intelligence
|
||||
extension = kwargs.get("file_extension", "")
|
||||
docintel_extensions = [
|
||||
".pdf",
|
||||
".docx",
|
||||
".xlsx",
|
||||
".pptx",
|
||||
".html",
|
||||
".jpeg",
|
||||
".jpg",
|
||||
".png",
|
||||
".bmp",
|
||||
".tiff",
|
||||
".heif",
|
||||
]
|
||||
if extension.lower() not in docintel_extensions:
|
||||
return None
|
||||
def accepts(
|
||||
self,
|
||||
file_stream: BinaryIO,
|
||||
stream_info: StreamInfo,
|
||||
**kwargs: Any, # Options to pass to the converter
|
||||
) -> bool:
|
||||
mimetype = (stream_info.mimetype or "").lower()
|
||||
extension = (stream_info.extension or "").lower()
|
||||
|
||||
# Get the bytestring from the converter input
|
||||
file_obj = input.read_file(mode="rb")
|
||||
file_bytes = file_obj.read()
|
||||
file_obj.close()
|
||||
if extension in OFFICE_FILE_EXTENSIONS + OTHER_FILE_EXTENSIONS:
|
||||
return True
|
||||
|
||||
# Certain document analysis features are not availiable for office filetypes (.xlsx, .pptx, .html, .docx)
|
||||
if extension.lower() in [".xlsx", ".pptx", ".html", ".docx"]:
|
||||
analysis_features = []
|
||||
else:
|
||||
analysis_features = [
|
||||
for prefix in OFFICE_MIME_TYPE_PREFIXES + OTHER_MIME_TYPE_PREFIXES:
|
||||
if mimetype.startswith(prefix):
|
||||
return True
|
||||
|
||||
return False
|
||||
|
||||
def _analysis_features(self, stream_info: StreamInfo) -> List[str]:
|
||||
"""
|
||||
Helper needed to determine which analysis features to use.
|
||||
Certain document analysis features are not availiable for
|
||||
office filetypes (.xlsx, .pptx, .html, .docx)
|
||||
"""
|
||||
mimetype = (stream_info.mimetype or "").lower()
|
||||
extension = (stream_info.extension or "").lower()
|
||||
|
||||
if extension in OFFICE_FILE_EXTENSIONS:
|
||||
return []
|
||||
|
||||
for prefix in OFFICE_MIME_TYPE_PREFIXES:
|
||||
if mimetype.startswith(prefix):
|
||||
return []
|
||||
|
||||
return [
|
||||
DocumentAnalysisFeature.FORMULAS, # enable formula extraction
|
||||
DocumentAnalysisFeature.OCR_HIGH_RESOLUTION, # enable high resolution OCR
|
||||
DocumentAnalysisFeature.STYLE_FONT, # enable font style extraction
|
||||
]
|
||||
|
||||
def convert(
|
||||
self,
|
||||
file_stream: BinaryIO,
|
||||
stream_info: StreamInfo,
|
||||
**kwargs: Any, # Options to pass to the converter
|
||||
) -> DocumentConverterResult:
|
||||
# Extract the text using Azure Document Intelligence
|
||||
poller = self.doc_intel_client.begin_analyze_document(
|
||||
model_id="prebuilt-layout",
|
||||
body=AnalyzeDocumentRequest(bytes_source=file_bytes),
|
||||
features=analysis_features,
|
||||
body=AnalyzeDocumentRequest(bytes_source=file_stream.read()),
|
||||
features=self._analysis_features(stream_info),
|
||||
output_content_format=CONTENT_FORMAT, # TODO: replace with "ContentFormat.MARKDOWN" when the bug is fixed
|
||||
)
|
||||
result: AnalyzeResult = poller.result()
|
||||
|
||||
# remove comments from the markdown content generated by Doc Intelligence and append to markdown string
|
||||
markdown_text = re.sub(r"<!--.*?-->", "", result.content, flags=re.DOTALL)
|
||||
return DocumentConverterResult(
|
||||
title=None,
|
||||
text_content=markdown_text,
|
||||
)
|
||||
return DocumentConverterResult(markdown=markdown_text)
|
||||
|
||||
@@ -1,14 +1,27 @@
|
||||
from typing import Union
|
||||
import sys
|
||||
|
||||
import mammoth
|
||||
from typing import BinaryIO, Any
|
||||
|
||||
from ._base import (
|
||||
DocumentConverterResult,
|
||||
)
|
||||
|
||||
from ._base import DocumentConverter
|
||||
from ._html_converter import HtmlConverter
|
||||
from ._converter_input import ConverterInput
|
||||
from .._base_converter import DocumentConverter, DocumentConverterResult
|
||||
from .._stream_info import StreamInfo
|
||||
from .._exceptions import MissingDependencyException, MISSING_DEPENDENCY_MESSAGE
|
||||
|
||||
# Try loading optional (but in this case, required) dependencies
|
||||
# Save reporting of any exceptions for later
|
||||
_dependency_exc_info = None
|
||||
try:
|
||||
import mammoth
|
||||
except ImportError:
|
||||
# Preserve the error and stack trace for later
|
||||
_dependency_exc_info = sys.exc_info()
|
||||
|
||||
|
||||
ACCEPTED_MIME_TYPE_PREFIXES = [
|
||||
"application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
||||
]
|
||||
|
||||
ACCEPTED_FILE_EXTENSIONS = [".docx"]
|
||||
|
||||
|
||||
class DocxConverter(HtmlConverter):
|
||||
@@ -16,25 +29,49 @@ class DocxConverter(HtmlConverter):
|
||||
Converts DOCX files to Markdown. Style information (e.g.m headings) and tables are preserved where possible.
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self, priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT
|
||||
):
|
||||
super().__init__(priority=priority)
|
||||
def __init__(self):
|
||||
super().__init__()
|
||||
self._html_converter = HtmlConverter()
|
||||
|
||||
def accepts(
|
||||
self,
|
||||
file_stream: BinaryIO,
|
||||
stream_info: StreamInfo,
|
||||
**kwargs: Any, # Options to pass to the converter
|
||||
) -> bool:
|
||||
mimetype = (stream_info.mimetype or "").lower()
|
||||
extension = (stream_info.extension or "").lower()
|
||||
|
||||
if extension in ACCEPTED_FILE_EXTENSIONS:
|
||||
return True
|
||||
|
||||
for prefix in ACCEPTED_MIME_TYPE_PREFIXES:
|
||||
if mimetype.startswith(prefix):
|
||||
return True
|
||||
|
||||
return False
|
||||
|
||||
def convert(
|
||||
self, input: ConverterInput, **kwargs
|
||||
) -> Union[None, DocumentConverterResult]:
|
||||
# Bail if not a DOCX
|
||||
extension = kwargs.get("file_extension", "")
|
||||
if extension.lower() != ".docx":
|
||||
return None
|
||||
self,
|
||||
file_stream: BinaryIO,
|
||||
stream_info: StreamInfo,
|
||||
**kwargs: Any, # Options to pass to the converter
|
||||
) -> DocumentConverterResult:
|
||||
# Check: the dependencies
|
||||
if _dependency_exc_info is not None:
|
||||
raise MissingDependencyException(
|
||||
MISSING_DEPENDENCY_MESSAGE.format(
|
||||
converter=type(self).__name__,
|
||||
extension=".docx",
|
||||
feature="docx",
|
||||
)
|
||||
) from _dependency_exc_info[
|
||||
1
|
||||
].with_traceback( # type: ignore[union-attr]
|
||||
_dependency_exc_info[2]
|
||||
)
|
||||
|
||||
result = None
|
||||
style_map = kwargs.get("style_map", None)
|
||||
file_obj = input.read_file(mode="rb")
|
||||
result = mammoth.convert_to_html(file_obj, style_map=style_map)
|
||||
file_obj.close()
|
||||
html_content = result.value
|
||||
result = self._convert(html_content)
|
||||
|
||||
return result
|
||||
return self._html_converter.convert_string(
|
||||
mammoth.convert_to_html(file_stream, style_map=style_map).value
|
||||
)
|
||||
|
||||
147
packages/markitdown/src/markitdown/converters/_epub_converter.py
Normal file
147
packages/markitdown/src/markitdown/converters/_epub_converter.py
Normal file
@@ -0,0 +1,147 @@
|
||||
import os
|
||||
import zipfile
|
||||
import xml.dom.minidom as minidom
|
||||
|
||||
from typing import BinaryIO, Any, Dict, List
|
||||
|
||||
from ._html_converter import HtmlConverter
|
||||
from .._base_converter import DocumentConverter, DocumentConverterResult
|
||||
from .._stream_info import StreamInfo
|
||||
|
||||
ACCEPTED_MIME_TYPE_PREFIXES = [
|
||||
"application/epub",
|
||||
"application/epub+zip",
|
||||
"application/x-epub+zip",
|
||||
]
|
||||
|
||||
ACCEPTED_FILE_EXTENSIONS = [".epub"]
|
||||
|
||||
MIME_TYPE_MAPPING = {
|
||||
".html": "text/html",
|
||||
".xhtml": "application/xhtml+xml",
|
||||
}
|
||||
|
||||
|
||||
class EpubConverter(HtmlConverter):
|
||||
"""
|
||||
Converts EPUB files to Markdown. Style information (e.g.m headings) and tables are preserved where possible.
|
||||
"""
|
||||
|
||||
def __init__(self):
|
||||
super().__init__()
|
||||
self._html_converter = HtmlConverter()
|
||||
|
||||
def accepts(
|
||||
self,
|
||||
file_stream: BinaryIO,
|
||||
stream_info: StreamInfo,
|
||||
**kwargs: Any, # Options to pass to the converter
|
||||
) -> bool:
|
||||
mimetype = (stream_info.mimetype or "").lower()
|
||||
extension = (stream_info.extension or "").lower()
|
||||
|
||||
if extension in ACCEPTED_FILE_EXTENSIONS:
|
||||
return True
|
||||
|
||||
for prefix in ACCEPTED_MIME_TYPE_PREFIXES:
|
||||
if mimetype.startswith(prefix):
|
||||
return True
|
||||
|
||||
return False
|
||||
|
||||
def convert(
|
||||
self,
|
||||
file_stream: BinaryIO,
|
||||
stream_info: StreamInfo,
|
||||
**kwargs: Any, # Options to pass to the converter
|
||||
) -> DocumentConverterResult:
|
||||
with zipfile.ZipFile(file_stream, "r") as z:
|
||||
# Extracts metadata (title, authors, language, publisher, date, description, cover) from an EPUB file."""
|
||||
|
||||
# Locate content.opf
|
||||
container_dom = minidom.parse(z.open("META-INF/container.xml"))
|
||||
opf_path = container_dom.getElementsByTagName("rootfile")[0].getAttribute(
|
||||
"full-path"
|
||||
)
|
||||
|
||||
# Parse content.opf
|
||||
opf_dom = minidom.parse(z.open(opf_path))
|
||||
metadata: Dict[str, Any] = {
|
||||
"title": self._get_text_from_node(opf_dom, "dc:title"),
|
||||
"authors": self._get_all_texts_from_nodes(opf_dom, "dc:creator"),
|
||||
"language": self._get_text_from_node(opf_dom, "dc:language"),
|
||||
"publisher": self._get_text_from_node(opf_dom, "dc:publisher"),
|
||||
"date": self._get_text_from_node(opf_dom, "dc:date"),
|
||||
"description": self._get_text_from_node(opf_dom, "dc:description"),
|
||||
"identifier": self._get_text_from_node(opf_dom, "dc:identifier"),
|
||||
}
|
||||
|
||||
# Extract manifest items (ID → href mapping)
|
||||
manifest = {
|
||||
item.getAttribute("id"): item.getAttribute("href")
|
||||
for item in opf_dom.getElementsByTagName("item")
|
||||
}
|
||||
|
||||
# Extract spine order (ID refs)
|
||||
spine_items = opf_dom.getElementsByTagName("itemref")
|
||||
spine_order = [item.getAttribute("idref") for item in spine_items]
|
||||
|
||||
# Convert spine order to actual file paths
|
||||
base_path = "/".join(
|
||||
opf_path.split("/")[:-1]
|
||||
) # Get base directory of content.opf
|
||||
spine = [
|
||||
f"{base_path}/{manifest[item_id]}" if base_path else manifest[item_id]
|
||||
for item_id in spine_order
|
||||
if item_id in manifest
|
||||
]
|
||||
|
||||
# Extract and convert the content
|
||||
markdown_content: List[str] = []
|
||||
for file in spine:
|
||||
if file in z.namelist():
|
||||
with z.open(file) as f:
|
||||
filename = os.path.basename(file)
|
||||
extension = os.path.splitext(filename)[1].lower()
|
||||
mimetype = MIME_TYPE_MAPPING.get(extension)
|
||||
converted_content = self._html_converter.convert(
|
||||
f,
|
||||
StreamInfo(
|
||||
mimetype=mimetype,
|
||||
extension=extension,
|
||||
filename=filename,
|
||||
),
|
||||
)
|
||||
markdown_content.append(converted_content.markdown.strip())
|
||||
|
||||
# Format and add the metadata
|
||||
metadata_markdown = []
|
||||
for key, value in metadata.items():
|
||||
if isinstance(value, list):
|
||||
value = ", ".join(value)
|
||||
if value:
|
||||
metadata_markdown.append(f"**{key.capitalize()}:** {value}")
|
||||
|
||||
markdown_content.insert(0, "\n".join(metadata_markdown))
|
||||
|
||||
return DocumentConverterResult(
|
||||
markdown="\n\n".join(markdown_content), title=metadata["title"]
|
||||
)
|
||||
|
||||
def _get_text_from_node(self, dom: minidom.Document, tag_name: str) -> str | None:
|
||||
"""Convenience function to extract a single occurrence of a tag (e.g., title)."""
|
||||
texts = self._get_all_texts_from_nodes(dom, tag_name)
|
||||
if len(texts) > 0:
|
||||
return texts[0]
|
||||
else:
|
||||
return None
|
||||
|
||||
def _get_all_texts_from_nodes(
|
||||
self, dom: minidom.Document, tag_name: str
|
||||
) -> List[str]:
|
||||
"""Helper function to extract all occurrences of a tag (e.g., multiple authors)."""
|
||||
texts: List[str] = []
|
||||
for node in dom.getElementsByTagName(tag_name):
|
||||
if node.firstChild and hasattr(node.firstChild, "nodeValue"):
|
||||
texts.append(node.firstChild.nodeValue.strip())
|
||||
return texts
|
||||
34
packages/markitdown/src/markitdown/converters/_exiftool.py
Normal file
34
packages/markitdown/src/markitdown/converters/_exiftool.py
Normal file
@@ -0,0 +1,34 @@
|
||||
import json
|
||||
import subprocess
|
||||
import locale
|
||||
import sys
|
||||
import shutil
|
||||
import os
|
||||
import warnings
|
||||
from typing import BinaryIO, Any, Union
|
||||
|
||||
|
||||
def exiftool_metadata(
|
||||
file_stream: BinaryIO,
|
||||
*,
|
||||
exiftool_path: Union[str, None],
|
||||
) -> Any: # Need a better type for json data
|
||||
# Nothing to do
|
||||
if not exiftool_path:
|
||||
return {}
|
||||
|
||||
# Run exiftool
|
||||
cur_pos = file_stream.tell()
|
||||
try:
|
||||
output = subprocess.run(
|
||||
[exiftool_path, "-json", "-"],
|
||||
input=file_stream.read(),
|
||||
capture_output=True,
|
||||
text=False,
|
||||
).stdout
|
||||
|
||||
return json.loads(
|
||||
output.decode(locale.getpreferredencoding(False)),
|
||||
)[0]
|
||||
finally:
|
||||
file_stream.seek(cur_pos)
|
||||
@@ -1,39 +1,52 @@
|
||||
from typing import Any, Union
|
||||
import io
|
||||
from typing import Any, BinaryIO, Optional
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
from ._base import DocumentConverter, DocumentConverterResult
|
||||
from .._base_converter import DocumentConverter, DocumentConverterResult
|
||||
from .._stream_info import StreamInfo
|
||||
from ._markdownify import _CustomMarkdownify
|
||||
from ._converter_input import ConverterInput
|
||||
|
||||
ACCEPTED_MIME_TYPE_PREFIXES = [
|
||||
"text/html",
|
||||
"application/xhtml",
|
||||
]
|
||||
|
||||
ACCEPTED_FILE_EXTENSIONS = [
|
||||
".html",
|
||||
".htm",
|
||||
]
|
||||
|
||||
|
||||
class HtmlConverter(DocumentConverter):
|
||||
"""Anything with content type text/html"""
|
||||
|
||||
def __init__(
|
||||
self, priority: float = DocumentConverter.PRIORITY_GENERIC_FILE_FORMAT
|
||||
):
|
||||
super().__init__(priority=priority)
|
||||
def accepts(
|
||||
self,
|
||||
file_stream: BinaryIO,
|
||||
stream_info: StreamInfo,
|
||||
**kwargs: Any, # Options to pass to the converter
|
||||
) -> bool:
|
||||
mimetype = (stream_info.mimetype or "").lower()
|
||||
extension = (stream_info.extension or "").lower()
|
||||
|
||||
if extension in ACCEPTED_FILE_EXTENSIONS:
|
||||
return True
|
||||
|
||||
for prefix in ACCEPTED_MIME_TYPE_PREFIXES:
|
||||
if mimetype.startswith(prefix):
|
||||
return True
|
||||
|
||||
return False
|
||||
|
||||
def convert(
|
||||
self, input: ConverterInput, **kwargs: Any
|
||||
) -> Union[None, DocumentConverterResult]:
|
||||
# Bail if not html
|
||||
extension = kwargs.get("file_extension", "")
|
||||
if extension.lower() not in [".html", ".htm"]:
|
||||
return None
|
||||
|
||||
result = None
|
||||
file_obj = input.read_file(mode="rt", encoding="utf-8")
|
||||
result = self._convert(file_obj.read())
|
||||
file_obj.close()
|
||||
|
||||
return result
|
||||
|
||||
def _convert(self, html_content: str) -> Union[None, DocumentConverterResult]:
|
||||
"""Helper function that converts an HTML string."""
|
||||
|
||||
# Parse the string
|
||||
soup = BeautifulSoup(html_content, "html.parser")
|
||||
self,
|
||||
file_stream: BinaryIO,
|
||||
stream_info: StreamInfo,
|
||||
**kwargs: Any, # Options to pass to the converter
|
||||
) -> DocumentConverterResult:
|
||||
# Parse the stream
|
||||
encoding = "utf-8" if stream_info.charset is None else stream_info.charset
|
||||
soup = BeautifulSoup(file_stream, "html.parser", from_encoding=encoding)
|
||||
|
||||
# Remove javascript and style blocks
|
||||
for script in soup(["script", "style"]):
|
||||
@@ -53,6 +66,25 @@ class HtmlConverter(DocumentConverter):
|
||||
webpage_text = webpage_text.strip()
|
||||
|
||||
return DocumentConverterResult(
|
||||
markdown=webpage_text,
|
||||
title=None if soup.title is None else soup.title.string,
|
||||
text_content=webpage_text,
|
||||
)
|
||||
|
||||
def convert_string(
|
||||
self, html_content: str, *, url: Optional[str] = None, **kwargs
|
||||
) -> DocumentConverterResult:
|
||||
"""
|
||||
Non-standard convenience method to convert a string to markdown.
|
||||
Given that many converters produce HTML as intermediate output, this
|
||||
allows for easy conversion of HTML to markdown.
|
||||
"""
|
||||
return self.convert(
|
||||
file_stream=io.BytesIO(html_content.encode("utf-8")),
|
||||
stream_info=StreamInfo(
|
||||
mimetype="text/html",
|
||||
extension=".html",
|
||||
charset="utf-8",
|
||||
url=url,
|
||||
),
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
@@ -1,32 +1,53 @@
|
||||
from typing import Union
|
||||
from ._base import DocumentConverter, DocumentConverterResult
|
||||
from ._media_converter import MediaConverter
|
||||
from ._converter_input import ConverterInput
|
||||
from typing import BinaryIO, Any, Union
|
||||
import base64
|
||||
import mimetypes
|
||||
from ._exiftool import exiftool_metadata
|
||||
from .._base_converter import DocumentConverter, DocumentConverterResult
|
||||
from .._stream_info import StreamInfo
|
||||
|
||||
ACCEPTED_MIME_TYPE_PREFIXES = [
|
||||
"image/jpeg",
|
||||
"image/png",
|
||||
]
|
||||
|
||||
ACCEPTED_FILE_EXTENSIONS = [".jpg", ".jpeg", ".png"]
|
||||
|
||||
|
||||
class ImageConverter(MediaConverter):
|
||||
class ImageConverter(DocumentConverter):
|
||||
"""
|
||||
Converts images to markdown via extraction of metadata (if `exiftool` is installed), OCR (if `easyocr` is installed), and description via a multimodal LLM (if an llm_client is configured).
|
||||
Converts images to markdown via extraction of metadata (if `exiftool` is installed), and description via a multimodal LLM (if an llm_client is configured).
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self, priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT
|
||||
):
|
||||
super().__init__(priority=priority)
|
||||
def accepts(
|
||||
self,
|
||||
file_stream: BinaryIO,
|
||||
stream_info: StreamInfo,
|
||||
**kwargs: Any,
|
||||
) -> bool:
|
||||
mimetype = (stream_info.mimetype or "").lower()
|
||||
extension = (stream_info.extension or "").lower()
|
||||
|
||||
if extension in ACCEPTED_FILE_EXTENSIONS:
|
||||
return True
|
||||
|
||||
for prefix in ACCEPTED_MIME_TYPE_PREFIXES:
|
||||
if mimetype.startswith(prefix):
|
||||
return True
|
||||
|
||||
return False
|
||||
|
||||
def convert(
|
||||
self, input: ConverterInput, **kwargs
|
||||
) -> Union[None, DocumentConverterResult]:
|
||||
# Bail if not an image
|
||||
extension = kwargs.get("file_extension", "")
|
||||
if extension.lower() not in [".jpg", ".jpeg", ".png"]:
|
||||
return None
|
||||
|
||||
self,
|
||||
file_stream: BinaryIO,
|
||||
stream_info: StreamInfo,
|
||||
**kwargs: Any, # Options to pass to the converter
|
||||
) -> DocumentConverterResult:
|
||||
md_content = ""
|
||||
|
||||
# Add metadata if a local path is provided
|
||||
if input.input_type == "filepath":
|
||||
metadata = self._get_metadata(input.filepath, kwargs.get("exiftool_path"))
|
||||
# Add metadata
|
||||
metadata = exiftool_metadata(
|
||||
file_stream, exiftool_path=kwargs.get("exiftool_path")
|
||||
)
|
||||
|
||||
if metadata:
|
||||
for f in [
|
||||
@@ -44,42 +65,59 @@ class ImageConverter(MediaConverter):
|
||||
if f in metadata:
|
||||
md_content += f"{f}: {metadata[f]}\n"
|
||||
|
||||
# Try describing the image with GPTV
|
||||
# Try describing the image with GPT
|
||||
llm_client = kwargs.get("llm_client")
|
||||
llm_model = kwargs.get("llm_model")
|
||||
if llm_client is not None and llm_model is not None:
|
||||
md_content += (
|
||||
"\n# Description:\n"
|
||||
+ self._get_llm_description(
|
||||
input,
|
||||
extension,
|
||||
llm_client,
|
||||
llm_model,
|
||||
llm_description = self._get_llm_description(
|
||||
file_stream,
|
||||
stream_info,
|
||||
client=llm_client,
|
||||
model=llm_model,
|
||||
prompt=kwargs.get("llm_prompt"),
|
||||
).strip()
|
||||
+ "\n"
|
||||
)
|
||||
|
||||
if llm_description is not None:
|
||||
md_content += "\n# Description:\n" + llm_description.strip() + "\n"
|
||||
|
||||
return DocumentConverterResult(
|
||||
title=None,
|
||||
text_content=md_content,
|
||||
markdown=md_content,
|
||||
)
|
||||
|
||||
def _get_llm_description(
|
||||
self, input: ConverterInput, extension, client, model, prompt=None
|
||||
):
|
||||
self,
|
||||
file_stream: BinaryIO,
|
||||
stream_info: StreamInfo,
|
||||
*,
|
||||
client,
|
||||
model,
|
||||
prompt=None,
|
||||
) -> Union[None, str]:
|
||||
if prompt is None or prompt.strip() == "":
|
||||
prompt = "Write a detailed caption for this image."
|
||||
|
||||
data_uri = ""
|
||||
content_type, encoding = mimetypes.guess_type("_dummy" + extension)
|
||||
if content_type is None:
|
||||
content_type = "image/jpeg"
|
||||
image_file = input.read_file(mode="rb")
|
||||
image_base64 = base64.b64encode(image_file.read()).decode("utf-8")
|
||||
image_file.close()
|
||||
data_uri = f"data:{content_type};base64,{image_base64}"
|
||||
# Get the content type
|
||||
content_type = stream_info.mimetype
|
||||
if not content_type:
|
||||
content_type, _ = mimetypes.guess_type(
|
||||
"_dummy" + (stream_info.extension or "")
|
||||
)
|
||||
if not content_type:
|
||||
content_type = "application/octet-stream"
|
||||
|
||||
# Convert to base64
|
||||
cur_pos = file_stream.tell()
|
||||
try:
|
||||
base64_image = base64.b64encode(file_stream.read()).decode("utf-8")
|
||||
except Exception as e:
|
||||
return None
|
||||
finally:
|
||||
file_stream.seek(cur_pos)
|
||||
|
||||
# Prepare the data-uri
|
||||
data_uri = f"data:{content_type};base64,{base64_image}"
|
||||
|
||||
# Prepare the OpenAI API request
|
||||
messages = [
|
||||
{
|
||||
"role": "user",
|
||||
@@ -95,5 +133,6 @@ class ImageConverter(MediaConverter):
|
||||
}
|
||||
]
|
||||
|
||||
# Call the OpenAI API
|
||||
response = client.chat.completions.create(model=model, messages=messages)
|
||||
return response.choices[0].message.content
|
||||
|
||||
@@ -1,41 +1,62 @@
|
||||
from typing import BinaryIO, Any
|
||||
import json
|
||||
from typing import Any, Union
|
||||
|
||||
from ._base import (
|
||||
DocumentConverter,
|
||||
DocumentConverterResult,
|
||||
)
|
||||
|
||||
from .._base_converter import DocumentConverter, DocumentConverterResult
|
||||
from .._exceptions import FileConversionException
|
||||
from ._converter_input import ConverterInput
|
||||
from .._stream_info import StreamInfo
|
||||
|
||||
CANDIDATE_MIME_TYPE_PREFIXES = [
|
||||
"application/json",
|
||||
]
|
||||
|
||||
ACCEPTED_FILE_EXTENSIONS = [".ipynb"]
|
||||
|
||||
|
||||
class IpynbConverter(DocumentConverter):
|
||||
"""Converts Jupyter Notebook (.ipynb) files to Markdown."""
|
||||
|
||||
def __init__(
|
||||
self, priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT
|
||||
):
|
||||
super().__init__(priority=priority)
|
||||
def accepts(
|
||||
self,
|
||||
file_stream: BinaryIO,
|
||||
stream_info: StreamInfo,
|
||||
**kwargs: Any, # Options to pass to the converter
|
||||
) -> bool:
|
||||
mimetype = (stream_info.mimetype or "").lower()
|
||||
extension = (stream_info.extension or "").lower()
|
||||
|
||||
if extension in ACCEPTED_FILE_EXTENSIONS:
|
||||
return True
|
||||
|
||||
for prefix in CANDIDATE_MIME_TYPE_PREFIXES:
|
||||
if mimetype.startswith(prefix):
|
||||
# Read further to see if it's a notebook
|
||||
cur_pos = file_stream.tell()
|
||||
try:
|
||||
encoding = stream_info.charset or "utf-8"
|
||||
notebook_content = file_stream.read().decode(encoding)
|
||||
return (
|
||||
"nbformat" in notebook_content
|
||||
and "nbformat_minor" in notebook_content
|
||||
)
|
||||
finally:
|
||||
file_stream.seek(cur_pos)
|
||||
|
||||
return False
|
||||
|
||||
def convert(
|
||||
self, input: ConverterInput, **kwargs: Any
|
||||
) -> Union[None, DocumentConverterResult]:
|
||||
# Bail if not ipynb
|
||||
extension = kwargs.get("file_extension", "")
|
||||
if extension.lower() != ".ipynb":
|
||||
return None
|
||||
|
||||
self,
|
||||
file_stream: BinaryIO,
|
||||
stream_info: StreamInfo,
|
||||
**kwargs: Any, # Options to pass to the converter
|
||||
) -> DocumentConverterResult:
|
||||
# Parse and convert the notebook
|
||||
result = None
|
||||
file_obj = input.read_file(mode="rt", encoding="utf-8")
|
||||
notebook_content = json.load(file_obj)
|
||||
file_obj.close()
|
||||
result = self._convert(notebook_content)
|
||||
|
||||
return result
|
||||
encoding = stream_info.charset or "utf-8"
|
||||
notebook_content = file_stream.read().decode(encoding=encoding)
|
||||
return self._convert(json.loads(notebook_content))
|
||||
|
||||
def _convert(self, notebook_content: dict) -> Union[None, DocumentConverterResult]:
|
||||
def _convert(self, notebook_content: dict) -> DocumentConverterResult:
|
||||
"""Helper function that converts notebook JSON content to Markdown."""
|
||||
try:
|
||||
md_output = []
|
||||
@@ -67,8 +88,8 @@ class IpynbConverter(DocumentConverter):
|
||||
title = notebook_content.get("metadata", {}).get("title", title)
|
||||
|
||||
return DocumentConverterResult(
|
||||
markdown=md_text,
|
||||
title=title,
|
||||
text_content=md_text,
|
||||
)
|
||||
|
||||
except Exception as e:
|
||||
|
||||
@@ -0,0 +1,50 @@
|
||||
from typing import BinaryIO, Any, Union
|
||||
import base64
|
||||
import mimetypes
|
||||
from .._stream_info import StreamInfo
|
||||
|
||||
|
||||
def llm_caption(
|
||||
file_stream: BinaryIO, stream_info: StreamInfo, *, client, model, prompt=None
|
||||
) -> Union[None, str]:
|
||||
if prompt is None or prompt.strip() == "":
|
||||
prompt = "Write a detailed caption for this image."
|
||||
|
||||
# Get the content type
|
||||
content_type = stream_info.mimetype
|
||||
if not content_type:
|
||||
content_type, _ = mimetypes.guess_type("_dummy" + (stream_info.extension or ""))
|
||||
if not content_type:
|
||||
content_type = "application/octet-stream"
|
||||
|
||||
# Convert to base64
|
||||
cur_pos = file_stream.tell()
|
||||
try:
|
||||
base64_image = base64.b64encode(file_stream.read()).decode("utf-8")
|
||||
except Exception as e:
|
||||
return None
|
||||
finally:
|
||||
file_stream.seek(cur_pos)
|
||||
|
||||
# Prepare the data-uri
|
||||
data_uri = f"data:{content_type};base64,{base64_image}"
|
||||
|
||||
# Prepare the OpenAI API request
|
||||
messages = [
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{"type": "text", "text": prompt},
|
||||
{
|
||||
"type": "image_url",
|
||||
"image_url": {
|
||||
"url": data_uri,
|
||||
},
|
||||
},
|
||||
],
|
||||
}
|
||||
]
|
||||
|
||||
# Call the OpenAI API
|
||||
response = client.chat.completions.create(model=model, messages=messages)
|
||||
return response.choices[0].message.content
|
||||
@@ -1,7 +1,7 @@
|
||||
import re
|
||||
import markdownify
|
||||
|
||||
from typing import Any
|
||||
from typing import Any, Optional
|
||||
from urllib.parse import quote, unquote, urlparse, urlunparse
|
||||
|
||||
|
||||
@@ -20,7 +20,14 @@ class _CustomMarkdownify(markdownify.MarkdownConverter):
|
||||
# Explicitly cast options to the expected type if necessary
|
||||
super().__init__(**options)
|
||||
|
||||
def convert_hn(self, n: int, el: Any, text: str, convert_as_inline: bool) -> str:
|
||||
def convert_hn(
|
||||
self,
|
||||
n: int,
|
||||
el: Any,
|
||||
text: str,
|
||||
convert_as_inline: Optional[bool] = False,
|
||||
**kwargs,
|
||||
) -> str:
|
||||
"""Same as usual, but be sure to start with a new line"""
|
||||
if not convert_as_inline:
|
||||
if not re.search(r"^\n", text):
|
||||
@@ -28,7 +35,13 @@ class _CustomMarkdownify(markdownify.MarkdownConverter):
|
||||
|
||||
return super().convert_hn(n, el, text, convert_as_inline) # type: ignore
|
||||
|
||||
def convert_a(self, el: Any, text: str, convert_as_inline: bool):
|
||||
def convert_a(
|
||||
self,
|
||||
el: Any,
|
||||
text: str,
|
||||
convert_as_inline: Optional[bool] = False,
|
||||
**kwargs,
|
||||
):
|
||||
"""Same as usual converter, but removes Javascript links and escapes URIs."""
|
||||
prefix, suffix, text = markdownify.chomp(text) # type: ignore
|
||||
if not text:
|
||||
@@ -68,7 +81,13 @@ class _CustomMarkdownify(markdownify.MarkdownConverter):
|
||||
else text
|
||||
)
|
||||
|
||||
def convert_img(self, el: Any, text: str, convert_as_inline: bool) -> str:
|
||||
def convert_img(
|
||||
self,
|
||||
el: Any,
|
||||
text: str,
|
||||
convert_as_inline: Optional[bool] = False,
|
||||
**kwargs,
|
||||
) -> str:
|
||||
"""Same as usual converter, but removes data URIs"""
|
||||
|
||||
alt = el.attrs.get("alt", None) or ""
|
||||
|
||||
@@ -1,41 +0,0 @@
|
||||
import subprocess
|
||||
import shutil
|
||||
import json
|
||||
from warnings import warn
|
||||
|
||||
from ._base import DocumentConverter
|
||||
|
||||
|
||||
class MediaConverter(DocumentConverter):
|
||||
"""
|
||||
Abstract class for multi-modal media (e.g., images and audio)
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self, priority: float = DocumentConverter.PRIORITY_GENERIC_FILE_FORMAT
|
||||
):
|
||||
super().__init__(priority=priority)
|
||||
|
||||
def _get_metadata(self, local_path, exiftool_path=None):
|
||||
if not exiftool_path:
|
||||
which_exiftool = shutil.which("exiftool")
|
||||
if which_exiftool:
|
||||
warn(
|
||||
f"""Implicit discovery of 'exiftool' is disabled. If you would like to continue to use exiftool in MarkItDown, please set the exiftool_path parameter in the MarkItDown consructor. E.g.,
|
||||
|
||||
md = MarkItDown(exiftool_path="{which_exiftool}")
|
||||
|
||||
This warning will be removed in future releases.
|
||||
""",
|
||||
DeprecationWarning,
|
||||
)
|
||||
|
||||
return None
|
||||
else:
|
||||
if True:
|
||||
result = subprocess.run(
|
||||
[exiftool_path, "-json", local_path], capture_output=True, text=True
|
||||
).stdout
|
||||
return json.loads(result)[0]
|
||||
# except Exception:
|
||||
# return None
|
||||
@@ -1,98 +0,0 @@
|
||||
import tempfile
|
||||
import os
|
||||
from typing import Union
|
||||
from ._base import DocumentConverter, DocumentConverterResult
|
||||
from ._wav_converter import WavConverter
|
||||
from warnings import resetwarnings, catch_warnings
|
||||
from ._converter_input import ConverterInput
|
||||
|
||||
# Optional Transcription support
|
||||
IS_AUDIO_TRANSCRIPTION_CAPABLE = False
|
||||
try:
|
||||
# Using warnings' catch_warnings to catch
|
||||
# pydub's warning of ffmpeg or avconv missing
|
||||
with catch_warnings(record=True) as w:
|
||||
import pydub
|
||||
|
||||
if w:
|
||||
raise ModuleNotFoundError
|
||||
import speech_recognition as sr
|
||||
|
||||
IS_AUDIO_TRANSCRIPTION_CAPABLE = True
|
||||
except ModuleNotFoundError:
|
||||
pass
|
||||
finally:
|
||||
resetwarnings()
|
||||
|
||||
|
||||
class Mp3Converter(WavConverter):
|
||||
"""
|
||||
Converts MP3 files to markdown via extraction of metadata (if `exiftool` is installed), and speech transcription (if `speech_recognition` AND `pydub` are installed).
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self, priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT
|
||||
):
|
||||
super().__init__(priority=priority)
|
||||
|
||||
def convert(
|
||||
self, input: ConverterInput, **kwargs
|
||||
) -> Union[None, DocumentConverterResult]:
|
||||
# Bail if not a MP3
|
||||
extension = kwargs.get("file_extension", "")
|
||||
if extension.lower() != ".mp3":
|
||||
return None
|
||||
|
||||
# Bail if a local path was not provided
|
||||
if input.input_type != "filepath":
|
||||
return None
|
||||
local_path = input.filepath
|
||||
|
||||
md_content = ""
|
||||
|
||||
# Add metadata
|
||||
metadata = self._get_metadata(local_path, kwargs.get("exiftool_path"))
|
||||
if metadata:
|
||||
for f in [
|
||||
"Title",
|
||||
"Artist",
|
||||
"Author",
|
||||
"Band",
|
||||
"Album",
|
||||
"Genre",
|
||||
"Track",
|
||||
"DateTimeOriginal",
|
||||
"CreateDate",
|
||||
"Duration",
|
||||
]:
|
||||
if f in metadata:
|
||||
md_content += f"{f}: {metadata[f]}\n"
|
||||
|
||||
# Transcribe
|
||||
if IS_AUDIO_TRANSCRIPTION_CAPABLE:
|
||||
handle, temp_path = tempfile.mkstemp(suffix=".wav")
|
||||
os.close(handle)
|
||||
try:
|
||||
sound = pydub.AudioSegment.from_mp3(local_path)
|
||||
sound.export(temp_path, format="wav")
|
||||
|
||||
_args = dict()
|
||||
_args.update(kwargs)
|
||||
_args["file_extension"] = ".wav"
|
||||
|
||||
try:
|
||||
transcript = super()._transcribe_audio(temp_path).strip()
|
||||
md_content += "\n\n### Audio Transcript:\n" + (
|
||||
"[No speech detected]" if transcript == "" else transcript
|
||||
)
|
||||
except Exception:
|
||||
md_content += "\n\n### Audio Transcript:\nError. Could not transcribe this audio."
|
||||
|
||||
finally:
|
||||
os.unlink(temp_path)
|
||||
|
||||
# Return the result
|
||||
return DocumentConverterResult(
|
||||
title=None,
|
||||
text_content=md_content.strip(),
|
||||
)
|
||||
@@ -1,7 +1,24 @@
|
||||
import olefile
|
||||
from typing import Any, Union
|
||||
from ._base import DocumentConverter, DocumentConverterResult
|
||||
from ._converter_input import ConverterInput
|
||||
import sys
|
||||
from typing import Any, Union, BinaryIO
|
||||
from .._stream_info import StreamInfo
|
||||
from .._base_converter import DocumentConverter, DocumentConverterResult
|
||||
from .._exceptions import MissingDependencyException, MISSING_DEPENDENCY_MESSAGE
|
||||
|
||||
# Try loading optional (but in this case, required) dependencies
|
||||
# Save reporting of any exceptions for later
|
||||
_dependency_exc_info = None
|
||||
olefile = None
|
||||
try:
|
||||
import olefile # type: ignore[no-redef]
|
||||
except ImportError:
|
||||
# Preserve the error and stack trace for later
|
||||
_dependency_exc_info = sys.exc_info()
|
||||
|
||||
ACCEPTED_MIME_TYPE_PREFIXES = [
|
||||
"application/vnd.ms-outlook",
|
||||
]
|
||||
|
||||
ACCEPTED_FILE_EXTENSIONS = [".msg"]
|
||||
|
||||
|
||||
class OutlookMsgConverter(DocumentConverter):
|
||||
@@ -12,22 +29,71 @@ class OutlookMsgConverter(DocumentConverter):
|
||||
- Email body content
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self, priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT
|
||||
):
|
||||
super().__init__(priority=priority)
|
||||
def accepts(
|
||||
self,
|
||||
file_stream: BinaryIO,
|
||||
stream_info: StreamInfo,
|
||||
**kwargs: Any, # Options to pass to the converter
|
||||
) -> bool:
|
||||
mimetype = (stream_info.mimetype or "").lower()
|
||||
extension = (stream_info.extension or "").lower()
|
||||
|
||||
# Check the extension and mimetype
|
||||
if extension in ACCEPTED_FILE_EXTENSIONS:
|
||||
return True
|
||||
|
||||
for prefix in ACCEPTED_MIME_TYPE_PREFIXES:
|
||||
if mimetype.startswith(prefix):
|
||||
return True
|
||||
|
||||
# Brute force, check if we have an OLE file
|
||||
cur_pos = file_stream.tell()
|
||||
try:
|
||||
if olefile and not olefile.isOleFile(file_stream):
|
||||
return False
|
||||
finally:
|
||||
file_stream.seek(cur_pos)
|
||||
|
||||
# Brue force, check if it's an Outlook file
|
||||
try:
|
||||
if olefile is not None:
|
||||
msg = olefile.OleFileIO(file_stream)
|
||||
toc = "\n".join([str(stream) for stream in msg.listdir()])
|
||||
return (
|
||||
"__properties_version1.0" in toc
|
||||
and "__recip_version1.0_#00000000" in toc
|
||||
)
|
||||
except Exception as e:
|
||||
pass
|
||||
finally:
|
||||
file_stream.seek(cur_pos)
|
||||
|
||||
return False
|
||||
|
||||
def convert(
|
||||
self, input: ConverterInput, **kwargs: Any
|
||||
) -> Union[None, DocumentConverterResult]:
|
||||
# Bail if not a MSG file
|
||||
extension = kwargs.get("file_extension", "")
|
||||
if extension.lower() != ".msg":
|
||||
return None
|
||||
self,
|
||||
file_stream: BinaryIO,
|
||||
stream_info: StreamInfo,
|
||||
**kwargs: Any, # Options to pass to the converter
|
||||
) -> DocumentConverterResult:
|
||||
# Check: the dependencies
|
||||
if _dependency_exc_info is not None:
|
||||
raise MissingDependencyException(
|
||||
MISSING_DEPENDENCY_MESSAGE.format(
|
||||
converter=type(self).__name__,
|
||||
extension=".msg",
|
||||
feature="outlook",
|
||||
)
|
||||
) from _dependency_exc_info[
|
||||
1
|
||||
].with_traceback( # type: ignore[union-attr]
|
||||
_dependency_exc_info[2]
|
||||
)
|
||||
|
||||
try:
|
||||
file_obj = input.read_file(mode="rb")
|
||||
msg = olefile.OleFileIO(file_obj)
|
||||
assert (
|
||||
olefile is not None
|
||||
) # If we made it this far, olefile should be available
|
||||
msg = olefile.OleFileIO(file_stream)
|
||||
|
||||
# Extract email metadata
|
||||
md_content = "# Email Message\n\n"
|
||||
@@ -52,21 +118,19 @@ class OutlookMsgConverter(DocumentConverter):
|
||||
md_content += body
|
||||
|
||||
msg.close()
|
||||
file_obj.close()
|
||||
|
||||
return DocumentConverterResult(
|
||||
title=headers.get("Subject"), text_content=md_content.strip()
|
||||
markdown=md_content.strip(),
|
||||
title=headers.get("Subject"),
|
||||
)
|
||||
|
||||
except Exception as e:
|
||||
raise FileConversionException(
|
||||
f"Could not convert MSG file '{input.filepath}': {str(e)}"
|
||||
)
|
||||
|
||||
def _get_stream_data(
|
||||
self, msg: olefile.OleFileIO, stream_path: str
|
||||
) -> Union[str, None]:
|
||||
def _get_stream_data(self, msg: Any, stream_path: str) -> Union[str, None]:
|
||||
"""Helper to safely extract and decode stream data from the MSG file."""
|
||||
assert olefile is not None
|
||||
assert isinstance(
|
||||
msg, olefile.OleFileIO
|
||||
) # Ensure msg is of the correct type (type hinting is not possible with the optional olefile package)
|
||||
|
||||
try:
|
||||
if msg.exists(stream_path):
|
||||
data = msg.openstream(stream_path).read()
|
||||
|
||||
@@ -1,9 +1,32 @@
|
||||
import sys
|
||||
import io
|
||||
|
||||
from typing import BinaryIO, Any
|
||||
|
||||
|
||||
from ._html_converter import HtmlConverter
|
||||
from .._base_converter import DocumentConverter, DocumentConverterResult
|
||||
from .._stream_info import StreamInfo
|
||||
from .._exceptions import MissingDependencyException, MISSING_DEPENDENCY_MESSAGE
|
||||
|
||||
|
||||
# Try loading optional (but in this case, required) dependencies
|
||||
# Save reporting of any exceptions for later
|
||||
_dependency_exc_info = None
|
||||
try:
|
||||
import pdfminer
|
||||
import pdfminer.high_level
|
||||
from typing import Union
|
||||
from io import StringIO
|
||||
from ._base import DocumentConverter, DocumentConverterResult
|
||||
from ._converter_input import ConverterInput
|
||||
except ImportError:
|
||||
# Preserve the error and stack trace for later
|
||||
_dependency_exc_info = sys.exc_info()
|
||||
|
||||
|
||||
ACCEPTED_MIME_TYPE_PREFIXES = [
|
||||
"application/pdf",
|
||||
"application/x-pdf",
|
||||
]
|
||||
|
||||
ACCEPTED_FILE_EXTENSIONS = [".pdf"]
|
||||
|
||||
|
||||
class PdfConverter(DocumentConverter):
|
||||
@@ -11,25 +34,45 @@ class PdfConverter(DocumentConverter):
|
||||
Converts PDFs to Markdown. Most style information is ignored, so the results are essentially plain-text.
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self, priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT
|
||||
):
|
||||
super().__init__(priority=priority)
|
||||
def accepts(
|
||||
self,
|
||||
file_stream: BinaryIO,
|
||||
stream_info: StreamInfo,
|
||||
**kwargs: Any, # Options to pass to the converter
|
||||
) -> bool:
|
||||
mimetype = (stream_info.mimetype or "").lower()
|
||||
extension = (stream_info.extension or "").lower()
|
||||
|
||||
if extension in ACCEPTED_FILE_EXTENSIONS:
|
||||
return True
|
||||
|
||||
for prefix in ACCEPTED_MIME_TYPE_PREFIXES:
|
||||
if mimetype.startswith(prefix):
|
||||
return True
|
||||
|
||||
return False
|
||||
|
||||
def convert(
|
||||
self, input: ConverterInput, **kwargs
|
||||
) -> Union[None, DocumentConverterResult]:
|
||||
# Bail if not a PDF
|
||||
extension = kwargs.get("file_extension", "")
|
||||
if extension.lower() != ".pdf":
|
||||
return None
|
||||
|
||||
output = StringIO()
|
||||
file_obj = input.read_file(mode="rb")
|
||||
pdfminer.high_level.extract_text_to_fp(file_obj, output)
|
||||
file_obj.close()
|
||||
|
||||
return DocumentConverterResult(
|
||||
title=None,
|
||||
text_content=output.getvalue(),
|
||||
self,
|
||||
file_stream: BinaryIO,
|
||||
stream_info: StreamInfo,
|
||||
**kwargs: Any, # Options to pass to the converter
|
||||
) -> DocumentConverterResult:
|
||||
# Check the dependencies
|
||||
if _dependency_exc_info is not None:
|
||||
raise MissingDependencyException(
|
||||
MISSING_DEPENDENCY_MESSAGE.format(
|
||||
converter=type(self).__name__,
|
||||
extension=".pdf",
|
||||
feature="pdf",
|
||||
)
|
||||
) from _dependency_exc_info[
|
||||
1
|
||||
].with_traceback( # type: ignore[union-attr]
|
||||
_dependency_exc_info[2]
|
||||
)
|
||||
|
||||
assert isinstance(file_stream, io.IOBase) # for mypy
|
||||
return DocumentConverterResult(
|
||||
markdown=pdfminer.high_level.extract_text(file_stream),
|
||||
)
|
||||
|
||||
@@ -1,43 +1,62 @@
|
||||
import mimetypes
|
||||
import sys
|
||||
|
||||
from charset_normalizer import from_path, from_bytes
|
||||
from typing import Any, Union
|
||||
from typing import BinaryIO, Any
|
||||
from charset_normalizer import from_bytes
|
||||
from .._base_converter import DocumentConverter, DocumentConverterResult
|
||||
from .._stream_info import StreamInfo
|
||||
|
||||
from ._base import DocumentConverter, DocumentConverterResult
|
||||
from ._converter_input import ConverterInput
|
||||
# Try loading optional (but in this case, required) dependencies
|
||||
# Save reporting of any exceptions for later
|
||||
_dependency_exc_info = None
|
||||
try:
|
||||
import mammoth
|
||||
except ImportError:
|
||||
# Preserve the error and stack trace for later
|
||||
_dependency_exc_info = sys.exc_info()
|
||||
|
||||
ACCEPTED_MIME_TYPE_PREFIXES = [
|
||||
"text/",
|
||||
"application/json",
|
||||
]
|
||||
|
||||
# Mimetypes to ignore (commonly confused extensions)
|
||||
IGNORE_MIME_TYPE_PREFIXES = [
|
||||
"text/vnd.in3d.spot", # .spo wich is confused with xls, doc, etc.
|
||||
"text/vnd.graphviz", # .dot which is confused with xls, doc, etc.
|
||||
]
|
||||
|
||||
|
||||
class PlainTextConverter(DocumentConverter):
|
||||
"""Anything with content type text/plain"""
|
||||
|
||||
def __init__(
|
||||
self, priority: float = DocumentConverter.PRIORITY_GENERIC_FILE_FORMAT
|
||||
):
|
||||
super().__init__(priority=priority)
|
||||
def accepts(
|
||||
self,
|
||||
file_stream: BinaryIO,
|
||||
stream_info: StreamInfo,
|
||||
**kwargs: Any, # Options to pass to the converter
|
||||
) -> bool:
|
||||
mimetype = (stream_info.mimetype or "").lower()
|
||||
extension = (stream_info.extension or "").lower()
|
||||
|
||||
for prefix in IGNORE_MIME_TYPE_PREFIXES:
|
||||
if mimetype.startswith(prefix):
|
||||
return False
|
||||
|
||||
for prefix in ACCEPTED_MIME_TYPE_PREFIXES:
|
||||
if mimetype.startswith(prefix):
|
||||
return True
|
||||
|
||||
return False
|
||||
|
||||
def convert(
|
||||
self, input: ConverterInput, **kwargs: Any
|
||||
) -> Union[None, DocumentConverterResult]:
|
||||
# Read file object from input
|
||||
file_obj = input.read_file(mode="rb")
|
||||
self,
|
||||
file_stream: BinaryIO,
|
||||
stream_info: StreamInfo,
|
||||
**kwargs: Any, # Options to pass to the converter
|
||||
) -> DocumentConverterResult:
|
||||
if stream_info.charset:
|
||||
text_content = file_stream.read().decode(stream_info.charset)
|
||||
else:
|
||||
text_content = str(from_bytes(file_stream.read()).best())
|
||||
|
||||
# Guess the content type from any file extension that might be around
|
||||
content_type, _ = mimetypes.guess_type(
|
||||
"__placeholder" + kwargs.get("file_extension", "")
|
||||
)
|
||||
|
||||
# Only accept text files
|
||||
if content_type is None:
|
||||
return None
|
||||
elif all(
|
||||
not content_type.lower().startswith(type_prefix)
|
||||
for type_prefix in ["text/", "application/json"]
|
||||
):
|
||||
return None
|
||||
|
||||
text_content = str(from_bytes(file_obj.read()).best())
|
||||
file_obj.close()
|
||||
return DocumentConverterResult(
|
||||
title=None,
|
||||
text_content=text_content,
|
||||
)
|
||||
return DocumentConverterResult(markdown=text_content)
|
||||
|
||||
@@ -1,68 +1,86 @@
|
||||
import sys
|
||||
import base64
|
||||
import pptx
|
||||
import os
|
||||
import io
|
||||
import re
|
||||
import html
|
||||
|
||||
from typing import Union
|
||||
from typing import BinaryIO, Any
|
||||
from operator import attrgetter
|
||||
|
||||
from ._base import DocumentConverterResult, DocumentConverter
|
||||
from ._html_converter import HtmlConverter
|
||||
from ._converter_input import ConverterInput
|
||||
from ._llm_caption import llm_caption
|
||||
from .._base_converter import DocumentConverter, DocumentConverterResult
|
||||
from .._stream_info import StreamInfo
|
||||
from .._exceptions import MissingDependencyException, MISSING_DEPENDENCY_MESSAGE
|
||||
|
||||
# Try loading optional (but in this case, required) dependencies
|
||||
# Save reporting of any exceptions for later
|
||||
_dependency_exc_info = None
|
||||
try:
|
||||
import pptx
|
||||
except ImportError:
|
||||
# Preserve the error and stack trace for later
|
||||
_dependency_exc_info = sys.exc_info()
|
||||
|
||||
|
||||
class PptxConverter(HtmlConverter):
|
||||
ACCEPTED_MIME_TYPE_PREFIXES = [
|
||||
"application/vnd.openxmlformats-officedocument.presentationml",
|
||||
]
|
||||
|
||||
ACCEPTED_FILE_EXTENSIONS = [".pptx"]
|
||||
|
||||
|
||||
class PptxConverter(DocumentConverter):
|
||||
"""
|
||||
Converts PPTX files to Markdown. Supports heading, tables and images with alt text.
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self, priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT
|
||||
):
|
||||
super().__init__(priority=priority)
|
||||
def __init__(self):
|
||||
super().__init__()
|
||||
self._html_converter = HtmlConverter()
|
||||
|
||||
def _get_llm_description(
|
||||
self, llm_client, llm_model, image_blob, content_type, prompt=None
|
||||
):
|
||||
if prompt is None or prompt.strip() == "":
|
||||
prompt = "Write a detailed alt text for this image with less than 50 words."
|
||||
def accepts(
|
||||
self,
|
||||
file_stream: BinaryIO,
|
||||
stream_info: StreamInfo,
|
||||
**kwargs: Any, # Options to pass to the converter
|
||||
) -> bool:
|
||||
mimetype = (stream_info.mimetype or "").lower()
|
||||
extension = (stream_info.extension or "").lower()
|
||||
|
||||
image_base64 = base64.b64encode(image_blob).decode("utf-8")
|
||||
data_uri = f"data:{content_type};base64,{image_base64}"
|
||||
if extension in ACCEPTED_FILE_EXTENSIONS:
|
||||
return True
|
||||
|
||||
messages = [
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{
|
||||
"type": "image_url",
|
||||
"image_url": {
|
||||
"url": data_uri,
|
||||
},
|
||||
},
|
||||
{"type": "text", "text": prompt},
|
||||
],
|
||||
}
|
||||
]
|
||||
for prefix in ACCEPTED_MIME_TYPE_PREFIXES:
|
||||
if mimetype.startswith(prefix):
|
||||
return True
|
||||
|
||||
response = llm_client.chat.completions.create(
|
||||
model=llm_model, messages=messages
|
||||
)
|
||||
return response.choices[0].message.content
|
||||
return False
|
||||
|
||||
def convert(
|
||||
self, input: ConverterInput, **kwargs
|
||||
) -> Union[None, DocumentConverterResult]:
|
||||
# Bail if not a PPTX
|
||||
extension = kwargs.get("file_extension", "")
|
||||
if extension.lower() != ".pptx":
|
||||
return None
|
||||
self,
|
||||
file_stream: BinaryIO,
|
||||
stream_info: StreamInfo,
|
||||
**kwargs: Any, # Options to pass to the converter
|
||||
) -> DocumentConverterResult:
|
||||
# Check the dependencies
|
||||
if _dependency_exc_info is not None:
|
||||
raise MissingDependencyException(
|
||||
MISSING_DEPENDENCY_MESSAGE.format(
|
||||
converter=type(self).__name__,
|
||||
extension=".pptx",
|
||||
feature="pptx",
|
||||
)
|
||||
) from _dependency_exc_info[
|
||||
1
|
||||
].with_traceback( # type: ignore[union-attr]
|
||||
_dependency_exc_info[2]
|
||||
)
|
||||
|
||||
# Perform the conversion
|
||||
presentation = pptx.Presentation(file_stream)
|
||||
md_content = ""
|
||||
|
||||
file_obj = input.read_file(mode="rb")
|
||||
presentation = pptx.Presentation(file_obj)
|
||||
file_obj.close()
|
||||
|
||||
slide_num = 0
|
||||
for slide in presentation.slides:
|
||||
slide_num += 1
|
||||
@@ -70,64 +88,65 @@ class PptxConverter(HtmlConverter):
|
||||
md_content += f"\n\n<!-- Slide number: {slide_num} -->\n"
|
||||
|
||||
title = slide.shapes.title
|
||||
for shape in slide.shapes:
|
||||
|
||||
def get_shape_content(shape, **kwargs):
|
||||
nonlocal md_content
|
||||
# Pictures
|
||||
if self._is_picture(shape):
|
||||
# https://github.com/scanny/python-pptx/pull/512#issuecomment-1713100069
|
||||
|
||||
llm_description = None
|
||||
alt_text = None
|
||||
llm_description = ""
|
||||
alt_text = ""
|
||||
|
||||
# Potentially generate a description using an LLM
|
||||
llm_client = kwargs.get("llm_client")
|
||||
llm_model = kwargs.get("llm_model")
|
||||
if llm_client is not None and llm_model is not None:
|
||||
# Prepare a file_stream and stream_info for the image data
|
||||
image_filename = shape.image.filename
|
||||
image_extension = None
|
||||
if image_filename:
|
||||
image_extension = os.path.splitext(image_filename)[1]
|
||||
image_stream_info = StreamInfo(
|
||||
mimetype=shape.image.content_type,
|
||||
extension=image_extension,
|
||||
filename=image_filename,
|
||||
)
|
||||
|
||||
image_stream = io.BytesIO(shape.image.blob)
|
||||
|
||||
# Caption the image
|
||||
try:
|
||||
llm_description = self._get_llm_description(
|
||||
llm_client,
|
||||
llm_model,
|
||||
shape.image.blob,
|
||||
shape.image.content_type,
|
||||
llm_description = llm_caption(
|
||||
image_stream,
|
||||
image_stream_info,
|
||||
client=llm_client,
|
||||
model=llm_model,
|
||||
prompt=kwargs.get("llm_prompt"),
|
||||
)
|
||||
except Exception:
|
||||
# Unable to describe with LLM
|
||||
# Unable to generate a description
|
||||
pass
|
||||
|
||||
if not llm_description:
|
||||
# Also grab any description embedded in the deck
|
||||
try:
|
||||
alt_text = shape._element._nvXxPr.cNvPr.attrib.get(
|
||||
"descr", ""
|
||||
)
|
||||
alt_text = shape._element._nvXxPr.cNvPr.attrib.get("descr", "")
|
||||
except Exception:
|
||||
# Unable to get alt text
|
||||
pass
|
||||
|
||||
# Prepare the alt, escaping any special characters
|
||||
alt_text = "\n".join([llm_description, alt_text]) or shape.name
|
||||
alt_text = re.sub(r"[\r\n\[\]]", " ", alt_text)
|
||||
alt_text = re.sub(r"\s+", " ", alt_text).strip()
|
||||
|
||||
# A placeholder name
|
||||
filename = re.sub(r"\W", "", shape.name) + ".jpg"
|
||||
md_content += (
|
||||
"\n\n"
|
||||
)
|
||||
md_content += "\n\n"
|
||||
|
||||
# Tables
|
||||
if self._is_table(shape):
|
||||
html_table = "<html><body><table>"
|
||||
first_row = True
|
||||
for row in shape.table.rows:
|
||||
html_table += "<tr>"
|
||||
for cell in row.cells:
|
||||
if first_row:
|
||||
html_table += "<th>" + html.escape(cell.text) + "</th>"
|
||||
else:
|
||||
html_table += "<td>" + html.escape(cell.text) + "</td>"
|
||||
html_table += "</tr>"
|
||||
first_row = False
|
||||
html_table += "</table></body></html>"
|
||||
md_content += (
|
||||
"\n" + self._convert(html_table).text_content.strip() + "\n"
|
||||
)
|
||||
md_content += self._convert_table_to_markdown(shape.table)
|
||||
|
||||
# Charts
|
||||
if shape.has_chart:
|
||||
@@ -140,6 +159,16 @@ class PptxConverter(HtmlConverter):
|
||||
else:
|
||||
md_content += shape.text + "\n"
|
||||
|
||||
# Group Shapes
|
||||
if shape.shape_type == pptx.enum.shapes.MSO_SHAPE_TYPE.GROUP:
|
||||
sorted_shapes = sorted(shape.shapes, key=attrgetter("top", "left"))
|
||||
for subshape in sorted_shapes:
|
||||
get_shape_content(subshape, **kwargs)
|
||||
|
||||
sorted_shapes = sorted(slide.shapes, key=attrgetter("top", "left"))
|
||||
for shape in sorted_shapes:
|
||||
get_shape_content(shape, **kwargs)
|
||||
|
||||
md_content = md_content.strip()
|
||||
|
||||
if slide.has_notes_slide:
|
||||
@@ -149,10 +178,7 @@ class PptxConverter(HtmlConverter):
|
||||
md_content += notes_frame.text
|
||||
md_content = md_content.strip()
|
||||
|
||||
return DocumentConverterResult(
|
||||
title=None,
|
||||
text_content=md_content.strip(),
|
||||
)
|
||||
return DocumentConverterResult(markdown=md_content.strip())
|
||||
|
||||
def _is_picture(self, shape):
|
||||
if shape.shape_type == pptx.enum.shapes.MSO_SHAPE_TYPE.PICTURE:
|
||||
@@ -167,7 +193,25 @@ class PptxConverter(HtmlConverter):
|
||||
return True
|
||||
return False
|
||||
|
||||
def _convert_table_to_markdown(self, table):
|
||||
# Write the table as HTML, then convert it to Markdown
|
||||
html_table = "<html><body><table>"
|
||||
first_row = True
|
||||
for row in table.rows:
|
||||
html_table += "<tr>"
|
||||
for cell in row.cells:
|
||||
if first_row:
|
||||
html_table += "<th>" + html.escape(cell.text) + "</th>"
|
||||
else:
|
||||
html_table += "<td>" + html.escape(cell.text) + "</td>"
|
||||
html_table += "</tr>"
|
||||
first_row = False
|
||||
html_table += "</table></body></html>"
|
||||
|
||||
return self._html_converter.convert_string(html_table).markdown.strip() + "\n"
|
||||
|
||||
def _convert_chart_to_markdown(self, chart):
|
||||
try:
|
||||
md = "\n\n### Chart"
|
||||
if chart.has_title:
|
||||
md += f": {chart.chart_title.text_frame.text}"
|
||||
@@ -189,3 +233,10 @@ class PptxConverter(HtmlConverter):
|
||||
header = markdown_table[0]
|
||||
separator = "|" + "|".join(["---"] * len(data[0])) + "|"
|
||||
return md + "\n".join([header, separator] + markdown_table[1:])
|
||||
except ValueError as e:
|
||||
# Handle the specific error for unsupported chart types
|
||||
if "unsupported plot type" in str(e):
|
||||
return "\n\n[unsupported chart]\n\n"
|
||||
except Exception:
|
||||
# Catch any other exceptions that might occur
|
||||
return "\n\n[unsupported chart]\n\n"
|
||||
|
||||
@@ -1,61 +1,102 @@
|
||||
from xml.dom import minidom
|
||||
from typing import Union
|
||||
from typing import BinaryIO, Any, Union
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
from ._markdownify import _CustomMarkdownify
|
||||
from ._base import DocumentConverter, DocumentConverterResult
|
||||
from ._converter_input import ConverterInput
|
||||
from .._stream_info import StreamInfo
|
||||
from .._base_converter import DocumentConverter, DocumentConverterResult
|
||||
|
||||
PRECISE_MIME_TYPE_PREFIXES = [
|
||||
"application/rss",
|
||||
"application/rss+xml",
|
||||
"application/atom",
|
||||
"application/atom+xml",
|
||||
]
|
||||
|
||||
PRECISE_FILE_EXTENSIONS = [".rss", ".atom"]
|
||||
|
||||
CANDIDATE_MIME_TYPE_PREFIXES = [
|
||||
"text/xml",
|
||||
"application/xml",
|
||||
]
|
||||
|
||||
CANDIDATE_FILE_EXTENSIONS = [
|
||||
".xml",
|
||||
]
|
||||
|
||||
|
||||
class RssConverter(DocumentConverter):
|
||||
"""Convert RSS / Atom type to markdown"""
|
||||
|
||||
def __init__(
|
||||
self, priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT
|
||||
):
|
||||
super().__init__(priority=priority)
|
||||
def accepts(
|
||||
self,
|
||||
file_stream: BinaryIO,
|
||||
stream_info: StreamInfo,
|
||||
**kwargs: Any, # Options to pass to the converter
|
||||
) -> bool:
|
||||
mimetype = (stream_info.mimetype or "").lower()
|
||||
extension = (stream_info.extension or "").lower()
|
||||
|
||||
def convert(
|
||||
self, input: ConverterInput, **kwargs
|
||||
) -> Union[None, DocumentConverterResult]:
|
||||
# Bail if not RSS type
|
||||
extension = kwargs.get("file_extension", "")
|
||||
if extension.lower() not in [".xml", ".rss", ".atom"]:
|
||||
return None
|
||||
# Read file object from input
|
||||
file_obj = input.read_file(mode="rb")
|
||||
# Check for precise mimetypes and file extensions
|
||||
if extension in PRECISE_FILE_EXTENSIONS:
|
||||
return True
|
||||
|
||||
for prefix in PRECISE_MIME_TYPE_PREFIXES:
|
||||
if mimetype.startswith(prefix):
|
||||
return True
|
||||
|
||||
# Check for precise mimetypes and file extensions
|
||||
if extension in CANDIDATE_FILE_EXTENSIONS:
|
||||
return self._check_xml(file_stream)
|
||||
|
||||
for prefix in CANDIDATE_MIME_TYPE_PREFIXES:
|
||||
if mimetype.startswith(prefix):
|
||||
return self._check_xml(file_stream)
|
||||
|
||||
return False
|
||||
|
||||
def _check_xml(self, file_stream: BinaryIO) -> bool:
|
||||
cur_pos = file_stream.tell()
|
||||
try:
|
||||
doc = minidom.parse(file_obj)
|
||||
doc = minidom.parse(file_stream)
|
||||
return self._feed_type(doc) is not None
|
||||
except BaseException as _:
|
||||
return None
|
||||
file_obj.close()
|
||||
pass
|
||||
finally:
|
||||
file_stream.seek(cur_pos)
|
||||
return False
|
||||
|
||||
result = None
|
||||
def _feed_type(self, doc: Any) -> str | None:
|
||||
if doc.getElementsByTagName("rss"):
|
||||
# A RSS feed must have a root element of <rss>
|
||||
result = self._parse_rss_type(doc)
|
||||
return "rss"
|
||||
elif doc.getElementsByTagName("feed"):
|
||||
root = doc.getElementsByTagName("feed")[0]
|
||||
if root.getElementsByTagName("entry"):
|
||||
# An Atom feed must have a root element of <feed> and at least one <entry>
|
||||
result = self._parse_atom_type(doc)
|
||||
else:
|
||||
return None
|
||||
else:
|
||||
# not rss or atom
|
||||
return "atom"
|
||||
return None
|
||||
|
||||
return result
|
||||
def convert(
|
||||
self,
|
||||
file_stream: BinaryIO,
|
||||
stream_info: StreamInfo,
|
||||
**kwargs: Any, # Options to pass to the converter
|
||||
) -> DocumentConverterResult:
|
||||
doc = minidom.parse(file_stream)
|
||||
feed_type = self._feed_type(doc)
|
||||
|
||||
def _parse_atom_type(
|
||||
self, doc: minidom.Document
|
||||
) -> Union[None, DocumentConverterResult]:
|
||||
if feed_type == "rss":
|
||||
return self._parse_rss_type(doc)
|
||||
elif feed_type == "atom":
|
||||
return self._parse_atom_type(doc)
|
||||
else:
|
||||
raise ValueError("Unknown feed type")
|
||||
|
||||
def _parse_atom_type(self, doc: minidom.Document) -> DocumentConverterResult:
|
||||
"""Parse the type of an Atom feed.
|
||||
|
||||
Returns None if the feed type is not recognized or something goes wrong.
|
||||
"""
|
||||
try:
|
||||
root = doc.getElementsByTagName("feed")[0]
|
||||
title = self._get_data_by_tag_name(root, "title")
|
||||
subtitle = self._get_data_by_tag_name(root, "subtitle")
|
||||
@@ -79,25 +120,20 @@ class RssConverter(DocumentConverter):
|
||||
md_text += self._parse_content(entry_content)
|
||||
|
||||
return DocumentConverterResult(
|
||||
markdown=md_text,
|
||||
title=title,
|
||||
text_content=md_text,
|
||||
)
|
||||
except BaseException as _:
|
||||
return None
|
||||
|
||||
def _parse_rss_type(
|
||||
self, doc: minidom.Document
|
||||
) -> Union[None, DocumentConverterResult]:
|
||||
def _parse_rss_type(self, doc: minidom.Document) -> DocumentConverterResult:
|
||||
"""Parse the type of an RSS feed.
|
||||
|
||||
Returns None if the feed type is not recognized or something goes wrong.
|
||||
"""
|
||||
try:
|
||||
root = doc.getElementsByTagName("rss")[0]
|
||||
channel = root.getElementsByTagName("channel")
|
||||
if not channel:
|
||||
return None
|
||||
channel = channel[0]
|
||||
channel_list = root.getElementsByTagName("channel")
|
||||
if not channel_list:
|
||||
raise ValueError("No channel found in RSS feed")
|
||||
channel = channel_list[0]
|
||||
channel_title = self._get_data_by_tag_name(channel, "title")
|
||||
channel_description = self._get_data_by_tag_name(channel, "description")
|
||||
items = channel.getElementsByTagName("item")
|
||||
@@ -105,8 +141,6 @@ class RssConverter(DocumentConverter):
|
||||
md_text = f"# {channel_title}\n"
|
||||
if channel_description:
|
||||
md_text += f"{channel_description}\n"
|
||||
if not items:
|
||||
items = []
|
||||
for item in items:
|
||||
title = self._get_data_by_tag_name(item, "title")
|
||||
description = self._get_data_by_tag_name(item, "description")
|
||||
@@ -123,12 +157,9 @@ class RssConverter(DocumentConverter):
|
||||
md_text += self._parse_content(content)
|
||||
|
||||
return DocumentConverterResult(
|
||||
markdown=md_text,
|
||||
title=channel_title,
|
||||
text_content=md_text,
|
||||
)
|
||||
except BaseException as _:
|
||||
print(traceback.format_exc())
|
||||
return None
|
||||
|
||||
def _parse_content(self, content: str) -> str:
|
||||
"""Parse the content of an RSS feed item"""
|
||||
@@ -150,5 +181,6 @@ class RssConverter(DocumentConverter):
|
||||
return None
|
||||
fc = nodes[0].firstChild
|
||||
if fc:
|
||||
if hasattr(fc, "data"):
|
||||
return fc.data
|
||||
return None
|
||||
|
||||
@@ -0,0 +1,55 @@
|
||||
import io
|
||||
import sys
|
||||
from typing import BinaryIO
|
||||
from .._exceptions import MissingDependencyException
|
||||
|
||||
# Try loading optional (but in this case, required) dependencies
|
||||
# Save reporting of any exceptions for later
|
||||
_dependency_exc_info = None
|
||||
try:
|
||||
# Suppress some deprecation warnings from the speech_recognition library
|
||||
import warnings
|
||||
|
||||
warnings.filterwarnings(
|
||||
"ignore", category=DeprecationWarning, module="speech_recognition"
|
||||
)
|
||||
warnings.filterwarnings(
|
||||
"ignore",
|
||||
category=SyntaxWarning,
|
||||
module="pydub", # TODO: Migrate away from pydub
|
||||
)
|
||||
import speech_recognition as sr
|
||||
|
||||
import pydub
|
||||
except ImportError:
|
||||
# Preserve the error and stack trace for later
|
||||
_dependency_exc_info = sys.exc_info()
|
||||
|
||||
|
||||
def transcribe_audio(file_stream: BinaryIO, *, audio_format: str = "wav") -> str:
|
||||
# Check for installed dependencies
|
||||
if _dependency_exc_info is not None:
|
||||
raise MissingDependencyException(
|
||||
"Speech transcription requires installing MarkItdown with the [audio-transcription] optional dependencies. E.g., `pip install markitdown[audio-transcription]` or `pip install markitdown[all]`"
|
||||
) from _dependency_exc_info[
|
||||
1
|
||||
].with_traceback( # type: ignore[union-attr]
|
||||
_dependency_exc_info[2]
|
||||
)
|
||||
|
||||
if audio_format in ["wav", "aiff", "flac"]:
|
||||
audio_source = file_stream
|
||||
elif audio_format in ["mp3", "mp4"]:
|
||||
audio_segment = pydub.AudioSegment.from_file(file_stream, format=audio_format)
|
||||
|
||||
audio_source = io.BytesIO()
|
||||
audio_segment.export(audio_source, format="wav")
|
||||
audio_source.seek(0)
|
||||
else:
|
||||
raise ValueError(f"Unsupported audio format: {audio_format}")
|
||||
|
||||
recognizer = sr.Recognizer()
|
||||
with sr.AudioFile(audio_source) as source:
|
||||
audio = recognizer.record(source)
|
||||
transcript = recognizer.recognize_google(audio).strip()
|
||||
return "[No speech detected]" if transcript == "" else transcript
|
||||
@@ -1,80 +0,0 @@
|
||||
from typing import Union
|
||||
from ._base import DocumentConverter, DocumentConverterResult
|
||||
from ._media_converter import MediaConverter
|
||||
from ._converter_input import ConverterInput
|
||||
|
||||
# Optional Transcription support
|
||||
IS_AUDIO_TRANSCRIPTION_CAPABLE = False
|
||||
try:
|
||||
import speech_recognition as sr
|
||||
|
||||
IS_AUDIO_TRANSCRIPTION_CAPABLE = True
|
||||
except ModuleNotFoundError:
|
||||
pass
|
||||
|
||||
|
||||
class WavConverter(MediaConverter):
|
||||
"""
|
||||
Converts WAV files to markdown via extraction of metadata (if `exiftool` is installed), and speech transcription (if `speech_recognition` is installed).
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self, priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT
|
||||
):
|
||||
super().__init__(priority=priority)
|
||||
|
||||
def convert(
|
||||
self, input: ConverterInput, **kwargs
|
||||
) -> Union[None, DocumentConverterResult]:
|
||||
# Bail if not a WAV
|
||||
extension = kwargs.get("file_extension", "")
|
||||
if extension.lower() != ".wav":
|
||||
return None
|
||||
|
||||
# Bail if a local path was not provided
|
||||
if input.input_type != "filepath":
|
||||
return None
|
||||
local_path = input.filepath
|
||||
|
||||
md_content = ""
|
||||
|
||||
# Add metadata
|
||||
metadata = self._get_metadata(local_path, kwargs.get("exiftool_path"))
|
||||
if metadata:
|
||||
for f in [
|
||||
"Title",
|
||||
"Artist",
|
||||
"Author",
|
||||
"Band",
|
||||
"Album",
|
||||
"Genre",
|
||||
"Track",
|
||||
"DateTimeOriginal",
|
||||
"CreateDate",
|
||||
"Duration",
|
||||
]:
|
||||
if f in metadata:
|
||||
md_content += f"{f}: {metadata[f]}\n"
|
||||
|
||||
# Transcribe
|
||||
if IS_AUDIO_TRANSCRIPTION_CAPABLE:
|
||||
try:
|
||||
transcript = self._transcribe_audio(local_path)
|
||||
md_content += "\n\n### Audio Transcript:\n" + (
|
||||
"[No speech detected]" if transcript == "" else transcript
|
||||
)
|
||||
except Exception:
|
||||
md_content += (
|
||||
"\n\n### Audio Transcript:\nError. Could not transcribe this audio."
|
||||
)
|
||||
|
||||
return DocumentConverterResult(
|
||||
title=None,
|
||||
text_content=md_content.strip(),
|
||||
)
|
||||
|
||||
def _transcribe_audio(self, local_path) -> str:
|
||||
recognizer = sr.Recognizer()
|
||||
with sr.AudioFile(local_path) as source:
|
||||
audio = recognizer.record(source)
|
||||
return recognizer.recognize_google(audio).strip()
|
||||
@@ -1,37 +1,63 @@
|
||||
import io
|
||||
import re
|
||||
import bs4
|
||||
from typing import Any, BinaryIO, Optional
|
||||
|
||||
from typing import Any, Union
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
from ._base import DocumentConverter, DocumentConverterResult
|
||||
from .._base_converter import DocumentConverter, DocumentConverterResult
|
||||
from .._stream_info import StreamInfo
|
||||
from ._markdownify import _CustomMarkdownify
|
||||
from ._converter_input import ConverterInput
|
||||
|
||||
ACCEPTED_MIME_TYPE_PREFIXES = [
|
||||
"text/html",
|
||||
"application/xhtml",
|
||||
]
|
||||
|
||||
ACCEPTED_FILE_EXTENSIONS = [
|
||||
".html",
|
||||
".htm",
|
||||
]
|
||||
|
||||
|
||||
class WikipediaConverter(DocumentConverter):
|
||||
"""Handle Wikipedia pages separately, focusing only on the main document content."""
|
||||
|
||||
def __init__(
|
||||
self, priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT
|
||||
):
|
||||
super().__init__(priority=priority)
|
||||
def accepts(
|
||||
self,
|
||||
file_stream: BinaryIO,
|
||||
stream_info: StreamInfo,
|
||||
**kwargs: Any, # Options to pass to the converter
|
||||
) -> bool:
|
||||
"""
|
||||
Make sure we're dealing with HTML content *from* Wikipedia.
|
||||
"""
|
||||
|
||||
url = stream_info.url or ""
|
||||
mimetype = (stream_info.mimetype or "").lower()
|
||||
extension = (stream_info.extension or "").lower()
|
||||
|
||||
if not re.search(r"^https?:\/\/[a-zA-Z]{2,3}\.wikipedia.org\/", url):
|
||||
# Not a Wikipedia URL
|
||||
return False
|
||||
|
||||
if extension in ACCEPTED_FILE_EXTENSIONS:
|
||||
return True
|
||||
|
||||
for prefix in ACCEPTED_MIME_TYPE_PREFIXES:
|
||||
if mimetype.startswith(prefix):
|
||||
return True
|
||||
|
||||
# Not HTML content
|
||||
return False
|
||||
|
||||
def convert(
|
||||
self, input: ConverterInput, **kwargs: Any
|
||||
) -> Union[None, DocumentConverterResult]:
|
||||
# Bail if not Wikipedia
|
||||
extension = kwargs.get("file_extension", "")
|
||||
if extension.lower() not in [".html", ".htm"]:
|
||||
return None
|
||||
url = kwargs.get("url", "")
|
||||
if not re.search(r"^https?:\/\/[a-zA-Z]{2,3}\.wikipedia.org\/", url):
|
||||
return None
|
||||
|
||||
# Parse the file
|
||||
soup = None
|
||||
file_obj = input.read_file(mode="rt", encoding="utf-8")
|
||||
soup = BeautifulSoup(file_obj.read(), "html.parser")
|
||||
file_obj.close()
|
||||
self,
|
||||
file_stream: BinaryIO,
|
||||
stream_info: StreamInfo,
|
||||
**kwargs: Any, # Options to pass to the converter
|
||||
) -> DocumentConverterResult:
|
||||
# Parse the stream
|
||||
encoding = "utf-8" if stream_info.charset is None else stream_info.charset
|
||||
soup = bs4.BeautifulSoup(file_stream, "html.parser", from_encoding=encoding)
|
||||
|
||||
# Remove javascript and style blocks
|
||||
for script in soup(["script", "style"]):
|
||||
@@ -46,9 +72,8 @@ class WikipediaConverter(DocumentConverter):
|
||||
|
||||
if body_elm:
|
||||
# What's the title
|
||||
if title_elm and len(title_elm) > 0:
|
||||
main_title = title_elm.string # type: ignore
|
||||
assert isinstance(main_title, str)
|
||||
if title_elm and isinstance(title_elm, bs4.Tag):
|
||||
main_title = title_elm.string
|
||||
|
||||
# Convert the page
|
||||
webpage_text = f"# {main_title}\n\n" + _CustomMarkdownify().convert_soup(
|
||||
@@ -58,6 +83,6 @@ class WikipediaConverter(DocumentConverter):
|
||||
webpage_text = _CustomMarkdownify().convert_soup(soup)
|
||||
|
||||
return DocumentConverterResult(
|
||||
markdown=webpage_text,
|
||||
title=main_title,
|
||||
text_content=webpage_text,
|
||||
)
|
||||
|
||||
@@ -1,70 +1,153 @@
|
||||
from typing import Union
|
||||
|
||||
import pandas as pd
|
||||
|
||||
from ._base import DocumentConverter, DocumentConverterResult
|
||||
import sys
|
||||
from typing import BinaryIO, Any
|
||||
from ._html_converter import HtmlConverter
|
||||
from ._converter_input import ConverterInput
|
||||
from .._base_converter import DocumentConverter, DocumentConverterResult
|
||||
from .._exceptions import MissingDependencyException, MISSING_DEPENDENCY_MESSAGE
|
||||
from .._stream_info import StreamInfo
|
||||
|
||||
# Try loading optional (but in this case, required) dependencies
|
||||
# Save reporting of any exceptions for later
|
||||
_xlsx_dependency_exc_info = None
|
||||
try:
|
||||
import pandas as pd
|
||||
import openpyxl
|
||||
except ImportError:
|
||||
_xlsx_dependency_exc_info = sys.exc_info()
|
||||
|
||||
_xls_dependency_exc_info = None
|
||||
try:
|
||||
import pandas as pd
|
||||
import xlrd
|
||||
except ImportError:
|
||||
_xls_dependency_exc_info = sys.exc_info()
|
||||
|
||||
ACCEPTED_XLSX_MIME_TYPE_PREFIXES = [
|
||||
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
|
||||
]
|
||||
ACCEPTED_XLSX_FILE_EXTENSIONS = [".xlsx"]
|
||||
|
||||
ACCEPTED_XLS_MIME_TYPE_PREFIXES = [
|
||||
"application/vnd.ms-excel",
|
||||
"application/excel",
|
||||
]
|
||||
ACCEPTED_XLS_FILE_EXTENSIONS = [".xls"]
|
||||
|
||||
|
||||
class XlsxConverter(HtmlConverter):
|
||||
class XlsxConverter(DocumentConverter):
|
||||
"""
|
||||
Converts XLSX files to Markdown, with each sheet presented as a separate Markdown table.
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self, priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT
|
||||
):
|
||||
super().__init__(priority=priority)
|
||||
def __init__(self):
|
||||
super().__init__()
|
||||
self._html_converter = HtmlConverter()
|
||||
|
||||
def accepts(
|
||||
self,
|
||||
file_stream: BinaryIO,
|
||||
stream_info: StreamInfo,
|
||||
**kwargs: Any, # Options to pass to the converter
|
||||
) -> bool:
|
||||
mimetype = (stream_info.mimetype or "").lower()
|
||||
extension = (stream_info.extension or "").lower()
|
||||
|
||||
if extension in ACCEPTED_XLSX_FILE_EXTENSIONS:
|
||||
return True
|
||||
|
||||
for prefix in ACCEPTED_XLSX_MIME_TYPE_PREFIXES:
|
||||
if mimetype.startswith(prefix):
|
||||
return True
|
||||
|
||||
return False
|
||||
|
||||
def convert(
|
||||
self, input: ConverterInput, **kwargs
|
||||
) -> Union[None, DocumentConverterResult]:
|
||||
# Bail if not a XLSX
|
||||
extension = kwargs.get("file_extension", "")
|
||||
if extension.lower() != ".xlsx":
|
||||
return None
|
||||
|
||||
file_obj = input.read_file(mode="rb")
|
||||
sheets = pd.read_excel(file_obj, sheet_name=None, engine="openpyxl")
|
||||
file_obj.close()
|
||||
self,
|
||||
file_stream: BinaryIO,
|
||||
stream_info: StreamInfo,
|
||||
**kwargs: Any, # Options to pass to the converter
|
||||
) -> DocumentConverterResult:
|
||||
# Check the dependencies
|
||||
if _xlsx_dependency_exc_info is not None:
|
||||
raise MissingDependencyException(
|
||||
MISSING_DEPENDENCY_MESSAGE.format(
|
||||
converter=type(self).__name__,
|
||||
extension=".xlsx",
|
||||
feature="xlsx",
|
||||
)
|
||||
) from _xlsx_dependency_exc_info[
|
||||
1
|
||||
].with_traceback( # type: ignore[union-attr]
|
||||
_xlsx_dependency_exc_info[2]
|
||||
)
|
||||
|
||||
sheets = pd.read_excel(file_stream, sheet_name=None, engine="openpyxl")
|
||||
md_content = ""
|
||||
for s in sheets:
|
||||
md_content += f"## {s}\n"
|
||||
html_content = sheets[s].to_html(index=False)
|
||||
md_content += self._convert(html_content).text_content.strip() + "\n\n"
|
||||
|
||||
return DocumentConverterResult(
|
||||
title=None,
|
||||
text_content=md_content.strip(),
|
||||
md_content += (
|
||||
self._html_converter.convert_string(html_content).markdown.strip()
|
||||
+ "\n\n"
|
||||
)
|
||||
|
||||
return DocumentConverterResult(markdown=md_content.strip())
|
||||
|
||||
class XlsConverter(HtmlConverter):
|
||||
|
||||
class XlsConverter(DocumentConverter):
|
||||
"""
|
||||
Converts XLS files to Markdown, with each sheet presented as a separate Markdown table.
|
||||
"""
|
||||
|
||||
def __init__(self):
|
||||
super().__init__()
|
||||
self._html_converter = HtmlConverter()
|
||||
|
||||
def accepts(
|
||||
self,
|
||||
file_stream: BinaryIO,
|
||||
stream_info: StreamInfo,
|
||||
**kwargs: Any, # Options to pass to the converter
|
||||
) -> bool:
|
||||
mimetype = (stream_info.mimetype or "").lower()
|
||||
extension = (stream_info.extension or "").lower()
|
||||
|
||||
if extension in ACCEPTED_XLS_FILE_EXTENSIONS:
|
||||
return True
|
||||
|
||||
for prefix in ACCEPTED_XLS_MIME_TYPE_PREFIXES:
|
||||
if mimetype.startswith(prefix):
|
||||
return True
|
||||
|
||||
return False
|
||||
|
||||
def convert(
|
||||
self, input: ConverterInput, **kwargs
|
||||
) -> Union[None, DocumentConverterResult]:
|
||||
# Bail if not a XLS
|
||||
extension = kwargs.get("file_extension", "")
|
||||
if extension.lower() != ".xls":
|
||||
return None
|
||||
|
||||
file_obj = input.read_file(mode="rb")
|
||||
sheets = pd.read_excel(file_obj, sheet_name=None, engine="xlrd")
|
||||
file_obj.close()
|
||||
self,
|
||||
file_stream: BinaryIO,
|
||||
stream_info: StreamInfo,
|
||||
**kwargs: Any, # Options to pass to the converter
|
||||
) -> DocumentConverterResult:
|
||||
# Load the dependencies
|
||||
if _xls_dependency_exc_info is not None:
|
||||
raise MissingDependencyException(
|
||||
MISSING_DEPENDENCY_MESSAGE.format(
|
||||
converter=type(self).__name__,
|
||||
extension=".xls",
|
||||
feature="xls",
|
||||
)
|
||||
) from _xls_dependency_exc_info[
|
||||
1
|
||||
].with_traceback( # type: ignore[union-attr]
|
||||
_xls_dependency_exc_info[2]
|
||||
)
|
||||
|
||||
sheets = pd.read_excel(file_stream, sheet_name=None, engine="xlrd")
|
||||
md_content = ""
|
||||
for s in sheets:
|
||||
md_content += f"## {s}\n"
|
||||
html_content = sheets[s].to_html(index=False)
|
||||
md_content += self._convert(html_content).text_content.strip() + "\n\n"
|
||||
|
||||
return DocumentConverterResult(
|
||||
title=None,
|
||||
text_content=md_content.strip(),
|
||||
md_content += (
|
||||
self._html_converter.convert_string(html_content).markdown.strip()
|
||||
+ "\n\n"
|
||||
)
|
||||
|
||||
return DocumentConverterResult(markdown=md_content.strip())
|
||||
|
||||
@@ -1,72 +1,121 @@
|
||||
import re
|
||||
import sys
|
||||
import json
|
||||
import time
|
||||
import io
|
||||
import re
|
||||
import bs4
|
||||
import warnings
|
||||
from typing import Any, BinaryIO, Optional, Dict, List, Union
|
||||
from urllib.parse import parse_qs, urlparse, unquote
|
||||
|
||||
from typing import Any, Union, Dict, List
|
||||
from urllib.parse import parse_qs, urlparse
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
from ._base import DocumentConverter, DocumentConverterResult
|
||||
from ._converter_input import ConverterInput
|
||||
|
||||
from .._base_converter import DocumentConverter, DocumentConverterResult
|
||||
from .._stream_info import StreamInfo
|
||||
from ._markdownify import _CustomMarkdownify
|
||||
|
||||
# Optional YouTube transcription support
|
||||
try:
|
||||
warnings.filterwarnings(
|
||||
"ignore",
|
||||
category=SyntaxWarning,
|
||||
module="youtube_transcript_api", # Patch submitted to youtube-transcript-api
|
||||
)
|
||||
from youtube_transcript_api import YouTubeTranscriptApi
|
||||
|
||||
IS_YOUTUBE_TRANSCRIPT_CAPABLE = True
|
||||
except ModuleNotFoundError:
|
||||
pass
|
||||
IS_YOUTUBE_TRANSCRIPT_CAPABLE = False
|
||||
|
||||
|
||||
ACCEPTED_MIME_TYPE_PREFIXES = [
|
||||
"text/html",
|
||||
"application/xhtml",
|
||||
]
|
||||
|
||||
ACCEPTED_FILE_EXTENSIONS = [
|
||||
".html",
|
||||
".htm",
|
||||
]
|
||||
|
||||
|
||||
class YouTubeConverter(DocumentConverter):
|
||||
"""Handle YouTube specially, focusing on the video title, description, and transcript."""
|
||||
|
||||
def __init__(
|
||||
self, priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT
|
||||
):
|
||||
super().__init__(priority=priority)
|
||||
def accepts(
|
||||
self,
|
||||
file_stream: BinaryIO,
|
||||
stream_info: StreamInfo,
|
||||
**kwargs: Any, # Options to pass to the converter
|
||||
) -> bool:
|
||||
"""
|
||||
Make sure we're dealing with HTML content *from* YouTube.
|
||||
"""
|
||||
url = stream_info.url or ""
|
||||
mimetype = (stream_info.mimetype or "").lower()
|
||||
extension = (stream_info.extension or "").lower()
|
||||
|
||||
url = unquote(url)
|
||||
url = url.replace(r"\?", "?").replace(r"\=", "=")
|
||||
|
||||
if not url.startswith("https://www.youtube.com/watch?"):
|
||||
# Not a YouTube URL
|
||||
return False
|
||||
|
||||
if extension in ACCEPTED_FILE_EXTENSIONS:
|
||||
return True
|
||||
|
||||
for prefix in ACCEPTED_MIME_TYPE_PREFIXES:
|
||||
if mimetype.startswith(prefix):
|
||||
return True
|
||||
|
||||
# Not HTML content
|
||||
return False
|
||||
|
||||
def convert(
|
||||
self, input: ConverterInput, **kwargs: Any
|
||||
) -> Union[None, DocumentConverterResult]:
|
||||
# Bail if not YouTube
|
||||
extension = kwargs.get("file_extension", "")
|
||||
if extension.lower() not in [".html", ".htm"]:
|
||||
return None
|
||||
url = kwargs.get("url", "")
|
||||
if not url.startswith("https://www.youtube.com/watch?"):
|
||||
return None
|
||||
|
||||
# Parse the file
|
||||
soup = None
|
||||
file_obj = input.read_file(mode="rt", encoding="utf-8")
|
||||
soup = BeautifulSoup(file_obj.read(), "html.parser")
|
||||
file_obj.close()
|
||||
self,
|
||||
file_stream: BinaryIO,
|
||||
stream_info: StreamInfo,
|
||||
**kwargs: Any, # Options to pass to the converter
|
||||
) -> DocumentConverterResult:
|
||||
# Parse the stream
|
||||
encoding = "utf-8" if stream_info.charset is None else stream_info.charset
|
||||
soup = bs4.BeautifulSoup(file_stream, "html.parser", from_encoding=encoding)
|
||||
|
||||
# Read the meta tags
|
||||
assert soup.title is not None and soup.title.string is not None
|
||||
metadata: Dict[str, str] = {"title": soup.title.string}
|
||||
metadata: Dict[str, str] = {}
|
||||
|
||||
if soup.title and soup.title.string:
|
||||
metadata["title"] = soup.title.string
|
||||
|
||||
for meta in soup(["meta"]):
|
||||
if not isinstance(meta, bs4.Tag):
|
||||
continue
|
||||
|
||||
for a in meta.attrs:
|
||||
if a in ["itemprop", "property", "name"]:
|
||||
metadata[meta[a]] = meta.get("content", "")
|
||||
key = str(meta.get(a, ""))
|
||||
content = str(meta.get("content", ""))
|
||||
if key and content: # Only add non-empty content
|
||||
metadata[key] = content
|
||||
break
|
||||
|
||||
# We can also try to read the full description. This is more prone to breaking, since it reaches into the page implementation
|
||||
# Try reading the description
|
||||
try:
|
||||
for script in soup(["script"]):
|
||||
content = script.text
|
||||
if not isinstance(script, bs4.Tag):
|
||||
continue
|
||||
if not script.string: # Skip empty scripts
|
||||
continue
|
||||
content = script.string
|
||||
if "ytInitialData" in content:
|
||||
lines = re.split(r"\r?\n", content)
|
||||
obj_start = lines[0].find("{")
|
||||
obj_end = lines[0].rfind("}")
|
||||
if obj_start >= 0 and obj_end >= 0:
|
||||
data = json.loads(lines[0][obj_start : obj_end + 1])
|
||||
attrdesc = self._findKey(data, "attributedDescriptionBodyText") # type: ignore
|
||||
if attrdesc:
|
||||
metadata["description"] = str(attrdesc["content"])
|
||||
match = re.search(r"var ytInitialData = ({.*?});", content)
|
||||
if match:
|
||||
data = json.loads(match.group(1))
|
||||
attrdesc = self._findKey(data, "attributedDescriptionBodyText")
|
||||
if attrdesc and isinstance(attrdesc, dict):
|
||||
metadata["description"] = str(attrdesc.get("content", ""))
|
||||
break
|
||||
except Exception:
|
||||
except Exception as e:
|
||||
print(f"Error extracting description: {e}")
|
||||
pass
|
||||
|
||||
# Start preparing the page
|
||||
@@ -100,32 +149,40 @@ class YouTubeConverter(DocumentConverter):
|
||||
|
||||
if IS_YOUTUBE_TRANSCRIPT_CAPABLE:
|
||||
transcript_text = ""
|
||||
parsed_url = urlparse(url) # type: ignore
|
||||
parsed_url = urlparse(stream_info.url) # type: ignore
|
||||
params = parse_qs(parsed_url.query) # type: ignore
|
||||
if "v" in params:
|
||||
assert isinstance(params["v"][0], str)
|
||||
if "v" in params and params["v"][0]:
|
||||
video_id = str(params["v"][0])
|
||||
try:
|
||||
youtube_transcript_languages = kwargs.get(
|
||||
"youtube_transcript_languages", ("en",)
|
||||
)
|
||||
# Must be a single transcript.
|
||||
transcript = YouTubeTranscriptApi.get_transcript(video_id, languages=youtube_transcript_languages) # type: ignore
|
||||
transcript_text = " ".join([part["text"] for part in transcript]) # type: ignore
|
||||
# Retry the transcript fetching operation
|
||||
transcript = self._retry_operation(
|
||||
lambda: YouTubeTranscriptApi.get_transcript(
|
||||
video_id, languages=youtube_transcript_languages
|
||||
),
|
||||
retries=3, # Retry 3 times
|
||||
delay=2, # 2 seconds delay between retries
|
||||
)
|
||||
if transcript:
|
||||
transcript_text = " ".join(
|
||||
[part["text"] for part in transcript]
|
||||
) # type: ignore
|
||||
# Alternative formatting:
|
||||
# formatter = TextFormatter()
|
||||
# formatter.format_transcript(transcript)
|
||||
except Exception:
|
||||
pass
|
||||
except Exception as e:
|
||||
print(f"Error fetching transcript: {e}")
|
||||
if transcript_text:
|
||||
webpage_text += f"\n### Transcript\n{transcript_text}\n"
|
||||
|
||||
title = title if title else soup.title.string
|
||||
title = title if title else (soup.title.string if soup.title else "")
|
||||
assert isinstance(title, str)
|
||||
|
||||
return DocumentConverterResult(
|
||||
markdown=webpage_text,
|
||||
title=title,
|
||||
text_content=webpage_text,
|
||||
)
|
||||
|
||||
def _get(
|
||||
@@ -134,23 +191,37 @@ class YouTubeConverter(DocumentConverter):
|
||||
keys: List[str],
|
||||
default: Union[str, None] = None,
|
||||
) -> Union[str, None]:
|
||||
"""Get first non-empty value from metadata matching given keys."""
|
||||
for k in keys:
|
||||
if k in metadata:
|
||||
return metadata[k]
|
||||
return default
|
||||
|
||||
def _findKey(self, json: Any, key: str) -> Union[str, None]: # TODO: Fix json type
|
||||
"""Recursively search for a key in nested dictionary/list structures."""
|
||||
if isinstance(json, list):
|
||||
for elm in json:
|
||||
ret = self._findKey(elm, key)
|
||||
if ret is not None:
|
||||
return ret
|
||||
elif isinstance(json, dict):
|
||||
for k in json:
|
||||
for k, v in json.items():
|
||||
if k == key:
|
||||
return json[k]
|
||||
else:
|
||||
ret = self._findKey(json[k], key)
|
||||
if ret is not None:
|
||||
return ret
|
||||
if result := self._findKey(v, key):
|
||||
return result
|
||||
return None
|
||||
|
||||
def _retry_operation(self, operation, retries=3, delay=2):
|
||||
"""Retries the operation if it fails."""
|
||||
attempt = 0
|
||||
while attempt < retries:
|
||||
try:
|
||||
return operation() # Attempt the operation
|
||||
except Exception as e:
|
||||
print(f"Attempt {attempt + 1} failed: {e}")
|
||||
if attempt < retries - 1:
|
||||
time.sleep(delay) # Wait before retrying
|
||||
attempt += 1
|
||||
# If all attempts fail, raise the last exception
|
||||
raise Exception(f"Operation failed after {retries} attempts.")
|
||||
|
||||
@@ -1,10 +1,23 @@
|
||||
import os
|
||||
import sys
|
||||
import zipfile
|
||||
import shutil
|
||||
from typing import Any, Union
|
||||
import io
|
||||
import os
|
||||
|
||||
from ._base import DocumentConverter, DocumentConverterResult
|
||||
from ._converter_input import ConverterInput
|
||||
from typing import BinaryIO, Any, TYPE_CHECKING
|
||||
|
||||
from .._base_converter import DocumentConverter, DocumentConverterResult
|
||||
from .._stream_info import StreamInfo
|
||||
from .._exceptions import UnsupportedFormatException, FileConversionException
|
||||
|
||||
# Break otherwise circular import for type hinting
|
||||
if TYPE_CHECKING:
|
||||
from .._markitdown import MarkItDown
|
||||
|
||||
ACCEPTED_MIME_TYPE_PREFIXES = [
|
||||
"application/zip",
|
||||
]
|
||||
|
||||
ACCEPTED_FILE_EXTENSIONS = [".zip"]
|
||||
|
||||
|
||||
class ZipConverter(DocumentConverter):
|
||||
@@ -47,104 +60,58 @@ class ZipConverter(DocumentConverter):
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self, priority: float = DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT
|
||||
self,
|
||||
*,
|
||||
markitdown: "MarkItDown",
|
||||
):
|
||||
super().__init__(priority=priority)
|
||||
super().__init__()
|
||||
self._markitdown = markitdown
|
||||
|
||||
def accepts(
|
||||
self,
|
||||
file_stream: BinaryIO,
|
||||
stream_info: StreamInfo,
|
||||
**kwargs: Any, # Options to pass to the converter
|
||||
) -> bool:
|
||||
mimetype = (stream_info.mimetype or "").lower()
|
||||
extension = (stream_info.extension or "").lower()
|
||||
|
||||
if extension in ACCEPTED_FILE_EXTENSIONS:
|
||||
return True
|
||||
|
||||
for prefix in ACCEPTED_MIME_TYPE_PREFIXES:
|
||||
if mimetype.startswith(prefix):
|
||||
return True
|
||||
|
||||
return False
|
||||
|
||||
def convert(
|
||||
self, input: ConverterInput, **kwargs: Any
|
||||
) -> Union[None, DocumentConverterResult]:
|
||||
# Bail if not a ZIP
|
||||
extension = kwargs.get("file_extension", "")
|
||||
if extension.lower() != ".zip":
|
||||
return None
|
||||
|
||||
# Bail if a local path is not provided
|
||||
if input.input_type != "filepath":
|
||||
return None
|
||||
local_path = input.filepath
|
||||
|
||||
# Get parent converters list if available
|
||||
parent_converters = kwargs.get("_parent_converters", [])
|
||||
if not parent_converters:
|
||||
return DocumentConverterResult(
|
||||
title=None,
|
||||
text_content=f"[ERROR] No converters available to process zip contents from: {local_path}",
|
||||
)
|
||||
|
||||
extracted_zip_folder_name = (
|
||||
f"extracted_{os.path.basename(local_path).replace('.zip', '_zip')}"
|
||||
)
|
||||
extraction_dir = os.path.normpath(
|
||||
os.path.join(os.path.dirname(local_path), extracted_zip_folder_name)
|
||||
)
|
||||
md_content = f"Content from the zip file `{os.path.basename(local_path)}`:\n\n"
|
||||
self,
|
||||
file_stream: BinaryIO,
|
||||
stream_info: StreamInfo,
|
||||
**kwargs: Any, # Options to pass to the converter
|
||||
) -> DocumentConverterResult:
|
||||
file_path = stream_info.url or stream_info.local_path or stream_info.filename
|
||||
md_content = f"Content from the zip file `{file_path}`:\n\n"
|
||||
|
||||
with zipfile.ZipFile(file_stream, "r") as zipObj:
|
||||
for name in zipObj.namelist():
|
||||
try:
|
||||
# Extract the zip file safely
|
||||
with zipfile.ZipFile(local_path, "r") as zipObj:
|
||||
# Safeguard against path traversal
|
||||
for member in zipObj.namelist():
|
||||
member_path = os.path.normpath(os.path.join(extraction_dir, member))
|
||||
if (
|
||||
not os.path.commonprefix([extraction_dir, member_path])
|
||||
== extraction_dir
|
||||
):
|
||||
raise ValueError(
|
||||
f"Path traversal detected in zip file: {member}"
|
||||
z_file_stream = io.BytesIO(zipObj.read(name))
|
||||
z_file_stream_info = StreamInfo(
|
||||
extension=os.path.splitext(name)[1],
|
||||
filename=os.path.basename(name),
|
||||
)
|
||||
|
||||
# Extract all files safely
|
||||
zipObj.extractall(path=extraction_dir)
|
||||
|
||||
# Process each extracted file
|
||||
for root, dirs, files in os.walk(extraction_dir):
|
||||
for name in files:
|
||||
file_path = os.path.join(root, name)
|
||||
relative_path = os.path.relpath(file_path, extraction_dir)
|
||||
|
||||
# Get file extension
|
||||
_, file_extension = os.path.splitext(name)
|
||||
|
||||
# Update kwargs for the file
|
||||
file_kwargs = kwargs.copy()
|
||||
file_kwargs["file_extension"] = file_extension
|
||||
file_kwargs["_parent_converters"] = parent_converters
|
||||
|
||||
# Try converting the file using available converters
|
||||
for converter in parent_converters:
|
||||
# Skip the zip converter to avoid infinite recursion
|
||||
if isinstance(converter, ZipConverter):
|
||||
continue
|
||||
|
||||
# Create a ConverterInput for the parent converter and attempt conversion
|
||||
input = ConverterInput(
|
||||
input_type="filepath", filepath=file_path
|
||||
result = self._markitdown.convert_stream(
|
||||
stream=z_file_stream,
|
||||
stream_info=z_file_stream_info,
|
||||
)
|
||||
result = converter.convert(input, **file_kwargs)
|
||||
if result is not None:
|
||||
md_content += f"\n## File: {relative_path}\n\n"
|
||||
md_content += result.text_content + "\n\n"
|
||||
break
|
||||
md_content += f"## File: {name}\n\n"
|
||||
md_content += result.markdown + "\n\n"
|
||||
except UnsupportedFormatException:
|
||||
pass
|
||||
except FileConversionException:
|
||||
pass
|
||||
|
||||
# Clean up extracted files if specified
|
||||
if kwargs.get("cleanup_extracted", True):
|
||||
shutil.rmtree(extraction_dir)
|
||||
|
||||
return DocumentConverterResult(title=None, text_content=md_content.strip())
|
||||
|
||||
except zipfile.BadZipFile:
|
||||
return DocumentConverterResult(
|
||||
title=None,
|
||||
text_content=f"[ERROR] Invalid or corrupted zip file: {local_path}",
|
||||
)
|
||||
except ValueError as ve:
|
||||
return DocumentConverterResult(
|
||||
title=None,
|
||||
text_content=f"[ERROR] Security error in zip file {local_path}: {str(ve)}",
|
||||
)
|
||||
except Exception as e:
|
||||
return DocumentConverterResult(
|
||||
title=None,
|
||||
text_content=f"[ERROR] Failed to process zip file {local_path}: {str(e)}",
|
||||
)
|
||||
return DocumentConverterResult(markdown=md_content.strip())
|
||||
|
||||
232
packages/markitdown/tests/_test_vectors.py
Normal file
232
packages/markitdown/tests/_test_vectors.py
Normal file
@@ -0,0 +1,232 @@
|
||||
import dataclasses
|
||||
from typing import List
|
||||
|
||||
|
||||
@dataclasses.dataclass(frozen=True, kw_only=True)
|
||||
class FileTestVector(object):
|
||||
filename: str
|
||||
mimetype: str | None
|
||||
charset: str | None
|
||||
url: str | None
|
||||
must_include: List[str]
|
||||
must_not_include: List[str]
|
||||
|
||||
|
||||
GENERAL_TEST_VECTORS = [
|
||||
FileTestVector(
|
||||
filename="test.docx",
|
||||
mimetype="application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
||||
charset=None,
|
||||
url=None,
|
||||
must_include=[
|
||||
"314b0a30-5b04-470b-b9f7-eed2c2bec74a",
|
||||
"49e168b7-d2ae-407f-a055-2167576f39a1",
|
||||
"## d666f1f7-46cb-42bd-9a39-9a39cf2a509f",
|
||||
"# Abstract",
|
||||
"# Introduction",
|
||||
"AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation",
|
||||
],
|
||||
must_not_include=[],
|
||||
),
|
||||
FileTestVector(
|
||||
filename="test.xlsx",
|
||||
mimetype="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
|
||||
charset=None,
|
||||
url=None,
|
||||
must_include=[
|
||||
"## 09060124-b5e7-4717-9d07-3c046eb",
|
||||
"6ff4173b-42a5-4784-9b19-f49caff4d93d",
|
||||
"affc7dad-52dc-4b98-9b5d-51e65d8a8ad0",
|
||||
],
|
||||
must_not_include=[],
|
||||
),
|
||||
FileTestVector(
|
||||
filename="test.xls",
|
||||
mimetype="application/vnd.ms-excel",
|
||||
charset=None,
|
||||
url=None,
|
||||
must_include=[
|
||||
"## 09060124-b5e7-4717-9d07-3c046eb",
|
||||
"6ff4173b-42a5-4784-9b19-f49caff4d93d",
|
||||
"affc7dad-52dc-4b98-9b5d-51e65d8a8ad0",
|
||||
],
|
||||
must_not_include=[],
|
||||
),
|
||||
FileTestVector(
|
||||
filename="test.pptx",
|
||||
mimetype="application/vnd.openxmlformats-officedocument.presentationml.presentation",
|
||||
charset=None,
|
||||
url=None,
|
||||
must_include=[
|
||||
"2cdda5c8-e50e-4db4-b5f0-9722a649f455",
|
||||
"04191ea8-5c73-4215-a1d3-1cfb43aaaf12",
|
||||
"44bf7d06-5e7a-4a40-a2e1-a2e42ef28c8a",
|
||||
"1b92870d-e3b5-4e65-8153-919f4ff45592",
|
||||
"AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation",
|
||||
"a3f6004b-6f4f-4ea8-bee3-3741f4dc385f", # chart title
|
||||
"2003", # chart value
|
||||
],
|
||||
must_not_include=[],
|
||||
),
|
||||
FileTestVector(
|
||||
filename="test_outlook_msg.msg",
|
||||
mimetype="application/vnd.ms-outlook",
|
||||
charset=None,
|
||||
url=None,
|
||||
must_include=[
|
||||
"# Email Message",
|
||||
"**From:** test.sender@example.com",
|
||||
"**To:** test.recipient@example.com",
|
||||
"**Subject:** Test Email Message",
|
||||
"## Content",
|
||||
"This is the body of the test email message",
|
||||
],
|
||||
must_not_include=[],
|
||||
),
|
||||
FileTestVector(
|
||||
filename="test.pdf",
|
||||
mimetype="application/pdf",
|
||||
charset=None,
|
||||
url=None,
|
||||
must_include=[
|
||||
"While there is contemporaneous exploration of multi-agent approaches"
|
||||
],
|
||||
must_not_include=[],
|
||||
),
|
||||
FileTestVector(
|
||||
filename="test_blog.html",
|
||||
mimetype="text/html",
|
||||
charset="utf-8",
|
||||
url="https://microsoft.github.io/autogen/blog/2023/04/21/LLM-tuning-math",
|
||||
must_include=[
|
||||
"Large language models (LLMs) are powerful tools that can generate natural language texts for various applications, such as chatbots, summarization, translation, and more. GPT-4 is currently the state of the art LLM in the world. Is model selection irrelevant? What about inference parameters?",
|
||||
"an example where high cost can easily prevent a generic complex",
|
||||
],
|
||||
must_not_include=[],
|
||||
),
|
||||
FileTestVector(
|
||||
filename="test_wikipedia.html",
|
||||
mimetype="text/html",
|
||||
charset="utf-8",
|
||||
url="https://en.wikipedia.org/wiki/Microsoft",
|
||||
must_include=[
|
||||
"Microsoft entered the operating system (OS) business in 1980 with its own version of [Unix]",
|
||||
'Microsoft was founded by [Bill Gates](/wiki/Bill_Gates "Bill Gates")',
|
||||
],
|
||||
must_not_include=[
|
||||
"You are encouraged to create an account and log in",
|
||||
"154 languages",
|
||||
"move to sidebar",
|
||||
],
|
||||
),
|
||||
FileTestVector(
|
||||
filename="test_serp.html",
|
||||
mimetype="text/html",
|
||||
charset="utf-8",
|
||||
url="https://www.bing.com/search?q=microsoft+wikipedia",
|
||||
must_include=[
|
||||
"](https://en.wikipedia.org/wiki/Microsoft",
|
||||
"Microsoft Corporation is **an American multinational corporation and technology company headquartered** in Redmond",
|
||||
"1995–2007: Foray into the Web, Windows 95, Windows XP, and Xbox",
|
||||
],
|
||||
must_not_include=[
|
||||
"https://www.bing.com/ck/a?!&&p=",
|
||||
"data:image/svg+xml,%3Csvg%20width%3D",
|
||||
],
|
||||
),
|
||||
FileTestVector(
|
||||
filename="test_mskanji.csv",
|
||||
mimetype="text/csv",
|
||||
charset="cp932",
|
||||
url=None,
|
||||
must_include=[
|
||||
"名前,年齢,住所",
|
||||
"佐藤太郎,30,東京",
|
||||
"三木英子,25,大阪",
|
||||
"髙橋淳,35,名古屋",
|
||||
],
|
||||
must_not_include=[],
|
||||
),
|
||||
FileTestVector(
|
||||
filename="test.json",
|
||||
mimetype="application/json",
|
||||
charset="ascii",
|
||||
url=None,
|
||||
must_include=[
|
||||
"5b64c88c-b3c3-4510-bcb8-da0b200602d8",
|
||||
"9700dc99-6685-40b4-9a3a-5e406dcb37f3",
|
||||
],
|
||||
must_not_include=[],
|
||||
),
|
||||
FileTestVector(
|
||||
filename="test_rss.xml",
|
||||
mimetype="text/xml",
|
||||
charset="utf-8",
|
||||
url=None,
|
||||
must_include=[
|
||||
"# The Official Microsoft Blog",
|
||||
"## Ignite 2024: Why nearly 70% of the Fortune 500 now use Microsoft 365 Copilot",
|
||||
"In the case of AI, it is absolutely true that the industry is moving incredibly fast",
|
||||
],
|
||||
must_not_include=["<rss", "<feed"],
|
||||
),
|
||||
FileTestVector(
|
||||
filename="test_notebook.ipynb",
|
||||
mimetype="application/json",
|
||||
charset="ascii",
|
||||
url=None,
|
||||
must_include=[
|
||||
"# Test Notebook",
|
||||
"```python",
|
||||
'print("markitdown")',
|
||||
"```",
|
||||
"## Code Cell Below",
|
||||
],
|
||||
must_not_include=[
|
||||
"nbformat",
|
||||
"nbformat_minor",
|
||||
],
|
||||
),
|
||||
FileTestVector(
|
||||
filename="test_files.zip",
|
||||
mimetype="application/zip",
|
||||
charset=None,
|
||||
url=None,
|
||||
must_include=[
|
||||
"314b0a30-5b04-470b-b9f7-eed2c2bec74a",
|
||||
"49e168b7-d2ae-407f-a055-2167576f39a1",
|
||||
"## d666f1f7-46cb-42bd-9a39-9a39cf2a509f",
|
||||
"# Abstract",
|
||||
"# Introduction",
|
||||
"AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation",
|
||||
"2cdda5c8-e50e-4db4-b5f0-9722a649f455",
|
||||
"04191ea8-5c73-4215-a1d3-1cfb43aaaf12",
|
||||
"44bf7d06-5e7a-4a40-a2e1-a2e42ef28c8a",
|
||||
"1b92870d-e3b5-4e65-8153-919f4ff45592",
|
||||
"## 09060124-b5e7-4717-9d07-3c046eb",
|
||||
"6ff4173b-42a5-4784-9b19-f49caff4d93d",
|
||||
"affc7dad-52dc-4b98-9b5d-51e65d8a8ad0",
|
||||
"Microsoft entered the operating system (OS) business in 1980 with its own version of [Unix]",
|
||||
'Microsoft was founded by [Bill Gates](/wiki/Bill_Gates "Bill Gates")',
|
||||
],
|
||||
must_not_include=[],
|
||||
),
|
||||
FileTestVector(
|
||||
filename="test.epub",
|
||||
mimetype="application/epub+zip",
|
||||
charset=None,
|
||||
url=None,
|
||||
must_include=[
|
||||
"**Authors:** Test Author",
|
||||
"A test EPUB document for MarkItDown testing",
|
||||
"# Chapter 1: Test Content",
|
||||
"This is a **test** paragraph with some formatting",
|
||||
"* A bullet point",
|
||||
"* Another point",
|
||||
"# Chapter 2: More Content",
|
||||
"*different* style",
|
||||
"> This is a blockquote for testing",
|
||||
],
|
||||
must_not_include=[],
|
||||
),
|
||||
]
|
||||
@@ -1,119 +0,0 @@
|
||||
#!/usr/bin/env python3 -m pytest
|
||||
import os
|
||||
import subprocess
|
||||
import pytest
|
||||
from markitdown import __version__
|
||||
|
||||
try:
|
||||
from .test_markitdown import TEST_FILES_DIR, DOCX_TEST_STRINGS
|
||||
except ImportError:
|
||||
from test_markitdown import TEST_FILES_DIR, DOCX_TEST_STRINGS
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def shared_tmp_dir(tmp_path_factory):
|
||||
return tmp_path_factory.mktemp("pytest_tmp")
|
||||
|
||||
|
||||
def test_version(shared_tmp_dir) -> None:
|
||||
result = subprocess.run(
|
||||
["python", "-m", "markitdown", "--version"], capture_output=True, text=True
|
||||
)
|
||||
|
||||
assert result.returncode == 0, f"CLI exited with error: {result.stderr}"
|
||||
assert __version__ in result.stdout, f"Version not found in output: {result.stdout}"
|
||||
|
||||
|
||||
def test_invalid_flag(shared_tmp_dir) -> None:
|
||||
result = subprocess.run(
|
||||
["python", "-m", "markitdown", "--foobar"], capture_output=True, text=True
|
||||
)
|
||||
|
||||
assert result.returncode != 0, f"CLI exited with error: {result.stderr}"
|
||||
assert (
|
||||
"unrecognized arguments" in result.stderr
|
||||
), f"Expected 'unrecognized arguments' to appear in STDERR"
|
||||
assert "SYNTAX" in result.stderr, f"Expected 'SYNTAX' to appear in STDERR"
|
||||
|
||||
|
||||
def test_output_to_stdout(shared_tmp_dir) -> None:
|
||||
# DOC X
|
||||
result = subprocess.run(
|
||||
["python", "-m", "markitdown", os.path.join(TEST_FILES_DIR, "test.docx")],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
)
|
||||
|
||||
assert result.returncode == 0, f"CLI exited with error: {result.stderr}"
|
||||
for test_string in DOCX_TEST_STRINGS:
|
||||
assert (
|
||||
test_string in result.stdout
|
||||
), f"Expected string not found in output: {test_string}"
|
||||
|
||||
|
||||
def test_output_to_file(shared_tmp_dir) -> None:
|
||||
# DOC X, flag -o at the end
|
||||
docx_output_file_1 = os.path.join(shared_tmp_dir, "test_docx_1.md")
|
||||
result = subprocess.run(
|
||||
[
|
||||
"python",
|
||||
"-m",
|
||||
"markitdown",
|
||||
os.path.join(TEST_FILES_DIR, "test.docx"),
|
||||
"-o",
|
||||
docx_output_file_1,
|
||||
],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
)
|
||||
|
||||
assert result.returncode == 0, f"CLI exited with error: {result.stderr}"
|
||||
assert os.path.exists(
|
||||
docx_output_file_1
|
||||
), f"Output file not created: {docx_output_file_1}"
|
||||
|
||||
with open(docx_output_file_1, "r") as f:
|
||||
output = f.read()
|
||||
for test_string in DOCX_TEST_STRINGS:
|
||||
assert (
|
||||
test_string in output
|
||||
), f"Expected string not found in output: {test_string}"
|
||||
|
||||
# DOC X, flag -o at the beginning
|
||||
docx_output_file_2 = os.path.join(shared_tmp_dir, "test_docx_2.md")
|
||||
result = subprocess.run(
|
||||
[
|
||||
"python",
|
||||
"-m",
|
||||
"markitdown",
|
||||
"-o",
|
||||
docx_output_file_2,
|
||||
os.path.join(TEST_FILES_DIR, "test.docx"),
|
||||
],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
)
|
||||
|
||||
assert result.returncode == 0, f"CLI exited with error: {result.stderr}"
|
||||
assert os.path.exists(
|
||||
docx_output_file_2
|
||||
), f"Output file not created: {docx_output_file_2}"
|
||||
|
||||
with open(docx_output_file_2, "r") as f:
|
||||
output = f.read()
|
||||
for test_string in DOCX_TEST_STRINGS:
|
||||
assert (
|
||||
test_string in output
|
||||
), f"Expected string not found in output: {test_string}"
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
"""Runs this file's tests from the command line."""
|
||||
import tempfile
|
||||
|
||||
with tempfile.TemporaryDirectory() as tmp_dir:
|
||||
test_version(tmp_dir)
|
||||
test_invalid_flag(tmp_dir)
|
||||
test_output_to_stdout(tmp_dir)
|
||||
test_output_to_file(tmp_dir)
|
||||
print("All tests passed!")
|
||||
35
packages/markitdown/tests/test_cli_misc.py
Normal file
35
packages/markitdown/tests/test_cli_misc.py
Normal file
@@ -0,0 +1,35 @@
|
||||
#!/usr/bin/env python3 -m pytest
|
||||
import subprocess
|
||||
import pytest
|
||||
from markitdown import __version__
|
||||
|
||||
# This file contains CLI tests that are not directly tested by the FileTestVectors.
|
||||
# This includes things like help messages, version numbers, and invalid flags.
|
||||
|
||||
|
||||
def test_version() -> None:
|
||||
result = subprocess.run(
|
||||
["python", "-m", "markitdown", "--version"], capture_output=True, text=True
|
||||
)
|
||||
|
||||
assert result.returncode == 0, f"CLI exited with error: {result.stderr}"
|
||||
assert __version__ in result.stdout, f"Version not found in output: {result.stdout}"
|
||||
|
||||
|
||||
def test_invalid_flag() -> None:
|
||||
result = subprocess.run(
|
||||
["python", "-m", "markitdown", "--foobar"], capture_output=True, text=True
|
||||
)
|
||||
|
||||
assert result.returncode != 0, f"CLI exited with error: {result.stderr}"
|
||||
assert (
|
||||
"unrecognized arguments" in result.stderr
|
||||
), f"Expected 'unrecognized arguments' to appear in STDERR"
|
||||
assert "SYNTAX" in result.stderr, f"Expected 'SYNTAX' to appear in STDERR"
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
"""Runs this file's tests from the command line."""
|
||||
test_version()
|
||||
test_invalid_flag()
|
||||
print("All tests passed!")
|
||||
172
packages/markitdown/tests/test_cli_vectors.py
Normal file
172
packages/markitdown/tests/test_cli_vectors.py
Normal file
@@ -0,0 +1,172 @@
|
||||
#!/usr/bin/env python3 -m pytest
|
||||
import os
|
||||
import time
|
||||
import pytest
|
||||
import subprocess
|
||||
import locale
|
||||
from typing import List
|
||||
|
||||
if __name__ == "__main__":
|
||||
from _test_vectors import GENERAL_TEST_VECTORS, FileTestVector
|
||||
else:
|
||||
from ._test_vectors import GENERAL_TEST_VECTORS, FileTestVector
|
||||
|
||||
from markitdown import (
|
||||
MarkItDown,
|
||||
UnsupportedFormatException,
|
||||
FileConversionException,
|
||||
StreamInfo,
|
||||
)
|
||||
|
||||
skip_remote = (
|
||||
True if os.environ.get("GITHUB_ACTIONS") else False
|
||||
) # Don't run these tests in CI
|
||||
|
||||
TEST_FILES_DIR = os.path.join(os.path.dirname(__file__), "test_files")
|
||||
TEST_FILES_URL = "https://raw.githubusercontent.com/microsoft/markitdown/refs/heads/main/packages/markitdown/tests/test_files"
|
||||
|
||||
|
||||
# Prepare CLI test vectors (remove vectors that require mockig the url)
|
||||
CLI_TEST_VECTORS: List[FileTestVector] = []
|
||||
for test_vector in GENERAL_TEST_VECTORS:
|
||||
if test_vector.url is not None:
|
||||
continue
|
||||
CLI_TEST_VECTORS.append(test_vector)
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def shared_tmp_dir(tmp_path_factory):
|
||||
return tmp_path_factory.mktemp("pytest_tmp")
|
||||
|
||||
|
||||
@pytest.mark.parametrize("test_vector", CLI_TEST_VECTORS)
|
||||
def test_output_to_stdout(shared_tmp_dir, test_vector) -> None:
|
||||
"""Test that the CLI outputs to stdout correctly."""
|
||||
|
||||
result = subprocess.run(
|
||||
[
|
||||
"python",
|
||||
"-m",
|
||||
"markitdown",
|
||||
os.path.join(TEST_FILES_DIR, test_vector.filename),
|
||||
],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
)
|
||||
|
||||
assert result.returncode == 0, f"CLI exited with error: {result.stderr}"
|
||||
for test_string in test_vector.must_include:
|
||||
assert test_string in result.stdout
|
||||
for test_string in test_vector.must_not_include:
|
||||
assert test_string not in result.stdout
|
||||
|
||||
|
||||
@pytest.mark.parametrize("test_vector", CLI_TEST_VECTORS)
|
||||
def test_output_to_file(shared_tmp_dir, test_vector) -> None:
|
||||
"""Test that the CLI outputs to a file correctly."""
|
||||
|
||||
output_file = os.path.join(shared_tmp_dir, test_vector.filename + ".output")
|
||||
result = subprocess.run(
|
||||
[
|
||||
"python",
|
||||
"-m",
|
||||
"markitdown",
|
||||
"-o",
|
||||
output_file,
|
||||
os.path.join(TEST_FILES_DIR, test_vector.filename),
|
||||
],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
)
|
||||
|
||||
assert result.returncode == 0, f"CLI exited with error: {result.stderr}"
|
||||
assert os.path.exists(output_file), f"Output file not created: {output_file}"
|
||||
|
||||
with open(output_file, "r") as f:
|
||||
output_data = f.read()
|
||||
for test_string in test_vector.must_include:
|
||||
assert test_string in output_data
|
||||
for test_string in test_vector.must_not_include:
|
||||
assert test_string not in output_data
|
||||
|
||||
os.remove(output_file)
|
||||
assert not os.path.exists(output_file), f"Output file not deleted: {output_file}"
|
||||
|
||||
|
||||
@pytest.mark.parametrize("test_vector", CLI_TEST_VECTORS)
|
||||
def test_input_from_stdin_without_hints(shared_tmp_dir, test_vector) -> None:
|
||||
"""Test that the CLI readds from stdin correctly."""
|
||||
|
||||
test_input = b""
|
||||
with open(os.path.join(TEST_FILES_DIR, test_vector.filename), "rb") as stream:
|
||||
test_input = stream.read()
|
||||
|
||||
result = subprocess.run(
|
||||
[
|
||||
"python",
|
||||
"-m",
|
||||
"markitdown",
|
||||
os.path.join(TEST_FILES_DIR, test_vector.filename),
|
||||
],
|
||||
input=test_input,
|
||||
capture_output=True,
|
||||
text=False,
|
||||
)
|
||||
|
||||
stdout = result.stdout.decode(locale.getpreferredencoding())
|
||||
assert (
|
||||
result.returncode == 0
|
||||
), f"CLI exited with error: {result.stderr.decode('utf-8')}"
|
||||
for test_string in test_vector.must_include:
|
||||
assert test_string in stdout
|
||||
for test_string in test_vector.must_not_include:
|
||||
assert test_string not in stdout
|
||||
|
||||
|
||||
@pytest.mark.skipif(
|
||||
skip_remote,
|
||||
reason="do not run tests that query external urls",
|
||||
)
|
||||
@pytest.mark.parametrize("test_vector", CLI_TEST_VECTORS)
|
||||
def test_convert_url(shared_tmp_dir, test_vector):
|
||||
"""Test the conversion of a stream with no stream info."""
|
||||
# Note: tmp_dir is not used here, but is needed to match the signature
|
||||
|
||||
markitdown = MarkItDown()
|
||||
|
||||
time.sleep(1) # Ensure we don't hit rate limits
|
||||
result = subprocess.run(
|
||||
["python", "-m", "markitdown", TEST_FILES_URL + "/" + test_vector.filename],
|
||||
capture_output=True,
|
||||
text=False,
|
||||
)
|
||||
|
||||
stdout = result.stdout.decode(locale.getpreferredencoding())
|
||||
assert result.returncode == 0, f"CLI exited with error: {result.stderr}"
|
||||
for test_string in test_vector.must_include:
|
||||
assert test_string in stdout
|
||||
for test_string in test_vector.must_not_include:
|
||||
assert test_string not in stdout
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
import sys
|
||||
import tempfile
|
||||
|
||||
"""Runs this file's tests from the command line."""
|
||||
|
||||
with tempfile.TemporaryDirectory() as tmp_dir:
|
||||
for test_function in [
|
||||
test_output_to_stdout,
|
||||
test_output_to_file,
|
||||
test_input_from_stdin_without_hints,
|
||||
test_convert_url,
|
||||
]:
|
||||
for test_vector in CLI_TEST_VECTORS:
|
||||
print(
|
||||
f"Running {test_function.__name__} on {test_vector.filename}...",
|
||||
end="",
|
||||
)
|
||||
test_function(tmp_dir, test_vector)
|
||||
print("OK")
|
||||
print("All tests passed!")
|
||||
BIN
packages/markitdown/tests/test_files/random.bin
vendored
Normal file
BIN
packages/markitdown/tests/test_files/random.bin
vendored
Normal file
Binary file not shown.
BIN
packages/markitdown/tests/test_files/test.epub
vendored
Normal file
BIN
packages/markitdown/tests/test_files/test.epub
vendored
Normal file
Binary file not shown.
BIN
packages/markitdown/tests/test_files/test.m4a
vendored
Executable file
BIN
packages/markitdown/tests/test_files/test.m4a
vendored
Executable file
Binary file not shown.
BIN
packages/markitdown/tests/test_files/test.mp3
vendored
Normal file
BIN
packages/markitdown/tests/test_files/test.mp3
vendored
Normal file
Binary file not shown.
BIN
packages/markitdown/tests/test_files/test.pdf
vendored
Normal file
BIN
packages/markitdown/tests/test_files/test.pdf
vendored
Normal file
Binary file not shown.
BIN
packages/markitdown/tests/test_files/test.pptx
vendored
BIN
packages/markitdown/tests/test_files/test.pptx
vendored
Binary file not shown.
BIN
packages/markitdown/tests/test_files/test.wav
vendored
Normal file
BIN
packages/markitdown/tests/test_files/test.wav
vendored
Normal file
Binary file not shown.
@@ -23,7 +23,7 @@
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"print('markitdown')"
|
||||
"print(\"markitdown\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
||||
@@ -1,416 +0,0 @@
|
||||
#!/usr/bin/env python3 -m pytest
|
||||
import io
|
||||
import os
|
||||
import shutil
|
||||
|
||||
import pytest
|
||||
import requests
|
||||
|
||||
from warnings import catch_warnings, resetwarnings
|
||||
|
||||
from markitdown import MarkItDown
|
||||
|
||||
skip_remote = (
|
||||
True if os.environ.get("GITHUB_ACTIONS") else False
|
||||
) # Don't run these tests in CI
|
||||
|
||||
|
||||
# Don't run the llm tests without a key and the client library
|
||||
skip_llm = False if os.environ.get("OPENAI_API_KEY") else True
|
||||
try:
|
||||
import openai
|
||||
except ModuleNotFoundError:
|
||||
skip_llm = True
|
||||
|
||||
# Skip exiftool tests if not installed
|
||||
skip_exiftool = shutil.which("exiftool") is None
|
||||
|
||||
TEST_FILES_DIR = os.path.join(os.path.dirname(__file__), "test_files")
|
||||
|
||||
JPG_TEST_EXIFTOOL = {
|
||||
"Author": "AutoGen Authors",
|
||||
"Title": "AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation",
|
||||
"Description": "AutoGen enables diverse LLM-based applications",
|
||||
"ImageSize": "1615x1967",
|
||||
"DateTimeOriginal": "2024:03:14 22:10:00",
|
||||
}
|
||||
|
||||
PDF_TEST_URL = "https://arxiv.org/pdf/2308.08155v2.pdf"
|
||||
PDF_TEST_STRINGS = [
|
||||
"While there is contemporaneous exploration of multi-agent approaches"
|
||||
]
|
||||
|
||||
YOUTUBE_TEST_URL = "https://www.youtube.com/watch?v=V2qZ_lgxTzg"
|
||||
YOUTUBE_TEST_STRINGS = [
|
||||
"## AutoGen FULL Tutorial with Python (Step-By-Step)",
|
||||
"This is an intermediate tutorial for installing and using AutoGen locally",
|
||||
"PT15M4S",
|
||||
"the model we're going to be using today is GPT 3.5 turbo", # From the transcript
|
||||
]
|
||||
|
||||
XLSX_TEST_STRINGS = [
|
||||
"## 09060124-b5e7-4717-9d07-3c046eb",
|
||||
"6ff4173b-42a5-4784-9b19-f49caff4d93d",
|
||||
"affc7dad-52dc-4b98-9b5d-51e65d8a8ad0",
|
||||
]
|
||||
|
||||
XLS_TEST_STRINGS = [
|
||||
"## 09060124-b5e7-4717-9d07-3c046eb",
|
||||
"6ff4173b-42a5-4784-9b19-f49caff4d93d",
|
||||
"affc7dad-52dc-4b98-9b5d-51e65d8a8ad0",
|
||||
]
|
||||
|
||||
DOCX_TEST_STRINGS = [
|
||||
"314b0a30-5b04-470b-b9f7-eed2c2bec74a",
|
||||
"49e168b7-d2ae-407f-a055-2167576f39a1",
|
||||
"## d666f1f7-46cb-42bd-9a39-9a39cf2a509f",
|
||||
"# Abstract",
|
||||
"# Introduction",
|
||||
"AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation",
|
||||
]
|
||||
|
||||
MSG_TEST_STRINGS = [
|
||||
"# Email Message",
|
||||
"**From:** test.sender@example.com",
|
||||
"**To:** test.recipient@example.com",
|
||||
"**Subject:** Test Email Message",
|
||||
"## Content",
|
||||
"This is the body of the test email message",
|
||||
]
|
||||
|
||||
DOCX_COMMENT_TEST_STRINGS = [
|
||||
"314b0a30-5b04-470b-b9f7-eed2c2bec74a",
|
||||
"49e168b7-d2ae-407f-a055-2167576f39a1",
|
||||
"## d666f1f7-46cb-42bd-9a39-9a39cf2a509f",
|
||||
"# Abstract",
|
||||
"# Introduction",
|
||||
"AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation",
|
||||
"This is a test comment. 12df-321a",
|
||||
"Yet another comment in the doc. 55yiyi-asd09",
|
||||
]
|
||||
|
||||
PPTX_TEST_STRINGS = [
|
||||
"2cdda5c8-e50e-4db4-b5f0-9722a649f455",
|
||||
"04191ea8-5c73-4215-a1d3-1cfb43aaaf12",
|
||||
"44bf7d06-5e7a-4a40-a2e1-a2e42ef28c8a",
|
||||
"1b92870d-e3b5-4e65-8153-919f4ff45592",
|
||||
"AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation",
|
||||
"a3f6004b-6f4f-4ea8-bee3-3741f4dc385f", # chart title
|
||||
"2003", # chart value
|
||||
]
|
||||
|
||||
BLOG_TEST_URL = "https://microsoft.github.io/autogen/blog/2023/04/21/LLM-tuning-math"
|
||||
BLOG_TEST_STRINGS = [
|
||||
"Large language models (LLMs) are powerful tools that can generate natural language texts for various applications, such as chatbots, summarization, translation, and more. GPT-4 is currently the state of the art LLM in the world. Is model selection irrelevant? What about inference parameters?",
|
||||
"an example where high cost can easily prevent a generic complex",
|
||||
]
|
||||
|
||||
|
||||
RSS_TEST_STRINGS = [
|
||||
"The Official Microsoft Blog",
|
||||
"In the case of AI, it is absolutely true that the industry is moving incredibly fast",
|
||||
]
|
||||
|
||||
|
||||
WIKIPEDIA_TEST_URL = "https://en.wikipedia.org/wiki/Microsoft"
|
||||
WIKIPEDIA_TEST_STRINGS = [
|
||||
"Microsoft entered the operating system (OS) business in 1980 with its own version of [Unix]",
|
||||
'Microsoft was founded by [Bill Gates](/wiki/Bill_Gates "Bill Gates")',
|
||||
]
|
||||
WIKIPEDIA_TEST_EXCLUDES = [
|
||||
"You are encouraged to create an account and log in",
|
||||
"154 languages",
|
||||
"move to sidebar",
|
||||
]
|
||||
|
||||
SERP_TEST_URL = "https://www.bing.com/search?q=microsoft+wikipedia"
|
||||
SERP_TEST_STRINGS = [
|
||||
"](https://en.wikipedia.org/wiki/Microsoft",
|
||||
"Microsoft Corporation is **an American multinational corporation and technology company headquartered** in Redmond",
|
||||
"1995–2007: Foray into the Web, Windows 95, Windows XP, and Xbox",
|
||||
]
|
||||
SERP_TEST_EXCLUDES = [
|
||||
"https://www.bing.com/ck/a?!&&p=",
|
||||
"data:image/svg+xml,%3Csvg%20width%3D",
|
||||
]
|
||||
|
||||
CSV_CP932_TEST_STRINGS = [
|
||||
"名前,年齢,住所",
|
||||
"佐藤太郎,30,東京",
|
||||
"三木英子,25,大阪",
|
||||
"髙橋淳,35,名古屋",
|
||||
]
|
||||
|
||||
LLM_TEST_STRINGS = [
|
||||
"5bda1dd6",
|
||||
]
|
||||
|
||||
JSON_TEST_STRINGS = [
|
||||
"5b64c88c-b3c3-4510-bcb8-da0b200602d8",
|
||||
"9700dc99-6685-40b4-9a3a-5e406dcb37f3",
|
||||
]
|
||||
|
||||
|
||||
# --- Helper Functions ---
|
||||
def validate_strings(result, expected_strings, exclude_strings=None):
|
||||
"""Validate presence or absence of specific strings."""
|
||||
text_content = result.text_content.replace("\\", "")
|
||||
for string in expected_strings:
|
||||
assert string in text_content
|
||||
if exclude_strings:
|
||||
for string in exclude_strings:
|
||||
assert string not in text_content
|
||||
|
||||
|
||||
@pytest.mark.skipif(
|
||||
skip_remote,
|
||||
reason="do not run tests that query external urls",
|
||||
)
|
||||
def test_markitdown_remote() -> None:
|
||||
markitdown = MarkItDown()
|
||||
|
||||
# By URL
|
||||
result = markitdown.convert(PDF_TEST_URL)
|
||||
for test_string in PDF_TEST_STRINGS:
|
||||
assert test_string in result.text_content
|
||||
|
||||
# By stream
|
||||
response = requests.get(PDF_TEST_URL)
|
||||
result = markitdown.convert_stream(
|
||||
io.BytesIO(response.content), file_extension=".pdf", url=PDF_TEST_URL
|
||||
)
|
||||
for test_string in PDF_TEST_STRINGS:
|
||||
assert test_string in result.text_content
|
||||
|
||||
# Youtube
|
||||
# TODO: This test randomly fails for some reason. Haven't been able to repro it yet. Disabling until I can debug the issue
|
||||
# result = markitdown.convert(YOUTUBE_TEST_URL)
|
||||
# for test_string in YOUTUBE_TEST_STRINGS:
|
||||
# assert test_string in result.text_content
|
||||
|
||||
|
||||
def test_markitdown_local_paths() -> None:
|
||||
markitdown = MarkItDown()
|
||||
|
||||
# Test XLSX processing
|
||||
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.xlsx"))
|
||||
validate_strings(result, XLSX_TEST_STRINGS)
|
||||
|
||||
# Test XLS processing
|
||||
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.xls"))
|
||||
for test_string in XLS_TEST_STRINGS:
|
||||
text_content = result.text_content.replace("\\", "")
|
||||
assert test_string in text_content
|
||||
|
||||
# Test DOCX processing
|
||||
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.docx"))
|
||||
validate_strings(result, DOCX_TEST_STRINGS)
|
||||
|
||||
# Test DOCX processing, with comments
|
||||
result = markitdown.convert(
|
||||
os.path.join(TEST_FILES_DIR, "test_with_comment.docx"),
|
||||
style_map="comment-reference => ",
|
||||
)
|
||||
validate_strings(result, DOCX_COMMENT_TEST_STRINGS)
|
||||
|
||||
# Test DOCX processing, with comments and setting style_map on init
|
||||
markitdown_with_style_map = MarkItDown(style_map="comment-reference => ")
|
||||
result = markitdown_with_style_map.convert(
|
||||
os.path.join(TEST_FILES_DIR, "test_with_comment.docx")
|
||||
)
|
||||
validate_strings(result, DOCX_COMMENT_TEST_STRINGS)
|
||||
|
||||
# Test PPTX processing
|
||||
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.pptx"))
|
||||
validate_strings(result, PPTX_TEST_STRINGS)
|
||||
|
||||
# Test HTML processing
|
||||
result = markitdown.convert(
|
||||
os.path.join(TEST_FILES_DIR, "test_blog.html"), url=BLOG_TEST_URL
|
||||
)
|
||||
validate_strings(result, BLOG_TEST_STRINGS)
|
||||
|
||||
# Test ZIP file processing
|
||||
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test_files.zip"))
|
||||
validate_strings(result, XLSX_TEST_STRINGS)
|
||||
|
||||
# Test Wikipedia processing
|
||||
result = markitdown.convert(
|
||||
os.path.join(TEST_FILES_DIR, "test_wikipedia.html"), url=WIKIPEDIA_TEST_URL
|
||||
)
|
||||
text_content = result.text_content.replace("\\", "")
|
||||
validate_strings(result, WIKIPEDIA_TEST_STRINGS, WIKIPEDIA_TEST_EXCLUDES)
|
||||
|
||||
# Test Bing processing
|
||||
result = markitdown.convert(
|
||||
os.path.join(TEST_FILES_DIR, "test_serp.html"), url=SERP_TEST_URL
|
||||
)
|
||||
text_content = result.text_content.replace("\\", "")
|
||||
validate_strings(result, SERP_TEST_STRINGS, SERP_TEST_EXCLUDES)
|
||||
|
||||
# Test RSS processing
|
||||
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test_rss.xml"))
|
||||
text_content = result.text_content.replace("\\", "")
|
||||
for test_string in RSS_TEST_STRINGS:
|
||||
assert test_string in text_content
|
||||
|
||||
## Test non-UTF-8 encoding
|
||||
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test_mskanji.csv"))
|
||||
validate_strings(result, CSV_CP932_TEST_STRINGS)
|
||||
|
||||
# Test MSG (Outlook email) processing
|
||||
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test_outlook_msg.msg"))
|
||||
validate_strings(result, MSG_TEST_STRINGS)
|
||||
|
||||
# Test JSON processing
|
||||
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.json"))
|
||||
validate_strings(result, JSON_TEST_STRINGS)
|
||||
|
||||
# Test input with leading blank characters
|
||||
input_data = b" \n\n\n<html><body><h1>Test</h1></body></html>"
|
||||
result = markitdown.convert_stream(io.BytesIO(input_data))
|
||||
assert "# Test" in result.text_content
|
||||
|
||||
|
||||
def test_markitdown_local_objects() -> None:
|
||||
markitdown = MarkItDown()
|
||||
|
||||
# Test XLSX processing
|
||||
with open(os.path.join(TEST_FILES_DIR, "test.xlsx"), "rb") as f:
|
||||
result = markitdown.convert(f, file_extension=".xlsx")
|
||||
validate_strings(result, XLSX_TEST_STRINGS)
|
||||
|
||||
# Test XLS processing
|
||||
with open(os.path.join(TEST_FILES_DIR, "test.xls"), "rb") as f:
|
||||
result = markitdown.convert(f, file_extension=".xls")
|
||||
for test_string in XLS_TEST_STRINGS:
|
||||
text_content = result.text_content.replace("\\", "")
|
||||
assert test_string in text_content
|
||||
|
||||
# Test DOCX processing
|
||||
with open(os.path.join(TEST_FILES_DIR, "test.docx"), "rb") as f:
|
||||
result = markitdown.convert(f, file_extension=".docx")
|
||||
validate_strings(result, DOCX_TEST_STRINGS)
|
||||
|
||||
# Test DOCX processing, with comments
|
||||
with open(os.path.join(TEST_FILES_DIR, "test_with_comment.docx"), "rb") as f:
|
||||
result = markitdown.convert(
|
||||
f,
|
||||
file_extension=".docx",
|
||||
style_map="comment-reference => ",
|
||||
)
|
||||
validate_strings(result, DOCX_COMMENT_TEST_STRINGS)
|
||||
|
||||
# Test DOCX processing, with comments and setting style_map on init
|
||||
markitdown_with_style_map = MarkItDown(style_map="comment-reference => ")
|
||||
with open(os.path.join(TEST_FILES_DIR, "test_with_comment.docx"), "rb") as f:
|
||||
result = markitdown_with_style_map.convert(f, file_extension=".docx")
|
||||
validate_strings(result, DOCX_COMMENT_TEST_STRINGS)
|
||||
|
||||
# Test PPTX processing
|
||||
with open(os.path.join(TEST_FILES_DIR, "test.pptx"), "rb") as f:
|
||||
result = markitdown.convert(f, file_extension=".pptx")
|
||||
validate_strings(result, PPTX_TEST_STRINGS)
|
||||
|
||||
# Test HTML processing
|
||||
with open(
|
||||
os.path.join(TEST_FILES_DIR, "test_blog.html"), "rt", encoding="utf-8"
|
||||
) as f:
|
||||
result = markitdown.convert(f, file_extension=".html", url=BLOG_TEST_URL)
|
||||
validate_strings(result, BLOG_TEST_STRINGS)
|
||||
|
||||
# Test Wikipedia processing
|
||||
with open(
|
||||
os.path.join(TEST_FILES_DIR, "test_wikipedia.html"), "rt", encoding="utf-8"
|
||||
) as f:
|
||||
result = markitdown.convert(f, file_extension=".html", url=WIKIPEDIA_TEST_URL)
|
||||
text_content = result.text_content.replace("\\", "")
|
||||
validate_strings(result, WIKIPEDIA_TEST_STRINGS, WIKIPEDIA_TEST_EXCLUDES)
|
||||
|
||||
# Test Bing processing
|
||||
with open(
|
||||
os.path.join(TEST_FILES_DIR, "test_serp.html"), "rt", encoding="utf-8"
|
||||
) as f:
|
||||
result = markitdown.convert(f, file_extension=".html", url=SERP_TEST_URL)
|
||||
text_content = result.text_content.replace("\\", "")
|
||||
validate_strings(result, SERP_TEST_STRINGS, SERP_TEST_EXCLUDES)
|
||||
|
||||
# Test RSS processing
|
||||
with open(os.path.join(TEST_FILES_DIR, "test_rss.xml"), "rb") as f:
|
||||
result = markitdown.convert(f, file_extension=".xml")
|
||||
text_content = result.text_content.replace("\\", "")
|
||||
for test_string in RSS_TEST_STRINGS:
|
||||
assert test_string in text_content
|
||||
|
||||
# Test MSG (Outlook email) processing
|
||||
with open(os.path.join(TEST_FILES_DIR, "test_outlook_msg.msg"), "rb") as f:
|
||||
result = markitdown.convert(f, file_extension=".msg")
|
||||
validate_strings(result, MSG_TEST_STRINGS)
|
||||
|
||||
# Test JSON processing
|
||||
with open(os.path.join(TEST_FILES_DIR, "test.json"), "rb") as f:
|
||||
result = markitdown.convert(f, file_extension=".json")
|
||||
validate_strings(result, JSON_TEST_STRINGS)
|
||||
|
||||
|
||||
@pytest.mark.skipif(
|
||||
skip_exiftool,
|
||||
reason="do not run if exiftool is not installed",
|
||||
)
|
||||
def test_markitdown_exiftool() -> None:
|
||||
# Test the automatic discovery of exiftool throws a warning
|
||||
# and is disabled
|
||||
try:
|
||||
with catch_warnings(record=True) as w:
|
||||
markitdown = MarkItDown()
|
||||
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.jpg"))
|
||||
assert len(w) == 1
|
||||
assert w[0].category is DeprecationWarning
|
||||
assert result.text_content.strip() == ""
|
||||
finally:
|
||||
resetwarnings()
|
||||
|
||||
# Test explicitly setting the location of exiftool
|
||||
which_exiftool = shutil.which("exiftool")
|
||||
markitdown = MarkItDown(exiftool_path=which_exiftool)
|
||||
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.jpg"))
|
||||
for key in JPG_TEST_EXIFTOOL:
|
||||
target = f"{key}: {JPG_TEST_EXIFTOOL[key]}"
|
||||
assert target in result.text_content
|
||||
|
||||
# Test setting the exiftool path through an environment variable
|
||||
os.environ["EXIFTOOL_PATH"] = which_exiftool
|
||||
markitdown = MarkItDown()
|
||||
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.jpg"))
|
||||
for key in JPG_TEST_EXIFTOOL:
|
||||
target = f"{key}: {JPG_TEST_EXIFTOOL[key]}"
|
||||
assert target in result.text_content
|
||||
|
||||
|
||||
@pytest.mark.skipif(
|
||||
skip_llm,
|
||||
reason="do not run llm tests without a key",
|
||||
)
|
||||
def test_markitdown_llm() -> None:
|
||||
client = openai.OpenAI()
|
||||
markitdown = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
|
||||
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test_llm.jpg"))
|
||||
|
||||
for test_string in LLM_TEST_STRINGS:
|
||||
assert test_string in result.text_content
|
||||
|
||||
# This is not super precise. It would also accept "red square", "blue circle",
|
||||
# "the square is not blue", etc. But it's sufficient for this test.
|
||||
for test_string in ["red", "circle", "blue", "square"]:
|
||||
assert test_string in result.text_content.lower()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
"""Runs this file's tests from the command line."""
|
||||
test_markitdown_remote()
|
||||
test_markitdown_local_paths()
|
||||
test_markitdown_local_objects()
|
||||
test_markitdown_exiftool()
|
||||
# test_markitdown_llm()
|
||||
print("All tests passed!")
|
||||
328
packages/markitdown/tests/test_module_misc.py
Normal file
328
packages/markitdown/tests/test_module_misc.py
Normal file
@@ -0,0 +1,328 @@
|
||||
#!/usr/bin/env python3 -m pytest
|
||||
import io
|
||||
import os
|
||||
import shutil
|
||||
import openai
|
||||
import pytest
|
||||
|
||||
from markitdown import (
|
||||
MarkItDown,
|
||||
UnsupportedFormatException,
|
||||
FileConversionException,
|
||||
StreamInfo,
|
||||
)
|
||||
|
||||
# This file contains module tests that are not directly tested by the FileTestVectors.
|
||||
# This includes things like helper functions and runtime conversion options
|
||||
# (e.g., LLM clients, exiftool path, transcription services, etc.)
|
||||
|
||||
skip_remote = (
|
||||
True if os.environ.get("GITHUB_ACTIONS") else False
|
||||
) # Don't run these tests in CI
|
||||
|
||||
|
||||
# Don't run the llm tests without a key and the client library
|
||||
skip_llm = False if os.environ.get("OPENAI_API_KEY") else True
|
||||
try:
|
||||
import openai
|
||||
except ModuleNotFoundError:
|
||||
skip_llm = True
|
||||
|
||||
# Skip exiftool tests if not installed
|
||||
skip_exiftool = shutil.which("exiftool") is None
|
||||
|
||||
TEST_FILES_DIR = os.path.join(os.path.dirname(__file__), "test_files")
|
||||
|
||||
JPG_TEST_EXIFTOOL = {
|
||||
"Author": "AutoGen Authors",
|
||||
"Title": "AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation",
|
||||
"Description": "AutoGen enables diverse LLM-based applications",
|
||||
"ImageSize": "1615x1967",
|
||||
"DateTimeOriginal": "2024:03:14 22:10:00",
|
||||
}
|
||||
|
||||
MP3_TEST_EXIFTOOL = {
|
||||
"Title": "f67a499e-a7d0-4ca3-a49b-358bd934ae3e",
|
||||
"Artist": "Artist Name Test String",
|
||||
"Album": "Album Name Test String",
|
||||
"SampleRate": "48000",
|
||||
}
|
||||
|
||||
PDF_TEST_URL = "https://arxiv.org/pdf/2308.08155v2.pdf"
|
||||
PDF_TEST_STRINGS = [
|
||||
"While there is contemporaneous exploration of multi-agent approaches"
|
||||
]
|
||||
|
||||
YOUTUBE_TEST_URL = "https://www.youtube.com/watch?v=V2qZ_lgxTzg"
|
||||
YOUTUBE_TEST_STRINGS = [
|
||||
"## AutoGen FULL Tutorial with Python (Step-By-Step)",
|
||||
"This is an intermediate tutorial for installing and using AutoGen locally",
|
||||
"PT15M4S",
|
||||
"the model we're going to be using today is GPT 3.5 turbo", # From the transcript
|
||||
]
|
||||
|
||||
DOCX_COMMENT_TEST_STRINGS = [
|
||||
"314b0a30-5b04-470b-b9f7-eed2c2bec74a",
|
||||
"49e168b7-d2ae-407f-a055-2167576f39a1",
|
||||
"## d666f1f7-46cb-42bd-9a39-9a39cf2a509f",
|
||||
"# Abstract",
|
||||
"# Introduction",
|
||||
"AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation",
|
||||
"This is a test comment. 12df-321a",
|
||||
"Yet another comment in the doc. 55yiyi-asd09",
|
||||
]
|
||||
|
||||
BLOG_TEST_URL = "https://microsoft.github.io/autogen/blog/2023/04/21/LLM-tuning-math"
|
||||
BLOG_TEST_STRINGS = [
|
||||
"Large language models (LLMs) are powerful tools that can generate natural language texts for various applications, such as chatbots, summarization, translation, and more. GPT-4 is currently the state of the art LLM in the world. Is model selection irrelevant? What about inference parameters?",
|
||||
"an example where high cost can easily prevent a generic complex",
|
||||
]
|
||||
|
||||
LLM_TEST_STRINGS = [
|
||||
"5bda1dd6",
|
||||
]
|
||||
|
||||
PPTX_TEST_STRINGS = [
|
||||
"2cdda5c8-e50e-4db4-b5f0-9722a649f455",
|
||||
"04191ea8-5c73-4215-a1d3-1cfb43aaaf12",
|
||||
"44bf7d06-5e7a-4a40-a2e1-a2e42ef28c8a",
|
||||
"1b92870d-e3b5-4e65-8153-919f4ff45592",
|
||||
"AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation",
|
||||
"a3f6004b-6f4f-4ea8-bee3-3741f4dc385f", # chart title
|
||||
"2003", # chart value
|
||||
]
|
||||
|
||||
|
||||
# --- Helper Functions ---
|
||||
def validate_strings(result, expected_strings, exclude_strings=None):
|
||||
"""Validate presence or absence of specific strings."""
|
||||
text_content = result.text_content.replace("\\", "")
|
||||
for string in expected_strings:
|
||||
assert string in text_content
|
||||
if exclude_strings:
|
||||
for string in exclude_strings:
|
||||
assert string not in text_content
|
||||
|
||||
|
||||
def test_stream_info_operations() -> None:
|
||||
"""Test operations performed on StreamInfo objects."""
|
||||
|
||||
stream_info_original = StreamInfo(
|
||||
mimetype="mimetype.1",
|
||||
extension="extension.1",
|
||||
charset="charset.1",
|
||||
filename="filename.1",
|
||||
local_path="local_path.1",
|
||||
url="url.1",
|
||||
)
|
||||
|
||||
# Check updating all attributes by keyword
|
||||
keywords = ["mimetype", "extension", "charset", "filename", "local_path", "url"]
|
||||
for keyword in keywords:
|
||||
updated_stream_info = stream_info_original.copy_and_update(
|
||||
**{keyword: f"{keyword}.2"}
|
||||
)
|
||||
|
||||
# Make sure the targted attribute is updated
|
||||
assert getattr(updated_stream_info, keyword) == f"{keyword}.2"
|
||||
|
||||
# Make sure the other attributes are unchanged
|
||||
for k in keywords:
|
||||
if k != keyword:
|
||||
assert getattr(stream_info_original, k) == getattr(
|
||||
updated_stream_info, k
|
||||
)
|
||||
|
||||
# Check updating all attributes by passing a new StreamInfo object
|
||||
keywords = ["mimetype", "extension", "charset", "filename", "local_path", "url"]
|
||||
for keyword in keywords:
|
||||
updated_stream_info = stream_info_original.copy_and_update(
|
||||
StreamInfo(**{keyword: f"{keyword}.2"})
|
||||
)
|
||||
|
||||
# Make sure the targted attribute is updated
|
||||
assert getattr(updated_stream_info, keyword) == f"{keyword}.2"
|
||||
|
||||
# Make sure the other attributes are unchanged
|
||||
for k in keywords:
|
||||
if k != keyword:
|
||||
assert getattr(stream_info_original, k) == getattr(
|
||||
updated_stream_info, k
|
||||
)
|
||||
|
||||
# Check mixing and matching
|
||||
updated_stream_info = stream_info_original.copy_and_update(
|
||||
StreamInfo(extension="extension.2", filename="filename.2"),
|
||||
mimetype="mimetype.3",
|
||||
charset="charset.3",
|
||||
)
|
||||
assert updated_stream_info.extension == "extension.2"
|
||||
assert updated_stream_info.filename == "filename.2"
|
||||
assert updated_stream_info.mimetype == "mimetype.3"
|
||||
assert updated_stream_info.charset == "charset.3"
|
||||
assert updated_stream_info.local_path == "local_path.1"
|
||||
assert updated_stream_info.url == "url.1"
|
||||
|
||||
# Check multiple StreamInfo objects
|
||||
updated_stream_info = stream_info_original.copy_and_update(
|
||||
StreamInfo(extension="extension.4", filename="filename.5"),
|
||||
StreamInfo(mimetype="mimetype.6", charset="charset.7"),
|
||||
)
|
||||
assert updated_stream_info.extension == "extension.4"
|
||||
assert updated_stream_info.filename == "filename.5"
|
||||
assert updated_stream_info.mimetype == "mimetype.6"
|
||||
assert updated_stream_info.charset == "charset.7"
|
||||
assert updated_stream_info.local_path == "local_path.1"
|
||||
assert updated_stream_info.url == "url.1"
|
||||
|
||||
|
||||
def test_docx_comments() -> None:
|
||||
markitdown = MarkItDown()
|
||||
|
||||
# Test DOCX processing, with comments and setting style_map on init
|
||||
markitdown_with_style_map = MarkItDown(style_map="comment-reference => ")
|
||||
result = markitdown_with_style_map.convert(
|
||||
os.path.join(TEST_FILES_DIR, "test_with_comment.docx")
|
||||
)
|
||||
validate_strings(result, DOCX_COMMENT_TEST_STRINGS)
|
||||
|
||||
|
||||
def test_input_as_strings() -> None:
|
||||
markitdown = MarkItDown()
|
||||
|
||||
# Test input from a stream
|
||||
input_data = b"<html><body><h1>Test</h1></body></html>"
|
||||
result = markitdown.convert_stream(io.BytesIO(input_data))
|
||||
assert "# Test" in result.text_content
|
||||
|
||||
# Test input with leading blank characters
|
||||
input_data = b" \n\n\n<html><body><h1>Test</h1></body></html>"
|
||||
result = markitdown.convert_stream(io.BytesIO(input_data))
|
||||
assert "# Test" in result.text_content
|
||||
|
||||
|
||||
@pytest.mark.skipif(
|
||||
skip_remote,
|
||||
reason="do not run tests that query external urls",
|
||||
)
|
||||
def test_markitdown_remote() -> None:
|
||||
markitdown = MarkItDown()
|
||||
|
||||
# By URL
|
||||
result = markitdown.convert(PDF_TEST_URL)
|
||||
for test_string in PDF_TEST_STRINGS:
|
||||
assert test_string in result.text_content
|
||||
|
||||
# Youtube
|
||||
result = markitdown.convert(YOUTUBE_TEST_URL)
|
||||
for test_string in YOUTUBE_TEST_STRINGS:
|
||||
assert test_string in result.text_content
|
||||
|
||||
|
||||
@pytest.mark.skipif(
|
||||
skip_remote,
|
||||
reason="do not run remotely run speech transcription tests",
|
||||
)
|
||||
def test_speech_transcription() -> None:
|
||||
markitdown = MarkItDown()
|
||||
|
||||
# Test WAV files, MP3 and M4A files
|
||||
for file_name in ["test.wav", "test.mp3", "test.m4a"]:
|
||||
result = markitdown.convert(os.path.join(TEST_FILES_DIR, file_name))
|
||||
result_lower = result.text_content.lower()
|
||||
assert (
|
||||
("1" in result_lower or "one" in result_lower)
|
||||
and ("2" in result_lower or "two" in result_lower)
|
||||
and ("3" in result_lower or "three" in result_lower)
|
||||
and ("4" in result_lower or "four" in result_lower)
|
||||
and ("5" in result_lower or "five" in result_lower)
|
||||
)
|
||||
|
||||
|
||||
def test_exceptions() -> None:
|
||||
# Check that an exception is raised when trying to convert an unsupported format
|
||||
markitdown = MarkItDown()
|
||||
with pytest.raises(UnsupportedFormatException):
|
||||
markitdown.convert(os.path.join(TEST_FILES_DIR, "random.bin"))
|
||||
|
||||
# Check that an exception is raised when trying to convert a file that is corrupted
|
||||
with pytest.raises(FileConversionException) as exc_info:
|
||||
markitdown.convert(
|
||||
os.path.join(TEST_FILES_DIR, "random.bin"), file_extension=".pptx"
|
||||
)
|
||||
assert len(exc_info.value.attempts) == 1
|
||||
assert type(exc_info.value.attempts[0].converter).__name__ == "PptxConverter"
|
||||
|
||||
|
||||
@pytest.mark.skipif(
|
||||
skip_exiftool,
|
||||
reason="do not run if exiftool is not installed",
|
||||
)
|
||||
def test_markitdown_exiftool() -> None:
|
||||
which_exiftool = shutil.which("exiftool")
|
||||
assert which_exiftool is not None
|
||||
|
||||
# Test explicitly setting the location of exiftool
|
||||
markitdown = MarkItDown(exiftool_path=which_exiftool)
|
||||
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.jpg"))
|
||||
for key in JPG_TEST_EXIFTOOL:
|
||||
target = f"{key}: {JPG_TEST_EXIFTOOL[key]}"
|
||||
assert target in result.text_content
|
||||
|
||||
# Test setting the exiftool path through an environment variable
|
||||
os.environ["EXIFTOOL_PATH"] = which_exiftool
|
||||
markitdown = MarkItDown()
|
||||
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.jpg"))
|
||||
for key in JPG_TEST_EXIFTOOL:
|
||||
target = f"{key}: {JPG_TEST_EXIFTOOL[key]}"
|
||||
assert target in result.text_content
|
||||
|
||||
# Test some other media types
|
||||
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.mp3"))
|
||||
for key in MP3_TEST_EXIFTOOL:
|
||||
target = f"{key}: {MP3_TEST_EXIFTOOL[key]}"
|
||||
assert target in result.text_content
|
||||
|
||||
|
||||
@pytest.mark.skipif(
|
||||
skip_llm,
|
||||
reason="do not run llm tests without a key",
|
||||
)
|
||||
def test_markitdown_llm() -> None:
|
||||
client = openai.OpenAI()
|
||||
markitdown = MarkItDown(llm_client=client, llm_model="gpt-4o")
|
||||
|
||||
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test_llm.jpg"))
|
||||
for test_string in LLM_TEST_STRINGS:
|
||||
assert test_string in result.text_content
|
||||
|
||||
# This is not super precise. It would also accept "red square", "blue circle",
|
||||
# "the square is not blue", etc. But it's sufficient for this test.
|
||||
for test_string in ["red", "circle", "blue", "square"]:
|
||||
assert test_string in result.text_content.lower()
|
||||
|
||||
# Images embedded in PPTX files
|
||||
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.pptx"))
|
||||
# LLM Captions are included
|
||||
for test_string in LLM_TEST_STRINGS:
|
||||
assert test_string in result.text_content
|
||||
# Standard alt text is included
|
||||
validate_strings(result, PPTX_TEST_STRINGS)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
"""Runs this file's tests from the command line."""
|
||||
for test in [
|
||||
test_stream_info_operations,
|
||||
test_docx_comments,
|
||||
test_input_as_strings,
|
||||
test_markitdown_remote,
|
||||
test_speech_transcription,
|
||||
test_exceptions,
|
||||
test_markitdown_exiftool,
|
||||
test_markitdown_llm,
|
||||
]:
|
||||
print(f"Running {test.__name__}...", end="")
|
||||
test()
|
||||
print("OK")
|
||||
print("All tests passed!")
|
||||
144
packages/markitdown/tests/test_module_vectors.py
Normal file
144
packages/markitdown/tests/test_module_vectors.py
Normal file
@@ -0,0 +1,144 @@
|
||||
#!/usr/bin/env python3 -m pytest
|
||||
import os
|
||||
import time
|
||||
import pytest
|
||||
import codecs
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
from _test_vectors import GENERAL_TEST_VECTORS
|
||||
else:
|
||||
from ._test_vectors import GENERAL_TEST_VECTORS
|
||||
|
||||
from markitdown import (
|
||||
MarkItDown,
|
||||
UnsupportedFormatException,
|
||||
FileConversionException,
|
||||
StreamInfo,
|
||||
)
|
||||
|
||||
skip_remote = (
|
||||
True if os.environ.get("GITHUB_ACTIONS") else False
|
||||
) # Don't run these tests in CI
|
||||
|
||||
TEST_FILES_DIR = os.path.join(os.path.dirname(__file__), "test_files")
|
||||
TEST_FILES_URL = "https://raw.githubusercontent.com/microsoft/markitdown/refs/heads/main/packages/markitdown/tests/test_files"
|
||||
|
||||
|
||||
@pytest.mark.parametrize("test_vector", GENERAL_TEST_VECTORS)
|
||||
def test_guess_stream_info(test_vector):
|
||||
"""Test the ability to guess stream info."""
|
||||
markitdown = MarkItDown()
|
||||
|
||||
local_path = os.path.join(TEST_FILES_DIR, test_vector.filename)
|
||||
expected_extension = os.path.splitext(test_vector.filename)[1]
|
||||
|
||||
with open(local_path, "rb") as stream:
|
||||
guesses = markitdown._get_stream_info_guesses(
|
||||
stream,
|
||||
base_guess=StreamInfo(
|
||||
filename=os.path.basename(test_vector.filename),
|
||||
local_path=local_path,
|
||||
extension=expected_extension,
|
||||
),
|
||||
)
|
||||
|
||||
# For some limited exceptions, we can't guarantee the exact
|
||||
# mimetype or extension, so we'll special-case them here.
|
||||
if test_vector.filename in [
|
||||
"test_outlook_msg.msg",
|
||||
]:
|
||||
return
|
||||
|
||||
assert guesses[0].mimetype == test_vector.mimetype
|
||||
assert guesses[0].extension == expected_extension
|
||||
assert guesses[0].charset == test_vector.charset
|
||||
|
||||
|
||||
@pytest.mark.parametrize("test_vector", GENERAL_TEST_VECTORS)
|
||||
def test_convert_local(test_vector):
|
||||
"""Test the conversion of a local file."""
|
||||
markitdown = MarkItDown()
|
||||
|
||||
result = markitdown.convert(
|
||||
os.path.join(TEST_FILES_DIR, test_vector.filename), url=test_vector.url
|
||||
)
|
||||
for string in test_vector.must_include:
|
||||
assert string in result.markdown
|
||||
for string in test_vector.must_not_include:
|
||||
assert string not in result.markdown
|
||||
|
||||
|
||||
@pytest.mark.parametrize("test_vector", GENERAL_TEST_VECTORS)
|
||||
def test_convert_stream_with_hints(test_vector):
|
||||
"""Test the conversion of a stream with full stream info."""
|
||||
markitdown = MarkItDown()
|
||||
|
||||
stream_info = StreamInfo(
|
||||
extension=os.path.splitext(test_vector.filename)[1],
|
||||
mimetype=test_vector.mimetype,
|
||||
charset=test_vector.charset,
|
||||
)
|
||||
|
||||
with open(os.path.join(TEST_FILES_DIR, test_vector.filename), "rb") as stream:
|
||||
result = markitdown.convert(
|
||||
stream, stream_info=stream_info, url=test_vector.url
|
||||
)
|
||||
for string in test_vector.must_include:
|
||||
assert string in result.markdown
|
||||
for string in test_vector.must_not_include:
|
||||
assert string not in result.markdown
|
||||
|
||||
|
||||
@pytest.mark.parametrize("test_vector", GENERAL_TEST_VECTORS)
|
||||
def test_convert_stream_without_hints(test_vector):
|
||||
"""Test the conversion of a stream with no stream info."""
|
||||
markitdown = MarkItDown()
|
||||
|
||||
with open(os.path.join(TEST_FILES_DIR, test_vector.filename), "rb") as stream:
|
||||
result = markitdown.convert(stream, url=test_vector.url)
|
||||
for string in test_vector.must_include:
|
||||
assert string in result.markdown
|
||||
for string in test_vector.must_not_include:
|
||||
assert string not in result.markdown
|
||||
|
||||
|
||||
@pytest.mark.skipif(
|
||||
skip_remote,
|
||||
reason="do not run tests that query external urls",
|
||||
)
|
||||
@pytest.mark.parametrize("test_vector", GENERAL_TEST_VECTORS)
|
||||
def test_convert_url(test_vector):
|
||||
"""Test the conversion of a stream with no stream info."""
|
||||
markitdown = MarkItDown()
|
||||
|
||||
time.sleep(1) # Ensure we don't hit rate limits
|
||||
|
||||
result = markitdown.convert(
|
||||
TEST_FILES_URL + "/" + test_vector.filename,
|
||||
url=test_vector.url, # Mock where this file would be found
|
||||
)
|
||||
for string in test_vector.must_include:
|
||||
assert string in result.markdown
|
||||
for string in test_vector.must_not_include:
|
||||
assert string not in result.markdown
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
import sys
|
||||
|
||||
"""Runs this file's tests from the command line."""
|
||||
for test_function in [
|
||||
test_guess_stream_info,
|
||||
test_convert_local,
|
||||
test_convert_stream_with_hints,
|
||||
test_convert_stream_without_hints,
|
||||
test_convert_url,
|
||||
]:
|
||||
for test_vector in GENERAL_TEST_VECTORS:
|
||||
print(
|
||||
f"Running {test_function.__name__} on {test_vector.filename}...", end=""
|
||||
)
|
||||
test_function(test_vector)
|
||||
print("OK")
|
||||
print("All tests passed!")
|
||||
Reference in New Issue
Block a user