86 Commits

Author SHA1 Message Date
afourney
3ce21a47ab Merge pull request #102 from microsoft/bump_version
Bump version.
2024-12-17 13:55:12 -08:00
Adam Fourney
9518c01d4e Bump version. 2024-12-17 13:51:13 -08:00
afourney
22504551ef Merge pull request #101 from microsoft/add_deprecation_warnings
Added deprecation warnings for mlm_* arguments.
2024-12-17 13:49:44 -08:00
Adam Fourney
95188a4a27 Merge main. 2024-12-17 13:46:26 -08:00
afourney
e69d012b86 Merge pull request #100 from microsoft/add_llm_tests 2024-12-17 13:36:36 -08:00
Adam Fourney
03a7843a0a Added deprecation warnings for mlm_* arguments. 2024-12-17 13:22:48 -08:00
Adam Fourney
248d64edd0 Added llm tests to the local test set. 2024-12-17 12:13:19 -08:00
gagb
ad5d4fb139 Merge pull request #77 from microsoft/kevinclb/main
Kevinclb/main
2024-12-16 18:14:09 -08:00
gagb
ad29122592 run precommit 2024-12-16 18:09:48 -08:00
gagb
898bfd4774 Merge branch 'main' into main 2024-12-16 18:00:26 -08:00
gagb
c8980d9f41 Merge pull request #75 from microsoft/cybernobie/main
Cybernobie/main
2024-12-16 17:40:13 -08:00
gagb
24b52b2b8f Improve readme 2024-12-16 17:35:47 -08:00
gagb
09159aa04e Merge branch 'main' into main 2024-12-16 17:24:47 -08:00
gagb
77f620b568 Merge pull request #67 from DIMAX99/issue#65
fix issue #65
2024-12-16 17:18:53 -08:00
gagb
825d3bbb77 Merge branch 'main' into issue#65 2024-12-16 17:09:53 -08:00
gagb
c0127af120 Merge pull request #72 from CharlesCNorton/patch-1
Fix LLM terms
2024-12-16 17:06:24 -08:00
gagb
33cb5015eb Merge branch 'main' into patch-1 2024-12-16 17:04:44 -08:00
gagb
cf13b7e657 Merge pull request #73 from CharlesCNorton/patch-2
Fix LLM terminology in code
2024-12-16 17:04:33 -08:00
gagb
874eba6265 Merge branch 'main' into patch-2 2024-12-16 16:59:22 -08:00
gagb
c3fa2934b9 Run pre-commit 2024-12-16 16:56:52 -08:00
gagb
736e7d9a7e Merge branch 'main' into patch-1 2024-12-16 16:53:58 -08:00
gagb
19c111251b Merge pull request #60 from madduci/main
Added Dockerfile
2024-12-16 16:42:26 -08:00
gagb
360c2dd95f Merge branch 'main' into main 2024-12-16 16:35:50 -08:00
kevinbabou
87846cf5f8 rm setup.py 2024-12-16 16:28:44 -08:00
kevinbabou
33638f1fe6 feature: add argument parsing and setup.py file for cli tool capability 2024-12-16 16:28:44 -08:00
gagb
73776b2c0f Merge pull request #50 from narumiruna/youtube-transcript-languages
Support specifying YouTube transcript language
2024-12-16 16:23:20 -08:00
gagb
2d3ffeade1 Merge branch 'main' into youtube-transcript-languages 2024-12-16 16:20:35 -08:00
gagb
51c1453699 Merge pull request #48 from Soulter/main
Fix: pass the kwargs to _convert method when converting an url file
2024-12-16 16:19:09 -08:00
gagb
ae4669107c Merge branch 'main' into main 2024-12-16 16:01:59 -08:00
gagb
b0115cf971 Merge branch 'main' into youtube-transcript-languages 2024-12-16 15:47:38 -08:00
gagb
5cf8474f37 Merge pull request #44 from Y-Kim-64/main
Exclude test files from language statistics using linguist-vendored
2024-12-16 15:35:19 -08:00
gagb
83dc81170b Merge branch 'main' into main 2024-12-16 15:29:33 -08:00
gagb
e7a2e20d93 Merge pull request #39 from SH4DOW4RE/main
Catching pydub's warning of ffmpeg or avconv missing
2024-12-16 15:28:53 -08:00
gagb
980abd3a60 Merge branch 'main' into main 2024-12-16 15:24:58 -08:00
afourney
6587e0f097 Merge branch 'main' into patch-1 2024-12-16 14:27:26 -08:00
afourney
978c8763aa Merge pull request #38 from VillePuuska/support-comments-in-docx
Add passing style_map kwarg to Mammoth when converting docx to allow keeping comments
2024-12-16 14:26:55 -08:00
afourney
e7636656d8 Merge branch 'main' into support-comments-in-docx 2024-12-16 14:23:14 -08:00
afourney
ddc1bebea4 Merge branch 'main' into patch-2 2024-12-16 14:20:16 -08:00
afourney
fa1f496d51 Merge branch 'main' into patch-1 2024-12-16 14:18:20 -08:00
afourney
da779dd125 Merge pull request #33 from nyosegawa/feature/add-pptx-chart-support
Add PPTX chart support
2024-12-16 14:11:49 -08:00
afourney
12ce5e95b2 Merge branch 'main' into feature/add-pptx-chart-support 2024-12-16 14:06:14 -08:00
gagb
6dad1cca96 Merge pull request #22 from Josh-XT/main
Add zip handling
2024-12-16 13:56:25 -08:00
gagb
9e6a19987b Merge branch 'main' into main 2024-12-16 13:51:39 -08:00
gagb
ed91e8b534 Merge pull request #19 from brc-dd/fix/18
Fix character decoding issues with text-like files
2024-12-16 13:49:48 -08:00
gagb
aeff2cb5ae Merge branch 'main' into fix/18 2024-12-16 13:46:17 -08:00
gagb
c9c7d98d30 Merge pull request #11 from simonw/patch-2
CLI usage instructions
2024-12-16 13:45:05 -08:00
gagb
e7d9b5546a Merge branch 'main' into patch-2 2024-12-16 13:42:28 -08:00
CharlesCNorton
ed651aeb16 Fix LLM terminology in code
Replaced all occurrences of mlm_client and mlm_model with llm_client and llm_model for consistent terminology when referencing Large Language Models (LLMs).
2024-12-16 16:23:52 -05:00
CharlesCNorton
3d9f3f3e5b Fix LLM terms
Updated all instances of mlm_client and mlm_model to llm_client and llm_model in the readme. The previous terms (mlm_client and mlm_model) are incorrect in the context of configuring Large Language Models (LLMs), as "MLM" typically refers to Masked Language Models, which is unrelated to the intended functionality. This change aligns the documentation with standard naming conventions for LLM configuration parameters and improves clarity for users integrating with LLMs like OpenAI's GPT models.
2024-12-16 16:23:03 -05:00
Divit
ad01da308d fix issue #65 2024-12-16 21:48:33 +05:30
CyberNobie
010f841008 Ensure hatch is installed before running tests 2024-12-16 18:47:24 +05:30
Michele Adduci
5fc03b6415 Added UID as argument 2024-12-16 13:11:13 +01:00
Michele Adduci
013b022427 Added Docker Image for using markitdown in a sandboxed environment 2024-12-16 13:08:15 +01:00
narumi
695100d5d8 Support specifying YouTube transcript language 2024-12-16 13:16:00 +08:00
Soulter
d66ef5fcca Update README to introduce the customized mlm_prompt 2024-12-16 12:08:51 +08:00
Soulter
c168703d5e Pass the kwargs to _convert method when converting an url file 2024-12-16 11:41:39 +08:00
Yeonjun
3548c96dd3 Create .gitattributes
Mark test files as linguist-vendored
2024-12-16 09:21:07 +09:00
SH4DOW4RE
1559d9d163 pre-commit ran 2024-12-15 22:15:20 +01:00
SH4DOW4RE
b7f5662ffd PR: Catching pydub's warning of ffmpeg or avconv missing 2024-12-15 17:29:14 +01:00
Ville Puuska
0a7203b876 add style_map prop to MarkItDown class 2024-12-15 17:23:57 +02:00
Ville Puuska
0704b0b6ff pass 'style_map' kwarg to mammoth when converting docx 2024-12-15 16:59:21 +02:00
sakasegawa
0dd4e95584 Remove _is_chart 2024-12-15 21:14:58 +09:00
sakasegawa
93130b5ba5 Add PPTX chart support 2024-12-15 20:42:55 +09:00
Divyansh Singh
52b723724c Fix character decoding issues with text-like files 2024-12-15 10:37:59 +05:30
Josh XT
a55c3d525c Merge branch 'main' into main 2024-12-14 23:09:30 -05:00
gagb
81e3f24acd Merge pull request #29 from microsoft/gagb-patch-1
Update README.md
2024-12-14 19:17:54 -08:00
gagb
b84294620a Update README.md 2024-12-14 19:05:51 -08:00
gagb
60c495d609 Merge branch 'main' into patch-2 2024-12-14 18:57:11 -08:00
gagb
71123a4df3 Merge pull request #7 from microsoft/gagb/improve-readme
Improve the readme with contributing guidelines
2024-12-14 18:54:28 -08:00
gagb
5753e553fe Fix conflicts 2024-12-14 18:47:34 -08:00
gagb
752dd897b9 Merge pull request #28 from pawarbi/main
Update README.md
2024-12-14 18:44:52 -08:00
gagb
1aa4abe90f Merge branch 'gagb/improve-readme' into main 2024-12-14 18:44:33 -08:00
gagb
ea7c6dcc40 Merge pull request #27 from haesleinhuepf/patch-1
Add installation instructions from haesleinhuepf:patch-1
2024-12-14 18:39:51 -08:00
gagb
a31c0a13e7 Merge branch 'main' into gagb/improve-readme 2024-12-14 18:34:27 -08:00
Sandeep Pawar
30ab78fe9e Update README.md
I have updated the readme with three changes:
- Created sections for Installation and Usage to help users
- Added installation instruction
- Added additional example of using LLM. This will be the primary use case and will help users.
2024-12-14 19:15:10 -06:00
gagb
559b1fc62a Merge branch 'main' into patch-2 2024-12-14 15:02:42 -08:00
Josh XT
df03382218 Improve docustring 2024-12-14 17:55:22 -05:00
Robert Haase
18301edcd0 Add installation instructions 2024-12-14 23:22:54 +01:00
Josh XT
4987201ef6 test 2024-12-14 08:49:03 -05:00
Josh XT
571c5bbc0e add test 2024-12-14 08:45:51 -05:00
Josh XT
e8ea8b6f3d Update readme 2024-12-14 08:41:07 -05:00
Josh XT
7e634acf5f import zipfile 2024-12-14 08:24:44 -05:00
Josh XT
862c39029e add zip handling 2024-12-14 06:34:47 -05:00
Simon Willison
33ce17954d Note about piping 2024-12-13 11:09:03 -08:00
Simon Willison
6ebef5af0c CLI usage instructions
Plus added  a PyPI badge
2024-12-13 11:06:11 -08:00
gagb
3f9ba06418 Improve the readme with contributing guidelines
Addresses issue https://github.com/microsoft/markitdown/issues/6

---

For more details, open the [Copilot Workspace session](https://copilot-workspace.githubnext.com/microsoft/markitdown?shareId=XXXX-XXXX-XXXX-XXXX).
2024-12-12 15:17:18 -08:00
17 changed files with 466 additions and 42 deletions

1
.dockerignore Normal file
View File

@@ -0,0 +1 @@
*

1
.gitattributes vendored Normal file
View File

@@ -0,0 +1 @@
tests/test_files/** linguist-vendored

16
Dockerfile Normal file
View File

@@ -0,0 +1,16 @@
FROM python:3.13-alpine
USER root
# Runtime dependency
RUN apk add --no-cache ffmpeg
RUN pip install markitdown
# Default USERID and GROUPID
ARG USERID=10000
ARG GROUPID=10000
USER $USERID:$GROUPID
ENTRYPOINT [ "markitdown" ]

View File

@@ -1,5 +1,7 @@
# MarkItDown
[![PyPI](https://img.shields.io/pypi/v/markitdown.svg)](https://pypi.org/project/markitdown/)
The MarkItDown library is a utility tool for converting various files to Markdown (e.g., for indexing, text analysis, etc.)
It presently supports:
@@ -12,7 +14,23 @@ It presently supports:
- Audio (EXIF metadata, and speech transcription)
- HTML (special handling of Wikipedia, etc.)
- Various other text-based formats (csv, json, xml, etc.)
- ZIP (Iterates over contents and converts each file)
# Installation
You can install `markitdown` using pip:
```python
pip install markitdown
```
or from the source
```sh
pip install -e .
```
# Usage
The API is simple:
```python
@@ -23,6 +41,44 @@ result = markitdown.convert("test.xlsx")
print(result.text_content)
```
To use this as a command-line utility, install it and then run it like this:
```bash
markitdown path-to-file.pdf
```
This will output Markdown to standard output. You can save it like this:
```bash
markitdown path-to-file.pdf > document.md
```
You can pipe content to standard input by omitting the argument:
```bash
cat path-to-file.pdf | markitdown
```
You can also configure markitdown to use Large Language Models to describe images. To do so you must provide `llm_client` and `llm_model` parameters to MarkItDown object, according to your specific client.
```python
from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
result = md.convert("example.jpg")
print(result.text_content)
```
You can also use the project as Docker Image:
```sh
docker build -t markitdown:latest .
docker run --rm -i markitdown:latest < ~/your-file.pdf > output.md
```
## Contributing
This project welcomes contributions and suggestions. Most contributions require you to agree to a
@@ -37,6 +93,24 @@ This project has adopted the [Microsoft Open Source Code of Conduct](https://ope
For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
### Running Tests
To run tests, install `hatch` using `pip` or other methods as described [here](https://hatch.pypa.io/dev/install).
```sh
pip install hatch
hatch shell
hatch test
```
### Running Pre-commit Checks
Please run the pre-commit checks before submitting a PR.
```sh
pre-commit run --all-files
```
## Trademarks
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft

View File

@@ -38,6 +38,8 @@ dependencies = [
"youtube-transcript-api",
"SpeechRecognition",
"pathvalidate",
"charset-normalizer",
"openai",
]
[project.urls]
@@ -76,3 +78,6 @@ exclude_lines = [
"if __name__ == .__main__.:",
"if TYPE_CHECKING:",
]
[tool.hatch.build.targets.sdist]
only-include = ["src/markitdown"]

View File

@@ -1,4 +1,4 @@
# SPDX-FileCopyrightText: 2024-present Adam Fourney <adamfo@microsoft.com>
#
# SPDX-License-Identifier: MIT
__version__ = "0.0.1a1"
__version__ = "0.0.1a3"

View File

@@ -2,21 +2,15 @@
#
# SPDX-License-Identifier: MIT
import sys
import argparse
from ._markitdown import MarkItDown
def main():
if len(sys.argv) == 1:
markitdown = MarkItDown()
result = markitdown.convert_stream(sys.stdin.buffer)
print(result.text_content)
elif len(sys.argv) == 2:
markitdown = MarkItDown()
result = markitdown.convert(sys.argv[1])
print(result.text_content)
else:
sys.stderr.write(
"""
parser = argparse.ArgumentParser(
description="Convert various file formats to markdown.",
formatter_class=argparse.RawDescriptionHelpFormatter,
usage="""
SYNTAX:
markitdown <OPTIONAL: FILENAME>
@@ -33,10 +27,21 @@ EXAMPLE:
OR
markitdown < example.pdf
""".strip()
+ "\n"
""".strip(),
)
parser.add_argument("filename", nargs="?")
args = parser.parse_args()
if args.filename is None:
markitdown = MarkItDown()
result = markitdown.convert_stream(sys.stdin.buffer)
print(result.text_content)
else:
markitdown = MarkItDown()
result = markitdown.convert(args.filename)
print(result.text_content)
if __name__ == "__main__":
main()

View File

@@ -12,8 +12,10 @@ import subprocess
import sys
import tempfile
import traceback
import zipfile
from typing import Any, Dict, List, Optional, Union
from urllib.parse import parse_qs, quote, unquote, urlparse, urlunparse
from warnings import warn, resetwarnings, catch_warnings
import mammoth
import markdownify
@@ -26,15 +28,24 @@ import pptx
import puremagic
import requests
from bs4 import BeautifulSoup
from charset_normalizer import from_path
# Optional Transcription support
try:
# Using warnings' catch_warnings to catch
# pydub's warning of ffmpeg or avconv missing
with catch_warnings(record=True) as w:
import pydub
if w:
raise ModuleNotFoundError
import speech_recognition as sr
IS_AUDIO_TRANSCRIPTION_CAPABLE = True
except ModuleNotFoundError:
pass
finally:
resetwarnings()
# Optional YouTube transcription support
try:
@@ -161,9 +172,7 @@ class PlainTextConverter(DocumentConverter):
elif "text/" not in content_type.lower():
return None
text_content = ""
with open(local_path, "rt", encoding="utf-8") as fh:
text_content = fh.read()
text_content = str(from_path(local_path).best())
return DocumentConverterResult(
title=None,
text_content=text_content,
@@ -344,8 +353,11 @@ class YouTubeConverter(DocumentConverter):
assert isinstance(params["v"][0], str)
video_id = str(params["v"][0])
try:
youtube_transcript_languages = kwargs.get(
"youtube_transcript_languages", ("en",)
)
# Must be a single transcript.
transcript = YouTubeTranscriptApi.get_transcript(video_id) # type: ignore
transcript = YouTubeTranscriptApi.get_transcript(video_id, languages=youtube_transcript_languages) # type: ignore
transcript_text = " ".join([part["text"] for part in transcript]) # type: ignore
# Alternative formatting:
# formatter = TextFormatter()
@@ -492,7 +504,9 @@ class DocxConverter(HtmlConverter):
result = None
with open(local_path, "rb") as docx_file:
result = mammoth.convert_to_html(docx_file)
style_map = kwargs.get("style_map", None)
result = mammoth.convert_to_html(docx_file, style_map=style_map)
html_content = result.value
result = self._convert(html_content)
@@ -582,6 +596,10 @@ class PptxConverter(HtmlConverter):
"\n" + self._convert(html_table).text_content.strip() + "\n"
)
# Charts
if shape.has_chart:
md_content += self._convert_chart_to_markdown(shape.chart)
# Text areas
elif shape.has_text_frame:
if shape == title:
@@ -616,6 +634,29 @@ class PptxConverter(HtmlConverter):
return True
return False
def _convert_chart_to_markdown(self, chart):
md = "\n\n### Chart"
if chart.has_title:
md += f": {chart.chart_title.text_frame.text}"
md += "\n\n"
data = []
category_names = [c.label for c in chart.plots[0].categories]
series_names = [s.name for s in chart.series]
data.append(["Category"] + series_names)
for idx, category in enumerate(category_names):
row = [category]
for series in chart.series:
row.append(series.values[idx])
data.append(row)
markdown_table = []
for row in data:
markdown_table.append("| " + " | ".join(map(str, row)) + " |")
header = markdown_table[0]
separator = "|" + "|".join(["---"] * len(data[0])) + "|"
return md + "\n".join([header, separator] + markdown_table[1:])
class MediaConverter(DocumentConverter):
"""
@@ -754,7 +795,7 @@ class Mp3Converter(WavConverter):
class ImageConverter(MediaConverter):
"""
Converts images to markdown via extraction of metadata (if `exiftool` is installed), OCR (if `easyocr` is installed), and description via a multimodal LLM (if an mlm_client is configured).
Converts images to markdown via extraction of metadata (if `exiftool` is installed), OCR (if `easyocr` is installed), and description via a multimodal LLM (if an llm_client is configured).
"""
def convert(self, local_path, **kwargs) -> Union[None, DocumentConverterResult]:
@@ -784,17 +825,17 @@ class ImageConverter(MediaConverter):
md_content += f"{f}: {metadata[f]}\n"
# Try describing the image with GPTV
mlm_client = kwargs.get("mlm_client")
mlm_model = kwargs.get("mlm_model")
if mlm_client is not None and mlm_model is not None:
llm_client = kwargs.get("llm_client")
llm_model = kwargs.get("llm_model")
if llm_client is not None and llm_model is not None:
md_content += (
"\n# Description:\n"
+ self._get_mlm_description(
+ self._get_llm_description(
local_path,
extension,
mlm_client,
mlm_model,
prompt=kwargs.get("mlm_prompt"),
llm_client,
llm_model,
prompt=kwargs.get("llm_prompt"),
).strip()
+ "\n"
)
@@ -804,12 +845,10 @@ class ImageConverter(MediaConverter):
text_content=md_content,
)
def _get_mlm_description(self, local_path, extension, client, model, prompt=None):
def _get_llm_description(self, local_path, extension, client, model, prompt=None):
if prompt is None or prompt.strip() == "":
prompt = "Write a detailed caption for this image."
sys.stderr.write(f"MLM Prompt:\n{prompt}\n")
data_uri = ""
with open(local_path, "rb") as image_file:
content_type, encoding = mimetypes.guess_type("_dummy" + extension)
@@ -837,6 +876,124 @@ class ImageConverter(MediaConverter):
return response.choices[0].message.content
class ZipConverter(DocumentConverter):
"""Converts ZIP files to markdown by extracting and converting all contained files.
The converter extracts the ZIP contents to a temporary directory, processes each file
using appropriate converters based on file extensions, and then combines the results
into a single markdown document. The temporary directory is cleaned up after processing.
Example output format:
```markdown
Content from the zip file `example.zip`:
## File: docs/readme.txt
This is the content of readme.txt
Multiple lines are preserved
## File: images/example.jpg
ImageSize: 1920x1080
DateTimeOriginal: 2024-02-15 14:30:00
Description: A beautiful landscape photo
## File: data/report.xlsx
## Sheet1
| Column1 | Column2 | Column3 |
|---------|---------|---------|
| data1 | data2 | data3 |
| data4 | data5 | data6 |
```
Key features:
- Maintains original file structure in headings
- Processes nested files recursively
- Uses appropriate converters for each file type
- Preserves formatting of converted content
- Cleans up temporary files after processing
"""
def convert(
self, local_path: str, **kwargs: Any
) -> Union[None, DocumentConverterResult]:
# Bail if not a ZIP
extension = kwargs.get("file_extension", "")
if extension.lower() != ".zip":
return None
# Get parent converters list if available
parent_converters = kwargs.get("_parent_converters", [])
if not parent_converters:
return DocumentConverterResult(
title=None,
text_content=f"[ERROR] No converters available to process zip contents from: {local_path}",
)
extracted_zip_folder_name = (
f"extracted_{os.path.basename(local_path).replace('.zip', '_zip')}"
)
new_folder = os.path.normpath(
os.path.join(os.path.dirname(local_path), extracted_zip_folder_name)
)
md_content = f"Content from the zip file `{os.path.basename(local_path)}`:\n\n"
# Safety check for path traversal
if not new_folder.startswith(os.path.dirname(local_path)):
return DocumentConverterResult(
title=None, text_content=f"[ERROR] Invalid zip file path: {local_path}"
)
try:
# Extract the zip file
with zipfile.ZipFile(local_path, "r") as zipObj:
zipObj.extractall(path=new_folder)
# Process each extracted file
for root, dirs, files in os.walk(new_folder):
for name in files:
file_path = os.path.join(root, name)
relative_path = os.path.relpath(file_path, new_folder)
# Get file extension
_, file_extension = os.path.splitext(name)
# Update kwargs for the file
file_kwargs = kwargs.copy()
file_kwargs["file_extension"] = file_extension
file_kwargs["_parent_converters"] = parent_converters
# Try converting the file using available converters
for converter in parent_converters:
# Skip the zip converter to avoid infinite recursion
if isinstance(converter, ZipConverter):
continue
result = converter.convert(file_path, **file_kwargs)
if result is not None:
md_content += f"\n## File: {relative_path}\n\n"
md_content += result.text_content + "\n\n"
break
# Clean up extracted files if specified
if kwargs.get("cleanup_extracted", True):
shutil.rmtree(new_folder)
return DocumentConverterResult(title=None, text_content=md_content.strip())
except zipfile.BadZipFile:
return DocumentConverterResult(
title=None,
text_content=f"[ERROR] Invalid or corrupted zip file: {local_path}",
)
except Exception as e:
return DocumentConverterResult(
title=None,
text_content=f"[ERROR] Failed to process zip file {local_path}: {str(e)}",
)
class FileConversionException(BaseException):
pass
@@ -852,16 +1009,50 @@ class MarkItDown:
def __init__(
self,
requests_session: Optional[requests.Session] = None,
llm_client: Optional[Any] = None,
llm_model: Optional[str] = None,
style_map: Optional[str] = None,
# Deprecated
mlm_client: Optional[Any] = None,
mlm_model: Optional[Any] = None,
mlm_model: Optional[str] = None,
):
if requests_session is None:
self._requests_session = requests.Session()
else:
self._requests_session = requests_session
self._mlm_client = mlm_client
self._mlm_model = mlm_model
# Handle deprecation notices
#############################
if mlm_client is not None:
if llm_client is None:
warn(
"'mlm_client' is deprecated, and was renamed 'llm_client'.",
DeprecationWarning,
)
llm_client = mlm_client
mlm_client = None
else:
raise ValueError(
"'mlm_client' is deprecated, and was renamed 'llm_client'. Do not use both at the same time. Just use 'llm_client' instead."
)
if mlm_model is not None:
if llm_model is None:
warn(
"'mlm_model' is deprecated, and was renamed 'llm_model'.",
DeprecationWarning,
)
llm_model = mlm_model
mlm_model = None
else:
raise ValueError(
"'mlm_model' is deprecated, and was renamed 'llm_model'. Do not use both at the same time. Just use 'llm_model' instead."
)
#############################
self._llm_client = llm_client
self._llm_model = llm_model
self._style_map = style_map
self._page_converters: List[DocumentConverter] = []
@@ -880,6 +1071,7 @@ class MarkItDown:
self.register_page_converter(Mp3Converter())
self.register_page_converter(ImageConverter())
self.register_page_converter(PdfConverter())
self.register_page_converter(ZipConverter())
def convert(
self, source: Union[str, requests.Response], **kwargs: Any
@@ -1003,7 +1195,7 @@ class MarkItDown:
self._append_ext(extensions, g)
# Convert
result = self._convert(temp_path, extensions, url=response.url)
result = self._convert(temp_path, extensions, url=response.url, **kwargs)
# Clean up
finally:
try:
@@ -1030,11 +1222,17 @@ class MarkItDown:
_kwargs.update({"file_extension": ext})
# Copy any additional global options
if "mlm_client" not in _kwargs and self._mlm_client is not None:
_kwargs["mlm_client"] = self._mlm_client
if "llm_client" not in _kwargs and self._llm_client is not None:
_kwargs["llm_client"] = self._llm_client
if "mlm_model" not in _kwargs and self._mlm_model is not None:
_kwargs["mlm_model"] = self._mlm_model
if "llm_model" not in _kwargs and self._llm_model is not None:
_kwargs["llm_model"] = self._llm_model
# Add the list of converters for nested processing
_kwargs["_parent_converters"] = self._page_converters
if "style_map" not in _kwargs and self._style_map is not None:
_kwargs["style_map"] = self._style_map
# If we hit an error log it and keep trying
try:
@@ -1071,7 +1269,6 @@ class MarkItDown:
if ext == "":
return
# if ext not in extensions:
if True:
extensions.append(ext)
def _guess_ext_magic(self, path):

0
tests/test_files/test.docx vendored Executable file → Normal file
View File

0
tests/test_files/test.jpg vendored Executable file → Normal file
View File

Before

Width:  |  Height:  |  Size: 463 KiB

After

Width:  |  Height:  |  Size: 463 KiB

BIN
tests/test_files/test.pptx vendored Executable file → Normal file

Binary file not shown.

0
tests/test_files/test.xlsx vendored Executable file → Normal file
View File

BIN
tests/test_files/test_files.zip vendored Normal file

Binary file not shown.

BIN
tests/test_files/test_llm.jpg vendored Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 145 KiB

4
tests/test_files/test_mskanji.csv vendored Normal file
View File

@@ -0,0 +1,4 @@
<EFBFBD><EFBFBD><EFBFBD>O,<EFBFBD>N<EFBFBD><EFBFBD>,<EFBFBD>Z<EFBFBD><EFBFBD>
<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>Y,30,<EFBFBD><EFBFBD><EFBFBD><EFBFBD>
<EFBFBD>O<EFBFBD>؉p<EFBFBD>q,25,<EFBFBD><EFBFBD><EFBFBD><EFBFBD>
<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>~,35,<EFBFBD><EFBFBD><EFBFBD>É<EFBFBD>
1 –¼‘O ”N—î �Z�Š
2 �²“¡‘¾˜Y 30 “Œ‹ž
3 ŽO–؉pŽq 25 ‘å�ã
4 îà‹´�~ 35 –¼ŒÃ‰®

BIN
tests/test_files/test_with_comment.docx vendored Normal file

Binary file not shown.

View File

@@ -6,11 +6,23 @@ import shutil
import pytest
import requests
from warnings import catch_warnings, resetwarnings
from markitdown import MarkItDown
skip_remote = (
True if os.environ.get("GITHUB_ACTIONS") else False
) # Don't run these tests in CI
# Don't run the llm tests without a key and the client library
skip_llm = False if os.environ.get("OPENAI_API_KEY") else True
try:
import openai
except ModuleNotFoundError:
skip_llm = True
# Skip exiftool tests if not installed
skip_exiftool = shutil.which("exiftool") is None
TEST_FILES_DIR = os.path.join(os.path.dirname(__file__), "test_files")
@@ -51,12 +63,25 @@ DOCX_TEST_STRINGS = [
"AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation",
]
DOCX_COMMENT_TEST_STRINGS = [
"314b0a30-5b04-470b-b9f7-eed2c2bec74a",
"49e168b7-d2ae-407f-a055-2167576f39a1",
"## d666f1f7-46cb-42bd-9a39-9a39cf2a509f",
"# Abstract",
"# Introduction",
"AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation",
"This is a test comment. 12df-321a",
"Yet another comment in the doc. 55yiyi-asd09",
]
PPTX_TEST_STRINGS = [
"2cdda5c8-e50e-4db4-b5f0-9722a649f455",
"04191ea8-5c73-4215-a1d3-1cfb43aaaf12",
"44bf7d06-5e7a-4a40-a2e1-a2e42ef28c8a",
"1b92870d-e3b5-4e65-8153-919f4ff45592",
"AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation",
"a3f6004b-6f4f-4ea8-bee3-3741f4dc385f", # chart title
"2003", # chart value
]
BLOG_TEST_URL = "https://microsoft.github.io/autogen/blog/2023/04/21/LLM-tuning-math"
@@ -87,6 +112,17 @@ SERP_TEST_EXCLUDES = [
"data:image/svg+xml,%3Csvg%20width%3D",
]
CSV_CP932_TEST_STRINGS = [
"名前,年齢,住所",
"佐藤太郎,30,東京",
"三木英子,25,大阪",
"髙橋淳,35,名古屋",
]
LLM_TEST_STRINGS = [
"5bda1dd6",
]
@pytest.mark.skipif(
skip_remote,
@@ -130,6 +166,24 @@ def test_markitdown_local() -> None:
text_content = result.text_content.replace("\\", "")
assert test_string in text_content
# Test DOCX processing, with comments
result = markitdown.convert(
os.path.join(TEST_FILES_DIR, "test_with_comment.docx"),
style_map="comment-reference => ",
)
for test_string in DOCX_COMMENT_TEST_STRINGS:
text_content = result.text_content.replace("\\", "")
assert test_string in text_content
# Test DOCX processing, with comments and setting style_map on init
markitdown_with_style_map = MarkItDown(style_map="comment-reference => ")
result = markitdown_with_style_map.convert(
os.path.join(TEST_FILES_DIR, "test_with_comment.docx")
)
for test_string in DOCX_COMMENT_TEST_STRINGS:
text_content = result.text_content.replace("\\", "")
assert test_string in text_content
# Test PPTX processing
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.pptx"))
for test_string in PPTX_TEST_STRINGS:
@@ -144,6 +198,12 @@ def test_markitdown_local() -> None:
text_content = result.text_content.replace("\\", "")
assert test_string in text_content
# Test ZIP file processing
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test_files.zip"))
for test_string in DOCX_TEST_STRINGS:
text_content = result.text_content.replace("\\", "")
assert test_string in text_content
# Test Wikipedia processing
result = markitdown.convert(
os.path.join(TEST_FILES_DIR, "test_wikipedia.html"), url=WIKIPEDIA_TEST_URL
@@ -164,6 +224,12 @@ def test_markitdown_local() -> None:
for test_string in SERP_TEST_STRINGS:
assert test_string in text_content
## Test non-UTF-8 encoding
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test_mskanji.csv"))
text_content = result.text_content.replace("\\", "")
for test_string in CSV_CP932_TEST_STRINGS:
assert test_string in text_content
@pytest.mark.skipif(
skip_exiftool,
@@ -179,8 +245,63 @@ def test_markitdown_exiftool() -> None:
assert target in result.text_content
def test_markitdown_deprecation() -> None:
try:
with catch_warnings(record=True) as w:
test_client = object()
markitdown = MarkItDown(mlm_client=test_client)
assert len(w) == 1
assert w[0].category is DeprecationWarning
assert markitdown._llm_client == test_client
finally:
resetwarnings()
try:
with catch_warnings(record=True) as w:
markitdown = MarkItDown(mlm_model="gpt-4o")
assert len(w) == 1
assert w[0].category is DeprecationWarning
assert markitdown._llm_model == "gpt-4o"
finally:
resetwarnings()
try:
test_client = object()
markitdown = MarkItDown(mlm_client=test_client, llm_client=test_client)
assert False
except ValueError:
pass
try:
markitdown = MarkItDown(mlm_model="gpt-4o", llm_model="gpt-4o")
assert False
except ValueError:
pass
@pytest.mark.skipif(
skip_llm,
reason="do not run llm tests without a key",
)
def test_markitdown_llm() -> None:
client = openai.OpenAI()
markitdown = MarkItDown(llm_client=client, llm_model="gpt-4o")
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test_llm.jpg"))
for test_string in LLM_TEST_STRINGS:
assert test_string in result.text_content
# This is not super precise. It would also accept "red square", "blue circle",
# "the square is not blue", etc. But it's sufficient for this test.
for test_string in ["red", "circle", "blue", "square"]:
assert test_string in result.text_content.lower()
if __name__ == "__main__":
"""Runs this file's tests from the command line."""
test_markitdown_remote()
test_markitdown_local()
test_markitdown_exiftool()
test_markitdown_deprecation()
test_markitdown_llm()