9 Commits

Author SHA1 Message Date
gagb
0c25a086e7 Merge branch 'main' into gagb/add-github-issue-conversion 2024-12-14 18:34:18 -08:00
gagb
8a30fca732 Add support for GH prs as well 2024-12-13 14:57:39 -08:00
gagb
0b6554738c Move github handling from convert to convert_url 2024-12-13 14:16:56 -08:00
gagb
f1274dca87 Run pre-commit 2024-12-13 13:58:24 -08:00
gagb
778fca3f70 Fix code scanning alert no. 1: Incomplete URL substring sanitization
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
2024-12-13 13:57:03 -08:00
gagb
7979eecfef SHift to Documentconverter class 2024-12-13 13:52:37 -08:00
gagb
8f16f32d53 Add tests 2024-12-12 23:10:23 +00:00
gagb
28af7ad341 Run pre-commit 2024-12-12 22:39:03 +00:00
gagb
9d047103d5 Add method to convert GitHub issue to markdown
Add support for converting GitHub issues to markdown.

* Add `convert_github_issue` method in `src/markitdown/_markitdown.py` to handle GitHub issue conversion.
* Use `PyGithub` to fetch issue details using the provided token.
* Convert the issue details to markdown format and return as `DocumentConverterResult`.
* Add optional GitHub issue support with `IS_GITHUB_ISSUE_CAPABLE` flag.
2024-12-12 13:41:31 -08:00
17 changed files with 207 additions and 457 deletions

View File

@@ -1 +0,0 @@
*

1
.gitattributes vendored
View File

@@ -1 +0,0 @@
tests/test_files/** linguist-vendored

View File

@@ -1,16 +0,0 @@
FROM python:3.13-alpine
USER root
# Runtime dependency
RUN apk add --no-cache ffmpeg
RUN pip install markitdown
# Default USERID and GROUPID
ARG USERID=10000
ARG GROUPID=10000
USER $USERID:$GROUPID
ENTRYPOINT [ "markitdown" ]

View File

@@ -1,7 +1,5 @@
# MarkItDown # MarkItDown
[![PyPI](https://img.shields.io/pypi/v/markitdown.svg)](https://pypi.org/project/markitdown/)
The MarkItDown library is a utility tool for converting various files to Markdown (e.g., for indexing, text analysis, etc.) The MarkItDown library is a utility tool for converting various files to Markdown (e.g., for indexing, text analysis, etc.)
It presently supports: It presently supports:
@@ -14,23 +12,7 @@ It presently supports:
- Audio (EXIF metadata, and speech transcription) - Audio (EXIF metadata, and speech transcription)
- HTML (special handling of Wikipedia, etc.) - HTML (special handling of Wikipedia, etc.)
- Various other text-based formats (csv, json, xml, etc.) - Various other text-based formats (csv, json, xml, etc.)
- ZIP (Iterates over contents and converts each file)
# Installation
You can install `markitdown` using pip:
```python
pip install markitdown
```
or from the source
```sh
pip install -e .
```
# Usage
The API is simple: The API is simple:
```python ```python
@@ -41,44 +23,6 @@ result = markitdown.convert("test.xlsx")
print(result.text_content) print(result.text_content)
``` ```
To use this as a command-line utility, install it and then run it like this:
```bash
markitdown path-to-file.pdf
```
This will output Markdown to standard output. You can save it like this:
```bash
markitdown path-to-file.pdf > document.md
```
You can pipe content to standard input by omitting the argument:
```bash
cat path-to-file.pdf | markitdown
```
You can also configure markitdown to use Large Language Models to describe images. To do so you must provide `llm_client` and `llm_model` parameters to MarkItDown object, according to your specific client.
```python
from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
result = md.convert("example.jpg")
print(result.text_content)
```
You can also use the project as Docker Image:
```sh
docker build -t markitdown:latest .
docker run --rm -i markitdown:latest < ~/your-file.pdf > output.md
```
## Contributing ## Contributing
This project welcomes contributions and suggestions. Most contributions require you to agree to a This project welcomes contributions and suggestions. Most contributions require you to agree to a
@@ -93,24 +37,6 @@ This project has adopted the [Microsoft Open Source Code of Conduct](https://ope
For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments. contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
### Running Tests
To run tests, install `hatch` using `pip` or other methods as described [here](https://hatch.pypa.io/dev/install).
```sh
pip install hatch
hatch shell
hatch test
```
### Running Pre-commit Checks
Please run the pre-commit checks before submitting a PR.
```sh
pre-commit run --all-files
```
## Trademarks ## Trademarks
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft

View File

@@ -38,8 +38,7 @@ dependencies = [
"youtube-transcript-api", "youtube-transcript-api",
"SpeechRecognition", "SpeechRecognition",
"pathvalidate", "pathvalidate",
"charset-normalizer", "pygithub"
"openai",
] ]
[project.urls] [project.urls]
@@ -78,6 +77,3 @@ exclude_lines = [
"if __name__ == .__main__.:", "if __name__ == .__main__.:",
"if TYPE_CHECKING:", "if TYPE_CHECKING:",
] ]
[tool.hatch.build.targets.sdist]
only-include = ["src/markitdown"]

View File

@@ -1,4 +1,4 @@
# SPDX-FileCopyrightText: 2024-present Adam Fourney <adamfo@microsoft.com> # SPDX-FileCopyrightText: 2024-present Adam Fourney <adamfo@microsoft.com>
# #
# SPDX-License-Identifier: MIT # SPDX-License-Identifier: MIT
__version__ = "0.0.1a3" __version__ = "0.0.1a1"

View File

@@ -2,15 +2,21 @@
# #
# SPDX-License-Identifier: MIT # SPDX-License-Identifier: MIT
import sys import sys
import argparse
from ._markitdown import MarkItDown from ._markitdown import MarkItDown
def main(): def main():
parser = argparse.ArgumentParser( if len(sys.argv) == 1:
description="Convert various file formats to markdown.", markitdown = MarkItDown()
formatter_class=argparse.RawDescriptionHelpFormatter, result = markitdown.convert_stream(sys.stdin.buffer)
usage=""" print(result.text_content)
elif len(sys.argv) == 2:
markitdown = MarkItDown()
result = markitdown.convert(sys.argv[1])
print(result.text_content)
else:
sys.stderr.write(
"""
SYNTAX: SYNTAX:
markitdown <OPTIONAL: FILENAME> markitdown <OPTIONAL: FILENAME>
@@ -27,20 +33,9 @@ EXAMPLE:
OR OR
markitdown < example.pdf markitdown < example.pdf
""".strip(), """.strip()
) + "\n"
)
parser.add_argument("filename", nargs="?")
args = parser.parse_args()
if args.filename is None:
markitdown = MarkItDown()
result = markitdown.convert_stream(sys.stdin.buffer)
print(result.text_content)
else:
markitdown = MarkItDown()
result = markitdown.convert(args.filename)
print(result.text_content)
if __name__ == "__main__": if __name__ == "__main__":

View File

@@ -12,10 +12,8 @@ import subprocess
import sys import sys
import tempfile import tempfile
import traceback import traceback
import zipfile
from typing import Any, Dict, List, Optional, Union from typing import Any, Dict, List, Optional, Union
from urllib.parse import parse_qs, quote, unquote, urlparse, urlunparse from urllib.parse import parse_qs, quote, unquote, urlparse, urlunparse
from warnings import warn, resetwarnings, catch_warnings
import mammoth import mammoth
import markdownify import markdownify
@@ -28,24 +26,15 @@ import pptx
import puremagic import puremagic
import requests import requests
from bs4 import BeautifulSoup from bs4 import BeautifulSoup
from charset_normalizer import from_path
# Optional Transcription support # Optional Transcription support
try: try:
# Using warnings' catch_warnings to catch import pydub
# pydub's warning of ffmpeg or avconv missing
with catch_warnings(record=True) as w:
import pydub
if w:
raise ModuleNotFoundError
import speech_recognition as sr import speech_recognition as sr
IS_AUDIO_TRANSCRIPTION_CAPABLE = True IS_AUDIO_TRANSCRIPTION_CAPABLE = True
except ModuleNotFoundError: except ModuleNotFoundError:
pass pass
finally:
resetwarnings()
# Optional YouTube transcription support # Optional YouTube transcription support
try: try:
@@ -55,6 +44,14 @@ try:
except ModuleNotFoundError: except ModuleNotFoundError:
pass pass
# Optional GitHub issue support
try:
from github import Github
IS_GITHUB_ISSUE_CAPABLE = True
except ModuleNotFoundError:
IS_GITHUB_ISSUE_CAPABLE = False
class _CustomMarkdownify(markdownify.MarkdownConverter): class _CustomMarkdownify(markdownify.MarkdownConverter):
""" """
@@ -172,7 +169,9 @@ class PlainTextConverter(DocumentConverter):
elif "text/" not in content_type.lower(): elif "text/" not in content_type.lower():
return None return None
text_content = str(from_path(local_path).best()) text_content = ""
with open(local_path, "rt", encoding="utf-8") as fh:
text_content = fh.read()
return DocumentConverterResult( return DocumentConverterResult(
title=None, title=None,
text_content=text_content, text_content=text_content,
@@ -353,11 +352,8 @@ class YouTubeConverter(DocumentConverter):
assert isinstance(params["v"][0], str) assert isinstance(params["v"][0], str)
video_id = str(params["v"][0]) video_id = str(params["v"][0])
try: try:
youtube_transcript_languages = kwargs.get(
"youtube_transcript_languages", ("en",)
)
# Must be a single transcript. # Must be a single transcript.
transcript = YouTubeTranscriptApi.get_transcript(video_id, languages=youtube_transcript_languages) # type: ignore transcript = YouTubeTranscriptApi.get_transcript(video_id) # type: ignore
transcript_text = " ".join([part["text"] for part in transcript]) # type: ignore transcript_text = " ".join([part["text"] for part in transcript]) # type: ignore
# Alternative formatting: # Alternative formatting:
# formatter = TextFormatter() # formatter = TextFormatter()
@@ -504,9 +500,7 @@ class DocxConverter(HtmlConverter):
result = None result = None
with open(local_path, "rb") as docx_file: with open(local_path, "rb") as docx_file:
style_map = kwargs.get("style_map", None) result = mammoth.convert_to_html(docx_file)
result = mammoth.convert_to_html(docx_file, style_map=style_map)
html_content = result.value html_content = result.value
result = self._convert(html_content) result = self._convert(html_content)
@@ -596,10 +590,6 @@ class PptxConverter(HtmlConverter):
"\n" + self._convert(html_table).text_content.strip() + "\n" "\n" + self._convert(html_table).text_content.strip() + "\n"
) )
# Charts
if shape.has_chart:
md_content += self._convert_chart_to_markdown(shape.chart)
# Text areas # Text areas
elif shape.has_text_frame: elif shape.has_text_frame:
if shape == title: if shape == title:
@@ -634,29 +624,6 @@ class PptxConverter(HtmlConverter):
return True return True
return False return False
def _convert_chart_to_markdown(self, chart):
md = "\n\n### Chart"
if chart.has_title:
md += f": {chart.chart_title.text_frame.text}"
md += "\n\n"
data = []
category_names = [c.label for c in chart.plots[0].categories]
series_names = [s.name for s in chart.series]
data.append(["Category"] + series_names)
for idx, category in enumerate(category_names):
row = [category]
for series in chart.series:
row.append(series.values[idx])
data.append(row)
markdown_table = []
for row in data:
markdown_table.append("| " + " | ".join(map(str, row)) + " |")
header = markdown_table[0]
separator = "|" + "|".join(["---"] * len(data[0])) + "|"
return md + "\n".join([header, separator] + markdown_table[1:])
class MediaConverter(DocumentConverter): class MediaConverter(DocumentConverter):
""" """
@@ -795,7 +762,7 @@ class Mp3Converter(WavConverter):
class ImageConverter(MediaConverter): class ImageConverter(MediaConverter):
""" """
Converts images to markdown via extraction of metadata (if `exiftool` is installed), OCR (if `easyocr` is installed), and description via a multimodal LLM (if an llm_client is configured). Converts images to markdown via extraction of metadata (if `exiftool` is installed), OCR (if `easyocr` is installed), and description via a multimodal LLM (if an mlm_client is configured).
""" """
def convert(self, local_path, **kwargs) -> Union[None, DocumentConverterResult]: def convert(self, local_path, **kwargs) -> Union[None, DocumentConverterResult]:
@@ -825,17 +792,17 @@ class ImageConverter(MediaConverter):
md_content += f"{f}: {metadata[f]}\n" md_content += f"{f}: {metadata[f]}\n"
# Try describing the image with GPTV # Try describing the image with GPTV
llm_client = kwargs.get("llm_client") mlm_client = kwargs.get("mlm_client")
llm_model = kwargs.get("llm_model") mlm_model = kwargs.get("mlm_model")
if llm_client is not None and llm_model is not None: if mlm_client is not None and mlm_model is not None:
md_content += ( md_content += (
"\n# Description:\n" "\n# Description:\n"
+ self._get_llm_description( + self._get_mlm_description(
local_path, local_path,
extension, extension,
llm_client, mlm_client,
llm_model, mlm_model,
prompt=kwargs.get("llm_prompt"), prompt=kwargs.get("mlm_prompt"),
).strip() ).strip()
+ "\n" + "\n"
) )
@@ -845,10 +812,12 @@ class ImageConverter(MediaConverter):
text_content=md_content, text_content=md_content,
) )
def _get_llm_description(self, local_path, extension, client, model, prompt=None): def _get_mlm_description(self, local_path, extension, client, model, prompt=None):
if prompt is None or prompt.strip() == "": if prompt is None or prompt.strip() == "":
prompt = "Write a detailed caption for this image." prompt = "Write a detailed caption for this image."
sys.stderr.write(f"MLM Prompt:\n{prompt}\n")
data_uri = "" data_uri = ""
with open(local_path, "rb") as image_file: with open(local_path, "rb") as image_file:
content_type, encoding = mimetypes.guess_type("_dummy" + extension) content_type, encoding = mimetypes.guess_type("_dummy" + extension)
@@ -876,122 +845,126 @@ class ImageConverter(MediaConverter):
return response.choices[0].message.content return response.choices[0].message.content
class ZipConverter(DocumentConverter): class GitHubIssueConverter(DocumentConverter):
"""Converts ZIP files to markdown by extracting and converting all contained files. """Converts GitHub issues and pull requests to Markdown."""
The converter extracts the ZIP contents to a temporary directory, processes each file def convert(self, github_url, github_token) -> Union[None, DocumentConverterResult]:
using appropriate converters based on file extensions, and then combines the results # Bail if not a valid GitHub issue or pull request URL
into a single markdown document. The temporary directory is cleaned up after processing. if github_url:
parsed_url = urlparse(github_url)
path_parts = parsed_url.path.strip("/").split("/")
if len(path_parts) < 4 or path_parts[2] not in ["issues", "pull"]:
return None
Example output format: if not github_token:
```markdown raise ValueError(
Content from the zip file `example.zip`: "GitHub token is not set. Cannot convert GitHub issue or pull request."
)
## File: docs/readme.txt if path_parts[2] == "issues":
return self._convert_github_issue(github_url, github_token)
elif path_parts[2] == "pull":
return self._convert_github_pr(github_url, github_token)
This is the content of readme.txt return None
Multiple lines are preserved
## File: images/example.jpg def _convert_github_issue(
self, issue_url: str, github_token: str
ImageSize: 1920x1080 ) -> DocumentConverterResult:
DateTimeOriginal: 2024-02-15 14:30:00 """
Description: A beautiful landscape photo Convert a GitHub issue to a markdown document.
Args:
## File: data/report.xlsx issue_url (str): The URL of the GitHub issue to convert.
github_token (str): A GitHub token with access to the repository.
## Sheet1 Returns:
| Column1 | Column2 | Column3 | DocumentConverterResult: The result containing the issue title and markdown content.
|---------|---------|---------| Raises:
| data1 | data2 | data3 | ImportError: If the PyGithub library is not installed.
| data4 | data5 | data6 | ValueError: If the provided URL is not a valid GitHub issue URL.
``` """
if not IS_GITHUB_ISSUE_CAPABLE:
Key features: raise ImportError(
- Maintains original file structure in headings "PyGithub is not installed. Please install it to use this feature."
- Processes nested files recursively
- Uses appropriate converters for each file type
- Preserves formatting of converted content
- Cleans up temporary files after processing
"""
def convert(
self, local_path: str, **kwargs: Any
) -> Union[None, DocumentConverterResult]:
# Bail if not a ZIP
extension = kwargs.get("file_extension", "")
if extension.lower() != ".zip":
return None
# Get parent converters list if available
parent_converters = kwargs.get("_parent_converters", [])
if not parent_converters:
return DocumentConverterResult(
title=None,
text_content=f"[ERROR] No converters available to process zip contents from: {local_path}",
) )
extracted_zip_folder_name = ( # Parse the issue URL
f"extracted_{os.path.basename(local_path).replace('.zip', '_zip')}" parsed_url = urlparse(issue_url)
path_parts = parsed_url.path.strip("/").split("/")
if len(path_parts) < 4 or path_parts[2] != "issues":
raise ValueError("Invalid GitHub issue URL")
owner, repo, _, issue_number = path_parts[:4]
# Authenticate with GitHub
g = Github(github_token)
repo = g.get_repo(f"{owner}/{repo}")
issue = repo.get_issue(int(issue_number))
# Convert issue details to markdown
markdown_content = f"# {issue.title}\n\n{issue.body}\n\n"
markdown_content += f"**State:** {issue.state}\n"
markdown_content += f"**Created at:** {issue.created_at}\n"
markdown_content += f"**Updated at:** {issue.updated_at}\n"
markdown_content += f"**Comments:**\n"
for comment in issue.get_comments():
markdown_content += (
f"- {comment.user.login} ({comment.created_at}): {comment.body}\n"
)
return DocumentConverterResult(
title=issue.title,
text_content=markdown_content,
) )
new_folder = os.path.normpath(
os.path.join(os.path.dirname(local_path), extracted_zip_folder_name) def _convert_github_pr(
self, pr_url: str, github_token: str
) -> DocumentConverterResult:
"""
Convert a GitHub pull request to a markdown document.
Args:
pr_url (str): The URL of the GitHub pull request to convert.
github_token (str): A GitHub token with access to the repository.
Returns:
DocumentConverterResult: The result containing the pull request title and markdown content.
Raises:
ImportError: If the PyGithub library is not installed.
ValueError: If the provided URL is not a valid GitHub pull request URL.
"""
if not IS_GITHUB_ISSUE_CAPABLE:
raise ImportError(
"PyGithub is not installed. Please install it to use this feature."
)
# Parse the pull request URL
parsed_url = urlparse(pr_url)
path_parts = parsed_url.path.strip("/").split("/")
if len(path_parts) < 4 or path_parts[2] != "pull":
raise ValueError("Invalid GitHub pull request URL")
owner, repo, _, pr_number = path_parts[:4]
# Authenticate with GitHub
g = Github(github_token)
repo = g.get_repo(f"{owner}/{repo}")
pr = repo.get_pull(int(pr_number))
# Convert pull request details to markdown
markdown_content = f"# {pr.title}\n\n{pr.body}\n\n"
markdown_content += f"**State:** {pr.state}\n"
markdown_content += f"**Created at:** {pr.created_at}\n"
markdown_content += f"**Updated at:** {pr.updated_at}\n"
markdown_content += f"**Comments:**\n"
for comment in pr.get_issue_comments():
markdown_content += (
f"- {comment.user.login} ({comment.created_at}): {comment.body}\n"
)
return DocumentConverterResult(
title=pr.title,
text_content=markdown_content,
) )
md_content = f"Content from the zip file `{os.path.basename(local_path)}`:\n\n"
# Safety check for path traversal
if not new_folder.startswith(os.path.dirname(local_path)):
return DocumentConverterResult(
title=None, text_content=f"[ERROR] Invalid zip file path: {local_path}"
)
try:
# Extract the zip file
with zipfile.ZipFile(local_path, "r") as zipObj:
zipObj.extractall(path=new_folder)
# Process each extracted file
for root, dirs, files in os.walk(new_folder):
for name in files:
file_path = os.path.join(root, name)
relative_path = os.path.relpath(file_path, new_folder)
# Get file extension
_, file_extension = os.path.splitext(name)
# Update kwargs for the file
file_kwargs = kwargs.copy()
file_kwargs["file_extension"] = file_extension
file_kwargs["_parent_converters"] = parent_converters
# Try converting the file using available converters
for converter in parent_converters:
# Skip the zip converter to avoid infinite recursion
if isinstance(converter, ZipConverter):
continue
result = converter.convert(file_path, **file_kwargs)
if result is not None:
md_content += f"\n## File: {relative_path}\n\n"
md_content += result.text_content + "\n\n"
break
# Clean up extracted files if specified
if kwargs.get("cleanup_extracted", True):
shutil.rmtree(new_folder)
return DocumentConverterResult(title=None, text_content=md_content.strip())
except zipfile.BadZipFile:
return DocumentConverterResult(
title=None,
text_content=f"[ERROR] Invalid or corrupted zip file: {local_path}",
)
except Exception as e:
return DocumentConverterResult(
title=None,
text_content=f"[ERROR] Failed to process zip file {local_path}: {str(e)}",
)
class FileConversionException(BaseException): class FileConversionException(BaseException):
@@ -1009,50 +982,16 @@ class MarkItDown:
def __init__( def __init__(
self, self,
requests_session: Optional[requests.Session] = None, requests_session: Optional[requests.Session] = None,
llm_client: Optional[Any] = None,
llm_model: Optional[str] = None,
style_map: Optional[str] = None,
# Deprecated
mlm_client: Optional[Any] = None, mlm_client: Optional[Any] = None,
mlm_model: Optional[str] = None, mlm_model: Optional[Any] = None,
): ):
if requests_session is None: if requests_session is None:
self._requests_session = requests.Session() self._requests_session = requests.Session()
else: else:
self._requests_session = requests_session self._requests_session = requests_session
# Handle deprecation notices self._mlm_client = mlm_client
############################# self._mlm_model = mlm_model
if mlm_client is not None:
if llm_client is None:
warn(
"'mlm_client' is deprecated, and was renamed 'llm_client'.",
DeprecationWarning,
)
llm_client = mlm_client
mlm_client = None
else:
raise ValueError(
"'mlm_client' is deprecated, and was renamed 'llm_client'. Do not use both at the same time. Just use 'llm_client' instead."
)
if mlm_model is not None:
if llm_model is None:
warn(
"'mlm_model' is deprecated, and was renamed 'llm_model'.",
DeprecationWarning,
)
llm_model = mlm_model
mlm_model = None
else:
raise ValueError(
"'mlm_model' is deprecated, and was renamed 'llm_model'. Do not use both at the same time. Just use 'llm_model' instead."
)
#############################
self._llm_client = llm_client
self._llm_model = llm_model
self._style_map = style_map
self._page_converters: List[DocumentConverter] = [] self._page_converters: List[DocumentConverter] = []
@@ -1071,7 +1010,6 @@ class MarkItDown:
self.register_page_converter(Mp3Converter()) self.register_page_converter(Mp3Converter())
self.register_page_converter(ImageConverter()) self.register_page_converter(ImageConverter())
self.register_page_converter(PdfConverter()) self.register_page_converter(PdfConverter())
self.register_page_converter(ZipConverter())
def convert( def convert(
self, source: Union[str, requests.Response], **kwargs: Any self, source: Union[str, requests.Response], **kwargs: Any
@@ -1081,7 +1019,6 @@ class MarkItDown:
- source: can be a string representing a path or url, or a requests.response object - source: can be a string representing a path or url, or a requests.response object
- extension: specifies the file extension to use when interpreting the file. If None, infer from source (path, uri, content-type, etc.) - extension: specifies the file extension to use when interpreting the file. If None, infer from source (path, uri, content-type, etc.)
""" """
# Local path or url # Local path or url
if isinstance(source, str): if isinstance(source, str):
if ( if (
@@ -1096,6 +1033,28 @@ class MarkItDown:
elif isinstance(source, requests.Response): elif isinstance(source, requests.Response):
return self.convert_response(source, **kwargs) return self.convert_response(source, **kwargs)
def convert_url(
self, url: str, **kwargs: Any
) -> DocumentConverterResult: # TODO: fix kwargs type
# Handle GitHub issue and pull request URLs directly
parsed_url = urlparse(url)
if parsed_url.hostname == "github.com" and any(
x in parsed_url.path for x in ["/issues/", "/pull/"]
):
github_token = kwargs.get("github_token", os.getenv("GITHUB_TOKEN"))
if not github_token:
raise ValueError(
"GitHub token is required for GitHub issue or pull request conversion."
)
return GitHubIssueConverter().convert(
github_url=url, github_token=github_token
)
# Send a HTTP request to the URL
response = self._requests_session.get(url, stream=True)
response.raise_for_status()
return self.convert_response(response, **kwargs)
def convert_local( def convert_local(
self, path: str, **kwargs: Any self, path: str, **kwargs: Any
) -> DocumentConverterResult: # TODO: deal with kwargs ) -> DocumentConverterResult: # TODO: deal with kwargs
@@ -1150,14 +1109,6 @@ class MarkItDown:
return result return result
def convert_url(
self, url: str, **kwargs: Any
) -> DocumentConverterResult: # TODO: fix kwargs type
# Send a HTTP request to the URL
response = self._requests_session.get(url, stream=True)
response.raise_for_status()
return self.convert_response(response, **kwargs)
def convert_response( def convert_response(
self, response: requests.Response, **kwargs: Any self, response: requests.Response, **kwargs: Any
) -> DocumentConverterResult: # TODO fix kwargs type ) -> DocumentConverterResult: # TODO fix kwargs type
@@ -1195,7 +1146,7 @@ class MarkItDown:
self._append_ext(extensions, g) self._append_ext(extensions, g)
# Convert # Convert
result = self._convert(temp_path, extensions, url=response.url, **kwargs) result = self._convert(temp_path, extensions, url=response.url)
# Clean up # Clean up
finally: finally:
try: try:
@@ -1222,17 +1173,11 @@ class MarkItDown:
_kwargs.update({"file_extension": ext}) _kwargs.update({"file_extension": ext})
# Copy any additional global options # Copy any additional global options
if "llm_client" not in _kwargs and self._llm_client is not None: if "mlm_client" not in _kwargs and self._mlm_client is not None:
_kwargs["llm_client"] = self._llm_client _kwargs["mlm_client"] = self._mlm_client
if "llm_model" not in _kwargs and self._llm_model is not None: if "mlm_model" not in _kwargs and self._mlm_model is not None:
_kwargs["llm_model"] = self._llm_model _kwargs["mlm_model"] = self._mlm_model
# Add the list of converters for nested processing
_kwargs["_parent_converters"] = self._page_converters
if "style_map" not in _kwargs and self._style_map is not None:
_kwargs["style_map"] = self._style_map
# If we hit an error log it and keep trying # If we hit an error log it and keep trying
try: try:
@@ -1269,7 +1214,8 @@ class MarkItDown:
if ext == "": if ext == "":
return return
# if ext not in extensions: # if ext not in extensions:
extensions.append(ext) if True:
extensions.append(ext)
def _guess_ext_magic(self, path): def _guess_ext_magic(self, path):
"""Use puremagic (a Python implementation of libmagic) to guess a file's extension based on the first few bytes.""" """Use puremagic (a Python implementation of libmagic) to guess a file's extension based on the first few bytes."""

0
tests/test_files/test.docx Normal file → Executable file
View File

0
tests/test_files/test.jpg Normal file → Executable file
View File

Before

Width:  |  Height:  |  Size: 463 KiB

After

Width:  |  Height:  |  Size: 463 KiB

BIN
tests/test_files/test.pptx Normal file → Executable file

Binary file not shown.

0
tests/test_files/test.xlsx Normal file → Executable file
View File

Binary file not shown.

Binary file not shown.

Before

Width:  |  Height:  |  Size: 145 KiB

View File

@@ -1,4 +0,0 @@
<EFBFBD><EFBFBD><EFBFBD>O,<EFBFBD>N<EFBFBD><EFBFBD>,<EFBFBD>Z<EFBFBD><EFBFBD>
<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>Y,30,<EFBFBD><EFBFBD><EFBFBD><EFBFBD>
<EFBFBD>O<EFBFBD>؉p<EFBFBD>q,25,<EFBFBD><EFBFBD><EFBFBD><EFBFBD>
<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>~,35,<EFBFBD><EFBFBD><EFBFBD>É<EFBFBD>
1 –¼‘O ”N—î �Z�Š
2 �²“¡‘¾˜Y 30 “Œ‹ž
3 ŽO–؉pŽq 25 ‘å�ã
4 îà‹´�~ 35 –¼ŒÃ‰®

View File

@@ -6,23 +6,11 @@ import shutil
import pytest import pytest
import requests import requests
from warnings import catch_warnings, resetwarnings
from markitdown import MarkItDown from markitdown import MarkItDown
skip_remote = ( skip_remote = (
True if os.environ.get("GITHUB_ACTIONS") else False True if os.environ.get("GITHUB_ACTIONS") else False
) # Don't run these tests in CI ) # Don't run these tests in CI
# Don't run the llm tests without a key and the client library
skip_llm = False if os.environ.get("OPENAI_API_KEY") else True
try:
import openai
except ModuleNotFoundError:
skip_llm = True
# Skip exiftool tests if not installed
skip_exiftool = shutil.which("exiftool") is None skip_exiftool = shutil.which("exiftool") is None
TEST_FILES_DIR = os.path.join(os.path.dirname(__file__), "test_files") TEST_FILES_DIR = os.path.join(os.path.dirname(__file__), "test_files")
@@ -63,25 +51,12 @@ DOCX_TEST_STRINGS = [
"AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation", "AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation",
] ]
DOCX_COMMENT_TEST_STRINGS = [
"314b0a30-5b04-470b-b9f7-eed2c2bec74a",
"49e168b7-d2ae-407f-a055-2167576f39a1",
"## d666f1f7-46cb-42bd-9a39-9a39cf2a509f",
"# Abstract",
"# Introduction",
"AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation",
"This is a test comment. 12df-321a",
"Yet another comment in the doc. 55yiyi-asd09",
]
PPTX_TEST_STRINGS = [ PPTX_TEST_STRINGS = [
"2cdda5c8-e50e-4db4-b5f0-9722a649f455", "2cdda5c8-e50e-4db4-b5f0-9722a649f455",
"04191ea8-5c73-4215-a1d3-1cfb43aaaf12", "04191ea8-5c73-4215-a1d3-1cfb43aaaf12",
"44bf7d06-5e7a-4a40-a2e1-a2e42ef28c8a", "44bf7d06-5e7a-4a40-a2e1-a2e42ef28c8a",
"1b92870d-e3b5-4e65-8153-919f4ff45592", "1b92870d-e3b5-4e65-8153-919f4ff45592",
"AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation", "AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation",
"a3f6004b-6f4f-4ea8-bee3-3741f4dc385f", # chart title
"2003", # chart value
] ]
BLOG_TEST_URL = "https://microsoft.github.io/autogen/blog/2023/04/21/LLM-tuning-math" BLOG_TEST_URL = "https://microsoft.github.io/autogen/blog/2023/04/21/LLM-tuning-math"
@@ -112,16 +87,9 @@ SERP_TEST_EXCLUDES = [
"data:image/svg+xml,%3Csvg%20width%3D", "data:image/svg+xml,%3Csvg%20width%3D",
] ]
CSV_CP932_TEST_STRINGS = [ GITHUB_ISSUE_URL = "https://github.com/microsoft/autogen/issues/1421"
"名前,年齢,住所", GITHUB_PR_URL = "https://github.com/microsoft/autogen/pull/194"
"佐藤太郎,30,東京", GITHUB_TOKEN = os.environ.get("GITHUB_TOKEN", "")
"三木英子,25,大阪",
"髙橋淳,35,名古屋",
]
LLM_TEST_STRINGS = [
"5bda1dd6",
]
@pytest.mark.skipif( @pytest.mark.skipif(
@@ -166,24 +134,6 @@ def test_markitdown_local() -> None:
text_content = result.text_content.replace("\\", "") text_content = result.text_content.replace("\\", "")
assert test_string in text_content assert test_string in text_content
# Test DOCX processing, with comments
result = markitdown.convert(
os.path.join(TEST_FILES_DIR, "test_with_comment.docx"),
style_map="comment-reference => ",
)
for test_string in DOCX_COMMENT_TEST_STRINGS:
text_content = result.text_content.replace("\\", "")
assert test_string in text_content
# Test DOCX processing, with comments and setting style_map on init
markitdown_with_style_map = MarkItDown(style_map="comment-reference => ")
result = markitdown_with_style_map.convert(
os.path.join(TEST_FILES_DIR, "test_with_comment.docx")
)
for test_string in DOCX_COMMENT_TEST_STRINGS:
text_content = result.text_content.replace("\\", "")
assert test_string in text_content
# Test PPTX processing # Test PPTX processing
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.pptx")) result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.pptx"))
for test_string in PPTX_TEST_STRINGS: for test_string in PPTX_TEST_STRINGS:
@@ -198,12 +148,6 @@ def test_markitdown_local() -> None:
text_content = result.text_content.replace("\\", "") text_content = result.text_content.replace("\\", "")
assert test_string in text_content assert test_string in text_content
# Test ZIP file processing
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test_files.zip"))
for test_string in DOCX_TEST_STRINGS:
text_content = result.text_content.replace("\\", "")
assert test_string in text_content
# Test Wikipedia processing # Test Wikipedia processing
result = markitdown.convert( result = markitdown.convert(
os.path.join(TEST_FILES_DIR, "test_wikipedia.html"), url=WIKIPEDIA_TEST_URL os.path.join(TEST_FILES_DIR, "test_wikipedia.html"), url=WIKIPEDIA_TEST_URL
@@ -224,12 +168,6 @@ def test_markitdown_local() -> None:
for test_string in SERP_TEST_STRINGS: for test_string in SERP_TEST_STRINGS:
assert test_string in text_content assert test_string in text_content
## Test non-UTF-8 encoding
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test_mskanji.csv"))
text_content = result.text_content.replace("\\", "")
for test_string in CSV_CP932_TEST_STRINGS:
assert test_string in text_content
@pytest.mark.skipif( @pytest.mark.skipif(
skip_exiftool, skip_exiftool,
@@ -245,57 +183,28 @@ def test_markitdown_exiftool() -> None:
assert target in result.text_content assert target in result.text_content
def test_markitdown_deprecation() -> None: @pytest.mark.skipif(
try: not GITHUB_TOKEN,
with catch_warnings(record=True) as w: reason="GitHub token not provided",
test_client = object() )
markitdown = MarkItDown(mlm_client=test_client) def test_markitdown_github_issue() -> None:
assert len(w) == 1 markitdown = MarkItDown()
assert w[0].category is DeprecationWarning result = markitdown.convert(GITHUB_ISSUE_URL, github_token=GITHUB_TOKEN)
assert markitdown._llm_client == test_client print(result.text_content)
finally: assert "User-Defined Functions" in result.text_content
resetwarnings() assert "closed" in result.text_content
assert "Comments:" in result.text_content
try:
with catch_warnings(record=True) as w:
markitdown = MarkItDown(mlm_model="gpt-4o")
assert len(w) == 1
assert w[0].category is DeprecationWarning
assert markitdown._llm_model == "gpt-4o"
finally:
resetwarnings()
try:
test_client = object()
markitdown = MarkItDown(mlm_client=test_client, llm_client=test_client)
assert False
except ValueError:
pass
try:
markitdown = MarkItDown(mlm_model="gpt-4o", llm_model="gpt-4o")
assert False
except ValueError:
pass
@pytest.mark.skipif( @pytest.mark.skipif(
skip_llm, not GITHUB_TOKEN,
reason="do not run llm tests without a key", reason="GitHub token not provided",
) )
def test_markitdown_llm() -> None: def test_markitdown_github_pr() -> None:
client = openai.OpenAI() markitdown = MarkItDown()
markitdown = MarkItDown(llm_client=client, llm_model="gpt-4o") result = markitdown.convert(GITHUB_PR_URL, github_token=GITHUB_TOKEN)
print(result.text_content)
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test_llm.jpg")) assert "faq" in result.text_content
for test_string in LLM_TEST_STRINGS:
assert test_string in result.text_content
# This is not super precise. It would also accept "red square", "blue circle",
# "the square is not blue", etc. But it's sufficient for this test.
for test_string in ["red", "circle", "blue", "square"]:
assert test_string in result.text_content.lower()
if __name__ == "__main__": if __name__ == "__main__":
@@ -303,5 +212,5 @@ if __name__ == "__main__":
test_markitdown_remote() test_markitdown_remote()
test_markitdown_local() test_markitdown_local()
test_markitdown_exiftool() test_markitdown_exiftool()
test_markitdown_deprecation() test_markitdown_github_issue()
test_markitdown_llm() test_markitdown_github_pr()