97 Commits

Author SHA1 Message Date
Adam Fourney
326d17b802 Bump version. 2025-02-28 07:29:12 -08:00
Hieu Lam
519fe172aa Unable to convert HTML to Markdown (#1072)
* feat: issue where inherited function from `markdownify.MarkdownConverter` doesn't have `current_tags` leading to error using `kwargs`, also set default value for `convert_as_inline`
2025-02-28 00:57:41 -08:00
Adam Fourney
abe9752438 Bumped version 2025-02-10 16:01:17 -08:00
wunde005
73ba69d8cd For csv files mimetypes.guess_type is returning "application/vnd.ms-excel" on windows causing an invalid mime type in plaintextconverter. In reference to issue: https://github.com/microsoft/markitdown/issues/150 (#273) 2025-02-08 20:58:13 -08:00
Werner Robitza
2a4f7bb6a8 fix: argparse CLI option ordering, fixes #268 (#290)
* fix: argparse CLI option ordering, fixes #268
* Fixed formatting.
2025-02-08 20:50:38 -08:00
masquare
7cf5e0bb23 feat(pptx): support image description with LLM for pptx files (#306) 2025-02-08 20:37:34 -08:00
James Hickey
3090917a49 Typo fixed (#270) 2025-02-08 20:30:13 -08:00
ZeyuTeng96
7bea2672a0 remove leading and trailing \n for HtmlConverter (#262) 2025-02-08 20:28:35 -08:00
KennyZhang1
bf6a15e9b5 Kennyzhang/docintel docs (#312)
* updated docs to include doc intelligence

* include reference to doc intel setup docs
2025-01-31 22:23:26 -08:00
KennyZhang1
bfde857420 Add support for conversion via Document Intelligence (#303)
* added cli params for doc intel

* added DocumentIntelligenceConverter class implementation

* initialized doc intel client instance field

* added isolated doc_intel main conversion function

* temp fix for ContentFormat import bug

* ran tests for docintel and offline for many filetypes

* push doc intel converter to the top of the stack

* formatting changes

* modified project toml file
2025-01-24 14:09:32 -08:00
afourney
f58a864951 Set exiftool path explicitly. (#267) 2025-01-06 12:43:47 -08:00
afourney
265aea2edf Removed the holiday away message from README.md (#266) 2025-01-06 09:06:21 -08:00
afourney
05b78e7ce1 Recognize json as plain text (if no other handlers are present). (#261)
* Recognize json as plain text (if no other handlers are present).
2025-01-03 16:40:43 -08:00
afourney
436407288f If puremagic has no guesses, try again after ltrim. (#260) 2025-01-03 16:03:11 -08:00
afourney
731b39e7f5 Added a test for leading spaces. (#258) 2025-01-03 14:34:33 -08:00
yeungadrian
08ed32869e Feature/ Add xls support (#169)
* add xlrd
* add xls converter with tests
2025-01-03 13:58:17 -08:00
Murat Can Kurtuluş
d248621ba4 feat: outlook ".msg" file converter (#196)
* feat: outlook .msg converter
* add test, adjust docstring
2025-01-03 13:34:39 -08:00
AbSadiki
4678c8a2a4 fix(transcription): IS_AUDIO_TRANSCRIPTION_CAPABLE should be iniztialized (#194) 2025-01-03 13:29:26 -08:00
Ikko Eltociear Ashimine
125e206047 docs: update README.md (#182)
faciliate -> facilitate
2024-12-21 01:51:30 -08:00
numekudi
f94d09990e feat: enable Git support in devcontainer (#136)
Co-authored-by: gagb <gagb@users.noreply.github.com>
2024-12-20 18:09:17 -08:00
lumin
cfd2319c14 feat: add version option to markitdown CLI (#172)
Add a `--version` option to the markitdown command-line interface 
that displays the current version number.
2024-12-20 16:24:45 -08:00
dependabot[bot]
73161982ff Bump actions/setup-python from 2 to 5 (#179)
Bumps [actions/setup-python](https://github.com/actions/setup-python) from 2 to 5.
- [Release notes](https://github.com/actions/setup-python/releases)
- [Commits](https://github.com/actions/setup-python/compare/v2...v5)

---
updated-dependencies:
- dependency-name: actions/setup-python
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: afourney <adamfo@microsoft.com>
2024-12-20 16:20:22 -08:00
dependabot[bot]
9b69467772 Bump actions/cache from 3 to 4 (#178)
Bumps [actions/cache](https://github.com/actions/cache) from 3 to 4.
- [Release notes](https://github.com/actions/cache/releases)
- [Changelog](https://github.com/actions/cache/blob/main/RELEASES.md)
- [Commits](https://github.com/actions/cache/compare/v3...v4)

---
updated-dependencies:
- dependency-name: actions/cache
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: gagb <gagb@users.noreply.github.com>
Co-authored-by: afourney <adamfo@microsoft.com>
2024-12-20 16:17:43 -08:00
gagb
857a2d160d Update README.md (#180) 2024-12-20 14:49:20 -08:00
Soulter
1123392306 fix: support -o param to avoid encoding issues (#116)
* perf: cli supports -o param

* doc: update README

---------

Co-authored-by: gagb <gagb@users.noreply.github.com>
2024-12-20 14:43:00 -08:00
dependabot[bot]
377a7eaa7d Bump actions/checkout from 2 to 4 (#177)
Bumps [actions/checkout](https://github.com/actions/checkout) from 2 to 4.
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](https://github.com/actions/checkout/compare/v2...v4)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-12-20 14:36:48 -08:00
lumin
c1a0d3deaf chore: configure Dependabot for GitHub Actions updates (#112)
Sets up Dependabot to automatically check for updates to 
GitHub Actions on a weekly basis, ensuring that the project 
remains up-to-date with the latest dependencies and security 
fixes.

Co-authored-by: gagb <gagb@users.noreply.github.com>
2024-12-20 14:28:55 -08:00
SigireddyBalasai
5276616ba1 Added support to use Pathlib (#93)
* Add support for Path objects in MarkItDown conversion methods

* Remove unnecessary blank line in test_markitdown_exiftool function

* Remove unnecessary blank line in test_markitdown_exiftool function

* remove pathlib path in test file

---------

Co-authored-by: afourney <adamfo@microsoft.com>
Co-authored-by: gagb <gagb@users.noreply.github.com>
2024-12-20 14:12:48 -08:00
gagb
7e6c36c5d4 docs: add contribution guidelines to README (#176) 2024-12-20 14:08:58 -08:00
lumin
52d73080c7 refactor(tests): add helper function for tests (#87)
* refactor(tests): simplify string validation in tests

Introduce a helper function `validate_strings` to streamline the 
validation of expected and excluded strings in test cases. Replace 
repetitive string assertions in the `test_markitdown_local` function 
with calls to this new helper, improving code readability and 
maintainability.

* run pre-commit

---------

Co-authored-by: lumin <71011125+l-melon@users.noreply.github.com>
Co-authored-by: gagb <gagb@users.noreply.github.com>
2024-12-20 11:42:32 -08:00
afourney
e6421000e3 Merge pull request #160 from sugatoray/support_type_hinting
Add support for type-hinting (PEP-561)
2024-12-20 10:54:43 -08:00
Sugato Ray
08a25345e3 [feat]: add support for type-hinting for PEP-561 2024-12-20 02:37:10 +00:00
Sugato Ray
8921fe7304 ignore .vscode folder
- avoid local developer vscode editor settings
2024-12-20 02:18:14 +00:00
Sugato Ray
613825d5b3 [feat]: add support for type-hinting for PEP-561 2024-12-20 02:12:24 +00:00
gagb
18e3f1d428 Merge pull request #91 from PetrAPConsulting/patch-1
Update README.md
2024-12-19 14:02:47 -08:00
gagb
c295dee5e4 Merge branch 'main' into patch-1 2024-12-19 13:22:51 -08:00
gagb
dd87dd5e36 Merge pull request #156 from microsoft/afourney-patch-1
Added holiday notice.
2024-12-19 11:18:24 -08:00
afourney
535147b2e8 Added holiday notice.
Added holiday notice.
2024-12-19 11:11:54 -08:00
gagb
5c776bda70 Update README.md 2024-12-19 10:30:53 -08:00
gagb
423a01844a Merge branch 'main' into patch-1 2024-12-19 10:30:10 -08:00
gagb
7147bef7d5 Merge pull request #130 from sugatoray/update_commandline_help
Update CLI helpdoc formatting to allow indentation in code
2024-12-19 10:20:23 -08:00
Sugato Ray
a5f39d6922 Merge branch 'main' into update_commandline_help 2024-12-19 07:58:48 -05:00
gagb
925c4499f7 Merge pull request #121 from l-lumin/add-project-description 2024-12-19 00:53:54 -08:00
Petr@AP Consulting
b28f380a47 Update README.md
Co-authored-by: gagb <gagb@users.noreply.github.com>
2024-12-19 09:23:15 +01:00
lumin
c86287b7e3 feat: add project description in pyproject.toml 2024-12-19 13:02:47 +09:00
Sugato Ray
6f3c762526 Merge branch 'main' into update_commandline_help 2024-12-18 17:50:07 -05:00
gagb
cb66b35f11 Merge pull request #132 from microsoft/gagb-patch-1
Add downloads badge
2024-12-18 14:30:09 -08:00
gagb
a2743a5314 Add downloads badge 2024-12-18 14:26:36 -08:00
Sugato Ray
277480066a Merge branch 'update_commandline_help' of https://github.com/sugatoray/markitdown into update_commandline_help 2024-12-18 21:53:54 +00:00
gagb
6e1b9a7402 Run precommit 2024-12-18 13:46:10 -08:00
Sugato Ray
1384e80725 update .gitignore to exclude .vscode folder 2024-12-18 21:46:06 +00:00
Sugato Ray
356e895306 update formatting with pre-commit 2024-12-18 21:45:23 +00:00
Petr@AP Consulting
f6e75c46d4 Update README.md
I changed command for running script from Mac version (python3) to Windows version (python)
2024-12-18 21:17:47 +01:00
afourney
8bc1bee18b Merge pull request #129 from finchy/main
Safeguard against path traversal for ZipConverter
2024-12-18 12:11:00 -08:00
Petr@AP Consulting
f4471d96e2 Update README.md
Co-authored-by: gagb <gagb@users.noreply.github.com>
2024-12-18 21:08:10 +01:00
Petr@AP Consulting
088007338d Update README.md
Co-authored-by: gagb <gagb@users.noreply.github.com>
2024-12-18 21:07:55 +01:00
Petr@AP Consulting
bb929629f3 Update README.md
Co-authored-by: gagb <gagb@users.noreply.github.com>
2024-12-18 21:05:36 +01:00
Petr@AP Consulting
233ba679b8 Update README.md
Co-authored-by: gagb <gagb@users.noreply.github.com>
2024-12-18 21:05:04 +01:00
gagb
46b7f043d3 Merge branch 'main' into patch-1 2024-12-18 11:57:57 -08:00
gagb
5fc70864f2 Run pre-commit 2024-12-18 11:46:39 -08:00
Sugato Ray
39410d01df Update CLI helpdoc formatting to allow indentation in code
Use `textwrap.dedent()` to allow indented cli-helpdoc in `__main__.py` file. The indentation increases readability, while `textwrap.dedent` helps maintain the same functionality without breaking code.
2024-12-18 14:22:58 -05:00
Joel Esler
6e4caac70d Safeguard against path traversal for ZipConverter
fix: prevent path traversal vulnerabilities in ZipConverter

Added a secure check for path traversal vulnerabilities in the ZipConverter class.
Now validates extracted file paths using `os.path.commonprefix` to ensure all files
remain within the intended extraction directory. Raises a `ValueError` if a
path traversal attempt is detected.

- Normalized file paths using `os.path.normpath`.
- Added specific exception handling for `zipfile.BadZipFile` and traversal errors.
- Ensured cleanup of extracted files after processing when `cleanup_extracted` is enabled.
2024-12-18 13:12:55 -05:00
Petr@AP Consulting
224f1df0fc Update README.md
I collapsed section about batch processing as was suggested
2024-12-18 09:28:18 +01:00
gagb
1deaba1c6c Merge pull request #98 from waterimp/feature/fix-code-comments
fix incorrect comments for "bail if not ..." for WAV and image cases.
2024-12-17 17:57:25 -08:00
gagb
09cb048cbe Merge branch 'main' into feature/fix-code-comments 2024-12-17 17:34:53 -08:00
gagb
b029ae1cd4 Merge pull request #108 from microsoft/gagb-readme
Simplify README
2024-12-17 17:30:49 -08:00
gagb
524aa0da75 Update README.md 2024-12-17 17:25:40 -08:00
gagb
de1b54d79f Update README.md 2024-12-17 17:25:13 -08:00
gagb
1e7806a7ac Simplify 2024-12-17 17:21:39 -08:00
gagb
1163aa2b4e Merge pull request #106 from microsoft/gagb-patch-1
Update README.md
2024-12-17 16:57:32 -08:00
gagb
3bcf2bdae7 Update README.md 2024-12-17 16:54:17 -08:00
gagb
41a10b9a35 Merge pull request #64 from l-lumin/add-devcontainer-config
feat(devcontainer): Add DevContainer Configuration for Easier Contribution Setup
2024-12-17 16:52:50 -08:00
gagb
f1e399eee4 Merge branch 'main' into add-devcontainer-config 2024-12-17 16:50:32 -08:00
gagb
8b02c0bf9f Merge pull request #80 from diya155/main
Update README.md
2024-12-17 16:49:58 -08:00
gagb
1dda535330 Merge branch 'main' into main 2024-12-17 16:46:23 -08:00
gagb
362214323e Merge branch 'main' into feature/fix-code-comments 2024-12-17 16:38:47 -08:00
lumin
457b6234e6 Merge branch 'main' into add-devcontainer-config 2024-12-18 09:14:31 +09:00
afourney
790031409b Merge pull request #71 from AumGupta/main
feat: Add IpynbConverter
2024-12-17 15:41:51 -08:00
afourney
9e546a8588 Merge branch 'main' into main 2024-12-17 15:37:28 -08:00
afourney
ddf695cf81 Merge pull request #97 from Soulter/main
feat: Add RSSConverter
2024-12-17 15:34:22 -08:00
Adam Fourney
8d5f16ecd2 Fixed formatting. 2024-12-17 15:27:06 -08:00
afourney
a571021199 Merge branch 'main' into main 2024-12-17 15:12:59 -08:00
afourney
9add517510 Merge branch 'main' into feature/fix-code-comments 2024-12-17 14:56:16 -08:00
Lee Bush
05a49ca129 fix incorrect comments for "bail if not ..." for WAV and image cases. 2024-12-17 08:10:53 -07:00
Soulter
752fbd333c feat: add tests of rss convertor 2024-12-17 22:45:27 +08:00
Soulter
7dc2695b96 feat: support convert atom to markdown 2024-12-17 21:44:50 +08:00
Soulter
53fad6eb31 feat: add rss converter 2024-12-17 21:22:27 +08:00
Petr@AP Consulting
f398f3d443 Update README.md
I added description and script for batch of files processing
2024-12-17 10:26:09 +01:00
lumin
e0a30295ff docs: update README with Devcontainer instructions
Add instructions for using Dev to run tests.Remove the install script it is no longer needed. 
Update trademark section for clarity.
2024-12-17 17:04:31 +09:00
lumin
07fe457a90 feat: add devcontainer configuration and installation script
Add a devcontainer configuration to streamline the development 
environment setup. Introduce an `install.sh` script to install 
the project in editable mode. Update the Dockerfile to use 
the `python:3.13-slim-bullseye` base image and install 
dependencies using `apt-get` for better compatibility.
2024-12-17 17:04:31 +09:00
Om Gupta
60c4a62917 Merge branch 'microsoft:main' into main 2024-12-17 10:33:40 +05:30
Om Gupta
3eb8cf385b Merge branch 'main' of https://github.com/AumGupta/markitdown 2024-12-17 10:24:30 +05:30
Om Gupta
8c91c11ea8 pre-commit run 2024-12-17 10:24:25 +05:30
diya155
14bd8d319a Update README.md 2024-12-17 09:16:40 +05:30
gagb
dbc727615d Merge branch 'main' into main 2024-12-16 15:48:49 -08:00
afourney
afaff11ef0 Merge branch 'main' into main 2024-12-16 14:40:58 -08:00
Om Gupta
a3208f2bd0 feat: Add IpynbConverter
- Implemented IpynbConverter class for converting Jupyter Notebook (.ipynb) files into Markdown format.
- Supports markdown cells, code cells and raw cells.
- First markdown heading is used as the title if no title is found in notebook metadata.
- Created a test notebook (`test_notebook.ipynb`) to verify the functionality of the converter.
2024-12-17 01:00:41 +05:30
18 changed files with 1014 additions and 162 deletions

View File

@@ -0,0 +1,32 @@
// For format details, see https://aka.ms/devcontainer.json. For config options, see the
// README at: https://github.com/devcontainers/templates/tree/main/src/docker-existing-dockerfile
{
"name": "Existing Dockerfile",
"build": {
// Sets the run context to one level up instead of the .devcontainer folder.
"context": "..",
// Update the 'dockerFile' property if you aren't using the standard 'Dockerfile' filename.
"dockerfile": "../Dockerfile",
"args": {
"INSTALL_GIT": "true"
}
},
// Features to add to the dev container. More info: https://containers.dev/features.
// "features": {},
"features": {
"ghcr.io/devcontainers-extra/features/hatch:2": {}
},
// Use 'forwardPorts' to make a list of ports inside the container available locally.
// "forwardPorts": [],
// Uncomment the next line to run commands after the container is created.
// "postCreateCommand": "cat /etc/os-release",
// Configure tool-specific properties.
// "customizations": {},
// Uncomment to connect as an existing user other than the container default. More info: https://aka.ms/dev-containers-non-root.
"remoteUser": "root"
}

6
.github/dependabot.yml vendored Normal file
View File

@@ -0,0 +1,6 @@
version: 2
updates:
- package-ecosystem: "github-actions"
directory: "/"
schedule:
interval: "weekly"

View File

@@ -5,9 +5,9 @@ jobs:
pre-commit:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v2
uses: actions/setup-python@v5
with:
python-version: "3.x"

View File

@@ -5,8 +5,8 @@ jobs:
tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: |
3.10
@@ -14,7 +14,7 @@ jobs:
3.12
- name: Set up pip cache
if: runner.os == 'Linux'
uses: actions/cache@v3
uses: actions/cache@v4
with:
path: ~/.cache/pip
key: ${{ runner.os }}-pip-${{ hashFiles('pyproject.toml') }}

4
.gitignore vendored
View File

@@ -1,3 +1,5 @@
.vscode
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
@@ -160,3 +162,5 @@ cython_debug/
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/
src/.DS_Store
.DS_Store

View File

@@ -1,9 +1,16 @@
FROM python:3.13-alpine
FROM python:3.13-slim-bullseye
USER root
ARG INSTALL_GIT=false
RUN if [ "$INSTALL_GIT" = "true" ]; then \
apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/*; \
fi
# Runtime dependency
RUN apk add --no-cache ffmpeg
RUN apt-get update && apt-get install -y --no-install-recommends \
ffmpeg \
&& rm -rf /var/lib/apt/lists/*
RUN pip install markitdown

176
README.md
View File

@@ -1,66 +1,75 @@
# MarkItDown
[![PyPI](https://img.shields.io/pypi/v/markitdown.svg)](https://pypi.org/project/markitdown/)
![PyPI - Downloads](https://img.shields.io/pypi/dd/markitdown)
[![Built by AutoGen Team](https://img.shields.io/badge/Built%20by-AutoGen%20Team-blue)](https://github.com/microsoft/autogen)
The MarkItDown library is a utility tool for converting various files to Markdown (e.g., for indexing, text analysis, etc.)
It presently supports:
MarkItDown is a utility for converting various files to Markdown (e.g., for indexing, text analysis, etc).
It supports:
- PDF
- PowerPoint
- Word
- Excel
- Images (EXIF metadata and OCR)
- Audio (EXIF metadata and speech transcription)
- HTML
- Text-based formats (CSV, JSON, XML)
- ZIP files (iterates over contents)
- PDF (.pdf)
- PowerPoint (.pptx)
- Word (.docx)
- Excel (.xlsx)
- Images (EXIF metadata, and OCR)
- Audio (EXIF metadata, and speech transcription)
- HTML (special handling of Wikipedia, etc.)
- Various other text-based formats (csv, json, xml, etc.)
- ZIP (Iterates over contents and converts each file)
To install MarkItDown, use pip: `pip install markitdown`. Alternatively, you can install it from the source: `pip install -e .`
# Installation
## Usage
You can install `markitdown` using pip:
```python
pip install markitdown
```
or from the source
```sh
pip install -e .
```
# Usage
The API is simple:
```python
from markitdown import MarkItDown
markitdown = MarkItDown()
result = markitdown.convert("test.xlsx")
print(result.text_content)
```
To use this as a command-line utility, install it and then run it like this:
```bash
markitdown path-to-file.pdf
```
This will output Markdown to standard output. You can save it like this:
### Command-Line
```bash
markitdown path-to-file.pdf > document.md
```
You can pipe content to standard input by omitting the argument:
Or use `-o` to specify the output file:
```bash
markitdown path-to-file.pdf -o document.md
```
To use Document Intelligence conversion:
```bash
markitdown path-to-file.pdf -o document.md -d -e "<document_intelligence_endpoint>"
```
You can also pipe content:
```bash
cat path-to-file.pdf | markitdown
```
You can also configure markitdown to use Large Language Models to describe images. To do so you must provide `llm_client` and `llm_model` parameters to MarkItDown object, according to your specific client.
More information about how to set up an Azure Document Intelligence Resource can be found [here](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/how-to-guides/create-document-intelligence-resource?view=doc-intel-4.0.0)
### Python API
Basic usage in Python:
```python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("test.xlsx")
print(result.text_content)
```
Document Intelligence conversion in Python:
```python
from markitdown import MarkItDown
md = MarkItDown(docintel_endpoint="<document_intelligence_endpoint>")
result = md.convert("test.pdf")
print(result.text_content)
```
To use Large Language Models for image descriptions, provide `llm_client` and `llm_model`:
```python
from markitdown import MarkItDown
@@ -72,13 +81,49 @@ result = md.convert("example.jpg")
print(result.text_content)
```
You can also use the project as Docker Image:
### Docker
```sh
docker build -t markitdown:latest .
docker run --rm -i markitdown:latest < ~/your-file.pdf > output.md
```
<details>
<summary>Batch Processing Multiple Files</summary>
This example shows how to convert multiple files to markdown format in a single run. The script processes all supported files in a directory and creates corresponding markdown files.
```python convert.py
from markitdown import MarkItDown
from openai import OpenAI
import os
client = OpenAI(api_key="your-api-key-here")
md = MarkItDown(llm_client=client, llm_model="gpt-4o-2024-11-20")
supported_extensions = ('.pptx', '.docx', '.pdf', '.jpg', '.jpeg', '.png')
files_to_convert = [f for f in os.listdir('.') if f.lower().endswith(supported_extensions)]
for file in files_to_convert:
print(f"\nConverting {file}...")
try:
md_file = os.path.splitext(file)[0] + '.md'
result = md.convert(file)
with open(md_file, 'w') as f:
f.write(result.text_content)
print(f"Successfully converted {file} to {md_file}")
except Exception as e:
print(f"Error converting {file}: {str(e)}")
print("\nAll conversions completed!")
```
2. Place the script in the same directory as your files
3. Install required packages: like openai
4. Run script ```bash python convert.py ```
Note that original files will remain unchanged and new markdown files are created with the same base name.
</details>
## Contributing
This project welcomes contributions and suggestions. Most contributions require you to agree to a
@@ -93,28 +138,41 @@ This project has adopted the [Microsoft Open Source Code of Conduct](https://ope
For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
### Running Tests
### How to Contribute
To run tests, install `hatch` using `pip` or other methods as described [here](https://hatch.pypa.io/dev/install).
You can help by looking at issues or helping review PRs. Any issue or PR is welcome, but we have also marked some as 'open for contribution' and 'open for reviewing' to help facilitate community contributions. These are ofcourse just suggestions and you are welcome to contribute in any way you like.
```sh
pip install hatch
hatch shell
hatch test
```
### Running Pre-commit Checks
<div align="center">
Please run the pre-commit checks before submitting a PR.
| | All | Especially Needs Help from Community |
|-----------------------|------------------------------------------|------------------------------------------------------------------------------------------|
| **Issues** | [All Issues](https://github.com/microsoft/markitdown/issues) | [Issues open for contribution](https://github.com/microsoft/markitdown/issues?q=is%3Aissue+is%3Aopen+label%3A%22open+for+contribution%22) |
| **PRs** | [All PRs](https://github.com/microsoft/markitdown/pulls) | [PRs open for reviewing](https://github.com/microsoft/markitdown/pulls?q=is%3Apr+is%3Aopen+label%3A%22open+for+reviewing%22) |
```sh
pre-commit run --all-files
```
</div>
### Running Tests and Checks
- Install `hatch` in your environment and run tests:
```sh
pip install hatch # Other ways of installing hatch: https://hatch.pypa.io/dev/install/
hatch shell
hatch test
```
(Alternative) Use the Devcontainer which has all the dependencies installed:
```sh
# Reopen the project in Devcontainer and run:
hatch test
```
- Run pre-commit checks before submitting a PR: `pre-commit run --all-files`
## Trademarks
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft
trademarks or logos is subject to and must follow
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft
trademarks or logos is subject to and must follow
[Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).
Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.
Any use of third-party trademarks or logos are subject to those third-party's policies.

View File

@@ -5,7 +5,7 @@ build-backend = "hatchling.build"
[project]
name = "markitdown"
dynamic = ["version"]
description = ''
description = 'Utility tool for converting various files to Markdown'
readme = "README.md"
requires-python = ">=3.10"
license = "MIT"
@@ -32,14 +32,18 @@ dependencies = [
"python-pptx",
"pandas",
"openpyxl",
"xlrd",
"pdfminer.six",
"puremagic",
"pydub",
"olefile",
"youtube-transcript-api",
"SpeechRecognition",
"pathvalidate",
"charset-normalizer",
"openai",
"azure-ai-documentintelligence",
"azure-identity"
]
[project.urls]

View File

@@ -1,4 +1,4 @@
# SPDX-FileCopyrightText: 2024-present Adam Fourney <adamfo@microsoft.com>
#
# SPDX-License-Identifier: MIT
__version__ = "0.0.1a3"
__version__ = "0.0.1a5"

View File

@@ -1,45 +1,109 @@
# SPDX-FileCopyrightText: 2024-present Adam Fourney <adamfo@microsoft.com>
#
# SPDX-License-Identifier: MIT
import sys
import argparse
from ._markitdown import MarkItDown
import sys
import shutil
from textwrap import dedent
from .__about__ import __version__
from ._markitdown import MarkItDown, DocumentConverterResult
def main():
parser = argparse.ArgumentParser(
description="Convert various file formats to markdown.",
prog="markitdown",
formatter_class=argparse.RawDescriptionHelpFormatter,
usage="""
SYNTAX:
usage=dedent(
"""
SYNTAX:
markitdown <OPTIONAL: FILENAME>
If FILENAME is empty, markitdown reads from stdin.
EXAMPLE:
markitdown example.pdf
OR
cat example.pdf | markitdown
OR
markitdown < example.pdf
OR to save to a file use
markitdown <OPTIONAL: FILENAME>
If FILENAME is empty, markitdown reads from stdin.
markitdown example.pdf -o example.md
OR
markitdown example.pdf > example.md
"""
).strip(),
)
EXAMPLE:
markitdown example.pdf
OR
parser.add_argument(
"-v",
"--version",
action="version",
version=f"%(prog)s {__version__}",
help="show the version number and exit",
)
cat example.pdf | markitdown
parser.add_argument(
"-o",
"--output",
help="Output file name. If not provided, output is written to stdout.",
)
OR
parser.add_argument(
"-d",
"--use-docintel",
action="store_true",
help="Use Document Intelligence to extract text instead of offline conversion. Requires a valid Document Intelligence Endpoint.",
)
markitdown < example.pdf
""".strip(),
parser.add_argument(
"-e",
"--endpoint",
type=str,
help="Document Intelligence Endpoint. Required if using Document Intelligence.",
)
parser.add_argument("filename", nargs="?")
args = parser.parse_args()
if args.filename is None:
markitdown = MarkItDown()
result = markitdown.convert_stream(sys.stdin.buffer)
print(result.text_content)
which_exiftool = shutil.which("exiftool")
if args.use_docintel:
if args.endpoint is None:
raise ValueError(
"Document Intelligence Endpoint is required when using Document Intelligence."
)
elif args.filename is None:
raise ValueError("Filename is required when using Document Intelligence.")
markitdown = MarkItDown(
exiftool_path=which_exiftool, docintel_endpoint=args.endpoint
)
else:
markitdown = MarkItDown(exiftool_path=which_exiftool)
if args.filename is None:
result = markitdown.convert_stream(sys.stdin.buffer)
else:
markitdown = MarkItDown()
result = markitdown.convert(args.filename)
_handle_output(args, result)
def _handle_output(args, result: DocumentConverterResult):
"""Handle output to stdout or file"""
if args.output:
with open(args.output, "w", encoding="utf-8") as f:
f.write(result.text_content)
else:
print(result.text_content)

View File

@@ -13,12 +13,15 @@ import sys
import tempfile
import traceback
import zipfile
from xml.dom import minidom
from typing import Any, Dict, List, Optional, Union
from pathlib import Path
from urllib.parse import parse_qs, quote, unquote, urlparse, urlunparse
from warnings import warn, resetwarnings, catch_warnings
import mammoth
import markdownify
import olefile
import pandas as pd
import pdfminer
import pdfminer.high_level
@@ -30,7 +33,24 @@ import requests
from bs4 import BeautifulSoup
from charset_normalizer import from_path
# Azure imports
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import (
AnalyzeDocumentRequest,
AnalyzeResult,
DocumentAnalysisFeature,
)
from azure.identity import DefaultAzureCredential
# TODO: currently, there is a bug in the document intelligence SDK with importing the "ContentFormat" enum.
# This constant is a temporary fix until the bug is resolved.
CONTENT_FORMAT = "markdown"
# Override mimetype for csv to fix issue on windows
mimetypes.add_type("text/csv", ".csv")
# Optional Transcription support
IS_AUDIO_TRANSCRIPTION_CAPABLE = False
try:
# Using warnings' catch_warnings to catch
# pydub's warning of ffmpeg or avconv missing
@@ -71,7 +91,14 @@ class _CustomMarkdownify(markdownify.MarkdownConverter):
# Explicitly cast options to the expected type if necessary
super().__init__(**options)
def convert_hn(self, n: int, el: Any, text: str, convert_as_inline: bool) -> str:
def convert_hn(
self,
n: int,
el: Any,
text: str,
convert_as_inline: Optional[bool] = False,
**kwargs,
) -> str:
"""Same as usual, but be sure to start with a new line"""
if not convert_as_inline:
if not re.search(r"^\n", text):
@@ -79,7 +106,9 @@ class _CustomMarkdownify(markdownify.MarkdownConverter):
return super().convert_hn(n, el, text, convert_as_inline) # type: ignore
def convert_a(self, el: Any, text: str, convert_as_inline: bool):
def convert_a(
self, el: Any, text: str, convert_as_inline: Optional[bool] = False, **kwargs
):
"""Same as usual converter, but removes Javascript links and escapes URIs."""
prefix, suffix, text = markdownify.chomp(text) # type: ignore
if not text:
@@ -115,7 +144,9 @@ class _CustomMarkdownify(markdownify.MarkdownConverter):
else text
)
def convert_img(self, el: Any, text: str, convert_as_inline: bool) -> str:
def convert_img(
self, el: Any, text: str, convert_as_inline: Optional[bool] = False, **kwargs
) -> str:
"""Same as usual converter, but removes data URIs"""
alt = el.attrs.get("alt", None) or ""
@@ -169,7 +200,10 @@ class PlainTextConverter(DocumentConverter):
# Only accept text files
if content_type is None:
return None
elif "text/" not in content_type.lower():
elif all(
not content_type.lower().startswith(type_prefix)
for type_prefix in ["text/", "application/json"]
):
return None
text_content = str(from_path(local_path).best())
@@ -197,7 +231,7 @@ class HtmlConverter(DocumentConverter):
return result
def _convert(self, html_content: str) -> Union[None, DocumentConverterResult]:
"""Helper function that converts and HTML string."""
"""Helper function that converts an HTML string."""
# Parse the string
soup = BeautifulSoup(html_content, "html.parser")
@@ -216,12 +250,152 @@ class HtmlConverter(DocumentConverter):
assert isinstance(webpage_text, str)
# remove leading and trailing \n
webpage_text = webpage_text.strip()
return DocumentConverterResult(
title=None if soup.title is None else soup.title.string,
text_content=webpage_text,
)
class RSSConverter(DocumentConverter):
"""Convert RSS / Atom type to markdown"""
def convert(
self, local_path: str, **kwargs
) -> Union[None, DocumentConverterResult]:
# Bail if not RSS type
extension = kwargs.get("file_extension", "")
if extension.lower() not in [".xml", ".rss", ".atom"]:
return None
try:
doc = minidom.parse(local_path)
except BaseException as _:
return None
result = None
if doc.getElementsByTagName("rss"):
# A RSS feed must have a root element of <rss>
result = self._parse_rss_type(doc)
elif doc.getElementsByTagName("feed"):
root = doc.getElementsByTagName("feed")[0]
if root.getElementsByTagName("entry"):
# An Atom feed must have a root element of <feed> and at least one <entry>
result = self._parse_atom_type(doc)
else:
return None
else:
# not rss or atom
return None
return result
def _parse_atom_type(
self, doc: minidom.Document
) -> Union[None, DocumentConverterResult]:
"""Parse the type of an Atom feed.
Returns None if the feed type is not recognized or something goes wrong.
"""
try:
root = doc.getElementsByTagName("feed")[0]
title = self._get_data_by_tag_name(root, "title")
subtitle = self._get_data_by_tag_name(root, "subtitle")
entries = root.getElementsByTagName("entry")
md_text = f"# {title}\n"
if subtitle:
md_text += f"{subtitle}\n"
for entry in entries:
entry_title = self._get_data_by_tag_name(entry, "title")
entry_summary = self._get_data_by_tag_name(entry, "summary")
entry_updated = self._get_data_by_tag_name(entry, "updated")
entry_content = self._get_data_by_tag_name(entry, "content")
if entry_title:
md_text += f"\n## {entry_title}\n"
if entry_updated:
md_text += f"Updated on: {entry_updated}\n"
if entry_summary:
md_text += self._parse_content(entry_summary)
if entry_content:
md_text += self._parse_content(entry_content)
return DocumentConverterResult(
title=title,
text_content=md_text,
)
except BaseException as _:
return None
def _parse_rss_type(
self, doc: minidom.Document
) -> Union[None, DocumentConverterResult]:
"""Parse the type of an RSS feed.
Returns None if the feed type is not recognized or something goes wrong.
"""
try:
root = doc.getElementsByTagName("rss")[0]
channel = root.getElementsByTagName("channel")
if not channel:
return None
channel = channel[0]
channel_title = self._get_data_by_tag_name(channel, "title")
channel_description = self._get_data_by_tag_name(channel, "description")
items = channel.getElementsByTagName("item")
if channel_title:
md_text = f"# {channel_title}\n"
if channel_description:
md_text += f"{channel_description}\n"
if not items:
items = []
for item in items:
title = self._get_data_by_tag_name(item, "title")
description = self._get_data_by_tag_name(item, "description")
pubDate = self._get_data_by_tag_name(item, "pubDate")
content = self._get_data_by_tag_name(item, "content:encoded")
if title:
md_text += f"\n## {title}\n"
if pubDate:
md_text += f"Published on: {pubDate}\n"
if description:
md_text += self._parse_content(description)
if content:
md_text += self._parse_content(content)
return DocumentConverterResult(
title=channel_title,
text_content=md_text,
)
except BaseException as _:
print(traceback.format_exc())
return None
def _parse_content(self, content: str) -> str:
"""Parse the content of an RSS feed item"""
try:
# using bs4 because many RSS feeds have HTML-styled content
soup = BeautifulSoup(content, "html.parser")
return _CustomMarkdownify().convert_soup(soup)
except BaseException as _:
return content
def _get_data_by_tag_name(
self, element: minidom.Element, tag_name: str
) -> Union[str, None]:
"""Get data from first child element with the given tag name.
Returns None when no such element is found.
"""
nodes = element.getElementsByTagName(tag_name)
if not nodes:
return None
fc = nodes[0].firstChild
if fc:
return fc.data
return None
class WikipediaConverter(DocumentConverter):
"""Handle Wikipedia pages separately, focusing only on the main document content."""
@@ -403,6 +577,67 @@ class YouTubeConverter(DocumentConverter):
return None
class IpynbConverter(DocumentConverter):
"""Converts Jupyter Notebook (.ipynb) files to Markdown."""
def convert(
self, local_path: str, **kwargs: Any
) -> Union[None, DocumentConverterResult]:
# Bail if not ipynb
extension = kwargs.get("file_extension", "")
if extension.lower() != ".ipynb":
return None
# Parse and convert the notebook
result = None
with open(local_path, "rt", encoding="utf-8") as fh:
notebook_content = json.load(fh)
result = self._convert(notebook_content)
return result
def _convert(self, notebook_content: dict) -> Union[None, DocumentConverterResult]:
"""Helper function that converts notebook JSON content to Markdown."""
try:
md_output = []
title = None
for cell in notebook_content.get("cells", []):
cell_type = cell.get("cell_type", "")
source_lines = cell.get("source", [])
if cell_type == "markdown":
md_output.append("".join(source_lines))
# Extract the first # heading as title if not already found
if title is None:
for line in source_lines:
if line.startswith("# "):
title = line.lstrip("# ").strip()
break
elif cell_type == "code":
# Code cells are wrapped in Markdown code blocks
md_output.append(f"```python\n{''.join(source_lines)}\n```")
elif cell_type == "raw":
md_output.append(f"```\n{''.join(source_lines)}\n```")
md_text = "\n\n".join(md_output)
# Check for title in notebook metadata
title = notebook_content.get("metadata", {}).get("title", title)
return DocumentConverterResult(
title=title,
text_content=md_text,
)
except Exception as e:
raise FileConversionException(
f"Error converting .ipynb file: {str(e)}"
) from e
class BingSerpConverter(DocumentConverter):
"""
Handle Bing results pages (only the organic search results).
@@ -524,7 +759,31 @@ class XlsxConverter(HtmlConverter):
if extension.lower() != ".xlsx":
return None
sheets = pd.read_excel(local_path, sheet_name=None)
sheets = pd.read_excel(local_path, sheet_name=None, engine="openpyxl")
md_content = ""
for s in sheets:
md_content += f"## {s}\n"
html_content = sheets[s].to_html(index=False)
md_content += self._convert(html_content).text_content.strip() + "\n\n"
return DocumentConverterResult(
title=None,
text_content=md_content.strip(),
)
class XlsConverter(HtmlConverter):
"""
Converts XLS files to Markdown, with each sheet presented as a separate Markdown table.
"""
def convert(self, local_path, **kwargs) -> Union[None, DocumentConverterResult]:
# Bail if not a XLS
extension = kwargs.get("file_extension", "")
if extension.lower() != ".xls":
return None
sheets = pd.read_excel(local_path, sheet_name=None, engine="xlrd")
md_content = ""
for s in sheets:
md_content += f"## {s}\n"
@@ -542,6 +801,35 @@ class PptxConverter(HtmlConverter):
Converts PPTX files to Markdown. Supports heading, tables and images with alt text.
"""
def _get_llm_description(
self, llm_client, llm_model, image_blob, content_type, prompt=None
):
if prompt is None or prompt.strip() == "":
prompt = "Write a detailed alt text for this image with less than 50 words."
image_base64 = base64.b64encode(image_blob).decode("utf-8")
data_uri = f"data:{content_type};base64,{image_base64}"
messages = [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": data_uri,
},
},
{"type": "text", "text": prompt},
],
}
]
response = llm_client.chat.completions.create(
model=llm_model, messages=messages
)
return response.choices[0].message.content
def convert(self, local_path, **kwargs) -> Union[None, DocumentConverterResult]:
# Bail if not a PPTX
extension = kwargs.get("file_extension", "")
@@ -562,17 +850,38 @@ class PptxConverter(HtmlConverter):
# Pictures
if self._is_picture(shape):
# https://github.com/scanny/python-pptx/pull/512#issuecomment-1713100069
alt_text = ""
try:
alt_text = shape._element._nvXxPr.cNvPr.attrib.get("descr", "")
except Exception:
pass
llm_description = None
alt_text = None
llm_client = kwargs.get("llm_client")
llm_model = kwargs.get("llm_model")
if llm_client is not None and llm_model is not None:
try:
llm_description = self._get_llm_description(
llm_client,
llm_model,
shape.image.blob,
shape.image.content_type,
)
except Exception:
# Unable to describe with LLM
pass
if not llm_description:
try:
alt_text = shape._element._nvXxPr.cNvPr.attrib.get(
"descr", ""
)
except Exception:
# Unable to get alt text
pass
# A placeholder name
filename = re.sub(r"\W", "", shape.name) + ".jpg"
md_content += (
"\n!["
+ (alt_text if alt_text else shape.name)
+ (llm_description or alt_text or shape.name)
+ "]("
+ filename
+ ")\n"
@@ -663,14 +972,25 @@ class MediaConverter(DocumentConverter):
Abstract class for multi-modal media (e.g., images and audio)
"""
def _get_metadata(self, local_path):
exiftool = shutil.which("exiftool")
if not exiftool:
def _get_metadata(self, local_path, exiftool_path=None):
if not exiftool_path:
which_exiftool = shutil.which("exiftool")
if which_exiftool:
warn(
f"""Implicit discovery of 'exiftool' is disabled. If you would like to continue to use exiftool in MarkItDown, please set the exiftool_path parameter in the MarkItDown consructor. E.g.,
md = MarkItDown(exiftool_path="{which_exiftool}")
This warning will be removed in future releases.
""",
DeprecationWarning,
)
return None
else:
try:
result = subprocess.run(
[exiftool, "-json", local_path], capture_output=True, text=True
[exiftool_path, "-json", local_path], capture_output=True, text=True
).stdout
return json.loads(result)[0]
except Exception:
@@ -683,7 +1003,7 @@ class WavConverter(MediaConverter):
"""
def convert(self, local_path, **kwargs) -> Union[None, DocumentConverterResult]:
# Bail if not a XLSX
# Bail if not a WAV
extension = kwargs.get("file_extension", "")
if extension.lower() != ".wav":
return None
@@ -691,7 +1011,7 @@ class WavConverter(MediaConverter):
md_content = ""
# Add metadata
metadata = self._get_metadata(local_path)
metadata = self._get_metadata(local_path, kwargs.get("exiftool_path"))
if metadata:
for f in [
"Title",
@@ -746,7 +1066,7 @@ class Mp3Converter(WavConverter):
md_content = ""
# Add metadata
metadata = self._get_metadata(local_path)
metadata = self._get_metadata(local_path, kwargs.get("exiftool_path"))
if metadata:
for f in [
"Title",
@@ -799,7 +1119,7 @@ class ImageConverter(MediaConverter):
"""
def convert(self, local_path, **kwargs) -> Union[None, DocumentConverterResult]:
# Bail if not a XLSX
# Bail if not an image
extension = kwargs.get("file_extension", "")
if extension.lower() not in [".jpg", ".jpeg", ".png"]:
return None
@@ -807,7 +1127,7 @@ class ImageConverter(MediaConverter):
md_content = ""
# Add metadata
metadata = self._get_metadata(local_path)
metadata = self._get_metadata(local_path, kwargs.get("exiftool_path"))
if metadata:
for f in [
"ImageSize",
@@ -876,6 +1196,79 @@ class ImageConverter(MediaConverter):
return response.choices[0].message.content
class OutlookMsgConverter(DocumentConverter):
"""Converts Outlook .msg files to markdown by extracting email metadata and content.
Uses the olefile package to parse the .msg file structure and extract:
- Email headers (From, To, Subject)
- Email body content
"""
def convert(
self, local_path: str, **kwargs: Any
) -> Union[None, DocumentConverterResult]:
# Bail if not a MSG file
extension = kwargs.get("file_extension", "")
if extension.lower() != ".msg":
return None
try:
msg = olefile.OleFileIO(local_path)
# Extract email metadata
md_content = "# Email Message\n\n"
# Get headers
headers = {
"From": self._get_stream_data(msg, "__substg1.0_0C1F001F"),
"To": self._get_stream_data(msg, "__substg1.0_0E04001F"),
"Subject": self._get_stream_data(msg, "__substg1.0_0037001F"),
}
# Add headers to markdown
for key, value in headers.items():
if value:
md_content += f"**{key}:** {value}\n"
md_content += "\n## Content\n\n"
# Get email body
body = self._get_stream_data(msg, "__substg1.0_1000001F")
if body:
md_content += body
msg.close()
return DocumentConverterResult(
title=headers.get("Subject"), text_content=md_content.strip()
)
except Exception as e:
raise FileConversionException(
f"Could not convert MSG file '{local_path}': {str(e)}"
)
def _get_stream_data(
self, msg: olefile.OleFileIO, stream_path: str
) -> Union[str, None]:
"""Helper to safely extract and decode stream data from the MSG file."""
try:
if msg.exists(stream_path):
data = msg.openstream(stream_path).read()
# Try UTF-16 first (common for .msg files)
try:
return data.decode("utf-16-le").strip()
except UnicodeDecodeError:
# Fall back to UTF-8
try:
return data.decode("utf-8").strip()
except UnicodeDecodeError:
# Last resort - ignore errors
return data.decode("utf-8", errors="ignore").strip()
except Exception:
pass
return None
class ZipConverter(DocumentConverter):
"""Converts ZIP files to markdown by extracting and converting all contained files.
@@ -934,27 +1327,33 @@ class ZipConverter(DocumentConverter):
extracted_zip_folder_name = (
f"extracted_{os.path.basename(local_path).replace('.zip', '_zip')}"
)
new_folder = os.path.normpath(
extraction_dir = os.path.normpath(
os.path.join(os.path.dirname(local_path), extracted_zip_folder_name)
)
md_content = f"Content from the zip file `{os.path.basename(local_path)}`:\n\n"
# Safety check for path traversal
if not new_folder.startswith(os.path.dirname(local_path)):
return DocumentConverterResult(
title=None, text_content=f"[ERROR] Invalid zip file path: {local_path}"
)
try:
# Extract the zip file
# Extract the zip file safely
with zipfile.ZipFile(local_path, "r") as zipObj:
zipObj.extractall(path=new_folder)
# Safeguard against path traversal
for member in zipObj.namelist():
member_path = os.path.normpath(os.path.join(extraction_dir, member))
if (
not os.path.commonprefix([extraction_dir, member_path])
== extraction_dir
):
raise ValueError(
f"Path traversal detected in zip file: {member}"
)
# Extract all files safely
zipObj.extractall(path=extraction_dir)
# Process each extracted file
for root, dirs, files in os.walk(new_folder):
for root, dirs, files in os.walk(extraction_dir):
for name in files:
file_path = os.path.join(root, name)
relative_path = os.path.relpath(file_path, new_folder)
relative_path = os.path.relpath(file_path, extraction_dir)
# Get file extension
_, file_extension = os.path.splitext(name)
@@ -978,7 +1377,7 @@ class ZipConverter(DocumentConverter):
# Clean up extracted files if specified
if kwargs.get("cleanup_extracted", True):
shutil.rmtree(new_folder)
shutil.rmtree(extraction_dir)
return DocumentConverterResult(title=None, text_content=md_content.strip())
@@ -987,6 +1386,11 @@ class ZipConverter(DocumentConverter):
title=None,
text_content=f"[ERROR] Invalid or corrupted zip file: {local_path}",
)
except ValueError as ve:
return DocumentConverterResult(
title=None,
text_content=f"[ERROR] Security error in zip file {local_path}: {str(ve)}",
)
except Exception as e:
return DocumentConverterResult(
title=None,
@@ -994,6 +1398,74 @@ class ZipConverter(DocumentConverter):
)
class DocumentIntelligenceConverter(DocumentConverter):
"""Specialized DocumentConverter that uses Document Intelligence to extract text from documents."""
def __init__(
self,
endpoint: str,
api_version: str = "2024-07-31-preview",
):
self.endpoint = endpoint
self.api_version = api_version
self.doc_intel_client = DocumentIntelligenceClient(
endpoint=self.endpoint,
api_version=self.api_version,
credential=DefaultAzureCredential(),
)
def convert(
self, local_path: str, **kwargs: Any
) -> Union[None, DocumentConverterResult]:
# Bail if extension is not supported by Document Intelligence
extension = kwargs.get("file_extension", "")
docintel_extensions = [
".pdf",
".docx",
".xlsx",
".pptx",
".html",
".jpeg",
".jpg",
".png",
".bmp",
".tiff",
".heif",
]
if extension.lower() not in docintel_extensions:
return None
# Get the bytestring for the local path
with open(local_path, "rb") as f:
file_bytes = f.read()
# Certain document analysis features are not availiable for filetypes (.xlsx, .pptx, .html)
if extension.lower() in [".xlsx", ".pptx", ".html"]:
analysis_features = []
else:
analysis_features = [
DocumentAnalysisFeature.FORMULAS, # enable formula extraction
DocumentAnalysisFeature.OCR_HIGH_RESOLUTION, # enable high resolution OCR
DocumentAnalysisFeature.STYLE_FONT, # enable font style extraction
]
# Extract the text using Azure Document Intelligence
poller = self.doc_intel_client.begin_analyze_document(
model_id="prebuilt-layout",
body=AnalyzeDocumentRequest(bytes_source=file_bytes),
features=analysis_features,
output_content_format=CONTENT_FORMAT, # TODO: replace with "ContentFormat.MARKDOWN" when the bug is fixed
)
result: AnalyzeResult = poller.result()
# remove comments from the markdown content generated by Doc Intelligence and append to markdown string
markdown_text = re.sub(r"<!--.*?-->", "", result.content, flags=re.DOTALL)
return DocumentConverterResult(
title=None,
text_content=markdown_text,
)
class FileConversionException(BaseException):
pass
@@ -1012,6 +1484,8 @@ class MarkItDown:
llm_client: Optional[Any] = None,
llm_model: Optional[str] = None,
style_map: Optional[str] = None,
exiftool_path: Optional[str] = None,
docintel_endpoint: Optional[str] = None,
# Deprecated
mlm_client: Optional[Any] = None,
mlm_model: Optional[str] = None,
@@ -1021,6 +1495,9 @@ class MarkItDown:
else:
self._requests_session = requests_session
if exiftool_path is None:
exiftool_path = os.environ.get("EXIFTOOL_PATH")
# Handle deprecation notices
#############################
if mlm_client is not None:
@@ -1053,6 +1530,7 @@ class MarkItDown:
self._llm_client = llm_client
self._llm_model = llm_model
self._style_map = style_map
self._exiftool_path = exiftool_path
self._page_converters: List[DocumentConverter] = []
@@ -1061,24 +1539,34 @@ class MarkItDown:
# To this end, the most specific converters should appear below the most generic converters
self.register_page_converter(PlainTextConverter())
self.register_page_converter(HtmlConverter())
self.register_page_converter(RSSConverter())
self.register_page_converter(WikipediaConverter())
self.register_page_converter(YouTubeConverter())
self.register_page_converter(BingSerpConverter())
self.register_page_converter(DocxConverter())
self.register_page_converter(XlsxConverter())
self.register_page_converter(XlsConverter())
self.register_page_converter(PptxConverter())
self.register_page_converter(WavConverter())
self.register_page_converter(Mp3Converter())
self.register_page_converter(ImageConverter())
self.register_page_converter(IpynbConverter())
self.register_page_converter(PdfConverter())
self.register_page_converter(ZipConverter())
self.register_page_converter(OutlookMsgConverter())
# Register Document Intelligence converter at the top of the stack if endpoint is provided
if docintel_endpoint is not None:
self.register_page_converter(
DocumentIntelligenceConverter(endpoint=docintel_endpoint)
)
def convert(
self, source: Union[str, requests.Response], **kwargs: Any
self, source: Union[str, requests.Response, Path], **kwargs: Any
) -> DocumentConverterResult: # TODO: deal with kwargs
"""
Args:
- source: can be a string representing a path or url, or a requests.response object
- source: can be a string representing a path either as string pathlib path object or url, or a requests.response object
- extension: specifies the file extension to use when interpreting the file. If None, infer from source (path, uri, content-type, etc.)
"""
@@ -1095,10 +1583,14 @@ class MarkItDown:
# Request response
elif isinstance(source, requests.Response):
return self.convert_response(source, **kwargs)
elif isinstance(source, Path):
return self.convert_local(source, **kwargs)
def convert_local(
self, path: str, **kwargs: Any
self, path: Union[str, Path], **kwargs: Any
) -> DocumentConverterResult: # TODO: deal with kwargs
if isinstance(path, Path):
path = str(path)
# Prepare a list of extensions to try (in order of priority)
ext = kwargs.get("file_extension")
extensions = [ext] if ext is not None else []
@@ -1228,12 +1720,15 @@ class MarkItDown:
if "llm_model" not in _kwargs and self._llm_model is not None:
_kwargs["llm_model"] = self._llm_model
# Add the list of converters for nested processing
_kwargs["_parent_converters"] = self._page_converters
if "style_map" not in _kwargs and self._style_map is not None:
_kwargs["style_map"] = self._style_map
if "exiftool_path" not in _kwargs and self._exiftool_path is not None:
_kwargs["exiftool_path"] = self._exiftool_path
# Add the list of converters for nested processing
_kwargs["_parent_converters"] = self._page_converters
# If we hit an error log it and keep trying
try:
res = converter.convert(local_path, **_kwargs)
@@ -1268,6 +1763,8 @@ class MarkItDown:
ext = ext.strip()
if ext == "":
return
if ext in extensions:
return
# if ext not in extensions:
extensions.append(ext)
@@ -1276,6 +1773,25 @@ class MarkItDown:
# Use puremagic to guess
try:
guesses = puremagic.magic_file(path)
# Fix for: https://github.com/microsoft/markitdown/issues/222
# If there are no guesses, then try again after trimming leading ASCII whitespaces.
# ASCII whitespace characters are those byte values in the sequence b' \t\n\r\x0b\f'
# (space, tab, newline, carriage return, vertical tab, form feed).
if len(guesses) == 0:
with open(path, "rb") as file:
while True:
char = file.read(1)
if not char: # End of file
break
if not char.isspace():
file.seek(file.tell() - 1)
break
try:
guesses = puremagic.magic_stream(file)
except puremagic.main.PureError:
pass
extensions = list()
for g in guesses:
ext = g.extension.strip()

0
src/markitdown/py.typed Normal file
View File

10
tests/test_files/test.json vendored Normal file
View File

@@ -0,0 +1,10 @@
{
"key1": "string_value",
"key2": 1234,
"key3": [
"list_value1",
"list_value2"
],
"5b64c88c-b3c3-4510-bcb8-da0b200602d8": "uuid_key",
"uuid_value": "9700dc99-6685-40b4-9a3a-5e406dcb37f3"
}

BIN
tests/test_files/test.xls vendored Normal file

Binary file not shown.

89
tests/test_files/test_notebook.ipynb vendored Normal file
View File

@@ -0,0 +1,89 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "0f61db80",
"metadata": {},
"source": [
"# Test Notebook"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "3f2a5bbd",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"markitdown\n"
]
}
],
"source": [
"print('markitdown')"
]
},
{
"cell_type": "markdown",
"id": "9b9c0468",
"metadata": {},
"source": [
"## Code Cell Below"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "37d8088a",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"42\n"
]
}
],
"source": [
"# comment in code\n",
"print(42)"
]
},
{
"cell_type": "markdown",
"id": "2e3177bd",
"metadata": {},
"source": [
"End\n",
"\n",
"---"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.8"
},
"title": "Test Notebook Title"
},
"nbformat": 4,
"nbformat_minor": 5
}

BIN
tests/test_files/test_outlook_msg.msg vendored Normal file

Binary file not shown.

1
tests/test_files/test_rss.xml vendored Normal file

File diff suppressed because one or more lines are too long

View File

@@ -54,6 +54,12 @@ XLSX_TEST_STRINGS = [
"affc7dad-52dc-4b98-9b5d-51e65d8a8ad0",
]
XLS_TEST_STRINGS = [
"## 09060124-b5e7-4717-9d07-3c046eb",
"6ff4173b-42a5-4784-9b19-f49caff4d93d",
"affc7dad-52dc-4b98-9b5d-51e65d8a8ad0",
]
DOCX_TEST_STRINGS = [
"314b0a30-5b04-470b-b9f7-eed2c2bec74a",
"49e168b7-d2ae-407f-a055-2167576f39a1",
@@ -63,6 +69,15 @@ DOCX_TEST_STRINGS = [
"AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation",
]
MSG_TEST_STRINGS = [
"# Email Message",
"**From:** test.sender@example.com",
"**To:** test.recipient@example.com",
"**Subject:** Test Email Message",
"## Content",
"This is the body of the test email message",
]
DOCX_COMMENT_TEST_STRINGS = [
"314b0a30-5b04-470b-b9f7-eed2c2bec74a",
"49e168b7-d2ae-407f-a055-2167576f39a1",
@@ -90,6 +105,13 @@ BLOG_TEST_STRINGS = [
"an example where high cost can easily prevent a generic complex",
]
RSS_TEST_STRINGS = [
"The Official Microsoft Blog",
"In the case of AI, it is absolutely true that the industry is moving incredibly fast",
]
WIKIPEDIA_TEST_URL = "https://en.wikipedia.org/wiki/Microsoft"
WIKIPEDIA_TEST_STRINGS = [
"Microsoft entered the operating system (OS) business in 1980 with its own version of [Unix]",
@@ -123,6 +145,22 @@ LLM_TEST_STRINGS = [
"5bda1dd6",
]
JSON_TEST_STRINGS = [
"5b64c88c-b3c3-4510-bcb8-da0b200602d8",
"9700dc99-6685-40b4-9a3a-5e406dcb37f3",
]
# --- Helper Functions ---
def validate_strings(result, expected_strings, exclude_strings=None):
"""Validate presence or absence of specific strings."""
text_content = result.text_content.replace("\\", "")
for string in expected_strings:
assert string in text_content
if exclude_strings:
for string in exclude_strings:
assert string not in text_content
@pytest.mark.skipif(
skip_remote,
@@ -156,79 +194,82 @@ def test_markitdown_local() -> None:
# Test XLSX processing
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.xlsx"))
for test_string in XLSX_TEST_STRINGS:
validate_strings(result, XLSX_TEST_STRINGS)
# Test XLS processing
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.xls"))
for test_string in XLS_TEST_STRINGS:
text_content = result.text_content.replace("\\", "")
assert test_string in text_content
# Test DOCX processing
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.docx"))
for test_string in DOCX_TEST_STRINGS:
text_content = result.text_content.replace("\\", "")
assert test_string in text_content
validate_strings(result, DOCX_TEST_STRINGS)
# Test DOCX processing, with comments
result = markitdown.convert(
os.path.join(TEST_FILES_DIR, "test_with_comment.docx"),
style_map="comment-reference => ",
)
for test_string in DOCX_COMMENT_TEST_STRINGS:
text_content = result.text_content.replace("\\", "")
assert test_string in text_content
validate_strings(result, DOCX_COMMENT_TEST_STRINGS)
# Test DOCX processing, with comments and setting style_map on init
markitdown_with_style_map = MarkItDown(style_map="comment-reference => ")
result = markitdown_with_style_map.convert(
os.path.join(TEST_FILES_DIR, "test_with_comment.docx")
)
for test_string in DOCX_COMMENT_TEST_STRINGS:
text_content = result.text_content.replace("\\", "")
assert test_string in text_content
validate_strings(result, DOCX_COMMENT_TEST_STRINGS)
# Test PPTX processing
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.pptx"))
for test_string in PPTX_TEST_STRINGS:
text_content = result.text_content.replace("\\", "")
assert test_string in text_content
validate_strings(result, PPTX_TEST_STRINGS)
# Test HTML processing
result = markitdown.convert(
os.path.join(TEST_FILES_DIR, "test_blog.html"), url=BLOG_TEST_URL
)
for test_string in BLOG_TEST_STRINGS:
text_content = result.text_content.replace("\\", "")
assert test_string in text_content
validate_strings(result, BLOG_TEST_STRINGS)
# Test ZIP file processing
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test_files.zip"))
for test_string in DOCX_TEST_STRINGS:
text_content = result.text_content.replace("\\", "")
assert test_string in text_content
validate_strings(result, XLSX_TEST_STRINGS)
# Test Wikipedia processing
result = markitdown.convert(
os.path.join(TEST_FILES_DIR, "test_wikipedia.html"), url=WIKIPEDIA_TEST_URL
)
text_content = result.text_content.replace("\\", "")
for test_string in WIKIPEDIA_TEST_EXCLUDES:
assert test_string not in text_content
for test_string in WIKIPEDIA_TEST_STRINGS:
assert test_string in text_content
validate_strings(result, WIKIPEDIA_TEST_STRINGS, WIKIPEDIA_TEST_EXCLUDES)
# Test Bing processing
result = markitdown.convert(
os.path.join(TEST_FILES_DIR, "test_serp.html"), url=SERP_TEST_URL
)
text_content = result.text_content.replace("\\", "")
for test_string in SERP_TEST_EXCLUDES:
assert test_string not in text_content
for test_string in SERP_TEST_STRINGS:
validate_strings(result, SERP_TEST_STRINGS, SERP_TEST_EXCLUDES)
# Test RSS processing
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test_rss.xml"))
text_content = result.text_content.replace("\\", "")
for test_string in RSS_TEST_STRINGS:
assert test_string in text_content
## Test non-UTF-8 encoding
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test_mskanji.csv"))
text_content = result.text_content.replace("\\", "")
for test_string in CSV_CP932_TEST_STRINGS:
assert test_string in text_content
validate_strings(result, CSV_CP932_TEST_STRINGS)
# Test MSG (Outlook email) processing
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test_outlook_msg.msg"))
validate_strings(result, MSG_TEST_STRINGS)
# Test JSON processing
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.json"))
validate_strings(result, JSON_TEST_STRINGS)
# Test input with leading blank characters
input_data = b" \n\n\n<html><body><h1>Test</h1></body></html>"
result = markitdown.convert_stream(io.BytesIO(input_data))
assert "# Test" in result.text_content
@pytest.mark.skipif(
@@ -236,9 +277,29 @@ def test_markitdown_local() -> None:
reason="do not run if exiftool is not installed",
)
def test_markitdown_exiftool() -> None:
markitdown = MarkItDown()
# Test the automatic discovery of exiftool throws a warning
# and is disabled
try:
with catch_warnings(record=True) as w:
markitdown = MarkItDown()
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.jpg"))
assert len(w) == 1
assert w[0].category is DeprecationWarning
assert result.text_content.strip() == ""
finally:
resetwarnings()
# Test JPG metadata processing
# Test explicitly setting the location of exiftool
which_exiftool = shutil.which("exiftool")
markitdown = MarkItDown(exiftool_path=which_exiftool)
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.jpg"))
for key in JPG_TEST_EXIFTOOL:
target = f"{key}: {JPG_TEST_EXIFTOOL[key]}"
assert target in result.text_content
# Test setting the exiftool path through an environment variable
os.environ["EXIFTOOL_PATH"] = which_exiftool
markitdown = MarkItDown()
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.jpg"))
for key in JPG_TEST_EXIFTOOL:
target = f"{key}: {JPG_TEST_EXIFTOOL[key]}"
@@ -300,8 +361,8 @@ def test_markitdown_llm() -> None:
if __name__ == "__main__":
"""Runs this file's tests from the command line."""
test_markitdown_remote()
test_markitdown_local()
# test_markitdown_remote()
# test_markitdown_local()
test_markitdown_exiftool()
test_markitdown_deprecation()
test_markitdown_llm()
# test_markitdown_deprecation()
# test_markitdown_llm()