183 Commits

Author SHA1 Message Date
Adam Fourney
abe9752438 Bumped version 2025-02-10 16:01:17 -08:00
wunde005
73ba69d8cd For csv files mimetypes.guess_type is returning "application/vnd.ms-excel" on windows causing an invalid mime type in plaintextconverter. In reference to issue: https://github.com/microsoft/markitdown/issues/150 (#273) 2025-02-08 20:58:13 -08:00
Werner Robitza
2a4f7bb6a8 fix: argparse CLI option ordering, fixes #268 (#290)
* fix: argparse CLI option ordering, fixes #268
* Fixed formatting.
2025-02-08 20:50:38 -08:00
masquare
7cf5e0bb23 feat(pptx): support image description with LLM for pptx files (#306) 2025-02-08 20:37:34 -08:00
James Hickey
3090917a49 Typo fixed (#270) 2025-02-08 20:30:13 -08:00
ZeyuTeng96
7bea2672a0 remove leading and trailing \n for HtmlConverter (#262) 2025-02-08 20:28:35 -08:00
KennyZhang1
bf6a15e9b5 Kennyzhang/docintel docs (#312)
* updated docs to include doc intelligence

* include reference to doc intel setup docs
2025-01-31 22:23:26 -08:00
KennyZhang1
bfde857420 Add support for conversion via Document Intelligence (#303)
* added cli params for doc intel

* added DocumentIntelligenceConverter class implementation

* initialized doc intel client instance field

* added isolated doc_intel main conversion function

* temp fix for ContentFormat import bug

* ran tests for docintel and offline for many filetypes

* push doc intel converter to the top of the stack

* formatting changes

* modified project toml file
2025-01-24 14:09:32 -08:00
afourney
f58a864951 Set exiftool path explicitly. (#267) 2025-01-06 12:43:47 -08:00
afourney
265aea2edf Removed the holiday away message from README.md (#266) 2025-01-06 09:06:21 -08:00
afourney
05b78e7ce1 Recognize json as plain text (if no other handlers are present). (#261)
* Recognize json as plain text (if no other handlers are present).
2025-01-03 16:40:43 -08:00
afourney
436407288f If puremagic has no guesses, try again after ltrim. (#260) 2025-01-03 16:03:11 -08:00
afourney
731b39e7f5 Added a test for leading spaces. (#258) 2025-01-03 14:34:33 -08:00
yeungadrian
08ed32869e Feature/ Add xls support (#169)
* add xlrd
* add xls converter with tests
2025-01-03 13:58:17 -08:00
Murat Can Kurtuluş
d248621ba4 feat: outlook ".msg" file converter (#196)
* feat: outlook .msg converter
* add test, adjust docstring
2025-01-03 13:34:39 -08:00
AbSadiki
4678c8a2a4 fix(transcription): IS_AUDIO_TRANSCRIPTION_CAPABLE should be iniztialized (#194) 2025-01-03 13:29:26 -08:00
Ikko Eltociear Ashimine
125e206047 docs: update README.md (#182)
faciliate -> facilitate
2024-12-21 01:51:30 -08:00
numekudi
f94d09990e feat: enable Git support in devcontainer (#136)
Co-authored-by: gagb <gagb@users.noreply.github.com>
2024-12-20 18:09:17 -08:00
lumin
cfd2319c14 feat: add version option to markitdown CLI (#172)
Add a `--version` option to the markitdown command-line interface 
that displays the current version number.
2024-12-20 16:24:45 -08:00
dependabot[bot]
73161982ff Bump actions/setup-python from 2 to 5 (#179)
Bumps [actions/setup-python](https://github.com/actions/setup-python) from 2 to 5.
- [Release notes](https://github.com/actions/setup-python/releases)
- [Commits](https://github.com/actions/setup-python/compare/v2...v5)

---
updated-dependencies:
- dependency-name: actions/setup-python
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: afourney <adamfo@microsoft.com>
2024-12-20 16:20:22 -08:00
dependabot[bot]
9b69467772 Bump actions/cache from 3 to 4 (#178)
Bumps [actions/cache](https://github.com/actions/cache) from 3 to 4.
- [Release notes](https://github.com/actions/cache/releases)
- [Changelog](https://github.com/actions/cache/blob/main/RELEASES.md)
- [Commits](https://github.com/actions/cache/compare/v3...v4)

---
updated-dependencies:
- dependency-name: actions/cache
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: gagb <gagb@users.noreply.github.com>
Co-authored-by: afourney <adamfo@microsoft.com>
2024-12-20 16:17:43 -08:00
gagb
857a2d160d Update README.md (#180) 2024-12-20 14:49:20 -08:00
Soulter
1123392306 fix: support -o param to avoid encoding issues (#116)
* perf: cli supports -o param

* doc: update README

---------

Co-authored-by: gagb <gagb@users.noreply.github.com>
2024-12-20 14:43:00 -08:00
dependabot[bot]
377a7eaa7d Bump actions/checkout from 2 to 4 (#177)
Bumps [actions/checkout](https://github.com/actions/checkout) from 2 to 4.
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](https://github.com/actions/checkout/compare/v2...v4)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-12-20 14:36:48 -08:00
lumin
c1a0d3deaf chore: configure Dependabot for GitHub Actions updates (#112)
Sets up Dependabot to automatically check for updates to 
GitHub Actions on a weekly basis, ensuring that the project 
remains up-to-date with the latest dependencies and security 
fixes.

Co-authored-by: gagb <gagb@users.noreply.github.com>
2024-12-20 14:28:55 -08:00
SigireddyBalasai
5276616ba1 Added support to use Pathlib (#93)
* Add support for Path objects in MarkItDown conversion methods

* Remove unnecessary blank line in test_markitdown_exiftool function

* Remove unnecessary blank line in test_markitdown_exiftool function

* remove pathlib path in test file

---------

Co-authored-by: afourney <adamfo@microsoft.com>
Co-authored-by: gagb <gagb@users.noreply.github.com>
2024-12-20 14:12:48 -08:00
gagb
7e6c36c5d4 docs: add contribution guidelines to README (#176) 2024-12-20 14:08:58 -08:00
lumin
52d73080c7 refactor(tests): add helper function for tests (#87)
* refactor(tests): simplify string validation in tests

Introduce a helper function `validate_strings` to streamline the 
validation of expected and excluded strings in test cases. Replace 
repetitive string assertions in the `test_markitdown_local` function 
with calls to this new helper, improving code readability and 
maintainability.

* run pre-commit

---------

Co-authored-by: lumin <71011125+l-melon@users.noreply.github.com>
Co-authored-by: gagb <gagb@users.noreply.github.com>
2024-12-20 11:42:32 -08:00
afourney
e6421000e3 Merge pull request #160 from sugatoray/support_type_hinting
Add support for type-hinting (PEP-561)
2024-12-20 10:54:43 -08:00
Sugato Ray
08a25345e3 [feat]: add support for type-hinting for PEP-561 2024-12-20 02:37:10 +00:00
Sugato Ray
8921fe7304 ignore .vscode folder
- avoid local developer vscode editor settings
2024-12-20 02:18:14 +00:00
Sugato Ray
613825d5b3 [feat]: add support for type-hinting for PEP-561 2024-12-20 02:12:24 +00:00
gagb
18e3f1d428 Merge pull request #91 from PetrAPConsulting/patch-1
Update README.md
2024-12-19 14:02:47 -08:00
gagb
c295dee5e4 Merge branch 'main' into patch-1 2024-12-19 13:22:51 -08:00
gagb
dd87dd5e36 Merge pull request #156 from microsoft/afourney-patch-1
Added holiday notice.
2024-12-19 11:18:24 -08:00
afourney
535147b2e8 Added holiday notice.
Added holiday notice.
2024-12-19 11:11:54 -08:00
gagb
5c776bda70 Update README.md 2024-12-19 10:30:53 -08:00
gagb
423a01844a Merge branch 'main' into patch-1 2024-12-19 10:30:10 -08:00
gagb
7147bef7d5 Merge pull request #130 from sugatoray/update_commandline_help
Update CLI helpdoc formatting to allow indentation in code
2024-12-19 10:20:23 -08:00
Sugato Ray
a5f39d6922 Merge branch 'main' into update_commandline_help 2024-12-19 07:58:48 -05:00
gagb
925c4499f7 Merge pull request #121 from l-lumin/add-project-description 2024-12-19 00:53:54 -08:00
Petr@AP Consulting
b28f380a47 Update README.md
Co-authored-by: gagb <gagb@users.noreply.github.com>
2024-12-19 09:23:15 +01:00
lumin
c86287b7e3 feat: add project description in pyproject.toml 2024-12-19 13:02:47 +09:00
Sugato Ray
6f3c762526 Merge branch 'main' into update_commandline_help 2024-12-18 17:50:07 -05:00
gagb
cb66b35f11 Merge pull request #132 from microsoft/gagb-patch-1
Add downloads badge
2024-12-18 14:30:09 -08:00
gagb
a2743a5314 Add downloads badge 2024-12-18 14:26:36 -08:00
Sugato Ray
277480066a Merge branch 'update_commandline_help' of https://github.com/sugatoray/markitdown into update_commandline_help 2024-12-18 21:53:54 +00:00
gagb
6e1b9a7402 Run precommit 2024-12-18 13:46:10 -08:00
Sugato Ray
1384e80725 update .gitignore to exclude .vscode folder 2024-12-18 21:46:06 +00:00
Sugato Ray
356e895306 update formatting with pre-commit 2024-12-18 21:45:23 +00:00
Petr@AP Consulting
f6e75c46d4 Update README.md
I changed command for running script from Mac version (python3) to Windows version (python)
2024-12-18 21:17:47 +01:00
afourney
8bc1bee18b Merge pull request #129 from finchy/main
Safeguard against path traversal for ZipConverter
2024-12-18 12:11:00 -08:00
Petr@AP Consulting
f4471d96e2 Update README.md
Co-authored-by: gagb <gagb@users.noreply.github.com>
2024-12-18 21:08:10 +01:00
Petr@AP Consulting
088007338d Update README.md
Co-authored-by: gagb <gagb@users.noreply.github.com>
2024-12-18 21:07:55 +01:00
Petr@AP Consulting
bb929629f3 Update README.md
Co-authored-by: gagb <gagb@users.noreply.github.com>
2024-12-18 21:05:36 +01:00
Petr@AP Consulting
233ba679b8 Update README.md
Co-authored-by: gagb <gagb@users.noreply.github.com>
2024-12-18 21:05:04 +01:00
gagb
46b7f043d3 Merge branch 'main' into patch-1 2024-12-18 11:57:57 -08:00
gagb
5fc70864f2 Run pre-commit 2024-12-18 11:46:39 -08:00
Sugato Ray
39410d01df Update CLI helpdoc formatting to allow indentation in code
Use `textwrap.dedent()` to allow indented cli-helpdoc in `__main__.py` file. The indentation increases readability, while `textwrap.dedent` helps maintain the same functionality without breaking code.
2024-12-18 14:22:58 -05:00
Joel Esler
6e4caac70d Safeguard against path traversal for ZipConverter
fix: prevent path traversal vulnerabilities in ZipConverter

Added a secure check for path traversal vulnerabilities in the ZipConverter class.
Now validates extracted file paths using `os.path.commonprefix` to ensure all files
remain within the intended extraction directory. Raises a `ValueError` if a
path traversal attempt is detected.

- Normalized file paths using `os.path.normpath`.
- Added specific exception handling for `zipfile.BadZipFile` and traversal errors.
- Ensured cleanup of extracted files after processing when `cleanup_extracted` is enabled.
2024-12-18 13:12:55 -05:00
Petr@AP Consulting
224f1df0fc Update README.md
I collapsed section about batch processing as was suggested
2024-12-18 09:28:18 +01:00
gagb
1deaba1c6c Merge pull request #98 from waterimp/feature/fix-code-comments
fix incorrect comments for "bail if not ..." for WAV and image cases.
2024-12-17 17:57:25 -08:00
gagb
09cb048cbe Merge branch 'main' into feature/fix-code-comments 2024-12-17 17:34:53 -08:00
gagb
b029ae1cd4 Merge pull request #108 from microsoft/gagb-readme
Simplify README
2024-12-17 17:30:49 -08:00
gagb
524aa0da75 Update README.md 2024-12-17 17:25:40 -08:00
gagb
de1b54d79f Update README.md 2024-12-17 17:25:13 -08:00
gagb
1e7806a7ac Simplify 2024-12-17 17:21:39 -08:00
gagb
1163aa2b4e Merge pull request #106 from microsoft/gagb-patch-1
Update README.md
2024-12-17 16:57:32 -08:00
gagb
3bcf2bdae7 Update README.md 2024-12-17 16:54:17 -08:00
gagb
41a10b9a35 Merge pull request #64 from l-lumin/add-devcontainer-config
feat(devcontainer): Add DevContainer Configuration for Easier Contribution Setup
2024-12-17 16:52:50 -08:00
gagb
f1e399eee4 Merge branch 'main' into add-devcontainer-config 2024-12-17 16:50:32 -08:00
gagb
8b02c0bf9f Merge pull request #80 from diya155/main
Update README.md
2024-12-17 16:49:58 -08:00
gagb
1dda535330 Merge branch 'main' into main 2024-12-17 16:46:23 -08:00
gagb
362214323e Merge branch 'main' into feature/fix-code-comments 2024-12-17 16:38:47 -08:00
lumin
457b6234e6 Merge branch 'main' into add-devcontainer-config 2024-12-18 09:14:31 +09:00
afourney
790031409b Merge pull request #71 from AumGupta/main
feat: Add IpynbConverter
2024-12-17 15:41:51 -08:00
afourney
9e546a8588 Merge branch 'main' into main 2024-12-17 15:37:28 -08:00
afourney
ddf695cf81 Merge pull request #97 from Soulter/main
feat: Add RSSConverter
2024-12-17 15:34:22 -08:00
Adam Fourney
8d5f16ecd2 Fixed formatting. 2024-12-17 15:27:06 -08:00
afourney
a571021199 Merge branch 'main' into main 2024-12-17 15:12:59 -08:00
afourney
9add517510 Merge branch 'main' into feature/fix-code-comments 2024-12-17 14:56:16 -08:00
afourney
3ce21a47ab Merge pull request #102 from microsoft/bump_version
Bump version.
2024-12-17 13:55:12 -08:00
Adam Fourney
9518c01d4e Bump version. 2024-12-17 13:51:13 -08:00
afourney
22504551ef Merge pull request #101 from microsoft/add_deprecation_warnings
Added deprecation warnings for mlm_* arguments.
2024-12-17 13:49:44 -08:00
Adam Fourney
95188a4a27 Merge main. 2024-12-17 13:46:26 -08:00
afourney
e69d012b86 Merge pull request #100 from microsoft/add_llm_tests 2024-12-17 13:36:36 -08:00
Adam Fourney
03a7843a0a Added deprecation warnings for mlm_* arguments. 2024-12-17 13:22:48 -08:00
Adam Fourney
248d64edd0 Added llm tests to the local test set. 2024-12-17 12:13:19 -08:00
Lee Bush
05a49ca129 fix incorrect comments for "bail if not ..." for WAV and image cases. 2024-12-17 08:10:53 -07:00
Soulter
752fbd333c feat: add tests of rss convertor 2024-12-17 22:45:27 +08:00
Soulter
7dc2695b96 feat: support convert atom to markdown 2024-12-17 21:44:50 +08:00
Soulter
53fad6eb31 feat: add rss converter 2024-12-17 21:22:27 +08:00
Petr@AP Consulting
f398f3d443 Update README.md
I added description and script for batch of files processing
2024-12-17 10:26:09 +01:00
lumin
e0a30295ff docs: update README with Devcontainer instructions
Add instructions for using Dev to run tests.Remove the install script it is no longer needed. 
Update trademark section for clarity.
2024-12-17 17:04:31 +09:00
lumin
07fe457a90 feat: add devcontainer configuration and installation script
Add a devcontainer configuration to streamline the development 
environment setup. Introduce an `install.sh` script to install 
the project in editable mode. Update the Dockerfile to use 
the `python:3.13-slim-bullseye` base image and install 
dependencies using `apt-get` for better compatibility.
2024-12-17 17:04:31 +09:00
Om Gupta
60c4a62917 Merge branch 'microsoft:main' into main 2024-12-17 10:33:40 +05:30
Om Gupta
3eb8cf385b Merge branch 'main' of https://github.com/AumGupta/markitdown 2024-12-17 10:24:30 +05:30
Om Gupta
8c91c11ea8 pre-commit run 2024-12-17 10:24:25 +05:30
diya155
14bd8d319a Update README.md 2024-12-17 09:16:40 +05:30
gagb
ad5d4fb139 Merge pull request #77 from microsoft/kevinclb/main
Kevinclb/main
2024-12-16 18:14:09 -08:00
gagb
ad29122592 run precommit 2024-12-16 18:09:48 -08:00
gagb
898bfd4774 Merge branch 'main' into main 2024-12-16 18:00:26 -08:00
gagb
c8980d9f41 Merge pull request #75 from microsoft/cybernobie/main
Cybernobie/main
2024-12-16 17:40:13 -08:00
gagb
24b52b2b8f Improve readme 2024-12-16 17:35:47 -08:00
gagb
09159aa04e Merge branch 'main' into main 2024-12-16 17:24:47 -08:00
gagb
77f620b568 Merge pull request #67 from DIMAX99/issue#65
fix issue #65
2024-12-16 17:18:53 -08:00
gagb
825d3bbb77 Merge branch 'main' into issue#65 2024-12-16 17:09:53 -08:00
gagb
c0127af120 Merge pull request #72 from CharlesCNorton/patch-1
Fix LLM terms
2024-12-16 17:06:24 -08:00
gagb
33cb5015eb Merge branch 'main' into patch-1 2024-12-16 17:04:44 -08:00
gagb
cf13b7e657 Merge pull request #73 from CharlesCNorton/patch-2
Fix LLM terminology in code
2024-12-16 17:04:33 -08:00
gagb
874eba6265 Merge branch 'main' into patch-2 2024-12-16 16:59:22 -08:00
gagb
c3fa2934b9 Run pre-commit 2024-12-16 16:56:52 -08:00
gagb
736e7d9a7e Merge branch 'main' into patch-1 2024-12-16 16:53:58 -08:00
gagb
19c111251b Merge pull request #60 from madduci/main
Added Dockerfile
2024-12-16 16:42:26 -08:00
gagb
360c2dd95f Merge branch 'main' into main 2024-12-16 16:35:50 -08:00
kevinbabou
87846cf5f8 rm setup.py 2024-12-16 16:28:44 -08:00
kevinbabou
33638f1fe6 feature: add argument parsing and setup.py file for cli tool capability 2024-12-16 16:28:44 -08:00
gagb
73776b2c0f Merge pull request #50 from narumiruna/youtube-transcript-languages
Support specifying YouTube transcript language
2024-12-16 16:23:20 -08:00
gagb
2d3ffeade1 Merge branch 'main' into youtube-transcript-languages 2024-12-16 16:20:35 -08:00
gagb
51c1453699 Merge pull request #48 from Soulter/main
Fix: pass the kwargs to _convert method when converting an url file
2024-12-16 16:19:09 -08:00
gagb
ae4669107c Merge branch 'main' into main 2024-12-16 16:01:59 -08:00
gagb
dbc727615d Merge branch 'main' into main 2024-12-16 15:48:49 -08:00
gagb
b0115cf971 Merge branch 'main' into youtube-transcript-languages 2024-12-16 15:47:38 -08:00
gagb
5cf8474f37 Merge pull request #44 from Y-Kim-64/main
Exclude test files from language statistics using linguist-vendored
2024-12-16 15:35:19 -08:00
gagb
83dc81170b Merge branch 'main' into main 2024-12-16 15:29:33 -08:00
gagb
e7a2e20d93 Merge pull request #39 from SH4DOW4RE/main
Catching pydub's warning of ffmpeg or avconv missing
2024-12-16 15:28:53 -08:00
gagb
980abd3a60 Merge branch 'main' into main 2024-12-16 15:24:58 -08:00
afourney
afaff11ef0 Merge branch 'main' into main 2024-12-16 14:40:58 -08:00
afourney
6587e0f097 Merge branch 'main' into patch-1 2024-12-16 14:27:26 -08:00
afourney
978c8763aa Merge pull request #38 from VillePuuska/support-comments-in-docx
Add passing style_map kwarg to Mammoth when converting docx to allow keeping comments
2024-12-16 14:26:55 -08:00
afourney
e7636656d8 Merge branch 'main' into support-comments-in-docx 2024-12-16 14:23:14 -08:00
afourney
ddc1bebea4 Merge branch 'main' into patch-2 2024-12-16 14:20:16 -08:00
afourney
fa1f496d51 Merge branch 'main' into patch-1 2024-12-16 14:18:20 -08:00
afourney
da779dd125 Merge pull request #33 from nyosegawa/feature/add-pptx-chart-support
Add PPTX chart support
2024-12-16 14:11:49 -08:00
afourney
12ce5e95b2 Merge branch 'main' into feature/add-pptx-chart-support 2024-12-16 14:06:14 -08:00
gagb
6dad1cca96 Merge pull request #22 from Josh-XT/main
Add zip handling
2024-12-16 13:56:25 -08:00
gagb
9e6a19987b Merge branch 'main' into main 2024-12-16 13:51:39 -08:00
gagb
ed91e8b534 Merge pull request #19 from brc-dd/fix/18
Fix character decoding issues with text-like files
2024-12-16 13:49:48 -08:00
gagb
aeff2cb5ae Merge branch 'main' into fix/18 2024-12-16 13:46:17 -08:00
gagb
c9c7d98d30 Merge pull request #11 from simonw/patch-2
CLI usage instructions
2024-12-16 13:45:05 -08:00
gagb
e7d9b5546a Merge branch 'main' into patch-2 2024-12-16 13:42:28 -08:00
CharlesCNorton
ed651aeb16 Fix LLM terminology in code
Replaced all occurrences of mlm_client and mlm_model with llm_client and llm_model for consistent terminology when referencing Large Language Models (LLMs).
2024-12-16 16:23:52 -05:00
CharlesCNorton
3d9f3f3e5b Fix LLM terms
Updated all instances of mlm_client and mlm_model to llm_client and llm_model in the readme. The previous terms (mlm_client and mlm_model) are incorrect in the context of configuring Large Language Models (LLMs), as "MLM" typically refers to Masked Language Models, which is unrelated to the intended functionality. This change aligns the documentation with standard naming conventions for LLM configuration parameters and improves clarity for users integrating with LLMs like OpenAI's GPT models.
2024-12-16 16:23:03 -05:00
Om Gupta
a3208f2bd0 feat: Add IpynbConverter
- Implemented IpynbConverter class for converting Jupyter Notebook (.ipynb) files into Markdown format.
- Supports markdown cells, code cells and raw cells.
- First markdown heading is used as the title if no title is found in notebook metadata.
- Created a test notebook (`test_notebook.ipynb`) to verify the functionality of the converter.
2024-12-17 01:00:41 +05:30
Divit
ad01da308d fix issue #65 2024-12-16 21:48:33 +05:30
CyberNobie
010f841008 Ensure hatch is installed before running tests 2024-12-16 18:47:24 +05:30
Michele Adduci
5fc03b6415 Added UID as argument 2024-12-16 13:11:13 +01:00
Michele Adduci
013b022427 Added Docker Image for using markitdown in a sandboxed environment 2024-12-16 13:08:15 +01:00
narumi
695100d5d8 Support specifying YouTube transcript language 2024-12-16 13:16:00 +08:00
Soulter
d66ef5fcca Update README to introduce the customized mlm_prompt 2024-12-16 12:08:51 +08:00
Soulter
c168703d5e Pass the kwargs to _convert method when converting an url file 2024-12-16 11:41:39 +08:00
Yeonjun
3548c96dd3 Create .gitattributes
Mark test files as linguist-vendored
2024-12-16 09:21:07 +09:00
SH4DOW4RE
1559d9d163 pre-commit ran 2024-12-15 22:15:20 +01:00
SH4DOW4RE
b7f5662ffd PR: Catching pydub's warning of ffmpeg or avconv missing 2024-12-15 17:29:14 +01:00
Ville Puuska
0a7203b876 add style_map prop to MarkItDown class 2024-12-15 17:23:57 +02:00
Ville Puuska
0704b0b6ff pass 'style_map' kwarg to mammoth when converting docx 2024-12-15 16:59:21 +02:00
sakasegawa
0dd4e95584 Remove _is_chart 2024-12-15 21:14:58 +09:00
sakasegawa
93130b5ba5 Add PPTX chart support 2024-12-15 20:42:55 +09:00
Divyansh Singh
52b723724c Fix character decoding issues with text-like files 2024-12-15 10:37:59 +05:30
Josh XT
a55c3d525c Merge branch 'main' into main 2024-12-14 23:09:30 -05:00
gagb
81e3f24acd Merge pull request #29 from microsoft/gagb-patch-1
Update README.md
2024-12-14 19:17:54 -08:00
gagb
b84294620a Update README.md 2024-12-14 19:05:51 -08:00
gagb
60c495d609 Merge branch 'main' into patch-2 2024-12-14 18:57:11 -08:00
gagb
71123a4df3 Merge pull request #7 from microsoft/gagb/improve-readme
Improve the readme with contributing guidelines
2024-12-14 18:54:28 -08:00
gagb
5753e553fe Fix conflicts 2024-12-14 18:47:34 -08:00
gagb
752dd897b9 Merge pull request #28 from pawarbi/main
Update README.md
2024-12-14 18:44:52 -08:00
gagb
1aa4abe90f Merge branch 'gagb/improve-readme' into main 2024-12-14 18:44:33 -08:00
gagb
ea7c6dcc40 Merge pull request #27 from haesleinhuepf/patch-1
Add installation instructions from haesleinhuepf:patch-1
2024-12-14 18:39:51 -08:00
gagb
a31c0a13e7 Merge branch 'main' into gagb/improve-readme 2024-12-14 18:34:27 -08:00
Sandeep Pawar
30ab78fe9e Update README.md
I have updated the readme with three changes:
- Created sections for Installation and Usage to help users
- Added installation instruction
- Added additional example of using LLM. This will be the primary use case and will help users.
2024-12-14 19:15:10 -06:00
gagb
559b1fc62a Merge branch 'main' into patch-2 2024-12-14 15:02:42 -08:00
Josh XT
df03382218 Improve docustring 2024-12-14 17:55:22 -05:00
Robert Haase
18301edcd0 Add installation instructions 2024-12-14 23:22:54 +01:00
Josh XT
4987201ef6 test 2024-12-14 08:49:03 -05:00
Josh XT
571c5bbc0e add test 2024-12-14 08:45:51 -05:00
Josh XT
e8ea8b6f3d Update readme 2024-12-14 08:41:07 -05:00
Josh XT
7e634acf5f import zipfile 2024-12-14 08:24:44 -05:00
Josh XT
862c39029e add zip handling 2024-12-14 06:34:47 -05:00
afourney
70ab149ff1 Merge pull request #10 from simonw/patch-1
Remove invalid classifiers
2024-12-13 21:10:53 -08:00
Simon Willison
33ce17954d Note about piping 2024-12-13 11:09:03 -08:00
Simon Willison
6ebef5af0c CLI usage instructions
Plus added  a PyPI badge
2024-12-13 11:06:11 -08:00
Simon Willison
3b88696777 Remove invalid classifiers
requires-python says 3.10 and higher only
2024-12-13 10:53:35 -08:00
gagb
3f9ba06418 Improve the readme with contributing guidelines
Addresses issue https://github.com/microsoft/markitdown/issues/6

---

For more details, open the [Copilot Workspace session](https://copilot-workspace.githubnext.com/microsoft/markitdown?shareId=XXXX-XXXX-XXXX-XXXX).
2024-12-12 15:17:18 -08:00
28 changed files with 1378 additions and 118 deletions

View File

@@ -0,0 +1,32 @@
// For format details, see https://aka.ms/devcontainer.json. For config options, see the
// README at: https://github.com/devcontainers/templates/tree/main/src/docker-existing-dockerfile
{
"name": "Existing Dockerfile",
"build": {
// Sets the run context to one level up instead of the .devcontainer folder.
"context": "..",
// Update the 'dockerFile' property if you aren't using the standard 'Dockerfile' filename.
"dockerfile": "../Dockerfile",
"args": {
"INSTALL_GIT": "true"
}
},
// Features to add to the dev container. More info: https://containers.dev/features.
// "features": {},
"features": {
"ghcr.io/devcontainers-extra/features/hatch:2": {}
},
// Use 'forwardPorts' to make a list of ports inside the container available locally.
// "forwardPorts": [],
// Uncomment the next line to run commands after the container is created.
// "postCreateCommand": "cat /etc/os-release",
// Configure tool-specific properties.
// "customizations": {},
// Uncomment to connect as an existing user other than the container default. More info: https://aka.ms/dev-containers-non-root.
"remoteUser": "root"
}

1
.dockerignore Normal file
View File

@@ -0,0 +1 @@
*

1
.gitattributes vendored Normal file
View File

@@ -0,0 +1 @@
tests/test_files/** linguist-vendored

6
.github/dependabot.yml vendored Normal file
View File

@@ -0,0 +1,6 @@
version: 2
updates:
- package-ecosystem: "github-actions"
directory: "/"
schedule:
interval: "weekly"

View File

@@ -5,9 +5,9 @@ jobs:
pre-commit:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v2
uses: actions/setup-python@v5
with:
python-version: "3.x"

View File

@@ -5,8 +5,8 @@ jobs:
tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: |
3.10
@@ -14,7 +14,7 @@ jobs:
3.12
- name: Set up pip cache
if: runner.os == 'Linux'
uses: actions/cache@v3
uses: actions/cache@v4
with:
path: ~/.cache/pip
key: ${{ runner.os }}-pip-${{ hashFiles('pyproject.toml') }}

4
.gitignore vendored
View File

@@ -1,3 +1,5 @@
.vscode
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
@@ -160,3 +162,5 @@ cython_debug/
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/
src/.DS_Store
.DS_Store

23
Dockerfile Normal file
View File

@@ -0,0 +1,23 @@
FROM python:3.13-slim-bullseye
USER root
ARG INSTALL_GIT=false
RUN if [ "$INSTALL_GIT" = "true" ]; then \
apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/*; \
fi
# Runtime dependency
RUN apt-get update && apt-get install -y --no-install-recommends \
ffmpeg \
&& rm -rf /var/lib/apt/lists/*
RUN pip install markitdown
# Default USERID and GROUPID
ARG USERID=10000
ARG GROUPID=10000
USER $USERID:$GROUPID
ENTRYPOINT [ "markitdown" ]

158
README.md
View File

@@ -1,28 +1,129 @@
# MarkItDown
The MarkItDown library is a utility tool for converting various files to Markdown (e.g., for indexing, text analysis, etc.)
[![PyPI](https://img.shields.io/pypi/v/markitdown.svg)](https://pypi.org/project/markitdown/)
![PyPI - Downloads](https://img.shields.io/pypi/dd/markitdown)
[![Built by AutoGen Team](https://img.shields.io/badge/Built%20by-AutoGen%20Team-blue)](https://github.com/microsoft/autogen)
It presently supports:
- PDF (.pdf)
- PowerPoint (.pptx)
- Word (.docx)
- Excel (.xlsx)
- Images (EXIF metadata, and OCR)
- Audio (EXIF metadata, and speech transcription)
- HTML (special handling of Wikipedia, etc.)
- Various other text-based formats (csv, json, xml, etc.)
MarkItDown is a utility for converting various files to Markdown (e.g., for indexing, text analysis, etc).
It supports:
- PDF
- PowerPoint
- Word
- Excel
- Images (EXIF metadata and OCR)
- Audio (EXIF metadata and speech transcription)
- HTML
- Text-based formats (CSV, JSON, XML)
- ZIP files (iterates over contents)
The API is simple:
To install MarkItDown, use pip: `pip install markitdown`. Alternatively, you can install it from the source: `pip install -e .`
## Usage
### Command-Line
```bash
markitdown path-to-file.pdf > document.md
```
Or use `-o` to specify the output file:
```bash
markitdown path-to-file.pdf -o document.md
```
To use Document Intelligence conversion:
```bash
markitdown path-to-file.pdf -o document.md -d -e "<document_intelligence_endpoint>"
```
You can also pipe content:
```bash
cat path-to-file.pdf | markitdown
```
More information about how to set up an Azure Document Intelligence Resource can be found [here](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/how-to-guides/create-document-intelligence-resource?view=doc-intel-4.0.0)
### Python API
Basic usage in Python:
```python
from markitdown import MarkItDown
markitdown = MarkItDown()
result = markitdown.convert("test.xlsx")
md = MarkItDown()
result = md.convert("test.xlsx")
print(result.text_content)
```
Document Intelligence conversion in Python:
```python
from markitdown import MarkItDown
md = MarkItDown(docintel_endpoint="<document_intelligence_endpoint>")
result = md.convert("test.pdf")
print(result.text_content)
```
To use Large Language Models for image descriptions, provide `llm_client` and `llm_model`:
```python
from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
result = md.convert("example.jpg")
print(result.text_content)
```
### Docker
```sh
docker build -t markitdown:latest .
docker run --rm -i markitdown:latest < ~/your-file.pdf > output.md
```
<details>
<summary>Batch Processing Multiple Files</summary>
This example shows how to convert multiple files to markdown format in a single run. The script processes all supported files in a directory and creates corresponding markdown files.
```python convert.py
from markitdown import MarkItDown
from openai import OpenAI
import os
client = OpenAI(api_key="your-api-key-here")
md = MarkItDown(llm_client=client, llm_model="gpt-4o-2024-11-20")
supported_extensions = ('.pptx', '.docx', '.pdf', '.jpg', '.jpeg', '.png')
files_to_convert = [f for f in os.listdir('.') if f.lower().endswith(supported_extensions)]
for file in files_to_convert:
print(f"\nConverting {file}...")
try:
md_file = os.path.splitext(file)[0] + '.md'
result = md.convert(file)
with open(md_file, 'w') as f:
f.write(result.text_content)
print(f"Successfully converted {file} to {md_file}")
except Exception as e:
print(f"Error converting {file}: {str(e)}")
print("\nAll conversions completed!")
```
2. Place the script in the same directory as your files
3. Install required packages: like openai
4. Run script ```bash python convert.py ```
Note that original files will remain unchanged and new markdown files are created with the same base name.
</details>
## Contributing
This project welcomes contributions and suggestions. Most contributions require you to agree to a
@@ -37,6 +138,37 @@ This project has adopted the [Microsoft Open Source Code of Conduct](https://ope
For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
### How to Contribute
You can help by looking at issues or helping review PRs. Any issue or PR is welcome, but we have also marked some as 'open for contribution' and 'open for reviewing' to help facilitate community contributions. These are ofcourse just suggestions and you are welcome to contribute in any way you like.
<div align="center">
| | All | Especially Needs Help from Community |
|-----------------------|------------------------------------------|------------------------------------------------------------------------------------------|
| **Issues** | [All Issues](https://github.com/microsoft/markitdown/issues) | [Issues open for contribution](https://github.com/microsoft/markitdown/issues?q=is%3Aissue+is%3Aopen+label%3A%22open+for+contribution%22) |
| **PRs** | [All PRs](https://github.com/microsoft/markitdown/pulls) | [PRs open for reviewing](https://github.com/microsoft/markitdown/pulls?q=is%3Apr+is%3Aopen+label%3A%22open+for+reviewing%22) |
</div>
### Running Tests and Checks
- Install `hatch` in your environment and run tests:
```sh
pip install hatch # Other ways of installing hatch: https://hatch.pypa.io/dev/install/
hatch shell
hatch test
```
(Alternative) Use the Devcontainer which has all the dependencies installed:
```sh
# Reopen the project in Devcontainer and run:
hatch test
```
- Run pre-commit checks before submitting a PR: `pre-commit run --all-files`
## Trademarks
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft

View File

@@ -5,7 +5,7 @@ build-backend = "hatchling.build"
[project]
name = "markitdown"
dynamic = ["version"]
description = ''
description = 'Utility tool for converting various files to Markdown'
readme = "README.md"
requires-python = ">=3.10"
license = "MIT"
@@ -16,11 +16,10 @@ authors = [
classifiers = [
"Development Status :: 4 - Beta",
"Programming Language :: Python",
"Programming Language :: Python :: 3.8",
"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3.11",
"Programming Language :: Python :: 3.12",
"Programming Language :: Python :: 3.13",
"Programming Language :: Python :: Implementation :: CPython",
"Programming Language :: Python :: Implementation :: PyPy",
]
@@ -33,12 +32,18 @@ dependencies = [
"python-pptx",
"pandas",
"openpyxl",
"xlrd",
"pdfminer.six",
"puremagic",
"pydub",
"olefile",
"youtube-transcript-api",
"SpeechRecognition",
"pathvalidate",
"charset-normalizer",
"openai",
"azure-ai-documentintelligence",
"azure-identity"
]
[project.urls]
@@ -77,3 +82,6 @@ exclude_lines = [
"if __name__ == .__main__.:",
"if TYPE_CHECKING:",
]
[tool.hatch.build.targets.sdist]
only-include = ["src/markitdown"]

View File

@@ -1,4 +1,4 @@
# SPDX-FileCopyrightText: 2024-present Adam Fourney <adamfo@microsoft.com>
#
# SPDX-License-Identifier: MIT
__version__ = "0.0.1a1"
__version__ = "0.0.1a4"

View File

@@ -1,41 +1,108 @@
# SPDX-FileCopyrightText: 2024-present Adam Fourney <adamfo@microsoft.com>
#
# SPDX-License-Identifier: MIT
import argparse
import sys
from ._markitdown import MarkItDown
import shutil
from textwrap import dedent
from .__about__ import __version__
from ._markitdown import MarkItDown, DocumentConverterResult
def main():
if len(sys.argv) == 1:
markitdown = MarkItDown()
result = markitdown.convert_stream(sys.stdin.buffer)
print(result.text_content)
elif len(sys.argv) == 2:
markitdown = MarkItDown()
result = markitdown.convert(sys.argv[1])
print(result.text_content)
else:
sys.stderr.write(
parser = argparse.ArgumentParser(
description="Convert various file formats to markdown.",
prog="markitdown",
formatter_class=argparse.RawDescriptionHelpFormatter,
usage=dedent(
"""
SYNTAX:
SYNTAX:
markitdown <OPTIONAL: FILENAME>
If FILENAME is empty, markitdown reads from stdin.
markitdown <OPTIONAL: FILENAME>
If FILENAME is empty, markitdown reads from stdin.
EXAMPLE:
EXAMPLE:
markitdown example.pdf
markitdown example.pdf
OR
OR
cat example.pdf | markitdown
cat example.pdf | markitdown
OR
OR
markitdown < example.pdf
""".strip()
+ "\n"
)
markitdown < example.pdf
OR to save to a file use
markitdown example.pdf -o example.md
OR
markitdown example.pdf > example.md
"""
).strip(),
)
parser.add_argument(
"-v",
"--version",
action="version",
version=f"%(prog)s {__version__}",
help="show the version number and exit",
)
parser.add_argument(
"-o",
"--output",
help="Output file name. If not provided, output is written to stdout.",
)
parser.add_argument(
"-d",
"--use-docintel",
action="store_true",
help="Use Document Intelligence to extract text instead of offline conversion. Requires a valid Document Intelligence Endpoint.",
)
parser.add_argument(
"-e",
"--endpoint",
type=str,
help="Document Intelligence Endpoint. Required if using Document Intelligence.",
)
parser.add_argument("filename", nargs="?")
args = parser.parse_args()
which_exiftool = shutil.which("exiftool")
if args.use_docintel:
if args.endpoint is None:
raise ValueError(
"Document Intelligence Endpoint is required when using Document Intelligence."
)
elif args.filename is None:
raise ValueError("Filename is required when using Document Intelligence.")
markitdown = MarkItDown(exiftool_path=which_exiftool, docintel_endpoint=args.endpoint)
else:
markitdown = MarkItDown(exiftool_path=which_exiftool)
if args.filename is None:
result = markitdown.convert_stream(sys.stdin.buffer)
else:
result = markitdown.convert(args.filename)
_handle_output(args, result)
def _handle_output(args, result: DocumentConverterResult):
"""Handle output to stdout or file"""
if args.output:
with open(args.output, "w", encoding="utf-8") as f:
f.write(result.text_content)
else:
print(result.text_content)
if __name__ == "__main__":

File diff suppressed because it is too large Load Diff

0
src/markitdown/py.typed Normal file
View File

0
tests/test_files/test.docx vendored Executable file → Normal file
View File

0
tests/test_files/test.jpg vendored Executable file → Normal file
View File

Before

Width:  |  Height:  |  Size: 463 KiB

After

Width:  |  Height:  |  Size: 463 KiB

10
tests/test_files/test.json vendored Normal file
View File

@@ -0,0 +1,10 @@
{
"key1": "string_value",
"key2": 1234,
"key3": [
"list_value1",
"list_value2"
],
"5b64c88c-b3c3-4510-bcb8-da0b200602d8": "uuid_key",
"uuid_value": "9700dc99-6685-40b4-9a3a-5e406dcb37f3"
}

BIN
tests/test_files/test.pptx vendored Executable file → Normal file

Binary file not shown.

BIN
tests/test_files/test.xls vendored Normal file

Binary file not shown.

0
tests/test_files/test.xlsx vendored Executable file → Normal file
View File

BIN
tests/test_files/test_files.zip vendored Normal file

Binary file not shown.

BIN
tests/test_files/test_llm.jpg vendored Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 145 KiB

4
tests/test_files/test_mskanji.csv vendored Normal file
View File

@@ -0,0 +1,4 @@
<EFBFBD><EFBFBD><EFBFBD>O,<EFBFBD>N<EFBFBD><EFBFBD>,<EFBFBD>Z<EFBFBD><EFBFBD>
<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>Y,30,<EFBFBD><EFBFBD><EFBFBD><EFBFBD>
<EFBFBD>O<EFBFBD>؉p<EFBFBD>q,25,<EFBFBD><EFBFBD><EFBFBD><EFBFBD>
<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>~,35,<EFBFBD><EFBFBD><EFBFBD>É<EFBFBD>
1 –¼‘O ”N—î �Z�Š
2 �²“¡‘¾˜Y 30 “Œ‹ž
3 ŽO–؉pŽq 25 ‘å�ã
4 îà‹´�~ 35 –¼ŒÃ‰®

89
tests/test_files/test_notebook.ipynb vendored Normal file
View File

@@ -0,0 +1,89 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "0f61db80",
"metadata": {},
"source": [
"# Test Notebook"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "3f2a5bbd",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"markitdown\n"
]
}
],
"source": [
"print('markitdown')"
]
},
{
"cell_type": "markdown",
"id": "9b9c0468",
"metadata": {},
"source": [
"## Code Cell Below"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "37d8088a",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"42\n"
]
}
],
"source": [
"# comment in code\n",
"print(42)"
]
},
{
"cell_type": "markdown",
"id": "2e3177bd",
"metadata": {},
"source": [
"End\n",
"\n",
"---"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.8"
},
"title": "Test Notebook Title"
},
"nbformat": 4,
"nbformat_minor": 5
}

BIN
tests/test_files/test_outlook_msg.msg vendored Normal file

Binary file not shown.

1
tests/test_files/test_rss.xml vendored Normal file

File diff suppressed because one or more lines are too long

BIN
tests/test_files/test_with_comment.docx vendored Normal file

Binary file not shown.

View File

@@ -6,11 +6,23 @@ import shutil
import pytest
import requests
from warnings import catch_warnings, resetwarnings
from markitdown import MarkItDown
skip_remote = (
True if os.environ.get("GITHUB_ACTIONS") else False
) # Don't run these tests in CI
# Don't run the llm tests without a key and the client library
skip_llm = False if os.environ.get("OPENAI_API_KEY") else True
try:
import openai
except ModuleNotFoundError:
skip_llm = True
# Skip exiftool tests if not installed
skip_exiftool = shutil.which("exiftool") is None
TEST_FILES_DIR = os.path.join(os.path.dirname(__file__), "test_files")
@@ -42,6 +54,12 @@ XLSX_TEST_STRINGS = [
"affc7dad-52dc-4b98-9b5d-51e65d8a8ad0",
]
XLS_TEST_STRINGS = [
"## 09060124-b5e7-4717-9d07-3c046eb",
"6ff4173b-42a5-4784-9b19-f49caff4d93d",
"affc7dad-52dc-4b98-9b5d-51e65d8a8ad0",
]
DOCX_TEST_STRINGS = [
"314b0a30-5b04-470b-b9f7-eed2c2bec74a",
"49e168b7-d2ae-407f-a055-2167576f39a1",
@@ -51,12 +69,34 @@ DOCX_TEST_STRINGS = [
"AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation",
]
MSG_TEST_STRINGS = [
"# Email Message",
"**From:** test.sender@example.com",
"**To:** test.recipient@example.com",
"**Subject:** Test Email Message",
"## Content",
"This is the body of the test email message",
]
DOCX_COMMENT_TEST_STRINGS = [
"314b0a30-5b04-470b-b9f7-eed2c2bec74a",
"49e168b7-d2ae-407f-a055-2167576f39a1",
"## d666f1f7-46cb-42bd-9a39-9a39cf2a509f",
"# Abstract",
"# Introduction",
"AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation",
"This is a test comment. 12df-321a",
"Yet another comment in the doc. 55yiyi-asd09",
]
PPTX_TEST_STRINGS = [
"2cdda5c8-e50e-4db4-b5f0-9722a649f455",
"04191ea8-5c73-4215-a1d3-1cfb43aaaf12",
"44bf7d06-5e7a-4a40-a2e1-a2e42ef28c8a",
"1b92870d-e3b5-4e65-8153-919f4ff45592",
"AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation",
"a3f6004b-6f4f-4ea8-bee3-3741f4dc385f", # chart title
"2003", # chart value
]
BLOG_TEST_URL = "https://microsoft.github.io/autogen/blog/2023/04/21/LLM-tuning-math"
@@ -65,6 +105,13 @@ BLOG_TEST_STRINGS = [
"an example where high cost can easily prevent a generic complex",
]
RSS_TEST_STRINGS = [
"The Official Microsoft Blog",
"In the case of AI, it is absolutely true that the industry is moving incredibly fast",
]
WIKIPEDIA_TEST_URL = "https://en.wikipedia.org/wiki/Microsoft"
WIKIPEDIA_TEST_STRINGS = [
"Microsoft entered the operating system (OS) business in 1980 with its own version of [Unix]",
@@ -87,6 +134,33 @@ SERP_TEST_EXCLUDES = [
"data:image/svg+xml,%3Csvg%20width%3D",
]
CSV_CP932_TEST_STRINGS = [
"名前,年齢,住所",
"佐藤太郎,30,東京",
"三木英子,25,大阪",
"髙橋淳,35,名古屋",
]
LLM_TEST_STRINGS = [
"5bda1dd6",
]
JSON_TEST_STRINGS = [
"5b64c88c-b3c3-4510-bcb8-da0b200602d8",
"9700dc99-6685-40b4-9a3a-5e406dcb37f3",
]
# --- Helper Functions ---
def validate_strings(result, expected_strings, exclude_strings=None):
"""Validate presence or absence of specific strings."""
text_content = result.text_content.replace("\\", "")
for string in expected_strings:
assert string in text_content
if exclude_strings:
for string in exclude_strings:
assert string not in text_content
@pytest.mark.skipif(
skip_remote,
@@ -120,67 +194,175 @@ def test_markitdown_local() -> None:
# Test XLSX processing
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.xlsx"))
for test_string in XLSX_TEST_STRINGS:
validate_strings(result, XLSX_TEST_STRINGS)
# Test XLS processing
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.xls"))
for test_string in XLS_TEST_STRINGS:
text_content = result.text_content.replace("\\", "")
assert test_string in text_content
# Test DOCX processing
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.docx"))
for test_string in DOCX_TEST_STRINGS:
text_content = result.text_content.replace("\\", "")
assert test_string in text_content
validate_strings(result, DOCX_TEST_STRINGS)
# Test DOCX processing, with comments
result = markitdown.convert(
os.path.join(TEST_FILES_DIR, "test_with_comment.docx"),
style_map="comment-reference => ",
)
validate_strings(result, DOCX_COMMENT_TEST_STRINGS)
# Test DOCX processing, with comments and setting style_map on init
markitdown_with_style_map = MarkItDown(style_map="comment-reference => ")
result = markitdown_with_style_map.convert(
os.path.join(TEST_FILES_DIR, "test_with_comment.docx")
)
validate_strings(result, DOCX_COMMENT_TEST_STRINGS)
# Test PPTX processing
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.pptx"))
for test_string in PPTX_TEST_STRINGS:
text_content = result.text_content.replace("\\", "")
assert test_string in text_content
validate_strings(result, PPTX_TEST_STRINGS)
# Test HTML processing
result = markitdown.convert(
os.path.join(TEST_FILES_DIR, "test_blog.html"), url=BLOG_TEST_URL
)
for test_string in BLOG_TEST_STRINGS:
text_content = result.text_content.replace("\\", "")
assert test_string in text_content
validate_strings(result, BLOG_TEST_STRINGS)
# Test ZIP file processing
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test_files.zip"))
validate_strings(result, XLSX_TEST_STRINGS)
# Test Wikipedia processing
result = markitdown.convert(
os.path.join(TEST_FILES_DIR, "test_wikipedia.html"), url=WIKIPEDIA_TEST_URL
)
text_content = result.text_content.replace("\\", "")
for test_string in WIKIPEDIA_TEST_EXCLUDES:
assert test_string not in text_content
for test_string in WIKIPEDIA_TEST_STRINGS:
assert test_string in text_content
validate_strings(result, WIKIPEDIA_TEST_STRINGS, WIKIPEDIA_TEST_EXCLUDES)
# Test Bing processing
result = markitdown.convert(
os.path.join(TEST_FILES_DIR, "test_serp.html"), url=SERP_TEST_URL
)
text_content = result.text_content.replace("\\", "")
for test_string in SERP_TEST_EXCLUDES:
assert test_string not in text_content
for test_string in SERP_TEST_STRINGS:
validate_strings(result, SERP_TEST_STRINGS, SERP_TEST_EXCLUDES)
# Test RSS processing
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test_rss.xml"))
text_content = result.text_content.replace("\\", "")
for test_string in RSS_TEST_STRINGS:
assert test_string in text_content
## Test non-UTF-8 encoding
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test_mskanji.csv"))
validate_strings(result, CSV_CP932_TEST_STRINGS)
# Test MSG (Outlook email) processing
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test_outlook_msg.msg"))
validate_strings(result, MSG_TEST_STRINGS)
# Test JSON processing
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.json"))
validate_strings(result, JSON_TEST_STRINGS)
# Test input with leading blank characters
input_data = b" \n\n\n<html><body><h1>Test</h1></body></html>"
result = markitdown.convert_stream(io.BytesIO(input_data))
assert "# Test" in result.text_content
@pytest.mark.skipif(
skip_exiftool,
reason="do not run if exiftool is not installed",
)
def test_markitdown_exiftool() -> None:
markitdown = MarkItDown()
# Test the automatic discovery of exiftool throws a warning
# and is disabled
try:
with catch_warnings(record=True) as w:
markitdown = MarkItDown()
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.jpg"))
assert len(w) == 1
assert w[0].category is DeprecationWarning
assert result.text_content.strip() == ""
finally:
resetwarnings()
# Test JPG metadata processing
# Test explicitly setting the location of exiftool
which_exiftool = shutil.which("exiftool")
markitdown = MarkItDown(exiftool_path=which_exiftool)
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.jpg"))
for key in JPG_TEST_EXIFTOOL:
target = f"{key}: {JPG_TEST_EXIFTOOL[key]}"
assert target in result.text_content
# Test setting the exiftool path through an environment variable
os.environ["EXIFTOOL_PATH"] = which_exiftool
markitdown = MarkItDown()
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test.jpg"))
for key in JPG_TEST_EXIFTOOL:
target = f"{key}: {JPG_TEST_EXIFTOOL[key]}"
assert target in result.text_content
def test_markitdown_deprecation() -> None:
try:
with catch_warnings(record=True) as w:
test_client = object()
markitdown = MarkItDown(mlm_client=test_client)
assert len(w) == 1
assert w[0].category is DeprecationWarning
assert markitdown._llm_client == test_client
finally:
resetwarnings()
try:
with catch_warnings(record=True) as w:
markitdown = MarkItDown(mlm_model="gpt-4o")
assert len(w) == 1
assert w[0].category is DeprecationWarning
assert markitdown._llm_model == "gpt-4o"
finally:
resetwarnings()
try:
test_client = object()
markitdown = MarkItDown(mlm_client=test_client, llm_client=test_client)
assert False
except ValueError:
pass
try:
markitdown = MarkItDown(mlm_model="gpt-4o", llm_model="gpt-4o")
assert False
except ValueError:
pass
@pytest.mark.skipif(
skip_llm,
reason="do not run llm tests without a key",
)
def test_markitdown_llm() -> None:
client = openai.OpenAI()
markitdown = MarkItDown(llm_client=client, llm_model="gpt-4o")
result = markitdown.convert(os.path.join(TEST_FILES_DIR, "test_llm.jpg"))
for test_string in LLM_TEST_STRINGS:
assert test_string in result.text_content
# This is not super precise. It would also accept "red square", "blue circle",
# "the square is not blue", etc. But it's sufficient for this test.
for test_string in ["red", "circle", "blue", "square"]:
assert test_string in result.text_content.lower()
if __name__ == "__main__":
"""Runs this file's tests from the command line."""
test_markitdown_remote()
test_markitdown_local()
# test_markitdown_remote()
# test_markitdown_local()
test_markitdown_exiftool()
# test_markitdown_deprecation()
# test_markitdown_llm()