Commit Graph

270 Commits

Author SHA1 Message Date
Yi-Cheng Wang
131f0c7739 feat: add Document Intelligence API version selection via kwargs (#1253)
Co-authored-by: Yi-Cheng Wang <yicheng.wang@heph-ai.com>
Co-authored-by: afourney <adamfo@microsoft.com>
2025-05-21 10:22:08 -07:00
JoshClark-git
56f7579ce2 FIX YouTube transcript errors (#1241)
* FIX YouTube transcript errors

* Fixed formatting.

---------

Co-authored-by: Josh <jca351@sfu.ca>
Co-authored-by: afourney <adamfo@microsoft.com>
2025-05-21 10:17:57 -07:00
t3tra
cb421cf9ea Chore: Make linter happy (#1256)
* refactor: remove unused imports

* fix: replace NotImplemented with NotImplementedError

* refactor: resolve E722 (do not use bare 'except')

* refactor: remove unused variable

* refactor: remove unused imports

* refactor: ignore unused imports that will be used in the future

* refactor: resolve W293 (blank line contains whitespace)

* refactor: resolve F541 (f-string is missing placeholders)

---------

Co-authored-by: afourney <adamfo@microsoft.com>
2025-05-21 10:02:16 -07:00
kira-offgrid
39e7252940 fix: python.lang.security.use-defused-xml-parse.use-defused-xml-parse-packages-markitdown-src-markitdown-converter_utils-docx-math-omml.py (#1251) 2025-05-21 09:57:21 -07:00
afourney
bbcf876b18 Switched from the stdlib minidom parser to defusedxml. (#1259) 2025-05-21 09:47:14 -07:00
createcentury
041be54471 Update README.md (#1187)
updated subtle misspelling.
2025-04-13 09:31:40 -07:00
lentil32
ebe2684b3d chore: fix typo in README.md (#1175)
* chore: fix typo in README.md
2025-04-13 09:29:16 -07:00
Turdıbek
8576f1d915 Add CSV to Markdown table conversion - fixes #1144 (#1176)
* feat: Add CSV to Markdown table converter

- Add new CsvConverter class to convert CSV files to Markdown tables\n- Support text/csv and application/csv MIME types\n- Preserve table structure with headers and data rows\n- Handle edge cases like empty cells and mismatched columns\n- Fix Azure Document Intelligence dependency handling\n- Register CsvConverter in MarkItDown class

----

Thanks also to @benny123tw who submitted a very similar PR in #1171
2025-04-13 09:19:00 -07:00
Sathindu
3fcd48cdfc feat: render math equations in .docx documents (#1160)
* feat: math equation rendering in .docx files
* fix: import fix on .docx pre processing
* test: add test cases for docx equation rendering
* docs: add ThirdPartyNotices.md
* refactor: reformatted with black
2025-03-28 15:36:38 -07:00
afourney
9e067c42b6 Make it easier to use AzureKeyCredentials with Azure Doc Intelligence (#1151)
* Make it easier to use AzureKeyCredentials with Azure Doc Intelligence
* Fixed mypy type error.
* Added more fine-grained options over types.
* Pass doc intel options further up the stack.
2025-03-26 10:44:11 -07:00
afourney
9a951055f0 Update readme to point to the mcp package. (#1158)
* Updated readme with link to the MCP package.
2025-03-25 15:00:04 -07:00
afourney
73b9d57312 Update badges (#1157)
* Update badges in subpackages.
2025-03-25 14:52:24 -07:00
afourney
3ca57986ef Basic SSE MCP Server for MarkItDown (#1155)
* Added an initial minimal MCP server for MarkItDown
* Added STDIO default option.
* Added a Dockerfile, and updated the README accordingly. Also added instructions for Claude Desktop
* Pin mcp version.
2025-03-25 14:38:22 -07:00
afourney
c1f9a323ee Bump version. (#1154) v0.1.1 2025-03-24 23:26:30 -07:00
afourney
e928b43afb convert_url renamed to convert_uri, and now handles data and file URIs (#1153) 2025-03-24 21:43:04 -07:00
afourney
2ffe6ea591 Bump version. (#1150) v0.1.0 2025-03-22 11:21:32 -07:00
afourney
efc55b260d Bump version and resolve a console encoding error. (#1149) v0.1.0a6 2025-03-21 09:27:25 -07:00
Yuzhong Zhang
52432bd228 Add support for preserving base64 encoded images (#1140)
* optional reserve base64 string in markdown _CustomMarkdownify and pptx
* add other converter para support
* fix linter
* Use *kwarg to pass keep_data_uri para.
* Add module cli vector tests
* Fixed formatting, and adjusted tests.
2025-03-20 18:50:23 -07:00
afourney
c0a511ecff Updated docx file to include an image. (#1146) 2025-03-20 12:25:56 -07:00
afourney
cd6aa41361 Adjust warning filters and update dependencies (#1143)
Adjusts warning filters to be more contextual
Updates dependencies for magika and youtube-transcript-api
Updates the version to 0.1.0a5 in __about__.py
v0.1.0a5
2025-03-19 22:09:14 -07:00
afourney
716f74dcb9 Consider anything with a charset as plain text-convertible. (#1142) 2025-03-19 20:46:35 -07:00
afourney
a93e0567e6 EPub Support. Adapted #123 to not use epublib. (#1131)
* Adapted #123 to not use epublib.
* Updated README.md
v0.1.0a4
2025-03-17 07:48:15 -07:00
afourney
c5f70b904f Have magika read from the stream. (#1136) 2025-03-17 07:39:19 -07:00
afourney
53834fdd24 Investigate and silence warnings. (#1133) 2025-03-15 23:41:35 -07:00
afourney
5c565b7d79 Fix remaining mypy errors. (#1132) 2025-03-15 23:12:48 -07:00
afourney
a78857bd43 Added epub test file. (#1130) 2025-03-15 18:34:51 -07:00
afourney
09df7fe8df Small fixes for autogen integration. (#1124) 2025-03-12 19:18:11 -07:00
Adam Fourney
6a9f09b153 Updated Magika dependency. 2025-03-12 16:15:33 -07:00
afourney
0b815fb916 Bumping version to 0.1.0a2 (#1123) 2025-03-12 11:44:19 -07:00
Emanuele Meazzo
12620f1545 Handle not supported plot type in pptx (#1122)
* Handle not supported plot type in pptx
* Fixed formatting.
2025-03-12 11:26:23 -07:00
afourney
5f75e16d20 Refactored tests. (#1120)
* Refactored tests.
* Fixed CI errors, and included misc tests.
* Omit mskanji from streaminfo test.
* Omit mskanji from no hints test.
* Log results of debugging in comments (linked to Magika issue)
* Added docs as to when to use misc tests.
2025-03-12 11:08:06 -07:00
yushihang
75140a90e2 fix: correct f-string formatting in FileConversionException (#1121) 2025-03-12 10:15:09 -07:00
afourney
af1be36e0c Added CLI options for extension, mimetypes, and charset. (#1115) 2025-03-11 13:16:33 -07:00
Adam Fourney
2a2ccc86aa Added mimetypes to _rss_converter 2025-03-10 16:17:41 -07:00
Adam Fourney
2e51ba22e7 Enhance type guessing. 2025-03-10 16:05:41 -07:00
afourney
8f8e58c9bb Minimize guesses when guesses are compatible. (#1114)
* Minimize guesses when guesses are compatible.
2025-03-10 15:30:44 -07:00
afourney
8e73a325c6 Switch from puremagic to magika. (#1108) 2025-03-10 12:49:52 -07:00
Mohit Agarwal
2405f201af fix typo in well-known path list (#1109) 2025-03-08 19:32:44 -08:00
afourney
99d8e562db Fix exiftool in well-known paths. (#1106) 2025-03-07 21:47:20 -08:00
Sebastian Yaghoubi
515fa854bf feat(docker): improve dockerfile build (#220)
* refactor(docker): remove unnecessary root user

The USER root directive isn't needed directly after FROM

Signed-off-by: Sebastian Yaghoubi <sebastianyaghoubi@gmail.com>

* fix(docker): use generic nobody nogroup default instead of uid gid

Signed-off-by: Sebastian Yaghoubi <sebastianyaghoubi@gmail.com>

* fix(docker): build app from source locally instead of installing package

Signed-off-by: Sebastian Yaghoubi <sebastianyaghoubi@gmail.com>

* fix(docker): use correct files in dockerignore

Signed-off-by: Sebastian Yaghoubi <sebastianyaghoubi@gmail.com>

* chore(docker): dont install recommended packages with git

Signed-off-by: Sebastian Yaghoubi <sebastianyaghoubi@gmail.com>

* fix(docker): run apt as non-interactive

Signed-off-by: Sebastian Yaghoubi <sebastianyaghoubi@gmail.com>

* Update Dockerfile to new package structure, and fix streaming bugs.

---------

Signed-off-by: Sebastian Yaghoubi <sebastianyaghoubi@gmail.com>
Co-authored-by: afourney <adamfo@microsoft.com>
2025-03-07 20:07:40 -08:00
Richard Ye
0229ff6cb7 feat: sort pptx shapes to be parsed in top-to-bottom, left-to-right order (#1104)
* Sort PPTX shapes to be read in top-to-bottom, left-to-right order

Referenced from 39bef65b31/pptx2md/parser.py (L249)

* Update README.md
* Fixed formatting.
* Added missing import
2025-03-07 15:45:14 -08:00
afourney
82d84e3edd Fixed formatting. (#1098) 2025-03-05 23:30:29 -08:00
scalabreseGD
36c4bc9ec3 Fixed deepcopy failure when passing llm_client (#1089)
Co-authored-by: afourney <adamfo@microsoft.com>
2025-03-05 23:25:37 -08:00
Andrea Pietrobon
80baa5db18 fix(README): correct pip install command formatting (#1090)
Added missing quotes around `markitdown[all]` in the installation command  
to ensure proper package resolution by pip.
2025-03-05 23:21:10 -08:00
Adam Fourney
00a65e8f8b Fixed version in README. 2025-03-05 23:10:21 -08:00
afourney
6bedf6d950 Fixed version. (#1097) v0.1.0a1 2025-03-05 22:52:52 -08:00
afourney
9380112892 Fixed loading of plugins. (#1096) 2025-03-05 22:24:08 -08:00
Adam Fourney
784c293579 Bump plugin version. 2025-03-05 21:55:20 -08:00
afourney
70e9f8c3c0 Bump version. (#1094) 2025-03-05 21:26:06 -08:00
afourney
e921497f79 Update converter API, user streams rather than file paths (#1088)
* Updated DocumentConverter interface
* Updated all DocumentConverter classes
* Added support for various new audio files.
* Updated sample plugin to new DocumentConverter interface.
* Updated project README with notes about changes, and use-cases.
* Updated DocumentConverter documentation.
* Move priority to outside DocumentConverter, allowing them to be reprioritized, and keeping the DocumentConverter interface simple.

---------

Co-authored-by: Kenny Zhang <kzhang678@gmail.com>
2025-03-05 21:16:55 -08:00