Commit Graph

17 Commits

Author SHA1 Message Date
afourney
e928b43afb convert_url renamed to convert_uri, and now handles data and file URIs (#1153) 2025-03-24 21:43:04 -07:00
Yuzhong Zhang
52432bd228 Add support for preserving base64 encoded images (#1140)
* optional reserve base64 string in markdown _CustomMarkdownify and pptx
* add other converter para support
* fix linter
* Use *kwarg to pass keep_data_uri para.
* Add module cli vector tests
* Fixed formatting, and adjusted tests.
2025-03-20 18:50:23 -07:00
afourney
c0a511ecff Updated docx file to include an image. (#1146) 2025-03-20 12:25:56 -07:00
afourney
a93e0567e6 EPub Support. Adapted #123 to not use epublib. (#1131)
* Adapted #123 to not use epublib.
* Updated README.md
2025-03-17 07:48:15 -07:00
afourney
c5f70b904f Have magika read from the stream. (#1136) 2025-03-17 07:39:19 -07:00
afourney
5c565b7d79 Fix remaining mypy errors. (#1132) 2025-03-15 23:12:48 -07:00
afourney
a78857bd43 Added epub test file. (#1130) 2025-03-15 18:34:51 -07:00
afourney
5f75e16d20 Refactored tests. (#1120)
* Refactored tests.
* Fixed CI errors, and included misc tests.
* Omit mskanji from streaminfo test.
* Omit mskanji from no hints test.
* Log results of debugging in comments (linked to Magika issue)
* Added docs as to when to use misc tests.
2025-03-12 11:08:06 -07:00
afourney
8f8e58c9bb Minimize guesses when guesses are compatible. (#1114)
* Minimize guesses when guesses are compatible.
2025-03-10 15:30:44 -07:00
afourney
99d8e562db Fix exiftool in well-known paths. (#1106) 2025-03-07 21:47:20 -08:00
afourney
e921497f79 Update converter API, user streams rather than file paths (#1088)
* Updated DocumentConverter interface
* Updated all DocumentConverter classes
* Added support for various new audio files.
* Updated sample plugin to new DocumentConverter interface.
* Updated project README with notes about changes, and use-cases.
* Updated DocumentConverter documentation.
* Move priority to outside DocumentConverter, allowing them to be reprioritized, and keeping the DocumentConverter interface simple.

---------

Co-authored-by: Kenny Zhang <kzhang678@gmail.com>
2025-03-05 21:16:55 -08:00
afourney
43bd79adc9 Print and log better exceptions when file conversions fail. (#1080)
* Print and log better exceptions when file conversions fail.
* Added unit tests for exceptions.
2025-02-28 16:07:47 -08:00
Matthew Powers
e82e0c1372 Add Support For PPTX Shape Groups (Fix in code design to not miss out on slide content) (#331)
* Adds support for Shape Groups

* Update to Test PPtx for nested shape

* This line was accidentally removed and is added back here
2025-02-27 23:21:51 -08:00
Nima Akbarzadeh
a394cc7c27 fix: Implement retry logic for YouTube transcript fetching and fix URL decoding issue (#1035)
* fix: add error handling, refactor _findKey to use json.items()

* fix: improve metadata and description extraction logic

* fix: improve YouTube transcript extraction reliability

* fix: implement retry logic for YouTube transcript fetching and fix URL decoding issue

* fix(readme): add youtube URLs as markitdown supports
2025-02-27 23:17:54 -08:00
afourney
dbdf2c0c10 Added CLI tests. (#327) 2025-02-11 20:42:50 -08:00
afourney
935da9976c Added priority argument to all converter constructors. (#324)
* Added priority argument to all converter constructors.
2025-02-11 12:36:32 -08:00
afourney
c73afcffea Cleanup and refactor, in preparation for plugin support. (#318)
* Work started moving converters to individual files.
* Significant cleanup and refactor.
* Moved everything to a packages subfolder.
* Added sample plugin.
* Added instructions to the README.md
* Bumped version, and added a note about compatibility.
2025-02-10 15:21:44 -08:00