Commit Graph

24 Commits

Author SHA1 Message Date
t3tra
cb421cf9ea Chore: Make linter happy (#1256)
* refactor: remove unused imports

* fix: replace NotImplemented with NotImplementedError

* refactor: resolve E722 (do not use bare 'except')

* refactor: remove unused variable

* refactor: remove unused imports

* refactor: ignore unused imports that will be used in the future

* refactor: resolve W293 (blank line contains whitespace)

* refactor: resolve F541 (f-string is missing placeholders)

---------

Co-authored-by: afourney <adamfo@microsoft.com>
2025-05-21 10:02:16 -07:00
Turdıbek
8576f1d915 Add CSV to Markdown table conversion - fixes #1144 (#1176)
* feat: Add CSV to Markdown table converter

- Add new CsvConverter class to convert CSV files to Markdown tables\n- Support text/csv and application/csv MIME types\n- Preserve table structure with headers and data rows\n- Handle edge cases like empty cells and mismatched columns\n- Fix Azure Document Intelligence dependency handling\n- Register CsvConverter in MarkItDown class

----

Thanks also to @benny123tw who submitted a very similar PR in #1171
2025-04-13 09:19:00 -07:00
afourney
9e067c42b6 Make it easier to use AzureKeyCredentials with Azure Doc Intelligence (#1151)
* Make it easier to use AzureKeyCredentials with Azure Doc Intelligence
* Fixed mypy type error.
* Added more fine-grained options over types.
* Pass doc intel options further up the stack.
2025-03-26 10:44:11 -07:00
afourney
e928b43afb convert_url renamed to convert_uri, and now handles data and file URIs (#1153) 2025-03-24 21:43:04 -07:00
afourney
a93e0567e6 EPub Support. Adapted #123 to not use epublib. (#1131)
* Adapted #123 to not use epublib.
* Updated README.md
2025-03-17 07:48:15 -07:00
afourney
c5f70b904f Have magika read from the stream. (#1136) 2025-03-17 07:39:19 -07:00
afourney
5f75e16d20 Refactored tests. (#1120)
* Refactored tests.
* Fixed CI errors, and included misc tests.
* Omit mskanji from streaminfo test.
* Omit mskanji from no hints test.
* Log results of debugging in comments (linked to Magika issue)
* Added docs as to when to use misc tests.
2025-03-12 11:08:06 -07:00
afourney
af1be36e0c Added CLI options for extension, mimetypes, and charset. (#1115) 2025-03-11 13:16:33 -07:00
Adam Fourney
2e51ba22e7 Enhance type guessing. 2025-03-10 16:05:41 -07:00
afourney
8f8e58c9bb Minimize guesses when guesses are compatible. (#1114)
* Minimize guesses when guesses are compatible.
2025-03-10 15:30:44 -07:00
afourney
8e73a325c6 Switch from puremagic to magika. (#1108) 2025-03-10 12:49:52 -07:00
Mohit Agarwal
2405f201af fix typo in well-known path list (#1109) 2025-03-08 19:32:44 -08:00
afourney
99d8e562db Fix exiftool in well-known paths. (#1106) 2025-03-07 21:47:20 -08:00
Sebastian Yaghoubi
515fa854bf feat(docker): improve dockerfile build (#220)
* refactor(docker): remove unnecessary root user

The USER root directive isn't needed directly after FROM

Signed-off-by: Sebastian Yaghoubi <sebastianyaghoubi@gmail.com>

* fix(docker): use generic nobody nogroup default instead of uid gid

Signed-off-by: Sebastian Yaghoubi <sebastianyaghoubi@gmail.com>

* fix(docker): build app from source locally instead of installing package

Signed-off-by: Sebastian Yaghoubi <sebastianyaghoubi@gmail.com>

* fix(docker): use correct files in dockerignore

Signed-off-by: Sebastian Yaghoubi <sebastianyaghoubi@gmail.com>

* chore(docker): dont install recommended packages with git

Signed-off-by: Sebastian Yaghoubi <sebastianyaghoubi@gmail.com>

* fix(docker): run apt as non-interactive

Signed-off-by: Sebastian Yaghoubi <sebastianyaghoubi@gmail.com>

* Update Dockerfile to new package structure, and fix streaming bugs.

---------

Signed-off-by: Sebastian Yaghoubi <sebastianyaghoubi@gmail.com>
Co-authored-by: afourney <adamfo@microsoft.com>
2025-03-07 20:07:40 -08:00
afourney
82d84e3edd Fixed formatting. (#1098) 2025-03-05 23:30:29 -08:00
scalabreseGD
36c4bc9ec3 Fixed deepcopy failure when passing llm_client (#1089)
Co-authored-by: afourney <adamfo@microsoft.com>
2025-03-05 23:25:37 -08:00
afourney
9380112892 Fixed loading of plugins. (#1096) 2025-03-05 22:24:08 -08:00
afourney
e921497f79 Update converter API, user streams rather than file paths (#1088)
* Updated DocumentConverter interface
* Updated all DocumentConverter classes
* Added support for various new audio files.
* Updated sample plugin to new DocumentConverter interface.
* Updated project README with notes about changes, and use-cases.
* Updated DocumentConverter documentation.
* Move priority to outside DocumentConverter, allowing them to be reprioritized, and keeping the DocumentConverter interface simple.

---------

Co-authored-by: Kenny Zhang <kzhang678@gmail.com>
2025-03-05 21:16:55 -08:00
afourney
c5cd659f63 Exploring ways to allow Optional dependencies (#1079)
* Enable optional dependencies. Starting with pptx.
* Fix CLI tests.... have them install [all]
* Added .docx to optional dependencies
* Reuse error messages for missing dependencies.
* Added xlsx and xls
* Added pdfs
* Added Ole files.
* Updated READMEs, and finished remaining feature-categories.
* Move OpenAI to hatch-test environment.
2025-03-03 09:06:19 -08:00
afourney
43bd79adc9 Print and log better exceptions when file conversions fail. (#1080)
* Print and log better exceptions when file conversions fail.
* Added unit tests for exceptions.
2025-02-28 16:07:47 -08:00
afourney
9a19fdd134 Make sure extensions are unique in MarkItDown's convert methods. (#1076) 2025-02-28 07:43:03 -08:00
André Menezes
d0ed74fdf4 Fix UnboundLocalError in MarkItDown._convert (#1038)
Initialize `res` at the beginning of `_convert`. If the first converter raises an exception, then the `res` variable was not initialized and we got an error when checking `if res is not None`
2025-02-27 23:11:27 -08:00
afourney
935da9976c Added priority argument to all converter constructors. (#324)
* Added priority argument to all converter constructors.
2025-02-11 12:36:32 -08:00
afourney
c73afcffea Cleanup and refactor, in preparation for plugin support. (#318)
* Work started moving converters to individual files.
* Significant cleanup and refactor.
* Moved everything to a packages subfolder.
* Added sample plugin.
* Added instructions to the README.md
* Bumped version, and added a note about compatibility.
2025-02-10 15:21:44 -08:00