Commit Graph

278 Commits

Author SHA1 Message Date
scalabreseGD
36c4bc9ec3 Fixed deepcopy failure when passing llm_client (#1089)
Co-authored-by: afourney <adamfo@microsoft.com>
2025-03-05 23:25:37 -08:00
Andrea Pietrobon
80baa5db18 fix(README): correct pip install command formatting (#1090)
Added missing quotes around `markitdown[all]` in the installation command  
to ensure proper package resolution by pip.
2025-03-05 23:21:10 -08:00
Adam Fourney
00a65e8f8b Fixed version in README. 2025-03-05 23:10:21 -08:00
afourney
6bedf6d950 Fixed version. (#1097) v0.1.0a1 2025-03-05 22:52:52 -08:00
afourney
9380112892 Fixed loading of plugins. (#1096) 2025-03-05 22:24:08 -08:00
Adam Fourney
784c293579 Bump plugin version. 2025-03-05 21:55:20 -08:00
afourney
70e9f8c3c0 Bump version. (#1094) 2025-03-05 21:26:06 -08:00
afourney
e921497f79 Update converter API, user streams rather than file paths (#1088)
* Updated DocumentConverter interface
* Updated all DocumentConverter classes
* Added support for various new audio files.
* Updated sample plugin to new DocumentConverter interface.
* Updated project README with notes about changes, and use-cases.
* Updated DocumentConverter documentation.
* Move priority to outside DocumentConverter, allowing them to be reprioritized, and keeping the DocumentConverter interface simple.

---------

Co-authored-by: Kenny Zhang <kzhang678@gmail.com>
2025-03-05 21:16:55 -08:00
afourney
1d2f231146 Fixed property name (#1085) 2025-03-03 09:45:36 -08:00
afourney
c5cd659f63 Exploring ways to allow Optional dependencies (#1079)
* Enable optional dependencies. Starting with pptx.
* Fix CLI tests.... have them install [all]
* Added .docx to optional dependencies
* Reuse error messages for missing dependencies.
* Added xlsx and xls
* Added pdfs
* Added Ole files.
* Updated READMEs, and finished remaining feature-categories.
* Move OpenAI to hatch-test environment.
2025-03-03 09:06:19 -08:00
afourney
f01c6c5277 Exceptions should subclass Exception not BaseException. (#1082) 2025-02-28 16:28:35 -08:00
afourney
43bd79adc9 Print and log better exceptions when file conversions fail. (#1080)
* Print and log better exceptions when file conversions fail.
* Added unit tests for exceptions.
2025-02-28 16:07:47 -08:00
afourney
9182923375 Don't have ZipConverter accept OOXML files. This will never yield a good result. (#1078) 2025-02-28 09:54:19 -08:00
afourney
9a19fdd134 Make sure extensions are unique in MarkItDown's convert methods. (#1076) 2025-02-28 07:43:03 -08:00
Matthew Powers
e82e0c1372 Add Support For PPTX Shape Groups (Fix in code design to not miss out on slide content) (#331)
* Adds support for Shape Groups

* Update to Test PPtx for nested shape

* This line was accidentally removed and is added back here
2025-02-27 23:21:51 -08:00
Nima Akbarzadeh
a394cc7c27 fix: Implement retry logic for YouTube transcript fetching and fix URL decoding issue (#1035)
* fix: add error handling, refactor _findKey to use json.items()

* fix: improve metadata and description extraction logic

* fix: improve YouTube transcript extraction reliability

* fix: implement retry logic for YouTube transcript fetching and fix URL decoding issue

* fix(readme): add youtube URLs as markitdown supports
2025-02-27 23:17:54 -08:00
tanreinama
a87fbf01ee add necessary imports (#861)
* add necessary imports
2025-02-27 23:16:09 -08:00
André Menezes
d0ed74fdf4 Fix UnboundLocalError in MarkItDown._convert (#1038)
Initialize `res` at the beginning of `_convert`. If the first converter raises an exception, then the `res` variable was not initialized and we got an error when checking `if res is not None`
2025-02-27 23:11:27 -08:00
afourney
e4b419ba40 Pin Markdownify version. (#1069)
* Pin markdownify version. TODO: update code for compatibility with Markdownify 1.0.0
2025-02-27 23:09:33 -08:00
afourney
dbdf2c0c10 Added CLI tests. (#327) 2025-02-11 20:42:50 -08:00
KennyZhang1
97eeed5f32 Doc Intelligence fixes for refactored code (#325)
* added priority flag to doc intel converter constructor
* fixed analysis features bug for docx
2025-02-11 16:01:46 -08:00
afourney
935da9976c Added priority argument to all converter constructors. (#324)
* Added priority argument to all converter constructors.
2025-02-11 12:36:32 -08:00
Ruijun Gao
5ce85c236c Fix a typo in sample RTF plugin (#320) 2025-02-11 10:33:52 -08:00
Tomasz Kalinowski
3a5ca22a8d Don't generate md links in 'pre' blocks (#322) 2025-02-11 07:13:17 -08:00
Adam Fourney
4b62506451 Small typo in README. 2025-02-10 15:24:28 -08:00
afourney
c73afcffea Cleanup and refactor, in preparation for plugin support. (#318)
* Work started moving converters to individual files.
* Significant cleanup and refactor.
* Moved everything to a packages subfolder.
* Added sample plugin.
* Added instructions to the README.md
* Bumped version, and added a note about compatibility.
2025-02-10 15:21:44 -08:00
wunde005
73ba69d8cd For csv files mimetypes.guess_type is returning "application/vnd.ms-excel" on windows causing an invalid mime type in plaintextconverter. In reference to issue: https://github.com/microsoft/markitdown/issues/150 (#273) 2025-02-08 20:58:13 -08:00
Werner Robitza
2a4f7bb6a8 fix: argparse CLI option ordering, fixes #268 (#290)
* fix: argparse CLI option ordering, fixes #268
* Fixed formatting.
2025-02-08 20:50:38 -08:00
masquare
7cf5e0bb23 feat(pptx): support image description with LLM for pptx files (#306) 2025-02-08 20:37:34 -08:00
James Hickey
3090917a49 Typo fixed (#270) 2025-02-08 20:30:13 -08:00
ZeyuTeng96
7bea2672a0 remove leading and trailing \n for HtmlConverter (#262) 2025-02-08 20:28:35 -08:00
KennyZhang1
bf6a15e9b5 Kennyzhang/docintel docs (#312)
* updated docs to include doc intelligence

* include reference to doc intel setup docs
2025-01-31 22:23:26 -08:00
KennyZhang1
bfde857420 Add support for conversion via Document Intelligence (#303)
* added cli params for doc intel

* added DocumentIntelligenceConverter class implementation

* initialized doc intel client instance field

* added isolated doc_intel main conversion function

* temp fix for ContentFormat import bug

* ran tests for docintel and offline for many filetypes

* push doc intel converter to the top of the stack

* formatting changes

* modified project toml file
2025-01-24 14:09:32 -08:00
afourney
f58a864951 Set exiftool path explicitly. (#267) 2025-01-06 12:43:47 -08:00
afourney
265aea2edf Removed the holiday away message from README.md (#266) 2025-01-06 09:06:21 -08:00
afourney
05b78e7ce1 Recognize json as plain text (if no other handlers are present). (#261)
* Recognize json as plain text (if no other handlers are present).
2025-01-03 16:40:43 -08:00
afourney
436407288f If puremagic has no guesses, try again after ltrim. (#260) 2025-01-03 16:03:11 -08:00
afourney
731b39e7f5 Added a test for leading spaces. (#258) 2025-01-03 14:34:33 -08:00
yeungadrian
08ed32869e Feature/ Add xls support (#169)
* add xlrd
* add xls converter with tests
2025-01-03 13:58:17 -08:00
Murat Can Kurtuluş
d248621ba4 feat: outlook ".msg" file converter (#196)
* feat: outlook .msg converter
* add test, adjust docstring
2025-01-03 13:34:39 -08:00
AbSadiki
4678c8a2a4 fix(transcription): IS_AUDIO_TRANSCRIPTION_CAPABLE should be iniztialized (#194) 2025-01-03 13:29:26 -08:00
Ikko Eltociear Ashimine
125e206047 docs: update README.md (#182)
faciliate -> facilitate
2024-12-21 01:51:30 -08:00
numekudi
f94d09990e feat: enable Git support in devcontainer (#136)
Co-authored-by: gagb <gagb@users.noreply.github.com>
2024-12-20 18:09:17 -08:00
lumin
cfd2319c14 feat: add version option to markitdown CLI (#172)
Add a `--version` option to the markitdown command-line interface 
that displays the current version number.
2024-12-20 16:24:45 -08:00
dependabot[bot]
73161982ff Bump actions/setup-python from 2 to 5 (#179)
Bumps [actions/setup-python](https://github.com/actions/setup-python) from 2 to 5.
- [Release notes](https://github.com/actions/setup-python/releases)
- [Commits](https://github.com/actions/setup-python/compare/v2...v5)

---
updated-dependencies:
- dependency-name: actions/setup-python
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: afourney <adamfo@microsoft.com>
2024-12-20 16:20:22 -08:00
dependabot[bot]
9b69467772 Bump actions/cache from 3 to 4 (#178)
Bumps [actions/cache](https://github.com/actions/cache) from 3 to 4.
- [Release notes](https://github.com/actions/cache/releases)
- [Changelog](https://github.com/actions/cache/blob/main/RELEASES.md)
- [Commits](https://github.com/actions/cache/compare/v3...v4)

---
updated-dependencies:
- dependency-name: actions/cache
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: gagb <gagb@users.noreply.github.com>
Co-authored-by: afourney <adamfo@microsoft.com>
2024-12-20 16:17:43 -08:00
gagb
857a2d160d Update README.md (#180) 2024-12-20 14:49:20 -08:00
Soulter
1123392306 fix: support -o param to avoid encoding issues (#116)
* perf: cli supports -o param

* doc: update README

---------

Co-authored-by: gagb <gagb@users.noreply.github.com>
2024-12-20 14:43:00 -08:00
dependabot[bot]
377a7eaa7d Bump actions/checkout from 2 to 4 (#177)
Bumps [actions/checkout](https://github.com/actions/checkout) from 2 to 4.
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](https://github.com/actions/checkout/compare/v2...v4)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-12-20 14:36:48 -08:00
lumin
c1a0d3deaf chore: configure Dependabot for GitHub Actions updates (#112)
Sets up Dependabot to automatically check for updates to 
GitHub Actions on a weekly basis, ensuring that the project 
remains up-to-date with the latest dependencies and security 
fixes.

Co-authored-by: gagb <gagb@users.noreply.github.com>
2024-12-20 14:28:55 -08:00