Commit Graph

57 Commits

Author SHA1 Message Date
Soulter
1123392306 fix: support -o param to avoid encoding issues (#116)
* perf: cli supports -o param

* doc: update README

---------

Co-authored-by: gagb <gagb@users.noreply.github.com>
2024-12-20 14:43:00 -08:00
SigireddyBalasai
5276616ba1 Added support to use Pathlib (#93)
* Add support for Path objects in MarkItDown conversion methods

* Remove unnecessary blank line in test_markitdown_exiftool function

* Remove unnecessary blank line in test_markitdown_exiftool function

* remove pathlib path in test file

---------

Co-authored-by: afourney <adamfo@microsoft.com>
Co-authored-by: gagb <gagb@users.noreply.github.com>
2024-12-20 14:12:48 -08:00
Sugato Ray
08a25345e3 [feat]: add support for type-hinting for PEP-561 2024-12-20 02:37:10 +00:00
Sugato Ray
613825d5b3 [feat]: add support for type-hinting for PEP-561 2024-12-20 02:12:24 +00:00
Sugato Ray
6f3c762526 Merge branch 'main' into update_commandline_help 2024-12-18 17:50:07 -05:00
Sugato Ray
356e895306 update formatting with pre-commit 2024-12-18 21:45:23 +00:00
gagb
5fc70864f2 Run pre-commit 2024-12-18 11:46:39 -08:00
Sugato Ray
39410d01df Update CLI helpdoc formatting to allow indentation in code
Use `textwrap.dedent()` to allow indented cli-helpdoc in `__main__.py` file. The indentation increases readability, while `textwrap.dedent` helps maintain the same functionality without breaking code.
2024-12-18 14:22:58 -05:00
Joel Esler
6e4caac70d Safeguard against path traversal for ZipConverter
fix: prevent path traversal vulnerabilities in ZipConverter

Added a secure check for path traversal vulnerabilities in the ZipConverter class.
Now validates extracted file paths using `os.path.commonprefix` to ensure all files
remain within the intended extraction directory. Raises a `ValueError` if a
path traversal attempt is detected.

- Normalized file paths using `os.path.normpath`.
- Added specific exception handling for `zipfile.BadZipFile` and traversal errors.
- Ensured cleanup of extracted files after processing when `cleanup_extracted` is enabled.
2024-12-18 13:12:55 -05:00
gagb
362214323e Merge branch 'main' into feature/fix-code-comments 2024-12-17 16:38:47 -08:00
afourney
9e546a8588 Merge branch 'main' into main 2024-12-17 15:37:28 -08:00
Adam Fourney
8d5f16ecd2 Fixed formatting. 2024-12-17 15:27:06 -08:00
afourney
a571021199 Merge branch 'main' into main 2024-12-17 15:12:59 -08:00
afourney
9add517510 Merge branch 'main' into feature/fix-code-comments 2024-12-17 14:56:16 -08:00
Adam Fourney
9518c01d4e Bump version. 2024-12-17 13:51:13 -08:00
Adam Fourney
95188a4a27 Merge main. 2024-12-17 13:46:26 -08:00
Adam Fourney
03a7843a0a Added deprecation warnings for mlm_* arguments. 2024-12-17 13:22:48 -08:00
Adam Fourney
248d64edd0 Added llm tests to the local test set. 2024-12-17 12:13:19 -08:00
Lee Bush
05a49ca129 fix incorrect comments for "bail if not ..." for WAV and image cases. 2024-12-17 08:10:53 -07:00
Soulter
752fbd333c feat: add tests of rss convertor 2024-12-17 22:45:27 +08:00
Soulter
7dc2695b96 feat: support convert atom to markdown 2024-12-17 21:44:50 +08:00
Soulter
53fad6eb31 feat: add rss converter 2024-12-17 21:22:27 +08:00
Om Gupta
60c4a62917 Merge branch 'microsoft:main' into main 2024-12-17 10:33:40 +05:30
Om Gupta
3eb8cf385b Merge branch 'main' of https://github.com/AumGupta/markitdown 2024-12-17 10:24:30 +05:30
Om Gupta
8c91c11ea8 pre-commit run 2024-12-17 10:24:25 +05:30
gagb
ad29122592 run precommit 2024-12-16 18:09:48 -08:00
gagb
898bfd4774 Merge branch 'main' into main 2024-12-16 18:00:26 -08:00
gagb
825d3bbb77 Merge branch 'main' into issue#65 2024-12-16 17:09:53 -08:00
gagb
874eba6265 Merge branch 'main' into patch-2 2024-12-16 16:59:22 -08:00
gagb
c3fa2934b9 Run pre-commit 2024-12-16 16:56:52 -08:00
kevinbabou
33638f1fe6 feature: add argument parsing and setup.py file for cli tool capability 2024-12-16 16:28:44 -08:00
gagb
dbc727615d Merge branch 'main' into main 2024-12-16 15:48:49 -08:00
gagb
b0115cf971 Merge branch 'main' into youtube-transcript-languages 2024-12-16 15:47:38 -08:00
gagb
980abd3a60 Merge branch 'main' into main 2024-12-16 15:24:58 -08:00
afourney
afaff11ef0 Merge branch 'main' into main 2024-12-16 14:40:58 -08:00
afourney
e7636656d8 Merge branch 'main' into support-comments-in-docx 2024-12-16 14:23:14 -08:00
afourney
ddc1bebea4 Merge branch 'main' into patch-2 2024-12-16 14:20:16 -08:00
afourney
12ce5e95b2 Merge branch 'main' into feature/add-pptx-chart-support 2024-12-16 14:06:14 -08:00
gagb
9e6a19987b Merge branch 'main' into main 2024-12-16 13:51:39 -08:00
CharlesCNorton
ed651aeb16 Fix LLM terminology in code
Replaced all occurrences of mlm_client and mlm_model with llm_client and llm_model for consistent terminology when referencing Large Language Models (LLMs).
2024-12-16 16:23:52 -05:00
Om Gupta
a3208f2bd0 feat: Add IpynbConverter
- Implemented IpynbConverter class for converting Jupyter Notebook (.ipynb) files into Markdown format.
- Supports markdown cells, code cells and raw cells.
- First markdown heading is used as the title if no title is found in notebook metadata.
- Created a test notebook (`test_notebook.ipynb`) to verify the functionality of the converter.
2024-12-17 01:00:41 +05:30
Divit
ad01da308d fix issue #65 2024-12-16 21:48:33 +05:30
narumi
695100d5d8 Support specifying YouTube transcript language 2024-12-16 13:16:00 +08:00
SH4DOW4RE
1559d9d163 pre-commit ran 2024-12-15 22:15:20 +01:00
SH4DOW4RE
b7f5662ffd PR: Catching pydub's warning of ffmpeg or avconv missing 2024-12-15 17:29:14 +01:00
Ville Puuska
0a7203b876 add style_map prop to MarkItDown class 2024-12-15 17:23:57 +02:00
Ville Puuska
0704b0b6ff pass 'style_map' kwarg to mammoth when converting docx 2024-12-15 16:59:21 +02:00
sakasegawa
0dd4e95584 Remove _is_chart 2024-12-15 21:14:58 +09:00
sakasegawa
93130b5ba5 Add PPTX chart support 2024-12-15 20:42:55 +09:00
Divyansh Singh
52b723724c Fix character decoding issues with text-like files 2024-12-15 10:37:59 +05:30