From f398f3d4434ef1664250e1032b7f2733944dd6d5 Mon Sep 17 00:00:00 2001 From: "Petr@AP Consulting" Date: Tue, 17 Dec 2024 10:26:09 +0100 Subject: [PATCH 1/9] Update README.md I added description and script for batch of files processing --- README.md | 44 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 44 insertions(+) diff --git a/README.md b/README.md index 7079dbf..01ceb71 100644 --- a/README.md +++ b/README.md @@ -78,7 +78,51 @@ You can also use the project as Docker Image: docker build -t markitdown:latest . docker run --rm -i markitdown:latest < ~/your-file.pdf > output.md ``` +Batch Processing Multiple Files +This extension allows you to convert multiple files to markdown format in a single run. The script processes all supported files in a directory and creates corresponding markdown files. + +Features + +- Converts multiple files in one operation +- Supports various file formats (.pptx, .docx, .pdf, .jpg, .jpeg, .png etc. you can change it) +- Maintains original filenames (changes extension to .md) +- Includes GPT-4o-latest image descriptions when available +- Continues processing if individual files fail + +Usage +1. Create a Python script (e.g., convert.py): +```python +from markitdown import MarkItDown +from openai import OpenAI +import os +client = OpenAI(api_key="your-api-key-here") +md = MarkItDown(mlm_client=client, mlm_model="gpt-4o-2024-11-20") +supported_extensions = ('.pptx', '.docx', '.pdf', '.jpg', '.jpeg', '.png') +files_to_convert = [f for f in os.listdir('.') if f.lower().endswith(supported_extensions)] +for file in files_to_convert: + print(f"\nConverting {file}...") + try: + md_file = os.path.splitext(file)[0] + '.md' + result = md.convert(file) + with open(md_file, 'w') as f: + f.write(result.text_content) + + print(f"Successfully converted {file} to {md_file}") + except Exception as e: + print(f"Error converting {file}: {str(e)}") + +print("\nAll conversions completed!") +``` +2. Place the script in the same directory as your files +3. Install required packages: like openai +4. Run script ```bash python3 convert.py ``` + +- The script processes all supported files in the current directory +- Original files remain unchanged +- New markdown files are created with the same base name +- Progress and any errors are displayed during conversion + ## Contributing This project welcomes contributions and suggestions. Most contributions require you to agree to a From 224f1df0fc33e83d49825e3d8b947d945787ad7d Mon Sep 17 00:00:00 2001 From: "Petr@AP Consulting" Date: Wed, 18 Dec 2024 09:28:18 +0100 Subject: [PATCH 2/9] Update README.md I collapsed section about batch processing as was suggested --- README.md | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 01ceb71..669caa2 100644 --- a/README.md +++ b/README.md @@ -78,11 +78,13 @@ You can also use the project as Docker Image: docker build -t markitdown:latest . docker run --rm -i markitdown:latest < ~/your-file.pdf > output.md ``` -Batch Processing Multiple Files +
+ +Batch Processing Multiple Files This extension allows you to convert multiple files to markdown format in a single run. The script processes all supported files in a directory and creates corresponding markdown files. -Features +### Features - Converts multiple files in one operation - Supports various file formats (.pptx, .docx, .pdf, .jpg, .jpeg, .png etc. you can change it) @@ -90,7 +92,7 @@ Features - Includes GPT-4o-latest image descriptions when available - Continues processing if individual files fail -Usage +### Usage 1. Create a Python script (e.g., convert.py): ```python from markitdown import MarkItDown @@ -122,6 +124,8 @@ print("\nAll conversions completed!") - Original files remain unchanged - New markdown files are created with the same base name - Progress and any errors are displayed during conversion + +
## Contributing From 233ba679b88389fb53aded8a15f7b967f93f5af3 Mon Sep 17 00:00:00 2001 From: "Petr@AP Consulting" <173082609+PetrAPConsulting@users.noreply.github.com> Date: Wed, 18 Dec 2024 21:05:04 +0100 Subject: [PATCH 3/9] Update README.md Co-authored-by: gagb --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index ebeca7f..f160d86 100644 --- a/README.md +++ b/README.md @@ -64,7 +64,7 @@ docker run --rm -i markitdown:latest < ~/your-file.pdf > output.md Batch Processing Multiple Files -This extension allows you to convert multiple files to markdown format in a single run. The script processes all supported files in a directory and creates corresponding markdown files. +This example shows how to convert multiple files to markdown format in a single run. The script processes all supported files in a directory and creates corresponding markdown files. ### Features From bb929629f3c573adef8fcdf15ae7112f052299d5 Mon Sep 17 00:00:00 2001 From: "Petr@AP Consulting" <173082609+PetrAPConsulting@users.noreply.github.com> Date: Wed, 18 Dec 2024 21:05:36 +0100 Subject: [PATCH 4/9] Update README.md Co-authored-by: gagb --- README.md | 7 ------- 1 file changed, 7 deletions(-) diff --git a/README.md b/README.md index f160d86..627243d 100644 --- a/README.md +++ b/README.md @@ -66,13 +66,6 @@ docker run --rm -i markitdown:latest < ~/your-file.pdf > output.md This example shows how to convert multiple files to markdown format in a single run. The script processes all supported files in a directory and creates corresponding markdown files. -### Features - -- Converts multiple files in one operation -- Supports various file formats (.pptx, .docx, .pdf, .jpg, .jpeg, .png etc. you can change it) -- Maintains original filenames (changes extension to .md) -- Includes GPT-4o-latest image descriptions when available -- Continues processing if individual files fail ### Usage 1. Create a Python script (e.g., convert.py): From 088007338d1567299a6654cf95fb4616413f7131 Mon Sep 17 00:00:00 2001 From: "Petr@AP Consulting" <173082609+PetrAPConsulting@users.noreply.github.com> Date: Wed, 18 Dec 2024 21:07:55 +0100 Subject: [PATCH 5/9] Update README.md Co-authored-by: gagb --- README.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/README.md b/README.md index 627243d..62de2e6 100644 --- a/README.md +++ b/README.md @@ -67,8 +67,6 @@ docker run --rm -i markitdown:latest < ~/your-file.pdf > output.md This example shows how to convert multiple files to markdown format in a single run. The script processes all supported files in a directory and creates corresponding markdown files. -### Usage -1. Create a Python script (e.g., convert.py): ```python from markitdown import MarkItDown from openai import OpenAI From f4471d96e2b61de672e3e1f4bf95222191844274 Mon Sep 17 00:00:00 2001 From: "Petr@AP Consulting" <173082609+PetrAPConsulting@users.noreply.github.com> Date: Wed, 18 Dec 2024 21:08:10 +0100 Subject: [PATCH 6/9] Update README.md Co-authored-by: gagb --- README.md | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/README.md b/README.md index 62de2e6..70a1b5c 100644 --- a/README.md +++ b/README.md @@ -93,10 +93,7 @@ print("\nAll conversions completed!") 3. Install required packages: like openai 4. Run script ```bash python3 convert.py ``` -- The script processes all supported files in the current directory -- Original files remain unchanged -- New markdown files are created with the same base name -- Progress and any errors are displayed during conversion +Note that original files will remain unchanged and new markdown files are created with the same base name. From f6e75c46d4f08a073f5fc07dd0bc122138f52436 Mon Sep 17 00:00:00 2001 From: "Petr@AP Consulting" <173082609+PetrAPConsulting@users.noreply.github.com> Date: Wed, 18 Dec 2024 21:17:47 +0100 Subject: [PATCH 7/9] Update README.md I changed command for running script from Mac version (python3) to Windows version (python) --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 70a1b5c..b82e5fc 100644 --- a/README.md +++ b/README.md @@ -91,7 +91,7 @@ print("\nAll conversions completed!") ``` 2. Place the script in the same directory as your files 3. Install required packages: like openai -4. Run script ```bash python3 convert.py ``` +4. Run script ```bash python convert.py ``` Note that original files will remain unchanged and new markdown files are created with the same base name. From b28f380a4768bfb88f9bd209cadc97ae73b7a5b8 Mon Sep 17 00:00:00 2001 From: "Petr@AP Consulting" <173082609+PetrAPConsulting@users.noreply.github.com> Date: Thu, 19 Dec 2024 09:23:15 +0100 Subject: [PATCH 8/9] Update README.md Co-authored-by: gagb --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index b82e5fc..d0201d4 100644 --- a/README.md +++ b/README.md @@ -67,7 +67,7 @@ docker run --rm -i markitdown:latest < ~/your-file.pdf > output.md This example shows how to convert multiple files to markdown format in a single run. The script processes all supported files in a directory and creates corresponding markdown files. -```python +```python convert.py from markitdown import MarkItDown from openai import OpenAI import os From 5c776bda70619e1a59ec8178fae1e3bdb12ff17b Mon Sep 17 00:00:00 2001 From: gagb Date: Thu, 19 Dec 2024 10:30:53 -0800 Subject: [PATCH 9/9] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index a91768a..6ffe8ff 100644 --- a/README.md +++ b/README.md @@ -75,7 +75,7 @@ from markitdown import MarkItDown from openai import OpenAI import os client = OpenAI(api_key="your-api-key-here") -md = MarkItDown(mlm_client=client, mlm_model="gpt-4o-2024-11-20") +md = MarkItDown(llm_client=client, llm_model="gpt-4o-2024-11-20") supported_extensions = ('.pptx', '.docx', '.pdf', '.jpg', '.jpeg', '.png') files_to_convert = [f for f in os.listdir('.') if f.lower().endswith(supported_extensions)] for file in files_to_convert: