Merge branch 'main' into feature/fix-code-comments

This commit is contained in:
gagb
2024-12-17 17:34:53 -08:00
committed by GitHub
3 changed files with 76 additions and 64 deletions

View File

@@ -0,0 +1,29 @@
// For format details, see https://aka.ms/devcontainer.json. For config options, see the
// README at: https://github.com/devcontainers/templates/tree/main/src/docker-existing-dockerfile
{
"name": "Existing Dockerfile",
"build": {
// Sets the run context to one level up instead of the .devcontainer folder.
"context": "..",
// Update the 'dockerFile' property if you aren't using the standard 'Dockerfile' filename.
"dockerfile": "../Dockerfile"
},
// Features to add to the dev container. More info: https://containers.dev/features.
// "features": {},
"features": {
"ghcr.io/devcontainers-extra/features/hatch:2": {}
},
// Use 'forwardPorts' to make a list of ports inside the container available locally.
// "forwardPorts": [],
// Uncomment the next line to run commands after the container is created.
// "postCreateCommand": "cat /etc/os-release",
// Configure tool-specific properties.
// "customizations": {},
// Uncomment to connect as an existing user other than the container default. More info: https://aka.ms/dev-containers-non-root.
"remoteUser": "root"
}

View File

@@ -1,9 +1,11 @@
FROM python:3.13-alpine FROM python:3.13-slim-bullseye
USER root USER root
# Runtime dependency # Runtime dependency
RUN apk add --no-cache ffmpeg RUN apt-get update && apt-get install -y --no-install-recommends \
ffmpeg \
&& rm -rf /var/lib/apt/lists/*
RUN pip install markitdown RUN pip install markitdown

101
README.md
View File

@@ -2,65 +2,47 @@
[![PyPI](https://img.shields.io/pypi/v/markitdown.svg)](https://pypi.org/project/markitdown/) [![PyPI](https://img.shields.io/pypi/v/markitdown.svg)](https://pypi.org/project/markitdown/)
The MarkItDown library is a utility tool for converting various files to Markdown (e.g., for indexing, text analysis, etc.) MarkItDown is a utility for converting various files to Markdown (e.g., for indexing, text analysis, etc).
It supports:
- PDF
- PowerPoint
- Word
- Excel
- Images (EXIF metadata and OCR)
- Audio (EXIF metadata and speech transcription)
- HTML
- Text-based formats (CSV, JSON, XML)
- ZIP files (iterates over contents)
It presently supports: To install MarkItDown, use pip: `pip install markitdown`. Alternatively, you can install it from the source: `pip install -e .`
- PDF (.pdf) ## Usage
- PowerPoint (.pptx)
- Word (.docx)
- Excel (.xlsx)
- Images (EXIF metadata, and OCR)
- Audio (EXIF metadata, and speech transcription)
- HTML (special handling of Wikipedia, etc.)
- Various other text-based formats (csv, json, xml, etc.)
- ZIP (Iterates over contents and converts each file)
# Installation ### Command-Line
You can install `markitdown` using pip:
```python
pip install markitdown
```
or from the source
```sh
pip install -e .
```
# Usage
The API is simple:
```python
from markitdown import MarkItDown
markitdown = MarkItDown()
result = markitdown.convert("test.xlsx")
print(result.text_content)
```
To use this as a command-line utility, install it and then run it like this:
```bash
markitdown path-to-file.pdf
```
This will output Markdown to standard output. You can save it like this:
```bash ```bash
markitdown path-to-file.pdf > document.md markitdown path-to-file.pdf > document.md
``` ```
You can pipe content to standard input by omitting the argument: You can also pipe content:
```bash ```bash
cat path-to-file.pdf | markitdown cat path-to-file.pdf | markitdown
``` ```
You can also configure markitdown to use Large Language Models to describe images. To do so you must provide `llm_client` and `llm_model` parameters to MarkItDown object, according to your specific client. ### Python API
Basic usage in Python:
```python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("test.xlsx")
print(result.text_content)
```
To use Large Language Models for image descriptions, provide `llm_client` and `llm_model`:
```python ```python
from markitdown import MarkItDown from markitdown import MarkItDown
@@ -72,7 +54,7 @@ result = md.convert("example.jpg")
print(result.text_content) print(result.text_content)
``` ```
You can also use the project as Docker Image: ### Docker
```sh ```sh
docker build -t markitdown:latest . docker build -t markitdown:latest .
@@ -93,23 +75,22 @@ This project has adopted the [Microsoft Open Source Code of Conduct](https://ope
For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments. contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
### Running Tests ### Running Tests and Checks
To run tests, install `hatch` using `pip` or other methods as described [here](https://hatch.pypa.io/dev/install). - Install `hatch` in your environment and run tests:
```sh
pip install hatch # Other ways of installing hatch: https://hatch.pypa.io/dev/install/
hatch shell
hatch test
```
```sh (Alternative) Use the Devcontainer which has all the dependencies installed:
pip install hatch ```sh
hatch shell # Reopen the project in Devcontainer and run:
hatch test hatch test
``` ```
### Running Pre-commit Checks - Run pre-commit checks before submitting a PR: `pre-commit run --all-files`
Please run the pre-commit checks before submitting a PR.
```sh
pre-commit run --all-files
```
## Trademarks ## Trademarks