---
name: doc
displayName: Legacy Word (.doc) Reader & Converter
description: Read and convert legacy OLE/CFB binary Word .doc files to plain
  text, JSON, or CSV. Use when you need to open or extract content from pre-2007
  Word documents.
tags:
  - document
  - word
  - doc
  - ole
  - cfb
  - convert
  - extract
  - legacy
capabilities:
  - ReadDoc
  - ConvertToText
  - ConvertToJson
  - ConvertToCsv
  - ExtractMetadata
representativeQueries:
  - Read a .doc file and show me the text
  - Convert a legacy Word doc to plain text
  - Extract content from a .doc file to JSON or CSV
  - Open an old Word document and get the text out
  - Convert a pre-2007 .doc file to a readable format
version: 0.1.0
tier: curated
---

# Legacy Word (.doc) Reader & Converter

Reads and converts legacy Microsoft Word binary files (`.doc`, OLE/CFB format — pre-Office 2007) to plain text, JSON, or CSV. These files use the Compound File Binary format and cannot be opened with standard XML/ZIP tools or `python-docx`.

## When to use

- You have a `.doc` file (not `.docx`) and need its text content.
- You need to batch-extract or convert old Word documents programmatically.
- A user asks to "open", "read", or "convert" a `.doc` file.
- You want to export document paragraphs as structured JSON or CSV rows.

## Steps

1. **Detect format.** Confirm the file is `.doc` (OLE/CFB binary) not `.docx` (OOXML ZIP). The magic bytes are `D0 CF 11 E0` at offset 0. If it is actually a `.docx`, redirect to the docx skill.
2. **Choose extraction backend.** The bundled script `scripts/doc_converter.py` tries `textract` first, then falls back to `antiword` CLI. If neither is available, it prints an install hint and exits non-zero.
3. **Run the converter.** Pass the file path and desired output format (`text`, `json`, or `csv`) as arguments.
4. **Handle errors.** Password-protected files, corrupted OLE structures, and missing libraries each produce a non-zero exit (exit 1) with a distinct human-readable message on stderr.
5. **Inspect output.** Text is printed to stdout; redirect or pipe as needed.

## Operations

| Capability | CRUD | Resource | Tool |
|---|---|---|---|
| `ReadDoc` | READ | .doc file text | `scripts/doc_converter.py` |
| `ConvertToText` | READ | plain text output | `scripts/doc_converter.py --format text` |
| `ConvertToJson` | READ | JSON array of paragraphs | `scripts/doc_converter.py --format json` |
| `ConvertToCsv` | READ | CSV rows of paragraphs | `scripts/doc_converter.py --format csv` |
| `ExtractMetadata` | READ | author/title/dates | `scripts/doc_converter.py --format json` |

## Output

- **text**: raw paragraph text, one paragraph per line, written to stdout.
- **json**: a JSON object `{"paragraphs": [...], "metadata": {...}}` with one string per paragraph and any available OLE summary metadata.
- **csv**: `paragraph_index,text` rows (UTF-8, comma-separated, LF line endings) written to stdout.

## Notes

- `.doc` files are the OLE/CFB binary format (magic bytes `D0 CF 11 E0`). Do NOT use `python-docx` — it only handles `.docx`.
- `textract` is the recommended library: `pip install textract`. If antiword is also installed (`brew install antiword` or `apt-get install antiword`), textract will use it as a backend for better fidelity.
- LibreOffice headless is an alternative for bulk conversion: `soffice --headless --convert-to txt *.doc`.
- Embedded macros, revision marks, and OLE objects may not survive extraction. This is expected behavior.
- See `references/doc-format-notes.md` for details on OLE/CFB gotchas and encoding edge cases.

<!-- runner-fallback -->
## Remote runner (MCP)
Can't run this locally (no setup, missing dependency)? The StealthStack runner exposes the **same code** as server-side MCP tools — no local install needed: `doc_to_text`, `doc_to_json`, `doc_to_csv`. Call the `application/mcp` catalog twin of this skill (its `runnerTwin`).
