HTML to Markdown Test Files for lightfeed-extract
This repository contains test files for validating HTML to LLM-extractor-ready Markdown conversion functionality. It specifically tests three conversion variants:
- Basic Conversion - Converting all HTML content to Markdown (without images)
- Main Content Extraction - Extracting and converting only the main content from HTML files (without images)
- Conversion with Images - Converting all HTML content to Markdown including images
├── html/ # Source HTML files
│ ├── forum/ # Forum HTML samples
│ │ ├── tech-0.html
│ │ └── ...
│ └── ...
│
└── groundtruth/ # Expected Markdown output files
├── forum/ # Expected forum conversion results
│ ├── tech-0.md # Basic conversion expected output
│ ├── tech-0.main.md # Main-content-only expected output
│ ├── tech-0.images.md # Conversion with images expected output
│ └── ...
└── ...
Files follow a specific naming pattern to clearly indicate their purpose:
html/[category]/[file-name].html
- Original HTML source filesgroundtruth/[category]/[file-name].md
- Expected output for basic HTML conversiongroundtruth/[category]/[file-name].main.md
- Expected output for main content extractiongroundtruth/[category]/[file-name].images.md
- Expected output for conversion with images
For example:
html/forum/tech-0.html
- Original forum HTML filegroundtruth/forum/tech-0.md
- Expected Markdown after basic conversion (no images)groundtruth/forum/tech-0.main.md
- Expected Markdown when only extracting main content (no images)groundtruth/forum/tech-0.images.md
- Expected Markdown with images included