SourceLibrary: A Vision for AI-Assisted Translation
Half a million Renaissance Latin texts await translation. We're building the tools to make that possible—not by replacing scholars, but by empowering them.
The Problem We're Solving
The Universal Short Title Catalogue records 533,000 Latin works printed between 1450 and 1700. Only about 3% have English translations. At the current rate of academic translation—perhaps a few dozen significant works per year—it would take millennia to translate even a fraction of what exists.
But translation isn't the only barrier. Most of these texts aren't evenreadable in any practical sense:
- ~18% have been digitized (scans exist)
- ~8% have searchable text (OCR or transcription)
- ~3% have any English translation at all
Even scholars who read Latin can only access a tiny slice of this heritage. The rest is locked away in rare book rooms, or buried in unreadable image scans.
Our Approach: Expert-Driven, AI-Assisted
We're not building a “push button, get translation” system. That would produce garbage at scale. Instead, we're building tools that put subject matter experts in control while handling the mechanical work that currently makes translation so slow.
The key insight: prompt refinement is where expertise lives. A scholar who understands Ficino's Neoplatonism can tune the translation prompts to use the correct philosophical terminology, to explain obscure references, to maintain consistency across 300 pages.
The Three-Stage Pipeline
Each page passes through three stages, each with customizable prompts:
Stage 1: OCR
Renaissance typography is challenging. Ligatures, abbreviations, the long ‘s’, mixed Latin and Greek—traditional OCR fails on most of it. Vision-language models (GPT-4o, Claude) can read these texts, but they need guidance.
Our OCR prompts tell the model:
- What typeface to expect (Roman, Gothic, mixed)
- How to handle abbreviations (expand? mark uncertain?)
- What to do with marginalia and annotations
- How to preserve structure (tables, headings, verse)
Stage 2: Translation
Translation is where domain expertise matters most. The default prompts produce readable English, but “readable” isn't “accurate.” An expert refines the prompts to specify:
- Terminology — anima mundi should be “world soul,” not “soul of the world”
- Audience — Graduate students? General readers? Specialists?
- Tone — Preserve archaic flavor, or modernize freely?
- Context — What references need explanation?
Stage 3: Extraction
Every page generates structured metadata: key terms (Latin→English), people mentioned, concepts introduced, connections to previous pages. This enables:
- Automatic glossary generation
- Indices of persons and concepts
- Section-by-section summaries
- Cross-references between pages
Context Continuity
Books aren't isolated pages. Arguments span chapters, terminology must remain consistent, sentences break across pages. Our system maintains context:
- Each page receives the previous page's OCR and translation
- A running glossary tracks every translated term
- Continuations are flagged: “[[continues from previous page]]”
- Book-level metadata informs every page
This means page 300 knows what happened on page 1. The translation of spiritusstays consistent throughout. A sentence that starts on page 45 and ends on page 46 gets translated correctly.
Why This Matters
Consider what becomes possible:
What We've Built
The system currently includes:
- Web interface — Browse the Internet Archive's Latin collection, upload PDFs, manage translation projects
- Claude Code commands — CLI tools for power users:
/latin-book-preview,/latin-book-run,/latin-book-compile - Customizable prompts — Templates for OCR, translation, and extraction, with guidance on refinement
- Structured output — Markdown for readability, JSON for processing, export to PDF/EPUB
What's Next
We're actively developing:
- Background processing — Queue large jobs that run while you sleep
- Collaborative review — Multiple experts contributing to a single project
- Quality metrics — Automated confidence scoring and human validation
- Scholarly apparatus — Footnotes, critical apparatus, variant readings
Get Involved
This is an open project. The code is on GitHub. The data is freely available. We need:
- Domain experts — Scholars who can test and refine prompts for their fields
- Latinists — To validate translations and catch errors
- Developers — To improve the pipeline and interface
- Early users — To try translating texts and report what works
Half a million books are waiting. The technology exists. The question is whether we can organize the expertise to use it responsibly.
Discussion
Loading comments...