SourceLibrary: A Vision for AI-Assisted Translation

Half a million Renaissance Latin texts await translation. We're building the tools to make that possible—not by replacing scholars, but by empowering them.

The Problem We're Solving

The Universal Short Title Catalogue records 533,000 Latin works printed between 1450 and 1700. Only about 3% have English translations. At the current rate of academic translation—perhaps a few dozen significant works per year—it would take millennia to translate even a fraction of what exists.

But translation isn't the only barrier. Most of these texts aren't evenreadable in any practical sense:

~18% have been digitized (scans exist)
~8% have searchable text (OCR or transcription)
~3% have any English translation at all

Even scholars who read Latin can only access a tiny slice of this heritage. The rest is locked away in rare book rooms, or buried in unreadable image scans.

Our Approach: Expert-Driven, AI-Assisted

We're not building a “push button, get translation” system. That would produce garbage at scale. Instead, we're building tools that put subject matter experts in control while handling the mechanical work that currently makes translation so slow.

The Expert-Driven Workflow

Preview — Generate a 15-page sample with default prompts

Review — Expert evaluates OCR accuracy and translation quality

Refine — Adjust prompts for this specific text: terminology, audience, tone

Produce — Process the full book with expert-refined prompts

Compile — Generate glossary, indices, and summaries

Publish — Export for distribution, with proper scholarly apparatus

The key insight: prompt refinement is where expertise lives. A scholar who understands Ficino's Neoplatonism can tune the translation prompts to use the correct philosophical terminology, to explain obscure references, to maintain consistency across 300 pages.

The Three-Stage Pipeline

Each page passes through three stages, each with customizable prompts:

Stage 1: OCR

Renaissance typography is challenging. Ligatures, abbreviations, the long ‘s’, mixed Latin and Greek—traditional OCR fails on most of it. Vision-language models (GPT-4o, Claude) can read these texts, but they need guidance.

Our OCR prompts tell the model:

What typeface to expect (Roman, Gothic, mixed)
How to handle abbreviations (expand? mark uncertain?)
What to do with marginalia and annotations
How to preserve structure (tables, headings, verse)

Stage 2: Translation

Translation is where domain expertise matters most. The default prompts produce readable English, but “readable” isn't “accurate.” An expert refines the prompts to specify:

Terminology — anima mundi should be “world soul,” not “soul of the world”
Audience — Graduate students? General readers? Specialists?
Tone — Preserve archaic flavor, or modernize freely?
Context — What references need explanation?

Stage 3: Extraction

Every page generates structured metadata: key terms (Latin→English), people mentioned, concepts introduced, connections to previous pages. This enables:

Automatic glossary generation
Indices of persons and concepts
Section-by-section summaries
Cross-references between pages

Context Continuity

Books aren't isolated pages. Arguments span chapters, terminology must remain consistent, sentences break across pages. Our system maintains context:

Each page receives the previous page's OCR and translation
A running glossary tracks every translated term
Continuations are flagged: “[[continues from previous page]]”
Book-level metadata informs every page

This means page 300 knows what happened on page 1. The translation of spiritusstays consistent throughout. A sentence that starts on page 45 and ends on page 46 gets translated correctly.

Why This Matters

Consider what becomes possible:

A scholar of Renaissance medicine could translate 50 medical treatises in a year—with careful review—instead of spending a career on one.

A graduate student could access primary sources that were previously impossible to read, even with Latin training.

The general public could finally explore the intellectual heritage of the Renaissance, not just the famous names.

The Internet Archive's 200,000+ Latin scans could become searchable, readable, accessible.

What We've Built

The system currently includes:

Web interface — Browse the Internet Archive's Latin collection, upload PDFs, manage translation projects
Claude Code commands — CLI tools for power users:/latin-book-preview, /latin-book-run, /latin-book-compile
Customizable prompts — Templates for OCR, translation, and extraction, with guidance on refinement
Structured output — Markdown for readability, JSON for processing, export to PDF/EPUB

What's Next

We're actively developing:

Background processing — Queue large jobs that run while you sleep
Collaborative review — Multiple experts contributing to a single project
Quality metrics — Automated confidence scoring and human validation
Scholarly apparatus — Footnotes, critical apparatus, variant readings

Get Involved

This is an open project. The code is on GitHub. The data is freely available. We need:

Domain experts — Scholars who can test and refine prompts for their fields
Latinists — To validate translations and catch errors
Developers — To improve the pipeline and interface
Early users — To try translating texts and report what works

Half a million books are waiting. The technology exists. The question is whether we can organize the expertise to use it responsibly.

Try the translation dashboard →

View the source on GitHub →