Teaching AI to Read 630,000 Renaissance Book Titles

The Universal Short Title Catalogue contains 1.6 million records of books printed between 1450 and 1700. But the metadata is sparse: just a title, author, year, and place. We used large language models to enrich 630,000 of these records with English translations, subject classifications, and religious affiliations—making the intellectual landscape of the Renaissance finally navigable.

The Problem: Titles Without Context

Here's a typical USTC record:

Title: Disputatio de potestate papae in rebus temporalibus

Author: Bellarmine, Roberto

Year: 1610

Place: Rome

Language: Latin

If you read Latin, you know this is a treatise on papal authority over temporal matters—a key text in the Counter-Reformation debate over church and state. But the USTC doesn't tell you that. It doesn't tell you it's Catholic, that it's responding to Protestant arguments, or that it belongs to the genre of political theology.

Multiply this by 1.6 million records. Scholars who want to study, say, Protestant responses to Catholic natural philosophy in the 1590s have no way to filter the catalogue. There's no subject search, no religious affiliation field, no way to find commentaries on Aristotle versus original treatises.

The Solution: LLM Enrichment at Scale

We ran every intellectual title through Claude Haiku and Gemini Flash in parallel, asking each model to extract structured metadata:

Field	Description
english_title	Translation of the title to English
detected_language	Actual language (Latin, French, German, etc.)
work_type	original, commentary, translation, edition, sermon, treatise...
original_author	For commentaries: who is being commented on (Aristotle, Galen...)
subject_tags	1-3 specific tags (astronomy, medicine, theology...)
religious_tradition	Catholic, Protestant, Lutheran, Calvinist, secular...
classical_source	If based on a Greek/Roman work

The same Bellarmine record now becomes:

{
  "english_title": "Disputation on the Power of the Pope in Temporal Affairs",
  "detected_language": "Latin",
  "work_type": "treatise",
  "original_author": null,
  "subject_tags": ["political theology", "papal authority", "church-state relations"],
  "religious_tradition": "Catholic",
  "classical_source": null
}

Now you can search. You can filter. You can ask questions like: “Show me all Protestant treatises on natural philosophy published in Germany between 1550 and 1600.” The catalogue becomes a research tool instead of just an inventory.

Why 630,000 Out of 1.6 Million?

The full USTC contains 1,628,578 records. We enriched 630,862 of them—about 39%. This wasn't random sampling; we deliberately filtered to focus on intellectually significant works while excluding material that wouldn't benefit from LLM enrichment.

What We Included

We selected USTC categories representing substantive intellectual content:

Category	Records	Why Include
Religious	355,980	Theology, devotional works, church debates
Jurisprudence	92,904	Legal commentaries, case law, political theory
History & Chronicles	34,988	Historical scholarship, primary sources
Medical Texts	28,099	Scientific and practical medicine
Educational Books	25,261	Textbooks, pedagogical works
Classical Authors	23,867	Editions and translations of ancient texts
Literature	31,791	Poetry, drama, prose fiction
Philosophy & Morality	15,029	Philosophical treatises, ethics
Science	7,632	Natural philosophy, astronomy, mathematics

What We Excluded

Nearly a million records were excluded for specific reasons:

Category	Records	Why Exclude
English language	~164,000	No translation needed
Newspapers	213,294	Ephemeral; titles don't convey content
University Publications	153,327	Mostly administrative (theses lists, calendars)
Ordinances & Edicts	134,340	Government documents; formulaic titles
News Books	57,147	Ephemeral current events
Funeral Orations	55,029	Formulaic; "Funeral oration for X"
Wedding Pamphlets	16,616	Formulaic occasional literature
Almanacs	18,011	Ephemeral calendrical material

The excluded categories share common traits: either the titles are formulaic (“Funeral oration for Johann Schmidt, 1623”), the content is ephemeral (newspapers, almanacs), or the material is administrative rather than intellectual (university calendars, government edicts). LLM enrichment adds little value here—you can't infer subject tags or religious affiliation from “Prussian News, Issue 47.”

English-language works (~164,000) were excluded because our primary goal was translation—a book already in English doesn't need its title translated. Otherwise, we enriched all languages: Latin (40%), French (16%), German (14%), Spanish (11%), Italian (9%), Dutch (5%), plus Greek, Portuguese, Hebrew, Polish, and others.

The Results

630,862

Records enriched

249,120

Latin titles

240,327

Catholic works

33,571

Commentaries

Explore the Data

The visualization below shows the complete enriched dataset. Hover over elements to see details; click the legend to filter by language.

What We Found

Latin Dominance, Then Decline

Latin represents 40% of all enriched titles (249,120 works). But the timeline shows its gradual displacement by vernacular languages. French, German, and Spanish all grow steadily from the mid-1500s, while Latin's share shrinks. By 1700, Latin is no longer the majority language of intellectual publishing in most of Europe.

The Catholic-Secular Split

38% of works are explicitly Catholic; 34% are secular (no religious affiliation). Protestant traditions combined (Protestant, Lutheran, Calvinist, Anglican) account for about 15%. This reflects both the geography of printing (Italy, France, and Spain were Catholic) and the nature of what got printed (much scholarly work was religiously neutral).

Commentary Culture

33,571 works are classified as commentaries—works that explain or expand upon another author's text. The most commented-upon authors: Aristotle (philosophy, natural science), Galen (medicine), Justinian (law), and Augustine (theology). This commentary culture is largely invisible without enrichment; titles like “In libros Physicorum” don't tell you it's an Aristotle commentary unless you already know.

3D Semantic Maps

We also embedded book titles using sentence transformers and visualized them in 3D space using UMAP dimensionality reduction. But we added a twist: time as the vertical axis.

In these visualizations, each point is a book. Books with similar titles cluster together horizontally (semantic similarity), but they're stacked vertically by publication year. This means you can literally see the lifespan of ideas:

Vertical columns represent works that were reprinted over decades—a column of points rising through time shows a text that stayed in print for 50, 100, or 200 years.
Isolated points represent works that were printed once and never again—ideas that didn't catch on, or specialized texts with limited audiences.
Semantic drift becomes visible when clusters shift horizontally over time—the same subject matter gets discussed differently as intellectual frameworks evolve.

Aristotelian natural philosophy, for example, forms a dense vertical column from 1450 to 1650—constant reprinting of commentaries on the Physics andDe Anima. But around 1650, the column thins dramatically as Cartesian and Newtonian frameworks displace scholastic natural philosophy.

3D semantic map with time as vertical axis. Rotate to explore; hover for details. Dense vertical columns indicate works reprinted across decades.

Technical Details

The enrichment pipeline ran in parallel across Claude Haiku and Gemini Flash:

# 4 Claude workers + 4 Gemini workers, processing 10k titles per batch
PYTHONUNBUFFERED=1 python scripts/enrich_safe.py 10000 4 4

# Each batch takes ~8 minutes
# Saves incrementally after every 10 records (no data loss on crash)
# Syncs to Supabase every 100 records

The system processed approximately 10,000 titles per batch, with each batch taking 8-10 minutes. Incremental saves after every 10 records meant that even if the process crashed (which it did, multiple times), we never lost more than a few seconds of work.

Total processing time for 630,000 records: approximately 63 batches over several days, running in the background while we worked on other things.

What's Next

The enriched data is now available via API. Other researchers can query it directly:

# Search for Protestant astronomy texts
curl "https://ykhxaecbbxaaqlujuzde.supabase.co/rest/v1/ustc_enrichments\
  ?religious_tradition=eq.Protestant\
  &subject_tags=cs.["astronomy"]" \
  -H "apikey: [key]"

# Get all commentaries on Aristotle
curl ".../ustc_enrichments?work_type=eq.commentary&original_author=ilike.*Aristotle*"

We're also working on cross-referencing with digitization databases to identify which of these 630,000 works are actually available online. Because knowing what exists is only the first step—the next is making it readable.

Notes & Sources

USTC data from Universal Short Title Catalogue. Total catalogue: 1,628,578 records. Enriched subset: 630,862 records (intellectual categories, non-English).
Enrichment models: Claude 3 Haiku (Anthropic) and Gemini 2.0 Flash (Google). Both models showed strong performance on Latin, French, German, Spanish, and Italian titles. Hebrew and Greek titles had higher error rates.
3D embeddings generated using all-MiniLM-L6-v2 sentence transformer, reduced to 3 dimensions via UMAP. Time axis added as third dimension for temporal visualization.
All data available via Supabase REST API. Contact for access credentials.