RESEARCHDecember 2025

Teaching AI to Read 630,000 Renaissance Book Titles

The Universal Short Title Catalogue contains 1.6 million records of books printed between 1450 and 1700. But the metadata is sparse: just a title, author, year, and place. We used large language models to enrich 630,000 of these records with English translations, subject classifications, and religious affiliations—making the intellectual landscape of the Renaissance finally navigable.

The Problem: Titles Without Context

Here's a typical USTC record:

Title: Disputatio de potestate papae in rebus temporalibus
Author: Bellarmine, Roberto
Year: 1610
Place: Rome
Language: Latin

If you read Latin, you know this is a treatise on papal authority over temporal matters—a key text in the Counter-Reformation debate over church and state. But the USTC doesn't tell you that. It doesn't tell you it's Catholic, that it's responding to Protestant arguments, or that it belongs to the genre of political theology.

Multiply this by 1.6 million records. Scholars who want to study, say, Protestant responses to Catholic natural philosophy in the 1590s have no way to filter the catalogue. There's no subject search, no religious affiliation field, no way to find commentaries on Aristotle versus original treatises.

The Solution: LLM Enrichment at Scale

We ran every intellectual title through Claude Haiku and Gemini Flash in parallel, asking each model to extract structured metadata:

FieldDescription
english_titleTranslation of the title to English
detected_languageActual language (Latin, French, German, etc.)
work_typeoriginal, commentary, translation, edition, sermon, treatise...
original_authorFor commentaries: who is being commented on (Aristotle, Galen...)
subject_tags1-3 specific tags (astronomy, medicine, theology...)
religious_traditionCatholic, Protestant, Lutheran, Calvinist, secular...
classical_sourceIf based on a Greek/Roman work

The same Bellarmine record now becomes:

{
  "english_title": "Disputation on the Power of the Pope in Temporal Affairs",
  "detected_language": "Latin",
  "work_type": "treatise",
  "original_author": null,
  "subject_tags": ["political theology", "papal authority", "church-state relations"],
  "religious_tradition": "Catholic",
  "classical_source": null
}

Now you can search. You can filter. You can ask questions like: “Show me all Protestant treatises on natural philosophy published in Germany between 1550 and 1600.” The catalogue becomes a research tool instead of just an inventory.

Why 630,000 Out of 1.6 Million?

The full USTC contains 1,628,578 records. We enriched 630,862 of them—about 39%. This wasn't random sampling; we deliberately filtered to focus on intellectually significant works while excluding material that wouldn't benefit from LLM enrichment.

What We Included

We selected USTC categories representing substantive intellectual content:

CategoryRecordsWhy Include
Religious355,980Theology, devotional works, church debates
Jurisprudence92,904Legal commentaries, case law, political theory
History & Chronicles34,988Historical scholarship, primary sources
Medical Texts28,099Scientific and practical medicine
Educational Books25,261Textbooks, pedagogical works
Classical Authors23,867Editions and translations of ancient texts
Literature31,791Poetry, drama, prose fiction
Philosophy & Morality15,029Philosophical treatises, ethics
Science7,632Natural philosophy, astronomy, mathematics

What We Excluded

Nearly a million records were excluded for specific reasons:

CategoryRecordsWhy Exclude
English language~164,000No translation needed
Newspapers213,294Ephemeral; titles don't convey content
University Publications153,327Mostly administrative (theses lists, calendars)
Ordinances & Edicts134,340Government documents; formulaic titles
News Books57,147Ephemeral current events
Funeral Orations55,029Formulaic; "Funeral oration for X"
Wedding Pamphlets16,616Formulaic occasional literature
Almanacs18,011Ephemeral calendrical material

The excluded categories share common traits: either the titles are formulaic (“Funeral oration for Johann Schmidt, 1623”), the content is ephemeral (newspapers, almanacs), or the material is administrative rather than intellectual (university calendars, government edicts). LLM enrichment adds little value here—you can't infer subject tags or religious affiliation from “Prussian News, Issue 47.”

English-language works (~164,000) were excluded because our primary goal was translation—a book already in English doesn't need its title translated. Otherwise, we enriched all languages: Latin (40%), French (16%), German (14%), Spanish (11%), Italian (9%), Dutch (5%), plus Greek, Portuguese, Hebrew, Polish, and others.

The Results

630,862
Records enriched
249,120
Latin titles
240,327
Catholic works
33,571
Commentaries

Explore the Data

The visualization below shows the complete enriched dataset. Hover over elements to see details; click the legend to filter by language.

What We Found

Latin Dominance, Then Decline

Latin represents 40% of all enriched titles (249,120 works). But the timeline shows its gradual displacement by vernacular languages. French, German, and Spanish all grow steadily from the mid-1500s, while Latin's share shrinks. By 1700, Latin is no longer the majority language of intellectual publishing in most of Europe.

The Catholic-Secular Split

38% of works are explicitly Catholic; 34% are secular (no religious affiliation). Protestant traditions combined (Protestant, Lutheran, Calvinist, Anglican) account for about 15%. This reflects both the geography of printing (Italy, France, and Spain were Catholic) and the nature of what got printed (much scholarly work was religiously neutral).

Commentary Culture

33,571 works are classified as commentaries—works that explain or expand upon another author's text. The most commented-upon authors: Aristotle (philosophy, natural science), Galen (medicine), Justinian (law), and Augustine (theology). This commentary culture is largely invisible without enrichment; titles like “In libros Physicorum” don't tell you it's an Aristotle commentary unless you already know.

3D Semantic Maps

We also embedded book titles using sentence transformers and visualized them in 3D space using UMAP dimensionality reduction. But we added a twist: time as the vertical axis.

In these visualizations, each point is a book. Books with similar titles cluster together horizontally (semantic similarity), but they're stacked vertically by publication year. This means you can literally see the lifespan of ideas:

  • Vertical columns represent works that were reprinted over decades—a column of points rising through time shows a text that stayed in print for 50, 100, or 200 years.
  • Isolated points represent works that were printed once and never again—ideas that didn't catch on, or specialized texts with limited audiences.
  • Semantic drift becomes visible when clusters shift horizontally over time—the same subject matter gets discussed differently as intellectual frameworks evolve.

Aristotelian natural philosophy, for example, forms a dense vertical column from 1450 to 1650—constant reprinting of commentaries on the Physics andDe Anima. But around 1650, the column thins dramatically as Cartesian and Newtonian frameworks displace scholastic natural philosophy.

3D semantic map with time as vertical axis. Rotate to explore; hover for details. Dense vertical columns indicate works reprinted across decades.

Technical Details

The enrichment pipeline ran in parallel across Claude Haiku and Gemini Flash:

# 4 Claude workers + 4 Gemini workers, processing 10k titles per batch
PYTHONUNBUFFERED=1 python scripts/enrich_safe.py 10000 4 4

# Each batch takes ~8 minutes
# Saves incrementally after every 10 records (no data loss on crash)
# Syncs to Supabase every 100 records

The system processed approximately 10,000 titles per batch, with each batch taking 8-10 minutes. Incremental saves after every 10 records meant that even if the process crashed (which it did, multiple times), we never lost more than a few seconds of work.

Total processing time for 630,000 records: approximately 63 batches over several days, running in the background while we worked on other things.

What's Next

The enriched data is now available via API. Other researchers can query it directly:

# Search for Protestant astronomy texts
curl "https://ykhxaecbbxaaqlujuzde.supabase.co/rest/v1/ustc_enrichments\
  ?religious_tradition=eq.Protestant\
  &subject_tags=cs.["astronomy"]" \
  -H "apikey: [key]"

# Get all commentaries on Aristotle
curl ".../ustc_enrichments?work_type=eq.commentary&original_author=ilike.*Aristotle*"

We're also working on cross-referencing with digitization databases to identify which of these 630,000 works are actually available online. Because knowing what exists is only the first step—the next is making it readable.


Notes & Sources

  1. USTC data from Universal Short Title Catalogue. Total catalogue: 1,628,578 records. Enriched subset: 630,862 records (intellectual categories, non-English).
  2. Enrichment models: Claude 3 Haiku (Anthropic) and Gemini 2.0 Flash (Google). Both models showed strong performance on Latin, French, German, Spanish, and Italian titles. Hebrew and Greek titles had higher error rates.
  3. 3D embeddings generated using all-MiniLM-L6-v2 sentence transformer, reduced to 3 dimensions via UMAP. Time axis added as third dimension for temporal visualization.
  4. All data available via Supabase REST API. Contact for access credentials.

SHARE THIS ARTICLE

Share on XLinkedInEmail

Discussion

Loading comments...

YOU MIGHT ALSO LIKE

SourceLibrary: A Vision for AI-Assisted TranslationVibecoding and the RenaissanceLatin Translations Census
Renaissance BestsellersMethodology