From 2% to 26%: The Journey to Match Historical Catalogs
When we first tried to match the Bibliotheca Philosophica Hermetica catalog against the Internet Archive, we found only 2% of works. Through iterative refinement—fuzzy matching, semantic embeddings, and multi-signal validation—we reached 26% for early modern Latin works. This is the story of how matching historical metadata is harder than it looks.
The Challenge: Two Catalogs, One Question
The Bibliotheca Philosophica Hermetica (BPH) in Amsterdam holds one of the world's finest collections of esoteric literature—30,000+ works on Hermeticism, alchemy, Kabbalah, and Rosicrucianism. The Internet Archive has digitized over 222,000 Latin texts. A simple question: how many BPH works are already available online?
Answering this question required solving a fundamental problem: how do you match 16th-century book titles across two independently created catalogs?
Attempt 1: Exact Prefix Matching (2%)
Our first attempt was naive: compare the first 50 characters of each title, normalized for case and punctuation. The result was devastating—only 2.1% of BPH works matched.
Manual inspection quickly revealed why. The same work might appear as:
- BPH: “De occulta philosophia libri tres”
- IA: “Henrici Cornelii Agrippae ab Nettesheym De occulta philosophia libri tres”
The IA version includes the author's full name and place of origin in the title—standard practice for early modern title pages, but fatal for prefix matching.
Attempt 2: Fuzzy String Matching (18.6%)
We implemented fuzzy matching using the rapidfuzz library's token set ratio algorithm. This approach finds the best match between tokenized versions of strings, handling word order differences and partial matches. With a threshold of 85%, match rates jumped to 18.6%.
The improvement came from matching cases like:
- BPH: “Summa theologiae”
- IA: “Divi Thomae Aquinatis Summa theologiae”
But fuzzy matching still struggled with fundamental differences in cataloging philosophy. BPH uses standardized titles; IA transcribes title pages verbatim. These aren't spelling variations—they're different conventions for describing the same book.
Semantic matching (65%) had the highest recall but lowest precision. Multi-signal matching (26%) trades recall for confidence.
Attempt 3: Semantic Embeddings (65%)
The breakthrough came from treating titles not as strings but as meaning. Using the paraphrase-multilingual-MiniLM-L12-v2 sentence transformer, we embedded all 10,683 BPH Latin works and 222,407 IA Latin texts into a 384-dimensional vector space. Similar titles cluster together regardless of exact wording.
Building a FAISS index over the IA embeddings, we found the nearest semantic neighbors for each BPH work. With a cosine similarity threshold of 0.75, match rates soared to 65%.
But there was a problem. Manual inspection revealed many false positives:
- BPH: “Tractatus de lapide philosophorum” (On the philosopher's stone)
- IA: “Tractatus de praeparatione lapidis” (On preparing stones—a geology text)
Both titles share Latin formulaic language (“Tractatus de...”) and discuss “stones,” but they're completely different works. The embedding model captured topical similarity but couldn't distinguish alchemical philosophy from geology.
Attempt 4: Multi-Signal Matching (26%)
The solution was to require corroborating signals beyond title similarity. We combined:
- Title embeddings (semantic similarity ≥ 0.75)
- Author matching (fuzzy surname comparison ≥ 80%)
- Year tolerance (publication dates within ±30 years)
A match required either high title confidence (≥0.85) alone, or medium title confidence (0.75-0.85) plus at least one corroborating signal.
For the early modern period (1400-1700), search-based matching found 650 high-confidence matches (25.7%). That's lower than embedding-only matching, but far more reliable. The strongest matches—title plus author plus year—account for only 4.4%, but these are essentially certain to be correct.
The Precision-Recall Tradeoff
Each approach represents a different point on the precision-recall curve:
There's no “correct” answer—it depends on your use case. For finding digitization candidates to verify manually, high-recall semantic matching makes sense. For producing statistics about digitization coverage, high-precision multi-signal matching is essential.
Century Patterns
Matching success varies significantly by century:
15th and 16th century works show higher match rates (~34%) than 17th century (21%), likely reflecting better bibliographic standardization for incunabula.
The 15th and 16th centuries show match rates around 34%, while the 17th century drops to 21%. This likely reflects both the bibliographic attention given to incunabula and the explosion of printing in the 17th century creating a larger haystack of undigitized works.
What the Numbers Mean
Our 26% match rate for 1400-1700 is a lower bound on true digitization coverage. The actual number of BPH works available in the Internet Archive is likely higher because:
- Anthology problem: Many esoteric works appear inside collected volumes with different titles
- Metadata gaps: Some IA records lack author/year data, preventing signal confirmation
- OCR limitations: Poor OCR in IA metadata may prevent correct matching
But importantly, 73.9% of early modern Latin esoteric works have no confirmed match in the world's largest open digital library. Whether they're truly absent or just unfindable, they remain effectively inaccessible to researchers.
Lessons Learned
This journey taught us several things about matching historical metadata:
- String matching is deceptively hard. Early modern titles follow different conventions than modern metadata, and the same work can have radically different title-page transcriptions.
- Semantic similarity isn't identity. Two works can be semantically similar (both about alchemy, both in Latin, both “Tractatus de...”) without being the same work.
- Corroborating signals are essential. Author names and publication years provide crucial disambiguation, even when imperfect.
- There's no single right answer. Different use cases demand different tradeoffs between finding more matches and being confident in matches found.
What's Next: Human Validation
With 650 high-confidence matches now in our database, we've built a validation interface where humans can verify whether each match is correct. The interface shows the BPH catalog entry alongside the Internet Archive metadata, fetched live from the IA API, allowing validators to quickly assess whether they're looking at the same work—or even the same edition.
Validation captures three outcomes: same edition (exact match), same work, different edition(e.g., a 1680 work matched to a 1911 reprint), or different works (false positive). This nuanced categorization lets us distinguish between “available online” and “available in the exact edition the BPH holds.”
We're also expanding our year range to include 18th-century works (1701-1800), which adds another 1,030 Latin works to match. Early results show a similar ~25% match rate for this period.
The ultimate goal isn't just counting matches—it's identifying which works most need digitization and translation, so the esoteric traditions that shaped Renaissance thought can be accessible to modern scholars.
Want to help validate our matches? Visit our validation page to help verify BPH-IA matches. Every validation helps improve our understanding of what's really available in digital archives.
Discussion
Loading comments...