2025/01/07

Closing the language gap: automated language identification in British Library catalogue records

Words in English, Hungarian and Volapuk shown above the appropriate language 'bucket'

The British Library faced challenges in cataloging over 4.7 million records lacking language information. To address this, a statistical model based on Bayesian methods was used to predict the language of resources from catalogue metadata, such as titles. The model analyzed patterns in word usage across known language records to generate probabilities for language identification. Results varied by language, with high precision but varying recall. The project successfully added language codes to over 2 million records, improving curatorial work and resource discovery. Ongoing efforts aim to code millions more records.