top of page

Transcribing Thai in Transkribus

  • Writer: Joseph Nockels
    Joseph Nockels
  • Apr 7
  • 3 min read

Updated: Jul 28

Beginning in the Spring, we welcomed Veepattra Siroros to the Digital Humanities Institute, an MA English Language and Linguistics student, for a work placement module. We've ran these for multiple years but, this time around, pitched an AI and Automated Text Recognition (ATR) project. Initially, we proposed a placement focused on recognising 19th century newspaper structures and text from The Spiritualist (1869-1882), held on the National Library of Scotland's (NLS) Data Foundry as open data - The Spiritualist – Data Foundry. Although Vee helped tremendously in trialling Transkribus's newly minted field models (Field Models - Transkribus AI) on this material, enabling the training of scalable print models for the NLS that recognise poetic stanzas, titles, subtitles, speech and paragraphed text - more on this work incoming, it became clear that a more meaningful (not that newspaper material is devoid of meaning) avenue was to test Transkribus on Thai documentation.


Vee began transcribing 25 pages of sutras, regimented chant scripts, from the Wat Pa Buddhapojhariphunchai (Phutthaphot Hariphunchai) Temple printed in Thai language, located in Ton Thong, Mueang Lamphun District, Lamphun 51000, Thailand. In learning the ropes of being a GLAM (galleries, libraries, archives and museums) curator, Vee also managed the temple connection - advocating for the worth of automatic transcription training and keeping them updated with our progress. The eventual model, 'Thai Transcriber' achieved 91.89% character accuracy , with limited training (only 1,650 words, much lower than the recognised 15,000 words for handwritten material and c. 5,000 for print) - due to the incantations being tight and short textual datasets. We then asked ourselves how well a model trained on regimented text, in relatively regimented print, could perform against other Thai materials?



ree

Ground truth transcriptions, within the Transkribus environment, of Phutthaphot Hariphunchai sutras


It turns out not particularly well, with the Thai Transcriber being applied to an openly accessible text entitled "Science of Breadth", a Buddhist staple out of copyright and uploaded to Scribd by Burin Kim (Burin Kim (burin7kim) | Scribd) - with Vee noting the following errors consistently made by the model, despite the text introduced remaining printed - just slightly more irregular:


  1. เ -> ง 

  2. ง -> เย / บ : เ / ย

  3. ึ-> ิ /  ี -> ิ / ี่ -> ี

  4. ็ -> ้ / ิ    / ้ -> ั / ์ -> ั and vice versa

  5. เห็น -> เห้น

  6. ใ -> โ / ไ -> ใ  / 

  7. ก -> ข

  8. ฒ -> ผ

  9. ) -> า

  10. ล -> ต

  11. ด -> ต

  12. สืบ -> สิบ

  13. ซ่อมแซม -> ช

  14. ของ -> ยอย

  15. Numerical Numbers: 2-3  -> วว / 73  -> กว

  16. ศ -> เต

  17. ษ -> ช / ช-> ษ

  18. ฤ -> ถุ / ถ 

  19. ษ -> ข and vice versa 

  20. ซ -> ช / ข ข -> ช / ช->ข

  21. ษ -> ข

  22. ข -> ษ / ช // ช -> ข

  23. ฐ์ -> ส / 

  24. ทั้งทางกาย -> ทับทเยกาย

  25. ร่ายการ -> ร่างกาย

  26. ซิ -> ขิ

  27. หรือ 

  28. ๊ -> ็

  29. เข้าใจ เชา

  30. ล -> ฉ

  31. พิจารณา 

  32. จ่าย -> ง

  33. โลหิต -> ใด / ต -> ผ

  34. ๆ -> ต

  35. จ - ค

  36. ด -> ต


So instead, we transcribed another 20 pages (approx. 1,400 words) of the "Science of Breath” and retrained, using the public Mixed Line Orientation for baseline recognition and our pre-trained Thai Transcriber as a base model. What was returned was a model with 99.55% accuracy! Although further testing on other fonts remains needed and character accuracy often being much lower compared to f-1 precision/recall scoring or Word Error Rates (WERs). Still - not half bad for a three month placement.


ree

ree

CER training/validation for 'Multi-font Thai Transcriber', as well as the model description within Transkribus


Feel free to get in touch with Vee at vsiroros1@sheffield.ac.uk for access to our models, training and validations datasets.


Recent Posts

See All

Comments


Connect and Share Your Thoughts

Thank You for Sharing Your Thoughts!

© 2025 by Joe Nockels. All rights reserved.

bottom of page