Transcribing Thai in Transkribus
- Joseph Nockels
- Apr 7
- 3 min read
Updated: Jul 28
Beginning in the Spring, we welcomed Veepattra Siroros to the Digital Humanities Institute, an MA English Language and Linguistics student, for a work placement module. We've ran these for multiple years but, this time around, pitched an AI and Automated Text Recognition (ATR) project. Initially, we proposed a placement focused on recognising 19th century newspaper structures and text from The Spiritualist (1869-1882), held on the National Library of Scotland's (NLS) Data Foundry as open data - The Spiritualist – Data Foundry. Although Vee helped tremendously in trialling Transkribus's newly minted field models (Field Models - Transkribus AI) on this material, enabling the training of scalable print models for the NLS that recognise poetic stanzas, titles, subtitles, speech and paragraphed text - more on this work incoming, it became clear that a more meaningful (not that newspaper material is devoid of meaning) avenue was to test Transkribus on Thai documentation.
Vee began transcribing 25 pages of sutras, regimented chant scripts, from the Wat Pa Buddhapojhariphunchai (Phutthaphot Hariphunchai) Temple printed in Thai language, located in Ton Thong, Mueang Lamphun District, Lamphun 51000, Thailand. In learning the ropes of being a GLAM (galleries, libraries, archives and museums) curator, Vee also managed the temple connection - advocating for the worth of automatic transcription training and keeping them updated with our progress. The eventual model, 'Thai Transcriber' achieved 91.89% character accuracy , with limited training (only 1,650 words, much lower than the recognised 15,000 words for handwritten material and c. 5,000 for print) - due to the incantations being tight and short textual datasets. We then asked ourselves how well a model trained on regimented text, in relatively regimented print, could perform against other Thai materials?

Ground truth transcriptions, within the Transkribus environment, of Phutthaphot Hariphunchai sutras
It turns out not particularly well, with the Thai Transcriber being applied to an openly accessible text entitled "Science of Breadth", a Buddhist staple out of copyright and uploaded to Scribd by Burin Kim (Burin Kim (burin7kim) | Scribd) - with Vee noting the following errors consistently made by the model, despite the text introduced remaining printed - just slightly more irregular:
เ -> ง
ง -> เย / บ : เ / ย
ึ-> ิ / ี -> ิ / ี่ -> ี
็ -> ้ / ิ / ้ -> ั / ์ -> ั and vice versa
เห็น -> เห้น
ใ -> โ / ไ -> ใ /
ก -> ข
ฒ -> ผ
) -> า
ล -> ต
ด -> ต
สืบ -> สิบ
ซ่อมแซม -> ช
ของ -> ยอย
Numerical Numbers: 2-3 -> วว / 73 -> กว
ศ -> เต
ษ -> ช / ช-> ษ
ฤ -> ถุ / ถ
ษ -> ข and vice versa
ซ -> ช / ข ข -> ช / ช->ข
ษ -> ข
ข -> ษ / ช // ช -> ข
ฐ์ -> ส /
ทั้งทางกาย -> ทับทเยกาย
ร่ายการ -> ร่างกาย
ซิ -> ขิ
หรือ
๊ -> ็
เข้าใจ เชา
ล -> ฉ
พิจารณา
จ่าย -> ง
โลหิต -> ใด / ต -> ผ
ๆ -> ต
จ - ค
ด -> ต
So instead, we transcribed another 20 pages (approx. 1,400 words) of the "Science of Breath” and retrained, using the public Mixed Line Orientation for baseline recognition and our pre-trained Thai Transcriber as a base model. What was returned was a model with 99.55% accuracy! Although further testing on other fonts remains needed and character accuracy often being much lower compared to f-1 precision/recall scoring or Word Error Rates (WERs). Still - not half bad for a three month placement.


CER training/validation for 'Multi-font Thai Transcriber', as well as the model description within Transkribus
Feel free to get in touch with Vee at vsiroros1@sheffield.ac.uk for access to our models, training and validations datasets.


Comments