Digital Folklore or Fakelore? History and Archives in Practice, Western Bank Library, April 16th 2025
- 3 days ago
- 7 min read
A few weeks ago, The National Archives (TNA) came to town, with our University of Sheffield Special Collections hosting History and Archives in Practice (HAP). Bridging archivists and historians through close conversation, HAP is a key TNA event demonstrating the range of work being done on historical collections, with an emphasis on hands-on research.
Of course, applying digital methods to collections is central to archival practice, and offers a way to enhance the co-creative aspects of both historical research and curation.
It was my job, then, to showcase some of these techniques and demonstrate why attendees may benefit from their use on their own material. I decided to spotlight the University's National Centre for English Cultural Tradition (NATCET) Collection (1975 - 2008).
Though closed in 2008, NATCET's folklorists directed the first national Survey of English Language and Folklore in 1964. Nowadays, contacts at Sheffield Hallam University are responding to this germinal work through a modern survey of their own, identifying shifts between NATCET's original data collection and 21st century evidence of cultural customs.
Led by John Widdowston, then NATCET director, the Survey in 1964 covered ‘all parts of the country, both urban and rural, and from representatives of all ages and social and ethnic groups’. It therefore broke with ideas that urbanisation was antithetical to prioritising ancestral culture, and remains a rich resource for understanding cultural customs, changes to them, and their inherent fluidity at the national, organisation, family and individual level.
The HAP session focused on the Survey’s B cards for May, previously used by our School of Education for engaging local school groups based on spring tradition. The original cards were made accessible to HAP attendees, numbering 126 and containing structured information from respondents as to: their given folklore story, where the story originated and whom it was first learnt from, as well as including respondent’s general demographic information.
These cards contain a mixture of handwritten and printed information, as well as being written on variable qualities of paper, alongside annotations from Widdowston’s own research team. Considering such materiality, they form a particularly good case study for trialling digital techniques and workflows: namely Automatic Text Recognition as a foundational data generation method, the process of converting images-of-manuscripts into computer-readable text, enabling further digital analysis (Pinche & Stokes, 2024).
A Digital Approach, Folklore or Fakelore?
Digital approaches at-scale rely on structured data, thereby following folklorist approaches in how data is gathered, with traditional surveying required to systemise and interpret large-scale, diverse, contradictory and challenging material (Davies & Houlbrook, 2025: 6-7). Folklore is also multi-national, influenced by migration, and studied across communities, requiring a level of structure to undergo any broad sense making.
Do digital techniques further enable such systematisation and structured research? Or, do they present an issue in flattening inherently complicated material, and the culture they contain? These remain open questions, relevant, but out of our scope here.
Digital workflows certainly have potential to exacerbate what folklorists have called ‘fakelore’, coined by Dorson (1968) as the distortion of serious cultural subjects through misinformation. Such concerns follow the Digital Humanities Institute's general ethos, where I am based, which seeks to critique technology as a method of inquiry and dissemination, beyond viewing digital tools as givens. Nonetheless, while digital society and global multimedia may seem to endanger traditional folklore, other folklorists view such paradigm shifts as creating new digital folklore traditions to be studied (Krawczyk-Wasilewska, 2016).
In balancing these arguments, our HAP session took a critical approach in using digital workflows to make folkloric heritage more accessible and interactive. As Domokos (2014) suggests, some folklorists are yet to fully accept computer-mediated interpretation, however our session demonstrated ways in which critical digital communication may be suitable for transmitting folklore, a practice increasingly called ‘electronic folklore’ (Krawczyk-Wasilewska, 2016).
Of course, produced for a twenty minute slot and brief demonstration, these workflows are in need of further refinement. However, they show the potential of using digital techniques to capture, present and study heritage folklore collections.
In doing so, we also began to answer -
1) How do May places relate in the survey?
2) Where do most folktales originate, looking at the from_whom influences within the May survey entries?
3) What main cultural customs emerge, and decline, within the May survey entries?
Method
To briefly outline the workflows covered, the results of which we displayed around the Digital Scholarship Suite, we used: automatic transcription, natural language processing, spatial mapping, and network analysis. These workflows were pre-built, and made up of common tools, code libraries or applications, accessible freely or at low-cost. No internal development was needed for this brief demonstration, with community-based tools prioritised, in keeping with the values of the research and archival sector.
Automated Transcription
As suggested, the survey cards were variable in layout and contained a mixture of print and handwritten information. We, therefore, began by converting them to grayscale (thresholding) to aid our digital transcription efforts through greater image contrast. This was done using a simple Python script.
We then used Transkribus, the largest consumer-level ATR system, to manually label the text regions and basic content, across the first 50 cards. For instance, the ‘from_whom’ question was labelled as such, which allowed certain fields to be transcribed and exported separately, lessening the need for data wrangling at the interpretation stage. These structural tags provided training data for our first Transkribus model, known as a ‘field’ model, returning a 62.31% mAP accuracy, which shows the accuracy of object detection against a set of bounding boxes (Bourne, 2026).
These regions were manually corrected, before segmenting each card into lines, to enable corresponding digital transcription. Due to image noise and pixelation, even with our pre-processing, our training of an automated layout and line segmentation model produced more broken lines than expected. Therefore, this work was carried out manually.

Survey of English Language and Folklore card, marked-up in structural XML, segmented and transcribed for export.
Finally, after mapping text regions and segmenting image lines, we used Transkribus’ transformer-based Text Titan model. This proved mostly accurate, with approximately five transcription corrections needed per card. Some common errors, including ‘Holiday’ -> ‘Friday’, and errors with the word ‘flower’. Of course, in keeping with Human-in-the-Loop approaches, these card transcriptions were manually checked.
Although some manual correction was needed, this transcription work demonstrated how ATR can foreground folklore research, both from a researcher and archive perspective. HAP participants were also able to experiment with Transkribus, through set-up laptops showing the survey cards within the ATR environment. Paper scans of each step were also laid out for inspection.
After finalised our ATR transcription correction, we moved to interpret the survey contents through a range of digital techniques:
1) Spatial Analysis
To better understand the locations held within the cards, we used the Natural Language Processing library spaCy, within Python, to extract place names, before making any corrections needed. These placenames were then given latlong coordinates (geocoded) using Geopy, again using Python. This information - the card number, place name, date, and geocode, were then presented using the accessible, low-threshold, mapping tool - Palladio, shown below.

2) Network Analysis
We were also interested in how May folk stories were transmitted. Following this, we exported only the from_who and learned_from survey questions, made simple through our text region tagging in Transkribus.
Again, through spaCy, we extracted the named entities mentioned (people, places, events, other - for instance certain book titles). These were then networked: if one entity (A) was mentioned within a window of ten words before another entity (B), it was seen to have a linear influence. This approach remains limited and returns to our initial discussion around needing to critique digital techniques, when dealing with complex and cultural information, however demonstrated how text extraction can aid digital interpretation.
Through structures nodes (entities) and edges (how the nodes relate), we mapped these connections within Gephi, a free and open network analysis tool. Here, we modelled based on in_out influence: the darker the node, the more connections it contained (greater transmission). The edge ‘weight’, again the darker the colour, showed the primary networks in a directional manner, with the most frequent nodes appearing centrally. These networks were also made viewable to HAP attendees through the laptop set-up.

Directed network of the from_whom and learned_from survey fields (May), highlighting the central prominence of 'mother' as well as key figures involved in survey coordination.
3) Topic Modelling
Lastly, to better understand how cultural customs emerged and declined in the May cards, we used our digital transcriptions to perform topic modelling. First, we standardised the dates into ISO format YYYY-MM-DD, and removed ‘stop words’ (erroneous text: ‘the’, ‘and’). We also lemmatised the data, so ‘maypole dancing’ and ‘maypole dance’ were considered the same, to avoid duplication. To discover hidden patterns and themes, we counted the frequency of these terms and groupings, and used the model BERTopic to support dynamic topic modelling, broadly showing the evolution of themes over time (Maarten, 2024). The results of this are shown through the graph, displayed below.

BERTopic graph of key topic instances, for instance showing that 'oak apple day' emerged in 1970 and peaked in 1973, where boys would sting girls with nettles if they were not wearing an oak leaf to commemorate Charles I's exile and hiding in a tree. This holds potential insight into shifting royalist sympathies, and a clear gendered element to folklore tradition.
To conclude, through showcasing a mixed-method of digital techniques, we attempted to demonstrate to HAP attendees that: 1) automated digital transcription, when critically used, can foreground further analyses of local heritage and history, 2) that such work has potential for folklore research and interpreting the spatial aspects, networks and dynamics of local cultural customs, 3) and that such methods can be trialled within archives at relatively low-cost.
We have plans to ingest more cards into these workflows, eventually building up to full survey digitisation, with the aid of DHI placement students, cataloguers, and Steph, our wonderful Digitisation Officer. We also hope to explore ways of further connecting different aspects of the survey: the written cards, audio recordings and file archives.
All the datasets constructed in this work are shareable, just let me know.
We also welcome advice, or comments, based on your own work.
References
Bourne, J., Simbeye, M., Govia, I. (2016). Get your COTe: A decomposable framework for evaluating Document Layout Analysis models. https://arxiv.org/abs/2603.12718
Davies, O., Houlbrook, C. 2025. Folklore - A journey through the Past and Present. Manchester University Press: Manchester.
Domokos, M. (2014). Towards Methodological Issues in Electronic Folklore. Ethnology, 2. 273-283. Available at: https://www.ceeol.com/search/article-detail?id=137958
Dorson, R.M. 1968. The British Folklorist: A History. University of Chicago Press: Chicago.
Grootendorst, M. (2024). Dynamic Topic Modelling. Available at: https://maartengr.github.io/BERTopic/getting_started/topicsovertime/topicsovertime.html
Krawczyk-Wasilewska, V. (2016). Folklore in the Digital Age. Cambridge University Press: Cambridge.
Pinche, A., Stokes, P. (2024). “Historical documents and Automatic Text Recognition: Introduction”, Journal of data mining and digital humanities, pp. 1 - 11. https://10.46298/jdmdh.13247.
Widdowson, J.D.A. (2016). ‘New Beginnings: Towards a National Folklore Survey’, Folklore, 127, 264,



Comments