Can AI assist in translating and understanding old manuscripts and documents?

Artificial Intelligence is undoubtedly helping humanity advance in many fields where human effort and the ability to adopt interpret, and make complex decisions based on data is simply not enough. An example of this is the role of AI in translating and understanding historical documents and handwritten manuscripts. In recent years there have been many cases where researchers have relied on deep learning to decipher ancient texts that remained a mystery to experts for decades. What is more, the continuous development of such technologies is about to reach a turning point in paleography due to the dramatically increased accuracy that deep learning tools provide. To understand better the importance of such technology, this article is dedicated to the application of AI in deciphering ancient manuscripts. In addition, we will also look at some of the most recent examples of ancient artifacts that have been successfully translated with the help of text recognition technologies.

blog article image

The mystery of ancient languages

Nowadays there are more than 7,000 languages used around the world. Languages are probably one of the most important aspects of our lives since without them we would not be able to express our feelings, desire, and queries to the world that surrounds us. Since the emergence of the concept of language about 10,000 years ago, the course of humanity has changed drastically throughout the centuries, leading the human race to where we are today. Yet, many people nowadays do not even have a clue where their native languages originated from. Spoken by many different communities, most of these languages are spin-offs of long-forgotten ancestral languages. For instance, most Romance languages like French, Italian, Spanish or Romanian derive from Latin. However, nowadays most people do not even know how Latin used to sound or look like. On the other hand, for others, ancient languages will be ever so captivating with their mystery and great historical value. Many scientists have devoted their lives to studying them with the hopes to unravel the well-kept secrets behind the curious wedge-shaped symbols and glyphs.

Today we can see most of the ancient languages in old manuscripts, documents, books or even as graffiti carved on the walls of old buildings. Such types of historical handwritings are particularly those of the classical antiquity period that date between the 8th century BC and the 6th century AD. Some of the most notable representatives of this era are languages like Sanskrit, Tamil, Ancient Greek, Hebrew, Arabic and of course Latin. Although many linguists would argue that there is not any objective criteria to judge which of these languages is superior to any of the others, one can state that all of them had a rich vocabulary and complex grammar. For instance, most of them used three genders: masculine, feminine and neutral, whereas others had six declension cases. These characteristics, however, have been studied by many scholars through time to the point where most of the abovementioned languages no longer pose a challenge for epigraphers. Yet, throughout the years archaeologists have discovered numerous artifacts where different writing systems were used, some of which have remained undechipered to this day. Unfortunately, even the most advanced machine learning systems would not be able to help in the decoding of these writings due to the lack of known languages descendants, insufficient examples of discovered tests or the mere nature of the characters found on this objects as they might not even be a part of a writing system at all. With that being said, many might ask if Artificial Intelligence has any purpose at all when it comes to helping people understand old manuscripts and historical handwritings.

To answer this question, we will take a detailed look at how exactly machine learning is being used by researchers to discover different meanings hidden underneath the words and symbols written or inscribed on artifacts. The following chapter explores the most recent approaches and techniques used in deciphering ancient manuscripts through the help of neural networks.


Deciphering ancient manuscripts with the help of AI

For many years, experts have studied ancient languages with the intentions to decipher valuable historical documents whose content would contribute to the knowledge of society in one way or another. Although researchers have indeed succeeded throughout time in understanding subtle patterns like those of the Egyptian hieroglyphics or the Maya inscriptions, there are still many lost languages that trouble epigraphers. Luckily, with the advancement of modern technologies historians nowadays have another powerful tool at their disposal, namely Artificial Intelligence. However, using a particular algorithm does not necessarily mean that one will achieve great results in unravelling the hidden meanings of ancient symbols. This is because some languages have been entirely isolated and therefore no previous data has been recorded for them. This is why although technologies can indeed contribute to the work of historians, sometimes they will also have to help machines in their attempt to teach them more about a particular language.

Machine learning being prone to all sorts of training can be taught to decode ancient languages through the use of algorithms. The latter are usually trained on massive datasets, for instance of 1.5 million characters or images, that they scan in order to learn through associations. However, in order for such technology to be efficient, its setpoints must correspond to a language that has been previously deciphered by scholars. To do so, researchers have incorporated additional methods of learning by training their algorithms to use a language that shares a root with the one used in the particular historical document that is under translation. That way the AI could find words in the known language that shared similarities, both in terms of the characters they used and its meaning within a broader context, to words from the undeciphered language. In addition, other scientists have relied on machine learning systems such as capsule-like networks used to better model hierarchical relationships or convolutional neural networks mostly used for image recognition. Despite the fact that these and other similar systems are not always showing 100% results when translating texts, they certainly reduce much more the error rates of the overall translation in comparison to manual translation done by experts.

Using machine learning to decipher ancient artifacts is truly a promising step forward, but there are still many prerequisites such as encyclopedic domain knowledge, parallel data or the mere digitization of ancient documents before such technology can work independently and automatically transcribe old symbols and letters. Although deciphering all sorts of lost languages might be unrealistic for some, it is indeed very possible if technology continues to develop with the current fast pace. To demonstrate better the current achievements of researchers who have used Artificial Intelligence for the deciphering of old manuscripts, below we will provide several examples of the past few years.


Artificial Intelligence and its achievements

Although there are many great ancient mysteries that will probably remain unsolved for years ahead, more and more researchers nowadays are enthusiastically trying to change that with their high-tech tools and modern technologies. This approach has been already implemented by many scientists in the last few years and has even shown great results revealing more about the different aspect of human history than ever before. Below we have gathered a list of five interesting discoveries that were achieved with the help of Artificial Intelligence technologies.


  • The “paperwork” of Persia’s Achaemenid Empire

In 1933, archaeologists from the Oriental Institute of the University of Chicago began an expedition to the ancient city of Persepolis whose ruins are located in Iran. There, in two small rooms in the fortification wall of the great stone terrace, they found a large number of clay tablets that contained cuneiform. The latter is one of the earliest systems of writing used by the Sumerians in Mesopotamia. The writings on the 2,087 tablets recorded what researchers refer to the “paperwork” of Persia’s Achaemenid Empire created twenty-five centuries ago. For many years, scientists have assiduously tried to decipher the ancient documents by carefully studying and translating the wedge-shaped marks inscribed on their surface. Yet, this process was very difficult, time-consuming and prone to errors since it was done mostly by hand without the usage of advanced technologies. Even in the 1990s, when scientists tried to implement computers in the deciphering process, their success was still limited due to the three-dimensional nature of the documents and the complexity of the Elamite language. However, more recently a technological breakthrough is about to unravel this mystery through machine learning and Artificial Intelligence. The breakthrough was born after the collaboration between researchers from the Oriental Institute and the Department of Computer Science both of the University of Chicago who worked on an AI model that would be able to ‘read’ the remaining part of the unanalyzed tablets in the collection. The model that is yet to be built will be developed with the help of a training set of more than 6,000 annotated images taken from the Persepolis Fortification Archive. With the help of this technology, researchers were finally able to create a dictionary of the Elamite languages, the one used on the clay tablets that can be applied in the deciphering of other ancient documents as well.


  • The 1700-year old En-Gedi Scroll

Unfortunately, not all artifacts found by archaeologists are in good conditions. Due to centuries of natural forces such as earthquakes, floods, fires, volcanic eruptions and more, many ancient documents, artworks, buildings have degraded past the point of successful examination. Such is the case with 1,700-year old En-Gedi Scroll that researches managed to unveil through modern technologies and Artificial Intelligence. The document is one of the oldest snippets of the Old Testament ever to be discovered. Found in 1970 in Ein Gedi, Israel, the parchment has been dated to the third or fourth century CE, containing a portion of the Book of Leviticus, the third one of the Old Testament. Due to its fragile state, caused by a fire in a Jewish synagogue, it was impossible for researchers to study the scroll as it disintegrated as soon as it was touched. However, several years ago, the content of the delicate artifact was finally cracked thanks to a team of computer scientists from the University of Kentucky. What they used to decipher the text without physically unfurling the scroll was the so called “virtual unwrapping” technology. The latter is a non-invasive method that uses a combination of scans and image-processing algorithms to visualize what was between the layers of the scroll. Once the team collected a series of images, they then fed them into an algorithm that was able to reconstruct the scroll by determining where one layer ended and another one began. The digital recovery of the document further on unraveled that the En-Gedi Scroll was written in Hebrew and contained 18 complete and 17 partial lines of the first two chapters of the Book of Leviticus.


  • In Codice Ratio

Many people have always wondered what treasures lie behind the closed doors of the Vatican Secret Archives. Eighty-five kilometers of shelving with thirty-five thousand volumes of catalogues, state papers, correspondence, and many other books and documents are enough to make any scholar’s eyes go wide open as soon as they step their foot into the archives. Despite the fact that the pope owns the material held in the archive, scholars and researchers are also allowed to enter under specific requirements. They are allowed to request only three specific documents per day instead of freely browsing the contents of the archive. Such restrictions could be really slow down the process of deciphering the massive collection of documents for scientific purposes, which is why a team of researchers from the Archives and Roma Tre University came up with a project that could address this issue and help solve a centuries-old problem. In Codice Ration, which is the name of this project, is focused on using AI for the automatic transcription of human historical handwriting. The system aims at developing new methods and techniques to support content analysis and knowledge discovery from any type of historical documents. To do so, the research project used state-of-the-art technologies like deep convolutional neural network, statistical language models, and fine-grained segmentation. After it was tested on the Vatican Registers, researchers were able to generate 65% of exact transcription of the word images. Although this percentage is nowhere near perfect, the researchers believe that the end results are accurate enough to provide paleographers with a tool that can significantly reduce the time and effort they need to put into when transcribing historical documents and especially large volumes of them.


  • Medieval graffiti on the walls of St. Sophia’s Cathedral in Kiev

The St. Sophia’s Cathedral is one of the most significant historical sites in Ukraine’s capital city Kiev. With its asymmetrical green domes with gold-capped spires and medieval graffiti, the 11th century house of worship has some 300 pieces of medieval graffiti scratched on to the stone walls of the cathedral. For years experts assigned different meanings to these graffiti based on the scholars’ personal interpretations, which in turn caused many debates as to which of them represented most accurately the meanings of the carved texts. One thing was for sure: From a historical point of view, the handwritten graffiti and images found on the surface of the cathedral were indeed a very powerful source of historical information. This is also the main motive behind the creation of the machine learning model that was able to decipher the medieval graffiti. The model was developed by researchers from the National Technical University of Ukraine and Huizhou University’s School of Information Science and Technology who applied Artificial Intelligence for the automatic recognition of the letters. The technology that was used is called capsule deep learning neural network. It used a dataset of more than 4000 images of 34 glyphs from both the Glagolitic and Cyrillic alphabets, which were used for the creation of the graffiti. Such technology has proven to provide results with low error rates even for the complex handwritten graffiti. Moreover, this approach allows scientists to make predictions with considerably better accuracy than the previous technologies used for the deciphering of the symbols. The end results suggest that the graffiti reflected the thoughts of locals at the time. For instance, part of the text described the hopes of a young woman to attract a male suitor. Another marking that is also believed to be the most unique one left on the walls of St. Sophia is an announcement of the death of Yaroslav the Wise, the brother of Volodymr the Great. The latter were Kiev’s Grand Princes and also the people who built the cathedral.


Artificial Intelligence vs human knowledge: Can AI independently decipher lost languages?

Artificial Intelligence can truly facilitate scientists’ efforts to decipher old manuscripts and documents written in ancient languages. As seen with the examples above, a lot of advancements in paleography have been made with the incorporation of modern technology and in particular neural networks that can discover more precise meanings of the symbols inscribed on historical artifacts. Yet, as explained, these contemporary methods require machines to be taught first before they are able to examine any text. Therefore the traditional way of humans exploring the context of an old document through their knowledge of ancient languages is still necessary. Regardless of the rich dataset of millions of symbols, letters or words that can be fed to an algorithm, a lost language cannot be deciphered through machine learning unless researchers have previously assigned meanings to these resources.

ancient languages
AI technologies
artificial intelligence