AI-powered Baybayin translator being developed by UP mathematicians

AI-powered Baybayin translator being developed by UP mathematicians

by Eunice Jean Patron, UPD-CS SciComm

Filipino mathematicians have just invented a computerized method for converting entire paragraphs and even full documents written in the ancient Filipino Baybayin writing system into text that even non-native readers can easily understand. And they’re now hard at work developing a full two-way translator.

By combining mathematics and technology, scientists from the University of the Philippines – Diliman College of Science Institute of Mathematics (UPD-CS IM) have made what is likely the world’s first paragraph-level optical character recognition (OCR) system that can distinguish between entire blocks of Baybayin and Latin characters in a text image.

Thousands of images, months of hard work

In their paper entitled “Block-level Optical Character Recognition System for Automatic Transliterations of Baybayin Texts Using Support Vector Machine,” masters student Rodney Pino and associate professors Dr. Renier Mendoza and Dr. Rachelle Sambayan developed an algorithm to convert a photograph of a set of text into binary data, which is then run through a support vector machine (SVM) character classifier to automatically determine whether the characters are Baybayin or Latin.

“SVM is a machine learning algorithm used to solve regression or classification problems,” Pino explained. “We have a dataset for Baybayin characters—let’s say character A and then character BA. SVM uses techniques or mathematical methods that can separate the two datasets to determine characters BA and A.”

It took the group more than three months to collect over a thousand images for each Baybayin character, gathering a total of 110 paragraphs from different websites that have either hand- or typewritten Baybayin, Latin, or Baybayin and Latin writing. “Adding more character images improves the recognition rate of SVM,” Pino explained.

Developing a smart, two-way translator

Currently, the OCR system can spell out the Latin equivalent of the Baybayin characters on a page, thus producing a transliterated version of the text. But the researchers are looking to enable it to do so much more.

The mathematicians also plan to make the OCR system more aware of the context of Baybayin words and phrases, possibly paving the way for a full-fledged translator. They are also trying to make the system work both ways, with the ability to convert Latin words with foreign sounds into Baybayin.

“We’re trying to refine the software we developed to make it easier for future users to navigate it. We also dream of creating a mobile application that automatically and accurately translates Baybayin characters just by hovering over the phone,” Dr. Mendoza said.

However, there are some kinks to smoothen out: Dr. Mendoza said that it was challenging to get the OCR system to translate Baybayin words and sentences accurately. “For now the system can’t distinguish between some Baybayin characters that are similar in writing, such as E and I, and O and U. We also have a lot of words that have different Latin equivalents,” he expounded. “The algorithm we used shows all possible translations of the Baybayin words.”

Preserving Filipino writing systems

Although still scant, interest in and research on Baybayin is slowly increasing, making mathematicians hopeful that more Filipinos will become interested in protecting Baybayin through research. The team published their data to encourage more researchers to conduct studies on Baybayin and OCR. “We cleaned the data in such a way that researchers could use it in analyzing Baybayin through other algorithms,” Dr. Mendoza shared. “We made the data readily available for use, so researchers wouldn’t go through the difficulty we experienced in gathering data.”

Philippine traditional writing systems, such as Baybayin, are representations of Filipino tradition and national identity. As such, the country’s government officials created the “Philippine Indigenous and Traditional Writing Systems Act,” which seeks to promote, protect, and preserve Baybayin and other traditional writing systems.

The proposed law urges using Baybayin as a tool for cultural development and safeguarding, therefore encouraging organizations and institutions to spearhead activities and projects that promote awareness of these traditional writing systems.

According to scientists, Baybayin is living proof that we Filipinos have our own technically-sophisticated traditions. While they aren’t putting forward making Baybayin the Philippines’ primary writing system, the group believes that conducting more research on Baybayin will help preserve this heritage. “This can be forgotten,” Dr. Sambayan said. “It’s important to have a record of each Baybayin character—even having digitized ones.”

Dr. Sambayan expressed concern that the number of Filipinos who can read and write Baybayin is decreasing, adding to the importance of identifying and translating Baybayin characters into Latin. “We’re hoping that through this OCR system, we could preserve and pass on the knowledge of understanding Baybayin to future Filipino generations,” she said.

Baybayin and other traditional writing systems are a part of the Philippines’ rich history. Several old Filipino documents are in Baybayin—documents that can uncover more information about Filipino culture. The scientists are encouraging more Filipinos to join them in cultivating the body of knowledge the country has on Baybayin. “Kapag walang gagawa nito, sinong gagawa? Even though its implication already has a bit of a niche, I think this is still a vital research venture,” Dr. Mendoza said. ### (PR)

For interview requests and other concerns, please contact media@science.upd.edu.ph.

Sources:

Pino, R., Mendoza, R., & Sambayan, R. (2022). Block-Level Optical Character Recognition System for Automatic Transliteration of Baybayin Texts using Support Vector Machine.

Philippine Journal of Science, 151(1), 303-315

.

Philippine Indigenous and Traditional Writing Systems Act, S. 1680, 19th Cong. (2022).

PRESS RELEASE