Skip to main content

Showing 1–13 of 13 results for author: Prasojo, R E

Searching in archive cs. Search in all archives.
.
  1. Investigating Text Shortening Strategy in BERT: Truncation vs Summarization

    Authors: Mirza Alim Mutasodirin, Radityo Eko Prasojo

    Abstract: The parallelism of Transformer-based models comes at the cost of their input max-length. Some studies proposed methods to overcome this limitation, but none of them reported the effectiveness of summarization as an alternative. In this study, we investigate the performance of document truncation and summarization in text classification tasks. Each of the two was investigated with several variation… ▽ More

    Submitted 19 March, 2024; originally announced March 2024.

    Comments: The 13th International Conference on Advanced Computer Science and Information Systems (ICACSIS 2021)

  2. Simple Hack for Transformers against Heavy Long-Text Classification on a Time- and Memory-Limited GPU Service

    Authors: Mirza Alim Mutasodirin, Radityo Eko Prasojo, Achmad F. Abka, Hanif Rasyidi

    Abstract: Many NLP researchers rely on free computational services, such as Google Colab, to fine-tune their Transformer models, causing a limitation for hyperparameter optimization (HPO) in long-text classification due to the method having quadratic complexity and needing a bigger resource. In Indonesian, only a few works were found on long-text classification using Transformers. Most only use a small amou… ▽ More

    Submitted 19 March, 2024; originally announced March 2024.

    Comments: The 10th International Conference on Advanced Informatics: Concepts, Theory, and Applications (ICAICTA 2023)

  3. arXiv:2311.01012  [pdf, other

    cs.CL

    COPAL-ID: Indonesian Language Reasoning with Local Culture and Nuances

    Authors: Haryo Akbarianto Wibowo, Erland Hilman Fuadi, Made Nindyatama Nityasya, Radityo Eko Prasojo, Alham Fikri Aji

    Abstract: We present COPAL-ID, a novel, public Indonesian language common sense reasoning dataset. Unlike the previous Indonesian COPA dataset (XCOPA-ID), COPAL-ID incorporates Indonesian local and cultural nuances, and therefore, provides a more natural portrayal of day-to-day causal reasoning within the Indonesian cultural sphere. Professionally written by natives from scratch, COPAL-ID is more fluent and… ▽ More

    Submitted 21 April, 2024; v1 submitted 2 November, 2023; originally announced November 2023.

    Comments: 8 pages, Camera Ready (NAACL 2024 - Main)

    MSC Class: 68T50

  4. arXiv:2306.02870  [pdf, ps, other

    cs.CL

    On "Scientific Debt" in NLP: A Case for More Rigour in Language Model Pre-Training Research

    Authors: Made Nindyatama Nityasya, Haryo Akbarianto Wibowo, Alham Fikri Aji, Genta Indra Winata, Radityo Eko Prasojo, Phil Blunsom, Adhiguna Kuncoro

    Abstract: This evidence-based position paper critiques current research practices within the language model pre-training literature. Despite rapid recent progress afforded by increasingly better pre-trained language models (PLMs), current PLM research practices often conflate different possible sources of model improvement, without conducting proper ablation studies and principled comparisons between differ… ▽ More

    Submitted 5 June, 2023; originally announced June 2023.

    Comments: Accepted at ACL 2023

  5. arXiv:2205.15960  [pdf, other

    cs.CL

    NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages

    Authors: Genta Indra Winata, Alham Fikri Aji, Samuel Cahyawijaya, Rahmad Mahendra, Fajri Koto, Ade Romadhony, Kemal Kurniawan, David Moeljadi, Radityo Eko Prasojo, Pascale Fung, Timothy Baldwin, Jey Han Lau, Rico Sennrich, Sebastian Ruder

    Abstract: Natural language processing (NLP) has a significant impact on society via technologies such as machine translation and search engines. Despite its success, NLP technology is only widely available for high-resource languages such as English and Chinese, while it remains inaccessible to many languages due to the unavailability of data resources and benchmarks. In this work, we focus on developing re… ▽ More

    Submitted 12 April, 2023; v1 submitted 31 May, 2022; originally announced May 2022.

    Comments: EACL 2023

  6. arXiv:2205.04651  [pdf, other

    cs.CL

    ParaCotta: Synthetic Multilingual Paraphrase Corpora from the Most Diverse Translation Sample Pair

    Authors: Alham Fikri Aji, Tirana Noor Fatyanosa, Radityo Eko Prasojo, Philip Arthur, Suci Fitriany, Salma Qonitah, Nadhifa Zulfa, Tomi Santoso, Mahendra Data

    Abstract: We release our synthetic parallel paraphrase corpus across 17 languages: Arabic, Catalan, Czech, German, English, Spanish, Estonian, French, Hindi, Indonesian, Italian, Dutch, Romanian, Russian, Swedish, Vietnamese, and Chinese. Our method relies only on monolingual data and a neural machine translation system to generate paraphrases, hence simple to apply. We generate multiple translation samples… ▽ More

    Submitted 9 May, 2022; originally announced May 2022.

    Comments: 10 pages, 3 figures, 6 tables. Accepted at PACLIC 2021. (ACL Anthology link: https://aclanthology.org/2021.paclic-1.56/)

    MSC Class: 68T50 ACM Class: I.2.7; I.2.6

  7. arXiv:2203.15643  [pdf, other

    cs.SD cs.CL cs.LG cs.NE eess.AS

    Nix-TTS: Lightweight and End-to-End Text-to-Speech via Module-wise Distillation

    Authors: Rendi Chevi, Radityo Eko Prasojo, Alham Fikri Aji, Andros Tjandra, Sakriani Sakti

    Abstract: Several solutions for lightweight TTS have shown promising results. Still, they either rely on a hand-crafted design that reaches non-optimum size or use a neural architecture search but often suffer training costs. We present Nix-TTS, a lightweight TTS achieved via knowledge distillation to a high-quality yet large-sized, non-autoregressive, and end-to-end (vocoder-free) TTS teacher model. Specif… ▽ More

    Submitted 5 November, 2022; v1 submitted 29 March, 2022; originally announced March 2022.

    Comments: Accepted at SLT 2022 (https://slt2022.org/). Associated materials can be seen in https://github.com/rendchevi/nix-tts

    MSC Class: 68T50 (Primary) 68T07; 68T10; 68T99 (Secondary) ACM Class: I.2.7; I.2.6; H.5.5

  8. arXiv:2203.13357  [pdf, other

    cs.CL

    One Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in Indonesia

    Authors: Alham Fikri Aji, Genta Indra Winata, Fajri Koto, Samuel Cahyawijaya, Ade Romadhony, Rahmad Mahendra, Kemal Kurniawan, David Moeljadi, Radityo Eko Prasojo, Timothy Baldwin, Jey Han Lau, Sebastian Ruder

    Abstract: NLP research is impeded by a lack of resources and awareness of the challenges presented by underrepresented languages and dialects. Focusing on the languages spoken in Indonesia, the second most linguistically diverse and the fourth most populous nation of the world, we provide an overview of the current state of NLP research for Indonesia's 700+ languages. We highlight challenges in Indonesian N… ▽ More

    Submitted 24 March, 2022; originally announced March 2022.

    Comments: Accepted in ACL 2022

  9. arXiv:2201.00558  [pdf, other

    cs.CL

    Which Student is Best? A Comprehensive Knowledge Distillation Exam for Task-Specific BERT Models

    Authors: Made Nindyatama Nityasya, Haryo Akbarianto Wibowo, Rendi Chevi, Radityo Eko Prasojo, Alham Fikri Aji

    Abstract: We perform knowledge distillation (KD) benchmark from task-specific BERT-base teacher models to various student models: BiLSTM, CNN, BERT-Tiny, BERT-Mini, and BERT-Small. Our experiment involves 12 datasets grouped in two tasks: text classification and sequence labeling in the Indonesian language. We also compare various aspects of distillations including the usage of word embeddings and unlabeled… ▽ More

    Submitted 3 January, 2022; originally announced January 2022.

    Comments: 14 pages, 3 figures, submitted to Elsevier

    MSC Class: 68T50 ACM Class: I.2.7; I.2.6

  10. arXiv:2012.15178  [pdf, other

    cs.CL cs.LG

    Synthetic Source Language Augmentation for Colloquial Neural Machine Translation

    Authors: Asrul Sani Ariesandy, Mukhlis Amien, Alham Fikri Aji, Radityo Eko Prasojo

    Abstract: Neural machine translation (NMT) is typically domain-dependent and style-dependent, and it requires lots of training data. State-of-the-art NMT models often fall short in handling colloquial variations of its source language and the lack of parallel data in this regard is a challenging hurdle in systematically improving the existing models. In this work, we develop a novel colloquial Indonesian-En… ▽ More

    Submitted 30 December, 2020; originally announced December 2020.

    Comments: 5 pages

    MSC Class: 68T50 ACM Class: I.2.7; I.2.6

  11. arXiv:2012.08958  [pdf, ps, other

    cs.CL

    Costs to Consider in Adopting NLP for Your Business

    Authors: Made Nindyatama Nityasya, Haryo Akbarianto Wibowo, Radityo Eko Prasojo, Alham Fikri Aji

    Abstract: Recent advances in Natural Language Processing (NLP) have largely pushed deep transformer-based models as the go-to state-of-the-art technique without much regard to the production and utilization cost. Companies planning to adopt these methods into their business face difficulties because of the lack of machine, data, and human resources to build them. We compare both the performance and the cost… ▽ More

    Submitted 14 April, 2021; v1 submitted 16 December, 2020; originally announced December 2020.

    Comments: 12 pages, 2 figures

    MSC Class: 68T50 ACM Class: I.2.7; I.2.6

  12. arXiv:2011.03286  [pdf, other

    cs.CL

    Semi-Supervised Low-Resource Style Transfer of Indonesian Informal to Formal Language with Iterative Forward-Translation

    Authors: Haryo Akbarianto Wibowo, Tatag Aziz Prawiro, Muhammad Ihsan, Alham Fikri Aji, Radityo Eko Prasojo, Rahmad Mahendra, Suci Fitriany

    Abstract: In its daily use, the Indonesian language is riddled with informality, that is, deviations from the standard in terms of vocabulary, spelling, and word order. On the other hand, current available Indonesian NLP models are typically developed with the standard Indonesian in mind. In this work, we address a style-transfer from informal to formal Indonesian as a low-resource machine translation probl… ▽ More

    Submitted 22 December, 2020; v1 submitted 6 November, 2020; originally announced November 2020.

    Comments: 6 pages, Camera ready to be presented at IALP 2020

    MSC Class: 68T50

  13. arXiv:1604.08377  [pdf, other

    cs.DB

    Enabling Fine-grained RDF Data Completeness Assessment

    Authors: Fariz Darari, Simon Razniewski, Radityo Eko Prasojo, Werner Nutt

    Abstract: Nowadays, more and more RDF data is becoming available on the Semantic Web. While the Semantic Web is generally incomplete by nature, on certain topics, it already contains complete information and thus, queries may return all answers that exist in reality. In this paper we develop a technique to check query completeness based on RDF data annotated with completeness information, taking into accoun… ▽ More

    Submitted 28 April, 2016; originally announced April 2016.

    Comments: This is a preprint version of a paper published in the Proceedings of the 16th International Conference on Web Engineering (ICWE 2016)