Hi there! I am Ayaka, a 25-year-old computer science, historical linguistics, and mathematics researcher.
I have made significant contributions to the open-source community. I have created numerous open-source projects on GitHub and have hosted several websites and web services at my own expense. My open-source contributions span various fields, including deep learning, natural language processing, language conservation, historical linguistics, and computational linguistics.
My expertise in deep learning is reflected in my familiarity with JAX and Google Cloud TPU. I actively submit bug reports, participate in feature discussions and answer questions in the JAX and Google Cloud TPU community. In addition, I created TPU Starter, a comprehensive guide that has helped many people to get started with JAX and Google Cloud TPU. The guide has been translated into Korean and Chinese. Moreover, to enhance the user experience of JAX, I developed jax-smi, a tool that enables the monitoring of real-time memory usage of JAX programs, providing a similar experience to that of nvidia-smi. My significant contributions led to the honour of receiving the 2023 Google Open Source Peer Bonus Award.
In natural language processing, I have contributed to the Hugging Face Transformers library and released several NLP models. Besides, I have reimplemented the BART and Llama 2 models, and also collaborated on the reimplementation of the Mistral model, all from scratch using pure JAX. These projects provide high-quality open-source codebases to deep learning researchers and engineers and demonstrate how Transformer models can be implemented using JAX and trained on Google Cloud TPUs. Moreover, I implemented the BERT model from scratch using NumPy, performed in-browser inference using Pyodide, and thereby created TrAVis, a BERT attention visualiser that runs entirely within a browser. The visualiser offers an intuitive visualisation of BERT's attention mechanism for researchers.
I constantly keep up with the most advanced AI technologies. I am an early adopter of the most advanced large language model today—ChatGPT and have been studying it since its release. I am the co-author of the open-source Better ChatGPT website. Utilising the ChatGPT API, this website offers many advanced features and greatly enhances the ChatGPT user experience. It has garnered over 8,000 stars on GitHub and is being used by millions of users worldwide.
My expertise in NLP also extends to language conservation. I trained the BART model for Cantonese, a low-resource language, and released it on the Hugging Face Hub. Building upon this, I proposed TransCan, an English-to-Cantonese machine translation model, greatly outperforming the state-of-the-art commercial machine translation system by 11.8 BLEU. The model has been released on GitHub, bringing benefits to both Cantonese and the wider low-resource NLP community.
In addition to language models, I have created several datasets. In the LIHKG Scraper project, I circumvented many layers of Cloudflare's restrictions to scrape LIHKG, one of the most popular Cantonese forums in Hong Kong, resulting in a corpus of 172,937,863 unique sentences. I have also created two English-Cantonese parallel corpora, Words.hk and ABC Cantonese.
Moreover, for the conservation of Hainanese and Hakka, I engineered web-scraping programs to regularly fetch the latest TV news of Wenchang and Xingning, which are broadcast in their local dialects.
I have also made considerable contributions to the field of historical linguistics. I founded the open-source organisation, nk2028, attracting a community of experts in historical linguistics. In nk2028, we have conducted pioneering research in the field of Middle Chinese phonology. We innovatively formalised the phonological positions of the Tshet-uinh phonological system as 6-tuples, which allowed us to accurately analyse the sound changes that have happened throughout the history of the Chinese language.
Moreover, in the process of putting this system into practice, we explored different methods of representing the laws of sound changes in computer programs. Initially, we designed a domain-specific language in PureScript and utilised SQLite as the database. In subsequent research, we simplified our approach by designing a novel JavaScript library, which greatly enhanced productivity.
Based on this, we released the Tshet-uinh Autoderiver website, allowing community members to contribute laws of sound changes for various languages. This website has effectively invigorated the community and attracted many people to this field. To help beginners master the Tshet-uinh phonological system, we also published many tools, such as a tool to automate the process of puonq-tshet, a tool to generate Tshet-uinh Flashcards, and a tool to look up Tshet-uinh phonological positions.
In nk2028, I have also made contributions to other aspects of linguistics. In the field of dialectology, we took over the discontinued MCPDict project and released the Chinese Dialect Pronunciation Atlas. Regarding classical Chinese, with the consent of the data provider, Sou-Yun website, we published ORCHESTRA, a comprehensive dataset of classical Chinese poetry. For phonetics, we created an IPA Online Practice System and a Putonghua IPA Converter.
Besides, I maintained the simplified-traditional Chinese conversion project OpenCC and its successor StarCC. These projects can accurately handle the problem of one-to-many mappings in simplified-traditional Chinese conversion. On top of this, leveraging my in-depth understanding of OpenType font features, I proposed a novel approach for simplified-to-traditional conversion fonts to handle the one-to-many mappings. Based on this approach, I produced two simplified-to-traditional conversion fonts, Fan Wun Ming and Fan Wun Hak. The approach I proposed has also been adopted by other font developers, enhancing the vibrancy of the typographic community.
For Cantonese, I published cantoseg, an effective Cantonese segmentation tool. I have also created two tools, namely ToJyutping and Inject Jyutping, which aid Cantonese learners in mastering the pronunciation of Chinese characters.
I am an active contributor to the rime input method community. As a member of the CanCLID organisation, I maintain rime-cantonese, a rime input schema for Cantonese. I've also released input schemata for TUPA, Loengfan, Mandarin, and Nüshu. Utilising my C++ and Python knowledge, I developed librime-python, a rime Python plugin that allows users to control the behaviour of the rime program through simple Python scripts. Moreover, I have curated awesome-rime, a comprehensive list of rime schemata and configs, gathering the efforts of the rime community.
My open-source contributions extend to my other areas of interest as well. With a deep understanding of the x64 instruction set and the Windows PE file format, I crafted the smallest 64-Bit PE file on Windows 10 using the assembly language. The file is a Windows executable of merely 268 bytes that can run normally and pop up a message box. Moreover, I proposed the Nya Calendar, a lunisolar-mercurial calendar that considers the synodic period of the Earth and Mercury and encompasses several unique properties.
In addition, I have contributed to the Arch Linux community by maintaining several AUR packages. I host several open-source websites and web services at my own expense, including the Online Nushu Dictionary website, a Graphviz server, a Telegram translation bot, and an instance of the Shieldy bot.
If you want to know more about me and explore my other passions and interests, feel free to visit my personal website!