Search | arXiv e-print repository

MuMA-ToM: Multi-modal Multi-Agent Theory of Mind

Authors: Haojun Shi, Suyu Ye, Xinyu Fang, Chuanyang Jin, Leyla Isik, Yen-Ling Kuo, Tianmin Shu

Abstract: Understanding people's social interactions in complex real-world scenarios often relies on intricate mental reasoning. To truly understand how and why people interact with one another, we must infer the underlying mental states that give rise to the social interactions, i.e., Theory of Mind reasoning in multi-agent interactions. Additionally, social interactions are often multi-modal -- we can wat… ▽ More Understanding people's social interactions in complex real-world scenarios often relies on intricate mental reasoning. To truly understand how and why people interact with one another, we must infer the underlying mental states that give rise to the social interactions, i.e., Theory of Mind reasoning in multi-agent interactions. Additionally, social interactions are often multi-modal -- we can watch people's actions, hear their conversations, and/or read about their past behaviors. For AI systems to successfully and safely interact with people in real-world environments, they also need to understand people's mental states as well as their inferences about each other's mental states based on multi-modal information about their interactions. For this, we introduce MuMA-ToM, a Multi-modal Multi-Agent Theory of Mind benchmark. MuMA-ToM is the first multi-modal Theory of Mind benchmark that evaluates mental reasoning in embodied multi-agent interactions. In MuMA-ToM, we provide video and text descriptions of people's multi-modal behavior in realistic household environments. Based on the context, we then ask questions about people's goals, beliefs, and beliefs about others' goals. We validated MuMA-ToM in a human experiment and provided a human baseline. We also proposed a novel multi-modal, multi-agent ToM model, LIMP (Language model-based Inverse Multi-agent Planning). Our experimental results show that LIMP significantly outperforms state-of-the-art methods, including large multi-modal models (e.g., GPT-4o, Gemini-1.5 Pro) and a recent multi-modal ToM model, BIP-ALM. △ Less

Submitted 25 August, 2024; v1 submitted 22 August, 2024; originally announced August 2024.

Comments: Project website: https://scai.cs.jhu.edu/projects/MuMA-ToM/ Code: https://github.com/SCAI-JHU/MuMA-ToM

arXiv:2408.08992 [pdf, other]

SpreadLine: Visualizing Egocentric Dynamic Influence

Authors: Yun-Hsin Kuo, Dongyu Liu, Kwan-Liu Ma

Abstract: Egocentric networks, often visualized as node-link diagrams, portray the complex relationship (link) dynamics between an entity (node) and others. However, common analytics tasks are multifaceted, encompassing interactions among four key aspects: strength, function, structure, and content. Current node-link visualization designs may fall short, focusing narrowly on certain aspects and neglecting t… ▽ More Egocentric networks, often visualized as node-link diagrams, portray the complex relationship (link) dynamics between an entity (node) and others. However, common analytics tasks are multifaceted, encompassing interactions among four key aspects: strength, function, structure, and content. Current node-link visualization designs may fall short, focusing narrowly on certain aspects and neglecting the holistic, dynamic nature of egocentric networks. To bridge this gap, we introduce SpreadLine, a novel visualization framework designed to enable the visual exploration of egocentric networks from these four aspects at the microscopic level. Leveraging the intuitive appeal of storyline visualizations, SpreadLine adopts a storyline-based design to represent entities and their evolving relationships. We further encode essential topological information in the layout and condense the contextual information in a metro map metaphor, allowing for a more engaging and effective way to explore temporal and attribute-based information. To guide our work, with a thorough review of pertinent literature, we have distilled a task taxonomy that addresses the analytical needs specific to egocentric network exploration. Acknowledging the diverse analytical requirements of users, SpreadLine offers customizable encodings to enable users to tailor the framework for their tasks. We demonstrate the efficacy and general applicability of SpreadLine through three diverse real-world case studies (disease surveillance, social media trends, and academic career evolution) and a usability study. △ Less

Submitted 16 August, 2024; originally announced August 2024.

Comments: To appear in VIS 2024 and IEEE Transactions on Visualization and Computer Graphics

arXiv:2407.13729 [pdf, other]

Baba Is AI: Break the Rules to Beat the Benchmark

Authors: Nathan Cloos, Meagan Jens, Michelangelo Naim, Yen-Ling Kuo, Ignacio Cases, Andrei Barbu, Christopher J. Cueva

Abstract: Humans solve problems by following existing rules and procedures, and also by leaps of creativity to redefine those rules and objectives. To probe these abilities, we developed a new benchmark based on the game Baba Is You where an agent manipulates both objects in the environment and rules, represented by movable tiles with words written on them, to reach a specified goal and win the game. We tes… ▽ More Humans solve problems by following existing rules and procedures, and also by leaps of creativity to redefine those rules and objectives. To probe these abilities, we developed a new benchmark based on the game Baba Is You where an agent manipulates both objects in the environment and rules, represented by movable tiles with words written on them, to reach a specified goal and win the game. We test three state-of-the-art multi-modal large language models (OpenAI GPT-4o, Google Gemini-1.5-Pro and Gemini-1.5-Flash) and find that they fail dramatically when generalization requires that the rules of the game must be manipulated and combined. △ Less

Submitted 18 July, 2024; originally announced July 2024.

Comments: 8 pages, 8 figures

arXiv:2406.18089 [pdf, other]

A Study on Synthesizing Expressive Violin Performances: Approaches and Comparisons

Authors: Tzu-Yun Hung, Jui-Te Wu, Yu-Chia Kuo, Yo-Wei Hsiao, Ting-Wei Lin, Li Su

Abstract: Expressive music synthesis (EMS) for violin performance is a challenging task due to the disagreement among music performers in the interpretation of expressive musical terms (EMTs), scarcity of labeled recordings, and limited generalization ability of the synthesis model. These challenges create trade-offs between model effectiveness, diversity of generated results, and controllability of the syn… ▽ More Expressive music synthesis (EMS) for violin performance is a challenging task due to the disagreement among music performers in the interpretation of expressive musical terms (EMTs), scarcity of labeled recordings, and limited generalization ability of the synthesis model. These challenges create trade-offs between model effectiveness, diversity of generated results, and controllability of the synthesis system, making it essential to conduct a comparative study on EMS model design. This paper explores two violin EMS approaches. The end-to-end approach is a modification of a state-of-the-art text-to-speech generator. The parameter-controlled approach is based on a simple parameter sampling process that can render note lengths and other parameters compatible with MIDI-DDSP. We study these two approaches (in total, three model variants) through objective and subjective experiments and discuss several key issues of EMS based on the results. △ Less

Submitted 26 June, 2024; originally announced June 2024.

Comments: 15 pages, 2 figures, 3 tables

arXiv:2406.06375 [pdf, other]

doi 10.1109/TASLP.2024.3407529

MOSA: Music Motion with Semantic Annotation Dataset for Cross-Modal Music Processing

Authors: Yu-Fen Huang, Nikki Moran, Simon Coleman, Jon Kelly, Shun-Hwa Wei, Po-Yin Chen, Yun-Hsin Huang, Tsung-Ping Chen, Yu-Chia Kuo, Yu-Chi Wei, Chih-Hsuan Li, Da-Yu Huang, Hsuan-Kai Kao, Ting-Wei Lin, Li Su

Abstract: In cross-modal music processing, translation between visual, auditory, and semantic content opens up new possibilities as well as challenges. The construction of such a transformative scheme depends upon a benchmark corpus with a comprehensive data infrastructure. In particular, the assembly of a large-scale cross-modal dataset presents major challenges. In this paper, we present the MOSA (Music m… ▽ More In cross-modal music processing, translation between visual, auditory, and semantic content opens up new possibilities as well as challenges. The construction of such a transformative scheme depends upon a benchmark corpus with a comprehensive data infrastructure. In particular, the assembly of a large-scale cross-modal dataset presents major challenges. In this paper, we present the MOSA (Music mOtion with Semantic Annotation) dataset, which contains high quality 3-D motion capture data, aligned audio recordings, and note-by-note semantic annotations of pitch, beat, phrase, dynamic, articulation, and harmony for 742 professional music performances by 23 professional musicians, comprising more than 30 hours and 570 K notes of data. To our knowledge, this is the largest cross-modal music dataset with note-level annotations to date. To demonstrate the usage of the MOSA dataset, we present several innovative cross-modal music information retrieval (MIR) and musical content generation tasks, including the detection of beats, downbeats, phrase, and expressive contents from audio, video and motion data, and the generation of musicians' body motion from given music audio. The dataset and codes are available alongside this publication (https://github.com/yufenhuang/MOSA-Music-mOtion-and-Semantic-Annotation-dataset). △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024. 14 pages, 7 figures. Dataset is available on: https://github.com/yufenhuang/MOSA-Music-mOtion-and-Semantic-Annotation-dataset/tree/main and https://zenodo.org/records/11393449

arXiv:2404.07351 [pdf, other]

doi 10.1145/3649902.3653439

A Transformer-Based Model for the Prediction of Human Gaze Behavior on Videos

Authors: Suleyman Ozdel, Yao Rong, Berat Mert Albaba, Yen-Ling Kuo, Xi Wang, Enkelejda Kasneci

Abstract: Eye-tracking applications that utilize the human gaze in video understanding tasks have become increasingly important. To effectively automate the process of video analysis based on eye-tracking data, it is important to accurately replicate human gaze behavior. However, this task presents significant challenges due to the inherent complexity and ambiguity of human gaze patterns. In this work, we i… ▽ More Eye-tracking applications that utilize the human gaze in video understanding tasks have become increasingly important. To effectively automate the process of video analysis based on eye-tracking data, it is important to accurately replicate human gaze behavior. However, this task presents significant challenges due to the inherent complexity and ambiguity of human gaze patterns. In this work, we introduce a novel method for simulating human gaze behavior. Our approach uses a transformer-based reinforcement learning algorithm to train an agent that acts as a human observer, with the primary role of watching videos and simulating human gaze behavior. We employed an eye-tracking dataset gathered from videos generated by the VirtualHome simulator, with a primary focus on activity recognition. Our experimental results demonstrate the effectiveness of our gaze prediction method by highlighting its capability to replicate human gaze behavior and its applicability for downstream tasks where real human-gaze is used as input. △ Less

Submitted 10 April, 2024; originally announced April 2024.

Comments: 2024 Symposium on Eye Tracking Research and Applications (ETRA24), Glasgow, United Kingdom

arXiv:2404.07347 [pdf, other]

doi 10.1145/3649902.3653340

Gaze-Guided Graph Neural Network for Action Anticipation Conditioned on Intention

Authors: Suleyman Ozdel, Yao Rong, Berat Mert Albaba, Yen-Ling Kuo, Xi Wang, Enkelejda Kasneci

Abstract: Humans utilize their gaze to concentrate on essential information while perceiving and interpreting intentions in videos. Incorporating human gaze into computational algorithms can significantly enhance model performance in video understanding tasks. In this work, we address a challenging and innovative task in video understanding: predicting the actions of an agent in a video based on a partial v… ▽ More Humans utilize their gaze to concentrate on essential information while perceiving and interpreting intentions in videos. Incorporating human gaze into computational algorithms can significantly enhance model performance in video understanding tasks. In this work, we address a challenging and innovative task in video understanding: predicting the actions of an agent in a video based on a partial video. We introduce the Gaze-guided Action Anticipation algorithm, which establishes a visual-semantic graph from the video input. Our method utilizes a Graph Neural Network to recognize the agent's intention and predict the action sequence to fulfill this intention. To assess the efficiency of our approach, we collect a dataset containing household activities generated in the VirtualHome environment, accompanied by human gaze data of viewing videos. Our method outperforms state-of-the-art techniques, achieving a 7\% improvement in accuracy for 18-class intention recognition. This highlights the efficiency of our method in learning important features from human gaze data. △ Less

Submitted 10 April, 2024; originally announced April 2024.

Comments: 2024 Symposium on Eye Tracking Research and Applications (ETRA24), Glasgow, United Kingdom

arXiv:2401.13280 [pdf, other]

DDI-CoCo: A Dataset For Understanding The Effect Of Color Contrast In Machine-Assisted Skin Disease Detection

Authors: Ming-Chang Chiu, Yingfei Wang, Yen-Ju Kuo, Pin-Yu Chen

Abstract: Skin tone as a demographic bias and inconsistent human labeling poses challenges in dermatology AI. We take another angle to investigate color contrast's impact, beyond skin tones, on malignancy detection in skin disease datasets: We hypothesize that in addition to skin tones, the color difference between the lesion area and skin also plays a role in malignancy detection performance of dermatology… ▽ More Skin tone as a demographic bias and inconsistent human labeling poses challenges in dermatology AI. We take another angle to investigate color contrast's impact, beyond skin tones, on malignancy detection in skin disease datasets: We hypothesize that in addition to skin tones, the color difference between the lesion area and skin also plays a role in malignancy detection performance of dermatology AI models. To study this, we first propose a robust labeling method to quantify color contrast scores of each image and validate our method by showing small labeling variations. More importantly, applying our method to \textit{the only} diverse-skin tone and pathologically-confirmed skin disease dataset DDI, yields \textbf{DDI-CoCo Dataset}, and we observe a performance gap between the high and low color difference groups. This disparity remains consistent across various state-of-the-art (SoTA) image classification models, which supports our hypothesis. Furthermore, we study the interaction between skin tone and color difference effects and suggest that color difference can be an additional reason behind model performance bias between skin tones. Our work provides a complementary angle to dermatology AI for improving skin disease detection. △ Less

Submitted 24 January, 2024; originally announced January 2024.

Comments: 5 pages, 4 figures, 2 tables, Accepted to ICASSP 2024

arXiv:2401.08743 [pdf, other]

MMToM-QA: Multimodal Theory of Mind Question Answering

Authors: Chuanyang Jin, Yutong Wu, Jing Cao, Jiannan Xiang, Yen-Ling Kuo, Zhiting Hu, Tomer Ullman, Antonio Torralba, Joshua B. Tenenbaum, Tianmin Shu

Abstract: Theory of Mind (ToM), the ability to understand people's mental states, is an essential ingredient for developing machines with human-level social intelligence. Recent machine learning models, particularly large language models, seem to show some aspects of ToM understanding. However, existing ToM benchmarks use unimodal datasets - either video or text. Human ToM, on the other hand, is more than v… ▽ More Theory of Mind (ToM), the ability to understand people's mental states, is an essential ingredient for developing machines with human-level social intelligence. Recent machine learning models, particularly large language models, seem to show some aspects of ToM understanding. However, existing ToM benchmarks use unimodal datasets - either video or text. Human ToM, on the other hand, is more than video or text understanding. People can flexibly reason about another person's mind based on conceptual representations (e.g., goals, beliefs, plans) extracted from any available data. To address this, we introduce a multimodal Theory of Mind question answering (MMToM-QA) benchmark. MMToM-QA comprehensively evaluates machine ToM both on multimodal data and on different kinds of unimodal data about a person's activity in a household environment. To engineer multimodal ToM capacity, we propose a novel method, BIP-ALM (Bayesian Inverse Planning Accelerated by Language Models). BIP-ALM extracts unified representations from multimodal data and utilizes language models for scalable Bayesian inverse planning. We conducted a systematic comparison of human performance, BIP-ALM, and state-of-the-art models, including GPT-4. The experiments demonstrate that large language models and large multimodal models still lack robust ToM capacity. BIP-ALM, on the other hand, shows promising results, by leveraging the power of both model-based mental inference and language models. △ Less

Submitted 15 June, 2024; v1 submitted 16 January, 2024; originally announced January 2024.

Comments: ACL 2024. 26 pages, 11 figures, 7 tables

arXiv:2309.10858 [pdf, other]

On-device Real-time Custom Hand Gesture Recognition

Authors: Esha Uboweja, David Tian, Qifei Wang, Yi-Chun Kuo, Joe Zou, Lu Wang, George Sung, Matthias Grundmann

Abstract: Most existing hand gesture recognition (HGR) systems are limited to a predefined set of gestures. However, users and developers often want to recognize new, unseen gestures. This is challenging due to the vast diversity of all plausible hand shapes, e.g. it is impossible for developers to include all hand gestures in a predefined list. In this paper, we present a user-friendly framework that lets… ▽ More Most existing hand gesture recognition (HGR) systems are limited to a predefined set of gestures. However, users and developers often want to recognize new, unseen gestures. This is challenging due to the vast diversity of all plausible hand shapes, e.g. it is impossible for developers to include all hand gestures in a predefined list. In this paper, we present a user-friendly framework that lets users easily customize and deploy their own gesture recognition pipeline. Our framework provides a pre-trained single-hand embedding model that can be fine-tuned for custom gesture recognition. Users can perform gestures in front of a webcam to collect a small amount of images per gesture. We also offer a low-code solution to train and deploy the custom gesture recognition model. This makes it easy for users with limited ML expertise to use our framework. We further provide a no-code web front-end for users without any ML expertise. This makes it even easier to build and test the end-to-end pipeline. The resulting custom HGR is then ready to be run on-device for real-time scenarios. This can be done by calling a simple function in our open-sourced model inference API, MediaPipe Tasks. This entire process only takes a few minutes. △ Less

Submitted 19 September, 2023; originally announced September 2023.

Comments: 5 pages, 6 figures; Accepted to ICCV Workshop on Computer Vision for Metaverse, Paris, France, 2023

arXiv:2309.05739 [pdf, other]

VisActs: Describing Intent in Communicative Visualization

Authors: Keshav Dasu, Yun-Hsin Kuo, Kwan-Liu Ma

Abstract: Data visualization can be defined as the visual communication of information. One important barometer for the success of a visualization is whether the intents of the communicator(s) are faithfully conveyed. The processes of constructing and displaying visualizations have been widely studied by our community. However, due to the lack of consistency in this literature, there is a growing acknowledg… ▽ More Data visualization can be defined as the visual communication of information. One important barometer for the success of a visualization is whether the intents of the communicator(s) are faithfully conveyed. The processes of constructing and displaying visualizations have been widely studied by our community. However, due to the lack of consistency in this literature, there is a growing acknowledgment of a need for frameworks and methodologies for classifying and formalizing the communicative component of visualization. This work focuses on intent and introduces how this concept in communicative visualization mirrors concepts in linguistics. We construct a mapping between the two spaces that enables us to leverage relevant frameworks to apply to visualization. We describe this translation as using the philosophy of language as a base for explaining communication in visualization. Furthermore, we illustrate the benefits and point out several prospective research directions. △ Less

Submitted 11 September, 2023; originally announced September 2023.

Comments: Currently pending review

arXiv:2308.11071 [pdf, other]

Neural Amortized Inference for Nested Multi-agent Reasoning

Authors: Kunal Jha, Tuan Anh Le, Chuanyang Jin, Yen-Ling Kuo, Joshua B. Tenenbaum, Tianmin Shu

Abstract: Multi-agent interactions, such as communication, teaching, and bluffing, often rely on higher-order social inference, i.e., understanding how others infer oneself. Such intricate reasoning can be effectively modeled through nested multi-agent reasoning. Nonetheless, the computational complexity escalates exponentially with each level of reasoning, posing a significant challenge. However, humans ef… ▽ More Multi-agent interactions, such as communication, teaching, and bluffing, often rely on higher-order social inference, i.e., understanding how others infer oneself. Such intricate reasoning can be effectively modeled through nested multi-agent reasoning. Nonetheless, the computational complexity escalates exponentially with each level of reasoning, posing a significant challenge. However, humans effortlessly perform complex social inferences as part of their daily lives. To bridge the gap between human-like inference capabilities and computational limitations, we propose a novel approach: leveraging neural networks to amortize high-order social inference, thereby expediting nested multi-agent reasoning. We evaluate our method in two challenging multi-agent interaction domains. The experimental results demonstrate that our method is computationally efficient while exhibiting minimal degradation in accuracy. △ Less

Submitted 21 August, 2023; originally announced August 2023.

Comments: 8 pages, 10 figures

arXiv:2308.07557 [pdf, other]

Character-Oriented Design for Visual Data Storytelling

Authors: Keshav Dasu, Yun-Hsin Kuo, Kwan-Liu Ma

Abstract: When telling a data story, an author has an intention they seek to convey to an audience. This intention can be of many forms such as to persuade, to educate, to inform, or even to entertain. In addition to expressing their intention, the story plot must balance being consumable and enjoyable while preserving scientific integrity. In data stories, numerous methods have been identified for construc… ▽ More When telling a data story, an author has an intention they seek to convey to an audience. This intention can be of many forms such as to persuade, to educate, to inform, or even to entertain. In addition to expressing their intention, the story plot must balance being consumable and enjoyable while preserving scientific integrity. In data stories, numerous methods have been identified for constructing and presenting a plot. However, there is an opportunity to expand how we think and create the visual elements that present the story. Stories are brought to life by characters; often they are what make a story captivating, enjoyable, memorable, and facilitate following the plot until the end. Through the analysis of 160 existing data stories, we systematically investigate and identify distinguishable features of characters in data stories, and we illustrate how they feed into the broader concept of "character-oriented design". We identify the roles and visual representations data characters assume as well as the types of relationships these roles have with one another. We identify characteristics of antagonists as well as define conflict in data stories. We find the need for an identifiable central character that the audience latches on to in order to follow the narrative and identify their visual representations. We then illustrate "character-oriented design" by showing how to develop data characters with common data story plots. With this work, we present a framework for data characters derived from our analysis; we then offer our extension to the data storytelling process using character-oriented design. To access our supplemental materials please visit https://chaorientdesignds.github.io/ △ Less

Submitted 14 August, 2023; originally announced August 2023.

Comments: Accepted to TVCG & VIS 2023 Pre-Print. Storytelling, Data Stories, Explanatory, Narrative visualization, Visual metaphor

arXiv:2308.00278 [pdf, other]

Classes are not Clusters: Improving Label-based Evaluation of Dimensionality Reduction

Authors: Hyeon Jeon, Yun-Hsin Kuo, Michaël Aupetit, Kwan-Liu Ma, Jinwook Seo

Abstract: A common way to evaluate the reliability of dimensionality reduction (DR) embeddings is to quantify how well labeled classes form compact, mutually separated clusters in the embeddings. This approach is based on the assumption that the classes stay as clear clusters in the original high-dimensional space. However, in reality, this assumption can be violated; a single class can be fragmented into m… ▽ More A common way to evaluate the reliability of dimensionality reduction (DR) embeddings is to quantify how well labeled classes form compact, mutually separated clusters in the embeddings. This approach is based on the assumption that the classes stay as clear clusters in the original high-dimensional space. However, in reality, this assumption can be violated; a single class can be fragmented into multiple separated clusters, and multiple classes can be merged into a single cluster. We thus cannot always assure the credibility of the evaluation using class labels. In this paper, we introduce two novel quality measures -- Label-Trustworthiness and Label-Continuity (Label-T&C) -- advancing the process of DR evaluation based on class labels. Instead of assuming that classes are well-clustered in the original space, Label-T&C work by (1) estimating the extent to which classes form clusters in the original and embedded spaces and (2) evaluating the difference between the two. A quantitative evaluation showed that Label-T&C outperform widely used DR evaluation measures (e.g., Trustworthiness and Continuity, Kullback-Leibler divergence) in terms of the accuracy in assessing how well DR embeddings preserve the cluster structure, and are also scalable. Moreover, we present case studies demonstrating that Label-T&C can be successfully used for revealing the intrinsic characteristics of DR techniques and their hyperparameters. △ Less

Submitted 11 August, 2023; v1 submitted 1 August, 2023; originally announced August 2023.

Comments: IEEE Transactions on Visualization and Computer Graphics (TVCG) (Proc. IEEE VIS 2023)

arXiv:2305.17600 [pdf, other]

NashFormer: Leveraging Local Nash Equilibria for Semantically Diverse Trajectory Prediction

Authors: Justin Lidard, Oswin So, Yanxia Zhang, Jonathan DeCastro, Xiongyi Cui, Xin Huang, Yen-Ling Kuo, John Leonard, Avinash Balachandran, Naomi Leonard, Guy Rosman

Abstract: Interactions between road agents present a significant challenge in trajectory prediction, especially in cases involving multiple agents. Because existing diversity-aware predictors do not account for the interactive nature of multi-agent predictions, they may miss these important interaction outcomes. In this paper, we propose NashFormer, a framework for trajectory prediction that leverages game-… ▽ More Interactions between road agents present a significant challenge in trajectory prediction, especially in cases involving multiple agents. Because existing diversity-aware predictors do not account for the interactive nature of multi-agent predictions, they may miss these important interaction outcomes. In this paper, we propose NashFormer, a framework for trajectory prediction that leverages game-theoretic inverse reinforcement learning to improve coverage of multi-modal predictions. We use a training-time game-theoretic analysis as an auxiliary loss resulting in improved coverage and accuracy without presuming a taxonomy of actions for the agents. We demonstrate our approach on the interactive split of the Waymo Open Motion Dataset, including four subsets involving scenarios with high interaction complexity. Experiment results show that our predictor produces accurate predictions while covering $33\%$ more potential interactions versus a baseline model. △ Less

Submitted 11 November, 2023; v1 submitted 27 May, 2023; originally announced May 2023.

Comments: 8 pages, 6 figures

arXiv:2305.06447 [pdf, other]

Dynamic Graph Representation Learning for Depression Screening with Transformer

Authors: Ai-Te Kuo, Haiquan Chen, Yu-Hsuan Kuo, Wei-Shinn Ku

Abstract: Early detection of mental disorder is crucial as it enables prompt intervention and treatment, which can greatly improve outcomes for individuals suffering from debilitating mental affliction. The recent proliferation of mental health discussions on social media platforms presents research opportunities to investigate mental health and potentially detect instances of mental illness. However, exist… ▽ More Early detection of mental disorder is crucial as it enables prompt intervention and treatment, which can greatly improve outcomes for individuals suffering from debilitating mental affliction. The recent proliferation of mental health discussions on social media platforms presents research opportunities to investigate mental health and potentially detect instances of mental illness. However, existing depression detection methods are constrained due to two major limitations: (1) the reliance on feature engineering and (2) the lack of consideration for time-varying factors. Specifically, these methods require extensive feature engineering and domain knowledge, which heavily rely on the amount, quality, and type of user-generated content. Moreover, these methods ignore the important impact of time-varying factors on depression detection, such as the dynamics of linguistic patterns and interpersonal interactive behaviors over time on social media (e.g., replies, mentions, and quote-tweets). To tackle these limitations, we propose an early depression detection framework, ContrastEgo treats each user as a dynamic time-evolving attributed graph (ego-network) and leverages supervised contrastive learning to maximize the agreement of users' representations at different scales while minimizing the agreement of users' representations to differentiate between depressed and control groups. ContrastEgo embraces four modules, (1) constructing users' heterogeneous interactive graphs, (2) extracting the representations of users' interaction snapshots using graph neural networks, (3) modeling the sequences of snapshots using attention mechanism, and (4) depression detection using contrastive learning. Extensive experiments on Twitter data demonstrate that ContrastEgo significantly outperforms the state-of-the-art methods in terms of all the effectiveness metrics in various experimental settings. △ Less

Submitted 10 May, 2023; originally announced May 2023.

Comments: 10 pages, 4 figures, 8 tables

arXiv:2301.09209 [pdf, other]

Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction Anticipation

Authors: Razvan-George Pasca, Alexey Gavryushin, Muhammad Hamza, Yen-Ling Kuo, Kaichun Mo, Luc Van Gool, Otmar Hilliges, Xi Wang

Abstract: We study object interaction anticipation in egocentric videos. This task requires an understanding of the spatio-temporal context formed by past actions on objects, coined action context. We propose TransFusion, a multimodal transformer-based architecture. It exploits the representational power of language by summarizing the action context. TransFusion leverages pre-trained image captioning and vi… ▽ More We study object interaction anticipation in egocentric videos. This task requires an understanding of the spatio-temporal context formed by past actions on objects, coined action context. We propose TransFusion, a multimodal transformer-based architecture. It exploits the representational power of language by summarizing the action context. TransFusion leverages pre-trained image captioning and vision-language models to extract the action context from past video frames. This action context together with the next video frame is processed by the multimodal fusion module to forecast the next object interaction. Our model enables more efficient end-to-end learning. The large pre-trained language models add common sense and a generalisation capability. Experiments on Ego4D and EPIC-KITCHENS-100 show the effectiveness of our multimodal fusion model. They also highlight the benefits of using language-based context summaries in a task where vision seems to suffice. Our method outperforms state-of-the-art approaches by 40.4% in relative terms in overall mAP on the Ego4D test set. We validate the effectiveness of TransFusion via experiments on EPIC-KITCHENS-100. Video and code are available at https://eth-ait.github.io/transfusion-proj/. △ Less

Submitted 10 March, 2024; v1 submitted 22 January, 2023; originally announced January 2023.

arXiv:2210.01306 [pdf, other]

doi 10.1145/3680291

Robust Qubit Mapping Algorithm via Double-Source Optimal Routing on Large Quantum Circuits

Authors: Chin-Yi Cheng, Chien-Yi Yang, Yi-Hsiang Kuo, Ren-Chu Wang, Hao-Chung Cheng, Chung-Yang Ric Huang

Abstract: Qubit Mapping is a critical aspect of implementing quantum circuits on real hardware devices. Currently, the existing algorithms for qubit mapping encounter difficulties when dealing with larger circuit sizes involving hundreds of qubits. In this paper, we introduce an innovative qubit mapping algorithm, Duostra, tailored to address the challenge of implementing large-scale quantum circuits on rea… ▽ More Qubit Mapping is a critical aspect of implementing quantum circuits on real hardware devices. Currently, the existing algorithms for qubit mapping encounter difficulties when dealing with larger circuit sizes involving hundreds of qubits. In this paper, we introduce an innovative qubit mapping algorithm, Duostra, tailored to address the challenge of implementing large-scale quantum circuits on real hardware devices with limited connectivity. Duostra operates by efficiently determining optimal paths for double-qubit gates and inserting SWAP gates accordingly to implement the double-qubit operations on real devices. Together with two heuristic scheduling algorithms, the Limitedly-Exhausitive (LE) Search and the Shortest-Path (SP) Estimation, it yields results of good quality within a reasonable runtime, thereby striving toward achieving quantum advantage. Experimental results showcase our algorithm's superiority, especially for large circuits beyond the NISQ era. For example, on large circuits with more than 50 qubits, we can reduce the mapping cost on an average 21.75% over the virtual best results among QMAP, t|ket>, Qiskit and SABRE. Besides, for mid-size circuits such as the SABRE-large benchmark, we improve the mapping costs by 4.5%, 5.2%, 16.3%, 20.7%, and 25.7%, when compared to QMAP, TOQM, t|ket>, Qiskit, and SABRE, respectively. △ Less

Submitted 3 August, 2024; v1 submitted 3 October, 2022; originally announced October 2022.

Comments: Accepted by ACM Transactions on Quantum Computing

Journal ref: ACM Transactions on Quantum Computing (August 2024)

arXiv:2209.02485 [pdf, other]

Reconstructing Action-Conditioned Human-Object Interactions Using Commonsense Knowledge Priors

Authors: Xi Wang, Gen Li, Yen-Ling Kuo, Muhammed Kocabas, Emre Aksan, Otmar Hilliges

Abstract: We present a method for inferring diverse 3D models of human-object interactions from images. Reasoning about how humans interact with objects in complex scenes from a single 2D image is a challenging task given ambiguities arising from the loss of information through projection. In addition, modeling 3D interactions requires the generalization ability towards diverse object categories and interac… ▽ More We present a method for inferring diverse 3D models of human-object interactions from images. Reasoning about how humans interact with objects in complex scenes from a single 2D image is a challenging task given ambiguities arising from the loss of information through projection. In addition, modeling 3D interactions requires the generalization ability towards diverse object categories and interaction types. We propose an action-conditioned modeling of interactions that allows us to infer diverse 3D arrangements of humans and objects without supervision on contact regions or 3D scene geometry. Our method extracts high-level commonsense knowledge from large language models (such as GPT-3), and applies them to perform 3D reasoning of human-object interactions. Our key insight is priors extracted from large language models can help in reasoning about human-object contacts from textural prompts only. We quantitatively evaluate the inferred 3D models on a large human-object interaction dataset and show how our method leads to better 3D reconstructions. We further qualitatively evaluate the effectiveness of our method on real images and demonstrate its generalizability towards interaction types and object categories. △ Less

Submitted 6 September, 2022; originally announced September 2022.

arXiv:2208.08878 [pdf, other]

Towards Learning in Grey Spatiotemporal Systems: A Prophet to Non-consecutive Spatiotemporal Dynamics

Authors: Zhengyang Zhou, Yang Kuo, Wei Sun, Binwu Wang, Min Zhou, Yunan Zong, Yang Wang

Abstract: Spatiotemporal forecasting is an imperative topic in data science due to its diverse and critical applications in smart cities. Existing works mostly perform consecutive predictions of following steps with observations completely and continuously obtained, where nearest observations can be exploited as key knowledge for instantaneous status estimation. However, the practical issues of early activi… ▽ More Spatiotemporal forecasting is an imperative topic in data science due to its diverse and critical applications in smart cities. Existing works mostly perform consecutive predictions of following steps with observations completely and continuously obtained, where nearest observations can be exploited as key knowledge for instantaneous status estimation. However, the practical issues of early activity planning and sensor failures elicit a brand-new task, i.e., non-consecutive forecasting. In this paper, we define spatiotemporal learning systems with missing observation as Grey Spatiotemporal Systems (G2S) and propose a Factor-Decoupled learning framework for G2S (FDG2S), where the core idea is to hierarchically decouple multi-level factors and enable both flexible aggregations and disentangled uncertainty estimations. Firstly, to compensate for missing observations, a generic semantic-neighboring sequence sampling is devised, which selects representative sequences to capture both periodical regularity and instantaneous variations. Secondly, we turn the predictions of non-consecutive statuses into inferring statuses under expected combined exogenous factors. In particular, a factor-decoupled aggregation scheme is proposed to decouple factor-induced predictive intensity and region-wise proximity by two energy functions of conditional random field. To infer region-wise proximity under flexible factor-wise combinations and enable dynamic neighborhood aggregations, we further disentangle compounded influences of exogenous factors on region-wise proximity and learn to aggregate them. Given the inherent incompleteness and critical applications of G2S, a DisEntangled Uncertainty Quantification is put forward, to identify two types of uncertainty for reliability guarantees and model interpretations. △ Less

Submitted 17 August, 2022; originally announced August 2022.

Comments: 13 pages, 6 figures and 4 tables

arXiv:2206.13891 [pdf, other]

Feature Learning for Nonlinear Dimensionality Reduction toward Maximal Extraction of Hidden Patterns

Authors: Takanori Fujiwara, Yun-Hsin Kuo, Anders Ynnerman, Kwan-Liu Ma

Abstract: Dimensionality reduction (DR) plays a vital role in the visual analysis of high-dimensional data. One main aim of DR is to reveal hidden patterns that lie on intrinsic low-dimensional manifolds. However, DR often overlooks important patterns when the manifolds are distorted or masked by certain influential data attributes. This paper presents a feature learning framework, FEALM, designed to genera… ▽ More Dimensionality reduction (DR) plays a vital role in the visual analysis of high-dimensional data. One main aim of DR is to reveal hidden patterns that lie on intrinsic low-dimensional manifolds. However, DR often overlooks important patterns when the manifolds are distorted or masked by certain influential data attributes. This paper presents a feature learning framework, FEALM, designed to generate a set of optimized data projections for nonlinear DR in order to capture important patterns in the hidden manifolds. These projections produce maximally different nearest-neighbor graphs so that resultant DR outcomes are significantly different. To achieve such a capability, we design an optimization algorithm as well as introduce a new graph dissimilarity measure, named neighbor-shape dissimilarity. Additionally, we develop interactive visualizations to assist comparison of obtained DR results and interpretation of each DR result. We demonstrate FEALM's effectiveness through experiments and case studies using synthetic and real-world datasets. △ Less

Submitted 24 February, 2023; v1 submitted 28 June, 2022; originally announced June 2022.

Comments: Accepted by PacificVis 2023. The previous preprint version was titled "Feature Learning for Dimensionality Reduction toward Maximal Extraction of Hidden Patterns" (arxiv:2206.13891v2)

arXiv:2205.11748 [pdf, other]

doi 10.3390/children9070996

Deep Learning-based automated classification of Chinese Speech Sound Disorders

Authors: Yao-Ming Kuo, Shanq-Jang Ruan, Yu-Chin Chen, Ya-Wen Tu

Abstract: This article describes a system for analyzing acoustic data to assist in the diagnosis and classification of children's speech sound disorders (SSDs) using a computer. The analysis concentrated on identifying and categorizing four distinct types of Chinese SSDs. The study collected and generated a speech corpus containing 2540 stopping, backing, final consonant deletion process (FCDP), and affrica… ▽ More This article describes a system for analyzing acoustic data to assist in the diagnosis and classification of children's speech sound disorders (SSDs) using a computer. The analysis concentrated on identifying and categorizing four distinct types of Chinese SSDs. The study collected and generated a speech corpus containing 2540 stopping, backing, final consonant deletion process (FCDP), and affrication samples from 90 children aged 3--6 years with normal or pathological articulatory features. Each recording was accompanied by a detailed diagnostic annotation by two speech-language pathologists (SLPs). Classification of the speech samples was accomplished using three well-established neural network models for image classification. The feature maps were created using three sets of Mel-frequency cepstral coefficients (MFCC) parameters extracted from speech sounds and aggregated into a three-dimensional data structure as model input. We employed six techniques for data augmentation to augment the available dataset while avoiding overfitting. The experiments examine the usability of four different categories of Chinese phrases and characters. Experiments with different data subsets demonstrate the system's ability to accurately detect the analyzed pronunciation disorders. The best multi-class classification using a single Chinese phrase achieves an accuracy of 74.4~percent. △ Less

Submitted 6 July, 2022; v1 submitted 23 May, 2022; originally announced May 2022.

Comments: Children 2022

Journal ref: Children 2022, 9, 996

arXiv:2202.05413 [pdf, other]

A Machine-Learning-Aided Visual Analysis Workflow for Investigating Air Pollution Data

Authors: Yun-Hsin Kuo, Takanori Fujiwara, Charles C. -K. Chou, Chun-houh Chen, Kwan-Liu Ma

Abstract: Analyzing air pollution data is challenging as there are various analysis focuses from different aspects: feature (what), space (where), and time (when). As in most geospatial analysis problems, besides high-dimensional features, the temporal and spatial dependencies of air pollution induce the complexity of performing analysis. Machine learning methods, such as dimensionality reduction, can extra… ▽ More Analyzing air pollution data is challenging as there are various analysis focuses from different aspects: feature (what), space (where), and time (when). As in most geospatial analysis problems, besides high-dimensional features, the temporal and spatial dependencies of air pollution induce the complexity of performing analysis. Machine learning methods, such as dimensionality reduction, can extract and summarize important information of the data to lift the burden of understanding such a complicated environment. In this paper, we present a methodology that utilizes multiple machine learning methods to uniformly explore these aspects. With this methodology, we develop a visual analytic system that supports a flexible analysis workflow, allowing domain experts to freely explore different aspects based on their analysis needs. We demonstrate the capability of our system and analysis workflow supporting a variety of analysis tasks with multiple use cases. △ Less

Submitted 10 February, 2022; originally announced February 2022.

Comments: To appear in the Proceedings of IEEE PacificVis 2022

arXiv:2110.11864 [pdf]

Deep learning-based NLP Data Pipeline for EHR Scanned Document Information Extraction

Authors: Enshuo Hsu, Ioannis Malagaris, Yong-Fang Kuo, Rizwana Sultana, Kirk Roberts

Abstract: Scanned documents in electronic health records (EHR) have been a challenge for decades, and are expected to stay in the foreseeable future. Current approaches for processing often include image preprocessing, optical character recognition (OCR), and text mining. However, there is limited work that evaluates the choice of image preprocessing methods, the selection of NLP models, and the role of doc… ▽ More Scanned documents in electronic health records (EHR) have been a challenge for decades, and are expected to stay in the foreseeable future. Current approaches for processing often include image preprocessing, optical character recognition (OCR), and text mining. However, there is limited work that evaluates the choice of image preprocessing methods, the selection of NLP models, and the role of document layout. The impact of each element remains unknown. We evaluated this method on a use case of two key indicators for sleep apnea, Apnea hypopnea index (AHI) and oxygen saturation (SaO2) values, from scanned sleep study reports. Our data that included 955 manually annotated reports was secondarily utilized from a previous study in the University of Texas Medical Branch. We performed image preprocessing: gray-scaling followed by 1 iteration of dilating and erode, and 20% contrast increasing. The OCR was implemented with the Tesseract OCR engine. A total of seven Bag-of-Words models (Logistic Regression, Ridge Regression, Lasso Regression, Support Vector Machine, k-Nearest Neighbor, Naïve Bayes, and Random Forest) and three deep learning-based models (BiLSTM, BERT, and Clinical BERT) were evaluated. We also evaluated the combinations of image preprocessing methods (gray-scaling, dilate & erode, increased contrast by 20%, increased contrast by 60%), and two deep learning architectures (with and without structured input that provides document layout information). Our proposed method using Clinical BERT reached an AUROC of 0.9743 and document accuracy of 94.76% for AHI, and an AUROC of 0.9523, and document accuracy of 91.61% for SaO2. We demonstrated the proper use of image preprocessing and document layout could be beneficial to scanned document processing. △ Less

Submitted 13 September, 2021; originally announced October 2021.

Comments: 6 tables, 7 figures

arXiv:2110.10298 [pdf, other]

Incorporating Rich Social Interactions Into MDPs

Authors: Ravi Tejwani, Yen-Ling Kuo, Tianmin Shu, Bennett Stankovits, Dan Gutfreund, Joshua B. Tenenbaum, Boris Katz, Andrei Barbu

Abstract: Much of what we do as humans is engage socially with other agents, a skill that robots must also eventually possess. We demonstrate that a rich theory of social interactions originating from microsociology and economics can be formalized by extending a nested MDP where agents reason about arbitrary functions of each other's hidden rewards. This extended Social MDP allows us to encode the five basi… ▽ More Much of what we do as humans is engage socially with other agents, a skill that robots must also eventually possess. We demonstrate that a rich theory of social interactions originating from microsociology and economics can be formalized by extending a nested MDP where agents reason about arbitrary functions of each other's hidden rewards. This extended Social MDP allows us to encode the five basic interactions that underlie microsociology: cooperation, conflict, coercion, competition, and exchange. The result is a robotic agent capable of executing social interactions zero-shot in new environments; like humans it can engage socially in novel ways even without a single example of that social interaction. Moreover, the judgments of these Social MDPs align closely with those of humans when considering which social interaction is taking place in an environment. This method both sheds light on the nature of social interactions, by providing concrete mathematical definitions, and brings rich social interactions into a mathematical framework that has proven to be natural for robotics, MDPs. △ Less

Submitted 7 February, 2022; v1 submitted 19 October, 2021; originally announced October 2021.

Comments: Accepted to the 39th International Conference on Robotics and Automation (ICRA 2022)

arXiv:2110.09741 [pdf, other]

Trajectory Prediction with Linguistic Representations

Authors: Yen-Ling Kuo, Xin Huang, Andrei Barbu, Stephen G. McGill, Boris Katz, John J. Leonard, Guy Rosman

Abstract: Language allows humans to build mental models that interpret what is happening around them resulting in more accurate long-term predictions. We present a novel trajectory prediction model that uses linguistic intermediate representations to forecast trajectories, and is trained using trajectory samples with partially-annotated captions. The model learns the meaning of each of the words without dir… ▽ More Language allows humans to build mental models that interpret what is happening around them resulting in more accurate long-term predictions. We present a novel trajectory prediction model that uses linguistic intermediate representations to forecast trajectories, and is trained using trajectory samples with partially-annotated captions. The model learns the meaning of each of the words without direct per-word supervision. At inference time, it generates a linguistic description of trajectories which captures maneuvers and interactions over an extended time interval. This generated description is used to refine predictions of the trajectories of multiple agents. We train and validate our model on the Argoverse dataset, and demonstrate improved accuracy results in trajectory prediction. In addition, our model is more interpretable: it presents part of its reasoning in plain language as captions, which can aid model development and can aid in building confidence in the model before deploying it. △ Less

Submitted 9 March, 2022; v1 submitted 19 October, 2021; originally announced October 2021.

Comments: Accepted in ICRA 2022

arXiv:2107.11537 [pdf]

doi 10.1109/TII.2020.3009133.

Secure Links: Secure-by-Design Communications in IEC 61499 Industrial Control Applications

Authors: Awais Tanveer, Roopak Sinha, Matthew M. Y. Kuo

Abstract: Increasing automation and external connectivity in industrial control systems (ICS) demand a greater emphasis on software-level communication security. In this article, we propose a secure-by-design development method for building ICS applications, where requirements from security standards like ISA/IEC 62443 are fulfilled by design-time abstractions called secure links. Proposed as an extension t… ▽ More Increasing automation and external connectivity in industrial control systems (ICS) demand a greater emphasis on software-level communication security. In this article, we propose a secure-by-design development method for building ICS applications, where requirements from security standards like ISA/IEC 62443 are fulfilled by design-time abstractions called secure links. Proposed as an extension to the IEC 61499 development standard, secure links incorporate both light-weight and traditional security mechanisms into applications with negligible effort. Applications containing secure links can be automatically compiled into fully IEC 61499-compliant software. Experimental results show secure links significantly reduce design and code complexity and improve application maintainability and requirements traceability. △ Less

Submitted 24 July, 2021; originally announced July 2021.

Comments: Journal paper, 11 pages, 10 figures, 3 tables

Journal ref: IEEE Transactions on Industrial Informatics 17(6)(2021), pp.3992-4002

arXiv:2105.14322 [pdf, other]

RPG: Learning Recursive Point Cloud Generation

Authors: Wei-Jan Ko, Hui-Yu Huang, Yu-Liang Kuo, Chen-Yi Chiu, Li-Heng Wang, Wei-Chen Chiu

Abstract: In this paper we propose a novel point cloud generator that is able to reconstruct and generate 3D point clouds composed of semantic parts. Given a latent representation of the target 3D model, the generation starts from a single point and gets expanded recursively to produce the high-resolution point cloud via a sequence of point expansion stages. During the recursive procedure of generation, we… ▽ More In this paper we propose a novel point cloud generator that is able to reconstruct and generate 3D point clouds composed of semantic parts. Given a latent representation of the target 3D model, the generation starts from a single point and gets expanded recursively to produce the high-resolution point cloud via a sequence of point expansion stages. During the recursive procedure of generation, we not only obtain the coarse-to-fine point clouds for the target 3D model from every expansion stage, but also unsupervisedly discover the semantic segmentation of the target model according to the hierarchical/parent-child relation between the points across expansion stages. Moreover, the expansion modules and other elements used in our recursive generator are mostly sharing weights thus making the overall framework light and efficient. Extensive experiments are conducted to demonstrate that our proposed point cloud generator has comparable or even superior performance on both generation and reconstruction tasks in comparison to various baselines, as well as provides the consistent co-segmentation among 3D instances of the same object class. △ Less

Submitted 29 May, 2021; originally announced May 2021.

arXiv:2012.12453 [pdf, other]

CholecSeg8k: A Semantic Segmentation Dataset for Laparoscopic Cholecystectomy Based on Cholec80

Authors: W. -Y. Hong, C. -L. Kao, Y. -H. Kuo, J. -R. Wang, W. -L. Chang, C. -S. Shih

Abstract: Computer-assisted surgery has been developed to enhance surgery correctness and safety. However, researchers and engineers suffer from limited annotated data to develop and train better algorithms. Consequently, the development of fundamental algorithms such as Simultaneous Localization and Mapping (SLAM) is limited. This article elaborates on the efforts of preparing the dataset for semantic segm… ▽ More Computer-assisted surgery has been developed to enhance surgery correctness and safety. However, researchers and engineers suffer from limited annotated data to develop and train better algorithms. Consequently, the development of fundamental algorithms such as Simultaneous Localization and Mapping (SLAM) is limited. This article elaborates on the efforts of preparing the dataset for semantic segmentation, which is the foundation of many computer-assisted surgery mechanisms. Based on the Cholec80 dataset [3], we extracted 8,080 laparoscopic cholecystectomy image frames from 17 video clips in Cholec80 and annotated the images. The dataset is named CholecSeg8K and its total size is 3GB. Each of these images is annotated at pixel-level for thirteen classes, which are commonly founded in laparoscopic cholecystectomy surgery. CholecSeg8k is released under the license CC BY- NC-SA 4.0. △ Less

Submitted 22 December, 2020; originally announced December 2020.

Comments: 6 pages

arXiv:2011.05451 [pdf, other]

doi 10.1016/j.jcp.2021.110527

Lattice meets lattice: Application of lattice cubature to models in lattice gauge theory

Authors: Tobias Hartung, Karl Jansen, Frances Y. Kuo, Hernan Leövey, Dirk Nuyens, Ian H. Sloan

Abstract: High dimensional integrals are abundant in many fields of research including quantum physics. The aim of this paper is to develop efficient recursive strategies to tackle a class of high dimensional integrals having a special product structure with low order couplings, motivated by models in lattice gauge theory from quantum field theory. A novel element of this work is the potential benefit in us… ▽ More High dimensional integrals are abundant in many fields of research including quantum physics. The aim of this paper is to develop efficient recursive strategies to tackle a class of high dimensional integrals having a special product structure with low order couplings, motivated by models in lattice gauge theory from quantum field theory. A novel element of this work is the potential benefit in using lattice cubature rules. The group structure within lattice rules combined with the special structure in the physics integrands may allow efficient computations based on Fast Fourier Transforms. Applications to the quantum mechanical rotor and compact $U(1)$ lattice gauge theory in two and three dimensions are considered. △ Less

Submitted 29 June, 2021; v1 submitted 10 November, 2020; originally announced November 2020.

MSC Class: 65D30; 65D32; 65T50; 65Z05; 81T80

Journal ref: Journal of Computational Physics Volume 443, 15 October 2021, 110527

arXiv:2008.03277 [pdf, other]

Learning a natural-language to LTL executable semantic parser for grounded robotics

Authors: Christopher Wang, Candace Ross, Yen-Ling Kuo, Boris Katz, Andrei Barbu

Abstract: Children acquire their native language with apparent ease by observing how language is used in context and attempting to use it themselves. They do so without laborious annotations, negative examples, or even direct corrections. We take a step toward robots that can do the same by training a grounded semantic parser, which discovers latent linguistic representations that can be used for the execut… ▽ More Children acquire their native language with apparent ease by observing how language is used in context and attempting to use it themselves. They do so without laborious annotations, negative examples, or even direct corrections. We take a step toward robots that can do the same by training a grounded semantic parser, which discovers latent linguistic representations that can be used for the execution of natural-language commands. In particular, we focus on the difficult domain of commands with a temporal aspect, whose semantics we capture with Linear Temporal Logic, LTL. Our parser is trained with pairs of sentences and executions as well as an executor. At training time, the parser hypothesizes a meaning representation for the input as a formula in LTL. Three competing pressures allow the parser to discover meaning from language. First, any hypothesized meaning for a sentence must be permissive enough to reflect all the annotated execution trajectories. Second, the executor -- a pretrained end-to-end LTL planner -- must find that the observe trajectories are likely executions of the meaning. Finally, a generator, which reconstructs the original input, encourages the model to find representations that conserve knowledge about the command. Together these ensure that the meaning is neither too general nor too specific. Our model generalizes well, being able to parse and execute both machine-generated and human-generated commands, with near-equal accuracy, despite the fact that the human-generated sentences are much more varied and complex with an open lexicon. The approach presented here is not specific to LTL: it can be applied to any domain where sentence meanings can be hypothesized and an executor can verify these meanings, thus opening the door to many applications for robotic agents. △ Less

Submitted 16 March, 2021; v1 submitted 7 August, 2020; originally announced August 2020.

Comments: 10 pages, 2 figures, Accepted in Conference on Robot Learning (CoRL) 2020

ACM Class: I.2.7

arXiv:2008.02742 [pdf, other]

Compositional Networks Enable Systematic Generalization for Grounded Language Understanding

Authors: Yen-Ling Kuo, Boris Katz, Andrei Barbu

Abstract: Humans are remarkably flexible when understanding new sentences that include combinations of concepts they have never encountered before. Recent work has shown that while deep networks can mimic some human language abilities when presented with novel sentences, systematic variation uncovers the limitations in the language-understanding abilities of networks. We demonstrate that these limitations c… ▽ More Humans are remarkably flexible when understanding new sentences that include combinations of concepts they have never encountered before. Recent work has shown that while deep networks can mimic some human language abilities when presented with novel sentences, systematic variation uncovers the limitations in the language-understanding abilities of networks. We demonstrate that these limitations can be overcome by addressing the generalization challenges in the gSCAN dataset, which explicitly measures how well an agent is able to interpret novel linguistic commands grounded in vision, e.g., novel pairings of adjectives and nouns. The key principle we employ is compositionality: that the compositional structure of networks should reflect the compositional structure of the problem domain they address, while allowing other parameters to be learned end-to-end. We build a general-purpose mechanism that enables agents to generalize their language understanding to compositional domains. Crucially, our network has the same state-of-the-art performance as prior work while generalizing its knowledge when prior work does not. Our network also provides a level of interpretability that enables users to inspect what each part of networks learns. Robust grounded language understanding without dramatic failures and without corner cases is critical to building safe and fair robots; we demonstrate the significant role that compositionality can play in achieving that goal. △ Less

Submitted 19 October, 2021; v1 submitted 6 August, 2020; originally announced August 2020.

Comments: Accepted in Findings of EMNLP 2021

arXiv:2006.01110 [pdf, other]

Encoding formulas as deep networks: Reinforcement learning for zero-shot execution of LTL formulas

Authors: Yen-Ling Kuo, Boris Katz, Andrei Barbu

Abstract: We demonstrate a reinforcement learning agent which uses a compositional recurrent neural network that takes as input an LTL formula and determines satisfying actions. The input LTL formulas have never been seen before, yet the network performs zero-shot generalization to satisfy them. This is a novel form of multi-task learning for RL agents where agents learn from one diverse set of tasks and ge… ▽ More We demonstrate a reinforcement learning agent which uses a compositional recurrent neural network that takes as input an LTL formula and determines satisfying actions. The input LTL formulas have never been seen before, yet the network performs zero-shot generalization to satisfy them. This is a novel form of multi-task learning for RL agents where agents learn from one diverse set of tasks and generalize to a new set of diverse tasks. The formulation of the network enables this capacity to generalize. We demonstrate this ability in two domains. In a symbolic domain, the agent finds a sequence of letters that is accepted. In a Minecraft-like environment, the agent finds a sequence of actions that conform to the formula. While prior work could learn to execute one formula reliably given examples of that formula, we demonstrate how to encode all formulas reliably. This could form the basis of new multitask agents that discover sub-tasks and execute them without any additional training, as well as the agents which follow more complex linguistic commands. The structures required for this generalization are specific to LTL formulas, which opens up an interesting theoretical question: what structures are required in neural networks for zero-shot generalization to different logics? △ Less

Submitted 6 August, 2020; v1 submitted 1 June, 2020; originally announced June 2020.

Comments: Accepted in IROS 2020

arXiv:2003.03716 [pdf, other]

Investigating the Decoders of Maximum Likelihood Sequence Models: A Look-ahead Approach

Authors: Yu-Siang Wang, Yen-Ling Kuo, Boris Katz

Abstract: We demonstrate how we can practically incorporate multi-step future information into a decoder of maximum likelihood sequence models. We propose a "k-step look-ahead" module to consider the likelihood information of a rollout up to k steps. Unlike other approaches that need to train another value network to evaluate the rollouts, we can directly apply this look-ahead module to improve the decoding… ▽ More We demonstrate how we can practically incorporate multi-step future information into a decoder of maximum likelihood sequence models. We propose a "k-step look-ahead" module to consider the likelihood information of a rollout up to k steps. Unlike other approaches that need to train another value network to evaluate the rollouts, we can directly apply this look-ahead module to improve the decoding of any sequence model trained in a maximum likelihood framework. We evaluate our look-ahead module on three datasets of varying difficulties: IM2LATEX-100k OCR image to LaTeX, WMT16 multimodal machine translation, and WMT14 machine translation. Our look-ahead module improves the performance of the simpler datasets such as IM2LATEX-100k and WMT16 multimodal machine translation. However, the improvement of the more difficult dataset (e.g., containing longer sequences), WMT14 machine translation, becomes marginal. Our further investigation using the k-step look-ahead suggests that the more difficult tasks suffer from the overestimated EOS (end-of-sentence) probability. We argue that the overestimated EOS probability also causes the decreased performance of beam search when increasing its beam width. We tackle the EOS problem by integrating an auxiliary EOS loss into the training to estimate if the model should emit EOS or other words. Our experiments show that improving EOS estimation not only increases the performance of our proposed look-ahead module but also the robustness of the beam search. △ Less

Submitted 7 March, 2020; originally announced March 2020.

Comments: 7 pages, 5 figures

arXiv:2002.05201 [pdf, other]

Deep compositional robotic planners that follow natural language commands

Authors: Yen-Ling Kuo, Boris Katz, Andrei Barbu

Abstract: We demonstrate how a sampling-based robotic planner can be augmented to learn to understand a sequence of natural language commands in a continuous configuration space to move and manipulate objects. Our approach combines a deep network structured according to the parse of a complex command that includes objects, verbs, spatial relations, and attributes, with a sampling-based planner, RRT. A recur… ▽ More We demonstrate how a sampling-based robotic planner can be augmented to learn to understand a sequence of natural language commands in a continuous configuration space to move and manipulate objects. Our approach combines a deep network structured according to the parse of a complex command that includes objects, verbs, spatial relations, and attributes, with a sampling-based planner, RRT. A recurrent hierarchical deep network controls how the planner explores the environment, determines when a planned path is likely to achieve a goal, and estimates the confidence of each move to trade off exploitation and exploration between the network and the planner. Planners are designed to have near-optimal behavior when information about the task is missing, while networks learn to exploit observations which are available from the environment, making the two naturally complementary. Combining the two enables generalization to new maps, new kinds of obstacles, and more complex sentences that do not occur in the training set. Little data is required to train the model despite it jointly acquiring a CNN that extracts features from the environment as it learns the meanings of words. The model provides a level of interpretability through the use of attention maps allowing users to see its reasoning steps despite being an end-to-end model. This end-to-end model allows robots to learn to follow natural language commands in challenging continuous environments. △ Less

Submitted 19 February, 2020; v1 submitted 12 February, 2020; originally announced February 2020.

Comments: Accepted in ICRA 2020

arXiv:1810.00804 [pdf, other]

Deep sequential models for sampling-based planning

Authors: Yen-Ling Kuo, Andrei Barbu, Boris Katz

Abstract: We demonstrate how a sequence model and a sampling-based planner can influence each other to produce efficient plans and how such a model can automatically learn to take advantage of observations of the environment. Sampling-based planners such as RRT generally know nothing of their environments even if they have traversed similar spaces many times. A sequence model, such as an HMM or LSTM, guides… ▽ More We demonstrate how a sequence model and a sampling-based planner can influence each other to produce efficient plans and how such a model can automatically learn to take advantage of observations of the environment. Sampling-based planners such as RRT generally know nothing of their environments even if they have traversed similar spaces many times. A sequence model, such as an HMM or LSTM, guides the search for good paths. The resulting model, called DeRRT*, observes the state of the planner and the local environment to bias the next move and next planner state. The neural-network-based models avoid manual feature engineering by co-training a convolutional network which processes map features and observations from sensors. We incorporate this sequence model in a manner that combines its likelihood with the existing bias for searching large unexplored Voronoi regions. This leads to more efficient trajectories with fewer rejected samples even in difficult domains such as when escaping bug traps. This model can also be used for dimensionality reduction in multi-agent environments with dynamic obstacles. Instead of planning in a high-dimensional space that includes the configurations of the other agents, we plan in a low-dimensional subspace relying on the sequence model to bias samples using the observed behavior of the other agents. The techniques presented here are general, include both graphical models and deep learning approaches, and can be adapted to a range of planners. △ Less

Submitted 1 October, 2018; originally announced October 2018.

Comments: Published in IROS 2018

arXiv:1809.08753 [pdf, other]

An Iterative Refinement Approach for Social Media Headline Prediction

Authors: Chih-Chung Hsu, Chia-Yen Lee, Ting-Xuan Liao, Jun-Yi Lee, Tsai-Yne Hou, Ying-Chu Kuo, Jing-Wen Lin, Ching-Yi Hsueh, Zhong-Xuan Zhan, Hsiang-Chin Chien

Abstract: In this study, we propose a novel iterative refinement approach to predict the popularity score of the social media meta-data effectively. With the rapid growth of the social media on the Internet, how to adequately forecast the view count or popularity becomes more important. Conventionally, the ensemble approach such as random forest regression achieves high and stable performance on various pre… ▽ More In this study, we propose a novel iterative refinement approach to predict the popularity score of the social media meta-data effectively. With the rapid growth of the social media on the Internet, how to adequately forecast the view count or popularity becomes more important. Conventionally, the ensemble approach such as random forest regression achieves high and stable performance on various prediction tasks. However, most of the regression methods may not precisely predict the extreme high or low values. To address this issue, we first predict the initial popularity score and retrieve their residues. In order to correctly compensate those extreme values, we adopt an ensemble regressor to compensate the residues to further improve the prediction performance. Comprehensive experiments are conducted to demonstrate the proposed iterative refinement approach outperforms the state-of-the-art regression approach. △ Less

Submitted 24 September, 2018; originally announced September 2018.

Comments: 5 pages, ACM Multimedia Conference 2018

arXiv:1808.08640 [pdf, other]

doi 10.1145/3269206.3271798

Detecting Outliers in Data with Correlated Measures

Authors: Yu-Hsuan Kuo, Zhenhui Li, Daniel Kifer

Abstract: Advances in sensor technology have enabled the collection of large-scale datasets. Such datasets can be extremely noisy and often contain a significant amount of outliers that result from sensor malfunction or human operation faults. In order to utilize such data for real-world applications, it is critical to detect outliers so that models built from these datasets will not be skewed by outliers.… ▽ More Advances in sensor technology have enabled the collection of large-scale datasets. Such datasets can be extremely noisy and often contain a significant amount of outliers that result from sensor malfunction or human operation faults. In order to utilize such data for real-world applications, it is critical to detect outliers so that models built from these datasets will not be skewed by outliers. In this paper, we propose a new outlier detection method that utilizes the correlations in the data (e.g., taxi trip distance vs. trip time). Different from existing outlier detection methods, we build a robust regression model that explicitly models the outliers and detects outliers simultaneously with the model fitting. We validate our approach on real-world datasets against methods specifically designed for each dataset as well as the state of the art outlier detectors. Our outlier detection method achieves better performances, demonstrating the robustness and generality of our method. Last, we report interesting case studies on some outliers that result from atypical events. △ Less

Submitted 26 August, 2018; originally announced August 2018.

Comments: 10 pages

arXiv:1804.00370 [pdf, other]

Differentially Private Hierarchical Count-of-Counts Histograms

Authors: Yu-Hsuan Kuo, Cho-Chun Chiu, Daniel Kifer, Michael Hay, Ashwin Machanavajjhala

Abstract: We consider the problem of privately releasing a class of queries that we call hierarchical count-of-counts histograms. Count-of-counts histograms partition the rows of an input table into groups (e.g., group of people in the same household), and for every integer j report the number of groups of size j. Hierarchical count-of-counts queries report count-of-counts histograms at different granularit… ▽ More We consider the problem of privately releasing a class of queries that we call hierarchical count-of-counts histograms. Count-of-counts histograms partition the rows of an input table into groups (e.g., group of people in the same household), and for every integer j report the number of groups of size j. Hierarchical count-of-counts queries report count-of-counts histograms at different granularities as per hierarchy defined on an attribute in the input data (e.g., geographical location of a household at the national, state and county levels). In this paper, we introduce this problem, along with appropriate error metrics and propose a differentially private solution that generates count-of-counts histograms that are consistent across all levels of the hierarchy. △ Less

Submitted 13 September, 2018; v1 submitted 1 April, 2018; originally announced April 2018.

Comments: 13 pages

arXiv:1706.09541 [pdf, ps, other]

Information-Centric Wireless Networks with Mobile Edge Computing

Authors: Yuchen Zhou, F. Richard Yu, Jian Chen, Yonghong Kuo

Abstract: In order to better accommodate the dramatically increasing demand for data caching and computing services, storage and computation capabilities should be endowed to some of the intermediate nodes within the network. In this paper, we design a novel virtualized heterogeneous networks framework aiming at enabling content caching and computing. With the virtualization of the whole system, the communi… ▽ More In order to better accommodate the dramatically increasing demand for data caching and computing services, storage and computation capabilities should be endowed to some of the intermediate nodes within the network. In this paper, we design a novel virtualized heterogeneous networks framework aiming at enabling content caching and computing. With the virtualization of the whole system, the communication, computing and caching resources can be shared among all users associated with different virtual service providers. We formulate the virtual resource allocation strategy as a joint optimization problem, where the gains of not only virtualization but also caching and computing are taken into consideration in the proposed architecture. In addition, a distributed algorithm based on alternating direction method of multipliers is adopted to solve the formulated problem, in order to reduce the computational complexity and signaling overhead. Finally, extensive simulations are presented to show the effectiveness of the proposed scheme under different system parameters. △ Less

Submitted 28 June, 2017; originally announced June 2017.

arXiv:1608.05339 [pdf, other]

Photo Filter Recommendation by Category-Aware Aesthetic Learning

Authors: Wei-Tse Sun, Ting-Hsuan Chao, Yin-Hsi Kuo, Winston H. Hsu

Abstract: Nowadays, social media has become a popular platform for the public to share photos. To make photos more visually appealing, users usually apply filters on their photos without domain knowledge. However, due to the growing number of filter types, it becomes a major issue for users to choose the best filter type. For this purpose, filter recommendation for photo aesthetics takes an important role i… ▽ More Nowadays, social media has become a popular platform for the public to share photos. To make photos more visually appealing, users usually apply filters on their photos without domain knowledge. However, due to the growing number of filter types, it becomes a major issue for users to choose the best filter type. For this purpose, filter recommendation for photo aesthetics takes an important role in image quality ranking problems. In these years, several works have declared that Convolutional Neural Networks (CNNs) outperform traditional methods in image aesthetic categorization, which classifies images into high or low quality. Most of them do not consider the effect on filtered images; hence, we propose a novel image aesthetic learning for filter recommendation. Instead of binarizing image quality, we adjust the state-of-the-art CNN architectures and design a pairwise loss function to learn the embedded aesthetic responses in hidden layers for filtered images. Based on our pilot study, we observe image categories (e.g., portrait, landscape, food) will affect user preference on filter selection. We further integrate category classification into our proposed aesthetic-oriented models. To the best of our knowledge, there is no public dataset for aesthetic judgment with filtered images. We create a new dataset called Filter Aesthetic Comparison Dataset (FACD). It contains 28,160 filtered images based on the AVA dataset and 42,240 reliable image pairs with aesthetic annotations using Amazon Mechanical Turk. It is the first dataset containing filtered images and user preference labels. We conduct experiments on the collected FACD for filter recommendation, and the results show that our proposed category-aware aesthetic learning outperforms aesthetic classification methods (e.g., 12% relative improvement). △ Less

Submitted 27 March, 2017; v1 submitted 18 August, 2016; originally announced August 2016.

Comments: 11 pages, 7 figures

arXiv:1606.08999 [pdf, ps, other]

De-Hashing: Server-Side Context-Aware Feature Reconstruction for Mobile Visual Search

Authors: Yin-Hsi Kuo, Winston H. Hsu

Abstract: Due to the prevalence of mobile devices, mobile search becomes a more convenient way than desktop search. Different from the traditional desktop search, mobile visual search needs more consideration for the limited resources on mobile devices (e.g., bandwidth, computing power, and memory consumption). The state-of-the-art approaches show that bag-of-words (BoW) model is robust for image and video… ▽ More Due to the prevalence of mobile devices, mobile search becomes a more convenient way than desktop search. Different from the traditional desktop search, mobile visual search needs more consideration for the limited resources on mobile devices (e.g., bandwidth, computing power, and memory consumption). The state-of-the-art approaches show that bag-of-words (BoW) model is robust for image and video retrieval; however, the large vocabulary tree might not be able to be loaded on the mobile device. We observe that recent works mainly focus on designing compact feature representations on mobile devices for bandwidth-limited network (e.g., 3G) and directly adopt feature matching on remote servers (cloud). However, the compact (binary) representation might fail to retrieve target objects (images, videos). Based on the hashed binary codes, we propose a de-hashing process that reconstructs BoW by leveraging the computing power of remote servers. To mitigate the information loss from binary codes, we further utilize contextual information (e.g., GPS) to reconstruct a context-aware BoW for better retrieval results. Experiment results show that the proposed method can achieve competitive retrieval accuracy as BoW while only transmitting few bits from mobile devices. △ Less

Submitted 29 June, 2016; originally announced June 2016.

Comments: Accepted for publication in IEEE Transactions on Circuits and Systems for Video Technology (TCSVT)

arXiv:1512.08580 [pdf, other]

A Simple Baseline for Travel Time Estimation using Large-Scale Trip Data

Authors: Hongjian Wang, Zhenhui Li, Yu-Hsuan Kuo, Dan Kifer

Abstract: The increased availability of large-scale trajectory data around the world provides rich information for the study of urban dynamics. For example, New York City Taxi Limousine Commission regularly releases source-destination information about trips in the taxis they regulate. Taxi data provide information about traffic patterns, and thus enable the study of urban flow -- what will traffic between… ▽ More The increased availability of large-scale trajectory data around the world provides rich information for the study of urban dynamics. For example, New York City Taxi Limousine Commission regularly releases source-destination information about trips in the taxis they regulate. Taxi data provide information about traffic patterns, and thus enable the study of urban flow -- what will traffic between two locations look like at a certain date and time in the future? Existing big data methods try to outdo each other in terms of complexity and algorithmic sophistication. In the spirit of "big data beats algorithms", we present a very simple baseline which outperforms state-of-the-art approaches, including Bing Maps and Baidu Maps (whose APIs permit large scale experimentation). Such a travel time estimation baseline has several important uses, such as navigation (fast travel time estimates can serve as approximate heuristics for A search variants for path finding) and trip planning (which uses operating hours for popular destinations along with travel time estimates to create an itinerary). △ Less

Submitted 28 December, 2015; originally announced December 2015.

Comments: 12 pages

ACM Class: H.2.8; I.2.6

Showing 1–43 of 43 results for author: Kuo, Y