Skip to main content

Showing 1–23 of 23 results for author: Dou, L

Searching in archive cs. Search in all archives.
.
  1. arXiv:2408.08841  [pdf, other

    cs.CL

    FLEXTAF: Enhancing Table Reasoning with Flexible Tabular Formats

    Authors: Xuanliang Zhang, Dingzirui Wang, Longxu Dou, Baoxin Wang, Dayong Wu, Qingfu Zhu, Wanxiang Che

    Abstract: The table reasoning task aims to answer the question according to the given table. Currently, using Large Language Models (LLMs) is the predominant method for table reasoning. Most existing methods employ a fixed tabular format to represent the table, which could limit the performance. Given that each instance requires different capabilities and models possess varying abilities, we assert that dif… ▽ More

    Submitted 27 August, 2024; v1 submitted 16 August, 2024; originally announced August 2024.

  2. arXiv:2408.08779  [pdf, other

    cs.CL

    DAC: Decomposed Automation Correction for Text-to-SQL

    Authors: Dingzirui Wang, Longxu Dou, Xuanliang Zhang, Qingfu Zhu, Wanxiang Che

    Abstract: Text-to-SQL is an important task that helps people obtain information from databases by automatically generating SQL queries. Considering the brilliant performance, approaches based on Large Language Models (LLMs) become the mainstream for text-to-SQL. Among these approaches, automated correction is an effective approach that further enhances performance by correcting the mistakes in the generated… ▽ More

    Submitted 27 August, 2024; v1 submitted 16 August, 2024; originally announced August 2024.

  3. arXiv:2407.13623  [pdf, other

    cs.CL cs.AI

    Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies

    Authors: Chaofan Tao, Qian Liu, Longxu Dou, Niklas Muennighoff, Zhongwei Wan, Ping Luo, Min Lin, Ngai Wong

    Abstract: Research on scaling large language models (LLMs) has primarily focused on model parameters and training data size, overlooking the role of vocabulary size. We investigate how vocabulary size impacts LLM scaling laws by training models ranging from 33M to 3B parameters on up to 500B characters with various vocabulary configurations. We propose three complementary approaches for predicting the compu… ▽ More

    Submitted 26 July, 2024; v1 submitted 18 July, 2024; originally announced July 2024.

    Comments: 26 pages, 12 figures. Add more related work

  4. arXiv:2407.01492  [pdf, other

    cs.CL cs.AI

    RegMix: Data Mixture as Regression for Language Model Pre-training

    Authors: Qian Liu, Xiaosen Zheng, Niklas Muennighoff, Guangtao Zeng, Longxu Dou, Tianyu Pang, Jing Jiang, Min Lin

    Abstract: The data mixture for large language model pre-training significantly impacts performance, yet how to determine an effective mixture remains unclear. We propose RegMix to automatically identify a high-performing data mixture by formulating it as a regression task. RegMix involves training a set of small models with diverse data mixtures and fitting a regression model to predict their performance gi… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

  5. arXiv:2405.05496  [pdf, other

    cs.CL

    Boosting Large Language Models with Continual Learning for Aspect-based Sentiment Analysis

    Authors: Xuanwen Ding, Jie Zhou, Liang Dou, Qin Chen, Yuanbin Wu, Chengcai Chen, Liang He

    Abstract: Aspect-based sentiment analysis (ABSA) is an important subtask of sentiment analysis, which aims to extract the aspects and predict their sentiments. Most existing studies focus on improving the performance of the target domain by fine-tuning domain-specific models (trained on source domains) based on the target domain dataset. Few works propose continual learning tasks for ABSA, which aim to lear… ▽ More

    Submitted 8 May, 2024; originally announced May 2024.

  6. arXiv:2404.03608  [pdf, other

    cs.CL cs.AI

    Sailor: Open Language Models for South-East Asia

    Authors: Longxu Dou, Qian Liu, Guangtao Zeng, Jia Guo, Jiahui Zhou, Wei Lu, Min Lin

    Abstract: We present Sailor, a family of open language models ranging from 0.5B to 7B parameters, tailored for South-East Asian (SEA) languages. These models are continually pre-trained from Qwen1.5, a great language model for multilingual use cases. From Qwen1.5, Sailor models accept 200B to 400B tokens, primarily covering the languages of English, Chinese, Vietnamese, Thai, Indonesian, Malay, and Lao. The… ▽ More

    Submitted 4 April, 2024; originally announced April 2024.

    Comments: Code is available at https://github.com/sail-sg/sailor-llm

  7. arXiv:2402.10666  [pdf, other

    cs.CL

    Multi-Hop Table Retrieval for Open-Domain Text-to-SQL

    Authors: Xuanliang Zhang, Dingzirui Wang, Longxu Dou, Qingfu Zhu, Wanxiang Che

    Abstract: Open-domain text-to-SQL is an important task that retrieves question-relevant tables from massive databases and then generates SQL. However, existing retrieval methods that retrieve in a single hop do not pay attention to the text-to-SQL challenge of schema linking, which is aligning the entities in the question with table entities, reflected in two aspects: similar irrelevant entity and domain mi… ▽ More

    Submitted 16 August, 2024; v1 submitted 16 February, 2024; originally announced February 2024.

  8. arXiv:2402.10663  [pdf, other

    cs.CL

    Improving Demonstration Diversity by Human-Free Fusing for Text-to-SQL

    Authors: Dingzirui Wang, Longxu Dou, Xuanliang Zhang, Qingfu Zhu, Wanxiang Che

    Abstract: Currently, the in-context learning method based on large language models (LLMs) has become the mainstream of text-to-SQL research. Previous works have discussed how to select demonstrations related to the user question from a human-labeled demonstration pool. However, human labeling suffers from the limitations of insufficient diversity and high labeling overhead. Therefore, in this paper, we disc… ▽ More

    Submitted 26 June, 2024; v1 submitted 16 February, 2024; originally announced February 2024.

  9. arXiv:2402.10654  [pdf, other

    cs.CL

    Enhancing Numerical Reasoning with the Guidance of Reliable Reasoning Processes

    Authors: Dingzirui Wang, Longxu Dou, Xuanliang Zhang, Qingfu Zhu, Wanxiang Che

    Abstract: Numerical reasoning is an essential ability for NLP systems to handle numeric information. Recent research indicates that fine-tuning a small-scale model to learn generating reasoning processes alongside answers can significantly enhance performance. However, current methods have the limitation that most methods generate reasoning processes with large language models (LLMs), which are "unreliable"… ▽ More

    Submitted 16 February, 2024; originally announced February 2024.

  10. arXiv:2402.08259  [pdf, other

    cs.CL

    A Survey of Table Reasoning with Large Language Models

    Authors: Xuanliang Zhang, Dingzirui Wang, Longxu Dou, Qingfu Zhu, Wanxiang Che

    Abstract: Table reasoning, which aims to generate the corresponding answer to the question following the user requirement according to the provided table, and optionally a text description of the table, effectively improving the efficiency of obtaining information. Recently, using Large Language Models (LLMs) has become the mainstream method for table reasoning, because it not only significantly reduces the… ▽ More

    Submitted 13 February, 2024; originally announced February 2024.

  11. arXiv:2308.10585  [pdf, other

    cs.CL

    Exploring Equation as a Better Intermediate Meaning Representation for Numerical Reasoning

    Authors: Dingzirui Wang, Longxu Dou, Wenbin Zhang, Junyu Zeng, Wanxiang Che

    Abstract: Numerical reasoning is vital for natural language processing models to understand and process numerical information in real-world scenarios. Most current methods first generate the Intermediate Meaning Representations (IMRs) of questions and then generate answers. Current SOTA methods generate programs as IMRs with large language models (LLMs). Intuitively, equations have fewer restrictions and cl… ▽ More

    Submitted 21 August, 2023; originally announced August 2023.

  12. arXiv:2305.04228  [pdf, ps, other

    cs.SE cs.AI cs.LG

    Heterogeneous Directed Hypergraph Neural Network over abstract syntax tree (AST) for Code Classification

    Authors: Guang Yang, Tiancheng Jin, Liang Dou

    Abstract: Code classification is a difficult issue in program understanding and automatic coding. Due to the elusive syntax and complicated semantics in programs, most existing studies use techniques based on abstract syntax tree (AST) and graph neural network (GNN) to create code representations for code classification. These techniques utilize the structure and semantic information of the code, but they o… ▽ More

    Submitted 3 February, 2024; v1 submitted 7 May, 2023; originally announced May 2023.

    Comments: Published in the 35th International Conference on Software Engineering and Knowledge Engineering (SEKE 2023) as a regular paper

  13. arXiv:2304.13902   

    cs.CL

    Controllable Data Augmentation for Context-Dependent Text-to-SQL

    Authors: Dingzirui Wang, Longxu Dou, Wanxiang Che

    Abstract: The limited scale of annotated data constraints existing context-dependent text-to-SQL models because of the complexity of labeling. The data augmentation method is a commonly used method to solve this problem. However, the data generated by current augmentation methods often lack diversity. In this paper, we introduce ConDA, which generates interactive questions and corresponding SQL results. We… ▽ More

    Submitted 27 April, 2023; v1 submitted 26 April, 2023; originally announced April 2023.

    Comments: fix overlap

  14. arXiv:2304.09402  [pdf, other

    cs.CL cs.LG

    MixPro: Simple yet Effective Data Augmentation for Prompt-based Learning

    Authors: Bohan Li, Longxu Dou, Yutai Hou, Yunlong Feng, Honglin Mu, Qingfu Zhu, Qinghua Sun, Wanxiang Che

    Abstract: Prompt-based learning has shown considerable promise in reformulating various downstream tasks as cloze problems by combining original input with a predetermined template. This approach demonstrates its effectiveness, especially in few-shot learning scenarios, where the model is trained on a scarce amount of data. Despite its successes, the limited templates and text in few-shot prompt-based learn… ▽ More

    Submitted 11 November, 2023; v1 submitted 18 April, 2023; originally announced April 2023.

    Comments: 19 pages, 5 figures, 6 tables

  15. arXiv:2304.07995  [pdf, other

    cs.CL cs.AI

    From Zero to Hero: Examining the Power of Symbolic Tasks in Instruction Tuning

    Authors: Qian Liu, Fan Zhou, Zhengbao Jiang, Longxu Dou, Min Lin

    Abstract: Fine-tuning language models on tasks with instructions has demonstrated potential in facilitating zero-shot generalization to unseen tasks. In this paper, we introduce a straightforward yet effective method for enhancing instruction tuning by employing symbolic tasks. Compared to crowdsourced human tasks or model-generated tasks, symbolic tasks present a unique advantage as they can be easily gene… ▽ More

    Submitted 17 April, 2023; originally announced April 2023.

    Comments: Work in Progress. The code is released at https://github.com/sail-sg/symbolic-instruction-tuning

  16. arXiv:2302.08269  [pdf, other

    cs.CV cs.AI

    SyreaNet: A Physically Guided Underwater Image Enhancement Framework Integrating Synthetic and Real Images

    Authors: Junjie Wen, Jinqiang Cui, Zhenjun Zhao, Ruixin Yan, Zhi Gao, Lihua Dou, Ben M. Chen

    Abstract: Underwater image enhancement (UIE) is vital for high-level vision-related underwater tasks. Although learning-based UIE methods have made remarkable achievements in recent years, it's still challenging for them to consistently deal with various underwater conditions, which could be caused by: 1) the use of the simplified atmospheric image formation model in UIE may result in severe errors; 2) the… ▽ More

    Submitted 25 May, 2023; v1 submitted 16 February, 2023; originally announced February 2023.

    Comments: ICRA23

  17. arXiv:2301.12344  [pdf, other

    cs.RO eess.SY

    TJ-FlyingFish: Design and Implementation of an Aerial-Aquatic Quadrotor with Tiltable Propulsion Units

    Authors: Xuchen Liu, Minghao Dou, Dongyue Huang, Biao Wang, Jinqiang Cui, Qinyuan Ren, Lihua Dou, Zhi Gao, Jie Chen, Ben M. Chen

    Abstract: Aerial-aquatic vehicles are capable to move in the two most dominant fluids, making them more promising for a wide range of applications. We propose a prototype with special designs for propulsion and thruster configuration to cope with the vast differences in the fluid properties of water and air. For propulsion, the operating range is switched for the different mediums by the dual-speed propulsi… ▽ More

    Submitted 6 February, 2023; v1 submitted 28 January, 2023; originally announced January 2023.

    Comments: 6 pages, 9 figures, accepted to 2023 IEEE International Conference on Robotics and Automation (ICRA)

  18. arXiv:2301.01067  [pdf, other

    cs.CL

    Towards Knowledge-Intensive Text-to-SQL Semantic Parsing with Formulaic Knowledge

    Authors: Longxu Dou, Yan Gao, Xuqi Liu, Mingyang Pan, Dingzirui Wang, Wanxiang Che, Dechen Zhan, Min-Yen Kan, Jian-Guang Lou

    Abstract: In this paper, we study the problem of knowledge-intensive text-to-SQL, in which domain knowledge is necessary to parse expert questions into SQL queries over domain-specific tables. We formalize this scenario by building a new Chinese benchmark KnowSQL consisting of domain-specific questions covering various domains. We then address this problem by presenting formulaic knowledge, rather than by a… ▽ More

    Submitted 3 January, 2023; originally announced January 2023.

    Comments: EMNLP 2022 Main Conference

  19. arXiv:2212.13492  [pdf, other

    cs.CL

    MultiSpider: Towards Benchmarking Multilingual Text-to-SQL Semantic Parsing

    Authors: Longxu Dou, Yan Gao, Mingyang Pan, Dingzirui Wang, Wanxiang Che, Dechen Zhan, Jian-Guang Lou

    Abstract: Text-to-SQL semantic parsing is an important NLP task, which greatly facilitates the interaction between users and the database and becomes the key component in many human-computer interaction systems. Much recent progress in text-to-SQL has been driven by large-scale datasets, but most of them are centered on English. In this work, we present MultiSpider, the largest multilingual text-to-SQL data… ▽ More

    Submitted 27 December, 2022; originally announced December 2022.

    Comments: AAAI2023 Main Conference. Code: https://github.com/microsoft/ContextualSP

  20. arXiv:2212.13465  [pdf, other

    cs.CL cs.AI

    A Survey on Table-and-Text HybridQA: Concepts, Methods, Challenges and Future Directions

    Authors: Dingzirui Wang, Longxu Dou, Wanxiang Che

    Abstract: Table-and-text hybrid question answering (HybridQA) is a widely used and challenging NLP task commonly applied in the financial and scientific domain. The early research focuses on migrating other QA task methods to HybridQA, while with further research, more and more HybridQA-specific methods have been present. With the rapid development of HybridQA, the systematic survey is still under-explored… ▽ More

    Submitted 1 February, 2023; v1 submitted 27 December, 2022; originally announced December 2022.

    Comments: 7 pages

  21. arXiv:2203.07781  [pdf, other

    cs.CL cs.AI cs.DB

    UniSAr: A Unified Structure-Aware Autoregressive Language Model for Text-to-SQL

    Authors: Longxu Dou, Yan Gao, Mingyang Pan, Dingzirui Wang, Wanxiang Che, Dechen Zhan, Jian-Guang Lou

    Abstract: Existing text-to-SQL semantic parsers are typically designed for particular settings such as handling queries that span multiple tables, domains or turns which makes them ineffective when applied to different settings. We present UniSAr (Unified Structure-Aware Autoregressive Language Model), which benefits from directly using an off-the-shelf language model architecture and demonstrates consisten… ▽ More

    Submitted 13 April, 2022; v1 submitted 15 March, 2022; originally announced March 2022.

    Comments: Codes and checkpoints are available at https://github.com/microsoft/ContextualSP/tree/master/unified_parser_text_to_sql

  22. arXiv:1902.00647  [pdf, other

    cs.CR

    A Large-Scale Empirical Study on Industrial Fake Apps

    Authors: Chongbin Tang, Sen Chen, Lingling Fan, Lihua Xu, Yang Liu, Zhushou Tang, Liang Dou

    Abstract: While there have been various studies towards Android apps and their development, there is limited discussion of the broader class of apps that fall in the fake area. Fake apps and their development are distinct from official apps and belong to the mobile underground industry. Due to the lack of knowledge of the mobile underground industry, fake apps, their ecosystem and nature still remain in mys… ▽ More

    Submitted 9 February, 2019; v1 submitted 2 February, 2019; originally announced February 2019.

  23. arXiv:1511.01706  [pdf

    cs.CV

    Image classification based on support vector machine and the fusion of complementary features

    Authors: Huilin Gao, Wenjie Chen, Lihua Dou

    Abstract: Image Classification based on BOW (Bag-of-words) has broad application prospect in pattern recognition field but the shortcomings are existed because of single feature and low classification accuracy. To this end we combine three ingredients: (i) Three features with functions of mutual complementation are adopted to describe the images, including PHOW (Pyramid Histogram of Words), PHOC (Pyramid Hi… ▽ More

    Submitted 5 November, 2015; originally announced November 2015.

    Comments: 22 pages,4 figures