skip to main content
10.1145/3589334.3645707acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Towards Cross-Table Masked Pretraining for Web Data Mining

Published: 13 May 2024 Publication History

Abstract

Tabular data pervades the landscape of the World Wide Web, playing a foundational role in the digital architecture that underpins online information. Given the recent influence of large-scale pretrained models like ChatGPT and SAM across various domains, exploring the application of pretraining techniques for mining tabular data on the web has emerged as a highly promising research direction. Indeed, there have been some recent works around this topic where most (if not all) of them are limited in the scope of a fixed-schema/single table. Due to the scale of the dataset and the parameter size of the prior models, we believe that we have not reached the ''BERT moment'' for the ubiquitous tabular data. The development on this line significantly lags behind the counterpart research domains such as natural language processing. In this work, we first identify the crucial challenges behind tabular data pretraining, particularly overcoming the cross-table hurdle. As a pioneering endeavor, this work mainly (i)-contributes a high-quality real-world tabular dataset, (ii)-proposes an innovative, generic, and efficient cross-table pretraining framework, dubbed as CM2, where the core to it comprises a semantic-aware tabular neural network that uniformly encodes heterogeneous tables without much restriction and (iii)-introduces a novel pretraining objective --- prompt Masked Table Modeling (pMTM) --- inspired by NLP but intricately tailored to scalable pretraining on tables. Our extensive experiments demonstrate CM2's state-of-the-art performance and validate that cross-table pretraining can enhance various downstream tasks.

Supplemental Material

MP4 File
Supplemental video

References

[1]
Sercan Ö Arik and Tomas Pfister. 2021. Tabnet: Attentive interpretable tabular learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 6679--6687.
[2]
Vadim Borisov, Kathrin Seßler, Tobias Leemann, Martin Pawelczyk, and Gjergji Kasneci. 2023. Language Models are Realistic Tabular Data Generators. arxiv: 2210.06280 [cs.LG]
[3]
Michael J Cafarella, Alon Halevy, Daisy Zhe Wang, Eugene Wu, and Yang Zhang. 2008. Webtables: exploring the power of tables on the web. Proceedings of the VLDB Endowment, Vol. 1, 1 (2008), 538--549.
[4]
Varun Chandola, Arindam Banerjee, and Vipin Kumar. 2009. Anomaly detection: A survey. ACM computing surveys (CSUR), Vol. 41, 3 (2009), 1--58.
[5]
Jintai Chen, KuanLun Liao, Yanwen Fang, Danny Chen, and Jian Wu. 2022a. TabCaps: A Capsule Neural Network for Tabular Data Classification with BoW Routing. In International Conference on Learning Representations.
[6]
Jintai Chen, Kuanlun Liao, Yao Wan, Danny Z Chen, and Jian Wu. 2022b. DANets: Deep abstract networks for tabular data classification and regression. In Proceedings of the AAAI Conference on Artificial Intelligence.
[7]
Jintai Chen, Jiahuan Yan, Danny Ziyi Chen, and Jian Wu. 2023. ExcelFormer: A Neural Network Surpassing GBDTs on Tabular Data. arXiv preprint arXiv:2301.02819 (2023).
[8]
Shuo Chen, Lei Luo, Jian Yang, Chen Gong, Jun Li, and Heng Huang. 2019. Curvilinear distance metric learning. Advances in Neural Information Processing Systems, Vol. 32 (2019).
[9]
Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 785--794.
[10]
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In International conference on machine learning. PMLR, 1597--1607.
[11]
Xiang Deng, Huan Sun, Alyssa Lees, You Wu, and Cong Yu. 2022. Turl: Table understanding through representation learning. ACM SIGMOD Record, Vol. 51, 1 (2022), 33--40.
[12]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[13]
Matt W Gardner and SR Dorling. 1998. Artificial neural networks (the multilayer perceptron)-a review of applications in the atmospheric sciences. Atmospheric environment, Vol. 32, 14--15 (1998), 2627--2636.
[14]
Yury Gorishniy, Ivan Rubachev, Valentin Khrulkov, and Artem Babenko. 2021. Revisiting deep learning models for tabular data. Advances in Neural Information Processing Systems, Vol. 34 (2021), 18932--18943.
[15]
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. 2022. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16000--16009.
[16]
Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno, and Julian Martin Eisenschlos. 2020. TaPas: Weakly supervised table parsing via pre-training. arXiv preprint arXiv:2004.02349 (2020).
[17]
Noah Hollmann, Samuel Müller, Katharina Eggensperger, and Frank Hutter. 2023. TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second. arxiv: 2207.01848 [cs.LG]
[18]
Xin Huang, Ashish Khetan, Milan Cvitkovic, and Zohar Karnin. 2020. Tabtransformer: Tabular data modeling using contextual embeddings. arXiv preprint arXiv:2012.06678 (2020).
[19]
Madelon Hulsebos, cC agatay Demiralp, and Paul Groth. 2023. GitTables: A Large-Scale Corpus of Relational Tables. Proceedings of the ACM on Management of Data, Vol. 1, 1 (may 2023), 1--17. https://doi.org/10.1145/3588710
[20]
Madelon Hulsebos, Kevin Hu, Michiel Bakker, Emanuel Zgraggen, Arvind Satyanarayan, Tim Kraska, and César Hidalgo. 2019. Sherlock: A Deep Learning Approach to Semantic Data Type Detection. arxiv: 1905.10688 [cs.LG]
[21]
Daniel Jarrett, Bogdan C Cebere, Tennison Liu, Alicia Curth, and Mihaela van der Schaar. 2022. Hyperimpute: Generalized iterative imputation with automatic model selection. In International Conference on Machine Learning. PMLR, 9916--9937.
[22]
Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems, Vol. 30 (2017).
[23]
Yoon Kim, Yacine Jernite, David Sontag, and Alexander Rush. 2016. Character-aware neural language models. In Proceedings of the AAAI conference on artificial intelligence, Vol. 30.
[24]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[25]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. 2023. Segment anything. arXiv preprint arXiv:2304.02643 (2023).
[26]
Igor Kononenko. 2001. Machine learning for medical diagnosis: history, state of the art and perspective. Artificial Intelligence in medicine, Vol. 23, 1 (2001), 89--109.
[27]
Liyao Li, Haobo Wang, Liangyu Zha, Qingyi Huang, Sai Wu, Gang Chen, and Junbo Zhao. 2023. Learning a Data-Driven Policy Network for Pre-Training Automated Feature Engineering. In The Eleventh International Conference on Learning Representations.
[28]
Dugang Liu, Pengxiang Cheng, Hong Zhu, Xing Tang, Yanyu Chen, Xiaoting Wang, Weike Pan, Zhong Ming, and Xiuqiang He. 2023. DIWIFT: Discovering Instance-wise Influential Features for Tabular Data. In Proceedings of the ACM Web Conference 2023. 1673--1682.
[29]
Guang Liu, Jie Yang, and Ledell Wu. 2022. PTab: Using the Pre-trained Language Model for Modeling Tabular Data. arXiv preprint arXiv:2209.08060 (2022).
[30]
Tianyu Liu, Kexiang Wang, Lei Sha, Baobao Chang, and Zhifang Sui. 2018. Table-to-text generation by structure-aware seq2seq learning. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32.
[31]
Jorge M Lobo, Alberto Jiménez-Valverde, and Raimundo Real. 2008. AUC: a misleading measure of the performance of predictive distribution models. Global ecology and Biogeography, Vol. 17, 2 (2008), 145--151.
[32]
Guoshan Lu, Haobo Wang, Saisai Yang, Jing Yuan, Guozheng Yang, Cheng Zang, Gang Chen, and Junbo Zhao. 2023. Catch: Collaborative Feature Set Search for Automated Feature Engineering. In Proceedings of the ACM Web Conference 2023. 1886--1896.
[33]
Yuanfei Luo, Mengshuo Wang, Hao Zhou, Quanming Yao, Wei-Wei Tu, Yuqiang Chen, Wenyuan Dai, and Qiang Yang. 2019. Autocross: Automatic feature crossing for tabular data in real-world applications. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1936--1945.
[34]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
[35]
OpenAI. 2022. ChatGPT. https://openai.com/blog/chatgpt.
[36]
Ru Peng, Heming Zou, Haobo Wang, Yawen Zeng, Zenan Huang, and Junbo Zhao. 2024. Energy-based Automated Model Evaluation. In The Twelfth International Conference on Learning Representations.
[37]
Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 3505--3506.
[38]
Tom Shenkar and Lior Wolf. 2021. Anomaly detection for tabular data with internal contrastive learning. In International Conference on Learning Representations.
[39]
Gowthami Somepalli, Micah Goldblum, Avi Schwarzschild, C Bayan Bruss, and Tom Goldstein. 2021. Saint: Improved neural networks for tabular data via row attention and contrastive pre-training. arXiv preprint arXiv:2106.01342 (2021).
[40]
Weiping Song, Chence Shi, Zhiping Xiao, Zhijian Duan, Yewen Xu, Ming Zhang, and Jian Tang. 2019. Autoint: Automatic feature interaction learning via self-attentive neural networks. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 1161--1170.
[41]
Daniel J Stekhoven and Peter Bühlmann. 2012. MissForest-non-parametric missing value imputation for mixed-type data. Bioinformatics, Vol. 28, 1 (2012), 112--118.
[42]
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. LLaMA: Open and Efficient Foundation Language Models. arxiv: 2302.13971 [cs.CL]
[43]
Mohamed Trabelsi, Zhiyu Chen, Shuo Zhang, Brian D Davison, and Jeff Heflin. 2022. Strubert: Structure-aware bert for table search and matching. In Proceedings of the ACM Web Conference 2022. 442--451.
[44]
Talip Ucar, Ehsan Hajiramezanali, and Lindsay Edwards. 2021. Subtab: Subsetting features of tabular data for self-supervised representation learning. Advances in Neural Information Processing Systems, Vol. 34 (2021), 18853--18865.
[45]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, Vol. 30 (2017).
[46]
Daheng Wang, Prashant Shiralkar, Colin Lockard, Binxuan Huang, Xin Luna Dong, and Meng Jiang. 2021a. TCN: table convolutional network for web table interpretation. In Proceedings of the Web Conference 2021. 4020--4032.
[47]
Haobo Wang, Ruixuan Xiao, Yixuan Li, Lei Feng, Gang Niu, Gang Chen, and Junbo Zhao. 2021c. Pico: Contrastive label disambiguation for partial label learning. In International Conference on Learning Representations.
[48]
Ruoxi Wang, Rakesh Shivanna, Derek Cheng, Sagar Jain, Dong Lin, Lichan Hong, and Ed Chi. 2021b. Dcn v2: Improved deep & cross network and practical lessons for web-scale learning to rank systems., bibinfonumpages1785--1797 pages.
[49]
Zifeng Wang and Jimeng Sun. 2022. Transtab: Learning transferable tabular transformers across tables. arXiv preprint arXiv:2205.09328 (2022).
[50]
Raymond E Wright. 1995. Logistic regression. (1995).
[51]
Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, and Han Hu. 2022. Simmim: A simple framework for masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9653--9663.
[52]
Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. 2019. Modeling Tabular data using Conditional GAN. arxiv: 1907.00503 [cs.LG]
[53]
Jiahuan Yan, Jintai Chen, Yixuan Wu, Danny Chen, and Jian Wu. 2023. T2G-Former: organizing tabular features into relation graphs promotes heterogeneous feature interaction. In Proceedings of the AAAI Conference on Artificial Intelligence.
[54]
Jiahuan Yan, Hongxia Xu, Yiheng Zhu, Danny Chen, Jimeng Sun, Jian Wu, and Jintai Chen. 2024. Making Pre-trained Language Models Great on Tabular Prediction. In International Conference on Learning Representations.
[55]
Pengcheng Yin, Graham Neubig, Wen tau Yih, and Sebastian Riedel. 2020. TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data. arxiv: 2005.08314 [cs.CL]
[56]
Jinsung Yoon, Yao Zhang, James Jordon, and Mihaela van der Schaar. 2020. Vime: Extending the success of self-and semi-supervised learning to tabular domain. Advances in Neural Information Processing Systems, Vol. 33 (2020), 11033--11043.
[57]
Tianping Zhang, Shaowen Wang, Shuicheng Yan, Jian Li, and Qian Liu. 2023. Generative Table Pre-training Empowers Models for Tabular Prediction. arxiv: 2305.09696 [cs.LG]
[58]
Weiqi Zhao, Dian Tang, Xin Chen, Dawei Lv, Daoli Ou, Biao Li, Peng Jiang, and Kun Gai. 2023. Disentangled Causal Embedding With Contrastive Learning For Recommender System. arXiv preprint arXiv:2302.03248 (2023).
[59]
Bingzhao Zhu, Xingjian Shi, Nick Erickson, Mu Li, George Karypis, and Mahsa Shoaran. 2023. XTab: Cross-table Pretraining for Tabular Transformers. arxiv: 2305.06090 [cs.LG]io

Cited By

View all

Index Terms

  1. Towards Cross-Table Masked Pretraining for Web Data Mining

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      WWW '24: Proceedings of the ACM Web Conference 2024
      May 2024
      4826 pages
      ISBN:9798400701719
      DOI:10.1145/3589334
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 13 May 2024

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. data mining
      2. pretraining
      3. tabular data

      Qualifiers

      • Research-article

      Conference

      WWW '24
      Sponsor:
      WWW '24: The ACM Web Conference 2024
      May 13 - 17, 2024
      Singapore, Singapore

      Acceptance Rates

      Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)227
      • Downloads (Last 6 weeks)24
      Reflects downloads up to 09 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media