Sufficient Representations for Categorical Variables

Johannemann, Jonathan; Hadad, Vitor; Athey, Susan; Wager, Stefan

Statistics > Machine Learning

arXiv:1908.09874 (stat)

[Submitted on 26 Aug 2019 (v1), last revised 28 Oct 2021 (this version, v3)]

Title:Sufficient Representations for Categorical Variables

Authors:Jonathan Johannemann, Vitor Hadad, Susan Athey, Stefan Wager

View PDF

Abstract:Many learning algorithms require categorical data to be transformed into real vectors before it can be used as input. Often, categorical variables are encoded as one-hot (or dummy) vectors. However, this mode of representation can be wasteful since it adds many low-signal regressors, especially when the number of unique categories is large. In this paper, we investigate simple alternative solutions for universally consistent estimators that rely on lower-dimensional real-valued representations of categorical variables that are "sufficient" in the sense that no predictive information is lost. We then compare preexisting and proposed methods on simulated and observational datasets.

Subjects:	Machine Learning (stat.ML); Machine Learning (cs.LG)
Cite as:	arXiv:1908.09874 [stat.ML]
	(or arXiv:1908.09874v3 [stat.ML] for this version)
	https://doi.org/10.48550/arXiv.1908.09874

Submission history

From: Vitor Hadad [view email]
[v1] Mon, 26 Aug 2019 18:41:29 UTC (2,316 KB)
[v2] Sat, 15 Feb 2020 20:28:32 UTC (2,317 KB)
[v3] Thu, 28 Oct 2021 17:56:28 UTC (2,669 KB)

Full-text links:

Access Paper:

view license

Current browse context:

stat.ML

< prev | next >

new | recent | 2019-08

Change to browse by:

cs
cs.LG
stat

References & Citations

export BibTeX citation

Statistics > Machine Learning

Title:Sufficient Representations for Categorical Variables

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Statistics > Machine Learning

Title:Sufficient Representations for Categorical Variables

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators