Research Papers by Mohit Dayal
Exploratory Projection Pursuit is an algorithm similar to principal components, that is, it too a... more Exploratory Projection Pursuit is an algorithm similar to principal components, that is, it too aims to visualize multidimensional data by projecting it linearly into a low dimensional (often 2 or 3 dimensional) space.
To do this, projection pursuit methods use scalar functions computed from data projections to quantify how "interesting" the view shown in the projection is.
In this paper, we propose a new algorithm for exploratory projection pursuit. The basis of the algorithm is the insight that previous approaches to projection pursuit used fairly narrow definitions of interestingness / non interestingness. We argue that allowing these definitions to depend on the problem / data at hand is a more natural approach in an exploratory technique. This also allows our technique much greater scope of applicability than the approaches extant in the literature.
Complementing this insight, we propose a class of projection indices based on the spatial distribution function that can make use of such information.
Finally, with the help of real datasets, we demonstrate how a range of multivariate exploratory tasks can be addressed with our algorithm. The examples further demonstrate that the proposed technique is quite capable of focusing on the interesting structure in the data, even when this structure is otherwise hard to detect or arises from very subtle patterns.
Conference Presentations by Mohit Dayal
For the purpose of visualization, it is often useful to think of a linear projection of multivari... more For the purpose of visualization, it is often useful to think of a linear projection of multivariate data, as viewing the high dimensional point cloud from a particular direction. Thus, if one utilizes several projections together, useful understanding about the features of the data may be gained.
DIVE (Drawing Interactively for Visual Exploration) is a new system for multivariate data visualization based on this principle. Prime features of DIVE are its easy comprehensibility and extremely high level of interactivity, unsurpassed by any such system for multivariate data.
DIVE allows the user to display an arbitrary number of (static) data projections that may be selected on any criteria. Next, a rich variety of point manipulation tools are presented, that allow the analyst to specify the kind of projections that she wants to see next. This is akin to specifying a composite hypothesis about the features of the high dimensional point cloud, which the system then attempts to validate.
Internally, these manipulation controls are implemented in the framework of context driven exploratory projection pursuit.
Term Papers by Mohit Dayal
In this article we aim to make predictions of movie grosses from script information alone. Such p... more In this article we aim to make predictions of movie grosses from script information alone. Such predictions would be useful at very early stages in a movie's production when a money commitment has not yet been made on the script.
Our analysis was done using two different models-a multiple regression model and a regression tree.Using the ratio of the predictions to the actual box office returns, we found that both models were quite good.
Finally, we proposed a distance function to measure closeness of scripts yet to be produced to those already produced. This gives us a practical way to apply our models.
The data at hand are weekly Foreign Exchange Bid rates, expressed in terms of 1 Indian Rupee. The... more The data at hand are weekly Foreign Exchange Bid rates, expressed in terms of 1 Indian Rupee. The data extends from 1 Jan 2003 to 31 Dec 2009. Specifically we look at the four currencies : the US Dollar(USD), British Pound (GBP), the Euro (EUR) and the Japanese Yen (JPY). The data have been sourced from oanda.com using their historical rates feature. Some of the questions that we entertain in
respect of these data are:
ā¢ Do the rates behave like stationary process? If yes, what are the dynamics?
ā¢ Are there structural breaks in the series?
ā¢ If the series are not stationary, are some linear combinations of them so?
Most modern spam filters work by first reading all the emails, from which a machine representatio... more Most modern spam filters work by first reading all the emails, from which a machine representation of the contents is created. A variety of machine representations are known: Bag of Words, bigram proximity matrix, etc. In the second step, a classifier is trained on this machine representation to classify emails as either spam or not spam.
One problem with all such spam filters is that to build the machine representations, the machine must actually read all of the emails at some point. This is obviously is a violation of privacy.
In this paper, I argue in favor of a spam filter that preserves the user's privacy in every aspect. In fact, the beauty of my spam filter is that it in fact never even needs to see the emails themselves at any point.
In my design, the emails always reside on the user's computer or other such trusted device. A program on this device then extracts certain summary features from the emails, and submits these to the spam filter for classification.
The features I extracted in my proposal were quite simple in fact: number of words, number of capitals, number of punctuation marks, length of subject line, and some others. A total of 14 such features were extracted, using nothing more than the GNU core utilities. On these features, simple classifiers were trained: the Nadaraya-Watson Kernel, and a resampling version of the linear discriminant (LDA).
A linear scaling in Classification Accuracy (measured by AUC) was obtained upto 0.80, which was very encouraging, given that we were not using any model of email contents.
In the first part of this paper, we will briefly review how neuro-dynamic programming (reinforcem... more In the first part of this paper, we will briefly review how neuro-dynamic programming (reinforcement learning) extends traditional dynamic programming to overcome the curse of dimensionality. In the second part, we give a brief review of ensemble methods-how they work and why we might require pruning. In the third part, we review a paper by Ioannis Partalas , Grigorios Tsoumakas, and Ioannis Vlahavas [Pruning an ensemble of classifiers via reinforcement learning : Neuro-
computing 72 (2009)]. This appears to be the first paper to propose use of Reinforcement Learning to prune classifier ensembles. The final part is speculative, where we examine some ideas from this paper to use Reinforcement Learning to build a neighborhood map of points in high-dimensional space, and examine how best we may combine 2-D projections of multivariate data to classify or interactively explore such data via their neighborhoods.
Talks by Mohit Dayal
In this article we aim to make predictions of movie grosses from script information alone. Such p... more In this article we aim to make predictions of movie grosses from script information alone. Such predictions would be useful at very early stages in a movieās production when a money commitment has not yet been made on the script.
Our analysis was done using two different models - a multiple regression model and a regression tree.Using the ratio of the predictions to the actual box office returns, we found
that both models were quite good.
Finally, we proposed a distance function to measure closeness of scripts yet to be produced to those already produced. This gives us a practical way to apply our models.
One is used to thinking about data as a list of numbers, but in the modern age, data can take on ... more One is used to thinking about data as a list of numbers, but in the modern age, data can take on a variety of forms. Functional Data Analysis is a branch of statistics that is concerned with analyzing data that emerges as functions. Examples of such data include handwriting recognition, MRI Images and Radar Waves.
In this presentation, I review a very novel technique for clustering of such functional data proposed by Ferraty, FrƩdƩric, and Philippe Vieu. That is, one can put "Similar" Curves together, and separate them out from others that are different.
Such clustering is useful for several purposes. For instance, separating diseased tissues from healthy ones using MRIs, etc.
The technology reviewed in these slides was applied by the authors to study radar waveforms. A radar waveform is nothing but the reflected radio wave that is received back at the RADAR source. The very fact that the physical form of this data is a wave means that the natural domain to analyze such data is the area of functional data analysis.
The authors were able to show that their proposed technique works well and was able to "automatically [produce] homogeneous subgroups that can be easily interpreted in terms of differences of grounds (and in the Amazonian basin in terms of river, lake, vegetation, etc.)"
This is a talk related to the Term Paper "Pruning Classifier Ensembles via Reinforcement Learning... more This is a talk related to the Term Paper "Pruning Classifier Ensembles via Reinforcement Learning and its possible Application to combining Data Projections".
The slides are a gentle introduction to what an ensemble of classifiers is, why are such ensembles useful, and why simply increasing the number of models in the ensemble is not always helpful. I next go on to a very brief introduction to Dynamic Programming, and its extension, Neuro-Dynamic Programming or Reinforcement Learning. I conclude with just a brief introduction to the idea of using Reinforcement Learning to build point neighborhood maps. This last is still a matter of research by me.
In this talk, I give a brief and very gentle introduction to what Dempster-Shafer theory is, and ... more In this talk, I give a brief and very gentle introduction to what Dempster-Shafer theory is, and how it relates to usual probability theory.
In particular, I introduce and explain basic probability assignments (bpa) with examples, show what belief functions are and illustrate how bpa's can be combined.
In the last part, I review the paper "A k-nearest neighbor classification rule based on Dempster-Shafer theory" [Denoeux (1995) IEEE Transactions on Systems, Man and Cybernetics]. This is with a view to illustrate how one can use this rather unusual theory of probability to build new Machine Learning Algorithms.
In this talk, I further develop the idea of using Dempster-Shafer theory as a basis of Machine Le... more In this talk, I further develop the idea of using Dempster-Shafer theory as a basis of Machine Learning. In particular, I propose one way of representing Machine Learning Computations as Dempster-Shafer sets. Finally, I extend the representation to build a classifier.
This talk is basically a review of the paper by the same name authored by Ker-Chau Li and publish... more This talk is basically a review of the paper by the same name authored by Ker-Chau Li and published in the Annals of Statistics, 1997.
"It is not uncommon to find nonlinear patterns in the scatterplots of regressor variables. But how such findings affect standard regression analysis remains largely unexplored. This article offers a theory on nonlinear confounding, a term for describing the situation where a certain nonlinear relationship in regressors leads to difficulties in modeling and related analysis of the data. The theory begins with a measure of nonlinearity between two regressor variables. It is then used to assess nonlinearity between any two projections from the high-dimensional regressor and a method of finding most nonlinear projections is given. Nonlinear confounding is addressed by taking a fresh new look at fundamental issues such as the validity of prediction and inference, diagnostics, regression surface approximation, model uncertainty and Fisher information loss." (Abstract of the Original Paper)
The Java language, and more broadly, the JVM Platform (as a host for several languages like Scala... more The Java language, and more broadly, the JVM Platform (as a host for several languages like Scala, Clojure and JRuby) is slowly emerging as the data analysis platform of the future. However, it is quite surprising that so far, high performance graphics libraries for data visualizations are still lacking on the Java Platform. This causes serious impediments to the data analyst.
In this talk, we propose using the high-quality and proven Qt graphics library to visualize data from the JVM. Qt is fast, reliable, and available for almost all major platforms for which Java is.
Several advantages result from this choice. First, we are saved from re-inventing the wheel, since the graphics stack and low level programming of Qt need not be re-done. Next, the extensive library of Qt widgets can be used as controls for our plot objects. Finally, since the visualization is performed outside of the JVM, any crashes of the graphics system leave us unaffected.
Lexical Scoping is a feature in R, where the definition of any variable or function is first soug... more Lexical Scoping is a feature in R, where the definition of any variable or function is first sought to be located in the place where it is called, if not found there, the language undertakes a hierarchical search for the symbol's definition in the each enveloping environment starting from the most immediate, and ending with the global environment.
This talk focuses on how one can use lexical scoping to write cleaner code, that is more concise and that even runs faster, than comparable code written without the use of this feature.
Teaching Documents by Mohit Dayal
Uploads
Research Papers by Mohit Dayal
To do this, projection pursuit methods use scalar functions computed from data projections to quantify how "interesting" the view shown in the projection is.
In this paper, we propose a new algorithm for exploratory projection pursuit. The basis of the algorithm is the insight that previous approaches to projection pursuit used fairly narrow definitions of interestingness / non interestingness. We argue that allowing these definitions to depend on the problem / data at hand is a more natural approach in an exploratory technique. This also allows our technique much greater scope of applicability than the approaches extant in the literature.
Complementing this insight, we propose a class of projection indices based on the spatial distribution function that can make use of such information.
Finally, with the help of real datasets, we demonstrate how a range of multivariate exploratory tasks can be addressed with our algorithm. The examples further demonstrate that the proposed technique is quite capable of focusing on the interesting structure in the data, even when this structure is otherwise hard to detect or arises from very subtle patterns.
Conference Presentations by Mohit Dayal
DIVE (Drawing Interactively for Visual Exploration) is a new system for multivariate data visualization based on this principle. Prime features of DIVE are its easy comprehensibility and extremely high level of interactivity, unsurpassed by any such system for multivariate data.
DIVE allows the user to display an arbitrary number of (static) data projections that may be selected on any criteria. Next, a rich variety of point manipulation tools are presented, that allow the analyst to specify the kind of projections that she wants to see next. This is akin to specifying a composite hypothesis about the features of the high dimensional point cloud, which the system then attempts to validate.
Internally, these manipulation controls are implemented in the framework of context driven exploratory projection pursuit.
Term Papers by Mohit Dayal
Our analysis was done using two different models-a multiple regression model and a regression tree.Using the ratio of the predictions to the actual box office returns, we found that both models were quite good.
Finally, we proposed a distance function to measure closeness of scripts yet to be produced to those already produced. This gives us a practical way to apply our models.
respect of these data are:
ā¢ Do the rates behave like stationary process? If yes, what are the dynamics?
ā¢ Are there structural breaks in the series?
ā¢ If the series are not stationary, are some linear combinations of them so?
One problem with all such spam filters is that to build the machine representations, the machine must actually read all of the emails at some point. This is obviously is a violation of privacy.
In this paper, I argue in favor of a spam filter that preserves the user's privacy in every aspect. In fact, the beauty of my spam filter is that it in fact never even needs to see the emails themselves at any point.
In my design, the emails always reside on the user's computer or other such trusted device. A program on this device then extracts certain summary features from the emails, and submits these to the spam filter for classification.
The features I extracted in my proposal were quite simple in fact: number of words, number of capitals, number of punctuation marks, length of subject line, and some others. A total of 14 such features were extracted, using nothing more than the GNU core utilities. On these features, simple classifiers were trained: the Nadaraya-Watson Kernel, and a resampling version of the linear discriminant (LDA).
A linear scaling in Classification Accuracy (measured by AUC) was obtained upto 0.80, which was very encouraging, given that we were not using any model of email contents.
computing 72 (2009)]. This appears to be the first paper to propose use of Reinforcement Learning to prune classifier ensembles. The final part is speculative, where we examine some ideas from this paper to use Reinforcement Learning to build a neighborhood map of points in high-dimensional space, and examine how best we may combine 2-D projections of multivariate data to classify or interactively explore such data via their neighborhoods.
Talks by Mohit Dayal
Our analysis was done using two different models - a multiple regression model and a regression tree.Using the ratio of the predictions to the actual box office returns, we found
that both models were quite good.
Finally, we proposed a distance function to measure closeness of scripts yet to be produced to those already produced. This gives us a practical way to apply our models.
In this presentation, I review a very novel technique for clustering of such functional data proposed by Ferraty, FrƩdƩric, and Philippe Vieu. That is, one can put "Similar" Curves together, and separate them out from others that are different.
Such clustering is useful for several purposes. For instance, separating diseased tissues from healthy ones using MRIs, etc.
The technology reviewed in these slides was applied by the authors to study radar waveforms. A radar waveform is nothing but the reflected radio wave that is received back at the RADAR source. The very fact that the physical form of this data is a wave means that the natural domain to analyze such data is the area of functional data analysis.
The authors were able to show that their proposed technique works well and was able to "automatically [produce] homogeneous subgroups that can be easily interpreted in terms of differences of grounds (and in the Amazonian basin in terms of river, lake, vegetation, etc.)"
The slides are a gentle introduction to what an ensemble of classifiers is, why are such ensembles useful, and why simply increasing the number of models in the ensemble is not always helpful. I next go on to a very brief introduction to Dynamic Programming, and its extension, Neuro-Dynamic Programming or Reinforcement Learning. I conclude with just a brief introduction to the idea of using Reinforcement Learning to build point neighborhood maps. This last is still a matter of research by me.
In particular, I introduce and explain basic probability assignments (bpa) with examples, show what belief functions are and illustrate how bpa's can be combined.
In the last part, I review the paper "A k-nearest neighbor classification rule based on Dempster-Shafer theory" [Denoeux (1995) IEEE Transactions on Systems, Man and Cybernetics]. This is with a view to illustrate how one can use this rather unusual theory of probability to build new Machine Learning Algorithms.
"It is not uncommon to find nonlinear patterns in the scatterplots of regressor variables. But how such findings affect standard regression analysis remains largely unexplored. This article offers a theory on nonlinear confounding, a term for describing the situation where a certain nonlinear relationship in regressors leads to difficulties in modeling and related analysis of the data. The theory begins with a measure of nonlinearity between two regressor variables. It is then used to assess nonlinearity between any two projections from the high-dimensional regressor and a method of finding most nonlinear projections is given. Nonlinear confounding is addressed by taking a fresh new look at fundamental issues such as the validity of prediction and inference, diagnostics, regression surface approximation, model uncertainty and Fisher information loss." (Abstract of the Original Paper)
In this talk, we propose using the high-quality and proven Qt graphics library to visualize data from the JVM. Qt is fast, reliable, and available for almost all major platforms for which Java is.
Several advantages result from this choice. First, we are saved from re-inventing the wheel, since the graphics stack and low level programming of Qt need not be re-done. Next, the extensive library of Qt widgets can be used as controls for our plot objects. Finally, since the visualization is performed outside of the JVM, any crashes of the graphics system leave us unaffected.
This talk focuses on how one can use lexical scoping to write cleaner code, that is more concise and that even runs faster, than comparable code written without the use of this feature.
Teaching Documents by Mohit Dayal
To do this, projection pursuit methods use scalar functions computed from data projections to quantify how "interesting" the view shown in the projection is.
In this paper, we propose a new algorithm for exploratory projection pursuit. The basis of the algorithm is the insight that previous approaches to projection pursuit used fairly narrow definitions of interestingness / non interestingness. We argue that allowing these definitions to depend on the problem / data at hand is a more natural approach in an exploratory technique. This also allows our technique much greater scope of applicability than the approaches extant in the literature.
Complementing this insight, we propose a class of projection indices based on the spatial distribution function that can make use of such information.
Finally, with the help of real datasets, we demonstrate how a range of multivariate exploratory tasks can be addressed with our algorithm. The examples further demonstrate that the proposed technique is quite capable of focusing on the interesting structure in the data, even when this structure is otherwise hard to detect or arises from very subtle patterns.
DIVE (Drawing Interactively for Visual Exploration) is a new system for multivariate data visualization based on this principle. Prime features of DIVE are its easy comprehensibility and extremely high level of interactivity, unsurpassed by any such system for multivariate data.
DIVE allows the user to display an arbitrary number of (static) data projections that may be selected on any criteria. Next, a rich variety of point manipulation tools are presented, that allow the analyst to specify the kind of projections that she wants to see next. This is akin to specifying a composite hypothesis about the features of the high dimensional point cloud, which the system then attempts to validate.
Internally, these manipulation controls are implemented in the framework of context driven exploratory projection pursuit.
Our analysis was done using two different models-a multiple regression model and a regression tree.Using the ratio of the predictions to the actual box office returns, we found that both models were quite good.
Finally, we proposed a distance function to measure closeness of scripts yet to be produced to those already produced. This gives us a practical way to apply our models.
respect of these data are:
ā¢ Do the rates behave like stationary process? If yes, what are the dynamics?
ā¢ Are there structural breaks in the series?
ā¢ If the series are not stationary, are some linear combinations of them so?
One problem with all such spam filters is that to build the machine representations, the machine must actually read all of the emails at some point. This is obviously is a violation of privacy.
In this paper, I argue in favor of a spam filter that preserves the user's privacy in every aspect. In fact, the beauty of my spam filter is that it in fact never even needs to see the emails themselves at any point.
In my design, the emails always reside on the user's computer or other such trusted device. A program on this device then extracts certain summary features from the emails, and submits these to the spam filter for classification.
The features I extracted in my proposal were quite simple in fact: number of words, number of capitals, number of punctuation marks, length of subject line, and some others. A total of 14 such features were extracted, using nothing more than the GNU core utilities. On these features, simple classifiers were trained: the Nadaraya-Watson Kernel, and a resampling version of the linear discriminant (LDA).
A linear scaling in Classification Accuracy (measured by AUC) was obtained upto 0.80, which was very encouraging, given that we were not using any model of email contents.
computing 72 (2009)]. This appears to be the first paper to propose use of Reinforcement Learning to prune classifier ensembles. The final part is speculative, where we examine some ideas from this paper to use Reinforcement Learning to build a neighborhood map of points in high-dimensional space, and examine how best we may combine 2-D projections of multivariate data to classify or interactively explore such data via their neighborhoods.
Our analysis was done using two different models - a multiple regression model and a regression tree.Using the ratio of the predictions to the actual box office returns, we found
that both models were quite good.
Finally, we proposed a distance function to measure closeness of scripts yet to be produced to those already produced. This gives us a practical way to apply our models.
In this presentation, I review a very novel technique for clustering of such functional data proposed by Ferraty, FrƩdƩric, and Philippe Vieu. That is, one can put "Similar" Curves together, and separate them out from others that are different.
Such clustering is useful for several purposes. For instance, separating diseased tissues from healthy ones using MRIs, etc.
The technology reviewed in these slides was applied by the authors to study radar waveforms. A radar waveform is nothing but the reflected radio wave that is received back at the RADAR source. The very fact that the physical form of this data is a wave means that the natural domain to analyze such data is the area of functional data analysis.
The authors were able to show that their proposed technique works well and was able to "automatically [produce] homogeneous subgroups that can be easily interpreted in terms of differences of grounds (and in the Amazonian basin in terms of river, lake, vegetation, etc.)"
The slides are a gentle introduction to what an ensemble of classifiers is, why are such ensembles useful, and why simply increasing the number of models in the ensemble is not always helpful. I next go on to a very brief introduction to Dynamic Programming, and its extension, Neuro-Dynamic Programming or Reinforcement Learning. I conclude with just a brief introduction to the idea of using Reinforcement Learning to build point neighborhood maps. This last is still a matter of research by me.
In particular, I introduce and explain basic probability assignments (bpa) with examples, show what belief functions are and illustrate how bpa's can be combined.
In the last part, I review the paper "A k-nearest neighbor classification rule based on Dempster-Shafer theory" [Denoeux (1995) IEEE Transactions on Systems, Man and Cybernetics]. This is with a view to illustrate how one can use this rather unusual theory of probability to build new Machine Learning Algorithms.
"It is not uncommon to find nonlinear patterns in the scatterplots of regressor variables. But how such findings affect standard regression analysis remains largely unexplored. This article offers a theory on nonlinear confounding, a term for describing the situation where a certain nonlinear relationship in regressors leads to difficulties in modeling and related analysis of the data. The theory begins with a measure of nonlinearity between two regressor variables. It is then used to assess nonlinearity between any two projections from the high-dimensional regressor and a method of finding most nonlinear projections is given. Nonlinear confounding is addressed by taking a fresh new look at fundamental issues such as the validity of prediction and inference, diagnostics, regression surface approximation, model uncertainty and Fisher information loss." (Abstract of the Original Paper)
In this talk, we propose using the high-quality and proven Qt graphics library to visualize data from the JVM. Qt is fast, reliable, and available for almost all major platforms for which Java is.
Several advantages result from this choice. First, we are saved from re-inventing the wheel, since the graphics stack and low level programming of Qt need not be re-done. Next, the extensive library of Qt widgets can be used as controls for our plot objects. Finally, since the visualization is performed outside of the JVM, any crashes of the graphics system leave us unaffected.
This talk focuses on how one can use lexical scoping to write cleaner code, that is more concise and that even runs faster, than comparable code written without the use of this feature.