Skip to main content

Mohit Dayal

Purdue University, Statistics, Graduate Student

Followers

397

Following

7

Co-authors

2

Public Views

I'm currently a graduate student in Statistics at Purdue, looking to graduate with a Masters in Statistics this May. Prior to this, I was a researcher at the Applied Statistics and Computing Lab (a lab which I had personally helped establish) at the Indian School of Business, one of the top business schools in India.

I really really like solving real-world problems, involving data and quant, and I'm pretty good at it too. Most of the work here was, in fact, inspired by real world problems.

To know more about me, and how all my work fits together, check out my personal webpage at http://mohitdayal.strikingly.com

Or check me out on linkedin: https://www.linkedin.com/in/mohit-dayal-8ba734114

less

InterestsView All (9)

Uploads

Research Papers by Mohit Dayal

Context Driven Exploratory Projection Pursuit

Exploratory Projection Pursuit is an algorithm similar to principal components, that is, it too a... more Exploratory Projection Pursuit is an algorithm similar to principal components, that is, it too aims to visualize multidimensional data by projecting it linearly into a low dimensional (often 2 or 3 dimensional) space.

To do this, projection pursuit methods use scalar functions computed from data projections to quantify how "interesting" the view shown in the projection is.

In this paper, we propose a new algorithm for exploratory projection pursuit. The basis of the algorithm is the insight that previous approaches to projection pursuit used fairly narrow definitions of interestingness / non interestingness. We argue that allowing these definitions to depend on the problem / data at hand is a more natural approach in an exploratory technique. This also allows our technique much greater scope of applicability than the approaches extant in the literature.

Complementing this insight, we propose a class of projection indices based on the spatial distribution function that can make use of such information.

Finally, with the help of real datasets, we demonstrate how a range of multivariate exploratory tasks can be addressed with our algorithm. The examples further demonstrate that the proposed technique is quite capable of focusing on the interesting structure in the data, even when this structure is otherwise hard to detect or arises from very subtle patterns.

Conference Presentations by Mohit Dayal

Exploring Multivariate Data via the DIVE system

For the purpose of visualization, it is often useful to think of a linear projection of multivari... more For the purpose of visualization, it is often useful to think of a linear projection of multivariate data, as viewing the high dimensional point cloud from a particular direction. Thus, if one utilizes several projections together, useful understanding about the features of the data may be gained.

DIVE (Drawing Interactively for Visual Exploration) is a new system for multivariate data visualization based on this principle. Prime features of DIVE are its easy comprehensibility and extremely high level of interactivity, unsurpassed by any such system for multivariate data.

DIVE allows the user to display an arbitrary number of (static) data projections that may be selected on any criteria. Next, a rich variety of point manipulation tools are presented, that allow the analyst to specify the kind of projections that she wants to see next. This is akin to specifying a composite hypothesis about the features of the high dimensional point cloud, which the system then attempts to validate.

Internally, these manipulation controls are implemented in the framework of context driven exploratory projection pursuit.

Term Papers by Mohit Dayal

Prediction of Movie Box Office Gross

by Mohit Dayal, Melih Burak Koca, and Rongrong Zhang

In this article we aim to make predictions of movie grosses from script information alone. Such p... more In this article we aim to make predictions of movie grosses from script information alone. Such predictions would be useful at very early stages in a movie's production when a money commitment has not yet been made on the script.

Our analysis was done using two different models-a multiple regression model and a regression tree.Using the ratio of the predictions to the actual box office returns, we found that both models were quite good.

Finally, we proposed a distance function to measure closeness of scripts yet to be produced to those already produced. This gives us a practical way to apply our models.

Unit Roots and Cointegration: A Study of the Indian Foreign Exchange Market

The data at hand are weekly Foreign Exchange Bid rates, expressed in terms of 1 Indian Rupee. The... more The data at hand are weekly Foreign Exchange Bid rates, expressed in terms of 1 Indian Rupee. The data extends from 1 Jan 2003 to 31 Dec 2009. Specifically we look at the four currencies : the US Dollar(USD), British Pound (GBP), the Euro (EUR) and the Japanese Yen (JPY). The data have been sourced from oanda.com using their historical rates feature. Some of the questions that we entertain in
respect of these data are:
• Do the rates behave like stationary process? If yes, what are the dynamics?
• Are there structural breaks in the series?
• If the series are not stationary, are some linear combinations of them so?

Email Classification

Most modern spam filters work by first reading all the emails, from which a machine representatio... more Most modern spam filters work by first reading all the emails, from which a machine representation of the contents is created. A variety of machine representations are known: Bag of Words, bigram proximity matrix, etc. In the second step, a classifier is trained on this machine representation to classify emails as either spam or not spam.

One problem with all such spam filters is that to build the machine representations, the machine must actually read all of the emails at some point. This is obviously is a violation of privacy.

In this paper, I argue in favor of a spam filter that preserves the user's privacy in every aspect. In fact, the beauty of my spam filter is that it in fact never even needs to see the emails themselves at any point.

In my design, the emails always reside on the user's computer or other such trusted device. A program on this device then extracts certain summary features from the emails, and submits these to the spam filter for classification.

The features I extracted in my proposal were quite simple in fact: number of words, number of capitals, number of punctuation marks, length of subject line, and some others. A total of 14 such features were extracted, using nothing more than the GNU core utilities. On these features, simple classifiers were trained: the Nadaraya-Watson Kernel, and a resampling version of the linear discriminant (LDA).

A linear scaling in Classification Accuracy (measured by AUC) was obtained upto 0.80, which was very encouraging, given that we were not using any model of email contents.

Pruning Classifier Ensembles via Reinforcement Learning and its possible Application to combining Data Projections

In the first part of this paper, we will briefly review how neuro-dynamic programming (reinforcem... more In the first part of this paper, we will briefly review how neuro-dynamic programming (reinforcement learning) extends traditional dynamic programming to overcome the curse of dimensionality. In the second part, we give a brief review of ensemble methods-how they work and why we might require pruning. In the third part, we review a paper by Ioannis Partalas , Grigorios Tsoumakas, and Ioannis Vlahavas [Pruning an ensemble of classifiers via reinforcement learning : Neuro-
computing 72 (2009)]. This appears to be the first paper to propose use of Reinforcement Learning to prune classifier ensembles. The final part is speculative, where we examine some ideas from this paper to use Reinforcement Learning to build a neighborhood map of points in high-dimensional space, and examine how best we may combine 2-D projections of multivariate data to classify or interactively explore such data via their neighborhoods.

Talks by Mohit Dayal

Predicting Movie's Box Office

by Mohit Dayal, Melih Burak Koca, and Rongrong Zhang

In this article we aim to make predictions of movie grosses from script information alone. Such p... more In this article we aim to make predictions of movie grosses from script information alone. Such predictions would be useful at very early stages in a movie’s production when a money commitment has not yet been made on the script.

Our analysis was done using two different models - a multiple regression model and a regression tree.Using the ratio of the predictions to the actual box office returns, we found
that both models were quite good.

Finally, we proposed a distance function to measure closeness of scripts yet to be produced to those already produced. This gives us a practical way to apply our models.

Nonparametric Clustering of Functional Data

One is used to thinking about data as a list of numbers, but in the modern age, data can take on ... more One is used to thinking about data as a list of numbers, but in the modern age, data can take on a variety of forms. Functional Data Analysis is a branch of statistics that is concerned with analyzing data that emerges as functions. Examples of such data include handwriting recognition, MRI Images and Radar Waves.

In this presentation, I review a very novel technique for clustering of such functional data proposed by Ferraty, Frédéric, and Philippe Vieu. That is, one can put "Similar" Curves together, and separate them out from others that are different.

Such clustering is useful for several purposes. For instance, separating diseased tissues from healthy ones using MRIs, etc.

The technology reviewed in these slides was applied by the authors to study radar waveforms. A radar waveform is nothing but the reflected radio wave that is received back at the RADAR source. The very fact that the physical form of this data is a wave means that the natural domain to analyze such data is the area of functional data analysis.

The authors were able to show that their proposed technique works well and was able to "automatically [produce] homogeneous subgroups that can be easily interpreted in terms of differences of grounds (and in the Amazonian basin in terms of river, lake, vegetation, etc.)"

Ensemble Pruning Using Reinforcement Learning

This is a talk related to the Term Paper "Pruning Classifier Ensembles via Reinforcement Learning... more This is a talk related to the Term Paper "Pruning Classifier Ensembles via Reinforcement Learning and its possible Application to combining Data Projections".

The slides are a gentle introduction to what an ensemble of classifiers is, why are such ensembles useful, and why simply increasing the number of models in the ensemble is not always helpful. I next go on to a very brief introduction to Dynamic Programming, and its extension, Neuro-Dynamic Programming or Reinforcement Learning. I conclude with just a brief introduction to the idea of using Reinforcement Learning to build point neighborhood maps. This last is still a matter of research by me.

k-Nearest Neighbour Classification using Dempster-Shafer Theory

In this talk, I give a brief and very gentle introduction to what Dempster-Shafer theory is, and ... more In this talk, I give a brief and very gentle introduction to what Dempster-Shafer theory is, and how it relates to usual probability theory.

In particular, I introduce and explain basic probability assignments (bpa) with examples, show what belief functions are and illustrate how bpa's can be combined.

In the last part, I review the paper "A k-nearest neighbor classification rule based on Dempster-Shafer theory" [Denoeux (1995) IEEE Transactions on Systems, Man and Cybernetics]. This is with a view to illustrate how one can use this rather unusual theory of probability to build new Machine Learning Algorithms.

Applying Dempster-Shafer Theory to Machine Learning

In this talk, I further develop the idea of using Dempster-Shafer theory as a basis of Machine Le... more

Nonlinear Confounding in High Dimensional Regression

This talk is basically a review of the paper by the same name authored by Ker-Chau Li and publish... more This talk is basically a review of the paper by the same name authored by Ker-Chau Li and published in the Annals of Statistics, 1997.

"It is not uncommon to find nonlinear patterns in the scatterplots of regressor variables. But how such findings affect standard regression analysis remains largely unexplored. This article offers a theory on nonlinear confounding, a term for describing the situation where a certain nonlinear relationship in regressors leads to difficulties in modeling and related analysis of the data. The theory begins with a measure of nonlinearity between two regressor variables. It is then used to assess nonlinearity between any two projections from the high-dimensional regressor and a method of finding most nonlinear projections is given. Nonlinear confounding is addressed by taking a fresh new look at fundamental issues such as the validity of prediction and inference, diagnostics, regression surface approximation, model uncertainty and Fisher information loss." (Abstract of the Original Paper)

Graphics for Big Data

The Java language, and more broadly, the JVM Platform (as a host for several languages like Scala... more The Java language, and more broadly, the JVM Platform (as a host for several languages like Scala, Clojure and JRuby) is slowly emerging as the data analysis platform of the future. However, it is quite surprising that so far, high performance graphics libraries for data visualizations are still lacking on the Java Platform. This causes serious impediments to the data analyst.

In this talk, we propose using the high-quality and proven Qt graphics library to visualize data from the JVM. Qt is fast, reliable, and available for almost all major platforms for which Java is.

Several advantages result from this choice. First, we are saved from re-inventing the wheel, since the graphics stack and low level programming of Qt need not be re-done. Next, the extensive library of Qt widgets can be used as controls for our plot objects. Finally, since the visualization is performed outside of the JVM, any crashes of the graphics system leave us unaffected.

Lexical Scoping in R: What is it and Why should I care about it

Lexical Scoping is a feature in R, where the definition of any variable or function is first soug... more Lexical Scoping is a feature in R, where the definition of any variable or function is first sought to be located in the place where it is called, if not found there, the language undertakes a hierarchical search for the symbol's definition in the each enveloping environment starting from the most immediate, and ending with the global environment.

This talk focuses on how one can use lexical scoping to write cleaner code, that is more concise and that even runs faster, than comparable code written without the use of this feature.

Teaching Documents by Mohit Dayal

WHY ADDED VARIABLE PLOTS ?

Autocorrelation in Linear Regression

Heteroscedasticity in Linear Regression

Binary Response Regression Models : Methods and Diagnostics

Refinements & Extensions to Binary Response Models

Multivariate Methods

Context Driven Exploratory Projection Pursuit

Exploratory Projection Pursuit is an algorithm similar to principal components, that is, it too a... more Exploratory Projection Pursuit is an algorithm similar to principal components, that is, it too aims to visualize multidimensional data by projecting it linearly into a low dimensional (often 2 or 3 dimensional) space.

To do this, projection pursuit methods use scalar functions computed from data projections to quantify how "interesting" the view shown in the projection is.

In this paper, we propose a new algorithm for exploratory projection pursuit. The basis of the algorithm is the insight that previous approaches to projection pursuit used fairly narrow definitions of interestingness / non interestingness. We argue that allowing these definitions to depend on the problem / data at hand is a more natural approach in an exploratory technique. This also allows our technique much greater scope of applicability than the approaches extant in the literature.

Complementing this insight, we propose a class of projection indices based on the spatial distribution function that can make use of such information.

Finally, with the help of real datasets, we demonstrate how a range of multivariate exploratory tasks can be addressed with our algorithm. The examples further demonstrate that the proposed technique is quite capable of focusing on the interesting structure in the data, even when this structure is otherwise hard to detect or arises from very subtle patterns.

Exploring Multivariate Data via the DIVE system

For the purpose of visualization, it is often useful to think of a linear projection of multivari... more For the purpose of visualization, it is often useful to think of a linear projection of multivariate data, as viewing the high dimensional point cloud from a particular direction. Thus, if one utilizes several projections together, useful understanding about the features of the data may be gained.

DIVE (Drawing Interactively for Visual Exploration) is a new system for multivariate data visualization based on this principle. Prime features of DIVE are its easy comprehensibility and extremely high level of interactivity, unsurpassed by any such system for multivariate data.

DIVE allows the user to display an arbitrary number of (static) data projections that may be selected on any criteria. Next, a rich variety of point manipulation tools are presented, that allow the analyst to specify the kind of projections that she wants to see next. This is akin to specifying a composite hypothesis about the features of the high dimensional point cloud, which the system then attempts to validate.

Internally, these manipulation controls are implemented in the framework of context driven exploratory projection pursuit.

Prediction of Movie Box Office Gross

by Mohit Dayal, Melih Burak Koca, and Rongrong Zhang

In this article we aim to make predictions of movie grosses from script information alone. Such p... more In this article we aim to make predictions of movie grosses from script information alone. Such predictions would be useful at very early stages in a movie's production when a money commitment has not yet been made on the script.

Our analysis was done using two different models-a multiple regression model and a regression tree.Using the ratio of the predictions to the actual box office returns, we found that both models were quite good.

Finally, we proposed a distance function to measure closeness of scripts yet to be produced to those already produced. This gives us a practical way to apply our models.

Unit Roots and Cointegration: A Study of the Indian Foreign Exchange Market

The data at hand are weekly Foreign Exchange Bid rates, expressed in terms of 1 Indian Rupee. The... more The data at hand are weekly Foreign Exchange Bid rates, expressed in terms of 1 Indian Rupee. The data extends from 1 Jan 2003 to 31 Dec 2009. Specifically we look at the four currencies : the US Dollar(USD), British Pound (GBP), the Euro (EUR) and the Japanese Yen (JPY). The data have been sourced from oanda.com using their historical rates feature. Some of the questions that we entertain in
respect of these data are:
• Do the rates behave like stationary process? If yes, what are the dynamics?
• Are there structural breaks in the series?
• If the series are not stationary, are some linear combinations of them so?

Email Classification

Most modern spam filters work by first reading all the emails, from which a machine representatio... more Most modern spam filters work by first reading all the emails, from which a machine representation of the contents is created. A variety of machine representations are known: Bag of Words, bigram proximity matrix, etc. In the second step, a classifier is trained on this machine representation to classify emails as either spam or not spam.

One problem with all such spam filters is that to build the machine representations, the machine must actually read all of the emails at some point. This is obviously is a violation of privacy.

In this paper, I argue in favor of a spam filter that preserves the user's privacy in every aspect. In fact, the beauty of my spam filter is that it in fact never even needs to see the emails themselves at any point.

In my design, the emails always reside on the user's computer or other such trusted device. A program on this device then extracts certain summary features from the emails, and submits these to the spam filter for classification.

The features I extracted in my proposal were quite simple in fact: number of words, number of capitals, number of punctuation marks, length of subject line, and some others. A total of 14 such features were extracted, using nothing more than the GNU core utilities. On these features, simple classifiers were trained: the Nadaraya-Watson Kernel, and a resampling version of the linear discriminant (LDA).

A linear scaling in Classification Accuracy (measured by AUC) was obtained upto 0.80, which was very encouraging, given that we were not using any model of email contents.

Pruning Classifier Ensembles via Reinforcement Learning and its possible Application to combining Data Projections

In the first part of this paper, we will briefly review how neuro-dynamic programming (reinforcem... more In the first part of this paper, we will briefly review how neuro-dynamic programming (reinforcement learning) extends traditional dynamic programming to overcome the curse of dimensionality. In the second part, we give a brief review of ensemble methods-how they work and why we might require pruning. In the third part, we review a paper by Ioannis Partalas , Grigorios Tsoumakas, and Ioannis Vlahavas [Pruning an ensemble of classifiers via reinforcement learning : Neuro-
computing 72 (2009)]. This appears to be the first paper to propose use of Reinforcement Learning to prune classifier ensembles. The final part is speculative, where we examine some ideas from this paper to use Reinforcement Learning to build a neighborhood map of points in high-dimensional space, and examine how best we may combine 2-D projections of multivariate data to classify or interactively explore such data via their neighborhoods.

Predicting Movie's Box Office

by Mohit Dayal, Melih Burak Koca, and Rongrong Zhang

In this article we aim to make predictions of movie grosses from script information alone. Such p... more In this article we aim to make predictions of movie grosses from script information alone. Such predictions would be useful at very early stages in a movie’s production when a money commitment has not yet been made on the script.

Our analysis was done using two different models - a multiple regression model and a regression tree.Using the ratio of the predictions to the actual box office returns, we found
that both models were quite good.

Finally, we proposed a distance function to measure closeness of scripts yet to be produced to those already produced. This gives us a practical way to apply our models.

Nonparametric Clustering of Functional Data

One is used to thinking about data as a list of numbers, but in the modern age, data can take on ... more One is used to thinking about data as a list of numbers, but in the modern age, data can take on a variety of forms. Functional Data Analysis is a branch of statistics that is concerned with analyzing data that emerges as functions. Examples of such data include handwriting recognition, MRI Images and Radar Waves.

In this presentation, I review a very novel technique for clustering of such functional data proposed by Ferraty, Frédéric, and Philippe Vieu. That is, one can put "Similar" Curves together, and separate them out from others that are different.

Such clustering is useful for several purposes. For instance, separating diseased tissues from healthy ones using MRIs, etc.

The technology reviewed in these slides was applied by the authors to study radar waveforms. A radar waveform is nothing but the reflected radio wave that is received back at the RADAR source. The very fact that the physical form of this data is a wave means that the natural domain to analyze such data is the area of functional data analysis.

The authors were able to show that their proposed technique works well and was able to "automatically [produce] homogeneous subgroups that can be easily interpreted in terms of differences of grounds (and in the Amazonian basin in terms of river, lake, vegetation, etc.)"

Ensemble Pruning Using Reinforcement Learning

This is a talk related to the Term Paper "Pruning Classifier Ensembles via Reinforcement Learning... more This is a talk related to the Term Paper "Pruning Classifier Ensembles via Reinforcement Learning and its possible Application to combining Data Projections".

The slides are a gentle introduction to what an ensemble of classifiers is, why are such ensembles useful, and why simply increasing the number of models in the ensemble is not always helpful. I next go on to a very brief introduction to Dynamic Programming, and its extension, Neuro-Dynamic Programming or Reinforcement Learning. I conclude with just a brief introduction to the idea of using Reinforcement Learning to build point neighborhood maps. This last is still a matter of research by me.

k-Nearest Neighbour Classification using Dempster-Shafer Theory

In this talk, I give a brief and very gentle introduction to what Dempster-Shafer theory is, and ... more In this talk, I give a brief and very gentle introduction to what Dempster-Shafer theory is, and how it relates to usual probability theory.

In particular, I introduce and explain basic probability assignments (bpa) with examples, show what belief functions are and illustrate how bpa's can be combined.

In the last part, I review the paper "A k-nearest neighbor classification rule based on Dempster-Shafer theory" [Denoeux (1995) IEEE Transactions on Systems, Man and Cybernetics]. This is with a view to illustrate how one can use this rather unusual theory of probability to build new Machine Learning Algorithms.

Applying Dempster-Shafer Theory to Machine Learning

In this talk, I further develop the idea of using Dempster-Shafer theory as a basis of Machine Le... more

Nonlinear Confounding in High Dimensional Regression

This talk is basically a review of the paper by the same name authored by Ker-Chau Li and publish... more This talk is basically a review of the paper by the same name authored by Ker-Chau Li and published in the Annals of Statistics, 1997.

"It is not uncommon to find nonlinear patterns in the scatterplots of regressor variables. But how such findings affect standard regression analysis remains largely unexplored. This article offers a theory on nonlinear confounding, a term for describing the situation where a certain nonlinear relationship in regressors leads to difficulties in modeling and related analysis of the data. The theory begins with a measure of nonlinearity between two regressor variables. It is then used to assess nonlinearity between any two projections from the high-dimensional regressor and a method of finding most nonlinear projections is given. Nonlinear confounding is addressed by taking a fresh new look at fundamental issues such as the validity of prediction and inference, diagnostics, regression surface approximation, model uncertainty and Fisher information loss." (Abstract of the Original Paper)

Graphics for Big Data

The Java language, and more broadly, the JVM Platform (as a host for several languages like Scala... more The Java language, and more broadly, the JVM Platform (as a host for several languages like Scala, Clojure and JRuby) is slowly emerging as the data analysis platform of the future. However, it is quite surprising that so far, high performance graphics libraries for data visualizations are still lacking on the Java Platform. This causes serious impediments to the data analyst.

In this talk, we propose using the high-quality and proven Qt graphics library to visualize data from the JVM. Qt is fast, reliable, and available for almost all major platforms for which Java is.

Several advantages result from this choice. First, we are saved from re-inventing the wheel, since the graphics stack and low level programming of Qt need not be re-done. Next, the extensive library of Qt widgets can be used as controls for our plot objects. Finally, since the visualization is performed outside of the JVM, any crashes of the graphics system leave us unaffected.

Lexical Scoping in R: What is it and Why should I care about it

Lexical Scoping is a feature in R, where the definition of any variable or function is first soug... more Lexical Scoping is a feature in R, where the definition of any variable or function is first sought to be located in the place where it is called, if not found there, the language undertakes a hierarchical search for the symbol's definition in the each enveloping environment starting from the most immediate, and ending with the global environment.

This talk focuses on how one can use lexical scoping to write cleaner code, that is more concise and that even runs faster, than comparable code written without the use of this feature.

WHY ADDED VARIABLE PLOTS ?

Autocorrelation in Linear Regression

Heteroscedasticity in Linear Regression

Binary Response Regression Models : Methods and Diagnostics

Refinements & Extensions to Binary Response Models

Multivariate Methods

Analysis of Panel Data

A New Algorithm for Exploratory Projection Pursuit

arXiv (Cornell University), Dec 19, 2011

How to Account for Contemporaneous and Time Dependence in Business Research?

SSRN Electronic Journal, 2011

Abstract In the analysis of balanced panel data sets used in Business Research, the commonly used... more

Testing Theories with Big Data : A SuperPower Approach

In the early days of IS research, a large sample was over 100 records. In today's Big Data er... more In the early days of IS research, a large sample was over 100 records. In today's Big Data era, papers routinely report many thousands, and even millions of records. Large samples provide a powerful tool for testing hypotheses. Applying small-sample modeling to large samples not only wastes the "super-power" advantages but can also lead to incorrect and misleading conclusions. In particular, it is prone to super-low p-values and the inability to rely on p-values for testing hypotheses. We introduce a "super power" approach for testing hypotheses with Big Data. While the statistical literature describes the asymptotic behavior of estimators and models, those results do not seem to impact practices in IS research, probably due to their highly theoretical nature. We focus on regression models that are popular with IS researchers, but the approach generalizes to inference with other statistical models. The super-power approach encompasses the different modeling s...