Q4 2009 Reference Update: Part 4 of 4

Modeling Problems

By Matt Perone on January 5th, 2010 1 Comment
Categories: References

Well, we’ve missed the end of 2009, but nonetheless here’s the 4th Part of our Q42009 research update. For this installment, we’ve uploaded 25 new papers to our reference database about some of the problems that are part of trying to model a high-dimensional, noisy, non-stationary domain like the stock market. Suffice it to say there are many.

The first set of papers in this update deal with the ‘curse of dimensionality’. One of the biggest issues with modeling stock returns, ironically, is the great amount of data we have available to throw at the problem. A lot of this is relevant information and needs to be added to a good model. However, many of these individual pieces of data will either have statistically significant relationships to each other or turn out to have no relationship to stock returns at all. Both cases will give extra degrees of freedom to an algorithm that will seriously affect its ability to find the meaningful information in a data set and its relationship to stock returns. Worse, even if we are only including good information in a model, many algorithms scale poorly as the size of their problem grows. With a limit on our computing power and time, this is a big issue.

There are two major solutions to this problem that involve pre-processing data before it is sent to a classification algorithm. The first is called feature selection and involves identifying the set of features in a data set that are important for prediction and discarding the rest. This is the approach taken by Eugene Tuv et al in their November 2009 paper “Feature Selection with Ensembles, Artificial Variables, and Redundancy Elimination”. Improvising on this theme, Krupka, Navot, and Tishby choose a subset of features with the novel idea of learning the meta-properties of a good feature in “Learning to Select Features using their Properties”. The second major way to deal with a high-dimensional data set is to project the data onto a lower dimension manifold – trying to maintain as many properties of the full-sized data set as possible. This is typically achieved using methods like principal component analysis and random projections. Yann LeCun, of convolutional neural network fame, has a paper in this update written with Raia Hadsell and Sumit Chopra that introduces a new method called Dimensionality Reduction by Learning an Invariant Mapping (DrLIM) that seems promising.

The second set of papers in this update deal with Ensembling techniques. It is increasingly apparent that the best practical approach to large, messy domains like the stock market is to combine the predictions of many heterogeneous individual prediction models in one ‘meta-model’. To prove the point, ensembling techniques have won the Netflix Prize and nearly every other machine learning competition that we are aware of in recent memory. Though this is an active field of development, there are a few heuristics emerging to help guide us. For one example, that unanimity seems to trump majority voting in our domain, see “Combining Heterogeneous Classifiers for Stock Selection” by Albanis and Batchelor.

Read past the break for citations for a few of the most interesting papers, or continue to references for the entire set.

  1. Arnak S. Dalalyan, Anatoly Juditsky, and Vladimir Spokoiny, “A New Algorithm for Estimating the Effective Dimension-Reduction Subspace,” Journal of Machine Learning Research 9 (August 2008): 1647-1678.
  2. Raia Hadsell, Sumit Chopra, and Yann LeCun, “Dimensionality Reduction by Learning an Invariant Mapping,” February 2006.
  3. Eugene Tuv et al., “Feature Selection with Ensembles, Artificial Variables, and Redundancy Elimination,” Journal of Machine Learning Research 10 (July 2009): 1341-1366.
  4. Eyal Krupka, Amir Navot, and Naftali Tishby, “Learning to Select Features using their Properties,” Journal of Machine Learning Research 9 (October 2008): 2349-2376.
  5. Sivan Sabato and Shai Shalev-Shwartz, “Ranking Categorical Features Using Generalization Properties,” Journal of Machine Learning Research 9 (June 2008): 1083-1114.
  6. Jianqing Fan, Richard Samworth, and Yichao Wu, “Ultrahigh Dimensional Feature Selection: Beyond The Linear Model,” Journal of Machine Learning Research 10 (September 2009): 2013-2038
  7. Luciana Ferrer, Kemal Sonmez, and Elizabeth Shriberg, “An Anticorrelation Kernel for Subsystem Training in Multiple Classifier Systems,” Journal of Machine Learning Research 10 (August 2009): 2079-2114.
  8. George T. Albanis and Roy A. Batchelor, “Combining Heterogeneous Classifiers for Stock Selection,” September 1999.
  9. Hugo Jair Escalante, Manuel Montes, and Luis Enrique Sucar, “Particle Swarm Model Selection,” Journal of Machine Learning Research 10 (February 2009): 405-440.

This website uses IntenseDebate comments, but they are not currently loaded because either your browser doesn't support JavaScript, or they didn't load fast enough.

One Response to “Modeling Problems”

  1. Liam Jobs says:

    Being a blog writer myself, I really appreciate the time you took in wriitng this article. I am currently reading it on my Blackberry and will scan it once I get home.

Leave a Reply