



LargeScale Machine Learning
Overview
In many domains, data now arrives faster than we are able to learn
from it. To avoid wasting this data, we must switch from the
traditional "oneshot" machine learning approach to systems that are
able to mine continuous, highvolume, openended data streams as they
arrive. We have identified a set of desiderata for such systems, and
developed an approach to building stream mining algorithms that
satisfies all of them. The approach is based on explicitly minimizing
the number of examples used in each learning step, while guaranteeing
that userdefined targets for predictive performance are met. So far,
we have applied this approach to four major (and widely differing)
types of learner: decision tree induction, Bayesian network learning,
kmeans clustering, and the EM algorithm for mixtures of Gaussians.
Our versions of these algorithms are able to mine orders of magnitude
more data than the best previous algorithms (e.g., our decision tree
learner can mine on the order of a billion examples per day on an
ordinary PC). We are currently applying our approach to the difficult
problem of largescale relational learning, and have already obtained
an orderofmagnitude speedup on a Web prediction task. We have released a
beta version of the VFML
toolkit with our current suite of stream mining algorithms. Our
ultimate goal is to develop a set of primitives (or, more generally, a
language) such that any learning algorithm built using them scales
automatically to arbitrarily large data streams.
Software
VFML (Very Fast
Machine Learning)
Publications
 P. Domingos and G. Hulten,
Mining HighSpeed Data Streams. Proceedings of the Sixth International
Conference on Knowledge Discovery and Data Mining (pp. 7180), 2000.
Boston, MA: ACM Press.
 P. Domingos and G. Hulten,
A General Method for Scaling Up Machine Learning Algorithms and its
Application to Clustering. Proceedings of the Eighteenth
International Conference on Machine Learning (pp. 106113), 2001.
Williamstown, MA: Morgan Kaufmann.
 P. Domingos and G. Hulten,
Learning from Infinite Data in Finite Time. Advances in Neural
Information Processing Systems 14 (pp. 673680), 2002. Cambridge, MA:
MIT Press.
 G. Hulten and P. Domingos,
Mining Complex Models from Arbitrarily Large Databases in Constant
Time. Proceedings of the Eighth International Conference on
Knowledge Discovery and Data Mining (pp. 525531), 2002. Edmonton,
Canada: ACM Press.
 G. Hulten, P. Domingos and Y. Abe,
Mining Massive Relational Databases. Proceedings of the
IJCAI2003 Workshop on Learning Statistical Models from Relational
Data, 2003. Acapulco, Mexico: IJCAII.
 G. Hulten, P. Domingos and L. Spencer.
Mining TimeChanging Data Streams. Proceedings of the Seventh
International Conference on Knowledge Discovery and Data Mining
(pp. 97106), 2001. San Francisco, CA: ACM Press.
