Regression analysis can imply a broader range
of techniques that ordinarily appreciated. Statisticians
commonly define regression so that the goal is to understand
“as far as possible with the available data how the the
conditional distribution of some response y varies across
subpopulations determined by the possible values of the
predictor or predictors” ( Cook and Weisberg, 1999: 27). For
example, if there is a single categorical predictor such as
male or female, a legitimate regression analysis has been
undertaken if one compares two income histograms, one for
men and one for women. Or, one might compare summary
statistics from the two income distributions: the mean
incomes, the median incomes, the two standard deviations of
income, and so on. One might also compare the shapes of the
two distributions with a Q-Q plot.
There is no requirement in regression
analysis for there to be a “model” by which the data were
supposed to be generated. There is no need to address cause
and effect. And there is no need to undertake statistical
tests or construct confidence intervals. The definition of a
regression analysis can be met by pure description alone.
Construction of a “model,” often coupled with causal and
statistical inference, are supplements to a regression
analysis, not a necessary component (Berk, 2003).
Given such a definition of regression
analysis, a wide variety of techniques and approaches can be
applied. In this chapter I will consider a range of
procedures under the broad rubric of data mining.