Skip to content

Why practitioners discretize their continuous data

Yihui asked this question yesterday. My supervisor Dr. Hau also criticized routine grouping discretization. I encountered two plausible reasons in 2007 classes, one negative, the other at least conditionally positive.

The first is a variant of the old Golden Hammer law -- if the only tool is ANOVA, every continuous predictor need discretization. The second reason is empirical -- ANOVA with discretization steals df(s). Let's demo it with a diagram.
The red are the population points, and the black are samples. Which predicts the population better--the green continuous line, or the discretized blue dashes? R simulation code is given.





{ 2 } Comments

  1. Yihui | March 8, 2009 at 7:26 am | Permalink

    The discretization here is essentially a kind of local smoothing techniques using a constant kernel function. Generally speaking, local modeling can effectively improve fitness (lower error sum of squares) but we have to carefully avoid overfitting. If you discretize x into more intervals, the fitting will be even better.

  2. lixiaoxu | March 8, 2009 at 9:56 am | Permalink

    Residuals and errors are different. The more intervals, squared-residuals decrease while squared-errors increase. So the black points, or discretization with max intervals, predict red population the worst.

    Discretization fades micro information (most errors) while highlights macro information (usually non-linear). When LOESS is popular enough, discretization will be abandoned. Practitioners really need local smoothing to preview their concerned macro models.

{ 2 } Trackbacks

  1. Keep on Fighting! | March 6, 2009 at 4:01 pm | Permalink

    离散化:毁灭信息的有效手段...

    如果你想掩盖数据,那么就把它们离散化吧!不知道为什么这么多人钟爱于将连续数据离散化,例如明明有年龄数据,在分析的时候非要分成老幼青壮这样的分类变量;明明有原始的计数数据...

  2. [...] 李晓煦老师的博客:非常专业,为数不多的会用LaTeX写上数学公式的博客,李老师对统计理论细节研究很认真,很有国外统计研究者的风范;博文如Why practitioners discretize their continuous data讲述了为什么大家喜欢将连续型数据离散化的原因之一。 [...]

Post a Comment

Your email is never published nor shared. Required fields are marked *


To prove you're a person (not a spam script), type the answer to the math equation shown in the picture.
Anti-spam equation