Log transforming variables with zero values

We often come across cases where we want to log transform a variable that has zero or negative values. The problem is that the log of zero (or a negative number) is undefined. What can you do in such cases?

One possibility is to delete all non-positive observations. This is only sensible if the occurrence of zero or negative values is random. If it is non-random, this approach introduces a selection bias. Consider, for example, the relationship between individuals’ expenditure on second-home(s) and wealth. Since less wealthy individuals are more likely to have zero expenditure on second-homes, deleting the zero observations would narrow the sample to include only wealthy individuals, thereby changing the scope of the analysis.

A better yet simple solution is to add a positive constant to the variable(s) for which you have zero values. For example, if your model is log(y) = a0 + a1 x + e, you can add a positive constant to all the y-values and estimate log(y+c) =a0 + a1 x + u, where c is a positive constant that ensures that all (y+c) values are greater than zero. The choice of the value for c is arbitrary. This approach may introduce some bias, and choosing a small value for c (i.e. c close to zero) is not necessarily better than say c=0.3. Nonetheless, adding a positive constant is common practice for dealing with zero values, and for dissertation purposes it is more than fine. If you use this approach, you should point out its limitations. See Bellego and Pape (2019) for a discussion.

A preferable approach is to take an inverse hyperbolic sine (IHS) transformation of the variable, log(y+(y2+1)1/2). This transformation behaves similar to a log transformation but is also defined for zero and negative valued observations. As y increases, the IHS tends to log(2y); which has led many to interpret it in the same way as a log transformed variable. While this is fine for large values of y, for very small values of y it can behave differently such that it biases the estimated elasticity. See Bellemare and Wichman (2020) for an excellent discussion on when and when not to use the IHS transformation.

Other, more novel approaches have been proposed. See, for example, Bellego and Pape (2019) who propose using the Pseudo-Poisson Maximum Likelihood (PPML) estimator. The method entails adding some optimal, observation-dependent positive value, ci, and estimating the model using GMM.

See Young and Young (1975) for more on deleting zero observations; MaCurdy and Pencavel (1986) for more on adding a positive constant; and Burbidge et al. (1988) for more on the IHS.

References

Bellégo, C. and Pape, L. (2019) Dealing with Logs and Zeros in Regression Models, CREST Série des Documents de Travail No. 2019-13.

Bellemare, M. F. and Wichman, C. J. (2020) Elasticities and the Inverse Hyperbolic Sine Transformation, Oxford Bulletin of Economics and Statistics, 82, 0305-9049.

Burbidge, J. B., Magee, L. and Robb, A.L. (1988) Alternative Transformations to Handle Extreme Values of the Dependent Variable, Journal of the American Statistical Association 83, 123–127.

MaCurdy, T. E. and Pencavel, J. H. (1986) Testing between Competing Models of Wage and Employment Determination in Unionized Markets, Journal of Political Economy 94, S3–S39.

Young, K.H. and Young, L. Y. (1975) Estimation of Regressions Involving Logarithmic Transformation of Zero Values in the Dependent Variable, The American Statistician 29, 118–120.