Generalized Linear Models and Extensions, Fourth Edition

Click to enlarge
See the back cover

Inside preview

Print eBook Kindle

$68.00 Print

Buy now

What are VitalSource eBooks?
Your access code will be emailed upon purchase.

$58.00 VitalSource

Buy now

$51.00 Amazon Kindle

Buy from Amazon

As an Amazon Associate, StataCorp earns a small referral credit from qualifying purchases made from affiliate links on our site.

Amazon Associate affiliate link

Authors:	James W. Hardin and Joseph M. Hilbe
Publisher:	Stata Press
Copyright:	2018
ISBN-13:	978-1-59718-225-6
Pages:	598; paperback
Price:	$68.00

Authors:	James W. Hardin and Joseph M. Hilbe
Publisher:	Stata Press
Copyright:	2018
ISBN-13:	978-1-59718-226-3
Pages:	598; eBook
Price:	$58.00

Authors:	James W. Hardin and Joseph M. Hilbe
Publisher:	Stata Press
Copyright:	2018
ISBN-13:	978-1-59718-227-0
Pages:	598; Kindle
Price:	$51.00

Preface
Author index
Subject index
Download the datasets used in this book

Comment from the Stata technical group

Generalized linear models (GLMs) extend linear regression to models with a non-Gaussian or even discrete response. GLM theory is predicated on the exponential family of distributions—a class so rich that it includes the commonly used logit, probit, and Poisson models. Although one can fit these models in Stata by using specialized commands (for example, logit for logit models), fitting them as GLMs with Stata’s glm command offers some advantages. For example, model diagnostics may be calculated and interpreted similarly regardless of the assumed distribution.

This text thoroughly covers GLMs, both theoretically and computationally, with an emphasis on Stata. The theory consists of showing how the various GLMs are special cases of the exponential family, showing general properties of this family of distributions, and showing the derivation of maximum likelihood (ML) estimators and standard errors. Hardin and Hilbe show how iteratively reweighted least squares, another method of parameter estimation, is a consequence of ML estimation using Fisher scoring. The authors also discuss different methods of estimating standard errors, including robust methods, robust methods with clustering, Newey–West, outer product of the gradient, bootstrap, and jackknife. The thorough coverage of model diagnostics includes measures of influence such as Cook’s distance, several forms of residuals, the Akaike and Bayesian information criteria, and various R²-type measures of explained variability.

After presenting general theory, Hardin and Hilbe then break down each distribution. Each distribution has its own chapter that explains the computational details of applying the general theory to that particular distribution. Pseudocode plays a valuable role here because it lets the authors describe computational algorithms relatively simply. Devoting an entire chapter to each distribution (or family, in GLM terms) also allows for the inclusion of real-data examples showing how Stata fits such models, as well as the presentation of certain diagnostics and analytical strategies that are unique to that family. The chapters on binary data and on count (Poisson) data are excellent in this regard. Hardin and Hilbe give ample attention to the problems of overdispersion and zero inflation in count-data models.

The final part of the text concerns extensions of GLMs. First, the authors cover multinomial responses, both ordered and unordered. Although multinomial responses are not strictly a part of GLM, the theory is similar in that one can think of a multinomial response as an extension of a binary response. The examples presented in these chapters often use the authors’ own Stata programs, augmenting official Stata’s capabilities. Second, GLMs may be extended to clustered data through generalized estimating equations (GEEs), and one chapter covers GEE theory and examples. GLMs may also be extended by programming one’s own family and link functions for use with Stata’s official glm command, and the authors detail this process. Finally, the authors describe extensions for multivariate models and Bayesian analysis.

The fourth edition includes two new chapters. The first introduces bivariate and multivariate models for binary and count outcomes. The second covers Bayesian analysis and demonstrates how to use the bayes: prefix and the bayesmh command to fit Bayesian models for many of the GLMs that were discussed in previous chapters. Additionally, the authors added discussions of the generalized negative binomial models of Waring and Famoye. New explanations of working with heaped data, clustered data, and bias-corrected GLMs are included. The new edition also incorporates more examples of creating synthetic data for models such as Poisson, negative binomial, hurdle, and finite mixture models.

About the authors

James W. Hardin is a professor and the Biostatistics division head in the Department of Epidemiology and Biostatistics at the University of South Carolina. He is also the associate dean for Faculty Affairs and Curriculum of the Arnold School of Public Health at the University of South Carolina. Hardin is the coauthor, along with Phillip Good, of four editions of Common Errors in Statistics (And How to Avoid Them). He is also the coauthor of more than 200 refereed journal articles and several book chapters. With Hilbe, he wrote the glm command, on which the current Stata command is based. He teaches courses on generalized linear models, generalized estimating equations, count data modeling, and logistic regression through statistics.com. Hardin serves on the editorial board of the Stata Journal.

Joseph M. Hilbe was a professor emeritus at the University of Hawaii and an adjunct professor of sociology and statistics at Arizona State University. A Fellow of the Royal Statistical Society and the American Statistical Association, he wrote many journal articles and book chapters. He also wrote Negative Binomial Regression, Practical Guide to Logistic Regression, Modeling Count Data, and with Hardin, Generalized Estimating Equations. Hilbe was also the lead statistician at several major research corporations, CEO of National Health Economics and Research, and president of Health Outcomes Technologies in Pennsylvania. He passed away in March 2017.

View table of contents >>

List of figures

List of tables

List of listings

Preface (PDF)

1 Introduction

1.1 Origins and motivation
1.2 Notational conventions
1.3 Applied or theoretical?
1.4 Road map
1.5 Installing the support materials

I Foundations of Generalized Linear Models

2 GLMs

2.1 Components
2.2 Assumptions
2.3 Exponential family
2.4 Example: Using an offset in a GLM
2.5 Summary

3 GLM estimation algorithms

3.1 Newton–Raphson (using the observed Hessian)
3.2 Starting values for Newton–Raphson
3.3 IRLS (using the expected Hessian)
3.4 Starting values for IRLS
3.5 Goodness of fit
3.6 Estimated variance matrices

3.6.1 Hessian
3.6.2 Outer product of the gradient
3.6.3 Sandwich
3.6.4 Modified sandwich
3.6.5 Unbiased sandwich
3.6.6 Modified unbiased sandwich
3.6.7 Weighted sandwich: Newey–West
3.6.8 Jackknife

3.6.8.1 Usual jackknife
3.6.8.2 One-step jackknife
3.6.8.3 Weighted jackknife
3.6.8.4 Variable jackknife

3.6.9 Bootstrap

3.6.9.1 Usual bootstrap
3.6.9.2 Grouped bootstrap

3.7 Estimation algorithms
3.8 Summary

4 Analysis of fit

4.1 Deviance
4.2 Diagnostics

4.2.1 Cook’s distance
4.2.2 Overdispersion

4.3 Assessing the link function
4.4 Residual analysis

4.4.1 Response residuals
4.4.2 Working residuals
4.4.3 Pearson residuals
4.4.4 Partial residuals
4.4.5 Anscombe residuals
4.4.6 Deviance residuals
4.4.7 Adjusted deviance residuals
4.4.8 Likelihood residuals
4.4.9 Score residuals

4.5 Checks for systematic departure from the model
4.6 Model statistics

4.6.1 Criterion measures

4.6.1.1 AIC
4.6.1.2 BIC

4.6.2 The interpretation of R² in linear regression

4.6.2.1 Percentage variance explained
4.6.2.2 The ratio of variances
4.6.2.3 A transformation of the likelihood ratio
4.6.2.4 A transformation of the F test
4.6.2.5 Squared correlation

4.6.3 Generalizations of linear regression R² interpretations

4.6.3.1 Efron’s pseudo-R²
4.6.3.2 McFadden’s likelihood-ratio index
4.6.3.3 Ben-Akiva and Lerman adjusted likelihood-ratio index
4.6.3.4 McKelvey and Zavoina ratio of variances
4.6.3.5 Transformation of likelihood ratio
4.6.3.6 Cragg and Uhler normed measure

4.6.4 More R² measures

4.6.4.1 The count R²
4.6.4.2 The adjusted count R²
4.6.4.3 Veall and Zimmermann R²
4.6.4.4 Cameron–Windmeijer R²

4.7 Marginal effects

4.7.1 Marginal effects for GLMs
4.7.2 Discrete change for GLMs

II Continuous Response Models

5 The Gaussian family

5.1 Derivation of the GLM Gaussian family
5.2 Derivation in terms of the mean
5.3 IRLS GLM algorithm (nonbinomial)
5.4 ML estimation
5.5 GLM log-Gaussian models
5.6 Expected versus observed information matrix
5.7 Other Gaussian links
5.8 Example: Relation to OLS
5.9 Example: Beta-carotene

6 The gamma family

6.1 Derivation of the gamma model
6.2 Example: Reciprocal link
6.3 ML estimation
6.4 Log-gamma models
6.5 Identity-gamma models
6.6 Using the gamma model for survival analysis

7 The inverse Gaussian family

7.1 Derivation of the inverse Gaussian model
7.2 Shape of the distribution
7.3 The inverse Gaussian algorithm
7.4 Maximum likelihood algorithm
7.5 Example: The canonical inverse Gaussian
7.6 Noncanonical links

8 The power family and link

8.1 Power links
8.2 Example: Power link
8.3 The power family

III Binomial Response Models

9 The binomial–logit family

9.1 Derivation of the binomial model
9.2 Derivation of the Bernoulli model
9.3 The binomial regression algorithm
9.4 Example: Logistic regression

9.4.1 Model producing logistic coefficients: The heart data
9.4.2 Model producing logistic odds ratios

9.5 GOF statistics
9.6 Grouped data
9.7 Interpretation of parameter estimates

10 The general binomial family

10.1 Noncanonical binomial models
10.2 Noncanonical binomial links (binary form)
10.3 The probit model
10.4 The clog-log and log-log models
10.5 Other links
10.6 Interpretation of coefficients

10.6.1 Identity link
10.6.2 Logit link
10.6.3 Log link
10.6.4 Log complement link
10.6.5 Log-log link
10.6.6 Complementary log-log link
10.6.7 Summary

10.7 Generalized binomial regression
10.8 Beta binomial regression
10.9 Zero-inflated models

11 The problem of overdispersion

11.1 Overdispersion
11.2 Scaling of standard errors
11.3 Williams’ procedure
11.4 Robust standard errors

IV Count Response Models

12 The Poisson family

12.1 Count response regression models
12.2 Derivation of the Poisson algorithm
12.3 Poisson regression: Examples
12.4 Example: Testing overdispersion in the Poisson model
12.5 Using the Poisson model for survival analysis
12.6 Using offsets to compare models
12.7 Interpretation of coefficients

13 The negative binomial family

13.1 Constant overdispersion
13.2 Variable overdispersion

13.2.1 Derivation in terms of a Poisson–gamma mixture
13.2.2 Derivation in terms of the negative binomial probability function
13.2.3 The canonical link negative binomial parameterization

13.3 The log-negative binomial parameterization
13.4 Negative binomial examples
13.5 The geometric family
13.6 Interpretation of coefficients

14 Other count-data models

14.1 Count response regression models
14.2 Zero-truncated models
14.3 Zero-inflated models
14.4 General truncated models
14.5 Hurdle models
14.6 Negative binomial(P) models
14.7 Negative binomial(Famoye)
14.8 Negative binomial(Waring)
14.9 Heterogeneous negative binomial models
14.10 Generalized Poisson regression models
14.11 Poisson inverse Gaussian models
14.12 Censored count response models
14.13 Finite mixture models
14.14 Quantile regression for count outcomes
14.15 Heaped data models

V Multinomial Response Models

15 Unordered-response family

15.1 The multinomial logit model

15.1.1 Interpretation of coefficients: Single binary predictor
15.1.2 Example: Relation to logistic regression
15.1.3 Example: Relation to conditional logistic regression
15.1.4 Example: Extensions with conditional logistic regression
15.1.5 The independence of irrelevant alternatives
15.1.6 Example: Assessing the IIA
15.1.7 Interpreting coefficients
15.1.8 Example: Medical admissions—introduction
15.1.9 Example: Medical admissions—summary

15.2 The multinomial probit model

15.2.1 Example: A comparison of the models
15.2.2 Example: Comparing probit and multinomial probit
15.2.3 Example: Concluding remarks

16 The ordered-response family

16.1 Interpretation of coefficients: Single binary predictor
16.2 Ordered outcomes for general link
16.3 Ordered outcomes for specific links

16.3.1 Ordered logit
16.3.2 Ordered probit
16.3.3 Ordered clog-log
16.3.4 Ordered log-log
16.3.5 Ordered cauchit

16.4 Generalized ordered outcome models
16.5 Example: Synthetic data
16.6 Example: Automobile data
16.7 Partial proportional-odds models
16.8 Continuation-ratio models
16.9 Adjacent category model

VI Extensions to the GLM

17 Extending the likelihood

17.1 The quasilikelihood
17.2 Example: Wedderburn’s leaf blotch data
17.3 Example: Tweedie family variance
17.4 Generalized additive models

18 Clustered data

18.1 Generalization from individual to clustered data
18.2 Pooled estimators
18.3 Fixed effects

18.3.1 Unconditional fixed-effects estimators
18.3.2 Conditional fixed-effects estimators

18.4 Random effects

18.4.1 Maximum likelihood estimation
18.4.2 Gibbs sampling

18.5 Mixed-effect models
18.6 GEEs
18.7 Other models

19 Bivariate and multivariate models

19.1 Bivariate and multivariate models for binary outcomes
19.2 Copula functions
19.3 Using copula functions to calculate bivariate probabilities
19.4 Synthetic datasets
19.5 Examples of bivariate count models using copula functions
19.6 The Famoye bivariate Poisson regression model
19.7 The Marshall–Olkin bivariate negative binomial regression model
19.8 The Famoye bivariate negative binomial regression model

20 Bayesian GLMs

20.1 Brief overview of Bayesian methodology

20.1.1 Specification and estimation
20.1.2 Bayesian analysis in Stata

20.2 Bayesian logistic regression

20.2.1 Bayesian logistic regression—noninformative priors
20.2.2 Diagnostic plots
20.2.3 Bayesian logistic regression—informative priors

20.3 Bayesian probit regression
20.4 Bayesian complementary log-log regression
20.5 Bayesian binomial logistic regression
20.6 Bayesian Poisson regression

20.6.1 Bayesian Poisson regression with noninformative priors
20.6.2 Bayesian Poisson with informative priors

20.7 Bayesian negative binomial likelihood

20.7.1 Zero-inflated negative binomial logit

20.8 Bayesian normal regression
20.9 Writing a custom likelihood

20.9.1 Using the llf() option

20.9.1.1 Bayesian logistic regression using llf()
20.9.1.2 Bayesian zero-inflated negative binomial logit regression using llf()

20.9.2 Using the llevaluator() option

20.9.2.1 Logistic regression model using llevaluator()
20.9.2.2 Bayesian clog-log regression with llevaluator()
20.9.2.3 Bayesian Poisson regression with llevaluator()
20.9.2.4 Bayesian negative binomial regression using llevaluator()
20.9.2.5 Zero-inflated negative binomial logit using llevaluator()
20.9.2.6 Bayesian gamma regression using llevaluator()
20.9.2.7 Bayesian inverse Gaussian regression using llevaluator()
20.9.2.8 Bayesian zero-truncated Poisson using llevaluator()
20.9.2.9 Bayesian bivariate Poisson using llevaluator()

VII Stata Software

21 Programs for Stata

21.1 The glm command

21.1.1 Syntax
21.1.2 Description
21.1.3 Options

21.2 The predict command after glm

21.2.1 Syntax
21.2.2 Options

21.3 User-written programs

21.3.1 Global macros available for user-written programs
21.3.2 User-written variance functions
21.3.3 User-written programs for link functions
21.3.4 User-written programs for Newey–West weights

21.4 Remarks

21.4.1 Equivalent commands
21.4.2 Special comments on family(Gaussian) models
21.4.3 Special comments on family(binomial) models
21.4.4 Special comments on family(nbinomial) models
21.4.5 Special comment on family(gamma) link(log) models

22 Data synthesis

22.1 Generating correlated data
22.2 Generating data from a specified population

22.2.1 Generating data for linear regression
22.2.2 Generating data for logistic regression
22.2.3 Generating data for probit regression
22.2.4 Generating data for complimentary log-log regression
22.2.5 Generating data for Gaussian variance and log link
22.2.6 Generating underdispersed count data

22.3 Simulation

22.3.1 Heteroskedasticity in linear regression
22.3.2 Power analysis
22.3.3 Comparing fit of Poisson and negative binomial
22.3.4 Effect of missing covariate on R²_Efron in Poisson regression

A Tables

References

Author index (PDF)

Subject index (PDF)

Generalized Linear Models and Extensions, Fourth Edition

Comment from the Stata technical group

About the authors

Table of contents

Contact us

Links

Connect

Stata/MP4 Annual License (download)

Generalized Linear Models and Extensions, Fourth Edition

Comment from the Stata technical group

About the authors

Table of contents

Contact us

Links

Connect