Wednesday, March 22, 2017

Count Models with Offsets: Practical Applications Using R

See also:

Lets consider three count modeling scenarios and determine the appropriate modeling strategy. 

Example 1: Suppose we are observing kids playing basketball during open gym. We have two groups of equal size A and B. Suppose both groups play for 60 minutes and kids in one group, A, score on average about 2.5 goals each while group B averages 5.
In this case both groups of students engage in activity for the same amount of time. There seems to be no need to include time as an offset. And it is clear, for whatever reason students in group B are better at scoring and therefore score more goals.

If we simulate count data to mimic this scenario (see toy data below) we might get descriptive statistics that look like this:

Table 1:

It is clear for the period of observation (60 minutes) group B out scored A. Would we in practice discuss this in terms of rates? Total points per 60 minute session? Or total goals per minute? In this case group A scores .0433 goals per minute vs. .09 for B.  Again, we conclude based on rates that B is better at scoring goals. But most likely, despite the implicit or explicit view of rate, we would discuss these outcomes in a more practical sense, total goals for A vs B. 

We could model this difference with a Poisson regression:

summary(glm(COUNT ~ GROUP,data = counts, family = poisson))


Table 2:
Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept)   0.9555     0.1601   5.967 2.41e-09 ***
GROUPB        0.7309     0.1949   3.750 0.000177 ***

We can from this that group B completes significantly more goals than A, at a ‘rate’ exp(.7309) = 2.075 times that of A. (roughly twice as many goals). This is basically what we get from a direct comparison of the average counts in the descriptives above.
But what if we wanted to be explicit about the interval of observation and include an offset? The way we incorporate rates into a poisson model for counts is through the offset. 

Log(μ/tx) = xβ  here we are explicitly specifying a rate based on time ‘tx

Re-arranging terms we get:
Log(μ) – Log(tx) = xβ 
Log(μ) = xβ + Log(tx)
The term Log(tx) becomes our ‘offset.’

So we would do this by including log(time) as an offset in our R code: 

summary(glm(COUNT ~ GROUP + offset(log(TIME2)),data = counts, family = poisson))

It turns out the estimate of  .7309 for B vs A is the same. Whether we directly compare the raw counts, or run count models with or without offsets we get the same result. 

Table 3:
Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept)  -3.1388     0.1601  -19.60  < 2e-16 ***
GROUPB        0.7309     0.1949    3.75 0.000177 ***

Example 2: Suppose again we are observing kids playing basketball during open gym. Let’s still refer to them as groups A and B. Suppose after 30 minutes group A is forced to leave the court (maybe their section of the court is reserved for an art show). Before leaving they score an average of about 2.5 goals. Group B is allowed to play for 60 minutes scoring an average of about 5 goals. This is an example where the two groups had different observation times or exposure times (i.e. playing time). Its plausible that if Group A continued to play longer they would have had more risk or opportunity to score more goals. It seems the only fair way to compare goal scoring for A vs B is to consider court time, or exposure or the rate of goal completion. If we use the same toy data as before (but assuming this different scenario) we would get the following  descriptives:

Table 4

You can see that the difference in the rate of goals scored is very small. Both teams are put on an ‘even’ playing field when we consider rates of goal completion. 

If we fail to consider exposure or the period of observation we run the following model:

summary(glm(COUNT ~ GROUP,data = counts, family = poisson))

The results will appear the same as in table 2 above.  But what if we want to consider the differences in exposure or observation time? In this case we would include an offset in our model specification:

summary(glm(COUNT ~ GROUP + offset(log(TIME3)),data = counts, family = poisson))

Table 5
Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept) -2.44569    0.16013 -15.273   <2e-16 ***
GROUPB       0.03774    0.19490   0.194    0.846 

We can see from the results that when considering exposure (modeling with an offset) there is no significant difference between groups, although this could be an issue of low power and small sample size. Directionally group B completes about 3.8% more goals (per minute of exposure) than A or alternatively exp(.0377) = 1.038 indicates that B completes 1.038 times as many goals as A  or alternatively (1.038-1)*100 = 3.8% more. We can get all of this from the descriptives by comparing the average ‘rates’ of goal completion for B vs A. But the conclusion is all the same, and if we fail to consider rates or exposure in this case we get the wrong answer!!!

Example 3: Suppose we are again observing kids playing basketball during open gym with groups A and B. Except this time group A tires out after playing about 20 minutes or so and leaves the court after scoring 2.6 goals each on average. Group B perseveres another 30 minutes or so and scores a total of about 5 goals on average per student. In this instance there seem to be important differences in group A and B in terms of drive and ambition that should not be equated by accounting for time played or inclusion of an offset. Event success seems to drive time as much as time drives the event. In this instance if we want to think of a ‘rate’ the rate is total goals scored per open gym session, not per minute of activity.  The relevant interval is a single open gym session.
In this case time actually seems endogenous or confounded with the outcome or confounded with other factors like effort and motivation which drive the outcome.

If we alter our simulated data from before to mimic this scenario we would generate the following descriptive statistics:

Table 6:

As discussed previously, this should be modeled without an offset, implying equal exposure/observation time with regard to the event or exposure being an entire open gym session.  We can think of this as a model of counts, or an implied model of rates in terms of total goals per open gym session.  In that case we get the same results as table 2 indicating that group B scores more goals than A.  It makes no sense in this case to include time as an offset or compare rates of goal completion between groups. But, if we did model this with an offset (making this a model with an explicit specification of exposure being court time) then we would get the following:

Table7 :
            Estimate Std. Error z value Pr(>|z|)   
(Intercept)  -2.1203     0.1601 -13.241   <2e-16 ***
GROUPB       -0.1054     0.1949  -0.541    0.589   

In this case we find that modeling this explicitly using playing time as exposure we get a result indicating that group B completes fewer goals or completes goals at a rate lower than group A. This approach completely ignores the fact that group A had persevered to play longer and ultimately complete more goals. Including an offset in this case most likely leads to the wrong conclusion. 

Summary:  When modeling outcomes that are counts a rate is always implied by the nature of the probability mass function for a Poisson process. However, in practical applications we may not always think of our outcome as an explicit rate based on an explicit interval or exposure time. In some cases this distinction can be critical. When we want to explicitly consider differences in exposure this is done through specification of an offset in our count model. Three examples were given using toy data where (1) modeling rates or including an offset made no difference in outcome (2) including an offset was required to obtain the correct conclusion and (3) including an offset may lead to the wrong conclusion. 

Conclusion: Counts always occur within some interval of time or space and therefore can always have an implicit ‘rate’ interpretation. If counts are observed across different intervals in time or space for different observations then differences in outcomes should be modeled through the specification of an offset. Whether to include an offset really depends on answering the questions:  (1) What is the relevant interval in time or space upon which our counts are based? (2) Is this interval different across our observations of counts?

References:

Essentials of Count Data Regression. A. Colin Cameron and Pravin K. Trivedi. (1999)

Count Data Models for Financial Data. A. Colin Cameron and Pravin K. Trivedi. (1996)

Models for Count Outcomes. Richard Williams, University of Notre Dame, http://www3.nd.edu/~rwilliam/ . Last revised February 16, 2016

Econometric Analysis of Count Data. By Rainer Winkelmann. 2nd Edition.

Notes: This ignores any discussion related to overdispersion or inflated zeros which relate to other possible model specifications including negative binomial or zero-inflated poisson (ZIP) or zero-inflated negative binomial (ZINB) models.

Simulated Toy Count Data:

COUNT GROUP ID   TIME TIME2 TIME3
3    A    1    20   60   30
4    A    2    25   60   30
2    A    3    20   60   30
2    A    4    20   60   30
1    A    5    20   60   30
6    A    6    30   60   30
0    A    7    20   60   30
0    A    8    20   60   30
1    A    9    20   60   30
5    A    10   25   60   30
3    A    11   20   60   30
2    A    12   20   60   30
3    A    13   20   60   30
3    A    14   25   60   30
4    A    15   20   60   30
5    B    16   50   60   60
4    B    17   45   60   60
7    B    18   55   60   60
8    B    19   50   60   60
3    B    20   50   60   60
7    B    21   45   60   60
5    B    22   55   60   60
4    B    23   50   60   60
7    B    24   50   60   60
8    B    25   45   60   60
5    B    26   55   60   60
3    B    27   50   60   60
5    B    28   50   60   60
4    B    29   45   60   60
6    B    30   55   60   60

No comments:

Post a Comment