A graphical explanation of fixed effects

Many applications in econometrics involve controlling for so called fixed effects (FE). That is, including a dummy variable for each e.g state, year, firm e.t.c. The reason for including FE’s are many and varies depending on the context but what it does is broadly controlling for things that are omitted and in the case of “entity fixed effects” (which I will focus on here) these are omitted variabels that can vary over time but not within entity. For example, say that you are trying to explain voting behavior across states with income. Regressing a dummy of republican/democratic governance on income (with the dummy being equal to unity if republican governance) one omitted variable might be culture. If culture varies by state and roughly constant over time you can control for that by including state fixed effects. That is the simple explanation with a highly stylized example. Rather, in most cases the reason for including FE’s can be pretty vague and difficult to grasp what they are actually doing in practice. Hence, I have decided to try to make an intuitive graphical explanation of FE’s.

Consider the case that you would like to know the (casual) relationship between wage and tenure. You have collected data from three firms and drawn a sample of  30 individuals in each. I have generated this example data such that mean wage is firm specific so,

\begin{aligned} Wage_{1i} = 10000 + \varepsilon_{i} \\ Wage_{2i} = 15000 + \varepsilon_{i} \\ Wage_{3i} = 20000 + \varepsilon_{i}  \end{aligned}

and tenure,

\begin{aligned} Tenure_{1i} = 10 + u_{i} \\ Tenure_{2i} = 15 + u_{i} \\ Tenure_{3i} = 20+ u_{i}  \end{aligned}

with \varepsilon \sim N(0,2000) and u \sim N(0,3). The graph below depicts the relationship between wage and tenure with a highly significant OLS estimate of 640.47 running the regression  Wage_{ki} = \alpha + \beta Tenure_{ik} + \nu_{ik} where  i is individual and k firm. Hence there is a clear correlation(!) between wage and tenure.


But there is something fishy with this estimate. It is not a causal estimate of the effect of tenure on wage. Rather, if we look at the data above, it appears as if it is all driven by which firm you work at! This is an omitted variable that can be dealt with using firm FE’s.  In fact, I constructed the data so that there should be no causal effect of tenure on wage. Let’s look at the regressions within firms depicted in the figure below,


As you can see, there is no clear positive pattern and each separate firm regression has \beta_{k} not significantly different from zero. But we don’t want to run three separate regressions. Rather we want to control for the effects on wage caused by the firm itself. What we do is we run the same regression as above but include firm FE’s (i.e. a dummy variable for each firm),

Wage_{ik} = \alpha + \delta Tenure_{ik} + \sum_{k=1}^{3} \gamma_{k} 1[Firm= k] + \eta_{ik}

What this effectively does is estimating the relationship between wage and tenure within(!) each firm and weights the coefficients into  one \delta which controls for the effect on wage coming from the firm specific component. \delta delivers the weighted average effect of tenure on wage holding constant which firm an individual is working in. The graph below depicts the residualized wage and the linear fit has the slope of \delta=0.965. Note how residualizing wage from firm FE’s demeans everything and “levels out” the playing field.


One interesting feature which I cannot explain is why the average of the estimates from the firm specific regressions (\beta_{1} + \beta_{2} + \beta_{3})/3 \neq \delta . My prior was that the single FE estimate would be the plain average of these three coefficients as there are equal number of observations in each firm but it turns out it isn’t. If somebody knows why this is drop a comment or email me. Or I might make a new post out of it once I figure it out.

Update: The reason (\beta_{1} + \beta_{2} + \beta_{3})/3 \neq \delta is due to the fact that the FE-estimator is variance weighted. This should, nevertheless, imply that the FE-estimator converges to the average of the three separate estimates as n \rightarrow \infty.

Dice and Binomial Probability

I few weeks ago some of my friends introduced me to a game called “DICE”. It turns out that I was terrible at this game the first time around. But I was still fascinated by it’s simplicity and that you could actually work out a pretty solid strategy by just calculating probabilities. Turns out that the game also has some more “human”-strategy to it by people either lying or telling the truth (i.e. guessing after their own dice) but at least knowing your probability gives you an edge. I’ll divide this post in sections. First, I’ll describe the game and it’s rules. Second, I get into how to more formally go about calculating the probability in order to get a solid strategy. Lastly, I end with developing a rule of thumb for optimal decisions in the game. One note on notation: I use digits (1,2,…,6) exclusively to symbolize an outcome of a die. Number written in text (one,two,..,six) refer to other stuff (e.g. number of dice) which should be clear from the context.

The Game of DICE


The rules are rather simple. Each player gets five dice (naturally fair dice as any self-respecting statistician would have to point out). All players throw their dice simultaneously and hides them under their hands. You are to look at your own dice and then the first player “bids” how many e.g. 3’s there are on the table. Then player two have to either raise the bet in the number of dice with 3’s or by the figure itself. Example: If player 1 said “there are four 3’s among all dice” player two can either say “there are four 4’s (or some higher number) or there are five 2’s (or any number). No matter what, you have to “raise the bid” in some sense. Eventually, a player will end up with a bid that is highly unlikely e.g. twenty-five 3’s out of 30 dice. The next player can then choose to “call” the previous betters bet. When a call is made, all dice are counted to see if the bid is correct or not. If it is correct and there are at least as many similar dice as the bidder said, the “caller” losses a die. If the bid is wrong the “bidder” losses a die. He or she who wins the game is the one with the last die standing. There is one caveat, and that is that the 1’s are Jokers. I.e. they are free numbers so if the total count of 6’s are say eight and there are three 1’s then the total number of similar dice are eleven 6’s.

Calculation of Binomial Probability

As dice games are a classic probability example I thought I would freshen up the old knowledge hidden somewhere. Formally, dice follows a Binomial distribution. This since the four main assumptions are meet. i) the outcome can be considers binary (either it is e.g. a 6 or it is not). ii) every toss is independent of one another. iii) each outcome of a die has a known an constant probability. iv) Every experiment (round) has a fixed sample size. Most people know that the probability of hitting any number e.g. a 3 on one (fair) die is  \displaystyle \frac{1}{6} . If you have two numbers to choose from e.g. 3’s and 1’s the probability of getting any of those are naturally  \displaystyle\frac{2}{6}=\frac{1}{3} . The probability mass function (PMF) for the Binomial distribution can be written as,

\begin{aligned} \Pr(X=x \mid n) =  \frac{n!}{(n-k)!k!} \; p^x(1-p)^{n-x} \end{aligned}

where p is the probability of success (i.e.  \frac{1}{3} in this case). X is the random variable which has realization x (i.e. how many dice have the “correct” number). n is the total number of dice in that particular round. Note that n is not a random variable. Finally,  \frac{n!}{(n-x)!x!} is the binomial coefficient often abbreviated by  \binom{n}{x}. It’s a formula for all possible combinations choosing x dice out of n. For those of you who have not seen factorials (“!”) it just means e.g.  4!=4\times3\times2\times1.
Lets see if the formula works with our trivial one die example above,

\begin{aligned}  \Pr(X=1 \mid n=1) = \frac{1!}{(1-1)!1!} \; \left(\frac{1}{3}\right)^1 \left(1 - \frac{1}{3}\right)^{1-1}  = 1 \; \left(\frac{1}{3}\right)^1 \left(\frac{2}{3}\right)^{0}  = \frac{1}{3} \end{aligned}

MY GOD! It works! Let’s take a bit more sophisticated example. What’s the probability of hitting 4 similar dice (including 1’s) out of 10. Applying the Binomial distribution above and the given probabilities we get,

\begin{aligned} Pr(X=4 \mid n=10) &=& \frac{10!}{(10-4)!4!} \; \left(\frac{1}{3}\right)^4 \left(1 - \frac{1}{3}\right)^{10-4} \\\ &=& \frac{10\times9\times 8 ... \times 1 }{(6 \times 5 ... \times 1) 4\times3\times2\times1} \; \left(\frac{1}{3}\right)^4 \left(\frac{2}{3}\right)^{6} \\\ &\approx& 210 \times 0.012345 \times 0.08779 \\\ &\approx& 0.227 \end{aligned}

So the probability getting exactly(!) (no more no less) four similar dice (including ones) out of ten is about 22.7 percent. However, what you really wanna know for this game is not the probability of getting exactly 4, 5 or 12 similar dice, but the probability of hitting that or more! So the probability you are after is not  \Pr(X=x) but rather \Pr(X\geq x). So in fancy statistical jargon you wanna find the complement of the cumulative distribution function (CDF). By the laws of probability we know that  \Pr(X\geq x) = 1 - \Pr(X < x). Getting to the last term is just summing up all the  \Pr(X=i) probabilities where  i < x \leq n. To exemplify, let’s reuse the example above. What is the probability of getting four or more dice out of ten that are similar? Well it’s just,

\begin{aligned}  \Pr(X \geq 4 \mid n=10) &=&  1 - [ \Pr(X=3) + \Pr(X=2) + \Pr(X=1) + \Pr(X=0) ] \\  &\approx& 1- [ 0.086 + 0.195 + 0.26 + 0.017 ] \\\ &\approx& 1- 0.6 \\\  &\approx& 0.44 \end{aligned}

So if you where to guess (not having seen your own dice!) the probability of there being four or more similar dice are about 44 percent.

Calculation of “Bayesian ” Binomial Probability

Note however, and this is important and what I got wrong the first time I played. You have information about your own dice and you should use it as it updates your beliefs and probabilities (to talk Bayesian statistics). If someone has guessed that there are four 3’s and you yourself have 5 out of the 10 dice and two of them are say e.g. 3’s (you have zero 1’s) well you one only needs 2 more dice to be right about there being four 3’s out of 10. The question you are asking and probability you are really interested in is “how many 3’s, additional to my own, are there on the table?”. Your own dice are not random variables! They are already realized and known! Formally, let m be the number of dice you have left and let k be the number of dice that has the similar number to what the bid foretells so  k \leq m < n . The probability you have to calculate is then,

\begin{aligned}  \Pr(X \geq x-k \mid n-m) = 1 - \sum_{i=0}^{x-k-1}  \binom{n-m}{x-k} \Pr(X=x-k)^{x-k}  \times [1 - \Pr(X = x -k )]^{n - m - x + k}  \end{aligned}

where again  x is your guess and  n is the total amount of dice that are in play. Lets call this in lack of a better term the Bayesian probability and let’s refer to the former (the one that is not adjusted for your own dice) the Naive probability.

As soon as the probability is below 50 percent you should think about “calling”. But after all, there might be information in what bids people have been giving earlier and whether or not they are pathological lairs but I’ll abstract from that in this analysis. So I ran some simulations in order to give me hints when to “call” someones bet. The figure above is an illustration of the Naive probability and to get the Bayesian ditto you only subtract the number of dices you have m and how many of them have the number you guessed k (including your ones). You then read of the x-axis as the guess and then go to the line which corresponds to n-m dice in total in the round. You can also have a look in the table (here) that underpins the graph do exactly the same thing in the table, read of which x-k you are at in the row and then go to the column n-m for how many dices there are in the game (excluding you own ones). The intersection of the row and the column is the probability that there are $x-k$ dice out of n-m.

Rule of Thumb in “DICE”

Actually after having done the analysis I realized that you don’t really need the probability table to make an informed choice. As I said, you would probably wanna call when the probability is less than fifty percent. As the rule of thumb you should always “call” when

\begin{aligned} \mathbb{E}[X \mid m ,n ] < x-k - \frac{1}{2} \qquad where \quad \mathbb{E}[X \mid m ,n ] = (n-m)p \end{aligned}

as is obvious from the equation above you should never call when x \leq k i.e. when the guess is less dice than you yourself have on hand. If \mathbb{E}[X \mid m ,n ]=x - k - \frac{1}{2} it is just a coin-flip so a fifty-fifty chance of you getting it right. In conclusion, here is the decision rule,


which ensures that you are only making making decisions that have a probability above .5 (50 percent). Nevertheless, as the Binomial distribution is discrete you can end up in a catch 22 where you based on probability have to choose to not call the previous bidder but when you bet you will bet on something that has a lower probability than 50 percent at occurring. Confusing, I know. But I guess thats also one of the reasons this game is quite fun, no matter how much you analyze it you might not be able to make a correct decision.


Proxy variable interactions and bias


I’ve recently taking on the daunting task of reading Wooldridge (2010) cover-to-cover. If I’m lucky I’ll remember 10-15% of it but hey, at least I tried. Anyhow, I read a passage a couple a days ago which cached my interest and since I want to make sense of it I started out by running Monte Carlo simulations of it. I also got kind of inspired by a new friend of mine Rafael Ahlskog’s blog  so thought I’d dust off the old blogging skills and start writing about some metrics for the few nerds out there. So here’s the story. Consider the structural model,

\displaystyle y= \beta_1 x_1 + \beta_2 x_2 + \hdots + \beta_K x_K + \gamma_1 q + \gamma_2 x_Kq

where \displaystyle q is some unobserved variable who happens to vary with \displaystyle x_K. For some reason, we don’t observe \displaystyle q but have a proxy variable for it denoted by \displaystyle z which fulfills the (i) redundancy condition \displaystyle \mathbb{E}[y \mid x_1, x_2, \hdots, x_K, q, z] = \mathbb{E}[y \mid x_1, x_2, \hdots, x_K, q] and (ii) \displaystyle Cov(q,x_j)=0 \; \forall j=1,2...K when we control for z. Let’s assume that \displaystyle q can be written as a linear function of \displaystyle z such that,

\displaystyle q= \theta_0 + \theta_1 z + r

If \displaystyle x_k and \displaystyle q weren’t interacted we’d be all happy and we could consistently estimate \displaystyle \beta_j. However, now when \displaystyle x_k and \displaystyle q are interacted the coefficient on \displaystyle x_K becomes biased. To see that, lets simplify the structural model and consider only the three variable case where \displaystyle x_1=1 (intercept) and \displaystyle x_k. We are interested in the condtional expectation of \displaystyle y given \displaystyle x_1 , x_K and \displaystyle q. Using our proxy variable we get,

\begin{aligned} \mathbb{E}[y \mid \mathbf{x}, q] &= \beta_1 + \beta_K x_K + \gamma_1 q + \gamma_2 x_K q \\ &= \beta_1 + \beta_K x_K + \gamma_1 (\theta_0 + \theta_1 z + r) + \gamma_2 x_K (\theta_0 + \theta_1 z + r) \\ &= \beta_1 + \gamma_1 \theta_0 + x_K( \beta_K + \gamma_2 \theta_0) + \gamma_1 \theta_1 z + \gamma_2 \theta_1 x_K z \\ \end{aligned}

where we have used the fact that \mathbb{E}[zr]=0 which follows by definition of the linear projection and \mathbb{E}[x_Kr] which follows from our second assumption about the proxy variable. We see that the coefficient on x_K is not merely its structural parameter \beta_K but it is biased by \gamma_2 \theta_0. In other words \displaystyle plim \; \hat{\beta}_K \rightarrow  \beta_K +  \gamma_2 \theta_0. So when the variable of interest \displaystyle x_k  is interacted with the unobserved variable (i.e.  \displaystyle \gamma_2 \neq 0 ) we cannot consistently estimate \displaystyle \beta_K unless we assume that \displaystyle   \displaystyle \theta_0=0 which I take as being the same as assuming that \displaystyle \mathbb{E}[q]=0 given that the mean value of the porxy is zero?  (Wooldridge has a typo here, or as he refered to it in my email to him, “a thinko” in his book as e claims what is needed is \displaystyle \mathbb{E}[z]=0). This i kind of amazing! How seldom isn’t it that we credibility can assume that the mean of our unobserved variable is equal to zero? In the classic example of ability for instance. Wooldrige suggests to demean the proxy variable as a solution to this problem but that does not change the bias if \displaystyle \theta_0 \neq 0 which becomes obvious in the simulation I ran. I’ll post these in an upcoming segment.

So the take home from all of this is that; if you believe that your variable of interest is interacted with an unobserved variable, using a proxy for that variable won’t enable you to consistently estimate the parameter of interest unless you assume that the expected value of the unobserved variable is zero. Or is it sufficient that \displaystyle \mathbb{E}[q]=\mathbb{E}[z]?

PS. These blog posts are far from scientific in the sense that they lack peer review when published. I’d be happy to her from anyone who finds errors or simply disagrees with the conclusions. DS.