**Introduction**

This blog was a continuous assessment requirement for my final module. The task was straight forward – to write a blog of a certain word count on any topic relevant to data analytics. Picking a topic was hard though, where do you start?

It was the week of the referendum on the 8^{th} amendment. At about 10pm on the day the country voted, both the Irish Times and RTE published exit polls.

The Irish Times poll was conducted by MRBI and had ‘approximately’ 4500 participants across 160 polling stations. It found 68% of voters in favour of repealing the 8^{th} amendment and 32% against with a margin of error ‘estimated as +/- 1.5%’.

The RTE poll was carried out by ‘Behaviours and Attitude’ with 3779 participants across 175 polling stations. It found 69.4% of voters in favour of repeal and 30.6% against. The margin of error is stated to be 1.6%. It also had a breakdown of the poll across gender, region and age and some other categorical variables.

I heard these numbers and was certain that the Irish people had voted to repeal the 8^{th }amendment. My husband was not so sure. He asked me why I was so confident, and I quoted the margin of error. He asked me how the error rate could be so low and my response was something like ‘it’s just proportional to the sample size and the actual split of the result’.

Blank look from my husband and then a string of questions….

It’s only 4000 people, how can that represent the whole country?

What if they have asked more yes than no voters?

What about people who refuse to answer?

What if people are not being truthful?

So I thought I’d write the required blogs to answer his questions and re-acquaint myself with the specifics underpinning my intuitive faith in the exit polls.

**Some Background and Terminology **

In Ireland, referenda are always a binary response (Yes/No) and it is the overall population proportion of Yes and No that is counted. Local and regional differences are irrelevant. There are no electoral college votes or other extra complications. One vote is one vote.

The physical location where voting occurs is referred to as a polling place. For example, a local school. Within the polling place there are a number of polling stations. This is a subset of the area covered by the polling place. In practice this comes down to which table you go to!

An exit poll is a poll of *actual voters* taken *immediately after* they have voted. It differs from an opinion poll in that it eliminates a lot of sources of uncertainty. The probability of each participant voting is 1 because every participant has actually cast a vote. There are no undecideds, a ballot can only be cast as a Yes or a No.

They offer a very rare opportunity to critically compare a sample to the actual population. Since every vote is counted, the real, actual, full unadulterated proportion of Yes/No voters is available very soon after the exit poll sample is taken. For data geeks like me, this is very exciting (it takes all sorts….).

**It’s only 4000 people, how can that represent the whole country?**

The sample size needed is related to 3 thing

- the margin of error required ( ie +/- X %)
- the proportion of Yes/No voters
- the level of confidence required in the result (how sure you want to be that your estimated interval contains the true answer)

Fortunately, with count data like a referendum result, the maths of this is pretty straight forward

where n = sample size required, p is the proportion voters voting one way, error is the margin of error required (%) and z_{critical} is the critical z value for a stated level of confidence.

I haven’t seen the confidence level reported anywhere so I am going to assume it is 95% in both cases as this is conventional practice in statistics. The corresponding z_{crtical} value can be found in standard statistical tables or using statistical software such as R. For 95% confidence it is 1.96.

So to achieve an error rate of 1.5 % with 95% confidence that the resulting interval with contain the true result, the poll needs to have

participants.

But what about ‘p’, the proportion of voters voting a particular way?

The choice is user defined and from here on p will be the proportion of voters voting Yes. Since the total proportion of Yes and No voters = 1, the proportion of No voters is given by 1-p.

Keeping z_{critical }and the required error rate constant at 1.96 and 1.5% respectively, we can plot the effect of changing proportions on the required sample size, n. Refer to **Graph 1**.

**Graph 1 **Sample size (n) required to achieve an error rate of +/- 1.5% with 95% confidence. When p = 0.5 the largest sample size of 4268 is needed but as the proportion of voters voting a particular way increases or decreases, the sample size decreases.

From the graph we see that most samples are required when the proportions are equal at 0.5. The bigger the difference in the proportions, the smaller the required sample size and so in designing the study you would plan for a worst-case scenario of a dead heat.

When you plug p = 0.5 into the formula above, the sample size n works out to be 4268.

So if we were just picking votes out of one big giant hat containing all 2 million odd votes, we would need to pick out 4268 votes to get an error rate of 1.5% when p = 0.5. If there is a larger difference in the proportions on the day, it just makes the margin of error smaller. Which is never a bad thing!

**Big Giant Caveat**

Underpinning this straight forward calculation is a pretty big and critical assumption that the 4268 participants in the poll represent a *simple random sample* of the overall population. A simple random sample is one in which every voter has an equal chance of being chosen to participate in the poll. In other words, there is no bias in the sample.

**How do you sample 4000 people from a few million without introducing bias?**

From a theoretical perspective, the best option would be to get every *n*th voter from every single polling place in the country to participate. Polling at every station means you don’t need to worry about different voting patterns in different areas and asking every *n*th voter is a very easy way to deal with differences in the size of the voting populations at different polling stations (a function of the number of registered voters and % turnout on the day). It is also a great way of ensuring the randomness of selection.

So long as every *n*th voter actually agreed to participate, you would pretty much have a perfect random sample of the *actual* voting population.

Unfortunately, reality tends to get in the way of perfect study design. Not just for exit polls, for everything really. There is always a trade-off. Part of designing a good study is to find the right balance between theory and reality. Sometimes the best theoretical approach is just too expensive, and the information gained from your study is not worth the cost of procuring it. Or sometimes it can be physically impossible or impractical to implement. It is important to find the right compromise and if that is not possible to reconsider whether or not the study should go ahead.

**How many polling places are there?**

After searching for and not finding a comprehensive list of all polling places in Ireland for this referendum, I figured out that some of the 40 constituencies post their polling places on a website named ‘https://** areaname**returningofficer.com’. I attempted to find the data for 20 of the 40 constituencies representing different geographical locations and both urban and rural areas. I actually found it for 14. Taking the mean number of polling places of these 14 constituencies and multiplying by 40 (the total number of constituencies) gives a good ball park estimate of the total number of polling places in Ireland for this referendum. It’s important to remember here that I am not looking to find the exact number of polling places in Ireland but rather to guestimate the order of magnitude. Are there 100 or 1000 or 1,000,000?

My magic number is 2420.

To sample circa 4000 voters from 2000 polling places you would need to recruit, train, pay and coordinate 2000 surveyors. Then they would each only sample 2/3 voters from each polling place over the course of 15 hours of voting to get the required 4000. You could try to reduce the number of surveyors required by getting them to sample at more than one polling place but this would increase the complexity of coordination and introduces a possibility of bias as they would necessarily be sampling at different times of the day in the different polling places on their list.

Some form of clustering needs to be considered. The usual approach to this is to select a small number of polling places and then apply simple random sampling to the population of voters at that polling place (approach every *n*th voter).

**How do you select the sample of polling places?**

This is not a trivial task and the accuracy of the poll hinges on selecting a subset of polling stations that reflect the voting behaviour of the entire country. Do you pick a simple random sample of polling places or do you select them based on demographics and likely voting behaviour? From what I have read (see here and here for a couple of examples) it seems it is accepted best practice to select polling stations based on available information on likely voting behaviour and known past voting behaviour. In this way only the *change* in voting behaviour of the cluster is needed to predict the actual vote. This is obviously a more difficult task when the vote is relating to a single issue than for general elections.

**What about people who refuse to answer? What if people are not being truthful?**

An assumption of the model based on random sampling from selected polling stations is that voters who refuse to take part are not systematically different to those that do. So, it is not the rate of refusal to answer in itself that is an issue, but if the rate of refusal is *different* between different voting groups.

Exit polls can be conducted in different ways. One option is to have an interviewer ask voters questions as they leave. Another is to give the participant a mock ballot paper and ask them to complete it as they did in the polling booth and to give extra categorical information such as gender and age if required. In this way, the participants vote remains private and so they are more likely to both agree to participate and be truthful in their response.

**Comparison of the exit polls to the actual result**

The actual nationwide result was 66.4% Yes with a turnout of 64.13% (2,153,613 valid ballots).

Since this is the actual count of every vote cast, it is the true value. There is probably some very small error associated with this number due to human error during counting but for practical purposes I am going to assume the error is 0%.

The null hypothesis (H_{0}) is that the exit poll (Irish Times 68% and RTE 69.4) accurately predicted the result (66.4%).

H_{0} p_{poll} – p_{actual }= 0

The alternative hypothesis (H_{a}) is that the exit poll did not accurately predict the result.

H_{a} p_{poll} – p_{actual} ≠ 0

Because there is no good reason not to, alpha is set equal to 0.05.

The **prop.test** function in R provides an easy way to calculate the outcome. Refer to **Figure 1** for the output from R.

**Figure 1** Output from the **prop.test** function in R to formally test if the result of the Irish Times poll of 68.0% Yes is statistically signficantly different to the true result of 66.4%. Output is similar for the RTE poll.

With a p – value < 0.05 for both the Irish Times and RTE poll results, we reject the null hypothesis that the polls accurately predicted the outcome of the referendum and accept the alternative hypothesis that at alpha = 0.05 there is a statistically significant difference between the actual result and the exit poll for both the Irish Times and RTE polls.

The **prop.test** function also gives the 95% confidence interval for the two poll results (**Figure 1**). Again, this means if the polls were conducted 100 times and there really is no bias in the sampling, 95 of resulting intervals would contain the true value.

For the IT poll the interval is 66.6 to 69.3% and for the RTE poll it is 67.9 to 70.8%. Refer to **Graph 2** for a visual comparison.

**Graph 2** Comparison of the 95% confidence intervals for both the Irish Times and RTE polls to the true value of 66.4%

While the Irish Times poll interval came pretty close to containing the true value (1.6% difference between the estimated value and the true result and 0.2% difference between the lower end of the interval and the true result), the RTE poll overestimated the Yes vote by 3% and there is a difference of 1.5% between the true value and the lower end of the confidence interval. When the result is so clear-cut this is not practically important but in a closer outcome it is actually quite a significant overestimation.

There are many reasons that could account for the differences between the polls but it is worth noting that the RTE poll was also a behaviour and attitude poll where participants were asked to supply information about themselves (age, gender and political party support) and to answer questions relating to the reasons for their decision and the factors that influenced it.