Visualizing the Spread of Zika using Tableau

Introduction

As part of the Data Visualization module we had a group assignment to pick a topic of our choice and develop a visualization in Tableau.

We picked Zika as it was (a) something we wanted to know more about, (b) had occurred recently but the outbreak was over and (c) we found some data to get us started. We thought a tool to visualize the geographical spread of a disease with time would be useful for researchers and the general public alike.

In creating a data based visualization there are two major considerations

(1) what is the story?

(2) who am I telling it to?

The second question was easily answered. We wanted to create visualizations to allow the general public to explore and learn about Zika in their own way and at their own pace.

But first we needed to educate ourselves on the topic and figure out the story!

Finding the story and data collation

We googled, followed up on references within references within references, manually collated data from newspaper articles and peer-reviewed journals dating back to 1947, wrote some python script to join circa 2000 csv files into one and joined all this with other data sets we found online (WHOPAHOand another github repository) into one master csv file.

The aim of the data collation exercise was to get as complete a picture as we could, within the allotted time frame, using what was freely available to us online. While the resulting data set does not necessarily contain every single last case of Zika ever reported,  it is much more comprehensive than anything we came across while researching this project.

In this way we collated the data and gained a deep subject matter knowledge in one go.

It turns out there are many, many stories to tell about Zika and so from this point, we split up and each took on a story and set about creating visualizations.

I took on the History of the Spread of Zika by MosquitosBasically, when and where it has been established to occur and the number of cases reported.

Note: Zika can also be transmitted in other ways but mosquito is the primary vector and when a case occurs in an area where the offending mosquito is present, it is assumed that this is the mode of transmission.

The Visualizations

To tell this story I created two dashboards

(1) The History of the Spread of Zika – a high level, mostly qualitative history of where and when Zika has been reported

(2) A quantitative dashboard focused on the outbreak in the America’s from 2015-2017.

1. The History of the Spread of Zika by Mosquitos

You need access to Tableau Online to use the link above so I have included a screenshot for anyone without access. Refer to Figure 1.

Figure 1 Screenshot of the History of the Spread of Zika dashboard. In the screenshot, all years up to and including 2018 are shown. The user can control what years to see, highlight the parent strain to see where the African and Asian strains have been found to occur or filter by occurence type (study, isolated case or outbreaks). The #Cases and #Countries scorecards are linked to the year filter.There is a tooltip included (not shown) that shows the median % population effected for scientific studies and the # cases per country for isolated confirmed cases and outbreaks.

So that you can critically compare the visualization to the story I was trying to tell here is a synopsis (it’s a bit long, but that’s the whole point of the visualization –  to replace this text with an interactive learning tool):

  1. Zika was first detected in Uganda in 1947.
  2. It silently circulated through Africa and Asia until 2006 with only 14 reported cases in 58 years. It was mostly detected retrospectively as the antibody in patient serum samples.
  3. Over 40 different strains of Zika have been identified but all are thought to belong to just two parent strains, referred to as  the ‘African’ and ‘Asian’ strains. Between 1948 and 2006, the African strain was only detected in Africa and the Asian strain was only detected in Asia.
  4. The first recorded outbreak occurred on the island of Yap just off the coast of Indonesia in 2007. There were 83,000 cases, equating to 73% of the population.
  5. Zika then made its way across the pacific, hopping from island to island causing similar outbreaks in French Polynesia and Easter Island before being reported in Brazil in early 2015 where it was first associated with microcephaly.
  6. There was an almost simultaneous outbreak on a small island group called Capo Verde off the west coast of Senegal.
  7. Although not reported at the time, studies on samples collected in Haiti in  2014 were subsequently found to be positive for Zika.
  8. Zika was then reported all over the low-lying areas of South and Central America, the Caribbean and even reached North America (Texas and Florida) with almost 800,000 cases reported in 115 countries by the end of the outbreak (circa end 2017).
  9. There were 3 isolated cases reported in Guinea-Bissau on the west coast of Africa in 2016.
  10. Where the parent strain has been identified it has been found that the Asian strain of Zika caused all cases outside of mainland Africa, including the outbreak on Capo Verde.

So in reaching Capo Verde, Zika completed its first circumnavigation of the planet, almost 70 years after it was first detected.

A few points on the technical aspects of the dashboard

Showing the build up over time

An important feature of the dashboard was to allow the user to see the build up of the geographical spread of Zika over time. By dropping ‘Year’ onto the Page shelf, Tableau Desktop creates a slide show of the view for each Year with playback functionality. There is a ‘Show History’ option, which is supposed to continue to show all previous observations as the slide show progresses. But with chloropleths (filled maps), the previous values are not shown filled but with a  circle symbol instead. Refer to Figure 2 (left).

Figure 2 The History of Zika (1947 -2006). When using the ‘Show History’ option with a chloropleth in Tableau, previous observations are displayed as a filled circle symbol instead of a filled country (left). A work around involving parameters and calculated fields enables the history to be shown as filled countries (right).

I did some googling and found a suggested work around. A parameter called ‘Parameter Year’ was set-up with integers spanning the range of years in question (1947 – 2017). A calculated field ‘Show Year’ was then created as [Year] <= [Parameter Year] and dropped on the filter shelf, set to only show values when this expression is TRUE (ie when the Year is less than or equal to the Parameter Year). The parameter control was then turned on and can be used to slide between Years, showing all countries with a recorded occurrence of Zika up to and including the Year in question. Refer to Figure 1 (right).

You do loose the playback functionality though, so hopefully this is an issue Tableau will deal with sooner rather than later.

Dealing with Alaska and HAWAII

Zika made its way via mosquito as far as Texas and Florida in the United States. But, using a chloropleth on a country level means Alaska and Hawaii are also included. This is misleading. To deal with this I found two options (i) create my own custom geo-coding for the United States and import it into Tableau or (ii) be a bit creative with blank objects.

Given this was a once off visualization, I decided not to invest time in the custom geo-coding. Since the countries Zika made it to are all clustered about the equator, I just created blank objects, coloured them white, gave them a white border and used them to hide the Northern part of the world and Hawaii.

2. The America’s Outbreak 2015-2017

For this outbreak between 2015 and 2017, I had detailed weekly case counts on a country level, which enabled me to put together a mini movie of the rise and fall of Zika in the Americas between 2015 and 2107 using Tableau Desktop. Unfortunately, again, the ‘play’ feature is only available in Tableau Desktop and not Tableau Online or Server.  I originally replaced the playback function with a slider on Tableau Online but it is nowhere near as effective. Instead, I have made a homemade movie of the dashboard in action and posted it to you-tube.

As well as a map showing the geographical spread and a qualitative representation of the number of cases in the form of the size of the filled blue circle (left), the dashboard includes graphs showing the weekly and cumulative case counts for (i) all of the countries (top right) and (ii) allows the user to select a particular country to follow (bottom right).  Tooltips are included to allow a user to explore the numbers of cases in more detail at anytime point. In the video I’ve set the country to Brazil as it is of most interest.

You should get the following information from watching the mini movie a couple of times:

  1. There is a first wave of Zika reported only in Brazil. This wave peaks in July 2015 and there are approximately 30,000 cases.
  2. The second country to detect Zika is Capo Verde, off the west coast of Senegal.
  3. The second wave is much larger and peaks in February 2016. This wave of Zika is reported to occur all over low-lying areas of South America, Central America and the Caribbean and by the end of this wave the cumulative count of cases is almost 800,000.
  4. A third wave occurs in the first half of 2017, but it is much smaller than the first and second wave.

Finally, I’ d like to point out that the explosion of cases in January 2016 is likely a result of Zika becoming a reportable disease as opposed to Zika expanding its territory to such a large extent overnight.

More questions than Answers

Learning about the history of Zika left me with more questions than answers…

Why has the Asian and not the African strain made it around the world?

Why has the African strain not at least made it to Capo Verde?

Why has Zika not been associated with microcephaly before Brazil  in 2015?

Is it genetics? Herd immunity in Africa? Difference in virulence? Or just that we never looked?

These question are the subject of current scientific investigation. You can follow the links above if you are interested to find out more.

Exit Polls

Introduction

This blog was a continuous assessment requirement for my final module. The task was straight forward – to write a blog of a certain word count on any topic relevant to data analytics. Picking a topic was hard though, where do you start?

It was the week of the referendum on the 8th amendment. At about 10pm on the day the country voted, both the Irish Times and RTE published exit polls.

The Irish Times poll was conducted by MRBI and had ‘approximately’ 4500 participants across 160 polling stations. It found 68% of voters in favour of repealing the 8th amendment and 32% against with a margin of error ‘estimated as +/- 1.5%’.

The RTE poll was carried out by ‘Behaviours and Attitude’ with 3779 participants across 175 polling stations. It found 69.4% of voters in favour of repeal and 30.6% against. The margin of error is stated to be 1.6%. It also had a breakdown of the poll across gender, region and age and some other categorical variables.

I heard these numbers and was certain that the Irish people had voted to repeal the 8th amendment. My husband was not so sure. He asked me why I was so confident, and I quoted the margin of error. He asked me how the error rate could be so low and my response was something like ‘it’s just proportional to the sample size and the actual split of the result’.

Blank look from my husband and then a string of questions….

It’s only 4000 people, how can that represent the whole country?

What if they have asked more yes than no voters?

What about people who refuse to answer?

What if people are not being truthful?

So I thought I’d write the required blogs to answer his questions and re-acquaint myself with the specifics underpinning my intuitive faith in the exit polls.

Some Background and Terminology

In Ireland, referenda are always a binary response (Yes/No) and it is the overall population proportion of Yes and No that is counted. Local and regional differences are irrelevant. There are no electoral college votes or other extra complications. One vote is one vote.

The physical location where voting occurs is referred to as a polling place. For example, a local school. Within the polling place there are a number of polling stations. This is a subset of the area covered by the polling place. In practice this comes down to which table you go to!

An exit poll is a poll of actual voters taken immediately after they have voted.  It differs from an opinion poll in that it eliminates a lot of sources of uncertainty. The probability of each participant voting is 1 because every participant has actually cast a vote.  There are no undecideds, a ballot can only be cast as a Yes or a No.

They offer a very rare opportunity to critically compare a sample to the actual population. Since every vote is counted, the real, actual, full unadulterated proportion of Yes/No voters is available very soon after the exit poll sample is taken. For data geeks like me, this is very exciting  (it takes all sorts….).

It’s only 4000 people, how can that represent the whole country?

The sample size needed is related to 3 thing

  1. the margin of error required ( ie +/- X %)
  2. the proportion of Yes/No voters
  3. the level of confidence required in the result (how sure you want to be that your estimated interval contains the true answer)

Fortunately, with count data like a referendum result, the maths of this is pretty straight forward

where n = sample size required, p is the proportion voters voting one way, error is the margin of error required (%) and zcritical is the critical z value for a stated level of confidence.

I haven’t seen the confidence level reported anywhere so I am going to assume it is 95% in both cases as this is conventional practice in statistics. The corresponding zcrtical value can be found in standard statistical tables or using statistical software such as R. For 95% confidence it is 1.96.

So to achieve an error rate of 1.5 % with 95% confidence that the resulting interval with contain the true result, the poll needs to have

participants.

But what about ‘p’, the proportion of voters voting a particular way?

The choice is user defined and from here on p will be the proportion of voters voting Yes. Since the total proportion of Yes and No voters = 1, the proportion of No voters is given by 1-p.

Keeping zcritical and the required error rate constant at 1.96 and 1.5% respectively, we can plot the effect of changing proportions on the required sample size, n. Refer to Graph 1.

Graph 1 Sample size (n) required to achieve an error rate of +/- 1.5% with 95% confidence. When p = 0.5 the largest sample size of 4268 is needed but as the proportion of voters voting a particular way increases or decreases, the sample size decreases.

From the graph we see that most samples are required when the proportions are equal at 0.5. The bigger the difference in the proportions, the smaller the required sample size and so in designing the study you would plan for a worst-case scenario of a dead heat.

When you plug p = 0.5 into the formula above, the sample size n works out to be 4268.

So if we were just picking votes out of one big giant hat containing all 2 million odd votes, we would need to pick out 4268 votes to get an error rate of 1.5% when p = 0.5. If there is a larger difference in the proportions on the day, it just makes the margin of error smaller. Which is never a bad thing!

Big Giant Caveat

Underpinning this straight forward calculation is a pretty big and critical assumption that the 4268 participants in the poll represent a simple random sample of the overall population. A simple random sample is one in which every voter has an equal chance of being chosen to participate in the poll. In other words, there is no bias in the sample.

How do you sample 4000 people from a few million without introducing bias?

From a theoretical perspective, the best option would be to get every nth voter from every single polling place in the country to participate. Polling at every station means you don’t need to worry about different voting patterns in different areas and asking every nth voter is a very easy way to deal with differences in the size of the voting populations at different polling stations (a function of the number of registered voters and % turnout on the day). It is also a great way of ensuring the randomness of selection.

So long as every nth voter actually agreed to participate, you would pretty much have a perfect random sample of the actual voting population.

Unfortunately, reality tends to get in the way of perfect study design. Not just for exit polls, for everything really. There is always a trade-off. Part of designing a good study is to find the right balance between theory and reality. Sometimes the best theoretical approach is just too expensive, and the information gained from your study is not worth the cost of procuring it. Or sometimes it can be physically impossible or impractical to implement. It is important to find the right compromise and if that is not possible to reconsider whether or not the study should go ahead.

How many polling places are there?

After searching for and not finding a comprehensive list of all polling places in Ireland for this referendum, I figured out that some of the 40 constituencies post their polling places on a website named ‘https://areanamereturningofficer.com’. I attempted to find the data for 20 of the 40 constituencies representing different geographical locations and both urban and rural areas. I actually found it for 14. Taking the mean number of polling places of these 14 constituencies and multiplying by 40 (the total number of constituencies) gives a good ball park estimate of the total number of polling places in Ireland for this referendum. It’s important to remember here that I am not looking to find the exact number of polling places in Ireland but rather to guestimate the order of magnitude. Are there 100 or 1000 or 1,000,000?

My magic number is 2420.

To sample circa 4000 voters from 2000 polling places you would need to recruit, train, pay and coordinate 2000 surveyors. Then they would each only sample 2/3 voters from each polling place over the course of 15 hours of voting to get the required 4000. You could try to reduce the number of surveyors required by getting them to sample at more than one polling place but this would increase the complexity of coordination and introduces a possibility of bias as they would necessarily be sampling at different times of the day in the different polling places on their list.

Some form of clustering needs to be considered. The usual approach to this is to select a small number of polling places and then apply simple random sampling to the population of voters at that polling place (approach every nth voter).

How do you select the sample of polling places?

This is not a trivial task and the accuracy of the poll hinges on selecting a subset of polling stations that reflect the voting behaviour of the entire country. Do you pick a simple random sample of polling places or do you select them based on demographics and likely voting behaviour? From what I have read (see here and here for a couple of examples) it seems it is accepted best practice to select polling stations based on available information on likely voting behaviour and known past voting behaviour. In this way only the change in voting behaviour of the cluster is needed to predict the actual vote.  This is obviously a more difficult task when the vote is relating to a single issue than for general elections.

What about people who refuse to answer? What if people are not being truthful?

An assumption of the model based on random sampling from selected polling stations is that voters who refuse to take part are not systematically different to those that do. So, it is not the rate of refusal to answer in itself that is an issue, but if the rate of refusal is different between different voting groups.

Exit polls can be conducted in different ways. One option is to have an interviewer ask voters questions as they leave. Another is to give the participant a mock ballot paper and ask them to complete it as they did in the polling booth and to give extra categorical information such as gender and age if required. In this way, the participants vote remains private and so they are more likely to both agree to participate and be truthful in their response.

Comparison of the exit polls to the actual result

The actual nationwide result was 66.4% Yes with a turnout of 64.13% (2,153,613 valid ballots).

Since this is the actual count of every vote cast, it is the true value. There is probably some very small error associated with this number due to human error during counting but for practical purposes I am going to assume the error is 0%.

The null hypothesis (H0) is that the exit poll (Irish Times 68% and RTE 69.4) accurately predicted the result (66.4%).

H0 ppoll – pactual = 0

The alternative hypothesis (Ha) is that the exit poll did not accurately predict the result.

Ha ppoll – pactual ≠ 0

Because there is no good reason not to, alpha is set equal to 0.05.

The prop.test function in R provides an easy way to calculate the outcome. Refer to Figure 1 for the output from R.

Figure 1 Output from the prop.test function in R to formally test if the result of the Irish Times poll of 68.0% Yes is statistically signficantly different to the true result of 66.4%. Output is similar for the RTE poll.

With a p – value < 0.05 for both the Irish Times and RTE poll results, we reject the null hypothesis that the polls accurately predicted the outcome of the referendum and accept the alternative hypothesis that at alpha = 0.05 there is a statistically significant difference between the actual result and the exit poll for both the Irish Times and RTE polls.

The prop.test function also gives the 95% confidence interval for the two poll results (Figure 1). Again, this means if the polls were conducted 100 times and there really is no bias in the sampling, 95 of resulting intervals would contain the true value.

For the IT poll the interval is 66.6 to 69.3% and for the RTE poll it is 67.9 to 70.8%. Refer to Graph 2 for a visual comparison.

Graph 2 Comparison of the 95% confidence intervals for both the Irish Times and RTE polls to the true value of 66.4%

While the Irish Times poll interval came pretty close to containing the true value (1.6% difference between the estimated value and the true result and 0.2% difference between the lower end of the interval and the true result), the RTE poll overestimated the Yes vote by 3% and there is a difference of 1.5% between the true value and the lower end of the confidence interval. When the result is so clear-cut this is not practically important but in a closer outcome it is actually quite a significant overestimation.

There are many reasons that could account for the differences between the polls but it is worth noting that the RTE poll was also a behaviour and attitude poll where participants were asked to supply information about themselves (age, gender and political party support) and to answer questions relating to the reasons for their decision and the factors that influenced it.