How to Create a Google Earth Choropleth Map: Chester County 2012 General Election

May 14th, 2013 | No Comments »

Summary: In this post I will show you how I created a Google Earth choropleth map  of the 2012 Presidential Election results of the precincts in Chester County, Pennsylvania.  Here is the final result:

Google Earth Chester County 2012 General Election Choropleth Map

 

Here are the links to some of the tools and resources we’ll be talking about (for easy later reference):

Background

For different scenarios it is useful to understand people’s political leanings on a granular level.  For me personally, this interest started because I am currently looking to buy a house in Chester County, but obviously there are many other reasons why such a map might be useful (e.g. for political campaigns who want to maximize resource expenditure).

Most election result data and maps only go down to the level of individual counties, such as this Washington Post map:

2012 Washington Post Election Map Chester County

So in a toss-up state, Chester County is a toss-up county: 49.7% Romney and 49.2% Obama.  For purposes of looking at real estate, this is however still too coarse.  So I wondered how I could get at a more detailed breakdown.

Overview

Here’s what we’ll be doing in a nutshell: get election data and tie it to the map shapefiles of Chester County voting precincts, then converting those shapefiles to KML files so we can view the map in Google Earth.

Getting the Data

The lowest level of political data I could get my hands on were the Chester County (CC) precinct results, which are conveniently located here: Chester County Election Results Archive.  Inconveniently, the data for each precinct is contained in a separate PDF.  To use the data in our map, we’ll need it all together in one place.  So I went through all 200+ PDFs and transcribed the precinct data in Excel (yes it sucked).  Here’s the data in CSV format.  Note that I only picked out votes for Romney or Obama, not all the other information contained in the precinct reports.

So now we have the voting data, we’ll need the map data (in the form of shapefiles) of Chester County precincts.  We could create those ourselves, but who has time for that.  What I found was the 2010 Census TIGER/Line® Shapefiles (there’s also the 2008 TIGER/Line® Shapefiles for: Pennsylvania).  Click on “Download Shapefiles” on the left, then “Voting Districts” from the dropdown and then select the state and county.  You’ll get a package like this one for Chester County.

Now that we have the data and the map shapefiles we come to the hardest part of this exercise: marrying the data with the map.

Map Meet Data, Data Meet Map

So admittedly I am not expert on shapefiles, I have plenty more learning to do there.  For a better explanation of what a shapefile is, consult your encyclopedia.  Important for us, there are two files that interact: the actual shapefile (.shp extension) which contains the geometry of polygons, and a data file (.dbf extension) which contains data connected to the shapes – in our cases respresenting voting precincts.

From my Google searches the following tasks are best accomplished in ArcGISthe industry standard mapping software by a company called esri.  But it costs money which I don’t have to spend on this right now.  So we will use three other tools: GeoDa, Excel 2010, and Access 2010. Theoretically we don’t really need GeoDa, but it helps to see what’s going on.

So when you open the shapefile (.shp extension) in GeoDa (which you can download for free here) you’ll see something like this:

Shapefile in GeoDa

 

You can also look at the data associated with the shapefile which is contained in the .dbf file.  Just click on that little table icon (third from left) and you’ll see something like this:

GeoDa Shapefile Table

There’s nothing there now that’s really interesting for us, other than the names of the actual voting precincts.  What we need is the voting data.  In the source data file (the .csv file), we have votes for both Romney and Obama.  To combine these in just one column (since the coloring will be based on one variable as contained in one column) I just used the percentage of votes for Obama, with Romney representing the flipside.  So a 34% vote Obama would simply mean a 66% vote for Romney (nevermind the 1% third party candidates).  This’ll make sense when you take a look at the Category Editor screenshot below and look at the breaks compared to the labels.

At this point we could just add a column and input the voting percentages for Obama for each precinct.  This is what I initially did, but editing this in Excel is much faster, so we’ll do that.  Here’s what you need to do:

  1. Make copies of all the shapefiles and work on those
  2. In Excel, open the .dbf file
  3. Add a column called “Vote” with the percentage of votes for Obama (or Romney, doesn’t matter)
  4. You cannot save back to .dbf format, so save as .xlsx
  5. Open MS Access
  6. Under External Data import Excel and choose the .xlsx file you just created; be sure to select the first row as column headers
  7. Then export it (under External Data > Export > More) as a dBase File. Pick a name and click Ok

Now when you reopen the shapefile in GeoDa, you’ll see the column you just added:

Vote Column

Color-coding the Map: Choroplethic!

At this point, you can already create a choropleth map in GeoDa.  Right-click on the map and select Change Current Map Type > Create New Custom.  Play around with the breaks and labels and such.  Here’s what I came up with:

GeoGeoDa Category Editor

 

GeoDa Choropleth Map of 2012 Chester County Presidential Elections

Already pretty cool!  But we’d like to see some of the underlying geographic data so we can do some more in-depth analysis of what’s going on. For example, why is there such a big Obama supporting precinct surrounded by Republican precincts in the lower left corner?

This last part is thankfully easy.

You’ll need to a download a little stand-alone program called shp2kml.  This will convert your .shp file (including data) into a .kml file which can be viewed on Google Earth.  Here are the settings I used:

shp2kml Settings 1 shp2kml Settings 2 shp2kml Settings 3 shp2kml Settings 4

 

On that last screen, click “Create and Open”, pick a file name, and voila, Google Earth opens with your awesome choropleth map: sweet!

Now we can see that in an otherwise sparsely populated district, we find Lincoln University: this likely explains why there is such a pro-Obama precinct in otherwise Republican territory:

Lincoln University

 

Next, I’ll have to figure how to make these maps using d3.  Enjoy! :)

 


Post-Election Model Evaluation Against Actual Results: a Victory for State Polls

November 7th, 2012 | 1 Comment »

So, here we are, November 7, and Barack Obama has been re-elected POTUS.  This came as a surprise to no-one who’s been watching the polls and following Nate Silver.  Well, maybe to Dean Chambers of Unskewedpolls.com.  Poor guy.

Nate did extremely well, correctly predicting all states and the overall election.  He had Florida at 50/50, and here it is, still undecided because it’s so close.

And how did my humble prediction model do? Equally well! (Read yesterday’s predictions here.)  Here are Nate’s and my predictions for competitive states (any state beyond 100% certainty):

Actual winners highlighted in blue or red.  The sole holdout is Florida, which we had at 50/50.

Taking a look at my projected likely outcomes for an Obama win:

The first two paths to victory are his actual paths to victory.  At the moment, CNN has Obama with 303 Electoral College votes, but if Obama wins Florida, he’ll get the 332. The outcomes above reflect the fact that Obama is slightly more likely to get Florida than not, and in actual fact this is what it’s looking like right now.

Another way to look at the accuracy of the model is to compare the computed state averages based on state polls to the actual election outcomes:

What we’re looking at here is the competitive states, sorted by how much the actual spread (based on election results) between D and R varied from the projected spread. One way to read this is that the undecideds broke more one way than the other.  In Arizona, for example, the polls had Obama at 43%, which is what he actually got. Romney, however, picked up 55% of the vote instead of the projected 50%.  This could be interpreted by reasoning that the additional 5% that Romney received were undecideds in the poll.  The average spread difference in these competitive states was only 2 – not bad!  Looks like the state polls were pretty good indicators of how the election would pan out.

Overall, my simple model did extremely well, actually surprisingly so, given how simple it is. The way it works is:

  1. Take the average D and R percentage and margin of error of the past X number of state polls for a given state (I’ve been using 10) for all states
  2. Simulate the election outcome by randomly picking a vote percentage value within the margin of error for each state, determining the winner, and allocating the appropriate number of Electoral College votes for the winner
  3. Do this 1,000,000 times and determine the percentage that each candidate won (or there was a tie); this is a Monte Carlo simulation to determine the probabilities of an outcome

That’s basically it.  How many polls to look back in this case was pretty arbitrary (10).  Using a lower number gives you possibly a bit fresher set of data (the polls were taken more recently) but leaves you a bit more exposed to potential outliers.  For example, if you look at my post from yesterday, you can see that using only the past 5 polls Florida had a probability of 37% Dem to 63% Rep.  Here are the past 5 polls:

We can see here that the InsiderAdvantage poll was a an outlier and had Romney at +5.  Using the past 10 polls smoothed this outlier out.

What’s remarkable to me is the accuracy of this model given how dumb it is.  Part of Nate’s secret sauce is his weighting of polls. You can see the weights when you hover over the little bar chart. For example, the weight given to the InsiderAdvantage polls was 0.57976 vs. 1.411115 for the PPP poll:

We don’t know exactly how these numbers are determined, but we know that variables such as time, polling house effect, sample size, etc. go into this weight. My model did none of this weighting, which is probably the first area I would start improving it. Nonetheless, the results are quite similar, so the question is, how much bang for your buck do we get from additional complexity.  There’s of course also the possibility that this simple model just happened to work out, and that under different conditions a more nuanced approached would be more accurate.  This is probably the more likely case.  But the takeaway is that you don’t need an overly complex model to come up with some pretty decent predictions on your own.

Another area where the model fudges is in taking the average: it takes the poll averages (e.g. 50% Obama and 48% Obama averages out to 49% Obama), but it also averages the margin of error.  Well, the more data you have and the larger the sample of voters that were surveyed, the smaller the margin of error will become.  I did not take this fact into consideration and simply averaged the margin of errors, yielding undue uncertainty for each average.

My model also did not take other factors into consideration in making the final prediction: no economic factors or national polls – just the state polls.

In all, I finish with the following observations:

  • In 2012, state polls were excellent predictors of the election outcome when considered in the aggregate.  It is finally time for the media to stop writing articles about single polls without putting them into the context of the larger picture.  If a candidate is up by 2 points in state, we’ll see polls that show a candidate up 4 and tied.  Putting out an article talking about a tied race is simply misleading.
  • Nate talked about the possibility of a systemic bias in the polls.  There appeared to be none when considered in the aggregate.
  • A model doesn’t have to be complex to be of value and reveal underlying trends in a set of data.
  • Data trumps gut feelings and intuition in making election predictions.  Be wary of pundits who don’t talk about their conclusions based on what’s happening in the polls or other relevant data.
  • Python rocks!

 


Competing with Nate Silver in Under 200 Lines of Python Code – Election 2012 Result Predictions

November 6th, 2012 | 1 Comment »

UPDATE: Post-election analysis here

It’s November 6, and over 18 months of grueling and never-ending campaigning is finally coming to an end.  I’m admittedly a bit of a political news junkie and check memeorandum religiously to get a pulse of what’s being talked about.  Together with the Chrome plugin visualizing political bias, it’s a great tool.

However, the past few years have been especially partisan and the rhetoric in the blogosphere is rancid.  So it was truly a breath of fresh air when I discovered Nate Silver in the beginning of the summer.  No bullshit – just the facts.  What a concept!  So Nate has been become the latest staple of my info diet.

He’s been catching a ton of flack in the past few months for his statistical, evidence based model that has consistently favored Obama in spite of media hype to the contrary.  All of the arguments against him don’t really hold much weight unless they are actually addressing the model himself.

One article in particular caught my attention: “Is Nate Silver’s value at risk?” by Sean Davis over at The Daily Caller.  His argument basically boils down to the question of whether state polls are accurate in predicting election outcomes, and whether Nate Silver’s model has relied to heavily on this data.  After re-creating Nate’s model (in Excel?!), Sean writes:

After running the simulation every day for several weeks, I noticed something odd: the winning probabilities it produced for Obama and Romney were nearly identical to those reported by FiveThirtyEight. Day after day, night after night. For example, based on the polls included in RealClearPolitics’ various state averages as of Tuesday night, the Sean Davis model suggested that Obama had a 73.0% chance of winning the Electoral College. In contrast, Silver’s FiveThirtyEight model as of Tuesday night forecast that Obama had a 77.4% chance of winning the Electoral College.

So what gives? If it’s possible to recreate Silver’s model using just Microsoft Excel, a cheap Monte Carlo plug-in, and poll results that are widely available, then what real predictive value does Silver’s model have?

The answer is: not all that much beyond what is already contained in state polls. Why? Because the FiveThirtyEight model is a complete slave to state polls. When state polls are accurate, FiveThirtyEight looks amazing. But when state polls are incorrect, FiveThirtyEight does quite poorly. That’s why my very simple model and Silver’s very fancy model produce remarkably similar results — they rely on the same data. Garbage in, garbage out.

So what happens if state polls are incorrect?

It’s a good question, although Sean’s answer isn’t particularly satisfactory: he basically says we probably don’t have enough data.

However, this piqued my interest… was it really so easy to emulate his model?  I wanted to find out more… the Monte Carlo plugin is $129: screw that.  I bit of Googling later and it turns out Monte Carlo simulations are pretty easy to do in Python.

So after creating a few arrays to hold the latest polls for each state (via both RealClearPolitics and 538), I ran the numbers. I’ll perhaps go into the code in a separate post, but for right now, let me just post my predictions along with Nate’s.

Let’s start with state-by-state probabilities.  I’ve listed Nate’s state probabilities, and then two version of my predictions.  The model I’m using is really super simple.  I take the average of the past X number of polls and then run the Monte Carlo simulation on those percentages and margin of errors.  That’s how I come up with the state probabilities.  The “Diff” column then lists the difference between my predictions and Nate’s predictions.

Everything is in the general ballpark, which for under 200 lines of Python code isn’t bad I think!  The average difference using the past 10 polls is 5%, while the average difference using the past 5 polls is 4% (looking only at those states where our probabilities differed).  Like I said, not bad, and in no case, do we predict different winners altogether (with the exception of a the 5 poll projection calling for Romney to win Florida).

So, who’s gonna win it all?  Using the above percentages and simulating an election 1,000,000 times, I get the following, using first the past 10 and then then past 5 results:

Using the 10 last polls is somewhat close to Nate, but overall, again, we’re in the same ballpark and the story being told is the same.  Obama is the heavy favorite to win it tonight with roughly 10:1 odds in his favor.

Now let’s look at the most likely paths to victory for each candidate.  For Obama, the following 5 paths occurred most often in the simulation:

M2 here stands for Maine’s 2nd Congressional District.  Maine and Nebraska apportion their Electoral Votes one for each Congressional District (Maine has 2, Nebraska 3) plus 2 for the overall vote winner.

You can see why Ohio is so pivotal for Romney.  Here are the most likely paths without Ohio:

Certainly puts his late push into PA into some perspective.

Well if you do happen upon this post and are actually interested in the Python code, let me know, and I might do a follow-up post looking at the code specifically.

Cheers, and enjoy the election!  I’d put my money on Obama!

UPDATE: Post-election analysis here


Avinash Kaushik at Strata 2012 Conference

October 16th, 2012 | No Comments »

Love this guy: if this presentation doesn’t get you amped about big data, nothing will:


Visions of the Future

August 7th, 2012 | No Comments »

By historical standards, our world today is already pretty strange.  Some of the future visions I’ve come across recently suggest a future even stranger – and perhaps a lot scarier.

First up, a video called “Sight” takes the current interface evolution to its next logical conclusion:

Sight from Sight Systems on Vimeo.

Next, an equally scary fascinating vision of a future with ubiquitous, tiny computers covering the urban landscape.  In “How low (power) can you go?” Charlie Stross posits:

So for the cost of removing chewing gum, a city in 2030 will be able to give every square metre of its streets the processing power of a 2012 tablet computer (or a 2002 workstation, or a 1982 supercomputer).

Now. What can we do with a city that has 1.5 billion networked ambient-light-powered processors, or roughly 200 cpus per resident?

What indeed…


Learn RapidMiner Quickly

July 30th, 2012 | No Comments »

So for work I’ve been taking a bit of a deep-dive into the world of text analysis. Specifically, our department has a help desk for our sales force and the incoming questions are captured in our internal CRM.  The challenge is to see if we can apply text analysis to gain insights about what kind of questions are being asked.

Doing some research on various text analysis tools, I stumbled upon RapidMiner, a fantastic tool of which you can download the community edition for free.

RapidMiner

You can get going pretty quickly, especially if you check out the series of videos by Neil McGuigan, which are awesome and will have you creating predictive models within a day (even if you don’t really understand what’s going on :P ):

Text Analytics with RapidMiner by Neil McGuigan:

  1. Loading Text
  2. Processing Text
  3. Association Rule Learning
  4. Document Similarity and Clustering
  5. Automatic Document Categorization
  6. Applying the Model to New Documents

I actually much prefer these over the videos on Rapid-I: the background music is pretty annoying and the tone is too “official” and corporate somehow.

Check out these resources if you want to get started with text analysis within a day!

UPDATE 8/7/12: Also check out these videos on web-crawling:

  1. Web scraping with Google Spreadsheets and XPath
  2. Web Crawling with RapidMiner
  3. Web Scraping with RapidMiner and Xpath
  4. Web Scraping AJAX Pages

Programming Python in Eclipse and Uploading to DreamHost: Mind the Newline Characters!

July 29th, 2012 | No Comments »

tl;dr: change your Eclipse settings in Window > Preferences > General > Workspace to encode files in UTF-8 by default and use Unix style new text line file delimiters. Otherwise your Python code won’t run in a web browser; see no. 4 on DreamHost’s python wiki page.

So once again I am getting back into Python, must be the 3rd or 4th time now – maybe this time it will stick?  Better not bet on it.

Anyway, I use DreamHost as my cheap shared host of choice, and am in the midst of simply setting everything up so the workflow can move into the background and I can focus on the code.  Initially contemplated a custom Python 3.2 install instead of the 2.6.6 DreamHost version, but since most educational material is still for Python 2.X I decided to just stick with what DreamHost provides.

After installing PyDev for Eclipse and setting up a project and proper connection, I just wanted to conduct a sanity check of uploading a .py file and getting some web output.  2 hours later I figure out what’s happening and now write this post to save you some time.

The “rules” for getting Python files to run are listed on the DreamHost wiki:

  1. end in “.py” (NOTE: “.cgi, .fcgi” works as well)
  2. have #!/usr/bin/python in the very first line of the file (NOTE: #!/usr/bin/python2.x or #!/usr/bin/env python2.x will work as well)
  3. be marked as executable: chmod 755
  4. use UNIX style newlines, not Windows [1]
  5. If you want to view printed output from your Python code, you must print print "Content-type: text/html\n\n" as the first line of output.
  6. If you don’t want .py files to be executed by Apache add “RemoveHandler .py” command to your .htaccess file

Ok, so I follow everything, in my case didn’t have an .htaccess file so didn’t have to worry about No. 6, and no. 4, I’m sure that’s ok, no need to worry… yea, so 2 hours later I return to No. 4.

I was writing files, uploading them, and then staring at this screen:

internal server error

The code I was trying to run was as simple as this:

#!/usr/bin/python
print "Content-type: text/html\n\n"
print "Hello World"

And yet I was getting nowhere.  So I started putting together a ticket for DreamHost, but in one of their steps they suggested to check out the error logs and discussion forums.  After chasing down some suexec error I was seeing in the error logs (~/logs/[domain]/http/error.log), someone mentioned that a file had DOS end of line characters which wouldn’t work.  Yea, back to No. 4.  Turns out the default settings in Eclipse (on Windows) are to encode files in CP-1252 (Windows text encoding) with default newline characters:

eclipse text encoding

You can change this by going to Window > Preferences > General > Workspace.  Mine now looks like this: I changed it to UTF-8 and Other > Unix new text file line delimiters.

eclipse text encoding fixed

Then I tried this out some more times while forgetting to set the permissions of the file… d’oh!  So yes, don’t forget to chmod your files to 755, which you can also do via Eclipse. Upload the file, then right-click on the uploaded file and click “Properties”. There you’ll want to set the permissions like so:

eclipse permissions

Hope that saved you some time!


Filip Dujardin

January 17th, 2012 | 1 Comment »

Via the ever-excellent Jörg Colberg (ein Deutscher wie ich soweit ich weiss :) ) I just found out about Filip Dujardin, who has a number of excellent photomontages.  The main thrust of these montages is architectural, and although we know these creations to be fake, they reflect the absurdity and banality of much of modern architecture.

The photoshopping is quite excellent.  The only real bummer is his website, which is inexplicably all flash-based (baaad idea) and consequently distorts the pictures depending on your resolution.

Fictions by Filip Dujardin 01

Fictions by Filip Dujardin 01

Fictions by Filip Dujardin 02

Fictions by Filip Dujardin 02

Fictions by Filip Dujardin 03

Fictions by Filip Dujardin 03

Fictions by Filip Dujardin 04

Fictions by Filip Dujardin 04


Experiments in Photomontage

January 16th, 2012 | No Comments »

I’ve recently become quite interested in the intersection of photomontage – cutting and joining and number of different photographs together – and magical realism, which is a kind of intense meditation on reality to the point of hyperreality, possibly with the inclusion of elements of the fantastic.

Last June, newly equipped with my Canon 60D, I decided to do a little experiment in photoshopping a scene together in my home office at the time.

Hand in Couch

Hand in Couch

I just used what was immediately around me.  I took several shots of my arm on the couch and of just the couch itself, and some photoshopping later we get this image which looks real but we yet know cannot be real.  I think this juxtaposition is kind of creepy and unsettling.

Just recently I’ve read a lot of strobist.com – that site is awesome.  Armed with my brother’s Canon Speedlite 430 EX as well as a light stand + umbrella I’ve been wreaking some studio quality lighting havoc recently.  Catching up on the work of Dave Hill I got to thinking about how to get multi-light pictures having only one light source (my flash) and possibly outside ambient light.  I thought why not independently light areas of the scene and then stitch them together in PS.  Basically light painting with a flash and shoot-through umbrella.

So I took roughly 60-70 pictures and ultimately ended up taking 32 of them and putting them together in PS.

Lee X 3

Lee X 3

All in all this picture was 61 layers (of which 32 were photos) totaling 3GB!  Consequently found out about the .psb file extension through this project, which is Adobe’s large file format as .psd’s only go up to 2 gigs.

Anyway, most of the work was actually in the brackground.  I basically used the light parts of each component picture and used the Lighten blend mode to blend the pics on top of one another.  Getting it just right requires a fair amount fine-tuning, but that was the basic process.  I used the analogy of additive lighting whereby I started with a black background and added light selectively to build up the picture.

The end result is quite interesting. The room is very unrealistically lit and there isn’t an obvious light source.  So just the room in and of itself is “off” somehow but it shows much greater detail – hyperreal so to speak.  Then of course three versions of me cannot be, yet it looks real… though in a fake way.  Some of the shadows could be improved and are to me giveaways (e.g. the right arm of me sitting in the recliner, or the right hand of me on the ground), but you don’t immediately notice it.

To me a quite fascinating space, the edge between reality and imagination.  Though all photography is an illusion (an image on paper or screen ain’t the real thing), there’s an expectation that a photo depicts some kind of reality objectively.  Of course there’s a huge amount of editing (cropping, lighting, etc.) but I guess the expectation is that I could go out and see a depicted scene for myself.  Playing with this expectation through photomontage opens up endless possibilities to surprise and challenge the viewer.


Some new videos by my bro Ian

November 2nd, 2009 | No Comments »

Some seriously heady shit:

superfLOW – The Unthinkable from Ian Clemmer on Vimeo.

superfLOW – Galactic Flow from Ian Clemmer on Vimeo.

SUPER FLOW from Ian Clemmer on Vimeo.