For different scenarios it is useful to understand people’s political leanings on a granular level. For me personally, this interest started because I am currently looking to buy a house in Chester County, but obviously there are many other reasons why such a map might be useful (e.g. for political campaigns who want to maximize resource expenditure).
So in a toss-up state, Chester County is a toss-up county: 49.7% Romney and 49.2% Obama. For purposes of looking at real estate, this is however still too coarse. So I wondered how I could get at a more detailed breakdown.
Here’s what we’ll be doing in a nutshell: get election data and tie it to the map shapefiles of Chester County voting precincts, then converting those shapefiles to KML files so we can view the map in Google Earth.
Getting the Data
The lowest level of political data I could get my hands on were the Chester County (CC) precinct results, which are conveniently located here: Chester County Election Results Archive. Inconveniently, the data for each precinct is contained in a separate PDF. To use the data in our map, we’ll need it all together in one place. So I went through all 200+ PDFs and transcribed the precinct data in Excel (yes it sucked). Here’s the data in CSV format. Note that I only picked out votes for Romney or Obama, not all the other information contained in the precinct reports.
Now that we have the data and the map shapefiles we come to the hardest part of this exercise: marrying the data with the map.
Map Meet Data, Data Meet Map
So admittedly I am not expert on shapefiles, I have plenty more learning to do there. For a better explanation of what a shapefile is, consult your encyclopedia. Important for us, there are two files that interact: the actual shapefile (.shp extension) which contains the geometry of polygons, and a data file (.dbf extension) which contains data connected to the shapes – in our cases respresenting voting precincts.
From my Google searches the following tasks are best accomplished in ArcGIS, the industry standard mapping software by a company called esri. But it costs money which I don’t have to spend on this right now. So we will use three other tools: GeoDa, Excel 2010, and Access 2010. Theoretically we don’t really need GeoDa, but it helps to see what’s going on.
You can also look at the data associated with the shapefile which is contained in the .dbf file. Just click on that little table icon (third from left) and you’ll see something like this:
There’s nothing there now that’s really interesting for us, other than the names of the actual voting precincts. What we need is the voting data. In the source data file (the .csv file), we have votes for both Romney and Obama. To combine these in just one column (since the coloring will be based on one variable as contained in one column) I just used the percentage of votes for Obama, with Romney representing the flipside. So a 34% vote Obama would simply mean a 66% vote for Romney (nevermind the 1% third party candidates). This’ll make sense when you take a look at the Category Editor screenshot below and look at the breaks compared to the labels.
At this point we could just add a column and input the voting percentages for Obama for each precinct. This is what I initially did, but editing this in Excel is much faster, so we’ll do that. Here’s what you need to do:
Make copies of all the shapefiles and work on those
In Excel, open the .dbf file
Add a column called “Vote” with the percentage of votes for Obama (or Romney, doesn’t matter)
You cannot save back to .dbf format, so save as .xlsx
Open MS Access
Under External Data import Excel and choose the .xlsx file you just created; be sure to select the first row as column headers
Then export it (under External Data > Export > More) as a dBase File. Pick a name and click Ok
Now when you reopen the shapefile in GeoDa, you’ll see the column you just added:
Color-coding the Map: Choroplethic!
At this point, you can already create a choropleth map in GeoDa. Right-click on the map and select Change Current Map Type > Create New Custom. Play around with the breaks and labels and such. Here’s what I came up with:
Already pretty cool! But we’d like to see some of the underlying geographic data so we can do some more in-depth analysis of what’s going on. For example, why is there such a big Obama supporting precinct surrounded by Republican precincts in the lower left corner?
So, here we are, November 7, and Barack Obama has been re-elected POTUS. This came as a surprise to no-one who’s been watching the polls and following Nate Silver. Well, maybe to Dean Chambers of Unskewedpolls.com. Poor guy.
Nate did extremely well, correctly predicting all states and the overall election. He had Florida at 50/50, and here it is, still undecided because it’s so close.
And how did my humble prediction model do? Equally well! (Read yesterday’s predictions here.) Here are Nate’s and my predictions for competitive states (any state beyond 100% certainty):
Actual winners highlighted in blue or red. The sole holdout is Florida, which we had at 50/50.
Taking a look at my projected likely outcomes for an Obama win:
The first two paths to victory are his actual paths to victory. At the moment, CNN has Obama with 303 Electoral College votes, but if Obama wins Florida, he’ll get the 332. The outcomes above reflect the fact that Obama is slightly more likely to get Florida than not, and in actual fact this is what it’s looking like right now.
Another way to look at the accuracy of the model is to compare the computed state averages based on state polls to the actual election outcomes:
What we’re looking at here is the competitive states, sorted by how much the actual spread (based on election results) between D and R varied from the projected spread. One way to read this is that the undecideds broke more one way than the other. In Arizona, for example, the polls had Obama at 43%, which is what he actually got. Romney, however, picked up 55% of the vote instead of the projected 50%. This could be interpreted by reasoning that the additional 5% that Romney received were undecideds in the poll. The average spread difference in these competitive states was only 2 – not bad! Looks like the state polls were pretty good indicators of how the election would pan out.
Overall, my simple model did extremely well, actually surprisingly so, given how simple it is. The way it works is:
Take the average D and R percentage and margin of error of the past X number of state polls for a given state (I’ve been using 10) for all states
Simulate the election outcome by randomly picking a vote percentage value within the margin of error for each state, determining the winner, and allocating the appropriate number of Electoral College votes for the winner
Do this 1,000,000 times and determine the percentage that each candidate won (or there was a tie); this is a Monte Carlo simulation to determine the probabilities of an outcome
That’s basically it. How many polls to look back in this case was pretty arbitrary (10). Using a lower number gives you possibly a bit fresher set of data (the polls were taken more recently) but leaves you a bit more exposed to potential outliers. For example, if you look at my post from yesterday, you can see that using only the past 5 polls Florida had a probability of 37% Dem to 63% Rep. Here are the past 5 polls:
We can see here that the InsiderAdvantage poll was a an outlier and had Romney at +5. Using the past 10 polls smoothed this outlier out.
What’s remarkable to me is the accuracy of this model given how dumb it is. Part of Nate’s secret sauce is his weighting of polls. You can see the weights when you hover over the little bar chart. For example, the weight given to the InsiderAdvantage polls was 0.57976 vs. 1.411115 for the PPP poll:
We don’t know exactly how these numbers are determined, but we know that variables such as time, polling house effect, sample size, etc. go into this weight. My model did none of this weighting, which is probably the first area I would start improving it. Nonetheless, the results are quite similar, so the question is, how much bang for your buck do we get from additional complexity. There’s of course also the possibility that this simple model just happened to work out, and that under different conditions a more nuanced approached would be more accurate. This is probably the more likely case. But the takeaway is that you don’t need an overly complex model to come up with some pretty decent predictions on your own.
Another area where the model fudges is in taking the average: it takes the poll averages (e.g. 50% Obama and 48% Obama averages out to 49% Obama), but it also averages the margin of error. Well, the more data you have and the larger the sample of voters that were surveyed, the smaller the margin of error will become. I did not take this fact into consideration and simply averaged the margin of errors, yielding undue uncertainty for each average.
My model also did not take other factors into consideration in making the final prediction: no economic factors or national polls – just the state polls.
In all, I finish with the following observations:
In 2012, state polls were excellent predictors of the election outcome when considered in the aggregate. It is finally time for the media to stop writing articles about single polls without putting them into the context of the larger picture. If a candidate is up by 2 points in state, we’ll see polls that show a candidate up 4 and tied. Putting out an article talking about a tied race is simply misleading.
Nate talked about the possibility of a systemic bias in the polls. There appeared to be none when considered in the aggregate.
A model doesn’t have to be complex to be of value and reveal underlying trends in a set of data.
Data trumps gut feelings and intuition in making election predictions. Be wary of pundits who don’t talk about their conclusions based on what’s happening in the polls or other relevant data.
It’s November 6, and over 18 months of grueling and never-ending campaigning is finally coming to an end. I’m admittedly a bit of a political news junkie and check memeorandum religiously to get a pulse of what’s being talked about. Together with the Chrome plugin visualizing political bias, it’s a great tool.
However, the past few years have been especially partisan and the rhetoric in the blogosphere is rancid. So it was truly a breath of fresh air when I discovered Nate Silver in the beginning of the summer. No bullshit – just the facts. What a concept! So Nate has been become the latest staple of my info diet.
He’s been catching a ton of flack in the past few months for his statistical, evidence based model that has consistently favored Obama in spite of media hype to the contrary. All of the arguments against him don’t really hold much weight unless they are actually addressing the model himself.
One article in particular caught my attention: “Is Nate Silver’s value at risk?” by Sean Davis over at The Daily Caller. His argument basically boils down to the question of whether state polls are accurate in predicting election outcomes, and whether Nate Silver’s model has relied to heavily on this data. After re-creating Nate’s model (in Excel?!), Sean writes:
After running the simulation every day for several weeks, I noticed something odd: the winning probabilities it produced for Obama and Romney were nearly identical to those reported by FiveThirtyEight. Day after day, night after night. For example, based on the polls included in RealClearPolitics’ various state averages as of Tuesday night, the Sean Davis model suggested that Obama had a 73.0% chance of winning the Electoral College. In contrast, Silver’s FiveThirtyEight model as of Tuesday night forecast that Obama had a 77.4% chance of winning the Electoral College.
So what gives? If it’s possible to recreate Silver’s model using just Microsoft Excel, a cheap Monte Carlo plug-in, and poll results that are widely available, then what real predictive value does Silver’s model have?
The answer is: not all that much beyond what is already contained in state polls. Why? Because the FiveThirtyEight model is a complete slave to state polls. When state polls are accurate, FiveThirtyEight looks amazing. But when state polls are incorrect, FiveThirtyEight does quite poorly. That’s why my very simple model and Silver’s very fancy model produce remarkably similar results — they rely on the same data. Garbage in, garbage out.
So what happens if state polls are incorrect?
It’s a good question, although Sean’s answer isn’t particularly satisfactory: he basically says we probably don’t have enough data.
However, this piqued my interest… was it really so easy to emulate his model? I wanted to find out more… the Monte Carlo plugin is $129: screw that. I bit of Googling later and it turns out Monte Carlo simulations are pretty easy to do in Python.
So after creating a few arrays to hold the latest polls for each state (via both RealClearPolitics and 538), I ran the numbers. I’ll perhaps go into the code in a separate post, but for right now, let me just post my predictions along with Nate’s.
Let’s start with state-by-state probabilities. I’ve listed Nate’s state probabilities, and then two version of my predictions. The model I’m using is really super simple. I take the average of the past X number of polls and then run the Monte Carlo simulation on those percentages and margin of errors. That’s how I come up with the state probabilities. The “Diff” column then lists the difference between my predictions and Nate’s predictions.
Everything is in the general ballpark, which for under 200 lines of Python code isn’t bad I think! The average difference using the past 10 polls is 5%, while the average difference using the past 5 polls is 4% (looking only at those states where our probabilities differed). Like I said, not bad, and in no case, do we predict different winners altogether (with the exception of a the 5 poll projection calling for Romney to win Florida).
So, who’s gonna win it all? Using the above percentages and simulating an election 1,000,000 times, I get the following, using first the past 10 and then then past 5 results:
Using the 10 last polls is somewhat close to Nate, but overall, again, we’re in the same ballpark and the story being told is the same. Obama is the heavy favorite to win it tonight with roughly 10:1 odds in his favor.
Now let’s look at the most likely paths to victory for each candidate. For Obama, the following 5 paths occurred most often in the simulation:
M2 here stands for Maine’s 2nd Congressional District. Maine and Nebraska apportion their Electoral Votes one for each Congressional District (Maine has 2, Nebraska 3) plus 2 for the overall vote winner.
You can see why Ohio is so pivotal for Romney. Here are the most likely paths without Ohio:
Certainly puts his late push into PA into some perspective.
Well if you do happen upon this post and are actually interested in the Python code, let me know, and I might do a follow-up post looking at the code specifically.
Cheers, and enjoy the election! I’d put my money on Obama!
Next, an equally scary fascinating vision of a future with ubiquitous, tiny computers covering the urban landscape. In “How low (power) can you go?” Charlie Stross posits:
So for the cost of removing chewing gum, a city in 2030 will be able to give every square metre of its streets the processing power of a 2012 tablet computer (or a 2002 workstation, or a 1982 supercomputer).
Now. What can we do with a city that has 1.5 billion networked ambient-light-powered processors, or roughly 200 cpus per resident?
So for work I’ve been taking a bit of a deep-dive into the world of text analysis. Specifically, our department has a help desk for our sales force and the incoming questions are captured in our internal CRM. The challenge is to see if we can apply text analysis to gain insights about what kind of questions are being asked.
Doing some research on various text analysis tools, I stumbled upon RapidMiner, a fantastic tool of which you can download the community edition for free.
You can get going pretty quickly, especially if you check out the series of videos by Neil McGuigan, which are awesome and will have you creating predictive models within a day (even if you don’t really understand what’s going on ):
tl;dr: change your Eclipse settings in Window > Preferences > General > Workspace to encode files in UTF-8 by default and use Unix style new text line file delimiters. Otherwise your Python code won’t run in a web browser; see no. 4 on DreamHost’s python wiki page.
So once again I am getting back into Python, must be the 3rd or 4th time now – maybe this time it will stick? Better not bet on it.
Anyway, I use DreamHost as my cheap shared host of choice, and am in the midst of simply setting everything up so the workflow can move into the background and I can focus on the code. Initially contemplated a custom Python 3.2 install instead of the 2.6.6 DreamHost version, but since most educational material is still for Python 2.X I decided to just stick with what DreamHost provides.
After installing PyDev for Eclipse and setting up a project and proper connection, I just wanted to conduct a sanity check of uploading a .py file and getting some web output. 2 hours later I figure out what’s happening and now write this post to save you some time.
The “rules” for getting Python files to run are listed on the DreamHost wiki:
end in “.py” (NOTE: “.cgi, .fcgi” works as well)
have #!/usr/bin/python in the very first line of the file (NOTE: #!/usr/bin/python2.x or #!/usr/bin/env python2.x will work as well)
If you want to view printed output from your Python code, you must print print "Content-type: text/html\n\n" as the first line of output.
If you don’t want .py files to be executed by Apache add “RemoveHandler .py” command to your .htaccess file
Ok, so I follow everything, in my case didn’t have an .htaccess file so didn’t have to worry about No. 6, and no. 4, I’m sure that’s ok, no need to worry… yea, so 2 hours later I return to No. 4.
I was writing files, uploading them, and then staring at this screen:
The code I was trying to run was as simple as this:
print "Content-type: text/html\n\n"
print "Hello World"
And yet I was getting nowhere. So I started putting together a ticket for DreamHost, but in one of their steps they suggested to check out the error logs and discussion forums. After chasing down some suexec error I was seeing in the error logs (~/logs/[domain]/http/error.log), someone mentioned that a file had DOS end of line characters which wouldn’t work. Yea, back to No. 4. Turns out the default settings in Eclipse (on Windows) are to encode files in CP-1252 (Windows text encoding) with default newline characters:
You can change this by going to Window > Preferences > General > Workspace. Mine now looks like this: I changed it to UTF-8 and Other > Unix new text file line delimiters.
Then I tried this out some more times while forgetting to set the permissions of the file… d’oh! So yes, don’t forget to chmod your files to 755, which you can also do via Eclipse. Upload the file, then right-click on the uploaded file and click “Properties”. There you’ll want to set the permissions like so:
Via the ever-excellent Jörg Colberg (ein Deutscher wie ich soweit ich weiss ) I just found out about Filip Dujardin, who has a number of excellent photomontages. The main thrust of these montages is architectural, and although we know these creations to be fake, they reflect the absurdity and banality of much of modern architecture.
The photoshopping is quite excellent. The only real bummer is his website, which is inexplicably all flash-based (baaad idea) and consequently distorts the pictures depending on your resolution.
I’ve recently become quite interested in the intersection of photomontage – cutting and joining and number of different photographs together – and magical realism, which is a kind of intense meditation on reality to the point of hyperreality, possibly with the inclusion of elements of the fantastic.
Last June, newly equipped with my Canon 60D, I decided to do a little experiment in photoshopping a scene together in my home office at the time.
Hand in Couch
I just used what was immediately around me. I took several shots of my arm on the couch and of just the couch itself, and some photoshopping later we get this image which looks real but we yet know cannot be real. I think this juxtaposition is kind of creepy and unsettling.
Just recently I’ve read a lot of strobist.com – that site is awesome. Armed with my brother’s Canon Speedlite 430 EX as well as a light stand + umbrella I’ve been wreaking some studio quality lighting havoc recently. Catching up on the work of Dave Hill I got to thinking about how to get multi-light pictures having only one light source (my flash) and possibly outside ambient light. I thought why not independently light areas of the scene and then stitch them together in PS. Basically light painting with a flash and shoot-through umbrella.
So I took roughly 60-70 pictures and ultimately ended up taking 32 of them and putting them together in PS.
Lee X 3
All in all this picture was 61 layers (of which 32 were photos) totaling 3GB! Consequently found out about the .psb file extension through this project, which is Adobe’s large file format as .psd’s only go up to 2 gigs.
Anyway, most of the work was actually in the brackground. I basically used the light parts of each component picture and used the Lighten blend mode to blend the pics on top of one another. Getting it just right requires a fair amount fine-tuning, but that was the basic process. I used the analogy of additive lighting whereby I started with a black background and added light selectively to build up the picture.
The end result is quite interesting. The room is very unrealistically lit and there isn’t an obvious light source. So just the room in and of itself is “off” somehow but it shows much greater detail – hyperreal so to speak. Then of course three versions of me cannot be, yet it looks real… though in a fake way. Some of the shadows could be improved and are to me giveaways (e.g. the right arm of me sitting in the recliner, or the right hand of me on the ground), but you don’t immediately notice it.
To me a quite fascinating space, the edge between reality and imagination. Though all photography is an illusion (an image on paper or screen ain’t the real thing), there’s an expectation that a photo depicts some kind of reality objectively. Of course there’s a huge amount of editing (cropping, lighting, etc.) but I guess the expectation is that I could go out and see a depicted scene for myself. Playing with this expectation through photomontage opens up endless possibilities to surprise and challenge the viewer.