Category: Coding

  • Toronto Elections data with Neo4j and Python part 1 of 3

    As promised I am pushing the envelope on the 2006 Elections contributions datasets. This time I am going to do some analysis using Neo4j, but since the data needs to be loaded using the right syntax, I have a little preparation to do, first. Currently, my data sits in a csv file and looks like the following:

    1;Robichon, Georges; ;H3R1R3;H3R;Mont-Royal (Quebec);200.00;CT0001;Cash;CR0001;Individual;LeDrew, Stephen;1.00;Mayor;
    2;Rousseau, Remy; ;J4M2B3;J4M;Longueil (Quebec);1000.00;CT0001;Cash;CR0001;Individual;Pitfield, Jane;1.00;Mayor;

    The first column (in the above it is “1” and “2”) that is a unique ID I created for each donation. It will be useful for identifying unique contributions as we’ll see later. Here is my sketch as to what I need the Neo4j graph to look like:
    neo4j_elections_sketch
    So I basically need to create nodes and relationships for each donation, and put it into a text file. That text file can then be cut-and-paste or otherwise imported into Neo4j. There might be a better way to import, but there is some tricky conditional statements you have to make around contributors, because someone can contribute to a candidate multiple times and/or contribute to multiple candidates. As a result a ‘straight CREATE’ statement on each line will result in duplicate entries. (EDIT: I still ended up having duplicate nodes, which I had to delete but you get the idea.)
    EDIT: I tried out py2neo and liked it! Code is updated with latest version of this Python plugin.
    import csv
    import py2neo
    from py2neo import neo4j, node, rel #this is a handy add-in for working with Neo4j
    source = 'C:\Users\jlalonde\Documents\personal\elections\\tblMayor.csv' #obviously you want to adjust your path to suit what you need
    graph_db = neo4j.GraphDatabaseService()
    graph_db.clear()
    d = [] #complete dataset, that can then be sorted
    p = [] #this will be a unique list of postal codes
    c = [] #this will be a unique list of candidates
    cont = [] #this will help with checking duplicate contributors
    m=1
    with open(source, 'rb') as csvfile:
         s = csv.reader(csvfile, delimiter=';')
            #this next section creates nodes
         for row in s:
            if row[11] == 'Gold, Mitch':
                d.append([row[0], row[1], row[3], row[5], row[6], row[8], row[10], row[11]]) # this only collects the data you need for this demo
                #from the original data....
                # row[0] is the unique ID
                #row[1] is the contributor name
                #row[2] is the contributor address
                #row[3] is the contributor postal code
                #row[4] is the contributor postal code FSA
                #row[5] is the contributor location or neighborhood
                #row[6] is the amount
                #row[7] is the contribution code
                #row[8] is the type (usually cash)
                #row[9] is some kind of contributor code
                #row[10] is the contributor type (individual vs. corporation)
                #row[11] is the candidate name
            #the new rows in d (for reference)
            #row[0] is the unique ID
            #row[1] is the contributor name
            #row[2] is the contributor postal code
            #row[3] is the contributor location or neighbourhood
            #row[4] is the amount
            #row[5] is the type
            #row[6] is the contributor type
            #row[7] is the candidate name
    from operator import itemgetter
    d.sort(key = itemgetter(2, 1)) ##this sorts by postal code and by name
    for row in d:#in this instance you want to create a unique list of nodes for candidates and postal codes
        #you'll treat people differently, later.
        p.append(row[2])
        c_nospace = str(row[7]).replace(' ','').replace(',','_').replace('-','').replace('.','').replace('&','').replace('(','').replace(')','')##yeah yeah yeah I probably could have used REGEX here
        c.append(c_nospace)
        contribution_create, = graph_db.create(node(contribution_id = 'ID' + str(row[0]), amount=row[4]))
        contribution_create.add_labels("contribution")
    p2 = list(set(p)) #create a list of unique values of postal codes
    c2 = list(set(c)) #create a list of unique candidates
    for row in p2:
        postcode, = graph_db.create(node(p_name = str(row)))
        postcode.add_labels("PostalCode")
    for row in c2:
        #Adam Sit was a candidate and also someone named Adam Sit made a contribution. So I added the 'C_' to make sure there was no error.
    # f.write(write_can)
        candidate, = graph_db.create(node(candidate_name = str(row)))
        candidate.add_labels("Candidate")
    #now you can go through each line of the dataset, creating nodes if they are unique

    Now that I have created the array, I can continue by finishing off my creating the relationships and other nodes.


    #the reason why you have the next part is that someone could be donating to more than one candidate or to the same candidate twice.
    contributor1 = ''
    pcode1 = ''
    for row in d:
        contributor2 = str(row[1]).replace(' ','').replace(',','_').replace('-','').replace('.','').replace('&','').replace('(','').replace(')','')
        if contributor2 in cont and pcode1 != pcode2:
            contributor2 = contributor2 + str(row[0]) #this ensures there is no duplicate contributor names who are not the same person. Trust me.
        cont.append(contributor2)
        pcode2 = row[2]
        candidate = str(row[7]).replace(' ','').replace(',','_')
        if contributor1 == contributor2 and pcode1 == pcode2:
            ###if they are the same then you do not have to create a new node, just a new contribution
            ###if they are NOT the same then a new contributor node gets created
            ## you do this because you could have two people with the SAME name making a contribution. You figure this out by throwing the postal code into the mix
            string1 = 'MATCH (a {contributor_name: "' + contributor1 + '"}), (b {contribution_id: "ID' + str(row[0]) +'"})'
            string1 = string1 + ' CREATE UNIQUE a-[:CONTRIBUTED]->b'
            query1 = neo4j.CypherQuery(graph_db, string1)
            go1 = query1.execute()
            string2 = 'MATCH (c {contribution_id: "ID' + str(row[0]) +'"}), (d {candidate_name: "' + candidate + '"})'
            string2 = string2 + ' CREATE UNIQUE c-[:RECEIVED]->d'
            query2 = neo4j.CypherQuery(graph_db, string2)
            go2 = query2.execute()
        else: #here means the contributor is new. (1) Set up the contributor. (2) Set up their relationship with their postal code and their donation
            f.write('CREATE (' + contributor2 + ':contributor {contributor_name:\'' + contributor2 + '\', type:\'' + row[6] + '\'})\n')
            contributor, = graph_db.create(node(contributor_name = contributor2))
            contributor.add_labels("contributor")
            string1 = 'MATCH (a {contributor_name: "' + contributor2 + '"}), (b {contribution_id: "ID' + str(row[0]) +'"})'
            string1 = string1 + ' CREATE UNIQUE a-[:CONTRIBUTED]->b'
            query1 = neo4j.CypherQuery(graph_db, string1)
            go1 = query1.execute()
            string2 = 'MATCH (c {contribution_id: "ID' + str(row[0]) +'"}), (d {candidate_name: "' + candidate + '"})'
            string2 = string2 + ' CREATE UNIQUE c-[:RECEIVED]->d'
            query2 = neo4j.CypherQuery(graph_db, string2)
            go2 = query2.execute()
            string3 = 'MATCH (e {contributor_name: "' + contributor2 + '"}), (f {p_name: "' + pcode2 + '"})'
            string3 = string3 + ' CREATE UNIQUE e-[:LIVES]->f'
            query3 = neo4j.CypherQuery(graph_db, string3)
            go3 = query3.execute()
        contributor1 = contributor2
        pcode1 = pcode2
    f.close()
    v.close()

    EDIT: Post the nodes first (obviously) and make sure the nodes and relationships are in the same box when entering them. Don’t get caught like I did!

    Here is a screen capture of Mitch Gold’s network.
    neo4j_elections_output

  • Sankey Diagram using D3.js Part 2 of 2

    The chart below shows the flows of money to Toronto mayoral candidates in 2006. What follows is a quick explanation and a few observations. Then I follow up with a few short tips on how I got the visualization up and running.

    2006 Toronto Election Contributions
    By Region, Dollar Amounts and Candidate

    [iframe width=”600″ height=”520″ src=”https://zenbot.ca/elections.html”]
    Source: City of Toronto

    Note that I was coding anything ‘Outside Toronto’ to be more specific and got part-way (you can see Kingston and Ottawa as some locations). Basically outside Toronto extends to Mississauga, Oakville and the Golden Horseshoe. It was possible to get more specific but I didn’t for this visualization. ‘Central Toronto’ seems to be not downtown, but includes Yonge/Eglington, etc.

    You can also see that proportionally, Stephen LeDrew received a relatively large amount of corporate donations (the orange links) while David Miller received none. You can also see that David Miller received money, not only from downtown, but everywhere. You can also see that the vast majority of the money is coming from individuals (blue) versus corporations (orange).

    If I were going to push the analysis further, I could get number of donors per candidate. I would also have loved to get 2009, as it is more recent, but like I mentioned in part 1 that wasn’t available through the city of Toronto’s website. I am sure comparing 2006 to 2009 would have been very interesting even if the candidates are completely different.

    To get the visualization working, you not only need the latest D3.js library, but also the sankey.js plugin which should both be included in your header:

    script type="text/javascript" src="js/d3.v3.min.js" charset="utf-8">/script>
    script type="text/javascript" src="js/sankey.js" charset="utf-8">/script>

    Next, I added some in-line styling:


    .link {
    fill: none;
    stroke-opacity: 0.4;
    }
    .link:hover{
    stroke-opacity: 0.6;
    }
    svg {
    font: 12px sans-serif;
    }

    In the main body, you need something to attach the svg chart to. In this case I picked the following:

    h1 id="chart"

    And finally the main bulk of the code. If you run into problems, please feel free to comment, below.

    //a big thank you to Mike Bostock. Most of this code is originally his
    //// modified for the purposes of this demonstration
    var margin = {top: 10, right: 1, bottom: 6, left: 1},
    width = 600 - margin.left - margin.right,
    height = 500 - margin.top - margin.bottom;
    var formatNumber = d3.format(",.0f"),
    format = function(d) { return "$" + formatNumber(d); },
    color = d3.scale.category20();
    var svg = d3.select("#chart").append("svg")
    .attr("width", width + margin.left + margin.right)
    .attr("height", height + margin.top + margin.bottom)
    .append("g")
    .attr("transform", "translate(" + margin.left + "," + margin.top + ")");
    var sankey = d3.sankey()
    .nodeWidth(15)
    .nodePadding(10)
    .size([width, height]);
    ////this colarray is to avoid going into the JSON document to change the colors of the link
    var colarray = {
    'Individual': '17,203,235',
    'Corporation': '252,189,53'
    }
    var path = sankey.link();
    /////////////here is where the sankey should kick in....
    d3.json("js/electionJSON.json", function(election) {
    sankey
    .nodes(election.nodes)
    .links(election.links)
    .layout(32);
    var link = svg.append("g").selectAll(".link")
    .data(election.links)
    .enter().append("path")
    .attr("class", "link")
    .attr("d", path)
    .style("stroke-width", function(d) { return Math.max(1, d.dy); })
    // .style("stroke-width", "100")
    .sort(function(a, b) { return b.dy - a.dy; })
    .style("stroke",function(d) { return "rgb(" + colarray[d.contribution_type] +")"; })
    link.append("title")
    .text(function(d) { return d.source.name + " → " + d.target.name + "\n" + format(d.value); });
    var node = svg.append("g").selectAll(".node")
    .data(election.nodes)
    .enter().append("g")
    .attr("class", "node")
    .attr("transform", function(d) { return "translate(" + d.x + "," + d.y + ")"; })
    .call(d3.behavior.drag()
    .origin(function(d) { return d; })
    .on("dragstart", function() { this.parentNode.appendChild(this); })
    .on("drag", dragmove));
    node.append("rect")
    .attr("height", function(d) { return d.dy; })
    .attr("width", sankey.nodeWidth())
    .style("fill", function(d) { return d.color = color(d.name.replace(/ .*/, "")); })
    .style("stroke", function(d) { return d3.rgb(d.color).darker(2); })
    .append("title")
    .text(function(d) { return d.name + "\n" + format(d.value); });
    node.append("text")
    .attr("x", -6)
    .attr("y", function(d) { return d.dy / 2; })
    .attr("dy", ".35em")
    .attr("text-anchor", "end")
    .attr("transform", null)
    .text(function(d) { return d.name; })
    .filter(function(d) { return d.x < width / 2; }) .attr("x", 6 + sankey.nodeWidth()) .attr("text-anchor", "start"); function dragmove(d) { d3.select(this).attr("transform", "translate(" + d.x + "," + (d.y = Math.max(0, Math.min(height - d.dy, d3.event.y))) + ")"); sankey.relayout(); link.attr("d", path); } });

  • Sankey Diagram using D3.js Part 1 of 2

    Among other things , I’ve been itching to master some D3.js tricks, mainly because the plugin lets you do some pretty gorgeous stuff, and there’s a wide variety of visualizations which are highly customizable. Recently, I finally had a few minutes to try something out. Since my work entails working with Statistics Canada data, or anything to do with start ups in Ontario I figured I would go for something that has nothing to do directly with that world.

    This led me to tracking down some elections donation data from the city of Toronto’s open data repository which was the donor list from the 2006 mayoral election. The title said it included 2009 as well, which sucked me in because that’s what I really wanted to use. I was disappointed when I found out it was only 2006, but figured it was OK because either way I was just playing around.

    The first part of this two-part series will describe how to lay out the data to get it ready for a Sankey diagram. The second part will talk about how I actually got the visualization going in D3.js. The data as it is presented shows each donor, their postal code, and whether they are a corporation or not and (of course) the candidate who received the donation. Sankey diagrams don’t need that level of detail, and I just wanted to show the movement of money from different parts of the GTA (and beyond) and how that money flowed to each candidate. So the first thing you do is you summarize by FSA (the first three digits of the postal code), while keeping the dollar amount, candidate name and type of donor (corporation vs. individual). I don’t really care how you do it: just run a pivot table, or something, just get that dollar amount by FSA.

    Next, get the area names by FSA region from Wikipedia so that you can distinguish different areas in a readable manner. Now you want to present the data in well-formed JSON like this:
    {"nodes":[
    {"name":"Brockville"},
    {"name":"Central Toronto"},
    .....
    ],
    "links":[
    {"source":0, "type": "Individual", "target":26, "value":500},
    {"source":5, "type": "Individual", "target":16, "value":200},
    ....
    ]}

    A quick note here that sankey.js (the library plugin to include with D3.js) is kind of picky and “value” (as above) is pretty immutable.
    Don’t get caught like I did by using “amount” instead of “value”.

    Finally, every single node (both region and mayoral candidate) gets included in the list of “names”. Then the “source” and “target is whomever is on the list of nodes, in order, starting at zero. In the example above, Brockville = 0. So now that you have the JSON explained, just create the JSON file and you are ready to go to part 2.