Toronto Elections data with Neo4j and Python part 1 of 3

As promised I am pushing the envelope on the 2006 Elections contributions datasets. This time I am going to do some analysis using Neo4j, but since the data needs to be loaded using the right syntax, I have a little preparation to do, first. Currently, my data sits in a csv file and looks like the following:

1;Robichon, Georges; ;H3R1R3;H3R;Mont-Royal (Quebec);200.00;CT0001;Cash;CR0001;Individual;LeDrew, Stephen;1.00;Mayor;
2;Rousseau, Remy; ;J4M2B3;J4M;Longueil (Quebec);1000.00;CT0001;Cash;CR0001;Individual;Pitfield, Jane;1.00;Mayor;

The first column (in the above it is “1” and “2”) that is a unique ID I created for each donation. It will be useful for identifying unique contributions as we’ll see later. Here is my sketch as to what I need the Neo4j graph to look like:
neo4j_elections_sketch
So I basically need to create nodes and relationships for each donation, and put it into a text file. That text file can then be cut-and-paste or otherwise imported into Neo4j. There might be a better way to import, but there is some tricky conditional statements you have to make around contributors, because someone can contribute to a candidate multiple times and/or contribute to multiple candidates. As a result a ‘straight CREATE’ statement on each line will result in duplicate entries. (EDIT: I still ended up having duplicate nodes, which I had to delete but you get the idea.)
EDIT: I tried out py2neo and liked it! Code is updated with latest version of this Python plugin.
import csv
import py2neo
from py2neo import neo4j, node, rel #this is a handy add-in for working with Neo4j
source = 'C:\Users\jlalonde\Documents\personal\elections\\tblMayor.csv' #obviously you want to adjust your path to suit what you need
graph_db = neo4j.GraphDatabaseService()
graph_db.clear()
d = [] #complete dataset, that can then be sorted
p = [] #this will be a unique list of postal codes
c = [] #this will be a unique list of candidates
cont = [] #this will help with checking duplicate contributors
m=1
with open(source, 'rb') as csvfile:
     s = csv.reader(csvfile, delimiter=';')
        #this next section creates nodes
     for row in s:
        if row[11] == 'Gold, Mitch':
            d.append([row[0], row[1], row[3], row[5], row[6], row[8], row[10], row[11]]) # this only collects the data you need for this demo
            #from the original data....
            # row[0] is the unique ID
            #row[1] is the contributor name
            #row[2] is the contributor address
            #row[3] is the contributor postal code
            #row[4] is the contributor postal code FSA
            #row[5] is the contributor location or neighborhood
            #row[6] is the amount
            #row[7] is the contribution code
            #row[8] is the type (usually cash)
            #row[9] is some kind of contributor code
            #row[10] is the contributor type (individual vs. corporation)
            #row[11] is the candidate name
        #the new rows in d (for reference)
        #row[0] is the unique ID
        #row[1] is the contributor name
        #row[2] is the contributor postal code
        #row[3] is the contributor location or neighbourhood
        #row[4] is the amount
        #row[5] is the type
        #row[6] is the contributor type
        #row[7] is the candidate name
from operator import itemgetter
d.sort(key = itemgetter(2, 1)) ##this sorts by postal code and by name
for row in d:#in this instance you want to create a unique list of nodes for candidates and postal codes
    #you'll treat people differently, later.
    p.append(row[2])
    c_nospace = str(row[7]).replace(' ','').replace(',','_').replace('-','').replace('.','').replace('&','').replace('(','').replace(')','')##yeah yeah yeah I probably could have used REGEX here
    c.append(c_nospace)
    contribution_create, = graph_db.create(node(contribution_id = 'ID' + str(row[0]), amount=row[4]))
    contribution_create.add_labels("contribution")
p2 = list(set(p)) #create a list of unique values of postal codes
c2 = list(set(c)) #create a list of unique candidates
for row in p2:
    postcode, = graph_db.create(node(p_name = str(row)))
    postcode.add_labels("PostalCode")
for row in c2:
    #Adam Sit was a candidate and also someone named Adam Sit made a contribution. So I added the 'C_' to make sure there was no error.
# f.write(write_can)
    candidate, = graph_db.create(node(candidate_name = str(row)))
    candidate.add_labels("Candidate")
#now you can go through each line of the dataset, creating nodes if they are unique

Now that I have created the array, I can continue by finishing off my creating the relationships and other nodes.


#the reason why you have the next part is that someone could be donating to more than one candidate or to the same candidate twice.
contributor1 = ''
pcode1 = ''
for row in d:
    contributor2 = str(row[1]).replace(' ','').replace(',','_').replace('-','').replace('.','').replace('&','').replace('(','').replace(')','')
    if contributor2 in cont and pcode1 != pcode2:
        contributor2 = contributor2 + str(row[0]) #this ensures there is no duplicate contributor names who are not the same person. Trust me.
    cont.append(contributor2)
    pcode2 = row[2]
    candidate = str(row[7]).replace(' ','').replace(',','_')
    if contributor1 == contributor2 and pcode1 == pcode2:
        ###if they are the same then you do not have to create a new node, just a new contribution
        ###if they are NOT the same then a new contributor node gets created
        ## you do this because you could have two people with the SAME name making a contribution. You figure this out by throwing the postal code into the mix
        string1 = 'MATCH (a {contributor_name: "' + contributor1 + '"}), (b {contribution_id: "ID' + str(row[0]) +'"})'
        string1 = string1 + ' CREATE UNIQUE a-[:CONTRIBUTED]->b'
        query1 = neo4j.CypherQuery(graph_db, string1)
        go1 = query1.execute()
        string2 = 'MATCH (c {contribution_id: "ID' + str(row[0]) +'"}), (d {candidate_name: "' + candidate + '"})'
        string2 = string2 + ' CREATE UNIQUE c-[:RECEIVED]->d'
        query2 = neo4j.CypherQuery(graph_db, string2)
        go2 = query2.execute()
    else: #here means the contributor is new. (1) Set up the contributor. (2) Set up their relationship with their postal code and their donation
        f.write('CREATE (' + contributor2 + ':contributor {contributor_name:\'' + contributor2 + '\', type:\'' + row[6] + '\'})\n')
        contributor, = graph_db.create(node(contributor_name = contributor2))
        contributor.add_labels("contributor")
        string1 = 'MATCH (a {contributor_name: "' + contributor2 + '"}), (b {contribution_id: "ID' + str(row[0]) +'"})'
        string1 = string1 + ' CREATE UNIQUE a-[:CONTRIBUTED]->b'
        query1 = neo4j.CypherQuery(graph_db, string1)
        go1 = query1.execute()
        string2 = 'MATCH (c {contribution_id: "ID' + str(row[0]) +'"}), (d {candidate_name: "' + candidate + '"})'
        string2 = string2 + ' CREATE UNIQUE c-[:RECEIVED]->d'
        query2 = neo4j.CypherQuery(graph_db, string2)
        go2 = query2.execute()
        string3 = 'MATCH (e {contributor_name: "' + contributor2 + '"}), (f {p_name: "' + pcode2 + '"})'
        string3 = string3 + ' CREATE UNIQUE e-[:LIVES]->f'
        query3 = neo4j.CypherQuery(graph_db, string3)
        go3 = query3.execute()
    contributor1 = contributor2
    pcode1 = pcode2
f.close()
v.close()

EDIT: Post the nodes first (obviously) and make sure the nodes and relationships are in the same box when entering them. Don’t get caught like I did!

Here is a screen capture of Mitch Gold’s network.
neo4j_elections_output

Related posts

2 Thoughts to “Toronto Elections data with Neo4j and Python part 1 of 3”

  1. […] the graph database! Or not. I’ve played around with Neo4j in the past, and as cool as it is, I didn’t have a bandwidth to maintain a Neo4j database on top of […]

  2. […] second chord this struck with me was the adoption of Neo4j: I’ve played around with the free version and it is pretty powerful. CBC’s reason for adopting it was that it allows for the ability to […]

Comments are closed.