We're going to wrangle the Boston data from OpenStreetMap. This means that we're going to:
The data we have is in xml
format. We will export it in csv
and store it in an SQL
database.
In order to clean the data, we will audit its quality according to these five characteristics:
Precisely, we might have to perform some of these tasks:
Most of the time, we will define helper functions so that they can be reused, here or in another project, and save us some time.
Once all of this is done, we will export our cleaned data into csv files, and import them into an SQL database to perform further analytics.
Let's get started.
This has already been done. The exact dataset can be downloaded here (although it is certain that this dataset will evolve in the future).
It has been unzipped and stored in the data
folder of this notebook.
We now need to extract the data so we can start working with it.
The first thing to do is to look at the boston_massachusetts.osm
file, and get a sense of its structure.
It follows an XML structure, humanly readable. The way it works is simple. There are several tags, and we're interested in three in particular:
Nodes are represented with the tag <node></node>
. They can have the following attributes:
id
- integer, the unique ID of the nodelat
- integer, the latitudelon
- integer, the longitudeversion
- integer, the number of editstimestamp
- W3C datetime format, time of the last modificationchangeset
- integer, the changeset number in which the object was created or updated. A changeset consists of a edits made by a single user over a short period of timeuid
- integer, the unique ID of the user who last editeduser
- string, the pseudonym of the user who last editedWays are trepresented with the tag <way></way>
. They can have the following attributes:
id
- integer, the unique ID of the nodeversion
- integer, the number of editstimestamp
- W3C datetime format, time of the last modificationchangeset
- integer, the changeset number in which the object was created or updated. A changeset consists of a edits made by a single user over a short period of timeuid
- integer, the unique ID of the user who last editeduser
- string, the pseudonym of the user who last editedRelations are represented with the tag <relation></relation>
. They can have the same attributes as ways.
Nodes that make up the ways are represented with the tag <nd />
. They can only have a ref
attribute, which basically references the id
of the node as an integer.
Ways and nodes in relations are represented with the tag <member />
. They can have three attributes:
type
- string, the type of the object (node, way)ref
- integer, the references to the unique ID of the objectrole
- string, the role it plays in the relation (via
, from
, to
)Tags are represented with the tag <tag />
. They are used as child elements of nodes, ways and relations to bring additional information. They can have two attributes:
k
- string, the key of the tagv
- string, the value of the tagFirst, were going to get to know our dataset, its 436.2 MB of data and its 5,920,723 lines of code.
Let's find out exactly what tags we have, and how many of each.
import xml.etree.cElementTree as ET
import pprint
dataset = "data/boston_massachusetts.osm"
def count_tags(filename):
"""
Takes in a dataset in XML format.
Returns a dictionary of the tags and the number of each tag.
"""
tag_dict = {}
for event, elem in ET.iterparse(filename):
if elem.tag not in tag_dict:
tag_dict[elem.tag] = 1
else:
tag_dict[elem.tag] += 1
return tag_dict
print(count_tags(dataset))
As we can see, there isn't so many different tags. We now know that for this dataset, we have:
Great! Now let's tackle the users question right away:
how many unique users do we have to thank for creating and updating this dataset?
def get_users(filename):
users = set()
for _, element in ET.iterparse(filename):
for att in element.attrib:
if att == 'uid':
if element.attrib['uid'] not in users:
users.add(element.attrib['uid'])
return users
print(len(get_users(dataset)))
We have to thank 1,402 people for making this dataset possible.
Just from looking at the data, we already know we're going to have some potential problems. All the <tag />
elements don't necessarily share the same format. Some are lowercase, others are uppercase; some are just text, others have one or more colon.
This is going to be problematic when trying to expand them to a reusable dictionary. Let's see how many of each we have, using regular expressions, and make a list of values for each type.
import re
lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')
def key_type(element, keys):
"""
Takes in an XML element (<node />, <way />) and a dictionary of keys.
If the element is a <tag />, its ['k'] attribute is analyzed,
the dictionary keys and lists are incremented accordingly,
- lower: valid tags that only contain lowercase letters
- lower_colon: valid tags that contain lowercase letter with one or more colons in their name
- problemchars: tags with problematic characters
- other: tags that don't fall in the previous three categories
"""
if element.tag == 'tag':
if lower.search(element.attrib['k']):
keys['lower'] +=1
elif lower_colon.search(element.attrib['k']):
keys['lower_colon'] += 1
elif problemchars.search(element.attrib['k']):
keys['problemchars'] += 1
else:
keys['other'] += 1
return keys
def process_tags(filename):
"""
Takes in a dataset in XML format, parses it, and executes the function key_type() for each element.
Returns a dictionary with the count of different types of keys.
"""
keys = {"lower": 0, "lower_colon": 0, "problemchars": 0, "other": 0}
for _, element in ET.iterparse(filename):
keys = key_type(element, keys)
return keys
tag_type_dict = process_tags(dataset)
print (tag_type_dict)
We apparently have one tag that has problematic keys. Let's find out which one it is. We know we only have one in this dataset, but for convenience, we'll have the function return a list in case other datasets have more than one problematic key.
def get_problemkeys(filename):
"""
Takes in a dataset in XML format, parses it and returns a list with the values of tags with problematic characters.
"""
problemchars_list = []
for _, element in ET.iterparse(filename):
if element.tag == 'tag':
if problemchars.search(element.attrib['k']):
problemchars_list.append(element.attrib['k'])
return problemchars_list
print(get_problemkeys(dataset))
It's just one with a space. Shouldn't be an issue.
Now let's get to cleaning the data.
We're going to audit the data to get all the different street types it contains. Then we will be able to build a dictionary upon which to base our correction.
from collections import defaultdict
street_type_re = re.compile(r'\S+\.?$', re.IGNORECASE)
street_types = defaultdict(int)
def audit_street_type(street_types, street_name):
m = street_type_re.search(street_name)
if m:
street_type = m.group()
street_types[street_type] += 1
def print_sorted_dict(d, expression):
keys = d.keys()
keys = sorted(keys, key=lambda s: s.lower())
for k in keys:
v = d[k]
print (expression % (k, v))
def is_street_name(elem):
return (elem.tag == "tag") and (elem.attrib['k'] == "addr:street")
def audit(filename):
for event, elem in ET.iterparse(filename):
if is_street_name(elem):
audit_street_type(street_types, elem.attrib['v'])
print(street_types, "%s: %d")
return(street_types)
all_types = audit(dataset)
all_types
We can see a number of issues here:
addr:housenumber
.Let's tackle these issues one at a time. First, we will correct the abbreviations issue.
Let's define a dictionary of the street types we deem valid, and of the corrections we want to implement.
expected = ['Artery', 'Avenue', 'Boulevard', 'Broadway', 'Commons', 'Court', 'Drive', 'Lane', 'Park', 'Parkway',
'Place', 'Road', 'Square', 'Street', 'Terrace', 'Trail', 'Turnpike', 'Wharf',
'Yard']
abbr_mapping = { 'Ave': 'Avenue',
'Ave.': 'Avenue',
'Ct': 'Court',
'Dr': 'Drive',
'HIghway': 'Highway',
'Hwy': 'Highway',
'Pl': 'Place',
'place': 'Place',
'Pkwy': 'Parkway',
'Rd': 'Road',
'rd.': 'Road',
'Sq.': 'Square',
'St': 'Street',
'st': 'Street',
'ST': 'Street',
'St,': 'Street',
'St.': 'Street',
'street': 'Street',
'Street.': 'Street'
}
First of all, let's print the problematic names. We will not print all of them (we don't need to print the Ave
or the St
. So we're going to print only the whole lines where the total of the type is less than 20 and don't yet appear in expected.
typo_full_names = {}
def audit_street_name(street_types, street_name):
m = street_type_re.search(street_name)
if m:
street_type = m.group()
if (all_types[street_type] < 20) and (street_type not in expected) and (street_type not in abbr_mapping):
if street_type in typo_full_names:
typo_full_names[street_type].append(street_name)
else:
typo_full_names.update({ street_type:[street_name] })
def audit_name(filename):
for event, elem in ET.iterparse(filename):
if is_street_name(elem):
audit_street_name(street_types, elem.attrib['v'])
# print_sorted_dict(street_types)
return typo_full_names
audit_name(dataset)
Now we can define other mapping dictionaries and apply new, individual corrections based on the information we have.
addr:housenumber
addr:suitenumber
PO Box 846028
information appears in the addr:housenumber
tag already Albany
, Boylston
, Cambridge
(written Cambrdige
here with a typo), Dartmouth
, Elm
, Hampshire
, Holland
, Newbury
are Boston street namesLafayette
, Longwood
and Winsor
are Boston avenue namesPasteur
corresponds to Avenue Louis Pasteur
South Market Building
is located on 4 South Market StreetCenter
s are correct, we will add Center
to our expected
listCircle
s are correct, we will add Circle
to our expected
listCoolidge Corner
should not appear hereMuseum of Science Driveway
is correct, we will add Driveway
to our expected
listB Street Ext
corresponds to B Street
Fellsway
is a parkwayFenway
is of course Fenway ParkBoylston Street, 5th Floor
should only be Boylston Street
, the 5th Floor
information should appear in the tag addr:floornumber
Stillings Street Garage
is a garage located on 11 Stillings StreetEast Boston Greenway
is correctH
is actually 605 Hancock StreetFaneuil Hall
is located 4 South Market StreetJamaicaway
is correctLOMASNEY WAY, ROOF LEVEL
should only be Lomasney Way
Cummington Mall
is correct, we will add Mall
to our expected
listFaneuil Hall Market
is 1 Faneuil Hall SquareTwo Center Plaza
should be 2 Center PlazaPark Plaza
is correctCharles Street South
should simply be Charles Street
Sidney Street, 2nd floor
should only be Sidney Street
, the 2nd Floor
information should appear in the tag addr:floornumber
First Street, 18 floor
should only be First Street
, the 18th Floor
information should appear in the tag addr:floornumber
Windsor
corresponds to Windsor Place
in Sommervilleexpected.extend(['Center', 'Circle', 'Driveway', 'Mall'])
typo_mapping = { 'Albany': 'Albany Street',
'Boylston': 'Boylston Street',
'Cambrdige': 'Cambridge Street',
'Dartmouth': 'Dartmouth Street',
'Elm': 'Elm Street',
'Hampshire': 'Hampshire Street',
'Holland': 'Holland Street',
'Newbury': 'Newbury Street',
'Lafayette': 'Lafayette Avenue',
'Longwood': 'Longwood Avenue',
'Winsor': 'Winsor Avenue',
'Pasteur': 'Avenue Louis Pasteur',
'Corner': 'Webster Street',
'Building': {'South Market Street': 4},
'B Street Ext': 'B Street',
'Fellsway': 'Fellsway Parkway',
'Fenway': 'Fenway Park',
'Floor': {'Boylston Street': '5th Floor'},
'Garage': 'Stillings Street',
'H': 'Hancock Street',
'Hall': {'Faneuil Hall Square': 1},
'LEVEL': {'Lomasney Way': 'Roof Level'},
'Market': 'Faneuil Hall Square',
'Plaza': { 'Two Center Plaza': {'Center Plaza': 2}},
'South': 'Charles Street South',
'floor': { 'Sidney Street': '2nd Floor',
'First Street': '18th Floor'},
'Windsor': 'Windsor Place'
}
numbers_mapping = { '#12': {'Harvard Street': 12},
'#1302': {'Cambridge Street': 1302},
'#501': {'Bromfield Street': 501},
'104': {'Mill Street': 'Suite 104'},
'1100': {'First Street': 'Suite 1100'},
'1702': { 'Franklin Street': 'Suite 1702'},
'3': {'Kendall Square': 'Suite B3201'},
'303': {'First Street': 'Suite 303'},
'6': {'Atlantic Avenue': 700 }
}
po_box_mapping = { '846028': 'Albany Street'}
expected = sorted(expected)
expected
Finally, before running updates, let's see if any street name contains incorrect characters.
name_problem_chars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\t\r\n]')
def get_problem_names(filename):
"""
Takes in a dataset in XML format, parses it and returns a list with the values of tags with problematic characters.
"""
problemchars_list = []
for _, element in ET.iterparse(filename):
if is_street_name(element):
if name_problem_chars.search(element.attrib['v']):
problemchars_list.append(element.attrib['v'])
return problemchars_list
print(get_problem_names(dataset))
Almost all of these entries are correct. We have two entries we need to investigate:
Church Street, Harvard Square
corresponds to a parish located on 1446 Massachusetts AvenueMassachusetts Ave; Mass Ave
should only be Massachusetts Avenue
We will define one last dictionary to correct them, and then we will finally correct the values.
char_mapping = { 'Church Street, Harvard Square': {'Massachusetts Avenue': 1446},
'Massachusetts Ave; Mass Ave': 'Massachusetts Avenue'
}
We have identified all the incorrect names. We created dictionaries to update incorrect values. We can now clean the data and save it all in a final dictionary.
def typo_correct(street_name, street_type):
if type(typo_mapping[street_type]) == type('string'):
name = typo_mapping[street_type]
elif type(typo_mapping[street_type]) == type({}):
if '2nd' in street_name:
name = 'Sidney Street'
# add attribute addr:floor '2nd Floor' typo_mapping[street_type]['Sidney Street']
elif '18' in street_name:
name = 'First Street'
# add attribute addr:floor '18th Floor' typo_mapping[street_type]['First Street']
elif '5th' in street_name:
name = 'Boylston Street'
# add attribute addr:floor '5th Floor' typo_mapping[street_type]['Boylston Street']
elif street_type == 'LEVEL':
name = 'Lomasney Way'
# add attribute addr:floor 'Roof Level' typo_mapping[street_type]['Lomasney Way']
elif 'Two Center' in street_name:
name = 'Center Plaza'
# add attribute addr:housenumber '2' typo_mapping[street_type]['Two Center Plaza']['Center Plaza']
else:
for key in typo_mapping[street_type]:
name = key
# add attribute addr:housenumber = value
return name
def numbers_correct(street_name, street_type):
if 'Suite' in street_name:
for key in numbers_mapping[street_type]:
name = key
# print (name)
# add attribute addr:suitenumber = value
else:
for key in numbers_mapping[street_type]:
name = key
# add attribute addr:housenumber = value
return name
def char_correct(street_name, street_type):
print (street_name, street_type)
if street_name + ' ' + street_type == 'Church Street, Harvard Square':
for key in char_mapping[street_name + ' ' + street_type]:
name = key
elif street_name + ' ' + street_type == 'Massachusetts Ave; Mass Ave':
name = char_mapping[street_name + ' ' + street_type]
return name
def audit_abbreviations(filename):
problem_street_types = defaultdict(set)
for event, elem in ET.iterparse(filename):
if is_street_name(elem):
expected_street_type(problem_street_types, elem.attrib['v'])
return problem_street_types
def expected_street_type(street_types, street_name):
m = street_type_re.search(street_name)
if m:
street_type = m.group()
if street_type not in expected:
street_types[street_type].add(street_name)
def update_name(name):
street_type = name.split(' ')[-1]
street_name = name.rsplit(' ', 1)[0]
if (street_name + ' ' + street_type) in char_mapping:
name = char_correct(street_name, street_type)
elif street_type in abbr_mapping:
name = street_name + ' ' + abbr_mapping[street_type]
elif street_type in typo_mapping:
name = typo_correct(street_name, street_type)
elif street_type in numbers_mapping:
name = numbers_correct(street_name, street_type)
elif street_type in po_box_mapping:
name = po_box_mapping[street_type]
return name
def run_updates(filename):
st_types = audit_abbreviations(dataset)
for st_type, ways in st_types.items():
for name in ways:
better_name = update_name(name)
if better_name != name:
corrected_names[name] = better_name
return corrected_names
corrected_names = {}
corrected_names = run_updates(dataset)
print_sorted_dict(corrected_names, "%s: %s")
Good! We now have a dictionary of all the incorrect values and their correction. We're ready to prepare the data to be inserted into a SQL database.
SCHEMA = {
'node': {
'type': 'dict',
'schema': {
'id': {'required': True, 'type': 'integer', 'coerce': int},
'lat': {'required': True, 'type': 'float', 'coerce': float},
'lon': {'required': True, 'type': 'float', 'coerce': float},
'user': {'required': True, 'type': 'string'},
'uid': {'required': True, 'type': 'integer', 'coerce': int},
'version': {'required': True, 'type': 'string'},
'changeset': {'required': True, 'type': 'integer', 'coerce': int},
'timestamp': {'required': True, 'type': 'string'}
}
},
'node_tags': {
'type': 'list',
'schema': {
'type': 'dict',
'schema': {
'id': {'required': True, 'type': 'integer', 'coerce': int},
'key': {'required': True, 'type': 'string'},
'value': {'required': True, 'type': 'string'},
'type': {'required': True, 'type': 'string'}
}
}
},
'way': {
'type': 'dict',
'schema': {
'id': {'required': True, 'type': 'integer', 'coerce': int},
'user': {'required': True, 'type': 'string'},
'uid': {'required': True, 'type': 'integer', 'coerce': int},
'version': {'required': True, 'type': 'string'},
'changeset': {'required': True, 'type': 'integer', 'coerce': int},
'timestamp': {'required': True, 'type': 'string'}
}
},
'way_nodes': {
'type': 'list',
'schema': {
'type': 'dict',
'schema': {
'id': {'required': True, 'type': 'integer', 'coerce': int},
'node_id': {'required': True, 'type': 'integer', 'coerce': int},
'position': {'required': True, 'type': 'integer', 'coerce': int}
}
}
},
'way_tags': {
'type': 'list',
'schema': {
'type': 'dict',
'schema': {
'id': {'required': True, 'type': 'integer', 'coerce': int},
'key': {'required': True, 'type': 'string'},
'value': {'required': True, 'type': 'string'},
'type': {'required': True, 'type': 'string'}
}
}
}
}
import csv
import codecs
import pprint
import re
import xml.etree.cElementTree as ET
import cerberus
import schema
OSM_PATH = "example.osm"
NODES_PATH = "nodes.csv"
NODE_TAGS_PATH = "nodes_tags.csv"
WAYS_PATH = "ways.csv"
WAY_NODES_PATH = "ways_nodes.csv"
WAY_TAGS_PATH = "ways_tags.csv"
LOWER_COLON = re.compile(r'^([a-z]|_)+:([a-z]|_)+')
PROBLEMCHARS = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')
# Make sure the fields order in the csvs matches the column order in the sql table schema
NODE_FIELDS = ['id', 'lat', 'lon', 'user', 'uid', 'version', 'changeset', 'timestamp']
NODE_TAGS_FIELDS = ['id', 'key', 'value', 'type']
WAY_FIELDS = ['id', 'user', 'uid', 'version', 'changeset', 'timestamp']
WAY_TAGS_FIELDS = ['id', 'key', 'value', 'type']
WAY_NODES_FIELDS = ['id', 'node_id', 'position']
def correct_element(v):
if v in corrected_names:
correct_value = corrected_names[v]
else:
correct_value = v
return correct_value
def shape_element(element, node_attr_fields=NODE_FIELDS, way_attr_fields=WAY_FIELDS,
problem_chars=PROBLEMCHARS, default_tag_type='regular'):
"""Clean and shape node or way XML element to Python dict"""
node_attribs = {}
way_attribs = {}
way_nodes = []
tags = [] # Handle secondary tags the same way for both node and way elements
if element.tag == 'node':
node_attribs['id'] = element.attrib['id']
node_attribs['user'] = element.attrib['user']
node_attribs['uid'] = element.attrib['uid']
node_attribs['version'] = element.attrib['version']
node_attribs['lat'] = element.attrib['lat']
node_attribs['lon'] = element.attrib['lon']
node_attribs['timestamp'] = element.attrib['timestamp']
node_attribs['changeset'] = element.attrib['changeset']
for node in element:
tag_dict = {}
tag_dict['id'] = element.attrib['id']
if ':' in node.attrib['k']:
tag_dict['type'] = node.attrib['k'].split(':', 1)[0]
tag_dict['key'] = node.attrib['k'].split(':', 1)[-1]
tag_dict['value'] = correct_element(node.attrib['v'])
else:
tag_dict['type'] = 'regular'
tag_dict['key'] = node.attrib['k']
tag_dict['value'] = correct_element(node.attrib['v'])
tags.append(tag_dict)
elif element.tag == 'way':
way_attribs['id'] = element.attrib['id']
way_attribs['user'] = element.attrib['user']
way_attribs['uid'] = element.attrib['uid']
way_attribs['version'] = element.attrib['version']
way_attribs['timestamp'] = element.attrib['timestamp']
way_attribs['changeset'] = element.attrib['changeset']
n = 0
for node in element:
if node.tag == 'nd':
way_dict = {}
way_dict['id'] = element.attrib['id']
way_dict['node_id'] = node.attrib['ref']
way_dict['position'] = n
n += 1
way_nodes.append(way_dict)
if node.tag == 'tag':
tag_dict = {}
tag_dict['id'] = element.attrib['id']
if ':' in node.attrib['k']:
tag_dict['type'] = node.attrib['k'].split(':', 1)[0]
tag_dict['key'] = node.attrib['k'].split(':', 1)[-1]
tag_dict['value'] = correct_element(node.attrib['v'])
else:
tag_dict['type'] = 'regular'
tag_dict['key'] = node.attrib['k']
tag_dict['value'] = correct_element(node.attrib['v'])
tags.append(tag_dict)
if element.tag == 'node':
return {'node': node_attribs, 'node_tags': tags}
elif element.tag == 'way':
return {'way': way_attribs, 'way_nodes': way_nodes, 'way_tags': tags}
# ================================================== #
# Helper Functions #
# ================================================== #
def get_element(osm_file, tags=('node', 'way', 'relation')):
"""Yield element if it is the right type of tag"""
context = ET.iterparse(osm_file, events=('start', 'end'))
_, root = next(context)
for event, elem in context:
if event == 'end' and elem.tag in tags:
yield elem
root.clear()
def validate_element(element, validator, schema=SCHEMA):
"""Raise ValidationError if element does not match schema"""
if validator.validate(element, schema) is not True:
field, errors = next(validator.errors.iteritems())
message_string = "\nElement of type '{0}' has the following errors:\n{1}"
error_string = pprint.pformat(errors)
raise Exception(message_string.format(field, error_string))
class UnicodeDictWriter(csv.DictWriter, object):
"""Extend csv.DictWriter to handle Unicode input"""
def writerow(self, row):
super(UnicodeDictWriter, self).writerow({
k: (v.encode('utf-8') if isinstance(v, str) else v) for k, v in row.items()
})
def writerows(self, rows):
for row in rows:
self.writerow(row)
# ================================================== #
# Main Function #
# ================================================== #
def process_map(file_in, validate):
"""Iteratively process each XML element and write to csv(s)"""
with codecs.open(NODES_PATH, 'w') as nodes_file, \
codecs.open(NODE_TAGS_PATH, 'w') as nodes_tags_file, \
codecs.open(WAYS_PATH, 'w') as ways_file, \
codecs.open(WAY_NODES_PATH, 'w') as way_nodes_file, \
codecs.open(WAY_TAGS_PATH, 'w') as way_tags_file:
nodes_writer = csv.DictWriter(nodes_file, NODE_FIELDS)
node_tags_writer = csv.DictWriter(nodes_tags_file, NODE_TAGS_FIELDS)
ways_writer = csv.DictWriter(ways_file, WAY_FIELDS)
way_nodes_writer = csv.DictWriter(way_nodes_file, WAY_NODES_FIELDS)
way_tags_writer = csv.DictWriter(way_tags_file, WAY_TAGS_FIELDS)
nodes_writer.writeheader()
node_tags_writer.writeheader()
ways_writer.writeheader()
way_nodes_writer.writeheader()
way_tags_writer.writeheader()
validator = cerberus.Validator()
for element in get_element(file_in, tags=('node', 'way')):
el = shape_element(element)
if el:
if validate is True:
validate_element(el, validator)
if element.tag == 'node':
nodes_writer.writerow(el['node'])
node_tags_writer.writerows(el['node_tags'])
elif element.tag == 'way':
ways_writer.writerow((el['way']))
way_nodes_writer.writerows(el['way_nodes'])
way_tags_writer.writerows(el['way_tags'])
process_map(dataset, validate=True)
import sqlite3
# Creating database on disk
sqlite_file = 'boston.db'
conn = sqlite3.connect(sqlite_file)
c = conn.cursor()
c.execute('''DROP TABLE IF EXISTS nodes''')
c.execute('''DROP TABLE IF EXISTS nodes_tags''')
c.execute('''DROP TABLE IF EXISTS ways''')
c.execute('''DROP TABLE IF EXISTS ways_tags''')
c.execute('''DROP TABLE IF EXISTS ways_nodes''')
conn.commit()
QUERY_NODES = """
CREATE TABLE nodes (
id INTEGER NOT NULL,
lat REAL,
lon REAL,
user TEXT,
uid INTEGER,
version INTEGER,
changeset INTEGER,
timestamp TEXT
);
"""
QUERY_NODES_TAGS = """
CREATE TABLE nodes_tags (
id INTEGER,
key TEXT,
value TEXT,
type TEXT,
FOREIGN KEY (id) REFERENCES nodes(id)
);
"""
QUERY_WAYS = """
CREATE TABLE ways (
id INTEGER NOT NULL,
user TEXT,
uid INTEGER,
version INTEGER,
changeset INTEGER,
timestamp TEXT
);
"""
QUERY_WAYS_TAGS = """
CREATE TABLE ways_tags (
id INTEGER NOT NULL,
key TEXT NOT NULL,
value TEXT NOT NULL,
type TEXT,
FOREIGN KEY (id) REFERENCES ways(id)
);
"""
QUERY_WAYS_NODES = """
CREATE TABLE ways_nodes (
id INTEGER NOT NULL,
node_id INTEGER NOT NULL,
position INTEGER NOT NULL,
FOREIGN KEY (id) REFERENCES ways(id),
FOREIGN KEY (node_id) REFERENCES nodes(id)
);
"""
c.execute(QUERY_NODES)
c.execute(QUERY_NODES_TAGS)
c.execute(QUERY_WAYS)
c.execute(QUERY_WAYS_TAGS)
c.execute(QUERY_WAYS_NODES)
conn.commit()
Our tables are created. We now need to import our csv files into them.
with open('nodes.csv','rt') as fin:
dr = csv.DictReader(fin) # comma is default delimiter
to_db1 = [(i['id'], i['lat'], i['lon'], i['user'], i['uid'], i['version'], i['changeset'], i['timestamp']) for i in dr]
with open('nodes_tags.csv','rt') as fin:
dr = csv.DictReader(fin) # comma is default delimiter
to_db2 = [(i['id'], i['key'], i['value'], i['type']) for i in dr]
with open('ways.csv','rt') as fin:
dr = csv.DictReader(fin) # comma is default delimiter
to_db3 = [(i['id'], i['user'], i['uid'], i['version'], i['changeset'], i['timestamp']) for i in dr]
with open('ways_tags.csv','rt') as fin:
dr = csv.DictReader(fin) # comma is default delimiter
to_db4 = [(i['id'], i['key'], i['value'], i['type']) for i in dr]
with open('ways_nodes.csv','rt') as fin:
dr = csv.DictReader(fin) # comma is default delimiter
to_db5 = [(i['id'], i['node_id'], i['position']) for i in dr]
c.executemany("INSERT INTO nodes(id, lat, lon, user, uid, version, changeset, timestamp) VALUES (?, ?, ?, ?, ?, ?, ?, ?);", to_db1)
c.executemany("INSERT INTO nodes_tags(id, key, value, type) VALUES (?, ?, ?, ?);", to_db2)
c.executemany("INSERT INTO ways(id, user, uid, version, changeset, timestamp) VALUES (?, ?, ?, ?, ?, ?);", to_db3)
c.executemany("INSERT INTO ways_tags(id, key, value, type) VALUES (?, ?, ?, ?);", to_db4)
c.executemany("INSERT INTO ways_nodes(id, node_id, position) VALUES (?, ?, ?);", to_db5)
conn.commit()
We already have the following information about our original data:
Let's check if our database corresponds.
c.execute('SELECT COUNT(*) FROM nodes')
all_rows = c.fetchall()
print(all_rows)
c.execute('SELECT COUNT(*) FROM ways')
all_rows = c.fetchall()
print(all_rows)
We're good! Let's query away!
Who are the top 3 users who brought the more modifications?
QUERY = '''
SELECT DISTINCT nodes.user, COUNT(*)
FROM nodes
GROUP BY nodes.uid
ORDER BY COUNT(*) DESC
LIMIT 3;
'''
c.execute(QUERY)
all_rows = c.fetchall()
print(all_rows)
Which proportion of the database did the top 10 users contribute to build?
QUERY = '''
SELECT DISTINCT nodes.user, COUNT(*) * 100.0 / (SELECT COUNT(*) FROM nodes)
FROM nodes
GROUP BY nodes.uid
ORDER BY (COUNT(*) * 100.0 / (SELECT COUNT(*) FROM nodes)) DESC
LIMIT 10;
'''
c.execute(QUERY)
all_rows = c.fetchall()
print(all_rows)
What are the top 10 streets that contain the more nodes?
QUERY = '''
SELECT ways_tags.value, COUNT(*)
FROM ways_tags
WHERE ways_tags.key = 'name'
AND ways_tags.type = 'regular'
GROUP BY ways_tags.value
ORDER BY COUNT(*) DESC
LIMIT 10;
'''
c.execute(QUERY)
all_rows = c.fetchall()
print(all_rows)
On average, how many nodes does a way contain?
QUERY = '''
SELECT AVG(Count)
FROM
(SELECT COUNT(*) as Count
FROM ways
JOIN ways_nodes
ON ways.id = ways_nodes.id
GROUP BY ways.id);
'''
c.execute(QUERY)
all_rows = c.fetchall()
print(all_rows)
What are the top 10 amenities in Boston?
QUERY = '''
SELECT value, COUNT(*) as Count
FROM nodes_tags
WHERE key='amenity'
GROUP BY value
ORDER BY Count DESC
LIMIT 10;
'''
c.execute(QUERY)
all_rows = c.fetchall()
print(all_rows)
There's definitely room to seat in Boston.
What's the religious landscape like in Boston?
QUERY = '''
SELECT nodes_tags.value, COUNT(*) as Count
FROM nodes_tags
JOIN
(SELECT DISTINCT(id)
FROM nodes_tags
WHERE value='place_of_worship') as Sub
ON nodes_tags.id=Sub.id
WHERE nodes_tags.key='religion'
GROUP BY nodes_tags.value
ORDER BY Count DESC;
'''
c.execute(QUERY)
all_rows = c.fetchall()
print(all_rows)
Irish descendants enjoying a strong presence in Boston, it would have been great to have a distinction between catholics and protestants.
I focused this analysis on identifying irregular street names, cleaning them, and reformatting them so that one type corresponds to one term (Avenue, Street), getting rid of abbreviations.
To improve the cleaning result, the process should also include the insertion of tags. Some street names contained valuable information about floors, for example, or suite numbers, that deserve to be preserved. This process would enable us to present organized information at their specific place, preserving both the street names and the Floor or Suite numbers.
This data wrangling process should probably be pushed a step further in order to verify and clean zip codes as well, before being sent back to OpenStreetMaps. Zip codes could present similar issues than the ones encountered with street names: