Boston Area Open Street Map Data Wrangling¶

Map Area¶

Boston, MA, USA

This is the city that welcomed we to the States, where I studied and graduated. I have a particular fondness for Boston, and I'm interested in seeing how good the data is and if it can be improved.

Analysis process¶

We're going to wrangle the Boston data from OpenStreetMap. This means that we're going to:

gather
extract
clean
store our data in order to prepare it for analysis.

The data we have is in xml format. We will export it in csv and store it in an SQL database.

In order to clean the data, we will audit its quality according to these five characteristics:

validity
accuracy
completeness
consistency
uniformity

Precisely, we might have to perform some of these tasks:

remove or correct typographical errors
validate against other entities
cross check with another dataset
data enhancement (make the data more complete)
data harmonization
change reference data

Most of the time, we will define helper functions so that they can be reused, here or in another project, and save us some time.

Once all of this is done, we will export our cleaned data into csv files, and import them into an SQL database to perform further analytics.

Let's get started.

Gathering Data¶

This has already been done. The exact dataset can be downloaded here (although it is certain that this dataset will evolve in the future).
It has been unzipped and stored in the data folder of this notebook.

Extracting Data¶

Basic information¶

We now need to extract the data so we can start working with it. The first thing to do is to look at the boston_massachusetts.osm file, and get a sense of its structure.
It follows an XML structure, humanly readable. The way it works is simple. There are several tags, and we're interested in three in particular:

nodes, which consist of "a single point in space defined by its latitude, longitude and node id. Nodes can be used to define standalone point features. In this case, a node will normally have at least one tag to define its purpose. Nodes are often used to define the shape or "path" of a way."
ways, which are an "ordered list of nodes and normally also has at least one tag". To make it simple, these are streets, avenues...
relations, which are "used to model logical (and usually local) or geographic relationships between objects". To make it simple, these are areas made of several ways.

Nodes are represented with the tag <node></node>. They can have the following attributes:

id - integer, the unique ID of the node
lat - integer, the latitude
lon - integer, the longitude
version - integer, the number of edits
timestamp - W3C datetime format, time of the last modification
changeset - integer, the changeset number in which the object was created or updated. A changeset consists of a edits made by a single user over a short period of time
uid - integer, the unique ID of the user who last edited
user - string, the pseudonym of the user who last edited

Ways are trepresented with the tag <way></way>. They can have the following attributes:

id - integer, the unique ID of the node
version - integer, the number of edits
timestamp - W3C datetime format, time of the last modification
changeset - integer, the changeset number in which the object was created or updated. A changeset consists of a edits made by a single user over a short period of time
uid - integer, the unique ID of the user who last edited
user - string, the pseudonym of the user who last edited

Relations are represented with the tag <relation></relation>. They can have the same attributes as ways.

Nodes that make up the ways are represented with the tag <nd />. They can only have a ref attribute, which basically references the id of the node as an integer.

Ways and nodes in relations are represented with the tag <member />. They can have three attributes:

type - string, the type of the object (node, way)
ref - integer, the references to the unique ID of the object
role - string, the role it plays in the relation (via, from, to)

Tags are represented with the tag <tag />. They are used as child elements of nodes, ways and relations to bring additional information. They can have two attributes:

k - string, the key of the tag
v - string, the value of the tag

First, were going to get to know our dataset, its 436.2 MB of data and its 5,920,723 lines of code.
Let's find out exactly what tags we have, and how many of each.

import xml.etree.cElementTree as ET
import pprint

dataset = "data/boston_massachusetts.osm"

def count_tags(filename):
    """
    Takes in a dataset in XML format.
    Returns a dictionary of the tags and the number of each tag.
    """
    tag_dict = {}
    for event, elem in ET.iterparse(filename):
        if elem.tag not in tag_dict:
            tag_dict[elem.tag] = 1
        else:
            tag_dict[elem.tag] += 1

    return tag_dict

print(count_tags(dataset))

{'bounds': 1, 'node': 1939872, 'tag': 907379, 'nd': 2335749, 'way': 310285, 'member': 10894, 'relation': 1263, 'osm': 1}

As we can see, there isn't so many different tags. We now know that for this dataset, we have:

1,939,872 nodes
310,285 ways
1,263 relations
10,894 members
2,335,749 nds

Great! Now let's tackle the users question right away:
how many unique users do we have to thank for creating and updating this dataset?

def get_users(filename):
    users = set()
    for _, element in ET.iterparse(filename):
        for att in element.attrib:
            if att == 'uid':
                if element.attrib['uid'] not in users:
                    users.add(element.attrib['uid'])

    return users

print(len(get_users(dataset)))

1402

We have to thank 1,402 people for making this dataset possible.

Getting deeper¶

Just from looking at the data, we already know we're going to have some potential problems. All the <tag /> elements don't necessarily share the same format. Some are lowercase, others are uppercase; some are just text, others have one or more colon.

This is going to be problematic when trying to expand them to a reusable dictionary. Let's see how many of each we have, using regular expressions, and make a list of values for each type.

import re

lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')


def key_type(element, keys):
    """
    Takes in an XML element (<node />, <way />) and a dictionary of keys.
    If the element is a <tag />, its ['k'] attribute is analyzed,
    the dictionary keys and lists are incremented accordingly,
    - lower: valid tags that only contain lowercase letters
    - lower_colon: valid tags that contain lowercase letter with one or more colons in their name
    - problemchars: tags with problematic characters
    - other: tags that don't fall in the previous three categories
    """
    
    if element.tag == 'tag':
        if lower.search(element.attrib['k']):
            keys['lower'] +=1
        elif lower_colon.search(element.attrib['k']):
            keys['lower_colon'] += 1
        elif problemchars.search(element.attrib['k']):
            keys['problemchars'] += 1
        else:
            keys['other'] += 1
    return keys


def process_tags(filename):
    """
    Takes in a dataset in XML format, parses it, and executes the function key_type() for each element.
    Returns a dictionary with the count of different types of keys.
    """
    keys = {"lower": 0, "lower_colon": 0, "problemchars": 0, "other": 0}
    for _, element in ET.iterparse(filename):
        keys = key_type(element, keys)

    return keys

tag_type_dict = process_tags(dataset)
print (tag_type_dict)

{'lower': 793897, 'lower_colon': 74286, 'problemchars': 1, 'other': 39195}

We apparently have one tag that has problematic keys. Let's find out which one it is. We know we only have one in this dataset, but for convenience, we'll have the function return a list in case other datasets have more than one problematic key.

def get_problemkeys(filename):
    """
    Takes in a dataset in XML format, parses it and returns a list with the values of tags with problematic characters.
    """
    problemchars_list = []
    for _, element in ET.iterparse(filename):
        if element.tag == 'tag':
            if problemchars.search(element.attrib['k']):
                problemchars_list.append(element.attrib['k'])
    return problemchars_list

print(get_problemkeys(dataset))

['service area']

It's just one with a space. Shouldn't be an issue.

Now let's get to cleaning the data.

Auditing the data¶

We're going to audit the data to get all the different street types it contains. Then we will be able to build a dictionary upon which to base our correction.

from collections import defaultdict

street_type_re = re.compile(r'\S+\.?$', re.IGNORECASE)
street_types = defaultdict(int)

def audit_street_type(street_types, street_name):
    m = street_type_re.search(street_name)
    if m:
        street_type = m.group()
        street_types[street_type] += 1

def print_sorted_dict(d, expression):
    keys = d.keys()
    keys = sorted(keys, key=lambda s: s.lower())
    for k in keys:
        v = d[k]
        print (expression % (k, v))

def is_street_name(elem):
    return (elem.tag == "tag") and (elem.attrib['k'] == "addr:street")

def audit(filename):
    for event, elem in ET.iterparse(filename):
        if is_street_name(elem):
            audit_street_type(street_types, elem.attrib['v'])
    print(street_types, "%s: %d")
    return(street_types)

all_types = audit(dataset)
all_types

defaultdict(<class 'int'>, {'Avenue': 851, 'Street': 4137, 'Drive': 77, 'Parkway': 27, 'street': 2, 'Road': 248, 'Ave': 163, 'Ave.': 21, 'St': 252, 'St,': 1, 'Broadway': 98, 'Garage': 1, 'Place': 83, 'Lane': 27, 'Pasteur': 3, 'Square': 68, 'St.': 42, 'Boulevard': 12, 'Rd': 29, 'Way': 25, 'Pkwy': 10, 'Ct': 8, 'Fellsway': 15, 'Wharf': 5, 'Elm': 1, 'Highway': 13, 'Hwy': 1, 'Artery': 1, 'Center': 15, 'Newbury': 2, 'Holland': 1, 'Lafayette': 1, 'Street.': 1, 'Floor': 1, '1702': 1, '6': 1, 'South': 2, 'LEVEL': 1, 'rd.': 1, 'Greenway': 2, 'Corner': 1, 'Row': 6, 'Sq.': 1, '303': 1, 'floor': 2, '1100': 1, '846028': 1, 'Hall': 1, 'Turnpike': 2, 'st': 1, 'Park': 52, 'Terrace': 17, 'Jamaicaway': 2, 'Plaza': 2, 'Court': 9, 'Building': 1, 'Market': 1, '#1302': 1, '#12': 1, '#501': 1, 'Dartmouth': 1, '104': 1, 'Yard': 1, 'Mall': 3, 'Boylston': 2, 'Driveway': 1, 'Winsor': 1, 'ST': 1, 'Cambrdige': 2, 'Albany': 3, 'Fenway': 5, 'HIghway': 1, 'Windsor': 2, 'Ext': 1, 'Circle': 7, 'place': 1, 'Pl': 1, 'Brook': 1, 'Hampshire': 1, 'Longwood': 1, 'Dr': 1, '3': 1, 'H': 1}) %s: %d

defaultdict(int,
            {'#12': 1,
             '#1302': 1,
             '#501': 1,
             '104': 1,
             '1100': 1,
             '1702': 1,
             '3': 1,
             '303': 1,
             '6': 1,
             '846028': 1,
             'Albany': 3,
             'Artery': 1,
             'Ave': 163,
             'Ave.': 21,
             'Avenue': 851,
             'Boulevard': 12,
             'Boylston': 2,
             'Broadway': 98,
             'Brook': 1,
             'Building': 1,
             'Cambrdige': 2,
             'Center': 15,
             'Circle': 7,
             'Corner': 1,
             'Court': 9,
             'Ct': 8,
             'Dartmouth': 1,
             'Dr': 1,
             'Drive': 77,
             'Driveway': 1,
             'Elm': 1,
             'Ext': 1,
             'Fellsway': 15,
             'Fenway': 5,
             'Floor': 1,
             'Garage': 1,
             'Greenway': 2,
             'H': 1,
             'HIghway': 1,
             'Hall': 1,
             'Hampshire': 1,
             'Highway': 13,
             'Holland': 1,
             'Hwy': 1,
             'Jamaicaway': 2,
             'LEVEL': 1,
             'Lafayette': 1,
             'Lane': 27,
             'Longwood': 1,
             'Mall': 3,
             'Market': 1,
             'Newbury': 2,
             'Park': 52,
             'Parkway': 27,
             'Pasteur': 3,
             'Pkwy': 10,
             'Pl': 1,
             'Place': 83,
             'Plaza': 2,
             'Rd': 29,
             'Road': 248,
             'Row': 6,
             'ST': 1,
             'South': 2,
             'Sq.': 1,
             'Square': 68,
             'St': 252,
             'St,': 1,
             'St.': 42,
             'Street': 4137,
             'Street.': 1,
             'Terrace': 17,
             'Turnpike': 2,
             'Way': 25,
             'Wharf': 5,
             'Windsor': 2,
             'Winsor': 1,
             'Yard': 1,
             'floor': 2,
             'place': 1,
             'rd.': 1,
             'st': 1,
             'street': 2})

We can see a number of issues here:

we've got a lot of numbers that should probably correspond to addr:housenumber.
we've got a lot of different abbreviations for different street types
we've got names of streets and parks instead of the type: Boylston is a Street, Broadway is a highway, Fenway is a park, and so on
we've got some typos: 'HIghway', 'Cambrdige'...

Let's tackle these issues one at a time. First, we will correct the abbreviations issue.

Correcting abbreviations¶

Let's define a dictionary of the street types we deem valid, and of the corrections we want to implement.

expected = ['Artery', 'Avenue', 'Boulevard', 'Broadway', 'Commons', 'Court', 'Drive', 'Lane', 'Park', 'Parkway',
            'Place', 'Road', 'Square', 'Street', 'Terrace', 'Trail', 'Turnpike', 'Wharf',
            'Yard']

abbr_mapping = { 'Ave': 'Avenue',
                  'Ave.': 'Avenue',
                  'Ct': 'Court',
                  'Dr': 'Drive',
                  'HIghway': 'Highway',
                  'Hwy': 'Highway',
                  'Pl': 'Place',
                  'place': 'Place',
                  'Pkwy': 'Parkway',
                  'Rd': 'Road',
                  'rd.': 'Road',
                  'Sq.': 'Square',
                  'St': 'Street',
                  'st': 'Street',
                  'ST': 'Street',
                  'St,': 'Street',
                  'St.': 'Street',
                  'street': 'Street',
                  'Street.': 'Street'
                }

First of all, let's print the problematic names. We will not print all of them (we don't need to print the Ave or the St. So we're going to print only the whole lines where the total of the type is less than 20 and don't yet appear in expected.

typo_full_names = {}

def audit_street_name(street_types, street_name):
    m = street_type_re.search(street_name)
    if m:
        street_type = m.group()
        if (all_types[street_type] < 20) and (street_type not in expected) and (street_type not in abbr_mapping):
            if street_type in typo_full_names:
                typo_full_names[street_type].append(street_name)
            else:
                typo_full_names.update({ street_type:[street_name] })

def audit_name(filename):
    for event, elem in ET.iterparse(filename):
        if is_street_name(elem):
            audit_street_name(street_types, elem.attrib['v'])    
    # print_sorted_dict(street_types)
    return typo_full_names

audit_name(dataset)

{'#12': ['Harvard St #12'],
 '#1302': ['Cambridge Street #1302'],
 '#501': ['Bromfield Street #501'],
 '104': ['Mill Street, Suite 104'],
 '1100': ['First Street, Suite 1100'],
 '1702': ['Franklin Street, Suite 1702'],
 '3': ['Kendall Square - 3'],
 '303': ['First Street, Suite 303'],
 '6': ['South Station, near Track 6'],
 '846028': ['PO Box 846028'],
 'Albany': ['Albany', 'Albany', 'Albany'],
 'Boylston': ['Boylston', 'Boylston'],
 'Brook': ['Furnace Brook'],
 'Building': ['South Market Building'],
 'Cambrdige': ['Cambrdige', 'Cambrdige'],
 'Center': ['Cambridge Center',
  'Channel Center',
  'Financial Center',
  'Cambridge Center',
  'Cambridge Center',
  'Channel Center',
  'Cambridge Center',
  'Channel Center',
  'Channel Center',
  'Cambridge Center',
  'Channel Center',
  'Cambridge Center',
  'Cambridge Center',
  'Channel Center',
  'Channel Center'],
 'Circle': ['Norcross Circle',
  'Norcross Circle',
  'Norcross Circle',
  'Achorn Circle',
  'Achorn Circle',
  'Stein Circle',
  'Edgewood Circle'],
 'Corner': ['Webster Street, Coolidge Corner'],
 'Dartmouth': ['Dartmouth'],
 'Driveway': ['Museum of Science Driveway'],
 'Elm': ['Elm'],
 'Ext': ['B Street Ext'],
 'Fellsway': ['Fellsway',
  'Fellsway',
  'Fellsway',
  'Fellsway',
  'Fellsway',
  'Fellsway',
  'Fellsway',
  'Fellsway',
  'Fellsway',
  'Fellsway',
  'Fellsway',
  'Fellsway',
  'Fellsway',
  'Fellsway',
  'Fellsway'],
 'Fenway': ['Fenway', 'Fenway', 'Fenway', 'Fenway', 'Fenway'],
 'Floor': ['Boylston Street, 5th Floor'],
 'Garage': ['Stillings Street Garage'],
 'Greenway': ['East Boston Greenway', 'East Boston Greenway'],
 'H': ['H'],
 'Hall': ['Faneuil Hall'],
 'Hampshire': ['Hampshire'],
 'Highway': ['Cummins Highway',
  'Providence Highway',
  "Monsignor O'Brien Highway",
  "Monsignor O'Brien Highway",
  'Santilli Highway',
  "Monsignor O'Brien Highway",
  "Monsignor O'Brien Highway",
  'Santilli Highway',
  "Monsignor O'Brien Highway",
  'Santilli Highway',
  'American Legion Highway',
  'American Legion Highway',
  'Santilli Highway'],
 'Holland': ['Holland'],
 'Jamaicaway': ['Jamaicaway', 'Jamaicaway'],
 'LEVEL': ['LOMASNEY WAY, ROOF LEVEL'],
 'Lafayette': ['Avenue De Lafayette'],
 'Longwood': ['Longwood'],
 'Mall': ['Cummington Mall', 'Cummington Mall', 'Cummington Mall'],
 'Market': ['Faneuil Hall Market'],
 'Newbury': ['Newbury', 'Newbury'],
 'Pasteur': ['Avenue Louis Pasteur',
  'Avenue Louis Pasteur',
  'Avenue Louis Pasteur'],
 'Plaza': ['Two Center Plaza', 'Park Plaza'],
 'Row': ['Assembly Row',
  'Assembly Row',
  'Assembly Row',
  'Professors Row',
  'East India Row',
  'Assembly Row'],
 'South': ['Charles Street South', 'Charles Street South'],
 'Windsor': ['Windsor', 'Windsor'],
 'Winsor': ['Winsor'],
 'floor': ['Sidney Street, 2nd floor', 'First Street, 18th floor']}

Now we can define other mapping dictionaries and apply new, individual corrections based on the information we have.

The numbers we see are either:
- house numbers, that belong to the tag addr:housenumber
- suite numbers, that belong to the tag addr:suitenumber
- train track numbers, that we don't necessarily need (knowing where to find South Station is sufficient)
- a PO Box that happens to be on Albany Street. The PO Box 846028 information appears in the addr:housenumber tag already
Albany, Boylston, Cambridge (written Cambrdige here with a typo), Dartmouth, Elm, Hampshire, Holland, Newbury are Boston street names
Lafayette, Longwood and Winsor are Boston avenue names
Pasteur corresponds to Avenue Louis Pasteur
The South Market Building is located on 4 South Market Street
All Centers are correct, we will add Center to our expected list
All Circles are correct, we will add Circle to our expected list
Coolidge Corner should not appear here
Museum of Science Driveway is correct, we will add Driveway to our expected list
B Street Ext corresponds to B Street
Fellsway is a parkway
Fenway is of course Fenway Park
Boylston Street, 5th Floor should only be Boylston Street, the 5th Floor information should appear in the tag addr:floornumber
Stillings Street Garage is a garage located on 11 Stillings Street
East Boston Greenway is correct
H is actually 605 Hancock Street
Faneuil Hall is located 4 South Market Street
Jamaicaway is correct
LOMASNEY WAY, ROOF LEVEL should only be Lomasney Way
Cummington Mall is correct, we will add Mall to our expected list
Faneuil Hall Market is 1 Faneuil Hall Square
Two Center Plaza should be 2 Center Plaza
Park Plaza is correct
Charles Street South should simply be Charles Street
Sidney Street, 2nd floor should only be Sidney Street, the 2nd Floor information should appear in the tag addr:floornumber
First Street, 18 floor should only be First Street, the 18th Floor information should appear in the tag addr:floornumber
Windsor corresponds to Windsor Place in Sommerville

expected.extend(['Center', 'Circle', 'Driveway', 'Mall'])

typo_mapping = {  'Albany': 'Albany Street',
                  'Boylston': 'Boylston Street',
                  'Cambrdige': 'Cambridge Street',
                  'Dartmouth': 'Dartmouth Street',
                  'Elm': 'Elm Street',
                  'Hampshire': 'Hampshire Street',
                  'Holland': 'Holland Street',
                  'Newbury': 'Newbury Street',
                  'Lafayette': 'Lafayette Avenue',
                  'Longwood': 'Longwood Avenue',
                  'Winsor': 'Winsor Avenue',
                  'Pasteur': 'Avenue Louis Pasteur',
                  'Corner': 'Webster Street',
                  'Building': {'South Market Street': 4},
                  'B Street Ext': 'B Street',
                  'Fellsway': 'Fellsway Parkway',
                  'Fenway': 'Fenway Park',
                  'Floor': {'Boylston Street': '5th Floor'},
                  'Garage': 'Stillings Street',
                  'H': 'Hancock Street',
                  'Hall': {'Faneuil Hall Square': 1},
                  'LEVEL': {'Lomasney Way': 'Roof Level'},
                  'Market': 'Faneuil Hall Square',
                  'Plaza': { 'Two Center Plaza': {'Center Plaza': 2}},
                  'South': 'Charles Street South',
                  'floor': { 'Sidney Street': '2nd Floor',
                             'First Street': '18th Floor'},
                  'Windsor': 'Windsor Place'
               }

numbers_mapping = { '#12': {'Harvard Street': 12},
                    '#1302': {'Cambridge Street': 1302},
                    '#501': {'Bromfield Street': 501},
                    '104': {'Mill Street': 'Suite 104'},
                    '1100': {'First Street': 'Suite 1100'},
                    '1702': { 'Franklin Street': 'Suite 1702'},
                    '3': {'Kendall Square': 'Suite B3201'},
                    '303': {'First Street': 'Suite 303'},
                    '6': {'Atlantic Avenue': 700 }
                  }

po_box_mapping = { '846028': 'Albany Street'}

expected = sorted(expected)
expected

['Artery',
 'Avenue',
 'Boulevard',
 'Broadway',
 'Center',
 'Circle',
 'Commons',
 'Court',
 'Drive',
 'Driveway',
 'Lane',
 'Mall',
 'Park',
 'Parkway',
 'Place',
 'Road',
 'Square',
 'Street',
 'Terrace',
 'Trail',
 'Turnpike',
 'Wharf',
 'Yard']

Finally, before running updates, let's see if any street name contains incorrect characters.

name_problem_chars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\t\r\n]')

def get_problem_names(filename):
    """
    Takes in a dataset in XML format, parses it and returns a list with the values of tags with problematic characters.
    """
    problemchars_list = []
    for _, element in ET.iterparse(filename):
        if is_street_name(element):
            if name_problem_chars.search(element.attrib['v']):
                problemchars_list.append(element.attrib['v'])
    return problemchars_list

print(get_problem_names(dataset))

['Walnut St,', "Monsignor O'Brien Hwy", "Monsignor O'Brien Highway", 'Boylston Street, 5th Floor', 'Franklin Street, Suite 1702', 'South Station, near Track 6', 'LOMASNEY WAY, ROOF LEVEL', 'Webster Street, Coolidge Corner', "Monsignor O'Brien Highway", 'First Street, Suite 303', 'Sidney Street, 2nd floor', 'First Street, 18th floor', 'First Street, Suite 1100', 'Cambridge Street #1302', 'Harvard St #12', 'Bromfield Street #501', 'Mill Street, Suite 104', "Monsignor O'Brien Highway", "Monsignor O'Brien Highway", "Saint Mary's St.", "Monsignor O'Brien Highway", "River's Edge Drive", 'Church Street, Harvard Square', 'Massachusetts Ave; Mass Ave', "St. Paul's Ave"]

Almost all of these entries are correct. We have two entries we need to investigate:

Church Street, Harvard Square corresponds to a parish located on 1446 Massachusetts Avenue
Massachusetts Ave; Mass Ave should only be Massachusetts Avenue

We will define one last dictionary to correct them, and then we will finally correct the values.

char_mapping = { 'Church Street, Harvard Square': {'Massachusetts Avenue': 1446},
                 'Massachusetts Ave; Mass Ave': 'Massachusetts Avenue'
               }

Cleaning the data¶

We have identified all the incorrect names. We created dictionaries to update incorrect values. We can now clean the data and save it all in a final dictionary.

def typo_correct(street_name, street_type):
    if type(typo_mapping[street_type]) == type('string'):
            name = typo_mapping[street_type]
    elif type(typo_mapping[street_type]) == type({}):
        if '2nd' in street_name:
            name = 'Sidney Street'
            # add attribute addr:floor '2nd Floor' typo_mapping[street_type]['Sidney Street']
        elif '18' in street_name:
            name = 'First Street'
            # add attribute addr:floor '18th Floor' typo_mapping[street_type]['First Street']
        elif '5th' in street_name:
            name = 'Boylston Street'
            # add attribute addr:floor '5th Floor' typo_mapping[street_type]['Boylston Street']
        elif street_type == 'LEVEL':
            name = 'Lomasney Way'
            # add attribute addr:floor 'Roof Level' typo_mapping[street_type]['Lomasney Way']
        elif 'Two Center' in street_name:
            name = 'Center Plaza'
            # add attribute addr:housenumber '2' typo_mapping[street_type]['Two Center Plaza']['Center Plaza']
        else:
            for key in typo_mapping[street_type]:
                name = key
                # add attribute addr:housenumber = value
    return name
        
def numbers_correct(street_name, street_type):
    if 'Suite' in street_name:
        for key in numbers_mapping[street_type]:
            name = key
            # print (name)
            # add attribute addr:suitenumber = value
    else:
        for key in numbers_mapping[street_type]:
            name = key
            # add attribute addr:housenumber = value
    return name

def char_correct(street_name, street_type):
    print (street_name, street_type)
    if street_name + ' ' + street_type == 'Church Street, Harvard Square':
        for key in char_mapping[street_name + ' ' + street_type]:
            name = key
    elif street_name + ' ' + street_type == 'Massachusetts Ave; Mass Ave':
        name = char_mapping[street_name + ' ' + street_type]
    return name

def audit_abbreviations(filename):
    problem_street_types = defaultdict(set)
    for event, elem in ET.iterparse(filename):
        if is_street_name(elem):
            expected_street_type(problem_street_types, elem.attrib['v'])
    return problem_street_types

def expected_street_type(street_types, street_name):
    m = street_type_re.search(street_name)
    if m:
        street_type = m.group()
        if street_type not in expected:
            street_types[street_type].add(street_name)
        
def update_name(name):
    street_type = name.split(' ')[-1]
    street_name = name.rsplit(' ', 1)[0]
    if (street_name + ' ' + street_type) in char_mapping:
        name = char_correct(street_name, street_type)
    elif street_type in abbr_mapping:
        name = street_name + ' ' + abbr_mapping[street_type]
    elif street_type in typo_mapping:
        name = typo_correct(street_name, street_type)
    elif street_type in numbers_mapping:
        name = numbers_correct(street_name, street_type)
    
    elif street_type in po_box_mapping:
        name = po_box_mapping[street_type]
    return name
    
def run_updates(filename):
    st_types = audit_abbreviations(dataset)
    for st_type, ways in st_types.items():
        for name in ways:
            better_name = update_name(name)
            if better_name != name:
                corrected_names[name] = better_name
    return corrected_names
            
corrected_names = {}           
corrected_names = run_updates(dataset)
print_sorted_dict(corrected_names, "%s: %s")

Massachusetts Ave; Mass Ave
1 Kendall Sq.: 1 Kendall Square
738 Commonwealth Ave: 738 Commonwealth Avenue
Abby Rd: Abby Road
Aberdeen Rd: Aberdeen Road
Adams St: Adams Street
Albany: Albany Street
Albion St.: Albion Street
American Legion HIghway: American Legion Highway
Antwerp St: Antwerp Street
argus place: argus Place
Arsenal St: Arsenal Street
Athol St: Athol Street
Avenue De Lafayette: Lafayette Avenue
Bagnal St: Bagnal Street
Banks St.: Banks Street
Birmingham Pkwy: Birmingham Parkway
Blue Hill Ave: Blue Hill Avenue
Boston Ave: Boston Avenue
Boston street: Boston Street
Bowdoin St: Bowdoin Street
Boylston: Boylston Street
Boylston St.: Boylston Street
Boylston Street, 5th Floor: Boylston Street
Brentwood St: Brentwood Street
Brighton Ave.: Brighton Avenue
Bristol Rd: Bristol Road
Broad St: Broad Street
Bromfield Street #501: Bromfield Street
Cambrdige: Cambridge Street
Cambridge St: Cambridge Street
Cambridge Street #1302: Cambridge Street
Centre St: Centre Street
Centre St.: Centre Street
Charles St: Charles Street
College Ave: College Avenue
Commonwealth Ave: Commonwealth Avenue
Concord Ave: Concord Avenue
Congress St: Congress Street
Corey rd.: Corey Road
Court St: Court Street
Cummington St: Cummington Street
Dane St: Dane Street
Dartmouth: Dartmouth Street
Duval St: Duval Street
Elm: Elm Street
Elm St: Elm Street
Elm St.: Elm Street
Everett Ave: Everett Avenue
Everett St: Everett Street
Faneuil Hall: Faneuil Hall Square
Faneuil Hall Market: Faneuil Hall Square
Fellsway: Fellsway Parkway
Fenway: Fenway Park
First Street, 18th floor: First Street
First Street, Suite 1100: First Street
First Street, Suite 303: First Street
Francesca Ave: Francesca Avenue
Franklin Street, Suite 1702: Franklin Street
George St: George Street
Goodnough Rd: Goodnough Road
Grove St: Grove Street
H: Hancock Street
Hammond St: Hammond Street
Hampshire: Hampshire Street
Hampshire St: Hampshire Street
Hancock Street.: Hancock Street
Harborside Dr: Harborside Drive
Harrison Ave: Harrison Avenue
Harvard St #12: Harvard Street
Highland Ave: Highland Avenue
Holland: Holland Street
Holton St: Holton Street
Josephine Ave: Josephine Avenue
Kelley Ct: Kelley Court
Kendall Square - 3: Kendall Square
Kirkland St: Kirkland Street
Leighton St: Leighton Street
Lexington Ave: Lexington Avenue
Litchfield St: Litchfield Street
LOMASNEY WAY, ROOF LEVEL: Lomasney Way
Longfellow Pl: Longfellow Place
Longwood: Longwood Avenue
Lothrop St: Lothrop Street
Mackin St: Mackin Street
Main St: Main Street
Main st: Main Street
Main St.: Main Street
Marshall St.: Marshall Street
Massachusetts Ave: Massachusetts Avenue
Massachusetts Ave.: Massachusetts Avenue
Massachusetts Ave; Mass Ave: Massachusetts Avenue
Maverick St: Maverick Street
Maverick St.: Maverick Street
Medford St: Medford Street
Merrill St: Merrill Street
Mill Street, Suite 104: Mill Street
Monsignor O'Brien Hwy: Monsignor O'Brien Highway
Morrison Ave: Morrison Avenue
Mt Auburn St: Mt Auburn Street
Mystic Ave: Mystic Avenue
N Beacon St: N Beacon Street
Newbury: Newbury Street
Newton ST: Newton Street
Norfolk St: Norfolk Street
Oakland Rd: Oakland Road
Park Plaza: Two Center Plaza
Pearl St.: Pearl Street
PO Box 846028: Albany Street
Portsmouth St: Portsmouth Street
Prospect St.: Prospect Street
Rawson Rd: Rawson Road
Richardson St: Richardson Street
Sagamore Ave: Sagamore Avenue
Saint Mary's St.: Saint Mary's Street
Sea St: Sea Street
Sidney Street, 2nd floor: Sidney Street
Soldiers Field Rd: Soldiers Field Road
Somerville Ave: Somerville Avenue
Somerville Ave.: Somerville Avenue
South Market Building: South Market Street
South Station, near Track 6: Atlantic Avenue
South Waverly St: South Waverly Street
Spaulding Ave.: Spaulding Avenue
Squanto Rd: Squanto Road
St. Paul's Ave: St. Paul's Avenue
Stewart St: Stewart Street
Stillings Street Garage: Stillings Street
Stuart St.: Stuart Street
Tremont St.: Tremont Street
Two Center Plaza: Center Plaza
Walnut St,: Walnut Street
Ware St: Ware Street
Washington Ave: Washington Avenue
Waverly St: Waverly Street
Webster Street, Coolidge Corner: Webster Street
Western Ave: Western Avenue
Willow Ave: Willow Avenue
Windsor: Windsor Place
Winsor: Winsor Avenue
Winter St: Winter Street
Winthrop St: Winthrop Street

Good! We now have a dictionary of all the incorrect values and their correction. We're ready to prepare the data to be inserted into a SQL database.

Exporting to csv¶

SCHEMA = {
    'node': {
        'type': 'dict',
        'schema': {
            'id': {'required': True, 'type': 'integer', 'coerce': int},
            'lat': {'required': True, 'type': 'float', 'coerce': float},
            'lon': {'required': True, 'type': 'float', 'coerce': float},
            'user': {'required': True, 'type': 'string'},
            'uid': {'required': True, 'type': 'integer', 'coerce': int},
            'version': {'required': True, 'type': 'string'},
            'changeset': {'required': True, 'type': 'integer', 'coerce': int},
            'timestamp': {'required': True, 'type': 'string'}
        }
    },
    'node_tags': {
        'type': 'list',
        'schema': {
            'type': 'dict',
            'schema': {
                'id': {'required': True, 'type': 'integer', 'coerce': int},
                'key': {'required': True, 'type': 'string'},
                'value': {'required': True, 'type': 'string'},
                'type': {'required': True, 'type': 'string'}
            }
        }
    },
    'way': {
        'type': 'dict',
        'schema': {
            'id': {'required': True, 'type': 'integer', 'coerce': int},
            'user': {'required': True, 'type': 'string'},
            'uid': {'required': True, 'type': 'integer', 'coerce': int},
            'version': {'required': True, 'type': 'string'},
            'changeset': {'required': True, 'type': 'integer', 'coerce': int},
            'timestamp': {'required': True, 'type': 'string'}
        }
    },
    'way_nodes': {
        'type': 'list',
        'schema': {
            'type': 'dict',
            'schema': {
                'id': {'required': True, 'type': 'integer', 'coerce': int},
                'node_id': {'required': True, 'type': 'integer', 'coerce': int},
                'position': {'required': True, 'type': 'integer', 'coerce': int}
            }
        }
    },
    'way_tags': {
        'type': 'list',
        'schema': {
            'type': 'dict',
            'schema': {
                'id': {'required': True, 'type': 'integer', 'coerce': int},
                'key': {'required': True, 'type': 'string'},
                'value': {'required': True, 'type': 'string'},
                'type': {'required': True, 'type': 'string'}
            }
        }
    }
}

import csv
import codecs
import pprint
import re
import xml.etree.cElementTree as ET

import cerberus

import schema

OSM_PATH = "example.osm"

NODES_PATH = "nodes.csv"
NODE_TAGS_PATH = "nodes_tags.csv"
WAYS_PATH = "ways.csv"
WAY_NODES_PATH = "ways_nodes.csv"
WAY_TAGS_PATH = "ways_tags.csv"

LOWER_COLON = re.compile(r'^([a-z]|_)+:([a-z]|_)+')
PROBLEMCHARS = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')

# Make sure the fields order in the csvs matches the column order in the sql table schema
NODE_FIELDS = ['id', 'lat', 'lon', 'user', 'uid', 'version', 'changeset', 'timestamp']
NODE_TAGS_FIELDS = ['id', 'key', 'value', 'type']
WAY_FIELDS = ['id', 'user', 'uid', 'version', 'changeset', 'timestamp']
WAY_TAGS_FIELDS = ['id', 'key', 'value', 'type']
WAY_NODES_FIELDS = ['id', 'node_id', 'position']

def correct_element(v):
    if v in corrected_names:
        correct_value = corrected_names[v]
    else:
        correct_value = v
    return correct_value

def shape_element(element, node_attr_fields=NODE_FIELDS, way_attr_fields=WAY_FIELDS,
                  problem_chars=PROBLEMCHARS, default_tag_type='regular'):
    """Clean and shape node or way XML element to Python dict"""

    node_attribs = {}
    way_attribs = {}
    way_nodes = []
    tags = []  # Handle secondary tags the same way for both node and way elements

    if element.tag == 'node':
        node_attribs['id'] = element.attrib['id']
        node_attribs['user'] = element.attrib['user']
        node_attribs['uid'] = element.attrib['uid']
        node_attribs['version'] = element.attrib['version']
        node_attribs['lat'] = element.attrib['lat']
        node_attribs['lon'] = element.attrib['lon']
        node_attribs['timestamp'] = element.attrib['timestamp']
        node_attribs['changeset'] = element.attrib['changeset']
        
        for node in element:
            tag_dict = {}
            tag_dict['id'] = element.attrib['id']
            if ':' in node.attrib['k']:
                tag_dict['type'] = node.attrib['k'].split(':', 1)[0]
                tag_dict['key'] = node.attrib['k'].split(':', 1)[-1]
                tag_dict['value'] = correct_element(node.attrib['v'])
            else:
                tag_dict['type'] = 'regular'
                tag_dict['key'] = node.attrib['k']
                tag_dict['value'] = correct_element(node.attrib['v'])
            tags.append(tag_dict)
            
    elif element.tag == 'way':
        way_attribs['id'] = element.attrib['id']
        way_attribs['user'] = element.attrib['user']
        way_attribs['uid'] = element.attrib['uid']
        way_attribs['version'] = element.attrib['version']
        way_attribs['timestamp'] = element.attrib['timestamp']
        way_attribs['changeset'] = element.attrib['changeset']
        n = 0
        for node in element:
            if node.tag == 'nd':
                way_dict = {}
                way_dict['id'] = element.attrib['id']
                way_dict['node_id'] = node.attrib['ref']
                way_dict['position'] = n
                n += 1
                way_nodes.append(way_dict)
            if node.tag == 'tag':
                tag_dict = {}
                tag_dict['id'] = element.attrib['id']
                if ':' in node.attrib['k']:
                    tag_dict['type'] = node.attrib['k'].split(':', 1)[0]
                    tag_dict['key'] = node.attrib['k'].split(':', 1)[-1]
                    tag_dict['value'] = correct_element(node.attrib['v'])
                else:
                    tag_dict['type'] = 'regular'
                    tag_dict['key'] = node.attrib['k']
                    tag_dict['value'] = correct_element(node.attrib['v'])
                tags.append(tag_dict)
    
    if element.tag == 'node':
        return {'node': node_attribs, 'node_tags': tags}
    elif element.tag == 'way':
        return {'way': way_attribs, 'way_nodes': way_nodes, 'way_tags': tags}


# ================================================== #
#               Helper Functions                     #
# ================================================== #
def get_element(osm_file, tags=('node', 'way', 'relation')):
    """Yield element if it is the right type of tag"""

    context = ET.iterparse(osm_file, events=('start', 'end'))
    _, root = next(context)
    for event, elem in context:
        if event == 'end' and elem.tag in tags:
            yield elem
            root.clear()


def validate_element(element, validator, schema=SCHEMA):
    """Raise ValidationError if element does not match schema"""
    if validator.validate(element, schema) is not True:
        field, errors = next(validator.errors.iteritems())
        message_string = "\nElement of type '{0}' has the following errors:\n{1}"
        error_string = pprint.pformat(errors)
        
        raise Exception(message_string.format(field, error_string))


class UnicodeDictWriter(csv.DictWriter, object):
    """Extend csv.DictWriter to handle Unicode input"""

    def writerow(self, row):
        super(UnicodeDictWriter, self).writerow({
            k: (v.encode('utf-8') if isinstance(v, str) else v) for k, v in row.items()
        })

    def writerows(self, rows):
        for row in rows:
            self.writerow(row)


# ================================================== #
#               Main Function                        #
# ================================================== #
def process_map(file_in, validate):
    """Iteratively process each XML element and write to csv(s)"""

    with codecs.open(NODES_PATH, 'w') as nodes_file, \
    codecs.open(NODE_TAGS_PATH, 'w') as nodes_tags_file, \
    codecs.open(WAYS_PATH, 'w') as ways_file, \
    codecs.open(WAY_NODES_PATH, 'w') as way_nodes_file, \
    codecs.open(WAY_TAGS_PATH, 'w') as way_tags_file:

        nodes_writer = csv.DictWriter(nodes_file, NODE_FIELDS)
        node_tags_writer = csv.DictWriter(nodes_tags_file, NODE_TAGS_FIELDS)
        ways_writer = csv.DictWriter(ways_file, WAY_FIELDS)
        way_nodes_writer = csv.DictWriter(way_nodes_file, WAY_NODES_FIELDS)
        way_tags_writer = csv.DictWriter(way_tags_file, WAY_TAGS_FIELDS)

        nodes_writer.writeheader()
        node_tags_writer.writeheader()
        ways_writer.writeheader()
        way_nodes_writer.writeheader()
        way_tags_writer.writeheader()

        validator = cerberus.Validator()

        for element in get_element(file_in, tags=('node', 'way')):
            el = shape_element(element)
            if el:
                if validate is True:
                    validate_element(el, validator)

                if element.tag == 'node':
                    nodes_writer.writerow(el['node'])
                    node_tags_writer.writerows(el['node_tags'])
                elif element.tag == 'way':
                    ways_writer.writerow((el['way']))
                    way_nodes_writer.writerows(el['way_nodes'])
                    way_tags_writer.writerows(el['way_tags'])
                    
process_map(dataset, validate=True)

https://gist.github.com/swwelch/f1144229848b407e0a5d13fcb7fbbd6f
https://discussions.udacity.com/t/help-cleaning-data/169833/84
https://gist.github.com/carlward/54ec1c91b62a5f911c42#file-sample_project-md

Exploring our data with SQL¶

Creating our database¶

import sqlite3

# Creating database on disk
sqlite_file = 'boston.db'
conn = sqlite3.connect(sqlite_file)
c = conn.cursor()

c.execute('''DROP TABLE IF EXISTS nodes''')
c.execute('''DROP TABLE IF EXISTS nodes_tags''')
c.execute('''DROP TABLE IF EXISTS ways''')
c.execute('''DROP TABLE IF EXISTS ways_tags''')
c.execute('''DROP TABLE IF EXISTS ways_nodes''')
conn.commit()

Creating our tables¶

QUERY_NODES = """
CREATE TABLE nodes (
    id INTEGER NOT NULL,
    lat REAL,
    lon REAL,
    user TEXT,
    uid INTEGER,
    version INTEGER,
    changeset INTEGER,
    timestamp TEXT
);
"""

QUERY_NODES_TAGS = """
CREATE TABLE nodes_tags (
    id INTEGER,
    key TEXT,
    value TEXT,
    type TEXT,
    FOREIGN KEY (id) REFERENCES nodes(id)
);
"""

QUERY_WAYS = """
CREATE TABLE ways (
    id INTEGER NOT NULL,
    user TEXT,
    uid INTEGER,
    version INTEGER,
    changeset INTEGER,
    timestamp TEXT
);
"""

QUERY_WAYS_TAGS = """
CREATE TABLE ways_tags (
    id INTEGER NOT NULL,
    key TEXT NOT NULL,
    value TEXT NOT NULL,
    type TEXT,
    FOREIGN KEY (id) REFERENCES ways(id)
);
"""

QUERY_WAYS_NODES = """
CREATE TABLE ways_nodes (
    id INTEGER NOT NULL,
    node_id INTEGER NOT NULL,
    position INTEGER NOT NULL,
    FOREIGN KEY (id) REFERENCES ways(id),
    FOREIGN KEY (node_id) REFERENCES nodes(id)
);
"""



c.execute(QUERY_NODES)
c.execute(QUERY_NODES_TAGS)
c.execute(QUERY_WAYS)
c.execute(QUERY_WAYS_TAGS)
c.execute(QUERY_WAYS_NODES)

conn.commit()

Reading in the data from our csv files¶

Our tables are created. We now need to import our csv files into them.

with open('nodes.csv','rt') as fin:
    dr = csv.DictReader(fin) # comma is default delimiter
    to_db1 = [(i['id'], i['lat'], i['lon'], i['user'], i['uid'], i['version'], i['changeset'], i['timestamp']) for i in dr]
    
with open('nodes_tags.csv','rt') as fin:
    dr = csv.DictReader(fin) # comma is default delimiter
    to_db2 = [(i['id'], i['key'], i['value'], i['type']) for i in dr]
    
with open('ways.csv','rt') as fin:
    dr = csv.DictReader(fin) # comma is default delimiter
    to_db3 = [(i['id'], i['user'], i['uid'], i['version'], i['changeset'], i['timestamp']) for i in dr]
    
with open('ways_tags.csv','rt') as fin:
    dr = csv.DictReader(fin) # comma is default delimiter
    to_db4 = [(i['id'], i['key'], i['value'], i['type']) for i in dr]
    
with open('ways_nodes.csv','rt') as fin:
    dr = csv.DictReader(fin) # comma is default delimiter
    to_db5 = [(i['id'], i['node_id'], i['position']) for i in dr]

Inserting the data¶

c.executemany("INSERT INTO nodes(id, lat, lon, user, uid, version, changeset, timestamp) VALUES (?, ?, ?, ?, ?, ?, ?, ?);", to_db1)
c.executemany("INSERT INTO nodes_tags(id, key, value, type) VALUES (?, ?, ?, ?);", to_db2)
c.executemany("INSERT INTO ways(id, user, uid, version, changeset, timestamp) VALUES (?, ?, ?, ?, ?, ?);", to_db3)
c.executemany("INSERT INTO ways_tags(id, key, value, type) VALUES (?, ?, ?, ?);", to_db4)
c.executemany("INSERT INTO ways_nodes(id, node_id, position) VALUES (?, ?, ?);", to_db5)
conn.commit()

Verifying our insert¶

We already have the following information about our original data:

1,939,872 nodes
310,285 ways
1,263 relations
10,894 members
2,335,749 nds
1402 unique users

Let's check if our database corresponds.

c.execute('SELECT COUNT(*) FROM nodes')
all_rows = c.fetchall()
print(all_rows)

c.execute('SELECT COUNT(*) FROM ways')
all_rows = c.fetchall()
print(all_rows)

[(1939872,)]
[(310285,)]

We're good! Let's query away!

Querying our SQL database¶

Who are the top 3 users who brought the more modifications?

QUERY = '''
SELECT DISTINCT nodes.user, COUNT(*)
FROM nodes
GROUP BY nodes.uid
ORDER BY COUNT(*) DESC
LIMIT 3;
'''

c.execute(QUERY)
all_rows = c.fetchall()
print(all_rows)

[('crschmidt', 1198858), ('jremillard-massgis', 198009), ('OceanVortex', 84754)]

Which proportion of the database did the top 10 users contribute to build?

QUERY = '''
SELECT DISTINCT nodes.user, COUNT(*) * 100.0 / (SELECT COUNT(*) FROM nodes)
FROM nodes
GROUP BY nodes.uid
ORDER BY (COUNT(*) * 100.0 / (SELECT COUNT(*) FROM nodes)) DESC
LIMIT 10;
'''

c.execute(QUERY)
all_rows = c.fetchall()
print(all_rows)

[('crschmidt', 61.80088170765906), ('jremillard-massgis', 10.207322957391003), ('OceanVortex', 4.369051153890566), ('wambag', 4.077021576681348), ('morganwahl', 3.4048638260668747), ('MassGIS Import', 2.8826128734267003), ('ryebread', 2.6775477969680472), ('ingalls_imports', 1.4554053050922948), ('Ahlzen', 1.3437484535062107), ('mapper999', 0.6402999785552861)]

What are the top 10 streets that contain the more nodes?

QUERY = '''
SELECT ways_tags.value, COUNT(*)
FROM ways_tags
WHERE ways_tags.key = 'name'
AND ways_tags.type = 'regular'
GROUP BY ways_tags.value
ORDER BY COUNT(*) DESC
LIMIT 10;
'''

c.execute(QUERY)
all_rows = c.fetchall()
print(all_rows)

[('Washington Street', 253), ('Massachusetts Avenue', 157), ('Centre Street', 136), ('Broadway', 118), ('Beacon Street', 116), ('Cambridge Street', 97), ('Boylston Street', 87), ('Adams Street', 86), ('Blue Hill Avenue', 84), ('Northeast Corridor', 83)]

On average, how many nodes does a way contain?

QUERY = '''
SELECT AVG(Count)
FROM
    (SELECT COUNT(*) as Count
    FROM ways
    JOIN ways_nodes
    ON ways.id = ways_nodes.id
    GROUP BY ways.id);
'''

c.execute(QUERY)
all_rows = c.fetchall()
print(all_rows)

[(7.5277535169279854,)]

What are the top 10 amenities in Boston?

QUERY = '''
SELECT value, COUNT(*) as Count
FROM nodes_tags
WHERE key='amenity'
GROUP BY value
ORDER BY Count DESC
LIMIT 10;
'''

c.execute(QUERY)
all_rows = c.fetchall()
print(all_rows)

[('bench', 1069), ('restaurant', 678), ('school', 499), ('bicycle_parking', 318), ('library', 276), ('place_of_worship', 275), ('cafe', 270), ('fast_food', 195), ('bicycle_rental', 139), ('post_box', 124)]

There's definitely room to seat in Boston.

What's the religious landscape like in Boston?

QUERY = '''
SELECT nodes_tags.value, COUNT(*) as Count
FROM nodes_tags 
JOIN
    (SELECT DISTINCT(id)
    FROM nodes_tags
    WHERE value='place_of_worship') as Sub
ON nodes_tags.id=Sub.id
WHERE nodes_tags.key='religion'
GROUP BY nodes_tags.value
ORDER BY Count DESC;
'''

c.execute(QUERY)
all_rows = c.fetchall()
print(all_rows)

[('christian', 245), ('jewish', 8), ('unitarian_universalist', 2), ('buddhist', 1), ('muslim', 1)]

Irish descendants enjoying a strong presence in Boston, it would have been great to have a distinction between catholics and protestants.

Further Analysis¶

I focused this analysis on identifying irregular street names, cleaning them, and reformatting them so that one type corresponds to one term (Avenue, Street), getting rid of abbreviations.
To improve the cleaning result, the process should also include the insertion of tags. Some street names contained valuable information about floors, for example, or suite numbers, that deserve to be preserved. This process would enable us to present organized information at their specific place, preserving both the street names and the Floor or Suite numbers.
This data wrangling process should probably be pushed a step further in order to verify and clean zip codes as well, before being sent back to OpenStreetMaps. Zip codes could present similar issues than the ones encountered with street names:
- Too many or too few numbers
- Zip codes at the wring place (not in the right tag)
- Zip codes containing irregular characters
- Zip codes corresponding to another information that should be preserved but given elsewhere