2 Geocoding Resource Locations

2.1 Overview

A common goal in opioid environment research is to calculate and compare access metrics to different providers of Medications for Opioid Overuse Disorder (MOUDs). Before we can run any analytics on the resource location data, we need to convert resource addresses to spatial data points, which can be then used to calculate access metrics.

Geocoding is the process of converting addresses (like a street address) into geographic coordinates using a known coordinate reference system. We can then use these coordinates (latitude, longitude) to spatially enable our data. This means we convert to a spatial data frame (sf) within R for spatial analysis within our R session, and then save as a shapefile (a spatial data format) for future use. In this tutorial we demonstrate how to geocode resource location addresses and convert to spatial data points that can be used for future mapping and geospatial analysis.

Our objectives are thus to:

  • Geocode addresses to get geographic coordinates
  • Visualize the resource locations as points on a map in R
  • Transform a flat file (.CSV) to a spatially enabled shapefile (.SHP)

2.2 Environment Setup

To replicate the code & functions illustrated in this tutorial, you’ll need to have R and RStudio downloaded and installed on your system. This tutorial assumes some familiarity with the R programming language.

2.2.1 Input/Output

Our input will be a CSV file that include addresses of our resources. This files can be found here.

  • Chicago Methadone Clinics, chicago_methadone.csv

We will convert these addresses to geographic coordinates using an appropriate coordinate reference system (CRS), and then spatially enable the data for mapping. We will then export the spatial dataframe as a shapefile.

2.2.2 Load Libraries

We will use the following packages in this tutorial:

  • sf: to manipulate spatial data
  • tmap: to visualize and create maps
  • tidygeocoder: to convert addresses to geographic coordinates

Then load the libraries for use. Note: The messages you see about GEOS, GDAL, and PROJ refer to software libraries that allow you to work with spatial data.

library(sf)
library(tidygeocoder)
library(tmap)

2.2.3 Load Data

We will use a CSV that includes methadone clinic addresses in Chicago as an example. We start with a small dataset to test our geocoding workflow, as best practice.

Let’s take a look at the first few rows of the dataset. Our data includes addresses but not geographic coordinates.

methadoneClinics <- read.csv("data/chicago_methadone_nogeometry.csv")
head(methadoneClinics)
##   X                                                         Name
## 1 1                Chicago Treatment and Counseling Center, Inc.
## 2 2                      Sundace Methadone Treatment Center, LLC
## 3 3 Soft Landing Interventions/DBA Symetria Recovery of Lakeview
## 4 4                                        PDSSC - Chicago, Inc.
## 5 5                          Center for Addictive Problems, Inc.
## 6 6                                Family Guidance Centers, Inc.
##                   Address    City State   Zip
## 1 4453 North Broadway st. Chicago    IL 60640
## 2 4545 North Broadway St. Chicago    IL 60640
## 3    3934 N. Lincoln Ave. Chicago    IL 60613
## 4     2260 N. Elston Ave. Chicago    IL 60614
## 5        609 N. Wells St. Chicago    IL 60654
## 6     310 W. Chicago Ave. Chicago    IL 60654

2.3 Geocode addresses

2.3.1 Quality Control

Before geocoding, perform an initial quality check on the data. Note that the address, city, state, and zip code are all separated as different columns. This will make it easier to stitch together for a coherent, standard address for geocoding. Furthermore, there do not appear to be any major errors. The city name “Chicago” is spelled consistently, without missing addresses or zip codes. This will not always be the case, unfortunately. Data must be cleaned prior to loading into a geocoding service.

2.3.2 Selecting Geocoding Service

To get a geographic coordinate for each site, we’ll need to geocode. There are a number of geocoding options in R; here we use we the tidygeocoder package. It uses mutliple geocoding services, providing the user with an option to choose. It also provides the option to use a cascade method which queries other geocoding services incase the default method fails to provide coordinates.

When considering which geocoding service to use, consider scale and potential geocoding errors. Some geocoding services are more accurate than others, so if your coordinates were not coded precisely, try a different service. If you have thousands of addresses to geocode, you may require more complex data pipelines. The default method used here is via US Census geocoder, which allows around 10,000 addresses to be geocoded at once. Others have varying daily limits. The Google Maps API and ESRI Geocoding service are additional high-quality geocoding services with varying cost associated.

2.3.3 Test Geocoding Service

Before geocoding your entire dataset, first review the documentation for the geocoding service you’ll be using. In our example we use tidygeocoder, with documentation found here. Let’s test the service by starting with one address:

sample <- geo("4545 North Broadway St. Chicago, IL", lat = latitude, long = longitude, method = 'cascade')
sample
## # A tibble: 1 x 4
##   address                            latitude longitude geo_method
##   <chr>                                 <dbl>     <dbl> <chr>     
## 1 4545 North Broadway St. Chicago, …     42.0     -87.7 census

What did the output look like? Get familiar with the input parameters, expected output, and review the documentation further if needed.

2.3.4 Prepare input parameter

To apply the function to multiple addresses, we first we need ensure that we have a character vector of full addresses.

str(methadoneClinics)
## 'data.frame':    27 obs. of  6 variables:
##  $ X      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Name   : chr  "Chicago Treatment and Counseling Center, Inc." "Sundace Methadone Treatment Center, LLC" "Soft Landing Interventions/DBA Symetria Recovery of Lakeview" "PDSSC - Chicago, Inc." ...
##  $ Address: chr  "4453 North Broadway st." "4545 North Broadway St." "3934 N. Lincoln Ave." "2260 N. Elston Ave." ...
##  $ City   : chr  "Chicago" "Chicago" "Chicago" "Chicago" ...
##  $ State  : chr  "IL" "IL" "IL" "IL" ...
##  $ Zip    : int  60640 60640 60613 60614 60654 60654 60651 60607 60607 60616 ...

Next we convert all fields to character first to avoid issues with factors (a common peril of R!).

methadoneClinics$fullAdd <- paste(as.character(methadoneClinics$Address), 
                                  as.character(methadoneClinics$City),
                                  as.character(methadoneClinics$State), 
                                  as.character(methadoneClinics$Zip))
head(methadoneClinics)
##   X                                                         Name
## 1 1                Chicago Treatment and Counseling Center, Inc.
## 2 2                      Sundace Methadone Treatment Center, LLC
## 3 3 Soft Landing Interventions/DBA Symetria Recovery of Lakeview
## 4 4                                        PDSSC - Chicago, Inc.
## 5 5                          Center for Addictive Problems, Inc.
## 6 6                                Family Guidance Centers, Inc.
##                   Address    City State   Zip
## 1 4453 North Broadway st. Chicago    IL 60640
## 2 4545 North Broadway St. Chicago    IL 60640
## 3    3934 N. Lincoln Ave. Chicago    IL 60613
## 4     2260 N. Elston Ave. Chicago    IL 60614
## 5        609 N. Wells St. Chicago    IL 60654
## 6     310 W. Chicago Ave. Chicago    IL 60654
##                                    fullAdd
## 1 4453 North Broadway st. Chicago IL 60640
## 2 4545 North Broadway St. Chicago IL 60640
## 3    3934 N. Lincoln Ave. Chicago IL 60613
## 4     2260 N. Elston Ave. Chicago IL 60614
## 5        609 N. Wells St. Chicago IL 60654
## 6     310 W. Chicago Ave. Chicago IL 60654

2.3.5 Batch Geocoding

Now we are ready to geocode the addresses. The “tibble” data structure below shows us the address, latitude, longitude and also the geocoding service used to get the coordinates. Note that geocoding takes a bit of time, so patience is required.

geoCodedClinics <- methadoneClinics %>% 
                            geocode(methadoneClinics, address = 'fullAdd', 
                                    lat = latitude, long = longitude, method = 'cascade')
geoCodedClinics
## # A tibble: 27 x 10
##        X Name  Address City  State   Zip fullAdd latitude
##    <int> <chr> <chr>   <chr> <chr> <int> <chr>      <dbl>
##  1     1 "Chi… 4453 N… Chic… IL    60640 4453 N…     NA  
##  2     2 "Sun… 4545 N… Chic… IL    60640 4545 N…     NA  
##  3     3 "Sof… 3934 N… Chic… IL    60613 3934 N…     42.0
##  4     4 "PDS… 2260 N… Chic… IL    60614 2260 N…     41.9
##  5     5 "Cen… 609 N.… Chic… IL    60654 609 N.…     41.9
##  6     6 "Fam… 310 W.… Chic… IL    60654 310 W.…     41.9
##  7     7 "A R… 3809 W… Chic… IL    60651 3809 W…     41.9
##  8     8 "*"   140 N.… Chic… IL    60607 140 N.…     41.9
##  9     9 "Hea… 210 N.… Chic… IL    60607 210 N.…     41.9
## 10    10 "Spe… 2630 S… Chic… IL    60616 2630 S…     41.8
## # … with 17 more rows, and 2 more variables: longitude <dbl>,
## #   geo_method <chr>

The code worked for all addresses except the first two. We already resolved the 4545 North Broadway St.address above but here in the dataframe we get NAs. It is pointing to some issue with the string input. These were missed in the previous quality check, but give us a clue to the types of errors we could see if geocoding more addresses. Unfortunately, such quirks are common across geocoding services in R and we just have to handle them. We manually update the full address strings to get apprpriate coordinates.

methadoneClinics[1,'fullAdd'] <- '4453 North Broadway St.,Chicago IL 60640'
methadoneClinics[2,'fullAdd'] <- '4545 North Broadway St.,Chicago IL 60640'

Now we can geocode the full suite of addresses with success:

geoCodedClinics <- methadoneClinics %>% 
                            geocode(methadoneClinics, address = 'fullAdd', 
                                    lat = latitude, long = longitude, method = 'cascade')
geoCodedClinics
## # A tibble: 27 x 10
##        X Name  Address City  State   Zip fullAdd latitude
##    <int> <chr> <chr>   <chr> <chr> <int> <chr>      <dbl>
##  1     1 "Chi… 4453 N… Chic… IL    60640 4453 N…     42.0
##  2     2 "Sun… 4545 N… Chic… IL    60640 4545 N…     42.0
##  3     3 "Sof… 3934 N… Chic… IL    60613 3934 N…     42.0
##  4     4 "PDS… 2260 N… Chic… IL    60614 2260 N…     41.9
##  5     5 "Cen… 609 N.… Chic… IL    60654 609 N.…     41.9
##  6     6 "Fam… 310 W.… Chic… IL    60654 310 W.…     41.9
##  7     7 "A R… 3809 W… Chic… IL    60651 3809 W…     41.9
##  8     8 "*"   140 N.… Chic… IL    60607 140 N.…     41.9
##  9     9 "Hea… 210 N.… Chic… IL    60607 210 N.…     41.9
## 10    10 "Spe… 2630 S… Chic… IL    60616 2630 S…     41.8
## # … with 17 more rows, and 2 more variables: longitude <dbl>,
## #   geo_method <chr>

2.4 Convert to Spatial Data

While we have geographic coordinates loaded in our data, it is still not spatially enabled. To convert to a spatial data format, we have to enable to coordinate reference system that connects the latitude and longitude recorded to actual points on Earth.

2.4.1 Spatial Reference Systems

There are thousands of ways to model the Earth, and each requires a different spatial reference system. This is a very complicated domain of spatial applications (for a primer see here), but for our purposes, we simplify by using a geodetic CRS that uses coordinates longitude and latitude. Not all coordinates will appear as a latitude/longitude, however, so it’s important to at least check for the CRS used when working with geographic data. The lat/long coordinates provided by the geocoding service we used report data using the CRS coded as 4326, a World Geodetic System (WGS84) model also used by Google Earth and many other applications. In this system, distance is measured as degrees and distorted. So while useful for visualizing points, we will need to convert to another CRS for other types of spatial analysis.

2.4.2 Enable Points

Next we convert our dataframe to a spatial data frame using the st_as_sf() function. The coords argument specifies which two columns are the X and Y for your data. We set the crs argument equal to 4326.

Please note longitude is entered as first column rather than the latitude. It is a very common mistake. The X, Y field actually refers to longitude, latitude.

methadoneSf <- st_as_sf(geoCodedClinics, 
                        coords = c("longitude", "latitude"),
                        crs = 4326)
head(data.frame(methadoneSf))
##   X                                                         Name
## 1 1                Chicago Treatment and Counseling Center, Inc.
## 2 2                      Sundace Methadone Treatment Center, LLC
## 3 3 Soft Landing Interventions/DBA Symetria Recovery of Lakeview
## 4 4                                        PDSSC - Chicago, Inc.
## 5 5                          Center for Addictive Problems, Inc.
## 6 6                                Family Guidance Centers, Inc.
##                   Address    City State   Zip
## 1 4453 North Broadway st. Chicago    IL 60640
## 2 4545 North Broadway St. Chicago    IL 60640
## 3    3934 N. Lincoln Ave. Chicago    IL 60613
## 4     2260 N. Elston Ave. Chicago    IL 60614
## 5        609 N. Wells St. Chicago    IL 60654
## 6     310 W. Chicago Ave. Chicago    IL 60654
##                                    fullAdd geo_method
## 1 4453 North Broadway St.,Chicago IL 60640        osm
## 2 4545 North Broadway St.,Chicago IL 60640        osm
## 3    3934 N. Lincoln Ave. Chicago IL 60613     census
## 4     2260 N. Elston Ave. Chicago IL 60614     census
## 5        609 N. Wells St. Chicago IL 60654     census
## 6     310 W. Chicago Ave. Chicago IL 60654     census
##                     geometry
## 1 POINT (-87.65566 41.96321)
## 2 POINT (-87.65694 41.96475)
## 3 POINT (-87.67818 41.95331)
## 4 POINT (-87.67407 41.92269)
## 5 POINT (-87.63409 41.89268)
## 6 POINT (-87.63636 41.89657)

Note that this is a data frame, but that it has a final column called “geometry” that stores the spatial information.

2.4.3 Visualize Points

We can now plot the location of the methadone clinics with base R. This is a recommended step to confirm that you translated your coordinates correctly. A common mistake is switching the lat/long values, so your points could plot across the globe. If that happens, repeat the step above with flipped long/lat values.

First we switch the tmap mode to view so we can look at the points with a live basemap layer.

tmap_mode("view")
## tmap mode set to interactive viewing

Next, we plot our points as dots, and add the basemap.

tm_shape(methadoneSf) + tm_dots() + tm_basemap("OpenStreetMap")

2.4.4 Save Data

Finally, we save this spatial dataframe as a shapefile which can be used for further spatial analysis.

write_sf(methadoneSf, "data/methadone_clinics.shp")