Manipulating Geospatial Data

import math
from pathlib import Path

import folium
import geopandas as gpd
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from folium import Marker, GeoJson
from folium.plugins import HeatMap
from geopy.geocoders import Nominatim

pd.set_option("display.max_columns", None)

plt.rcParams["figure.facecolor"] = (1, 1, 1, 0)  # RGBA tuple with alpha=0
plt.rcParams["axes.facecolor"] = (1, 1, 1, 0)  # RGBA tuple with alpha=0

PROJECT_ROOT = Path.cwd().parent.parent

Geocoding

Geocoding is the process of converting the name of a place or an address to a location on a map.

In the imports above, Nominatim refers to the geocoding software that will be used to generate locations.

We begin by instantiating the geocoder. Then, we need only apply the name or address as a Python string. (In this case, we supply “Pyramid of Khufu”, also known as the Great Pyramid of Giza.)

If the geocoding is successful, it returns a geopy.location.Location object with two important attributes:

  • the “point” attribute contains the (latitude, longitude) location, and
  • the “address” attribute contains the full address.
geolocator = Nominatim(user_agent="kaggle_learn")
location = geolocator.geocode("Pyramid of Khufu")

print(location.point)
print(location.address)
29 58m 44.976s N, 31 8m 3.17625s E
هرم خوفو, شارع ابو الهول السياحي, نزلة البطران, الجيزة, 12125, مصر

The value for the “point” attribute is a geopy.point.Point object. We can get the latitude and longitude from the latitude and longitude attributes, respectively.

point = location.point
print("Latitude:", point.latitude)
print("Longitude:", point.longitude)
Latitude: 29.97916
Longitude: 31.134215625236113

It’s often the case that we’ll need to geocode many different addresses. For instance, say we want to obtain the locations of 100 top universities in Europe.

universities = pd.read_csv(f"{PROJECT_ROOT}/data/kaggle_geospatial/top_universities.csv")
universities.head()
Name
0 University of Oxford
1 University of Cambridge
2 Imperial College London
3 ETH Zurich
4 UCL

Then we can use a lambda function to apply the geocoder to every row in the DataFrame. (We use a try/except statement to account for the case that the geocoding is unsuccessful.)

def my_geocoder(row):
    try:
        point = geolocator.geocode(row).point
        return pd.Series({"Latitude": point.latitude, "Longitude": point.longitude})
    except:
        return None


universities[["Latitude", "Longitude"]] = universities.apply(lambda x: my_geocoder(x["Name"]), axis=1)

print(f"{(1 - np.isnan(universities['Latitude']).mean()) * 100}% of addresses were geocoded")
95.0% of addresses were geocoded!
universities = universities.loc[~np.isnan(universities["Latitude"])]
universities = gpd.GeoDataFrame(
    universities, geometry=gpd.points_from_xy(universities.Longitude, universities.Latitude), crs="EPSG:4326"
)
universities.head()
Name Latitude Longitude geometry
0 University of Oxford 51.759037 -1.252430 POINT (-1.25243 51.75904)
1 University of Cambridge 52.210946 0.092005 POINT (0.09200 52.21095)
2 Imperial College London 51.498959 -0.175641 POINT (-0.17564 51.49896)
3 ETH Zurich 47.413218 8.537491 POINT (8.53749 47.41322)
4 UCL 51.521785 -0.135151 POINT (-0.13515 51.52179)

Next, we visualize all of the locations that were returned by the geocoder. Notice that a few of the locations are certainly inaccurate, as they’re not in Europe!

m = folium.Map(location=[54, 15], tiles="openstreetmap", zoom_start=2)

for idx, row in universities.iterrows():
    Marker([row["Latitude"], row["Longitude"]], popup=row["Name"]).add_to(m)

m

Table joins

Attribute join

You already know how to use pd.DataFrame.join() to combine information from multiple DataFrames with a shared index. We refer to this way of joining data (by sampling matching values in the index) as an attribute join.

When performing an attribute join with a GeoDataFrame, it’s best to use gpd.GeoDataFrame.merge(). To illustrate this, we’ll work with a GeoDataFrame europe_boundaries containing the boundaries for every country in Europe. The first five rows of this GeoDataFrame are printed below.

world = gpd.read_file(f"{PROJECT_ROOT}/data/ne_110m_admin_0_countries/ne_110m_admin_0_countries.shp")
europe = world.loc[world["CONTINENT"] == "Europe"].reset_index(drop=True)

europe_stats = europe[["NAME", "POP_EST", "GDP_MD"]]
europe_boundaries = europe[["NAME", "geometry"]]

europe_boundaries.head()
NAME geometry
0 Russia MULTIPOLYGON (((178.72530 71.09880, 180.00000 ...
1 Norway MULTIPOLYGON (((15.14282 79.67431, 15.52255 80...
2 France MULTIPOLYGON (((-51.65780 4.15623, -52.24934 3...
3 Sweden POLYGON ((11.02737 58.85615, 11.46827 59.43239...
4 Belarus POLYGON ((28.17671 56.16913, 29.22951 55.91834...

We’ll join it with a DataFrame europe_stats containing the estimated population and gross domestic product (GDP) for each country.

europe_stats.head()
NAME POP_EST GDP_MD
0 Russia 144373535.0 1699876
1 Norway 5347896.0 403336
2 France 67059887.0 2715518
3 Sweden 10285453.0 530883
4 Belarus 9466856.0 63080

We do the attribute join in the code cell below. The on argument is set to the column NAME that is used to match rows in europe_boundaries to rows in europe_stats.

europe = europe_boundaries.merge(europe_stats, on="NAME")
europe.head()
NAME geometry POP_EST GDP_MD
0 Russia MULTIPOLYGON (((178.72530 71.09880, 180.00000 ... 144373535.0 1699876
1 Norway MULTIPOLYGON (((15.14282 79.67431, 15.52255 80... 5347896.0 403336
2 France MULTIPOLYGON (((-51.65780 4.15623, -52.24934 3... 67059887.0 2715518
3 Sweden POLYGON ((11.02737 58.85615, 11.46827 59.43239... 10285453.0 530883
4 Belarus POLYGON ((28.17671 56.16913, 29.22951 55.91834... 9466856.0 63080

Spatial join

Another type of join is a spatial join. With a spatial join, we combine GeoDataFrames based on the spatial relationship between the objects in the geometry column. For instance, we already have a GeoDataFrame universities containing geocoded addresses of European universities.

Then we can use a spatial join to match each university to its corresponding country. We do this with gpd.sjoin().

european_universities = gpd.sjoin(universities, europe)

print(f"We located {len(universities)} universities.")
print(
    f"Only {len(european_universities)} of the universities were located in Europe (in {len(european_universities['NAME'].unique())} different countries)."
)
european_universities.head()
We located 95 universities.
Only 90 of the universities were located in Europe (in 15 different countries).
Name Latitude Longitude geometry index_right NAME POP_EST GDP_MD
0 University of Oxford 51.759037 -1.252430 POINT (-1.25243 51.75904) 28 United Kingdom 66834405.0 2829108
1 University of Cambridge 52.210946 0.092005 POINT (0.09200 52.21095) 28 United Kingdom 66834405.0 2829108
2 Imperial College London 51.498959 -0.175641 POINT (-0.17564 51.49896) 28 United Kingdom 66834405.0 2829108
4 UCL 51.521785 -0.135151 POINT (-0.13515 51.52179) 28 United Kingdom 66834405.0 2829108
5 London School of Economics and Political Science 51.514261 -0.116734 POINT (-0.11673 51.51426) 28 United Kingdom 66834405.0 2829108

The spatial join above looks at the “geometry” columns in both GeoDataFrames. If a Point object from the universities GeoDataFrame intersects a Polygon object from the europe DataFrame, the corresponding rows are combined and added as a single row of the european_universities DataFrame. Otherwise, countries without a matching university (and universities without a matching country) are omitted from the results.

The gpd.sjoin() method is customizable for different types of joins, through the how and op arguments. For instance, you can do the equivalent of a SQL left (or right) join by setting how='left' or how='right'.

Proximity Analysis

We have a dataset from the US Environmental Protection Agency (EPA) that tracks releases of toxic chemicals in Philadelphia.

releases = gpd.read_file(
    f"{PROJECT_ROOT}/data/kaggle_geospatial/toxic_release_pennsylvania/toxic_release_pennsylvania/toxic_release_pennsylvania.shp"
)
releases.head()
YEAR CITY COUNTY ST LATITUDE LONGITUDE CHEMICAL UNIT_OF_ME TOTAL_RELE geometry
0 2016 PHILADELPHIA PHILADELPHIA PA 40.005901 -75.072103 FORMIC ACID Pounds 0.160 POINT (2718560.227 256380.179)
1 2016 PHILADELPHIA PHILADELPHIA PA 39.920120 -75.146410 ETHYLENE GLYCOL Pounds 13353.480 POINT (2698674.606 224522.905)
2 2016 PHILADELPHIA PHILADELPHIA PA 40.023880 -75.220450 CERTAIN GLYCOL ETHERS Pounds 104.135 POINT (2676833.394 261701.856)
3 2016 PHILADELPHIA PHILADELPHIA PA 39.913540 -75.198890 LEAD COMPOUNDS Pounds 1730.280 POINT (2684030.004 221697.388)
4 2016 PHILADELPHIA PHILADELPHIA PA 39.913540 -75.198890 BENZENE Pounds 39863.290 POINT (2684030.004 221697.388)

We also have a dataset that contains readings from air quality monitoring stations in Philadelphia.

stations = gpd.read_file(
    f"{PROJECT_ROOT}/data/kaggle_geospatial/PhillyHealth_Air_Monitoring_Stations/PhillyHealth_Air_Monitoring_Stations/PhillyHealth_Air_Monitoring_Stations.shp"
)
stations.head()
SITE_NAME ADDRESS BLACK_CARB ULTRAFINE_ CO SO2 OZONE NO2 NOY_NO PM10 PM2_5 SPECIATED_ PM_COURSE CARBONYLS PAMS_VOC TSP_11101 TSP_METALS TSP_LEAD TOXICS_TO1 MET COMMUNITY_ LATITUDE LONGITUDE geometry
0 LAB 1501 East Lycoming Avenue N N Y N Y Y Y N Y N N Y Y N Y N y N N 40.008606 -75.097624 POINT (2711384.641 257149.310)
1 ROX Eva and Dearnley Streets N N N N N N N N N N N Y N N Y N Y N N 40.050461 -75.236966 POINT (2671934.290 271248.900)
2 NEA Grant Avenue and Ashton Street N N N N Y N N N N N N N N N N N N Y N 40.072073 -75.013128 POINT (2734326.638 280980.247)
3 CHS 500 South Broad Street N N N N N N N N N N N Y N N Y N Y N N 39.944510 -75.165442 POINT (2693078.580 233247.101)
4 NEW 2861 Lewis Street N N Y Y Y N Y Y Y Y Y N N Y N Y N Y N 39.991688 -75.080378 POINT (2716399.773 251134.976)

Measuring distance

To measure distances between points from two different GeoDataFrames, we first have to make sure that they use the same coordinate reference system (CRS). Thankfully, this is the case here, where both use EPSG 2272.

print(stations.crs)
print(releases.crs)
EPSG:2272
EPSG:2272

We also check the CRS to see which units it uses (meters, feet, or something else). In this case, EPSG 2272 has units of feet. (If you like, you can check this here.)

It’s relatively straightforward to compute distances in GeoPandas. The code cell below calculates the distance (in feet) between a relatively recent release incident in recent_release and every station in the stations GeoDataFrame.

recent_release = releases.iloc[360]

# Measure distance from release to each station
distances = stations.geometry.distance(recent_release.geometry)
distances
0     44778.509761
1     51006.456589
2     77744.509207
3     14672.170878
4     43753.554393
5      4711.658655
6     23197.430858
7     12072.823097
8     79081.825506
9      3780.623591
10    27577.474903
11    19818.381002
dtype: float64

Using the calculated distances, we can obtain statistics like the mean distance to each station.

print(f"Mean distance to monitoring stations: {round(distances.mean(), 2)} feet")
Mean distance to monitoring stations: 33516.28 feet

Or, we can get the closest monitoring station.

print(f"Closest monitoring station ({distances.min()} feet):")
print(stations.iloc[distances.idxmin()][["ADDRESS", "LATITUDE", "LONGITUDE"]])
Closest monitoring station (3780.623590556444 feet):
ADDRESS      3100 Penrose Ferry Road
LATITUDE                    39.91279
LONGITUDE                 -75.185448
Name: 9, dtype: object

Creating a buffer

If we want to understand all points on a map that are some radius away from a point, the simplest way is to create a buffer.

The code cell below creates a GeoSeries two_mile_buffer containing 12 different Polygon objects. Each polygon is a buffer of 2 miles (or, 2*5280 feet) around a different air monitoring station.

two_mile_buffer = stations.geometry.buffer(2 * 5280)
two_mile_buffer.head()
0    POLYGON ((2721944.641 257149.310, 2721893.792 ...
1    POLYGON ((2682494.290 271248.900, 2682443.441 ...
2    POLYGON ((2744886.638 280980.247, 2744835.789 ...
3    POLYGON ((2703638.580 233247.101, 2703587.731 ...
4    POLYGON ((2726959.773 251134.976, 2726908.924 ...
dtype: geometry

We use folium.GeoJson() to plot each polygon on a map. Note that since folium requires coordinates in latitude and longitude, we have to convert the CRS to EPSG 4326 before plotting.

m = folium.Map(location=[39.9526, -75.1652], zoom_start=11)

HeatMap(data=releases[["LATITUDE", "LONGITUDE"]], radius=15).add_to(m)

for idx, row in stations.iterrows():
    Marker([row["LATITUDE"], row["LONGITUDE"]]).add_to(m)

GeoJson(two_mile_buffer.to_crs(epsg=4326)).add_to(m)

m
Make this Notebook Trusted to load map: File -> Trust Notebook

Now, to test if a toxic release occurred within 2 miles of any monitoring station, we could run 12 different tests for each polygon (to check individually if it contains the point).

But a more efficient way is to first collapse all of the polygons into a MultiPolygon object. We do this with the unary_union attribute.

my_union = two_mile_buffer.geometry.unary_union

my_union

We use the contains() method to check if the multipolygon contains a point. We’ll use the release incident from earlier, which we know is roughly 3781 feet to the closest monitoring station.

my_union.contains(releases.iloc[360].geometry)
True

Not all releases occured within two miles of an air monitoring station

my_union.contains(releases.iloc[358].geometry)
False