Manipulating Geospatial Data

import math
from pathlib import Path

import folium
import geopandas as gpd
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from folium import Marker, GeoJson
from folium.plugins import HeatMap
from geopy.geocoders import Nominatim

pd.set_option("display.max_columns", None)

plt.rcParams["figure.facecolor"] = (1, 1, 1, 0)  # RGBA tuple with alpha=0
plt.rcParams["axes.facecolor"] = (1, 1, 1, 0)  # RGBA tuple with alpha=0

PROJECT_ROOT = Path.cwd().parent.parent

Geocoding

Geocoding is the process of converting the name of a place or an address to a location on a map.

In the imports above, Nominatim refers to the geocoding software that will be used to generate locations.

We begin by instantiating the geocoder. Then, we need only apply the name or address as a Python string. (In this case, we supply “Pyramid of Khufu”, also known as the Great Pyramid of Giza.)

If the geocoding is successful, it returns a geopy.location.Location object with two important attributes:

the “point” attribute contains the (latitude, longitude) location, and
the “address” attribute contains the full address.

geolocator = Nominatim(user_agent="kaggle_learn")
location = geolocator.geocode("Pyramid of Khufu")

print(location.point)
print(location.address)

29 58m 44.976s N, 31 8m 3.17625s E
هرم خوفو, شارع ابو الهول السياحي, نزلة البطران, الجيزة, 12125, مصر

The value for the “point” attribute is a geopy.point.Point object. We can get the latitude and longitude from the latitude and longitude attributes, respectively.

point = location.point
print("Latitude:", point.latitude)
print("Longitude:", point.longitude)

Latitude: 29.97916
Longitude: 31.134215625236113

It’s often the case that we’ll need to geocode many different addresses. For instance, say we want to obtain the locations of 100 top universities in Europe.

universities = pd.read_csv(f"{PROJECT_ROOT}/data/kaggle_geospatial/top_universities.csv")
universities.head()

	Name
0	University of Oxford
1	University of Cambridge
2	Imperial College London
3	ETH Zurich
4	UCL

Then we can use a lambda function to apply the geocoder to every row in the DataFrame. (We use a try/except statement to account for the case that the geocoding is unsuccessful.)

def my_geocoder(row):
    try:
        point = geolocator.geocode(row).point
        return pd.Series({"Latitude": point.latitude, "Longitude": point.longitude})
    except:
        return None


universities[["Latitude", "Longitude"]] = universities.apply(lambda x: my_geocoder(x["Name"]), axis=1)

print(f"{(1 - np.isnan(universities['Latitude']).mean()) * 100}% of addresses were geocoded")

95.0% of addresses were geocoded!

universities = universities.loc[~np.isnan(universities["Latitude"])]
universities = gpd.GeoDataFrame(
    universities, geometry=gpd.points_from_xy(universities.Longitude, universities.Latitude), crs="EPSG:4326"
)
universities.head()

	Name	Latitude	Longitude	geometry
0	University of Oxford	51.759037	-1.252430	POINT (-1.25243 51.75904)
1	University of Cambridge	52.210946	0.092005	POINT (0.09200 52.21095)
2	Imperial College London	51.498959	-0.175641	POINT (-0.17564 51.49896)
3	ETH Zurich	47.413218	8.537491	POINT (8.53749 47.41322)
4	UCL	51.521785	-0.135151	POINT (-0.13515 51.52179)

Next, we visualize all of the locations that were returned by the geocoder. Notice that a few of the locations are certainly inaccurate, as they’re not in Europe!

m = folium.Map(location=[54, 15], tiles="openstreetmap", zoom_start=2)

for idx, row in universities.iterrows():
    Marker([row["Latitude"], row["Longitude"]], popup=row["Name"]).add_to(m)

m

Table joins

Attribute join

You already know how to use pd.DataFrame.join() to combine information from multiple DataFrames with a shared index. We refer to this way of joining data (by sampling matching values in the index) as an attribute join.

When performing an attribute join with a GeoDataFrame, it’s best to use gpd.GeoDataFrame.merge(). To illustrate this, we’ll work with a GeoDataFrame europe_boundaries containing the boundaries for every country in Europe. The first five rows of this GeoDataFrame are printed below.

world = gpd.read_file(f"{PROJECT_ROOT}/data/ne_110m_admin_0_countries/ne_110m_admin_0_countries.shp")
europe = world.loc[world["CONTINENT"] == "Europe"].reset_index(drop=True)

europe_stats = europe[["NAME", "POP_EST", "GDP_MD"]]
europe_boundaries = europe[["NAME", "geometry"]]

europe_boundaries.head()

	NAME	geometry
0	Russia	MULTIPOLYGON (((178.72530 71.09880, 180.00000 ...
1	Norway	MULTIPOLYGON (((15.14282 79.67431, 15.52255 80...
2	France	MULTIPOLYGON (((-51.65780 4.15623, -52.24934 3...
3	Sweden	POLYGON ((11.02737 58.85615, 11.46827 59.43239...
4	Belarus	POLYGON ((28.17671 56.16913, 29.22951 55.91834...

We’ll join it with a DataFrame europe_stats containing the estimated population and gross domestic product (GDP) for each country.

europe_stats.head()

	NAME	POP_EST	GDP_MD
0	Russia	144373535.0	1699876
1	Norway	5347896.0	403336
2	France	67059887.0	2715518
3	Sweden	10285453.0	530883
4	Belarus	9466856.0	63080

We do the attribute join in the code cell below. The on argument is set to the column NAME that is used to match rows in europe_boundaries to rows in europe_stats.

europe = europe_boundaries.merge(europe_stats, on="NAME")
europe.head()

	NAME	geometry	POP_EST	GDP_MD
0	Russia	MULTIPOLYGON (((178.72530 71.09880, 180.00000 ...	144373535.0	1699876
1	Norway	MULTIPOLYGON (((15.14282 79.67431, 15.52255 80...	5347896.0	403336
2	France	MULTIPOLYGON (((-51.65780 4.15623, -52.24934 3...	67059887.0	2715518
3	Sweden	POLYGON ((11.02737 58.85615, 11.46827 59.43239...	10285453.0	530883
4	Belarus	POLYGON ((28.17671 56.16913, 29.22951 55.91834...	9466856.0	63080

Spatial join

Another type of join is a spatial join. With a spatial join, we combine GeoDataFrames based on the spatial relationship between the objects in the geometry column. For instance, we already have a GeoDataFrame universities containing geocoded addresses of European universities.

Then we can use a spatial join to match each university to its corresponding country. We do this with gpd.sjoin().

european_universities = gpd.sjoin(universities, europe)

print(f"We located {len(universities)} universities.")
print(
    f"Only {len(european_universities)} of the universities were located in Europe (in {len(european_universities['NAME'].unique())} different countries)."
)
european_universities.head()

We located 95 universities.
Only 90 of the universities were located in Europe (in 15 different countries).

	Name	Latitude	Longitude	geometry	index_right	NAME	POP_EST	GDP_MD
0	University of Oxford	51.759037	-1.252430	POINT (-1.25243 51.75904)	28	United Kingdom	66834405.0	2829108
1	University of Cambridge	52.210946	0.092005	POINT (0.09200 52.21095)	28	United Kingdom	66834405.0	2829108
2	Imperial College London	51.498959	-0.175641	POINT (-0.17564 51.49896)	28	United Kingdom	66834405.0	2829108
4	UCL	51.521785	-0.135151	POINT (-0.13515 51.52179)	28	United Kingdom	66834405.0	2829108
5	London School of Economics and Political Science	51.514261	-0.116734	POINT (-0.11673 51.51426)	28	United Kingdom	66834405.0	2829108

The spatial join above looks at the “geometry” columns in both GeoDataFrames. If a Point object from the universities GeoDataFrame intersects a Polygon object from the europe DataFrame, the corresponding rows are combined and added as a single row of the european_universities DataFrame. Otherwise, countries without a matching university (and universities without a matching country) are omitted from the results.

The gpd.sjoin() method is customizable for different types of joins, through the how and op arguments. For instance, you can do the equivalent of a SQL left (or right) join by setting how='left' or how='right'.

Proximity Analysis

We have a dataset from the US Environmental Protection Agency (EPA) that tracks releases of toxic chemicals in Philadelphia.

releases = gpd.read_file(
    f"{PROJECT_ROOT}/data/kaggle_geospatial/toxic_release_pennsylvania/toxic_release_pennsylvania/toxic_release_pennsylvania.shp"
)
releases.head()

	YEAR	CITY	COUNTY	ST	LATITUDE	LONGITUDE	CHEMICAL	UNIT_OF_ME	TOTAL_RELE	geometry
0	2016	PHILADELPHIA	PHILADELPHIA	PA	40.005901	-75.072103	FORMIC ACID	Pounds	0.160	POINT (2718560.227 256380.179)
1	2016	PHILADELPHIA	PHILADELPHIA	PA	39.920120	-75.146410	ETHYLENE GLYCOL	Pounds	13353.480	POINT (2698674.606 224522.905)
2	2016	PHILADELPHIA	PHILADELPHIA	PA	40.023880	-75.220450	CERTAIN GLYCOL ETHERS	Pounds	104.135	POINT (2676833.394 261701.856)
3	2016	PHILADELPHIA	PHILADELPHIA	PA	39.913540	-75.198890	LEAD COMPOUNDS	Pounds	1730.280	POINT (2684030.004 221697.388)
4	2016	PHILADELPHIA	PHILADELPHIA	PA	39.913540	-75.198890	BENZENE	Pounds	39863.290	POINT (2684030.004 221697.388)

We also have a dataset that contains readings from air quality monitoring stations in Philadelphia.

stations = gpd.read_file(
    f"{PROJECT_ROOT}/data/kaggle_geospatial/PhillyHealth_Air_Monitoring_Stations/PhillyHealth_Air_Monitoring_Stations/PhillyHealth_Air_Monitoring_Stations.shp"
)
stations.head()

	SITE_NAME	ADDRESS	BLACK_CARB	ULTRAFINE_	CO	SO2	OZONE	NO2	NOY_NO	PM10	PM2_5	SPECIATED_	PM_COURSE	CARBONYLS	PAMS_VOC	TSP_11101	TSP_METALS	TSP_LEAD	TOXICS_TO1	MET	COMMUNITY_	LATITUDE	LONGITUDE	geometry
0	LAB	1501 East Lycoming Avenue	N	N	Y	N	Y	Y	Y	N	Y	N	N	Y	Y	N	Y	N	y	N	N	40.008606	-75.097624	POINT (2711384.641 257149.310)
1	ROX	Eva and Dearnley Streets	N	N	N	N	N	N	N	N	N	N	N	Y	N	N	Y	N	Y	N	N	40.050461	-75.236966	POINT (2671934.290 271248.900)
2	NEA	Grant Avenue and Ashton Street	N	N	N	N	Y	N	N	N	N	N	N	N	N	N	N	N	N	Y	N	40.072073	-75.013128	POINT (2734326.638 280980.247)
3	CHS	500 South Broad Street	N	N	N	N	N	N	N	N	N	N	N	Y	N	N	Y	N	Y	N	N	39.944510	-75.165442	POINT (2693078.580 233247.101)
4	NEW	2861 Lewis Street	N	N	Y	Y	Y	N	Y	Y	Y	Y	Y	N	N	Y	N	Y	N	Y	N	39.991688	-75.080378	POINT (2716399.773 251134.976)

Measuring distance

To measure distances between points from two different GeoDataFrames, we first have to make sure that they use the same coordinate reference system (CRS). Thankfully, this is the case here, where both use EPSG 2272.

print(stations.crs)
print(releases.crs)

EPSG:2272
EPSG:2272

We also check the CRS to see which units it uses (meters, feet, or something else). In this case, EPSG 2272 has units of feet. (If you like, you can check this here.)

It’s relatively straightforward to compute distances in GeoPandas. The code cell below calculates the distance (in feet) between a relatively recent release incident in recent_release and every station in the stations GeoDataFrame.

recent_release = releases.iloc[360]

# Measure distance from release to each station
distances = stations.geometry.distance(recent_release.geometry)
distances

0     44778.509761
1     51006.456589
2     77744.509207
3     14672.170878
4     43753.554393
5      4711.658655
6     23197.430858
7     12072.823097
8     79081.825506
9      3780.623591
10    27577.474903
11    19818.381002
dtype: float64

Using the calculated distances, we can obtain statistics like the mean distance to each station.

print(f"Mean distance to monitoring stations: {round(distances.mean(), 2)} feet")

Mean distance to monitoring stations: 33516.28 feet

Or, we can get the closest monitoring station.

print(f"Closest monitoring station ({distances.min()} feet):")
print(stations.iloc[distances.idxmin()][["ADDRESS", "LATITUDE", "LONGITUDE"]])

Closest monitoring station (3780.623590556444 feet):
ADDRESS      3100 Penrose Ferry Road
LATITUDE                    39.91279
LONGITUDE                 -75.185448
Name: 9, dtype: object

Creating a buffer

If we want to understand all points on a map that are some radius away from a point, the simplest way is to create a buffer.

The code cell below creates a GeoSeries two_mile_buffer containing 12 different Polygon objects. Each polygon is a buffer of 2 miles (or, 2*5280 feet) around a different air monitoring station.

two_mile_buffer = stations.geometry.buffer(2 * 5280)
two_mile_buffer.head()

0    POLYGON ((2721944.641 257149.310, 2721893.792 ...
1    POLYGON ((2682494.290 271248.900, 2682443.441 ...
2    POLYGON ((2744886.638 280980.247, 2744835.789 ...
3    POLYGON ((2703638.580 233247.101, 2703587.731 ...
4    POLYGON ((2726959.773 251134.976, 2726908.924 ...
dtype: geometry

We use folium.GeoJson() to plot each polygon on a map. Note that since folium requires coordinates in latitude and longitude, we have to convert the CRS to EPSG 4326 before plotting.

m = folium.Map(location=[39.9526, -75.1652], zoom_start=11)

HeatMap(data=releases[["LATITUDE", "LONGITUDE"]], radius=15).add_to(m)

for idx, row in stations.iterrows():
    Marker([row["LATITUDE"], row["LONGITUDE"]]).add_to(m)

GeoJson(two_mile_buffer.to_crs(epsg=4326)).add_to(m)

m

Make this Notebook Trusted to load map: File -> Trust Notebook

Now, to test if a toxic release occurred within 2 miles of any monitoring station, we could run 12 different tests for each polygon (to check individually if it contains the point).

But a more efficient way is to first collapse all of the polygons into a MultiPolygon object. We do this with the unary_union attribute.

my_union = two_mile_buffer.geometry.unary_union

my_union

We use the contains() method to check if the multipolygon contains a point. We’ll use the release incident from earlier, which we know is roughly 3781 feet to the closest monitoring station.

my_union.contains(releases.iloc[360].geometry)

True

Not all releases occured within two miles of an air monitoring station

my_union.contains(releases.iloc[358].geometry)

False