import math
from pathlib import Path
import folium
import geopandas as gpd
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from folium import Marker, GeoJson
from folium.plugins import HeatMap
from geopy.geocoders import Nominatim
"display.max_columns", None)
pd.set_option(
"figure.facecolor"] = (1, 1, 1, 0) # RGBA tuple with alpha=0
plt.rcParams["axes.facecolor"] = (1, 1, 1, 0) # RGBA tuple with alpha=0
plt.rcParams[
= Path.cwd().parent.parent PROJECT_ROOT
Manipulating Geospatial Data
Geocoding
Geocoding is the process of converting the name of a place or an address to a location on a map.
In the imports above, Nominatim
refers to the geocoding software that will be used to generate locations.
We begin by instantiating the geocoder. Then, we need only apply the name or address as a Python string. (In this case, we supply “Pyramid of Khufu”, also known as the Great Pyramid of Giza.)
If the geocoding is successful, it returns a geopy.location.Location
object with two important attributes:
- the “point” attribute contains the (latitude, longitude) location, and
- the “address” attribute contains the full address.
= Nominatim(user_agent="kaggle_learn")
geolocator = geolocator.geocode("Pyramid of Khufu")
location
print(location.point)
print(location.address)
29 58m 44.976s N, 31 8m 3.17625s E
هرم خوفو, شارع ابو الهول السياحي, نزلة البطران, الجيزة, 12125, مصر
The value for the “point” attribute is a geopy.point.Point
object. We can get the latitude and longitude from the latitude and longitude attributes, respectively.
= location.point
point print("Latitude:", point.latitude)
print("Longitude:", point.longitude)
Latitude: 29.97916
Longitude: 31.134215625236113
It’s often the case that we’ll need to geocode many different addresses. For instance, say we want to obtain the locations of 100 top universities in Europe.
= pd.read_csv(f"{PROJECT_ROOT}/data/kaggle_geospatial/top_universities.csv")
universities universities.head()
Name | |
---|---|
0 | University of Oxford |
1 | University of Cambridge |
2 | Imperial College London |
3 | ETH Zurich |
4 | UCL |
Then we can use a lambda function to apply the geocoder to every row in the DataFrame. (We use a try/except statement to account for the case that the geocoding is unsuccessful.)
def my_geocoder(row):
try:
= geolocator.geocode(row).point
point return pd.Series({"Latitude": point.latitude, "Longitude": point.longitude})
except:
return None
"Latitude", "Longitude"]] = universities.apply(lambda x: my_geocoder(x["Name"]), axis=1)
universities[[
print(f"{(1 - np.isnan(universities['Latitude']).mean()) * 100}% of addresses were geocoded")
95.0% of addresses were geocoded!
= universities.loc[~np.isnan(universities["Latitude"])]
universities = gpd.GeoDataFrame(
universities =gpd.points_from_xy(universities.Longitude, universities.Latitude), crs="EPSG:4326"
universities, geometry
) universities.head()
Name | Latitude | Longitude | geometry | |
---|---|---|---|---|
0 | University of Oxford | 51.759037 | -1.252430 | POINT (-1.25243 51.75904) |
1 | University of Cambridge | 52.210946 | 0.092005 | POINT (0.09200 52.21095) |
2 | Imperial College London | 51.498959 | -0.175641 | POINT (-0.17564 51.49896) |
3 | ETH Zurich | 47.413218 | 8.537491 | POINT (8.53749 47.41322) |
4 | UCL | 51.521785 | -0.135151 | POINT (-0.13515 51.52179) |
Next, we visualize all of the locations that were returned by the geocoder. Notice that a few of the locations are certainly inaccurate, as they’re not in Europe!
= folium.Map(location=[54, 15], tiles="openstreetmap", zoom_start=2)
m
for idx, row in universities.iterrows():
"Latitude"], row["Longitude"]], popup=row["Name"]).add_to(m)
Marker([row[
m
Table joins
Attribute join
You already know how to use pd.DataFrame.join()
to combine information from multiple DataFrames with a shared index. We refer to this way of joining data (by sampling matching values in the index) as an attribute join.
When performing an attribute join with a GeoDataFrame, it’s best to use gpd.GeoDataFrame.merge()
. To illustrate this, we’ll work with a GeoDataFrame europe_boundaries
containing the boundaries for every country in Europe. The first five rows of this GeoDataFrame are printed below.
= gpd.read_file(f"{PROJECT_ROOT}/data/ne_110m_admin_0_countries/ne_110m_admin_0_countries.shp")
world = world.loc[world["CONTINENT"] == "Europe"].reset_index(drop=True)
europe
= europe[["NAME", "POP_EST", "GDP_MD"]]
europe_stats = europe[["NAME", "geometry"]]
europe_boundaries
europe_boundaries.head()
NAME | geometry | |
---|---|---|
0 | Russia | MULTIPOLYGON (((178.72530 71.09880, 180.00000 ... |
1 | Norway | MULTIPOLYGON (((15.14282 79.67431, 15.52255 80... |
2 | France | MULTIPOLYGON (((-51.65780 4.15623, -52.24934 3... |
3 | Sweden | POLYGON ((11.02737 58.85615, 11.46827 59.43239... |
4 | Belarus | POLYGON ((28.17671 56.16913, 29.22951 55.91834... |
We’ll join it with a DataFrame europe_stats
containing the estimated population and gross domestic product (GDP) for each country.
europe_stats.head()
NAME | POP_EST | GDP_MD | |
---|---|---|---|
0 | Russia | 144373535.0 | 1699876 |
1 | Norway | 5347896.0 | 403336 |
2 | France | 67059887.0 | 2715518 |
3 | Sweden | 10285453.0 | 530883 |
4 | Belarus | 9466856.0 | 63080 |
We do the attribute join in the code cell below. The on argument is set to the column NAME
that is used to match rows in europe_boundaries to rows in europe_stats.
= europe_boundaries.merge(europe_stats, on="NAME")
europe europe.head()
NAME | geometry | POP_EST | GDP_MD | |
---|---|---|---|---|
0 | Russia | MULTIPOLYGON (((178.72530 71.09880, 180.00000 ... | 144373535.0 | 1699876 |
1 | Norway | MULTIPOLYGON (((15.14282 79.67431, 15.52255 80... | 5347896.0 | 403336 |
2 | France | MULTIPOLYGON (((-51.65780 4.15623, -52.24934 3... | 67059887.0 | 2715518 |
3 | Sweden | POLYGON ((11.02737 58.85615, 11.46827 59.43239... | 10285453.0 | 530883 |
4 | Belarus | POLYGON ((28.17671 56.16913, 29.22951 55.91834... | 9466856.0 | 63080 |
Spatial join
Another type of join is a spatial join. With a spatial join, we combine GeoDataFrames based on the spatial relationship between the objects in the geometry
column. For instance, we already have a GeoDataFrame universities containing geocoded addresses of European universities.
Then we can use a spatial join to match each university to its corresponding country. We do this with gpd.sjoin()
.
= gpd.sjoin(universities, europe)
european_universities
print(f"We located {len(universities)} universities.")
print(
f"Only {len(european_universities)} of the universities were located in Europe (in {len(european_universities['NAME'].unique())} different countries)."
) european_universities.head()
We located 95 universities.
Only 90 of the universities were located in Europe (in 15 different countries).
Name | Latitude | Longitude | geometry | index_right | NAME | POP_EST | GDP_MD | |
---|---|---|---|---|---|---|---|---|
0 | University of Oxford | 51.759037 | -1.252430 | POINT (-1.25243 51.75904) | 28 | United Kingdom | 66834405.0 | 2829108 |
1 | University of Cambridge | 52.210946 | 0.092005 | POINT (0.09200 52.21095) | 28 | United Kingdom | 66834405.0 | 2829108 |
2 | Imperial College London | 51.498959 | -0.175641 | POINT (-0.17564 51.49896) | 28 | United Kingdom | 66834405.0 | 2829108 |
4 | UCL | 51.521785 | -0.135151 | POINT (-0.13515 51.52179) | 28 | United Kingdom | 66834405.0 | 2829108 |
5 | London School of Economics and Political Science | 51.514261 | -0.116734 | POINT (-0.11673 51.51426) | 28 | United Kingdom | 66834405.0 | 2829108 |
The spatial join above looks at the “geometry” columns in both GeoDataFrames. If a Point object from the universities GeoDataFrame intersects a Polygon object from the europe DataFrame, the corresponding rows are combined and added as a single row of the european_universities DataFrame. Otherwise, countries without a matching university (and universities without a matching country) are omitted from the results.
The gpd.sjoin()
method is customizable for different types of joins, through the how
and op
arguments. For instance, you can do the equivalent of a SQL left (or right) join by setting how='left'
or how='right'
.
Proximity Analysis
We have a dataset from the US Environmental Protection Agency (EPA) that tracks releases of toxic chemicals in Philadelphia.
= gpd.read_file(
releases f"{PROJECT_ROOT}/data/kaggle_geospatial/toxic_release_pennsylvania/toxic_release_pennsylvania/toxic_release_pennsylvania.shp"
) releases.head()
YEAR | CITY | COUNTY | ST | LATITUDE | LONGITUDE | CHEMICAL | UNIT_OF_ME | TOTAL_RELE | geometry | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 2016 | PHILADELPHIA | PHILADELPHIA | PA | 40.005901 | -75.072103 | FORMIC ACID | Pounds | 0.160 | POINT (2718560.227 256380.179) |
1 | 2016 | PHILADELPHIA | PHILADELPHIA | PA | 39.920120 | -75.146410 | ETHYLENE GLYCOL | Pounds | 13353.480 | POINT (2698674.606 224522.905) |
2 | 2016 | PHILADELPHIA | PHILADELPHIA | PA | 40.023880 | -75.220450 | CERTAIN GLYCOL ETHERS | Pounds | 104.135 | POINT (2676833.394 261701.856) |
3 | 2016 | PHILADELPHIA | PHILADELPHIA | PA | 39.913540 | -75.198890 | LEAD COMPOUNDS | Pounds | 1730.280 | POINT (2684030.004 221697.388) |
4 | 2016 | PHILADELPHIA | PHILADELPHIA | PA | 39.913540 | -75.198890 | BENZENE | Pounds | 39863.290 | POINT (2684030.004 221697.388) |
We also have a dataset that contains readings from air quality monitoring stations in Philadelphia.
= gpd.read_file(
stations f"{PROJECT_ROOT}/data/kaggle_geospatial/PhillyHealth_Air_Monitoring_Stations/PhillyHealth_Air_Monitoring_Stations/PhillyHealth_Air_Monitoring_Stations.shp"
) stations.head()
SITE_NAME | ADDRESS | BLACK_CARB | ULTRAFINE_ | CO | SO2 | OZONE | NO2 | NOY_NO | PM10 | PM2_5 | SPECIATED_ | PM_COURSE | CARBONYLS | PAMS_VOC | TSP_11101 | TSP_METALS | TSP_LEAD | TOXICS_TO1 | MET | COMMUNITY_ | LATITUDE | LONGITUDE | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | LAB | 1501 East Lycoming Avenue | N | N | Y | N | Y | Y | Y | N | Y | N | N | Y | Y | N | Y | N | y | N | N | 40.008606 | -75.097624 | POINT (2711384.641 257149.310) |
1 | ROX | Eva and Dearnley Streets | N | N | N | N | N | N | N | N | N | N | N | Y | N | N | Y | N | Y | N | N | 40.050461 | -75.236966 | POINT (2671934.290 271248.900) |
2 | NEA | Grant Avenue and Ashton Street | N | N | N | N | Y | N | N | N | N | N | N | N | N | N | N | N | N | Y | N | 40.072073 | -75.013128 | POINT (2734326.638 280980.247) |
3 | CHS | 500 South Broad Street | N | N | N | N | N | N | N | N | N | N | N | Y | N | N | Y | N | Y | N | N | 39.944510 | -75.165442 | POINT (2693078.580 233247.101) |
4 | NEW | 2861 Lewis Street | N | N | Y | Y | Y | N | Y | Y | Y | Y | Y | N | N | Y | N | Y | N | Y | N | 39.991688 | -75.080378 | POINT (2716399.773 251134.976) |
Measuring distance
To measure distances between points from two different GeoDataFrames, we first have to make sure that they use the same coordinate reference system (CRS). Thankfully, this is the case here, where both use EPSG 2272.
print(stations.crs)
print(releases.crs)
EPSG:2272
EPSG:2272
We also check the CRS to see which units it uses (meters, feet, or something else). In this case, EPSG 2272 has units of feet. (If you like, you can check this here.)
It’s relatively straightforward to compute distances in GeoPandas. The code cell below calculates the distance (in feet) between a relatively recent release incident in recent_release
and every station in the stations
GeoDataFrame.
= releases.iloc[360]
recent_release
# Measure distance from release to each station
= stations.geometry.distance(recent_release.geometry)
distances distances
0 44778.509761
1 51006.456589
2 77744.509207
3 14672.170878
4 43753.554393
5 4711.658655
6 23197.430858
7 12072.823097
8 79081.825506
9 3780.623591
10 27577.474903
11 19818.381002
dtype: float64
Using the calculated distances, we can obtain statistics like the mean distance to each station.
print(f"Mean distance to monitoring stations: {round(distances.mean(), 2)} feet")
Mean distance to monitoring stations: 33516.28 feet
Or, we can get the closest monitoring station.
print(f"Closest monitoring station ({distances.min()} feet):")
print(stations.iloc[distances.idxmin()][["ADDRESS", "LATITUDE", "LONGITUDE"]])
Closest monitoring station (3780.623590556444 feet):
ADDRESS 3100 Penrose Ferry Road
LATITUDE 39.91279
LONGITUDE -75.185448
Name: 9, dtype: object
Creating a buffer
If we want to understand all points on a map that are some radius away from a point, the simplest way is to create a buffer.
The code cell below creates a GeoSeries two_mile_buffer
containing 12 different Polygon objects. Each polygon is a buffer of 2 miles (or, 2*5280 feet) around a different air monitoring station.
= stations.geometry.buffer(2 * 5280)
two_mile_buffer two_mile_buffer.head()
0 POLYGON ((2721944.641 257149.310, 2721893.792 ...
1 POLYGON ((2682494.290 271248.900, 2682443.441 ...
2 POLYGON ((2744886.638 280980.247, 2744835.789 ...
3 POLYGON ((2703638.580 233247.101, 2703587.731 ...
4 POLYGON ((2726959.773 251134.976, 2726908.924 ...
dtype: geometry
We use folium.GeoJson()
to plot each polygon on a map. Note that since folium requires coordinates in latitude and longitude, we have to convert the CRS to EPSG 4326 before plotting.
= folium.Map(location=[39.9526, -75.1652], zoom_start=11)
m
=releases[["LATITUDE", "LONGITUDE"]], radius=15).add_to(m)
HeatMap(data
for idx, row in stations.iterrows():
"LATITUDE"], row["LONGITUDE"]]).add_to(m)
Marker([row[
=4326)).add_to(m)
GeoJson(two_mile_buffer.to_crs(epsg
m
Now, to test if a toxic release occurred within 2 miles of any monitoring station, we could run 12 different tests for each polygon (to check individually if it contains the point).
But a more efficient way is to first collapse all of the polygons into a MultiPolygon object. We do this with the unary_union
attribute.
= two_mile_buffer.geometry.unary_union
my_union
my_union
We use the contains()
method to check if the multipolygon contains a point. We’ll use the release incident from earlier, which we know is roughly 3781 feet to the closest monitoring station.
360].geometry) my_union.contains(releases.iloc[
True
Not all releases occured within two miles of an air monitoring station
358].geometry) my_union.contains(releases.iloc[
False