import math
from pathlib import Path
import folium
import geopandas as gpd
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from folium import Marker, GeoJson
from folium.plugins import HeatMap
from geopy.geocoders import Nominatim
pd.set_option("display.max_columns", None)
plt.rcParams["figure.facecolor"] = (1, 1, 1, 0)  # RGBA tuple with alpha=0
plt.rcParams["axes.facecolor"] = (1, 1, 1, 0)  # RGBA tuple with alpha=0
PROJECT_ROOT = Path.cwd().parent.parentManipulating Geospatial Data
Geocoding
Geocoding is the process of converting the name of a place or an address to a location on a map.
In the imports above, Nominatim refers to the geocoding software that will be used to generate locations.
We begin by instantiating the geocoder. Then, we need only apply the name or address as a Python string. (In this case, we supply “Pyramid of Khufu”, also known as the Great Pyramid of Giza.)
If the geocoding is successful, it returns a geopy.location.Location object with two important attributes:
- the “point” attribute contains the (latitude, longitude) location, and
- the “address” attribute contains the full address.
geolocator = Nominatim(user_agent="kaggle_learn")
location = geolocator.geocode("Pyramid of Khufu")
print(location.point)
print(location.address)29 58m 44.976s N, 31 8m 3.17625s E
هرم خوفو, شارع ابو الهول السياحي, نزلة البطران, الجيزة, 12125, مصرThe value for the “point” attribute is a geopy.point.Point object. We can get the latitude and longitude from the latitude and longitude attributes, respectively.
point = location.point
print("Latitude:", point.latitude)
print("Longitude:", point.longitude)Latitude: 29.97916
Longitude: 31.134215625236113It’s often the case that we’ll need to geocode many different addresses. For instance, say we want to obtain the locations of 100 top universities in Europe.
universities = pd.read_csv(f"{PROJECT_ROOT}/data/kaggle_geospatial/top_universities.csv")
universities.head()| Name | |
|---|---|
| 0 | University of Oxford | 
| 1 | University of Cambridge | 
| 2 | Imperial College London | 
| 3 | ETH Zurich | 
| 4 | UCL | 
Then we can use a lambda function to apply the geocoder to every row in the DataFrame. (We use a try/except statement to account for the case that the geocoding is unsuccessful.)
def my_geocoder(row):
    try:
        point = geolocator.geocode(row).point
        return pd.Series({"Latitude": point.latitude, "Longitude": point.longitude})
    except:
        return None
universities[["Latitude", "Longitude"]] = universities.apply(lambda x: my_geocoder(x["Name"]), axis=1)
print(f"{(1 - np.isnan(universities['Latitude']).mean()) * 100}% of addresses were geocoded")95.0% of addresses were geocoded!universities = universities.loc[~np.isnan(universities["Latitude"])]
universities = gpd.GeoDataFrame(
    universities, geometry=gpd.points_from_xy(universities.Longitude, universities.Latitude), crs="EPSG:4326"
)
universities.head()| Name | Latitude | Longitude | geometry | |
|---|---|---|---|---|
| 0 | University of Oxford | 51.759037 | -1.252430 | POINT (-1.25243 51.75904) | 
| 1 | University of Cambridge | 52.210946 | 0.092005 | POINT (0.09200 52.21095) | 
| 2 | Imperial College London | 51.498959 | -0.175641 | POINT (-0.17564 51.49896) | 
| 3 | ETH Zurich | 47.413218 | 8.537491 | POINT (8.53749 47.41322) | 
| 4 | UCL | 51.521785 | -0.135151 | POINT (-0.13515 51.52179) | 
Next, we visualize all of the locations that were returned by the geocoder. Notice that a few of the locations are certainly inaccurate, as they’re not in Europe!
m = folium.Map(location=[54, 15], tiles="openstreetmap", zoom_start=2)
for idx, row in universities.iterrows():
    Marker([row["Latitude"], row["Longitude"]], popup=row["Name"]).add_to(m)
mTable joins
Attribute join
You already know how to use pd.DataFrame.join() to combine information from multiple DataFrames with a shared index. We refer to this way of joining data (by sampling matching values in the index) as an attribute join.
When performing an attribute join with a GeoDataFrame, it’s best to use gpd.GeoDataFrame.merge(). To illustrate this, we’ll work with a GeoDataFrame europe_boundaries containing the boundaries for every country in Europe. The first five rows of this GeoDataFrame are printed below.
world = gpd.read_file(f"{PROJECT_ROOT}/data/ne_110m_admin_0_countries/ne_110m_admin_0_countries.shp")
europe = world.loc[world["CONTINENT"] == "Europe"].reset_index(drop=True)
europe_stats = europe[["NAME", "POP_EST", "GDP_MD"]]
europe_boundaries = europe[["NAME", "geometry"]]
europe_boundaries.head()| NAME | geometry | |
|---|---|---|
| 0 | Russia | MULTIPOLYGON (((178.72530 71.09880, 180.00000 ... | 
| 1 | Norway | MULTIPOLYGON (((15.14282 79.67431, 15.52255 80... | 
| 2 | France | MULTIPOLYGON (((-51.65780 4.15623, -52.24934 3... | 
| 3 | Sweden | POLYGON ((11.02737 58.85615, 11.46827 59.43239... | 
| 4 | Belarus | POLYGON ((28.17671 56.16913, 29.22951 55.91834... | 
We’ll join it with a DataFrame europe_stats containing the estimated population and gross domestic product (GDP) for each country.
europe_stats.head()| NAME | POP_EST | GDP_MD | |
|---|---|---|---|
| 0 | Russia | 144373535.0 | 1699876 | 
| 1 | Norway | 5347896.0 | 403336 | 
| 2 | France | 67059887.0 | 2715518 | 
| 3 | Sweden | 10285453.0 | 530883 | 
| 4 | Belarus | 9466856.0 | 63080 | 
We do the attribute join in the code cell below. The on argument is set to the column NAME that is used to match rows in europe_boundaries to rows in europe_stats.
europe = europe_boundaries.merge(europe_stats, on="NAME")
europe.head()| NAME | geometry | POP_EST | GDP_MD | |
|---|---|---|---|---|
| 0 | Russia | MULTIPOLYGON (((178.72530 71.09880, 180.00000 ... | 144373535.0 | 1699876 | 
| 1 | Norway | MULTIPOLYGON (((15.14282 79.67431, 15.52255 80... | 5347896.0 | 403336 | 
| 2 | France | MULTIPOLYGON (((-51.65780 4.15623, -52.24934 3... | 67059887.0 | 2715518 | 
| 3 | Sweden | POLYGON ((11.02737 58.85615, 11.46827 59.43239... | 10285453.0 | 530883 | 
| 4 | Belarus | POLYGON ((28.17671 56.16913, 29.22951 55.91834... | 9466856.0 | 63080 | 
Spatial join
Another type of join is a spatial join. With a spatial join, we combine GeoDataFrames based on the spatial relationship between the objects in the geometry column. For instance, we already have a GeoDataFrame universities containing geocoded addresses of European universities.
Then we can use a spatial join to match each university to its corresponding country. We do this with gpd.sjoin().
european_universities = gpd.sjoin(universities, europe)
print(f"We located {len(universities)} universities.")
print(
    f"Only {len(european_universities)} of the universities were located in Europe (in {len(european_universities['NAME'].unique())} different countries)."
)
european_universities.head()We located 95 universities.
Only 90 of the universities were located in Europe (in 15 different countries).| Name | Latitude | Longitude | geometry | index_right | NAME | POP_EST | GDP_MD | |
|---|---|---|---|---|---|---|---|---|
| 0 | University of Oxford | 51.759037 | -1.252430 | POINT (-1.25243 51.75904) | 28 | United Kingdom | 66834405.0 | 2829108 | 
| 1 | University of Cambridge | 52.210946 | 0.092005 | POINT (0.09200 52.21095) | 28 | United Kingdom | 66834405.0 | 2829108 | 
| 2 | Imperial College London | 51.498959 | -0.175641 | POINT (-0.17564 51.49896) | 28 | United Kingdom | 66834405.0 | 2829108 | 
| 4 | UCL | 51.521785 | -0.135151 | POINT (-0.13515 51.52179) | 28 | United Kingdom | 66834405.0 | 2829108 | 
| 5 | London School of Economics and Political Science | 51.514261 | -0.116734 | POINT (-0.11673 51.51426) | 28 | United Kingdom | 66834405.0 | 2829108 | 
The spatial join above looks at the “geometry” columns in both GeoDataFrames. If a Point object from the universities GeoDataFrame intersects a Polygon object from the europe DataFrame, the corresponding rows are combined and added as a single row of the european_universities DataFrame. Otherwise, countries without a matching university (and universities without a matching country) are omitted from the results.
The gpd.sjoin() method is customizable for different types of joins, through the how and op arguments. For instance, you can do the equivalent of a SQL left (or right) join by setting how='left' or how='right'.
Proximity Analysis
We have a dataset from the US Environmental Protection Agency (EPA) that tracks releases of toxic chemicals in Philadelphia.
releases = gpd.read_file(
    f"{PROJECT_ROOT}/data/kaggle_geospatial/toxic_release_pennsylvania/toxic_release_pennsylvania/toxic_release_pennsylvania.shp"
)
releases.head()| YEAR | CITY | COUNTY | ST | LATITUDE | LONGITUDE | CHEMICAL | UNIT_OF_ME | TOTAL_RELE | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2016 | PHILADELPHIA | PHILADELPHIA | PA | 40.005901 | -75.072103 | FORMIC ACID | Pounds | 0.160 | POINT (2718560.227 256380.179) | 
| 1 | 2016 | PHILADELPHIA | PHILADELPHIA | PA | 39.920120 | -75.146410 | ETHYLENE GLYCOL | Pounds | 13353.480 | POINT (2698674.606 224522.905) | 
| 2 | 2016 | PHILADELPHIA | PHILADELPHIA | PA | 40.023880 | -75.220450 | CERTAIN GLYCOL ETHERS | Pounds | 104.135 | POINT (2676833.394 261701.856) | 
| 3 | 2016 | PHILADELPHIA | PHILADELPHIA | PA | 39.913540 | -75.198890 | LEAD COMPOUNDS | Pounds | 1730.280 | POINT (2684030.004 221697.388) | 
| 4 | 2016 | PHILADELPHIA | PHILADELPHIA | PA | 39.913540 | -75.198890 | BENZENE | Pounds | 39863.290 | POINT (2684030.004 221697.388) | 
We also have a dataset that contains readings from air quality monitoring stations in Philadelphia.
stations = gpd.read_file(
    f"{PROJECT_ROOT}/data/kaggle_geospatial/PhillyHealth_Air_Monitoring_Stations/PhillyHealth_Air_Monitoring_Stations/PhillyHealth_Air_Monitoring_Stations.shp"
)
stations.head()| SITE_NAME | ADDRESS | BLACK_CARB | ULTRAFINE_ | CO | SO2 | OZONE | NO2 | NOY_NO | PM10 | PM2_5 | SPECIATED_ | PM_COURSE | CARBONYLS | PAMS_VOC | TSP_11101 | TSP_METALS | TSP_LEAD | TOXICS_TO1 | MET | COMMUNITY_ | LATITUDE | LONGITUDE | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | LAB | 1501 East Lycoming Avenue | N | N | Y | N | Y | Y | Y | N | Y | N | N | Y | Y | N | Y | N | y | N | N | 40.008606 | -75.097624 | POINT (2711384.641 257149.310) | 
| 1 | ROX | Eva and Dearnley Streets | N | N | N | N | N | N | N | N | N | N | N | Y | N | N | Y | N | Y | N | N | 40.050461 | -75.236966 | POINT (2671934.290 271248.900) | 
| 2 | NEA | Grant Avenue and Ashton Street | N | N | N | N | Y | N | N | N | N | N | N | N | N | N | N | N | N | Y | N | 40.072073 | -75.013128 | POINT (2734326.638 280980.247) | 
| 3 | CHS | 500 South Broad Street | N | N | N | N | N | N | N | N | N | N | N | Y | N | N | Y | N | Y | N | N | 39.944510 | -75.165442 | POINT (2693078.580 233247.101) | 
| 4 | NEW | 2861 Lewis Street | N | N | Y | Y | Y | N | Y | Y | Y | Y | Y | N | N | Y | N | Y | N | Y | N | 39.991688 | -75.080378 | POINT (2716399.773 251134.976) | 
Measuring distance
To measure distances between points from two different GeoDataFrames, we first have to make sure that they use the same coordinate reference system (CRS). Thankfully, this is the case here, where both use EPSG 2272.
print(stations.crs)
print(releases.crs)EPSG:2272
EPSG:2272We also check the CRS to see which units it uses (meters, feet, or something else). In this case, EPSG 2272 has units of feet. (If you like, you can check this here.)
It’s relatively straightforward to compute distances in GeoPandas. The code cell below calculates the distance (in feet) between a relatively recent release incident in recent_release and every station in the stations GeoDataFrame.
recent_release = releases.iloc[360]
# Measure distance from release to each station
distances = stations.geometry.distance(recent_release.geometry)
distances0     44778.509761
1     51006.456589
2     77744.509207
3     14672.170878
4     43753.554393
5      4711.658655
6     23197.430858
7     12072.823097
8     79081.825506
9      3780.623591
10    27577.474903
11    19818.381002
dtype: float64Using the calculated distances, we can obtain statistics like the mean distance to each station.
print(f"Mean distance to monitoring stations: {round(distances.mean(), 2)} feet")Mean distance to monitoring stations: 33516.28 feetOr, we can get the closest monitoring station.
print(f"Closest monitoring station ({distances.min()} feet):")
print(stations.iloc[distances.idxmin()][["ADDRESS", "LATITUDE", "LONGITUDE"]])Closest monitoring station (3780.623590556444 feet):
ADDRESS      3100 Penrose Ferry Road
LATITUDE                    39.91279
LONGITUDE                 -75.185448
Name: 9, dtype: objectCreating a buffer
If we want to understand all points on a map that are some radius away from a point, the simplest way is to create a buffer.
The code cell below creates a GeoSeries two_mile_buffer containing 12 different Polygon objects. Each polygon is a buffer of 2 miles (or, 2*5280 feet) around a different air monitoring station.
two_mile_buffer = stations.geometry.buffer(2 * 5280)
two_mile_buffer.head()0    POLYGON ((2721944.641 257149.310, 2721893.792 ...
1    POLYGON ((2682494.290 271248.900, 2682443.441 ...
2    POLYGON ((2744886.638 280980.247, 2744835.789 ...
3    POLYGON ((2703638.580 233247.101, 2703587.731 ...
4    POLYGON ((2726959.773 251134.976, 2726908.924 ...
dtype: geometryWe use folium.GeoJson() to plot each polygon on a map. Note that since folium requires coordinates in latitude and longitude, we have to convert the CRS to EPSG 4326 before plotting.
m = folium.Map(location=[39.9526, -75.1652], zoom_start=11)
HeatMap(data=releases[["LATITUDE", "LONGITUDE"]], radius=15).add_to(m)
for idx, row in stations.iterrows():
    Marker([row["LATITUDE"], row["LONGITUDE"]]).add_to(m)
GeoJson(two_mile_buffer.to_crs(epsg=4326)).add_to(m)
mNow, to test if a toxic release occurred within 2 miles of any monitoring station, we could run 12 different tests for each polygon (to check individually if it contains the point).
But a more efficient way is to first collapse all of the polygons into a MultiPolygon object. We do this with the unary_union attribute.
my_union = two_mile_buffer.geometry.unary_union
my_unionWe use the contains() method to check if the multipolygon contains a point. We’ll use the release incident from earlier, which we know is roughly 3781 feet to the closest monitoring station.
my_union.contains(releases.iloc[360].geometry)TrueNot all releases occured within two miles of an air monitoring station
my_union.contains(releases.iloc[358].geometry)False