By Avipsa Roy , PhD Student, Arizona State University, School of Geographical Sciences and Urban Planning
If you are interested to use BikeMaps.org data, there is an easy and completely new way to do it with Python. Python has an extended library for analysing and visualising urban crowdsourced spatial data in an interactive manner. In this blog, we take a look at some of the effective ways to use cycling safety data from BikeMaps.org and visualising it with Jupyter notebooks. For most part of the work we demostrate how to use Geopandas Python library to perform spatial analysis interactively in Python without have to install an additional enterprise GIS tool like ArcGIS or QGIS! We will take a look at the different kinds of bike safety incidents that are most commonly occuring, try to find the locaitons within Maricopa county where the most number of incidents have occurred and get an overview of how additional factors such as infrastructure change and kind of terrain or topography of the land affect the occurrence of such incidents. We will also try to find a way to plot a timeseries of the incidents to find a trend in the different kinds of bike related incidents over time.
We use the Python spatial analysis library called Geopandas to analyse bike safety data from BikeMaps.org. Let's import the necessary packages for analaysis and visualisation. This is going to be our first step!
!pip install geopandas
!pip install seaborn
!pip install mplleaflet
Once we have all the necessary packages installed, we need to import them into our Jupyter notebook
import geopandas as gpd # Python spatial analysis library
import pandas as pd # Python data analysis library
import mplleaflet as mpl # Used to visualise an interactive LEaflet map within Python
import numpy as np # Numerical analysis library
import matplotlib.pyplot as plt # Data visualisation libraries
import seaborn as sns
%matplotlib inline`
plt.rcParams['figure.figsize'] = (15.0, 10.0) #Setting the size of the figures
BikeMaps.org provides a REST API which we have used to collect the GeoJSON files for each kinds of incidents recorded in the app. These GeoJSON files can be easily read into our Jupyter notebook with the Geopandas library. The Geopandas package provides a seamless interface to directly read GeoJSON or Shapefiles into dataframes. The location of each incident is stored in a special attribute called "geometry" which is automatically created by geopandas after reading the raw GeoJSON files.
path = "./Data/Rest_api/"`
collisions = gpd.read_file(path + 'collisions.geojson')`
nearmiss = gpd.read_file(path + 'nearmiss.geojson')`
hazards = gpd.read_file(path + 'hazards.geojson')`
thefts = gpd.read_file(path + 'thefts.geojson')`
Now that we have loaded the raw data into memory using Geopandas, we will now examine the Geodataframes created from the raw data. Each of the above Geodataframes will have a geometry column which stores the location where the incident occurred. The rest of the attributes are also available in columnar format. Let us take a quick look at the data and see the number of records etc.
collisions[["i_type","date","injury","trip_purpose","sex","age","geometry"]].head()
i_type | date | injury | trip_purpose | sex | age | geometry | |
---|---|---|---|---|---|---|---|
0 | Collision with moving object or vehicle | 2015-05-13T17:30:00 | Injury, saw family doctor | Commute | M | 1959 | POINT (-114.0949058532715 51.08303753521079) |
1 | Collision with moving object or vehicle | 2015-04-18T12:30:00 | Injury, hospital emergency visit | Exercise or recreation | None | None | POINT (-123.1313860416412 49.32364651138661) |
2 | Collision with moving object or vehicle | 2015-05-11T07:15:00 | Injury, no treatment | Commute | F | 1976 | POINT (-122.99212038517 49.22070503231325) |
3 | Collision with moving object or vehicle | 2015-06-10T08:15:00 | No injury | Commute | None | None | POINT (-80.27263641357422 43.544691011049) |
4 | Collision with moving object or vehicle | 2015-05-24T16:00:00 | Injury, no treatment | Personal business | M | 1989 | POINT (11.04057312011719 59.9539654475704) |
There are 4 major kinds of bike incidents captured by BikeMaps.org namely - 'collisions', 'nearmiss', 'hazards' and 'thefts'. We have created 4 different dataframes for each kind of incident. We wanted to see how many incidents of each kind occurred in Tempe. So we created a map showing the most common locations of such incidents in Maricopa county.
The package mplleaflet provides an easy interface to visualise geometries on OpenStreetMap using the Leaflet library. So, we try to locate the exact locations on an interactive map where collisions occurred in Tempe.
collisions_maricopa = gpd.read_file("./Data/Shapefiles/collisions_maricopa.shp")
nearmiss_maricopa = gpd.read_file("./Data/Shapefiles/near_maricopa.shp")
hazards_maricopa = gpd.read_file("./Data/Shapefiles/hazards_maricopa.shp")
thefts_maricopa = gpd.read_file("./Data/Shapefiles/thefts_maricopa.shp")
fig = collisions_maricopa.plot(color="red")
fig1 = nearmiss_maricopa.plot(ax=fig,color="green")
fig2 = hazards_maricopa.plot(ax=fig1,color="blue")
fig3 = thefts_maricopa.plot(ax=fig2,color="yellow")
mpl.display()
We can get a spatial distribution of the various incidents from the map. As we can see, there are most concentration of incidents in and around Tempe, which is pretty much due to the fact that most number of bike riders in Maricopa county are in this region.
Now to understand the nature of the BikeMaps data , we do some basic statistical analysis and plot the data using the "seaborn" library. The very first step is to get a frequenct distribution of the various kinds of bike incidents that have occurred over the year.
#Plot the no of bike incidents by category in 2017
x = ['collisions','hazards','thefts','nearmiss']
y = np.array([collisions.shape[0],hazards.shape[0],thefts.shape[0],nearmiss.shape[0]])
sns.set_style("whitegrid")
counts = pd.DataFrame({'type':x,'count':y})
ax = sns.barplot(x="type", y="count", data=counts)
It is clearly visible that "nearmiss" incidents are the most frequent ones. In bike safety, nearmises often go unreported, but BikeMaps.org is a pretty great tool, which actually captures such incidents. Otherwise, most of the nearmiss incidents would never be captured . It would be difficult for urban planners to visibly understand the risk factors associated with such incidents and take necesssary actions to prevent those.
import random
collisions["id"] = np.random.randint(200,2000,size=collisions.shape[0])
nearmiss["id"] = np.random.randint(100,1000,size=nearmiss.shape[0])
df1 = collisions[["id","sex","i_type"]].query("sex=='F' | sex=='M'").groupby(["sex","i_type"],as_index=False).count()
sns.barplot(x="i_type",y="id",hue="sex",data=df1)
df2 = collisions[["i_type","id","age","sex"]].groupby(["i_type","age","sex"],as_index=False).count()
#df2 = df2.dropna()
df2["age_in_years"] = df2["age"].apply(lambda x:2017-int(x))
def age_groups(x):
if (x<15):
y = "<15"
elif (x in (15,25)):
y = "15-25"
elif (x in (26,35)) :
y= "26-35"
elif (x in (36,45)):
y = "36-45"
elif (x in (46,55)):
y = "46-55"
else:
y= ">55"
return y
df2["age_groups"] = df2["age_in_years"].apply(lambda x: age_groups(int(x)))
sns.barplot(x="age_groups",y="id",hue="sex",data=df2)
From the above plots we can see a strong bias towards Male riders in the age groups 25-45 who meet with most collisions. In order to reconfirm this statement we need to do a similar analysis for the other kinds of incidents as well. Let us compare this with nearmiss incidents.
df3 = nearmiss[["i_type","id","age","sex"]].groupby(["i_type","age","sex"],as_index=False).count()
df3["age_in_years"] = df3["age"].apply(lambda x:2017-int(x))
df3["age_groups"] = df3["age_in_years"].apply(lambda x: age_groups(int(x)))
sns.barplot(x="age_groups",y="id",hue="sex",data=df3)
Therefore both incidents have a definite bias towards 26-35 males. This is understood since most bikers are males between 26-35 years of age and those are also a large chunk of riders who use smartphones to access the BikeMaps.org app.
df_coll_helmet = collisions[["id","helmet","i_type"]].groupby(["helmet","i_type"],as_index = False).count()
N = 5
helmet = df_coll_helmet.query("helmet=='Y'").id
no_helmet = df_coll_helmet.query("helmet=='N'").id
ind = np.arange(3) # the x locations for the groups
width = 0.25 # the width of the bars: can also be len(x) sequence
p1 = plt.bar(ind, helmet, width, color ='c' )
p2 = plt.bar(ind, no_helmet, width,bottom=ind, color = 'm')
plt.ylabel('Counts')
plt.title("Variation of collision incidents with use of helmets")
plt.xticks(ind, ('Collision with moving object or vehicle', 'Collision with stationary object or vehicle', 'Fall'))
plt.yticks(np.arange(0, 450, 50))
plt.legend((p1[0], p2[0]), ('Helmet', 'No Helmet'))
plt.show()
Now that we have understood the variation of the different type of collisions and nearmiss that are most common, we would be more interested to find out the contributing factors behind such incidents. Topography of a region is especially indicative of the nature of the terrain that bikers are riding on. Let us take a look at the nearmiss incidents and how they are associated with the topography of a region.
sns.boxplot(x="i_type",y="id",hue="terrain",data = nearmiss)
The above boxplot shows the measures of the central tendencies for each nearmiss incident with varying terrain. For example, for "Downhill" terrain is the most common reason for nearmisses with moving objects but "Uphill" terrain contributes more towards a nearmiss incident with a stationary object or vehicle. The number of nearmisses due to downhill terrain or a steep slope varies between 350-780 which is nearly 70% of the total incidents recorded.
df3 = pd.crosstab(index=nearmiss["i_type"], columns="count")
df3 = pd.crosstab(index=nearmiss["i_type"],columns=nearmiss["infrastructure_changed"])
df3.plot(kind="barh",stacked=True,figsize=(10,8))
We also try to see if the change in infrastructure necessarily contributed to the nearmiss. However, from the data at hand, we cannot see much of an influence from any change in infrastructure. Therfore, this is not a good attribute which causes those incidents or the riders avoid routes where new infrastructure have been introduced consciously enough, so we don't have the data to make solid assumptions!
Now that we have understoodhow to check for differnet attributes and find the ones that contribute the most to the incidents, we might be curious to know if there is any trend which can be observed over time in the collisions and nearmiss incidents recorded by BikeMaps.org users. We will try to build a timeseries from the raw data by extracting the date and time and merginfg the "year", "month" and "day" columns to create a timeseries in pnadas. We will then plot the timeeries for 2 different kinds of incidents - collisions and nearmiss for the year 2016.
# Convert date column to a timestamcollisions.date
import warnings
warnings.filterwarnings('ignore')
collisions["year"] = collisions["date"].apply(lambda x: x[0:4])
collisions["month"] = collisions["date"].apply(lambda x: x[5:7])
collisions["day"] = collisions["date"].apply(lambda x: x[8:10])
collisions["hour"] = collisions["date"].apply(lambda x: x[11:13])
collisions["minute"] = collisions["date"].apply(lambda x: x[14:16])
collisions["second"] = collisions["date"].apply(lambda x: x[17:19])
collisions["timestamp"] = pd.to_datetime(collisions[['year', 'month','day']])
df1 = collisions[["id","timestamp"]].groupby("timestamp").count()
nearmiss["year"] = nearmiss["date"].apply(lambda x: x[0:4])
nearmiss["month"] = nearmiss["date"].apply(lambda x: x[5:7])
nearmiss["day"] = nearmiss["date"].apply(lambda x: x[8:10])
nearmiss["hour"] = nearmiss["date"].apply(lambda x: x[11:13])
nearmiss["minute"] = nearmiss["date"].apply(lambda x: x[14:16])
nearmiss["second"] = nearmiss["date"].apply(lambda x: x[17:19])
nearmiss["timestamp"] = pd.to_datetime(nearmiss[['year', 'month','day']])
df2 = nearmiss[["id","timestamp"]].groupby("timestamp").count()
fig1 = df1['2016'].resample('D').interpolate(method='cubic').plot(color='blue')
fig2 = df2['2016'].resample('D').interpolate(method='cubic').plot(color='green',ax=fig1)
plt.legend( ["collisions","nearmiss"])
plt.xlabel("Month")
plt.title("Comparison of Monthly variation of nearmiss and collision incidents between 2015 and 2016")
There is a sharp spike in May and June for the nearmiss incidents. The collisions vary more or less in a standard manner. The peaks in August mid- September and mid-November stand out though. The month of June is when a lot of countries experience summer weather. So in general, the number of bikers increase altogether. Therefore, the increase in incidents in BikeMaps.org is clearly an indicator of increase in ridership during this month as well.
This is a short example of how we can use BikeMaps.org data to analyse bike safety more quantitatively in Python. The analyses can always be combined with additional data sources to gain more insights. For example, we can use weather data to analyse seasonal variation in bike incidents and find what impacts severer weather conditions such as heavy rainfall or very high temperatures have on bike safety. Hope more people will contribute to this effort and ply around with BikeMaps.org data in the future!
References:
- Jestico, Ben, Trisalyn Nelson, and Meghan Winters. "Mapping ridership using crowdsourced cycling data." Journal of transport geography 52 (2016): 90-97.
- Ferster, Colin Jay, et al. "Geographic age and gender representation in volunteered cycling safety data: a case study of BikeMaps. org." Applied Geography 88 (2017): 144-150.
- http://geopandas.org/mapping.html
- McKinney, Wes. "Data structures for statistical computing in python." Proceedings of the 9th Python in Science Conference. Vol. 445. Austin, TX: SciPy, 2010.