👨🏫 Guide
Important
Italy-geopop was thought to be primarily used as a pandas.Series accessor.
While some of its functions can be used and are useful outside pandas integration, the guide will focus on italy_geopop pandas accessor.
Tutorial #1 - Hospital beds in Italy
How many hospital beds are there in Italy?
Let’s suppose you want to answer this question. To do so, you need hospital beds data. Luckily the Italian Health Ministry published such data under Italian Open Data License, so everyone can access it here.
Note
A simplified version of that dataset is available for download and it will be used in this guide. This dataset only contains observations from the year 2019. Only the columns used in the guide were kept and the labels were translated to english.
Let’s read the csv file and see how it looks like.
1import pandas as pd
2df = pd.read_csv('open_data_italy_hospitals_beds_2019.csv', dtype={'beds':float, 'province_short':str}, keep_default_na=False)
3df.head(10)
beds |
province_short |
|
|---|---|---|
0 |
179.0 |
TO |
1 |
18.0 |
TO |
2 |
17.0 |
TO |
3 |
277.0 |
TO |
4 |
15.0 |
TO |
5 |
80.0 |
TO |
6 |
25.0 |
TO |
7 |
20.0 |
TO |
8 |
120.0 |
TO |
9 |
145.0 |
TO |
Warning
Remember to use keep_default_na=False when reading dataset as province_short, which is the province abbreviation,
for Naples is ‘NA’, so pandas default behaviour is to treat it as null.
The dataset contains only two columns: beds and province_short. As you can notice province_short has values repeated multiple times,
that’s because every row of the table is an hospital in the original dataset, so we need to group the dataset by province_short and sum the bed column.
In that way we will obtain a dataset with the number of hospital beds for every province.
4df = df.groupby(['province_short'], as_index=False).sum()
5df.head()
province_short |
beds |
|
|---|---|---|
0 |
AG |
751.0 |
1 |
AL |
1460.0 |
2 |
AN |
1969.0 |
3 |
AO |
444.0 |
4 |
AP |
471.0 |
At this point we may want to plot the geospatial distribution of hospital beds and that’s when italy-geopop can helps us.
First of all, let’s activate the accessor. This registers the accessor to pandas and let us take advantage of italy-geopop functionalities.
6from italy_geopop.pandas_extension import pandas_activate
7pandas_activate(include_geometry=True, data_year=2022)
Hint
You may want the accessor to only live in a specific context. You can do so using pandas_activate_context that has the same syntax of pandas_activate.
This is useful if you want to register the accessor with different initialization options more than once in your code or if you want to free up memory right
after you get the needed data (the trade off is that italy-geopop needs to be reinitialized everytime you register and use the accessor).
Here you can find the complete api reference documentation for both pandas_activate and pandas_activate_context.
And now we can get the geospatial data we need to plot the geospatial distribution using italy_geopop.from_province accessor.
8df[['geometry']] = df.province_short.italy_geopop.from_province(return_cols=['geometry'])
9print('df type:', type(df))
10df.head()
This should output
df type: <class 'pandas.core.frame.DataFrame'>
province_short |
beds |
geometry |
|
|---|---|---|---|
0 |
AG |
751.0 |
MULTIPOLYGON (((13.663438752935148 37.19338230… |
1 |
AL |
1460.0 |
POLYGON ((8.408748511874723 44.70686568809358,… |
2 |
AN |
1969.0 |
POLYGON ((13.422716165173247 43.62050399042972… |
3 |
AO |
444.0 |
POLYGON ((7.734554907653768 45.92365152529369,… |
4 |
AP |
471.0 |
POLYGON ((13.50521153822172 42.775569771884555… |
Now we have geospatial data for every province but df is a pandas.DataFrame instance and we need a geopandas.GeoDataFrame instance in order to generate the plot.
Note
Note that we created the geometry column with double square brackets, that’s because italy_geopop accessor
actually return a subset of another dataframe, so passing return_cols=['geometry'] will make the accessor
return a 2-dimensional pandas.DataFrame and passing return_cols='geometry' will make the accessor return
a 1-dimensional pandas.Series instance.
11import geopandas as gpd
12import matplotlib.pyplot as plt
13
14df = gpd.GeoDataFrame(df)
15
16df.plot(
17 'beds',
18 cmap='OrRd',
19 legend=True
20)
21plt.title('Hospital beds for province - Italy - 2019')
22plt.tight_layout()
We can see that there are few provinces with a very high number of hospital beds and the others seem to have a very low number of beds.
There must be some kind of bias.
At least we need to adjust the number of beds for province’s population and italy-geopop can help us even in this task.
23df['population'] = df.province_short.italy_geopop.from_province(return_cols='population', population_limits='total')
24df['beds_per_capita'] = df.beds / df.population
25df.head()
province_short |
beds |
geometry |
population |
beds_per_capita |
|
|---|---|---|---|---|---|
0 |
AG |
751.0 |
MULTIPOLYGON |
415887.0 |
0.0018057789736154292 |
1 |
AL |
1460.0 |
POLYGON |
407264.0 |
0.003584898247819596 |
2 |
AN |
1969.0 |
POLYGON |
461687.0 |
0.004264794113761055 |
3 |
AO |
444.0 |
POLYGON |
123360.0 |
0.0035992217898832683 |
4 |
AP |
471.0 |
POLYGON |
202365.0 |
0.0023274775776443556 |
Here we created the population column. Note that we assigned it using single square brackets because the output
of italy_geopop accessor was 1-dimensional. Then we created beds_per_capita column dividing beds for population
obtaining the number of hospital beds per person.
Now we can create the same plot as before but adjusted for province population.
26df.plot(
27 'beds_per_capita',
28 cmap='OrRd',
29 legend=True
30)
31plt.title('Hospital beds per capita per province - Italy - 2019')
32plt.tight_layout()
This representation is surely more accurate than the previous one, but distribution seems to have an elevated variability despite the adjustment we made.
Healthcare in Italy is largely adminsitrated at a regional level, so maybe it would be more accurate to plot the distribution of hospital beds by region instead. Let’s do so.
Firstly, we reload our dataset and get the region_code using italy_geopop.from_province accessor.
33df = pd.read_csv('./docs/source/_static/assets/open_data_italy_hospitals_beds_2019.csv', dtype={'beds':float, 'province_short':str}, keep_default_na=False)
34df['region_code'] = df.province_short.italy_geopop.from_province(return_cols='region_code')
35df.head()
The expected output is
beds |
province_short |
region_code |
|
|---|---|---|---|
0 |
179.0 |
TO |
1 |
1 |
18.0 |
TO |
1 |
2 |
17.0 |
TO |
1 |
3 |
277.0 |
TO |
1 |
4 |
15.0 |
TO |
1 |
Then we use pandas.DataFrame.groupby to group the dataset by region_code and sum the beds.
36df = df.groupby(['region_code'], as_index=False)[['beds']].sum()
37df.head()
region_code |
beds |
|
|---|---|---|
0 |
1 |
14572.0 |
1 |
2 |
444.0 |
2 |
3 |
34812.0 |
3 |
4 |
3597.0 |
4 |
5 |
15997.0 |
Then we can get geospatial and population data for regions using italy_geopop.from_region accessor.
Then we recalculate the beds_per_capita column dividing number of region’s hospital beds for region’s population.
38df[['geometry', 'population']] = df.region_code.italy_geopop.from_region(return_cols=['geometry', 'population'], population_limits='total')
39df['beds_per_capita'] = df.beds / df.population
40df = gpd.GeoDataFrame(df)
Note
At this point you may have noticed that we have used region_code to feed italy_geopop.from_region. That is possible
because the accessor will recognize the kind of data you pass to it, wheater it is region full name or region code,
e.g. ‘Piemonte’ == 1. This behaviour is valid also for italy_geopop.from_municipality, that can accept municipality name
or municipality code, e.g. ‘Torino’ == 1272, and is valid also for italy_geopop.from_province, that can accept not only province name and province_code
but also province abbreviation, that’s actually what is used in this tutorial, e.g. ‘TO’ == ‘Torino’ == 1.
Moreover you can pass mixed data type to the accessor.
41df.plot(
42 'beds_per_capita',
43 cmap='OrRd',
44 legend=True
45)
46plt.title('Hospital beds per capita per region - Italy - 2019')
47plt.tight_layout()
This plot shows some differences across regions in the number of hospital bed per person, but the variability is smaller. Also every region has its own health policy so the number of hospital beds can be lower while still providing an adeguate healthcare quality.