👨‍🏫 Guide

Important

Italy-geopop was thought to be primarily used as a pandas.Series accessor. While some of its functions can be used and are useful outside pandas integration, the guide will focus on italy_geopop pandas accessor.

Tutorial #1 - Hospital beds in Italy

How many hospital beds are there in Italy?

Let’s suppose you want to answer this question. To do so, you need hospital beds data. Luckily the Italian Health Ministry published such data under Italian Open Data License, so everyone can access it here.

Note

A simplified version of that dataset is available for download and it will be used in this guide. This dataset only contains observations from the year 2019. Only the columns used in the guide were kept and the labels were translated to english.

👉 open_data_italy_hospitals_beds_2019.csv

Let’s read the csv file and see how it looks like.

import pandas as pd
df = pd.read_csv('open_data_italy_hospitals_beds_2019.csv', dtype={'beds':float, 'province_short':str}, keep_default_na=False)
df.head(10)

	beds	province_short
0	179.0	TO
1	18.0	TO
2	17.0	TO
3	277.0	TO
4	15.0	TO
5	80.0	TO
6	25.0	TO
7	20.0	TO
8	120.0	TO
9	145.0	TO

Warning

Remember to use keep_default_na=False when reading dataset as province_short, which is the province abbreviation, for Naples is ‘NA’, so pandas default behaviour is to treat it as null.

The dataset contains only two columns: beds and province_short. As you can notice province_short has values repeated multiple times, that’s because every row of the table is an hospital in the original dataset, so we need to group the dataset by province_short and sum the bed column. In that way we will obtain a dataset with the number of hospital beds for every province.

df = df.groupby(['province_short'], as_index=False).sum()
df.head()

	province_short	beds
0	AG	751.0
1	AL	1460.0
2	AN	1969.0
3	AO	444.0
4	AP	471.0

At this point we may want to plot the geospatial distribution of hospital beds and that’s when italy-geopop can helps us.

First of all, let’s activate the accessor. This registers the accessor to pandas and let us take advantage of italy-geopop functionalities.

from italy_geopop.pandas_extension import pandas_activate
pandas_activate(include_geometry=True, data_year=2022)

Hint

You may want the accessor to only live in a specific context. You can do so using pandas_activate_context that has the same syntax of pandas_activate. This is useful if you want to register the accessor with different initialization options more than once in your code or if you want to free up memory right after you get the needed data (the trade off is that italy-geopop needs to be reinitialized everytime you register and use the accessor).

Here you can find the complete api reference documentation for both pandas_activate and pandas_activate_context.

And now we can get the geospatial data we need to plot the geospatial distribution using italy_geopop.from_province accessor.

df[['geometry']] = df.province_short.italy_geopop.from_province(return_cols=['geometry'])
print('df type:', type(df))
df.head()

This should output

df type: <class 'pandas.core.frame.DataFrame'>

	province_short	beds	geometry
0	AG	751.0	MULTIPOLYGON (((13.663438752935148 37.19338230…
1	AL	1460.0	POLYGON ((8.408748511874723 44.70686568809358,…
2	AN	1969.0	POLYGON ((13.422716165173247 43.62050399042972…
3	AO	444.0	POLYGON ((7.734554907653768 45.92365152529369,…
4	AP	471.0	POLYGON ((13.50521153822172 42.775569771884555…

Now we have geospatial data for every province but df is a pandas.DataFrame instance and we need a geopandas.GeoDataFrame instance in order to generate the plot.

Note

Note that we created the geometry column with double square brackets, that’s because italy_geopop accessor actually return a subset of another dataframe, so passing return_cols=['geometry'] will make the accessor return a 2-dimensional pandas.DataFrame and passing return_cols='geometry' will make the accessor return a 1-dimensional pandas.Series instance.

import geopandas as gpd
import matplotlib.pyplot as plt

df = gpd.GeoDataFrame(df)

df.plot(
    'beds',
    cmap='OrRd',
    legend=True
)
plt.title('Hospital beds for province - Italy - 2019')
plt.tight_layout()

_images/hospital_beds_per_province_2019_italy.png

We can see that there are few provinces with a very high number of hospital beds and the others seem to have a very low number of beds.

There must be some kind of bias.

At least we need to adjust the number of beds for province’s population and italy-geopop can help us even in this task.

df['population'] = df.province_short.italy_geopop.from_province(return_cols='population', population_limits='total')
df['beds_per_capita'] = df.beds / df.population
df.head()

	province_short	beds	geometry	population	beds_per_capita
0	AG	751.0	MULTIPOLYGON	415887.0	0.0018057789736154292
1	AL	1460.0	POLYGON	407264.0	0.003584898247819596
2	AN	1969.0	POLYGON	461687.0	0.004264794113761055
3	AO	444.0	POLYGON	123360.0	0.0035992217898832683
4	AP	471.0	POLYGON	202365.0	0.0023274775776443556

Here we created the population column. Note that we assigned it using single square brackets because the output of italy_geopop accessor was 1-dimensional. Then we created beds_per_capita column dividing beds for population obtaining the number of hospital beds per person.

Now we can create the same plot as before but adjusted for province population.

df.plot(
    'beds_per_capita',
    cmap='OrRd',
    legend=True
)
plt.title('Hospital beds per capita per province - Italy - 2019')
plt.tight_layout()

_images/hospital_beds_per_capita_per_province_2019_italy.png

This representation is surely more accurate than the previous one, but distribution seems to have an elevated variability despite the adjustment we made.

Healthcare in Italy is largely adminsitrated at a regional level, so maybe it would be more accurate to plot the distribution of hospital beds by region instead. Let’s do so.

Firstly, we reload our dataset and get the region_code using italy_geopop.from_province accessor.

df = pd.read_csv('./docs/source/_static/assets/open_data_italy_hospitals_beds_2019.csv', dtype={'beds':float, 'province_short':str}, keep_default_na=False)
df['region_code'] = df.province_short.italy_geopop.from_province(return_cols='region_code')
df.head()

The expected output is

	beds	province_short	region_code
0	179.0	TO	1
1	18.0	TO	1
2	17.0	TO	1
3	277.0	TO	1
4	15.0	TO	1

Then we use pandas.DataFrame.groupby to group the dataset by region_code and sum the beds.

df = df.groupby(['region_code'], as_index=False)[['beds']].sum()
df.head()

	region_code	beds
0	1	14572.0
1	2	444.0
2	3	34812.0
3	4	3597.0
4	5	15997.0

Then we can get geospatial and population data for regions using italy_geopop.from_region accessor. Then we recalculate the beds_per_capita column dividing number of region’s hospital beds for region’s population.

df[['geometry', 'population']] = df.region_code.italy_geopop.from_region(return_cols=['geometry', 'population'], population_limits='total')
df['beds_per_capita'] = df.beds / df.population
df = gpd.GeoDataFrame(df)

Note

At this point you may have noticed that we have used region_code to feed italy_geopop.from_region. That is possible because the accessor will recognize the kind of data you pass to it, wheater it is region full name or region code, e.g. ‘Piemonte’ == 1. This behaviour is valid also for italy_geopop.from_municipality, that can accept municipality name or municipality code, e.g. ‘Torino’ == 1272, and is valid also for italy_geopop.from_province, that can accept not only province name and province_code but also province abbreviation, that’s actually what is used in this tutorial, e.g. ‘TO’ == ‘Torino’ == 1. Moreover you can pass mixed data type to the accessor.

df.plot(
    'beds_per_capita',
    cmap='OrRd',
    legend=True
)
plt.title('Hospital beds per capita per region - Italy - 2019')
plt.tight_layout()

_images/hospital_beds_per_capita_per_region_2019_italy.png

This plot shows some differences across regions in the number of hospital bed per person, but the variability is smaller. Also every region has its own health policy so the number of hospital beds can be lower while still providing an adeguate healthcare quality.