Noise pollution data cleanup
Integration for open GIS data using python
The Austrian governemnt provides a great source of open data for noise pollution. However, it can only easily be looked at using their map-based visualization. When one wants to download the data and perform custom analytics with it some data cleaning is required as one is presented with a hierarchy of files.
getting started
The first step is to download the data from Lärminfo.
Then after unzipping there are multiple categories available represented as a hiererchy of folders per state:
├── flug_2017_noise
│ ├── INSPIRE_2017_FLUGHAEFEN_24H_ZONEN_KT
...
│ └── INSPIRE_2017_FLUGHAEFEN_NACHT_ZONEN_WI
├── industrie_2017_noise
│ ├── INSPIRE_2017_IPPC_24H_ZONEN_NO
....
│ └── INSPIRE_2017_IPPC_NACHT_ZONEN_WI
├── schiene_2017_noise
│ ├── INSPIRE_2017_SCHIENE_24H_ZONEN_BG
....
│ └── INSPIRE_2017_SCHIENE_NACHT_ZONEN_WI
└── strasse_2017_noise
├── INSPIRE_2017_STRASSE_24H_ZONEN_BG
....
└── INSPIRE_2017_STRASSE_NACHT_ZONEN_WI
Using python data cleaning can easily be accomplished.
The folder hierarchy needs to be parsed recursively. Do note, that a Generator
is constructed and not a list. This can be more memory efficient as only the objects which are really required during processing need to be held in memory.
def iter_dirs(directory_in_str, glob):
pathlist = Path(directory_in_str).glob(glob)
for path in pathlist:
yield str(path)
Each path looks like 2017_STRASSE_NACHT_ZONEN_TI.shp
and contains some attribute values. These are required later on to diffenciate between the different layers and need to be retained. parse_attributes_from_path
will extract them.
def parse_attributes_from_path(path):
file_name = path.split('/')[-1]
elements = file_name.split('_')
result = {}
result['year'] = elements[0]
result['kind'] = elements[1]
result['timing'] = elements[2]
result['state'] = elements[-1].split('.')[0]
return result
Finally, you can start to use the functions defined above and loop over all shapefiles:
paths = iter_dirs(c.BASE_PATH, '**/*.shp')
tmp_appended_data = []
for path in tqdm(paths):
print(path)
attributes_from_filenname = parse_attributes_from_path(path)
df = gp.read_file(path)
df = add_columns_to_df(df, attributes_from_filenname)
tmp_appended_data.append(df)
This will extract the attributes from all file paths and concatenate the results. You should end up with somethign similar to:
DB_LO ZUST geometry kind state timing year
0 45 ALLE POLYGON ((252412.115130411 374722.80843502, 25... STRASSE TI NACHT 2017
1 45 ALLE POLYGON ((250225.579655268 374848.450664676, 2... STRASSE TI NACHT 2017
2 45 ALLE POLYGON ((257144.687224785 375790.285411876, 2... STRASSE TI NACHT 2017
3 45 ALLE POLYGON ((252474.722654981 374521.47906019, 25... STRASSE TI NACHT 2017
4 45 ALLE POLYGON ((252519.897499734 376489.588762502, 2... STRASSE TI NACHT 2017
...
summary
Using some snippets of python to clean the data makes the task feel almost too easy to obtain a neatly cleaned dataset. I was overwhelmed by the speed of reading and processing the data until everything is concatenated and found the last step of writing the result to disk rather slow.
NOTE: I decided to output to
gzip
compressed CSV files. This is not ideal, but easy to generate and flexible, i.e. allows to have multiple types of geometry in the same column (POLYGON
andMULTIPOLYGON
). Geo package files would be better suited though. These could contain for example coordinate system reference information or an spatial index, but do not allow for multiple types of geometry in the same column.
EDIT: Meanwhile, the transformation to a common geometry type has been implemented:
def convert_polygon_to_multipolygon(raw_geometry):
if(isinstance(raw_geometry, shapely.geometry.polygon.Polygon)):
return MultiPolygon([raw_geometry])
else:
# we currently only have MULTIPOLYGON and POLYGON so plain else is good enough
return raw_geometry
df.geometry = df.geometry.apply(convert_polygon_to_multipolygon)
So, geo packages are now supported as well. They are about 2.7 GB in size though.