Kedro hands-on: Build your own demographics atlas. Pt. 2: building footprints classification | by Trung Nguyen | Jan, 2021


1 Completeness and Correctness
— 3.1 Completeness
— 3.2 Correctness & Tobler’s law
2 Building footprints classification pipeline
— 2.1 Generate building features
— 2.2 Building blocks segmentation with HDBSCAN
— 2.3 Building types classification — XGBoost
3 Wrapup

Since OSM (OpenStreetMap) is a VGI platform (volunteer geographic data), constructions are traced and tagged voluntarily by the neighborhood. Thus, this results in 2 primary demanding situations:

  1. Completeness— The location’s collection of building footprints are totally coated within the information
  2. Correctness — Accurate building varieties are tagged as it should be

In this sequence, the focal point is on “Correctness”. Thus, we will be able to NOT cross into “Completeness” an excessive amount of after the creation

The 2 demanding situations. Image by Author

1.1 Completeness

In 2017, McGill study compares the completeness of OSM with respectable govt information and concludes that the OSM information is globally 83% whole (with lots of the Western nations side road community totally mapped).

However, building footprints completeness review is any other tale. There are a number of researches on this box: from evaluate OSM with respectable govt geodata [1,2] to extract footprints from satellites/UAV photographs with deep studying [3] (comparable to SpaceNet 1 & 2 building detection competitions). But as discussed above, this text won’t talk about the answers to this problem at this day and age.

The great factor about kedro pipeline (or any device studying workflow) is that you’ll revisit this problem in a later time simply by tinkering with nodes and pipelines

On Medium, there are lots of tutorials and discussions should you put “building footprints” within the seek field. Some notable articles by Adam Van Etten, Todd Stavish on SpaceNet problem. Also from Romain Candy and Lucas Suryana on building efficient “footprints detector”.

1.2 Correctness and Tobler’s legislation

Building footprints from OSM are tagged voluntarily by communities. In some house, the tagging charges can succeed in as much as 100% however in different, many constructions are left untagged. From our “naive classification” within the second pipeline (data_preparation), there are nonetheless massive chunks of footprints set as “to_be_classified”

Results from second pipeline: nonetheless have a couple of UNTAGGED constructions. Image by Author

According to Tobler’s first legislation of geography [7], “the whole lot is said to the whole lot else, however close to issues are extra comparable than far-off issues.” This is the root of the elemental ideas of spatial dependence and spatial auto-correlation.

“Buildings shut to one another are in most cases belong to the similar kind”

Translate that into one thing associated with our problem, we will be able to say that constructions which can be shut to one another generally tend to belong to the similar kind. Thus, building footprints in a block will also be safely think that they’ve top similarities [8].

Building classification pipeline in a nutshell. Image by Author

The means of this pipeline will also be summarized into four steps:

  1. Generate further options for clustering (general house, rectangularity, polygon turning purposes, proximity distance-matrix)
  2. Proximity-based grouping footprints into building blocks with HDBSCAN set of rules
  3. Classify footprints into building varieties the usage of Three options: form, measurement and building block. Compare effects from a couple of device studying algorithms.

Same as earlier pipelines, every step above might be transformed into a node then plug into the primary pipe with inputs. Since we will be able to iterate via every house in Germany (~10okay municipalities) , there’s no wish to outline “outputs”. Each node is designed to scan for constructions information within the enter folder, carry out the operations without delay at the document and save output to “output trail” independently.

One studying from data_preparation section is that many municipalities has zero residential building footprints (just about 2000 municipalities with 100% variations between respectable and OSM information)

To atone for spaces that experience too few “categorized” footprints information, I made up our minds to mixture the dataset from municipality-level (~10okay teams) to district-level (400 teams).

Meaning of an AGS key (municipality-key). Image by Author
Function to mixture and get district-level building footprints information. Image by Author

2.1 Generate building footprints characteristic

Features of a building footprint. Image by writer

To get ready for the building classification process, we wish to generate further options from the present parts:

2.1.1 Size

The floor house is principally the footprint polygon house. In the former second pipeline data_preparation, the consolidated dataset from GeoFabrik & Overpass API gave us a complete view on what shall we harvest from OSM. One of the parts there may be Geometry — which is a choice of issues that assemble the polygon form of the building footprint. Using that, we will be able to compute the dimensions of all building footprints to be had.

2.1.2 Usable Size

The different element we will be able to make the most of is the “building_levels”. We can calculate the overall usable measurement by:

Usable measurement = building_levels * measurement (floor house)

2.1.3 Shape

In morphological research, a regular option to measure the form of a building is by the usage of rectangularity of a footprint polygon. The same old way to calculate rectangularity is to get the ratio of the measurement (footprint house) to the world of its minimal bounding field (MBB)

Rectangularity = Size / MMB

2.2 Building blocks segmentation with (H)DBSCAN

In the paper “Proximity-based grouping of buildings in urban blocks”, the authors used 2 other approaches to judge four algorithms in clustering constructions into city blocks. It concludes that DBSCAN (Density-based spatial clustering of packages with noise) along with ASCDT (An adaptive spatial clustering set of rules in keeping with delaunay triangulation) carried out highest and their level of complexity isn’t arduous to put in force. Thus, on this venture, I applied HDBSCAN — a longer model of DBSCAN to cluster our OSM building footprints.

2.2.1 Reason to not use Okay-means

When call to mind clustering, k-means in most cases pops proper up as the primary solution. As the set of rules itself, k-means is designed to attenuate variance.

Okay-means is regarded as as a regular set of rules for clustering; on the other hand, it does now not imply all clustering issues can use it neatly

Since the knowledge is in latitude, longitude layout ==> NON-LINEAR, worst case is k-means won’t ever converge (even with Haversine distance). To circumvent this drawback, we will have to use an set of rules that may care for arbitrary distance purposes, particularly geodetic distance purposes comparable to Hierarchical clustering, PAM, CLARA, OPTICS and DBSCAN.[5]

Okay-means vs DBSCAN by NHSipster


HDBSCAN is a clustering set of rules evolved by Campello, Moulavi, and Sander[6]. It extends DBSCAN by changing it right into a hierarchical clustering set of rules, after which the usage of a strategy to extract a flat clustering founded within the steadiness of clusters. Since clusters shaped don’t seem to be radius-based, they may be able to be in non-circular layout. Hence, it’s extra appropriate for geo-analytics taking into consideration footprints obstacles don’t seem to be in round form. And that is the set of rules we will be able to use to workforce building footprints.

Since clusters shaped don’t seem to be radius-based, they may be able to be in non-circular layout. Hence, it’s extra appropriate for geo-analytics taking into consideration footprints obstacles don’t seem to be in round form.

Another excellent learn to grasp HDBSCAN higher is right here by Pepe Berba

There are Three major parameters for HDBSCAN we will have to center of attention on:

min_cluster_size = minimal collection of footprints that may be assigned to a "block"

cluster_selection_epsilon = identical because the epsilon metric for DBSCAN, Two issues are regarded as neighbors if the space between the 2 issues is beneath the edge epsilon.

min_samples = The minimal collection of neighbors a given level will have to have with the intention to be categorized as a core level. It’s vital to notice that the purpose itself is integrated within the minimal collection of samples.

Beside that, we additionally wish to arrange the metrics for the proximity-matrix. Since we’re coping with geospatial information, as a substitute of Euclidean distance, it’s higher to make use of Haversine distance which measures the distance between 2 issues on a sphere given their latitude and longitude:

metric = ’haversine’, # haversine distance on earth floor

We will apply the literature from [4] with ε = 3 (Three meters distance between footprints) and MinPts = 2

Building blocks for a municipality. Image by Author

When run the node in kedro, it assembles information from municipality-level, plays HDBSCAN for every district and saves outputs within the 04_feature folder.

2.3 Building varieties classification — XGBoost

Now we’ve got created enough building footprint options, let’s construct a classifier for building varieties. Same as ahead of, this node will iterate during the building information folder, trains device studying fashion base on district-level information and in the end practice the fashion on unclassified footprints. The Three options we’re going to use are form (rectangularity), measurement (floor house) and building block (from HDBSCAN).

For opting for which set of rules to construct the device studying fashion, I took out 10 districts with greater than 50okay footprints / district and teach other fashions on them.

Since we simplest care about “residential” kind. The drawback used to be transformed from a multi-class classification (a couple of sorts of footprints) into non-residential vs residential

The comparability issues out that the usage of XGBoost would yield the most efficient effects:

Comparison of classification fashions on 10 districts with greater than 50.000 footprints. Image by Author

The code to run the simulation and create the chart is beneath. However, this section used to be carried out within the pocket book and now not put within the ultimate manufacturing of kedro pipeline.

For the parameters, you’ll attempt to optimize parameters for every district, however since we’re coaching and making use of fashions on greater than 400 district spaces, I make a decision to make use of the bottom parameters.

An superb publish on XGBoost by Vishal Morde right here

XGBoost classifier with and with out “building block” characteristic used to be additionally examined to turn the variations in accuracy.

XGBoost with out “building_block” as characteristic
XGBoost with “building_block” as characteristic

This step permits us to categorise extra “to_be_classified” footprints into residential with top self belief. The comparability will also be observed from those 2 diagrams. The house of “blue” legend (to_be_classified) dropped significantly after appearing XGBoost.

Before XGBoost classifier. Image by Author
After XGBoost classifier. Image by Author

In this text, I’ve guided you during the third pipeline building_classification which teams building into blocks and classify footprints into residential varieties. Now, when we finished working from the first to third pipelines, the dataset is able to combine with demographics information. In the following article, the overall pipeline residents_allocation will “assign” other people into residential constructions and visualize the outcome so we will be able to have the entire demographics atlas.

[1] Helbich, M. and Christof Amelunxen (2012). Comparative Spatial Analysis of Positional Accuracy of OpenStreetMap and Proprietary Geodata. [online] ResearchGate. Available at: [Accessed 29 Dec. 2020].

‌[2] Brovelli, M.A., Minghini, M., Molinari, M.E. and Zamboni, G. (2016). POSITIONAL ACCURACY ASSESSMENT OF THE OPENSTREETMAP BUILDINGS LAYER THROUGH AUTOMATIC HOMOLOGOUS PAIRS DETECTION: THE METHOD AND A CASE STUDY. ISPRS — International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, [online] XLI-B2, pp.615–620. Available at: [Accessed 29 Dec. 2020].

[3] Liu, P., Liu, X., Liu, M., Shi, Q., Yang, J., Xu, X. and Zhang, Y. (2019). Building Footprint Extraction from High-Resolution Images by way of Spatial Residual Inception Convolutional Neural Network. Remote Sensing, [online] 11(7), p.830. Available at: [Accessed 29 Dec. 2020].

[4] ‌Sinan Cetinkaya, Melih Basaraner and Burghardt, D. (2015). Proximity-based grouping of constructions in city blocks: A comparability of 4 algorithms. [online] ResearchGate. Available at: [Accessed 2 Jan. 2021].

[5] Ester, M. (2019). A density-based set of rules for locating clusters in massive spatial databases with noise. [online] Available at: [Accessed 2 Jan. 2021].

[6] McInnes, L., Healy, J. and Astels, S. (2017). hdbscan: Hierarchical density founded clustering. [online] ResearchGate. Available at: [Accessed 6 Jan. 2021].

[7] ‌R, T.W. (1970). A Computer Movie Simulating Urban Growth within the Detroit Region. Economic Geography, [online] 46, pp.234–240. Available at: [Accessed 14 Jan. 2021].

[8] Hongchao Fan, Zipf, A. and Qing Fu (2014). Estimation of Building Types on OpenStreetMap Based on Urban Morphology Analysis. [online] ResearchGate. Available at: [Accessed 14 Jan. 2021].


Please enter your comment!
Please enter your name here