Change Detection of Structures in Panchromatic Imagery

Published in

GeoAI

17 min readFeb 17, 2021

Fine-tuning a deep learning model for building footprints to be used for automatically counting new buildings in temporal panchromatic imagery

Change detection is an extremely important area of remote sensing. It helps us understand, given imagery or radar scans of given locations, what meaningful semantic changes occurred in categories of interest. Often those categories of interest include buildings /structures— these are the primary indicators of urban expansion and the size of population centers. However, classical change detection methods like statistical subtraction will often fail to accurately detect and isolate new structures. This is because optical data (such as panchromatic imagery) is often very noisy as the result of changing cloud cover, change outside of categories of interest (such as vegetation change), different sensors, or registration errors. One of the ways around this (when you have sufficient examples for the category of interest) is to build a deep learning model for detecting the specific feature you’re trying to detect change for — in our case, buildings. Since buildings are such a ubiquitous object in remote sensing, we can start off with a powerful pre-trained model and simply fine tune it for our specific flavor of data — panchromatic imagery. Once we have such a feature extractor built, change detection comes down to simply looking at the difference in detected features between time periods.

For this workflow we will be utilizing some of the powerful new deep learning tools present in the ArcGIS ecosystem (specifically in ArcGIS Pro 2.6+), which will help us perform rapid inferencing, labeling, and training for deep learning. Because of these tools this guide can be used by both data scientists and GIS professionals.

Below are the areas we’ll be going over:

Table of Contents:

Panchromatic Imagery
Data Pre-processing — Pansharpening vs. Replication
Setting Up a Test Area
Testing the Pre-trained Building Footprints Model
Fine-tuning Our Model with New Data
Assessing Results and Post-processing
Calculating Change
Conclusion

Panchromatic Imagery

Panchromatic imagery is not only widely popular, it’s extremely practical for most purposes. Folks outside the remote sensing circles might have trouble understanding why it might be cheaper or easier to collect black and white imagery at a high spatial resolution — but most people’s experience with imagery that looks like this is as a post-processing filter on RGB imagery.

The simple explanation to why panchromatic imagery can be collected at a higher resolution than multi-spectral is because when you don’t have to capture at multiple wavelengths (spectral resolution) you can collect more visible light from a particular ground point. Panchromatic imagery comes from sampling across a wide range of the visible light spectrum and thus the sensor has plenty of energy to work with over a smaller area. In general, there is a trade-off between spectral and spatial resolution.

The panchromatic imagery we will be using is collected by DigitalGlobe at a resolution of 0.5 meters. This is enough to accurately identify the footprints of structures, as well as roads and large vegetation.

Data Pre-processing: Pansharpening vs. Replication

The deep learning model we will fine-tune was trained against 3-band RGB imagery at the same resolution (0.5m), unfortunately our panchromatic imagery is only a single band. We have two main routes to take from here that don’t require modifying the architecture of the model:

Use pan-sharpening with some accompanying multispectral data to add additional spectral bands to the panchromatic imagery
Replicate the panchromatic band 3 times

Let’s try the first approach —if you don’t have paired multispectral imagery you can skip this approach and proceed to the second. We won’t get into the process of pansharpening in depth here however in general it is the process of generating additional spectral bands for high resolution panchromatic imagery using a lower spatial resolution multispectral image. This relies on aligning the images and utilizing interpolation algorithms on the lower resolution multispectral image to fill pixel values in the panchromatic. We have companion multispectral imagery from DigitalGlobe at a much coarser resolution of 2m:

Bringing both sets of data into ArcGIS Pro (by importing the folder containing them and adding them to the current map) we can first use the flicker tool to see the resolution difference visually:

Panchromatic @ 0.5m vs. MSI @ 2m — DigitalGlobe

Clearly, the multispectral is too low resolution to be able to accurately discern building footprints with. However we can still explore using it to pansharpen. To do this, we’ll need two separate geoprocessing tools. The first, Compute Pansharpen Weights, will return an optimal weighting scheme for each spectral band we’ll need for the pansharpening process.

After running the tool (which should only take a few seconds) — the optimal weights are printed to the output window. These need to be recorded somewhere for use with the second tool.

The second tool we will be using will perform the actual pansharpening. The tool is Create Pansharpened Raster Dataset - it will accept our input (multispectral) raster, the panchromatic image, a color channel mapping and the set of pansharpening weights we calculated earlier. Make sure you color-band mapping is correct before entering the pansharpening weights!

After running this tool we should have our pansharpened image, at 0.5m resolution (same as the panchromatic). Here’s what it looks like:

This gives us enough spectral resolution (3) and enough spatial resolution (0.5m) to run our pretrained building footprint model. We will set this aside and also calculate the replicated panchromatic imagery so we can run a comparison test of detection performance. We can easily do this via the Composite Bands raster function.

We simply add the panchromatic raster 3 times to indicate we want to replicate the same panchromatic band for the output image. The output will not look any different than the input, but should contain 3 bands (this can be checked by right-clicking on the layer and going to properties).

The last thing we’re going to check before moving to testing is the bit depth of the images. This can be checked under the properties of the image, if the image is 16-bit we can use the Copy Raster tool and set the “Pixel Type” property to 8 bit Unsigned (make sure to check the “scale pixel values” option).

We now have two separate 3 channel, 0.5m spatial resolution rasters (with 8 bit unsigned data types) ready to test the pretrained building footprints model against.

Setting Up a Test Area

Before we download and run the building footprints model we need to be good scientists and think about our study design. Generally, we want the model we train and test our data against to be spatially (and sometimes temporally) distinct, and we’d like to select a test area where we have ground truth so we can compute metrics. Testing deep learning models comes down to two fundamental properties: Accuracy (with respect to an important value metric) and Generalization (to extend to all relevant use cases the model will be run on). Since the main business value from this model will come from accurate building counts, we will use simple detection accuracy as the metric. Most uses of this model however will be related to extracting good building footprints, in which case we’d need to compute IoU. From a generalization standpoint, as it stands the model was trained on completely distinct imagery from what we have (Esri World Imagery) so any imagery it is being run against will be a test of generalization capability. Last thing we need to do is (1) Designate a testing area (2) Generate ground truth labels for the testing area.

We’re going to create a feature class to mask off the testing area and create a point feature class to designate the unique structures within the testing area, making sure the points are accurate to both sets of imagery we will be testing (the replicated panchromatic and the pansharpened):

Now that we have a designated study area for testing let’s download the model and run it!

Testing the Pre-trained Building Footprints Model

Esri provides a wonderful pre-trained model for detecting building footprints as part of Living Atlas. This model was trained on 3 band imagery at a spatial resolution of 0.5m using the Mask-RCNN model architecture.

It was trained against building footprints in the USA. Unlike other model architectures (like YoLo) which output bounding boxes, Mask-RCNN can learn to directly produce polygons delineating boundaries for objects of interest. In our case, it will output a polygon feature layer with unique polygons for each detected building footprint. Let’s test its out of the box performance on our two 3 band rasters we created.

First, we download the model and add the folder containing it as a folder connection within our project. Then, we navigate to the Detect Objects using Deep Learning tool (available in Pro 2.5+) , where we can directly plug the .dlpk file from the downloaded model. Once we’ve done that (and there are no environment issues) — the following options for padding, batch_size, threshold, return_bboxes should pop up:

These parameters can be confusing if you’re not familiar with deep learning terminology, here’s what they mean in plain English:

Padding: Since the model is run against image chips of a pre-defined size, this controls how much additional context is generated around each chip (in pixels). Increasing this might help with getting rid of edge detection artifacts (like banding) or disconnected/segmented detections.

Batch_size: How many chips to hold in memory and run detections against at a time. The minimum is 1 and in general should be as high as your CPU or GPU setup will allow as it is directly related to how quickly the inference job will finish. Note if you receive errors about RAM or CUDA_MEMORY, you need to restart the tool with a smaller batch size. If you have multiple GPUs you can set the IDs in the neighboring “Environment” tab.

Threshold: Each detection from the model will come with a confidence value on it being the object of interest, this is what confidence value should be used to filter the detections. This ranges from 0–1 and higher values will help filter out false detections while lower values will typically produce more detections.

Return_bboxes: Whether to return the associated minimum bounding box around each detection. This is either True or False.

Non-Maximum Suppression: This is a technique to help get rid of multiple overlapping detections in the same area. Checking this will help with accurate feature counting, but may cost extra computation time. There’s an explanation here

The default values are fine, however we lower the threshold a bit to 0.5, which will allow us to pick up more low confidence detections to help us figure out more about model performance. We first select our pansharpened raster to test against, name the output layer appropriately, and them move to the “Environment” tab and set our processing extent to the testing AOI we created above. After running it, we get the following detections:

Now we can count these manually since our test AOI is small, but to make this workflow scalable we will utilize the selection tool to select the ground truth layer by spatial intersection to the detections:

Pulling up the attribute table shows that 6/46 of the ground truth points are intersected by the detection polygons. However, we can see two of the ground truth points are intersected by the same polygon. To solve this, we can also do a selection on the detection polygons by intersection to ground truth points — this shows 5 polygons selected. Taking the minimum of these gives us the right condition since we’d like to find points that intersect detected polygons as well as the number of distinct polygons that intersect points. So our overall accuracy is 5/46 — roughly 11%. Not great. Let’s try with the replicated panchromatic imagery.

We run into a distinctly different problem here. The detections are gigantic compared to the actual size of the structures! Using our method from above we get 15/46 ground truth points captured by the polygons and 3 polygons that intersect points — taking the minimum gives us an accuracy of 3/46 or roughly 6.5%. Terrible.

Between both of these results it’s obvious that we cannot easily use the model off the shelf — this is no problem as we can always fine-tune the model for our use case.

Fine-tuning our Model With New Data

The process of fine-tuning a deep learning model involves creating a small set of labels against a new dataset and training a portion (often called the “head”) of the deep learning model. For example, below is an illustration of the famous VGG-16 network architecture, often used for image classification:

Since the network above was trained against ImageNet, which contains 1000 image classes, the last layer (orange, softmax) is 1000 wide with a softmax so it outputs the probability of the image being from each of the 1000 classes. The “backbone” (or feature extractor) of the network is everything before that last layer (or everything before the fully connected layers depending on your school of thought). Training the last layer or last few layers allows us to maintain the network’s power to extract meaningful information (features) from the image or input data, while the last layer serves to take a combination of those features and make a judgement, such as what class the image is or the location of particular objects within the image based on which features are extracted. This process is called “fine-tuning” the model, or “transfer learning” in general.

For our use case, the pre-trained model is a Mask-RCNN, which has a slightly more complicated architecture:

Diagram source: https://ronjian.github.io/blog/2018/05/16/Understand-Mask-RCNN

You can read more about the architecture in the original paper. For now, understand that we’ll be freezing the convolutional backbone so that the feature extractor (which has already been trained to extract important details from the image to make building footprint judgements) is maintained.

The first thing we’ll need for this process is to designate a training area of interest, where we will label the data needed for retraining the model. Once we’ve designated such an area, we can use a feature class to start labeling structures within the area. This time we’ll need to label entire building footprints though, not just a point indicating the presence of a building. This is because we need to mimic the training data the model was initially trained with, which was polygons delineating building footprints. This takes a little longer, but hopefully the payoff will be worth it.

The right number of training examples that will be needed depends highly on the data, specifically how different the original data and the data we will retrain with is. In our case, there’s not much difference in the geometric aspects of our features (the structures are still the same relative size since we are also using 0.5m resolution imagery and the appearance of the structures is not significantly altered), the main thing that has changed is the spectral signature. We will start by labeling a little over a 100 as a test run.

After labeling we will use the Export Training Data for Deep Learning tool to export image chips with mask labels for each structure. We will keep the default values but change the Meta Data Format to “RCNN Masks”. This is to have it line up with the correct format for training our Mask RCNN model.

This will produce a folder of image chips, we can take a quick look to make sure these image chips are what we expect:

Now, we can move onto the retraining portion. This can be done in one of two ways:

(1) Use the “Train Deep Learning Model” tool found in ArcGIS Pro 2.6+

(2) Use ArcGIS Notebooks (or Jupyter Notebooks after installing the Python API).

The first option is very simple — we navigate to the Train Deep Learning Model tool and pass it the Input Training Data folder we created during the export above. This should automatically populate some of the values, including Model Type under Model Parameters.

The second important value we need to populate here is the “Pre-trained Model” parameter — here we’re going to place the downloaded .dlpk model file for the original model. This will allow us to fine-tune the weights of that model instead of starting from scratch. The Mask-RCNN model is quite large and thus a GPU (graphics processing unit) is strongly suggested to make training times feasible (minutes instead of hours for our small example). While training it’s important to make sure that loss is decreasing fairly steadily, otherwise it may be prudent to lower the “Learning Rate” parameter to a smaller value such as 0.0005 or create more training data.

After training the tool will output another .dlpk that contains the fine tuned model. We will test this new and improved model against our data.

Results and Post-processing

Detected structures, before retraining (red) vs. after (blue)

46 of the detected polygons intersect our ground truth points. Meanwhile 36 of the ground truth points lie within a polygon. Taking the minimum we get 36 correctly detected structures out of 46 total structures. That’s a 78.2% accuracy, up from 6.5% earlier, an improvement of over 70% by just labeling under 150 structures and retraining!

Next, there are two main things we’re going to handle that will drastically improve the quality of our results and go a long way towards turning our blobby polygons into realistic and accurate building footprints. Firstly, here’s a zoomed image of a few of the detections directly from the model:

We can now immediately see two of the major issues:

Overlap, there are multiple detections over the same structure
Edges, the rounded corners are not tightly aligned with the footprints

Luckily, we have two tools within ArcGIS Pro designed to handle these exact issues. Firstly, we’ll be using the Dissolve geoprocessing tool to merge multiple overlapping polygons into a single polygon.

then, we’ll utilize the very useful “Regularize Building Footprints” tool to help sharpen up the edges of our detected structures:

The result after using these tools looks much improved over the original detections:

Calculating Change

Now that we know how to generate a feature class for detected building footprints, we can repeat this process for imagery from multiple time periods. While usually for commercially purchased imagery you can count on some degree of consistency between imagery collects, it is useful to do a quick scan of the panchromatic imagery we have between two time periods and see if anything jumps out at us.

We can immediately see that the registration is inconsistent between time periods. This is going to be a huge problem when we move onto change detection, as the inconsistency will translate to false positives for new building additions. Luckily, this is easy to fix using the georeferencing tools in Pro. The best way to do this is to make the post imagery layer partially transparent, and aligning the georeferencing anchor point to the same point on the underlying pre imagery (this can be done before or after the above mentioned pre-processing steps). Another option is to use the auto-georeference tool, which does a great job creating and aligning anchor points between the imagery.

From here on the change detection is straightforward. We firstly run the model over two successive temporal sets of imagery. Once we have sets of building footprints for each period, we can do a simple filtering operation to understand which new building footprints appeared each year. We can do this by using the “Select by Location” tool — this will allow us to select the structures in each time period that do not intersect structures in the previous time period’s detections (this can be done by inverting the intersection relationship — we can also use a small buffer distance if we still suspect slight registration errors in the imagery).

We can then create a new feature class based on the selection which represents the new structures that popped up each year. Once uploaded to Portal, these feature classes can be placed in an ArcGIS Dashboard that summarizes all of the information in a readable and dynamic way:

If model performance is good and the georeferencing error is very small, we can also use this to determine where new additions to structures have been made by setting no buffer value. We can also bring in parcel data to flag parcels as ones containing new construction or measuring new construction as a percentage of parcel area.

From here there are a number of different things we can do, such as wrapping the entire workflow in a model builder model, setting it up to run automatically when new imagery is uploaded to Portal (or cloud storage such as Azure Blob Storage), setting up additional filters against known new building locations (for example using a permit databases), or running it over historical time periods to map urbanization of new areas.

Conclusion

In this post we’ve seen how we can pre-process a set of imagery to be compatible with a deep learning model trained on a separate set of imagery, generate additional training data for the model, fine-tune the model against our newly created data, perform inference using our fine-tuned model to get significantly improved results, and finally utilize GIS techniques to accurately map and visualize newly erected structures and changes in existing structures. There’s a lot of steps to this workflow but the value is that it can be generalized to any feature whose change we’d like to map over sets of temporal imagery, such as roads, wells, utility poles, or water reservoirs. More broadly, imagery based change detection is one of the cornerstones of understanding changing communities to be able to better serve them and understand what steps we need to take to balance urban development and environmental protection. Thanks for reading!

High Resolution Land Cover Mapping using Deep Learning

Building Footprint Extraction and Damage Classification

AI For Good: Disaster Response

Special thanks to Esri Chile for the use case and imagery which made this work possible.