Monday 20 September 2021

Adapting the QC to account for the June 2021 North American Heatwave (part 3)

We have seen in parts 1 and 2 that the Climatological and Distribution checks have been adjusted because some values at some stations were likely being erroneously flagged during the heatwave over North America in June 2021.

In order to see how much of an effect these updated checks have had on the flagging rates, we look at a spatial distribution of the stations during the last days of June 2021 showing the temperatures at each station, with observations flagged with our unmodified QC appearing in green (Figure 1). As is clear, a large number of stations are flagged during this time. 

Figure 1.  Temperatures at HadISD stations on 28-June-2021 at 00:00UTC (17:00PDT on 27th June).  Flagged stations are shown in green, and non-reporting as transparent.


We produced the same map after the modifications in the two QC tests, and also the appropriate adjustment to the buddy check to allow any tentative flags to be unset (Figure 2).  Many fewer stations are flagged with these updated tests.  However, a number are still flagged.  These are flagged throughout the period of interest rather than during the hottest part of the day, and so were likely the result of a test which flags an entire month. After some spot checks on these stations, these flags are from the excessive variance test.

Figure 2.  As for Figure 1 but after updated QC tests.

The excessive variance test looks at the distribution of the within-month variance, and identifies months with exceptionally low or exceptionally high variance.  The scaling used for this test is the interquartile range, and months which have a variance more than 8 IQR from the average are flagged. As can be seen for the example of Osoyoos (712150-99999,  latitude 49.033, longitude -119.433), the variance for June 2021 is much larger than any previously seen June (Figure 3).

Figure 3.  The Variance Check for Osoyoos (BC, 49.033, -119.433) showing the all of June 2021 flagged.

The variance check uses a fixed threshold of 8 IQR rather than thresholds generated from the properties of the distribution (as in the climatological and distribution checks).  To update this check to determine thresholds from the distribution itself (as in the e.g. climatological and distributional checks) would be a larger change than the relatively small ones we have done so far. Also, in the example in Figure 3, a reasonable threshold determined from the distribution may still have excluded June 2021 (note, the y-axis is a log-scale) and we might struggle to be objective in this change if tailoring to this specific event, perhaps causing issues in other regions. In contrast, the changes in the climatological and distributional checks were easily motivated (and perhaps should have been spotted during development of the monthly updates).

As we noted in the HadISD papers, the automated QC is a balance between removing erroneous/dubious observations but retaining true extremes, and what we do not want to do is make changes with inadvertently large impacts elsewhere.  Our plan at this point in time is to note this as an issue for this test (and event) to look at in the future in any next major update to HadISD.  Any flagged data in HadISD is removed from the netCDF data fields, but remains available within the netCDF files should users wish to access it.  If you have thoughts on this, please do get in touch or comment below.

The next update to HadISD (in October 2021) will show a version number increment to reflect these changes in the QC tests (3.2.0.202109p).

[Animations of three days of the heatwave showing the flagged stations before and after the QC test updates are available on the HadISD homepage].

Tuesday 14 September 2021

Adapting the QC to account for the June 2021 North American heatwave (part 2)

Continuing this took longer than planned, so there has been another version release of HadISD (v3.1.2.202107p) in the meantime.

In the last post, we went through the changes that were made to the Climatological Outlier check as a result of the temperatures experienced in North America in June 2021.  Since then, there have continued to be heatwave events across the world, with temperatures and impacts around the Mediterranean being the current focus (at time of writing).  We will continue to use the North American heatwave for these changes for consistency, but note that of course changes to our QC will affect all stations and variables, and hence events.

Distributional Gap Check

In this check there are in fact two. The first uses monthly aggregated data, to look for asymmetries in the distribution, and we haven't changed that one.  The second is what we delve into here, which uses all observations within a calendar month, and identifies gaps in this distribution to decide where to flag.  We'll use the same station for our plots as in the previous post, 711130-99999 (Agazziz, BC, Canada).

As we use a very similar approach in this test, we also had the same issue where our diagnostic plots initially were not showing data from the incomplete calendar year.  But that was an easy fix, see Figure 1.

Fig. 1  the distribution of scaled anomalies for June from Agassiz (711130-99999), with the flagged ones highlighted in red.  Distribution from all years before 2021 is in black, and from all years including 2021 in grey. Blue is the fit to the data including skew and kurtosis using Gauss-Hermite polynomials. Note the logarithmic y-axis. 

Here again, a handful of observations were being flagged because they fall beyond the bulk, but only when ignoring others from the incomplete calendar year.  What we also noted was the single observation at the low end being flagged.  This test should look for gaps at least two bin-widths wide (so two consecutive empty bins), and this doesn't seem to have been the case.  We fixed that at the same time as the other changes.

As for the climatological check, we treated the complete and incomplete years separately, which meant that these observations were now tentatively flagged, which can be unset by the neighbour check (Figure 2).

Fig 2. Same as Fig. 1, but with the data from June 2021 now being correctly marked as tentatively flagged.  Orange vertical lines are derived from the fit of the distribution (blue) to complete years only (black), and red ones are where a gap has been found.  The Purple and Pink are derived when including the final month.

The final thing that we wanted to change was the nature of the curve being used to fit the distribution.  When putting this code together, we wanted to include skew and kurtosis, as the distributions were clearly non-gaussian.  At the time, we used Gauss-Hermite polynomials to obtain the fit with these higher moments of the distribution.  However, we have since found that these sometimes have artefacts which result in some "wiggles" in the distributions (see Figure 1).  Although this approach is still useful for gauging where to start looking for gaps in the distribution, but we thought that this was an opportunity to see what else could be done.  We tried using the same skewed distribution (no kurtosis) as for the climatological outlier check.

Fig. 3: Same as Fig. 2, but now using a skewed-Gaussian ditribution for the fit rather than the Gauss-Hermite polynomials.

For this month, it is a more sensible fit, and also has a co-benefit of moving the value from which this test starts searching for a gap to the right, and so includes all of the hot temperatures in June.