Wednesday, 11 December 2019

Another minor version change

The HadISD code for versions 2 and 3 is written in Python (having migrated from the IDL code in version 1).  However, it has been using Python2.7, support for which is ending on January 1st 2020.  Therefore, I have updated the codebase to ensure that it is Python3 compatible.

During this process, it is possible that I will have missed some of the changes needed to ensure continuity across the versions (e.g. integer versus float division as default).  However I have found one bug which seems to have been present since the creation of the Python version in 2014.

The world records check compares observation values to the world records for each continent (WMO region) as held by the WMO.  Unfortunately in prior versions only the global values were being used, and so this check was not as powerful as it could have been.  For many cases, the observation values that exceeded regional records but not the global ones will have been picked up by other checks (e.g. climatological or distribution).

Below I show the images for the old and new versions of the test.  These are on very slightly different runs (one from v3.0.1.201910p and one from a test run of the new code, which also updated the station counts slightly).  As a fraction of the number of observations in each station, the increase in flagging is less than 0.1% (and in the stations I've checked numbers in the range of single observation to a few tens).
Fig 1: World record checks for temperature in v3.0.1.201910p (Python 2.7 live version)
Fig 2: World record checks for temperature in v3.1.0.201910p (test version of Python 3)
As a result of (a) the update to Python3, and (b) the bug-fix of the world record check, the version released in December (which includes data up to the end of November 2019) is 3.1.0.201911p (as opposed to 3.0.1.201911p).  No other changes have been made, and we expect to release the final version for 2019 updates in early January (3.1.0.2019f).

Tuesday, 5 February 2019

Monthly updates, extra variables, new version

Since the launch of HadISD, we have run an annual update cycle in two stages.  In January we update to allow a preliminary look at the most recent calendar year, and then in roughly April, we run a final version.  This was adopted to balance the need of users to access data from the most recent complete calendar year in a timely fashion with the issue that data do not always arrive into the NOAA/NCEI archives within days of their observation.

For the last few years I have been working on adapting the Python code that compiles HadISD2 to run in such a way to enable monthly updates.  To do this, I've adopted the following conventions and outlines for the processing.

Quality Control Tests

In some of the quality control tests, the entire period of record is used to e.g. determine parameters of a distribution or set threshold values.  With monthly appends to the data, these parameters would change from month to month, resulting in changing threshold values with each release.  I decided that the flutter that this would cause in whether observations were flagged or not would be undesirable to users and so developed the tests as follows.

Where thresholds were set from the parameters of the observations themselves, these are only ever calculated from the data occurring up to the last complete year (31st December at 2300).  Therefore, adding extra months has no impact until a run in January of a following year.

Some tests do not have thresholds set in this way, and so these are not impacted by the monthly append of new data.

Update cycle and versioning schema

To retain a stable set of stations that make up HadISD, we have decided to only recreate the station list on an annual cycle.  At the same time, all data in the "deep past" can be updated (i.e. prior to the most recent complete calendar year).  On monthly updates, only the current year will have any data updated, but due to the way that the ISD files are stored at NCEI, an entire year is downloaded so the update in November could also have changes in February.

To allow users to clearly identify which version of HadISD they are using in any output, we are going to stick with the versioning scheme as of HadISD and HadISD2 (x.y.z.datelabel). However, the date stamp will be more important for the monthly updating dataset.  As there will be significant changes to the code, variables and processing we have decided to increment the overall version number by one - forming HadISD version 3.  This is documented in a Met Office Hadley Centre Technical Note. 

We have been running monthly updates for testing and internal purposes during the last few months of 2018.  The first release using this new code will be that in January 2019.  This includes the addition of 2018 data over previous release (v2.0.2.2017f), but no changes in the deep past, and no reselection of the stations.  This update could be called v3.0.0.201812p - the preliminary update including data to the end of December.  But as this will be the final monthly update for 2018's data, it will be released under v3.0.0.2018f.

In February, the update which will include January's data will also check for any updates in previous years (1931-2018).  With a new station selection in this update, it will be released as v3.0.1.201901p. In March v3.0.1.201902p etc all the way to January 2020 with v3.0.1.2019f there will be updates where all data from 2019 is overwritten.  Then in February 2020, there is another update to the deep past, to the station list and also to include January 2020 - resulting in v3.0.2.202001p (note the change in the date label as well as in the "z" label).


Other bits and bobs - precipitation and station level pressure

While implementing the changes to the HadISD code base, I decided that this was an opportunity to address the issue with the precipitation fields, outlined in the previous post.  There are 4 precipitation fields in the ISD, each with an accumulation period, an accumulation amount and quality code.  I've split these out into new fields in the netCDF files, one for each accumulation period.  This should make it easier for users who wish to use the precipitation amounts.  However it is important to note that at this point, these data have NOT been quality controlled.

A user requested to have station level pressure (different to sea-level pressure that is currently in HadISD) included.  This we have done, and added another QC test to compare the station and sea-level pressures.  If the difference between them is greater or less than 4.5 median-absolute deviations from the median difference, then the station level pressure is flagged.

A Met Office Hadley Centre Technical Note has being drafted and will shortly be available (also on the HadISD website). We encourage users to provide feedback on the monthly updates during 2019.

[US Government Shutdown]

As a result of the prolonged US Government Shutdown, the update to HadISD is a bit later this year.