Tuesday, 18 February 2020

v3.1.1.202001p

As I'm sure you've been noticing, I'm managing to update HadISD on a rougly monthly basis (the exact release date depends on other things in my schedule, but it tends to be around the second week of the month).  

In January version 3.1.0.2019f was released, the final update to 2019 data (hence the "f"). I also ran the Pairwise Homogenisation Algorithm on the data to produce the homogeneity assessment information for this version.  

Then, earlier this month (February), the new version 3.1.1.202001p was commenced.  As this is a new set of monthly updates in 2020, the station selection code was re-run resulting in the addition of around 300 stations, making a total of just over 8400 in the HadISD.  All other processing completed as normal, and this version is available at www.metoffice.gov.uk/hadobs/hadisd as usual.  All data before 2020 is now frozen in this version, with only monthly appends, resulting in changes in the 2020 data.

As always, please let us know if you spot anything which doesn't look right or if you are having issues in obtaining the data.

Wednesday, 11 December 2019

Another minor version change

The HadISD code for versions 2 and 3 is written in Python (having migrated from the IDL code in version 1).  However, it has been using Python2.7, support for which is ending on January 1st 2020.  Therefore, I have updated the codebase to ensure that it is Python3 compatible.

During this process, it is possible that I will have missed some of the changes needed to ensure continuity across the versions (e.g. integer versus float division as default).  However I have found one bug which seems to have been present since the creation of the Python version in 2014.

The world records check compares observation values to the world records for each continent (WMO region) as held by the WMO.  Unfortunately in prior versions only the global values were being used, and so this check was not as powerful as it could have been.  For many cases, the observation values that exceeded regional records but not the global ones will have been picked up by other checks (e.g. climatological or distribution).

Below I show the images for the old and new versions of the test.  These are on very slightly different runs (one from v3.0.1.201910p and one from a test run of the new code, which also updated the station counts slightly).  As a fraction of the number of observations in each station, the increase in flagging is less than 0.1% (and in the stations I've checked numbers in the range of single observation to a few tens).
Fig 1: World record checks for temperature in v3.0.1.201910p (Python 2.7 live version)
Fig 2: World record checks for temperature in v3.1.0.201910p (test version of Python 3)
As a result of (a) the update to Python3, and (b) the bug-fix of the world record check, the version released in December (which includes data up to the end of November 2019) is 3.1.0.201911p (as opposed to 3.0.1.201911p).  No other changes have been made, and we expect to release the final version for 2019 updates in early January (3.1.0.2019f).

Tuesday, 5 February 2019

Monthly updates, extra variables, new version

Since the launch of HadISD, we have run an annual update cycle in two stages.  In January we update to allow a preliminary look at the most recent calendar year, and then in roughly April, we run a final version.  This was adopted to balance the need of users to access data from the most recent complete calendar year in a timely fashion with the issue that data do not always arrive into the NOAA/NCEI archives within days of their observation.

For the last few years I have been working on adapting the Python code that compiles HadISD2 to run in such a way to enable monthly updates.  To do this, I've adopted the following conventions and outlines for the processing.

Quality Control Tests

In some of the quality control tests, the entire period of record is used to e.g. determine parameters of a distribution or set threshold values.  With monthly appends to the data, these parameters would change from month to month, resulting in changing threshold values with each release.  I decided that the flutter that this would cause in whether observations were flagged or not would be undesirable to users and so developed the tests as follows.

Where thresholds were set from the parameters of the observations themselves, these are only ever calculated from the data occurring up to the last complete year (31st December at 2300).  Therefore, adding extra months has no impact until a run in January of a following year.

Some tests do not have thresholds set in this way, and so these are not impacted by the monthly append of new data.

Update cycle and versioning schema

To retain a stable set of stations that make up HadISD, we have decided to only recreate the station list on an annual cycle.  At the same time, all data in the "deep past" can be updated (i.e. prior to the most recent complete calendar year).  On monthly updates, only the current year will have any data updated, but due to the way that the ISD files are stored at NCEI, an entire year is downloaded so the update in November could also have changes in February.

To allow users to clearly identify which version of HadISD they are using in any output, we are going to stick with the versioning scheme as of HadISD and HadISD2 (x.y.z.datelabel). However, the date stamp will be more important for the monthly updating dataset.  As there will be significant changes to the code, variables and processing we have decided to increment the overall version number by one - forming HadISD version 3.  This is documented in a Met Office Hadley Centre Technical Note. 

We have been running monthly updates for testing and internal purposes during the last few months of 2018.  The first release using this new code will be that in January 2019.  This includes the addition of 2018 data over previous release (v2.0.2.2017f), but no changes in the deep past, and no reselection of the stations.  This update could be called v3.0.0.201812p - the preliminary update including data to the end of December.  But as this will be the final monthly update for 2018's data, it will be released under v3.0.0.2018f.

In February, the update which will include January's data will also check for any updates in previous years (1931-2018).  With a new station selection in this update, it will be released as v3.0.1.201901p. In March v3.0.1.201902p etc all the way to January 2020 with v3.0.1.2019f there will be updates where all data from 2019 is overwritten.  Then in February 2020, there is another update to the deep past, to the station list and also to include January 2020 - resulting in v3.0.2.202001p (note the change in the date label as well as in the "z" label).


Other bits and bobs - precipitation and station level pressure

While implementing the changes to the HadISD code base, I decided that this was an opportunity to address the issue with the precipitation fields, outlined in the previous post.  There are 4 precipitation fields in the ISD, each with an accumulation period, an accumulation amount and quality code.  I've split these out into new fields in the netCDF files, one for each accumulation period.  This should make it easier for users who wish to use the precipitation amounts.  However it is important to note that at this point, these data have NOT been quality controlled.

A user requested to have station level pressure (different to sea-level pressure that is currently in HadISD) included.  This we have done, and added another QC test to compare the station and sea-level pressures.  If the difference between them is greater or less than 4.5 median-absolute deviations from the median difference, then the station level pressure is flagged.

A Met Office Hadley Centre Technical Note has being drafted and will shortly be available (also on the HadISD website). We encourage users to provide feedback on the monthly updates during 2019.

[US Government Shutdown]

As a result of the prolonged US Government Shutdown, the update to HadISD is a bit later this year.

Thursday, 22 March 2018

Precipitation in HadISD

We have included precipitation accumulations in HadISD since its launch (netCDF field “precip1_depth”), as this information is used as part of the quality control suite to check for high humidity periods in the dewpoint depression check.  We wanted to make the HadISD fully traceable, so that users could check our quality control decisions for themselves, should they wish to.  These precipitation accumulations are not quality controlled and so we have urged users to take care when using these data in their analyses.  

In the ISD data format there are four possible entries for the precipitation data.  These are indicated by character code “AA1” to “AA4”.  Each of these has period, depth, condition and quality entries.  To assist in the quality control of the dewpoint temperature fields, we extracted the first of these four precipitation fields.  The netCDF names we assigned these variables are “precip1_depth” and “precip1_period”, as these were from the first ISD precipitation field.   

Recently, Kimberly Channell of the Great Lakes Integrated Sciences and Assessments at the University of Michigan highlighted a confusion with the description of the precip1_depth field in the netCDF files.  The metadata for versions up to and including v2.0.2.2017p states “Depth of Precipitation Reported over time period”. This, combined with the hourly time stamps, could easily result in an assumption that the precip1_depth field only contains hourly accumulations.  Furthermore, our naming of the netCDF variable inadvertently supports this interpretation, that the “1” in the “precip1_depth” suggests hourly accumulation values.  Unfortunately, neither of these are the case.  


The accumulation period for the precip1_depth is given by the precip1_period.  Even if there are timestamps every hour, the accumulation period may be a mix of time periods (from hourly to daily). We now appreciate that the metadata for these two variables could have been clearer, and that our chosen naming could be confusing, especially without knowledge of the ISD naming conventions.  We apologise to users if these issues have caused problems with their analyses.  To properly use the precipitation information, the depth information should be combined with the period.

Within one of the ISD precipitation fields, it is possible to have a number of accumulation periods, rather than just a single one across the entire length of the station record.  The ISD is itself made up of a number of underlying databases, drawing observations from across a variety of observation networks (e.g. SYNOP, METAR, GTS).  Each of these may have a different accumulation period, and also conventions as to the time of observation of e.g. 24 hour accumulations (and to which day these are assigned).  These have been combined together to form threaded records for single station locations where possible during both the ISD and HadISD development.

It may be that a station report type (e.g. GTS) was the primary source in the early period (e.g. filling AA1), but that for a later period, a different source with a different standard accumulation period has a higher priority in a merging process, and so supersedes this.  Therefore, at observation times where both sources have data, this could move the hourly accumulations down into the later precipitation fields (AA2-4), resulting in the interleaving of the different accumulation periods in the first entry.  The example station in Figure 1 exhibits behaviour consistent with this.  It is possible that further hourly accumulation values are
present in AA2-4 of the ISD file for this station, but as we have not extracted those, they are not available to users of the HadISD at this time.
 

Figure 1 (top) precip1_depth, and (bottom) precip1_period for 724380-93819 (Indianapolis Airport).  This station has been merged from two ISD stations (99999-93819 and 724380-93819).  As the precipitation information is not quality controlled, likely erroneous observations like the ~160mm in the late 1970s are still present in the data files (using HadISD v2.0.2.2017p).


Therefore, for any given station in the HadISD, it is very likely that the period over which the precipitation depth has been accumulated is not constant over the entire record.  However, there still may be valid precip1_depth measurements present at each hourly timestamp, but these may be a combination of hourly1, but also 3, 6, 12, and 24 hourly measurements.  We advise users wishing to take advantage of the precipitation information to make plots like those in Figure 1 to check for themselves what data have been included.

In light of this possible confusion from our netCDF variable names, we have/will take a number of actions:

1)    Added notes to the HadISD webpages to clarify our naming scheme and inform users about the need to use both the “precip1_depth” and “precip1_period” fields.  We’ve also improved the metadata for these two variables on the webpages too.
2)    Improve the metadata of the “precip1_depth” and “precip1_period” fields in the netCDF files in the next update (v2.0.2.2017f).
3)    In the longer term, extract all four ISD precipitation fields where available, and attempt to disaggregate into 1, 3 , 6, 12, and 24 hourly accumulation fields within the netCDF files.  However, it is unlikely that we will be doing any quality control on these data and so we will still advise caution when using these. 


We note again that the precipitation information in HadISD is not quality controlled at the moment. 


Please do get in touch if you would like more information.

Tuesday, 23 January 2018

HadISD v2.0.2.2017p

We have just released version 2.0.2.2017p of HadISD on the Hadobs website.  The data now cover 1931/1/1 to 2017/12/31.

Downloading the data from the ISD finished on Monday 15th January and the quality control and other processes ran over the following days.

There are 8103 stations in this version of HadISD, a full 2000 more than in HadISD version 1.0.x.  However, there have been no changes to the quality control tests over v2.01.2016f.

As always, if you notice anything untoward in the dataset please do get in touch.  We intend to run a final version in a few months time if there have been changes to the ISD data in 2017 or earlier years in the intervening time.

We hope to move to monthly updates during 2018, which entail some minor changes to the QC code, but which should not impact the annual update methods.  We will post on here in due course when this project is nearing completion.

Tuesday, 22 August 2017

Digitisation and reporting resolution

A couple of years ago James Goldie (UNSW) contacted me about an issue he found in HadISD relating to the reporting resolution of temperature and humidity information for stations in Australia.

In the HadISD, the data vary between single-degree, half-degree and 1/10th degree resolution.  However, variations between these can cause some interesting striations in derived quantities.

James has written up his work, with some cool animated plots at his blog

Wednesday, 7 June 2017

High windspeed values

Thanks to Phil Jones (UEA) and colleagues for pointing out this issue.

There are a number of stations which have wind values of 88 m/s which also stands out as a repeating value (see Figure 1). 

Fig 1. Station 151080-99999 (Ceahlau Toaca, 46.983N, 25.950E, 1898.0m) showing the wind speeds and inhomogeneities (vertical lines).  The cluster of high values between 1991 and 2001 is clear (v2.0.1.2016f).
These may be the result of a mistyped missing data code in the original data.  It is also clear that this station may have rounding or conversion problems - we have not had the chance to investigate in detail so far.

The maximum wind speed used for the record check is 113.3m/s (derived from a maximum gust speed - https://wmo.asu.edu/content/world-maximum-surface-wind-gust), so this would not exclude these values.  The wind speeds are not passed through the distributional or frequent value checks as the shape of the distribution is not gaussian and to this point, these tests have been written assuming this shape.  Nor is the spike check applied.  Therefore, unfortunately, our QC suite is not (yet) clever enough at identifying these erroneous values.

At the current time we do not have a solution to these issues - we would rather make folks aware than try and implement a "quick fix" which causes issues elsewhere.  We will look into this during the course of this year and hope to roll out improvements to the wind QC in the next update.

The stations which have been noted as affected by repeated high values are:
151080-99999
156150-99999
156270-99999
228370-99999






Though others are noted to have one or a few high values.


Please do not hesitate to get in touch if you do spot any issues or would like more information on these.