Monday, 20 October 2014

Further thoughts on the Merging Problem

To follow on from the previous post, we've done some more thinking about the issue of merging stations correctly.  The three options from last time were:
  1. We can not merge at all, keep all the ISD station IDs as unique and be confident that by creating HadISD we have not degraded any of the data.
  2. We can merge only in cases where we have specific information (from a national met service, for example) as to the identity of stations.  This could also be applied in cases where we have information indicating that a split of a station record would be appropriate 
  3. We can merge (and split) when we have specific information, but also run an automated procedure to identify candidate stations to merge together.  An example algorithm has been produced by the International Surface Temperature Initiative databank v1.0 (see description in the paper).  However this approach is very likely to introduce some spurious mergers, however careful we are with the algorithm.
with our preference at the time leaning towards number 2.  

When we were thinking of option 3, what we had in mind was something similar to the merging process carried out for HadISD v1.0.0.  In that process the merging of short records was carried out before stations were selected, so that lots of short records, once merged, would be long enough to pass the selection criteria.  This cross-matching of all 29,000 stations in ISD to obtain a parent-set from which HadISD is drawn will result in some final stations being composed of many short segments.  The likelihood that some of these stations will be erroneously merged together is quite high given the automated nature of the build.  A subtly different alternative came to mind.

However, what could be done is to select stations on the raw ISD record lengths and reporting intervals, and then using this master-list, see which other stations in the ISD could be merged in to supplement these primary stations.  This will not increase the final station list (in fact it will decrease it as there will be stations that have been selected that will be merged together), but should improve the data coverage over time for the final set of merged stations.

Fig. 1: Flowchart showing envisaged station selection procedure with merging

Most of the merging process takes place after stations have been selected on the basis of their length of record and their reporting interval, however for specific countries, it occurs before as we have extra and definitive information as to which stations should be merged or split.  At the moment there are also stations in the master list which will be selected to merge with other stations in the master list - hence the reduction to 8207.

Selecting stations to Merge

To select stations which are possible merging candidates we so far are using a very simple algorithm.  We test the horizontal and vertical separation of the stations and also the similarity of the station names.  The distances are mapped to an exponential decay curve, which returns a value between 0 and 1 which we use as a probability.  For the horizontal distances, this curve falls to 1/e by 25km, and by 100m for the vertical separation.  To calculate the similarity of the station names, we use the Jaccard Index (also used by ISTI), also returning a value between 0 and 1.  These three probabilities are multiplied together, and stations where the final value is >0.5 are selected.

For an automated system there is no perfect result - there will always be false positives (stations that we shouldn't merge and are distinct) and false negatives (stations that we should merge but do not select to do so).  An inspection of the resulting candidates for the UK (where we have more idea of the suitability of merging these candidates) suggests that the values used above are reasonable, with no obvious false positives.  The algorithm and thresholds are not yet set in stone, so changes can still occur.

At the current moment in time we find that within the primary station list, 478 stations are similar to others also in the list, reducing the station number to 8207 (including the changes from the German stations outlined below).  By cross matching these 8207 stations to the complete ISD database, 2101 will contain data from other station IDs.

Fig. 2: The effect of the current merging system on the number of stations that report over time.  Improvements at the beginning and end of the record are clearly visible.
Fig. 2 shows the effect of the merging selection as it currently stands on the stations available in each year.  There are clear improvements in the number of merged stations available in the early part of the record (1935-1970) and also the last 10 years or so. 
 

Specific Countries - Germany & Canada

For some countries we have specific information about which stations to merge or split (and we hope that we obtain more of these lists as time goes on).  Currently we have information about German stations, whose id's start with 09 and 10 in the isd-history.txt files.  Here, the last 4 digits of the WMO-ID are important, and some stations have had their records split so that 09abcd and 10abcd are the same station.  So we use the selection algorithm to check these station-pairs specifically, and allow them to merge if they pass the same criteria as outlined above.

Including this information prior to the station selection criteria results in 8685 stations being selected compared to 8667 before.

For Canada, things are a little more complicated.  There are only 1000 WMO-IDs available for Canada, and as a result, stations with different locations have ended up with the same IDs.  Thanks to Environment Canada, we have a list of the station moves.  In this case we want to split up records so that apparent false mergers are not included in HadISD.  We are still working on including this information in the station selection code.

Note

These criteria and this procedure have not yet been finalised.  We may still revert to only merging/splitting stations where we have specific information.  If you have further suggestions or comments, please let us know.

Tuesday, 7 October 2014

Extending HadISD: Station Selection

I have started to re-assess the station selection part of HadISD.  During the early stages of creating HadISD (in around 2008), the ISD database was interrogated to find stations which would be suitable for HadISD.  This process, outlined in the paper, resulted in the 6103 stations which form HadISDv1.0.x.

However, this station list has not been updated since that point.  This means that we have not benefited from any new stations that have been added to the ISD database in recent years.  The static station list may also to be partly to blame for the jump in 2005 in the total number of stations with available data (see also HadISDH) and also the fall-off in the number of stations since 1990.
Fig. 1 - Number of stations which have data in any given year in HadISDv1.0.x.  These are all the individual input stations (including those merged to form composites), hence the peak is more than 6103.  The dip at 2005 is visible, as well as the drop before 1973 (which set the start period of HadISD v1.0.x).

At the same time as increasing the station selection we also intend to extend HadISD so that data is available and quality controlled prior to 1973.  It is clear from Fig. 1 why the start year of HadISDv1.0.x was chosen as 1973, however this does result in a relatively short period of record. 

Back to the beginning

So, we have gone back to the ISD database to dynamically return a station listing which could be run with each major update of HadISD.  Using the isd-history.txt we extracted those stations which have valid latitudes, longitudes and elevations, and also those which had at least 15 years between their start and end dates.  There are 29525 unique station IDs in the ISD database, and 14947 satisfy these criteria (these numbers will change fractionally as the ISD database is continually updated).

The isd-inventory.txt lists the number of observations in each month for each station.  We have used this to find those stations which report on average every 6 hours and which have observations in at least 15 years worth of months (180 months) to account for stations with many gaps.  This returns 8694 stations world wide.

Fig. 2. The number of stations which have data using the initial version of the updated station selection code.  The drops in 1972 and 2005 are still visible, but the gentle drop off from 1990 is less pronounced when compared to Fig. 1
We have dropped the reporting interval to every 6 hours rather than every 3 to try and select more stations in those regions where currently HadISD does not have many (e.g. central South America, Africa) but still maintain a subdaily resolution.

As can be seen in Fig. 2, there is still a large dip in 1972, and the drop in 2005 has also not entirely disappeared.  Some of these dips may be ameliorated by merging stations together.   However the drop off post 1990 is less prominent, and there are more stations overall.

To merge or not to merge?

As the ISD had many stations with short records, when creating HadISDv.1.0.x stations were merged to create ones with longer records.  This was done using a hierarchical table (see Table 1 in the paper) to identify potential candidates and then an in-depth and time-consuming manual process to reduce this to the mergers used.  If the station selection is to be run on each major update, then selecting these merger candidates would have to be automated.

We could go back to the raw ISD listings and find stations which are merging candidates with the ~8500 initially selected.  By merging these in, some of the gaps in Fig. 2 could be filled (but also possibly not).

As we have found over time, not all of these mergers are correct, and therefore a number of options present themselves:

  1. We can not merge at all, keep all the ISD station IDs as unique and be confident that by creating HadISD we have not degraded any of the data.
  2. We can merge only in cases where we have specific information (from a national met service, for example) as to the identity of stations.  This could also be applied in cases where we have information indicating that a split of a station record would be appropriate 
  3. We can merge (and split) when we have specific information, but also run an automated procedure to identify candidate stations to merge together.  An example algorithm has been produced by the International Surface Temperature Initiative databank v1.0 (see description in the paper).  However this approach is very likely to introduce spurious mergers, however careful we are with the algorithm.
We have not yet decided which route to follow, but are erring towards the second in the first instance.  

If you have any further suggestions or preferences, please leave a comment or get in touch.