Thursday 19 December 2013

Spurious Stations - bad mergers

While homogenising HadISD (v.1.0.1.2012p) we have come across a number of stations which are bad mergers or station moves.  These are the stations where the PHA algorithm of Menne & Williams (2009) found breaks of larger than 5 degrees in temperature.

026720-99999 KALMAR               +56.733 +016.300 +16m   Sweden
157250-99999 SNEJANKA /TOP/SOMME       +41.667 +024.683 +193m Bulgaria
700638-99999 FALSE  PASS       +54.850 -163.417 +6m  US/Alaska
710620-99999 VIOLET GROVE  +53.000 -115.117 +903m Canada

710730-99999 QUEENTOWN     +50.600 -112.800 +941m Canada
715500-99999 DAUPHIN CS       +51.100 -100.050 +305m Canada
718260-99999 PANGNIRTUNG   +66.150 -065.717 +23m Canada
718360-99999 MOOSONEE        +51.283 -080.600 +9m Canada
719040-99999 WIMBORBE        +51.933 -113.583 +940m Canada
726626-99999 LANGLADE         +45.150 -089.117 +464m US/Wisconsin
729595-99999 TUKTOYAKTUK  +69.433 -133.033 +1m Canada


For station 719040, we suspect a typographical error in the ISD listing file as Wimborne is the name in the Environment Canada listing


In some cases we can determine the likely cause of the inhomogeneity.

Kalmar, False Pass, Dauphin, Moosonee, Tuktoyaktuk are all likely to have been erroneous merges carried out when creating HadISD.


The station number of Violet Grove (Alberta) has been reused by Environment Canada.  The station number used to belong to Bernard Harbour (Nunavut, 68.8, -114.8).  

The station number of Queenstown (Alberta) has also been reused by Environment Canada.  The station number used to belong to Cluff Lake (Saskatchewan between 1999 and 2005 and Fort Reliance (Northwest Territories) until 1994.

The station number of Pangnirtung (Nunavut) has also been reused by Environment Canada.  The station number used to belong to Nitchequon (Quebec) until 1985.

The station number of Wimborne (Alberta) has also been reused by Environment Canada.  The station number used to belong to Quaqtaq (Quebec) until 1989.

For Snejanka and Langlade have not been merged when creating HadISD, and therefore we are not sure why the large inhomogeneities have occurred.

Please be careful when using these stations.

The merging process of HadISD will be addressed and updated in the near future, and include all information on station moves that we have available.


026720-99999, Kalmar
157250-99999, Snejanka

700638-99999, Falls Pass

710620-99999, Violet Grove
710730-99999, Queenstown

715500-99999, Dauphin
718260-99999, Pangnirtung

718630-99999, Moosonee

719040-99999, Wimborne

726626-99999, Langlade

729595-99999, Tuktoyaktuk


Thursday 3 October 2013

Heat-waves, time-series and Voronoi tiling

I'm currently working on using HadISD to study some heat-waves (mainly ones which have been studied in detail before).  A number of options present themselves when studying heat-waves, and Sarah Perkins (UNSW) has done some assessments on the best types of indices to use for heat-wave studies (2012, J. Climate, 26, 4500–451).  For the moment, however, I've stuck with something I've used before when assessing the performance of HadISD, and also something new to show spatial extents.

Time Series


To study the effect at an individual station what we can do with the HadISD data is to show the time-series from a particular year against the range expected from a climatology period.  As HadISD covers the span 1973-2012 (for v1.0.1.2012p), we have used the 30 year period of 1975-2004. 
Fig. 1. The daily temperatures from 2010 (green) shown on the 5th - 95th percentile range derived from the 30 year climatology over 1975-2004 (yellow band) for Moscow Botanical Gardens.

Fig. 1 shows the daily temperature for 2010 for the station in the Moscow Botanical Gardens (276120-99999, 55.833N, 37.617E).  To create the daily temperature we have required that there are at least 4 observations in a day (24hrs) and that these are spread over at least 12 hours.  For a climatology to be calculated, we require that valid days be present over at least 20 years in the 30 year period.  We also show the 5th - 95th percentile range in the yellow band, and have highlighted the days where the daily average temperatures are above the 95th percentile in red, and below the 5th percentile in blue.

The extreme warm period in late July and early August is clearly visible, and gives some impression as to the intensity and duration of the heat wave at this one station.  The magnitude of this event becomes clearer if we show the same plot for Paris-Montsouris (071560-99999, 48.817N, 2.333E) in 2003, Fig. 2.
Fig. 2. The daily temperatures from 2003 (green) shown on the 5th - 95th percentile range derived from the 30 year climatology over 1975-2004 (yellow band) for Paris-Montsouris

Spatial Extent & Voronoi Tiling

However, what about showing the spatial extent of a heat-wave.  With station data, we can show the value for each station as a coloured dot.  This isn't the clearest way of presenting the data, as is hopefully obvious in Fig. 3. 

Fig. 3 The 2010 Moscow heat-wave in HadISD for July.  Each station has been coloured by the number of degree days over climatology (see text for details)
What is plotted in Fig. 3 is basically the integral of the area highlighted in red in Fig. 1 & 2.  It is the sum over one month within a given year (July 2010 in this case) of the number of degrees the daily average is above the 95th percentile of the climatology (1975-2004 as above).   We are not counting the periods where the daily average is below the 5th percentile in the sum.  This measure gives an indication of the combined duration and intensity of an event.  A long event of only a few degrees above the 95th percentile would give the same signal of a short event which is many degrees above.  A few "bulls-eye" stations do stand out, which have high values but are not close to the centre of the heat-wave.

To try and improve the presentation of this heat map I played around with something called Voronoi tessellation (also known as Theissen Polygons).  This technique divides up an area on which a number of fixed points such that each edge of a polygon bisects the distance between two centres.  This is hopefully clear in the example below, which just colours each polygon by random, but also shows the lines which are bisected in red.
Fig. 4 Voronoi tiling.  The red lines show the connections between all the points, forming a set of Delaunay Triangles.  The Voronoi polygons are formed by joining all the bisectors of the edges of the triangles.
Combining the Voronoi method with the HadISD station distribution and the heat-wave index outlined above, results in the following map, also for July 2010.
Fig. 5 The heat map for Moscow in July 2010 using the Voronoi tiling method.  The location of each station is shown by a grey dot, usually close to the middle of the polygon, but not always so.
Using this method, the intensity of the heat-wave is much clearer than Fig. 3.  The few stations which for some reason have high values but are not in the heat-wave region (e.g. south Ukraine and central Turkey) stand out just as much as in Fig. 3.  By the nature of the tiling method, it is assumed that a station is representative of the area surrounding it.  Many stations are on the coast (UK, Norway etc.) and these are not clearly visible in Fig. 3, however the areas they represent are very clear in this representation.


An alternative way of presenting this kind of data would have been to grid up the individual stations into grid boxes.  Although this would have shown a very similar pattern, it would not be immediately clear from the resulting map, how many stations were contributing to a grid box.  Some gridding methods do not require any stations within the grid box, but use a weighted average of those stations within a search radius. The gridding process also would act as a smoothing function on the data, reducing the intensity of the maxima and minima.

Personally I think this is a good way of presenting the station data of HadISD in a space filling way without resorting to gridding.

Thursday 20 June 2013

Years versus file size


During the review of the HadISD paper (see documents on Climate of the Past Discussions) we were asked to quantify how many stations report for how long in the ISD (Integrated Surface Dataset).  Our comment that "many stations report only rarely" could have been misleading.  We therefore did a quick analysis of the stations in the ISD in July 2012.  I've just re-run the code to update to the current status of the ISD, and thought the results might be of wide enough interest to not remain buried on the discussion paper.

Of the 29,678 unique station IDs present in the database (on 20 June 2013), 14,159 report in fewer than 10 years (almost half), almost 18,921 for less than 20 and almost 21,962 for less than 30 years.  One station reports in 81 years.  The mean length is 18.2 years, but the median is only 11.  The distribution is shown in Fig. 1.
Fig. 1 Number of stations against the number of years they report for.  The spike at 40 years is the result of a sudden increase in the number of stations in 1973.
We used the file size in bytes as an indicator of the number of records, as a station that reports only once a year for many years is not much use for climatological studies.  Most stations had sizes between 10^5 and 10^7 bytes.  The figure below shows the distribution of the number of years a station reports for against the file size.  The colour scale is logarithmic.
Fig. 2 Years with records in the ISD against total file size over all years for each station ID.  Created on 20/7/2013
This shows that there are some stations which have lots of data in them but only for a small number of years (bottom right of the figure).  The apparent diagonal cut-off, from bottom left to top right, shows the link between file size (in bytes) and years which have data, assuming a fairly constant set of reported variables.

The file size is not a perfect proxy to use when trying to assess the completeness of a record, but if combined with the number of years in which a station reports, many stations which only report for a few years or containing very little data can easily be excluded from any station selection made.

Friday 10 May 2013

Station Reporting Interval

While working on a related project, the issue of station reporting interval came up.  The HadISD stations were selected to try and chose those in the ISD which reported very hour or every 3 hours.  To investigate how many of which type occurred at any year, I made the following plots.

Number of stations at each reporting interval
Proportion of stations at each reporting interval

The first plot shows the number of stations at each reporting interval.  I've included the option of 6 hourly as well.  Firstly, there is a drop in the number of active station after around 1990.  This plot shows the reporting intervals for the ~4200 "accepted" stations, i.e. those which are thought to be suitable for climatological studies.  The second plot shows how the proportion of the different reporting intervals changes.  The number of three hourly stations falls off with time, and the number of hourly stations increases.

If your application could depend on the reporting interval (or changes in it) then this may be of use.








Thursday 25 April 2013

Quality Control Code Released

We have released the IDL code which performs the detailed quality control (QC) on the HadISD data.  This is the code as it was used for the v1.0.1.2012p release.  We are not supporting the code, but there is a README file included in the zip archive, which can be found here.  Most of the necessary files should be available in the online material.  However if there are any input files or IDL program files that you feel are missing please do let us know so that we can include them or explain why they have not been included.

The code that has been released is scientific code.  During its development we have tried to ensure its readability.  However it is a complex piece of code, and we welcome all suggestions on how to improve it for future versions of HadISD.  The dataset is still under development, and so we envisage there being changes in the code (both language and logic) in the future.


Tuesday 12 February 2013

Message from NCDC ISD team

There is a message from the ISD team at NCDC on the ISD FTP server about a problem with their merging process:

NCDC has identified a problem with merged ISD data. Under certain circumstances station merging is not taking place when it should, resulting in unsorted, non-merged output. This problem may be found in files on this FTP server from 2005-present. 

We are working to correct this issue and we apologize for any inconvenience this may cause.

This may have affected v1.0.1.2012p and we are investigating.  We will wait until this issue is resolved before finalising this update and creating v1.0.1.2012f.

Please let us know if you find anything untoward in the data.
 
 


Monday 11 February 2013

2 Stations with the Wrong Latitudes

Thanks to Tim McVicar (CSIRO) we've found two stations that have been listed with the wrong latitude in our station listing files:

917650 61705 PAGO PAGO       +14332 -170711 +00030  (should be -14.3, -170.7)
    (American Samoa)
619670 70701 DIEGO GARCIA +07300 +072400 +00027  (should be -7.3, 72.4)

Both of these stations have the wrong latitudes in both v1.0.0.2011f and currently v1.0.1.2012p.  The listing files will be updated and synchronised with the ISD listing files before running the final v1.0.1.2012f quality control run.

Please do let us know if you find any other quirks in the data.
See also comment here regarding affect on HadISDH.

Wednesday 6 February 2013

Version 1.0.1.2012p

#!/usr/bin/python
print "hello world"


This blog will hopefully act as a repository for ongoing developments with the HadISD dataset.  As this is a new dataset, all kinds of quirks or short interesting results are likely to come up, which might be of interest to users, and we can try and highlight these here. 

The current stable version of HadISD is 1.0.0.2011f, however, a preliminary version has just been released (1.0.1.2012p).  Although these versions are just a mass of numbers, hopefully the following will make them clear.

  • The final letter indicates whether the dataset is f-final or p-preliminary.  A final dataset is stable, and will not change, whereas this preliminary dataset will be updated and overwritten at some point in the future.  This is because the source data for HadISD, the Integrated Surface Database (ISD), is still being updated for the 2012 data.  Once ISD is stable, we'll re-run our scripts and update the data
  • The year stamp indicates the final complete year in the dataset (the first year is 1973, and for the moment is unlikely to move to earlier years)
  • The remaining three numbers indicate major-moderate-minor changes. Major changes (e.g. a complete re-write of our scripts) would require an accompanying publication in a peer-reviewed journal.  Moderate changes (e.g. an update to one of the QC tests) would have a technical note to explain the change and the resulting differences available on the website.  Minor changes include update to past years (not just the most recent) and will be noted on the download page.
Now it should be nice and clear - the dataset is a preliminary version of the one which includes data up to the end of 2012.  There has been a minor change in one of the QC tests (the global and African record high temperatures have changed, see paper and WMO press release) and the ISD maintainers updated data in the years 2004-9 and 2011.

We have checked through some of the summary and diagnostic plots (all available on the website too) and found no major unexpected changes in the flagging rates, however it is still a preliminary dataset.  If you find anything untoward or have problems please contact the maintainers through the HadISD website.