Thursday, 20 June 2013

Years versus file size

During the review of the HadISD paper (see documents on Climate of the Past Discussions) we were asked to quantify how many stations report for how long in the ISD (Integrated Surface Dataset).  Our comment that "many stations report only rarely" could have been misleading.  We therefore did a quick analysis of the stations in the ISD in July 2012.  I've just re-run the code to update to the current status of the ISD, and thought the results might be of wide enough interest to not remain buried on the discussion paper.

Of the 29,678 unique station IDs present in the database (on 20 June 2013), 14,159 report in fewer than 10 years (almost half), almost 18,921 for less than 20 and almost 21,962 for less than 30 years.  One station reports in 81 years.  The mean length is 18.2 years, but the median is only 11.  The distribution is shown in Fig. 1.
Fig. 1 Number of stations against the number of years they report for.  The spike at 40 years is the result of a sudden increase in the number of stations in 1973.
We used the file size in bytes as an indicator of the number of records, as a station that reports only once a year for many years is not much use for climatological studies.  Most stations had sizes between 10^5 and 10^7 bytes.  The figure below shows the distribution of the number of years a station reports for against the file size.  The colour scale is logarithmic.
Fig. 2 Years with records in the ISD against total file size over all years for each station ID.  Created on 20/7/2013
This shows that there are some stations which have lots of data in them but only for a small number of years (bottom right of the figure).  The apparent diagonal cut-off, from bottom left to top right, shows the link between file size (in bytes) and years which have data, assuming a fairly constant set of reported variables.

The file size is not a perfect proxy to use when trying to assess the completeness of a record, but if combined with the number of years in which a station reports, many stations which only report for a few years or containing very little data can easily be excluded from any station selection made.