The paper describing HadISD version 2.0.0 has just appeared in the Discussions section of Climate of the Past:
http://www.clim-past-discuss.net/11/4569/2015/cpd-11-4569-2015.html
There now follows 8 weeks of reviewing process. Two anonymous referees will be asked to make comments, which will appear online, and also anyone can make attributed comments (i.e. under their name) which will also appear. After that we will have the chance to respond (these will also be published) and then the final paper will appear thereafter. Once all that is done, then we can release the dataset and also the quality control code. Hopefully this will all go through before the end of the year so that I can also run an update in January to v2.0.1.2015p.
For those of you who follow this blog, then a number of the sections in the paper will be familiar, however, the jist of the paper is expanding the time coverage of HadISD from 1973 back to 1931. At the same time we've readdressed the way stations are selected and merged, and so v2.0.0.2014f has 8113 stations, with around 2000 of these being composite.
As part of the creation of HadISD.2.0.0, we have also re-written all code into Python for ease of use - and as such we were able to check and in some cases alter some of the QC tests to work a bit better. We have also added new checks for wind speed and direction.
We believe that the result of these changes are that HadISD.2.0.0 is a more useful dataset for the study of extreme events, but also model validation, for ingestion into reanalyses and many other applications.
Update: January 2016
After some useful review comments from the referees and a discussion with the editorial team, it was suggested that we re-submit this paper to Geoscientific Instrumentation, Methods and Data Systems a partner journal of Climate of the Past. As we are currently updating HadISD.1.0.4, we will do this once all the annual dataset updates are complete, and at the same time update HadISD.2.0.0 to include data from 2015. We aim to resubmit this in early spring.
News, updates and interesting features of the HadISD dataset
Wednesday, 30 September 2015
Wednesday, 29 April 2015
Neighbour (buddy) Check for v2.0.0
In HadISD v1.0.x, the neighbour check selects stations within 500m height and 300km distance of the target station. The bearing is also used to assign stations to quadrants (90-degree bins), and the closest 10 are chosen, ensuring that each quadrant contains at least two stations. When fewer neighbours are available the distribution of stations across the quadrants can be lop-sided.
For HadISD v2.0.0, I wanted to improve the station selection as, just because a neighbour is close, it may not be very useful when running the buddy checks. So the new neighbour selection uses the correlation coefficient of the target and neighbour time series as well as the data overlap (very important in early years). The details of both at this point are as follows.
Initially stations are selected on the basis of distance to ensure that the neighbours experience similar weather as the target. Then, the correlation of the two timeseries is obtained. However, so that the correlation is not dominated by the annual or diurnal cycle, the timeseries are processed to removed these. Firstly daily means are calculated for all days which have more than 6 observations, which are used to create the climate anomalies for each observation. To further remove the diurnal cycle, hourly means are calculated and used to create "anomalised climate anomalies" for each observation. These time series are used to calculate the correlation coefficients.
The reason for using the data overlap as another criteria results from the lengthened data coverage of HadISD v2.0.0. Few stations will have coverage over the entire 1931-2014 period, and so it would be highly likely that neighbours selected in terms of distance alone have no concurrent data. I use the fraction of observations that are also present in the neighbour as the overlap value.
The neighbours are then sorted by the linear combination of the correlation coefficient and the overlap fraction, and the top 10 are selected, again ensuring that there are at least two in each quadrant if possible.
In a perfect world (or at least one with infinite computing resources), I would select all stations within the 500m height--300km distance criteria and calculate the correlations and overlaps for all. However, this takes a while (it probably could be faster, but at some level, lots of file-read operations have to occur) and it is important that this dataset can be quality controlled within a reasonable time frame. Therefore, at the moment, only the nearest 20 stations are assessed for their correlation and overlap with the target.
The process appears to take around the 5 minute mark per station on a ~2GHz processor - so 28 days of processing for 8000 stations if done just one. I'm hoping to use many more than just one to do my bidding!
For HadISD v2.0.0, I wanted to improve the station selection as, just because a neighbour is close, it may not be very useful when running the buddy checks. So the new neighbour selection uses the correlation coefficient of the target and neighbour time series as well as the data overlap (very important in early years). The details of both at this point are as follows.
Initially stations are selected on the basis of distance to ensure that the neighbours experience similar weather as the target. Then, the correlation of the two timeseries is obtained. However, so that the correlation is not dominated by the annual or diurnal cycle, the timeseries are processed to removed these. Firstly daily means are calculated for all days which have more than 6 observations, which are used to create the climate anomalies for each observation. To further remove the diurnal cycle, hourly means are calculated and used to create "anomalised climate anomalies" for each observation. These time series are used to calculate the correlation coefficients.
The reason for using the data overlap as another criteria results from the lengthened data coverage of HadISD v2.0.0. Few stations will have coverage over the entire 1931-2014 period, and so it would be highly likely that neighbours selected in terms of distance alone have no concurrent data. I use the fraction of observations that are also present in the neighbour as the overlap value.
The neighbours are then sorted by the linear combination of the correlation coefficient and the overlap fraction, and the top 10 are selected, again ensuring that there are at least two in each quadrant if possible.
In a perfect world (or at least one with infinite computing resources), I would select all stations within the 500m height--300km distance criteria and calculate the correlations and overlaps for all. However, this takes a while (it probably could be faster, but at some level, lots of file-read operations have to occur) and it is important that this dataset can be quality controlled within a reasonable time frame. Therefore, at the moment, only the nearest 20 stations are assessed for their correlation and overlap with the target.
The process appears to take around the 5 minute mark per station on a ~2GHz processor - so 28 days of processing for 8000 stations if done just one. I'm hoping to use many more than just one to do my bidding!
Tuesday, 28 April 2015
HadISD v1.0.3.2014f released
The raw data were downloaded on 7th April 2014, and processed over the subsequent days. Despite the updates to the ISD, as the previous version was preliminary, we retain the version number (v1.0.3.2014) and only increment the descriptor to "final". This version still contains 6103 stations, with 4060 passing the final filtering checks, as for the preliminary version.
As always, if you find anything untoward in the data, please contact the dataset maintainers.
Monday, 19 January 2015
v1.0.3.2014p Released
HadISD version 1.0.3.2014p has just been released. All plots and files should be on the website This update extends the coverage of the dataset to the
end of 2014 (31 December at 2300 inclusive). It remains a preliminary
dataset as there could still be further updates to the ISD dataset
in the next few months. We hope to do a processing run for the final
version some time around Easter (to create 1.0.3.2014f).
The raw data were downloaded on 5th January 2014, and processed over the subsequent days. There have been changes to all of the raw files only in 2013 as part of the normal ISD update process We have made no substantial changes to the codes which do the conversion to NetCDF files or the Quality Control suite. Hence the version number has only incremented by 0.0.1 and the year.
This version still contains 6103 stations, with 4060 passing the final filtering checks, down slightly from the 4071 in v1.0.2.2013p (see the HadISD paper Section 6). The patterns of flagging are very similar to v1.0.2.2013p (see figures here). However if you find something strange, do let us know using the contact details on the HadISD website. Please note the stations which are known to have issues are documented on this blog and on the website.
The Homogeneity information for this version is also available on the website using the same procedure (PHA) as outlined in Dunn et al, 2014.
As always, if you see anything untoward in the data or are having problems using it, please do not hesitate to get in touch.
The raw data were downloaded on 5th January 2014, and processed over the subsequent days. There have been changes to all of the raw files only in 2013 as part of the normal ISD update process We have made no substantial changes to the codes which do the conversion to NetCDF files or the Quality Control suite. Hence the version number has only incremented by 0.0.1 and the year.
This version still contains 6103 stations, with 4060 passing the final filtering checks, down slightly from the 4071 in v1.0.2.2013p (see the HadISD paper Section 6). The patterns of flagging are very similar to v1.0.2.2013p (see figures here). However if you find something strange, do let us know using the contact details on the HadISD website. Please note the stations which are known to have issues are documented on this blog and on the website.
Fig.1 The fraction of temperature records flagged for each station. |
Fig. 2 The fraction of all dewpoint temperature records flagged for each station |
Fig. 3 The fraction of all sea-level pressure records flagged for each station |
As always, if you see anything untoward in the data or are having problems using it, please do not hesitate to get in touch.
Wednesday, 7 January 2015
Attempting to fix undocumented merges
As mentioned in an earlier posts, we had found some issues in the Canadian stations which appeared like undocumented station moves. In discussions with Environment Canada, we were given a list of the Canadian WMO stations along with dates of their changes. There were 994 stations present in their list.
We separated the stations out into different categories (the number of stations in each is given in parentheses):
Stations which appeared in the Single, On/Off and Homogeneity issues categories were retained in the candidate station list. Those from the Questionable Moves, Dates, Overlap moves and Other were rejected from the station list.
The 216 stations in the Good Moves list were processed further. Using the station details in the ISD list, the period of time when the station was in this location as determined from the Environment Canada list was extracted. Usually this was the most recent location. The start and end times of the station were adjusted as appropriate to ensure that only the period in the location as given in the full ISD station list was used when further selecting stations. In many cases this will result in the station not being selected for inclusion with HadISD.
Of the 934 Canadian stations we were able to assess, 797 were kept for processing by further selection criteria, 33 could not be tested and 104 were rejected.
There are other stations which are located in Canada (which do not match the WMO IDs) which we could not process. These, along with the 33 which were not in the Environment Canada list, were retained in the stations selection procedure as we have no information indicating that there are problems with them.
These changes result in 14762 stations being selected using the restrictions on latitude, longitude and time-spans, 8561 in the master-list and 8104 in the final merged list (of which 2045 have other stations merged into them). This a reduction from the 8207 stations which were in the previous selection, but hopefully fewer of these have serious inhomogeneities resulting from the undocumented station moves.
We separated the stations out into different categories (the number of stations in each is given in parentheses):
- Single - stations which appeared in the list only once (529)
- On/Off - stations which had an "active" and "inactive" status indicating the start and end dates of operation (47)
- Good Station Moves - stations which showed a change in location, with dates showing the end of reporting at the previous location, and the start in the new location (216)
- Overlap Moves - similarly to good station moves, but the start of reporting in the new location occurs before the end of reporting at the old (15)
- Possible Homogeneity issues - multiple dates at a single location indicating perhaps changes in instrumentation (92)
- Questionable Moves - location changes with no dates given showing the end at one or the beginning at another location (33)
- Dates - cases where "active" and "inactive" statuses occurred at the same time, so the final status could not be determined (49)
- Other - more complex sets of start and end dates that could not be categorised easily (13)
Stations which appeared in the Single, On/Off and Homogeneity issues categories were retained in the candidate station list. Those from the Questionable Moves, Dates, Overlap moves and Other were rejected from the station list.
The 216 stations in the Good Moves list were processed further. Using the station details in the ISD list, the period of time when the station was in this location as determined from the Environment Canada list was extracted. Usually this was the most recent location. The start and end times of the station were adjusted as appropriate to ensure that only the period in the location as given in the full ISD station list was used when further selecting stations. In many cases this will result in the station not being selected for inclusion with HadISD.
Of the 934 Canadian stations we were able to assess, 797 were kept for processing by further selection criteria, 33 could not be tested and 104 were rejected.
There are other stations which are located in Canada (which do not match the WMO IDs) which we could not process. These, along with the 33 which were not in the Environment Canada list, were retained in the stations selection procedure as we have no information indicating that there are problems with them.
These changes result in 14762 stations being selected using the restrictions on latitude, longitude and time-spans, 8561 in the master-list and 8104 in the final merged list (of which 2045 have other stations merged into them). This a reduction from the 8207 stations which were in the previous selection, but hopefully fewer of these have serious inhomogeneities resulting from the undocumented station moves.
Subscribe to:
Posts (Atom)