In HadISD v1.0.x, the neighbour check selects stations within 500m height and 300km distance of the target station. The bearing is also used to assign stations to quadrants (90-degree bins), and the closest 10 are chosen, ensuring that each quadrant contains at least two stations. When fewer neighbours are available the distribution of stations across the quadrants can be lop-sided.
For HadISD v2.0.0, I wanted to improve the station selection as, just because a neighbour is close, it may not be very useful when running the buddy checks. So the new neighbour selection uses the correlation coefficient of the target and neighbour time series as well as the data overlap (very important in early years). The details of both at this point are as follows.
Initially stations are selected on the basis of distance to ensure that the neighbours experience similar weather as the target. Then, the correlation of the two timeseries is obtained. However, so that the correlation is not dominated by the annual or diurnal cycle, the timeseries are processed to removed these. Firstly daily means are calculated for all days which have more than 6 observations, which are used to create the climate anomalies for each observation. To further remove the diurnal cycle, hourly means are calculated and used to create "anomalised climate anomalies" for each observation. These time series are used to calculate the correlation coefficients.
The reason for using the data overlap as another criteria results from the lengthened data coverage of HadISD v2.0.0. Few stations will have coverage over the entire 1931-2014 period, and so it would be highly likely that neighbours selected in terms of distance alone have no concurrent data. I use the fraction of observations that are also present in the neighbour as the overlap value.
The neighbours are then sorted by the linear combination of the correlation coefficient and the overlap fraction, and the top 10 are selected, again ensuring that there are at least two in each quadrant if possible.
In a perfect world (or at least one with infinite computing resources), I would select all stations within the 500m height--300km distance criteria and calculate the correlations and overlaps for all. However, this takes a while (it probably could be faster, but at some level, lots of file-read operations have to occur) and it is important that this dataset can be quality controlled within a reasonable time frame. Therefore, at the moment, only the nearest 20 stations are assessed for their correlation and overlap with the target.
The process appears to take around the 5 minute mark per station on a ~2GHz processor - so 28 days of processing for 8000 stations if done just one. I'm hoping to use many more than just one to do my bidding!