L'Ombre de l'Olivier

The Shadow of the Olive Tree

being the maunderings of an Englishman on the Côte d'Azur

25 November 2009 Blog Home : November 2009 : Permalink

More CRU Code thoughts

Actually this is more CRU data thoughts, but since the data (and data structures) are key to the code used to create the results I think it counts as code.

I am hampered here in that we don't actually have (as far as I can tell) any of the relevent intermediate data files available, thus it is quite possible that I'm going to go off the rails here. Also, some of the actual programs are missing or have been totally renamed. I'm referring to the various versions of mergedb mentioned in HARRY but not in the source directory as far as I can tell. Hence this post is, speculative (and potentially completely wrong) but I don't think it is.

The code (and data) suffers from its designers coming from a fixed recond fortan background which doesn't help. Right now intermediate results get stored in "data bases" which tend to be one record per line things. These do have the benefit of being directly human readable which the SQL db wouldn't but sql browse queries aren't _THAT_ complicated and would mean you avoid problems where you need to calculate how many months since 1926 have elapse to get your 1991-1995 data. The current "databases" tend to be fix field width ones with the usage of special numbers like -9999 to indicate that data is missing. This is mostly fine but care needs to be taken because sometimes -9999 (possibly divided by 100) can be a genuine value (e.g. longitude -99.99 is quite valid for locations in Mexico, the US and Canada) which doesn't help.

Brief aside: In the input this is even worse and fields which are usually numeric sometimes aren't (see part 35a):

Then tried to merge that into wet.0311061611.dtb, and immediately hot formatting issues - that pesky last field has been badly abused here, taking values including:

 -999.00

    0.00

  nocode     (yes, really!)

Had a quick review of mergedb; it won't be trivial to update it to treat that field as a8. So reluctantly,
changed all the 'nocode' entries to '0':

crua6[/cru/cruts/version_3_0/db/rd0] perl -pi -e 's/nocode/     0/g' wet.0311061611.dt*

Unfortunately, that didn't solve the problems.. as there are alphanumerics in that field later on:

-712356  5492 -11782  665 SPRING CRK WOLVERINE CANADA        1969 1988   -999  307F0P9

and occasions where no data is simply indicated by spaces.
Aside to the aside:Various people have commented on the existance of fromexcel.f90, actually this isn't as bad as it soudns since what is being imported is CSV (comma separated variable) fields which is not a proprietary microsoft format. Given that the point here is to gather data from as many sources as possible its quite reasonable to be gathering data produced in CSV format by one researcher.

The fixed length problem manifests itself in various ways. In many cases there is a need for the program to multiply (or divide) by 10 or 100 to convert the data, as stored into the data as it should be used in calculations of the next output - see part 35j for a good example.

Another way that the fixed length problem shows up is in the pseudonumeric WMO station IDs. These are 5 digit (with CRU extension to 7 digits internally to allow for inconsistent duplicates etc.) with the first two digits being the country code and the other 3 the station ID in the country [Q what happens if we end up with >100 countries or >999 stations per country?] but handling in various ways leads to versions with 5, 6 or 7 digits and with leading (or trailing?) zeros being lost and so on - see part 35v's discussion of LILLE.

Station ids leads us to another problem that crops up in part 35r - the gridding problem. The CRU temperature series divides the globe into grid squares (and yes that is a problem in the polar regions) but, for what are presumably good scientific reasons (I think its the way the ocean observations are gathered from moving ships whereas on land they are static), does some calculations differently in ocean grid cells compared to land ones. Which leads to a problem when the grid includes coastline as that cell may have both land and ocean stations to process:

ERROR. Station in sea:
File: cldupdate6190/cldupdate6190.1996.01.txt
Offending line: 18.54 72.49 11.0 -6.600004305700
Resulting indices ilat,ilon: 218 505
crua6[/cru/cruts/version_3_0/secondaries/cld]

This is a station on the West coast of India; probably Mumbai. Unfortunately, as a coastal
station it runs the risk of missing the nearest land cell. The simple movenorms program is
about to become less simple.. but was do-able. The log file was empty at the end, indicating
that all 'damp' stations had found dry land

Then there is the problem of the inclusion of "synthetics".

This is not necessarily evil fakery (despite some more hysterical claims that I've read) it's a way to interpolate data that is missing for one reason or another. There are two problems here, the first is that the generation of the "synthetic" data is a black art and one that occured once many years ago and which is now lost (maybe) and the second of which is the decision about when to use the synthetic data.

The latter problem is simple to explain: there are numerous cases where there is partial data (Tmin say bit not Tmax for a station for a time period) or data that looks questionable (is it massive rainfall or a transcription error?) where the choice of whether to use a synthetic or not is not so clear cut.

The first problem is potentially more worrying in that they seem to be based on historical climate models - and old ones at that. Since the models tend to have been written and tested against older versions of the HADCRU (and other) historical series there is a feedback loop which appears to lead to a potential problem of confirmation bias. I'm not sure that it has done so - and we don't I think have the data to see one way or another - but if there is an "evil" bit of HADCRU then this positive feedback loop is probably where it is. Any other errors introduced are, it seems to me, not ascribable to malice but rather very definitely caused by some combination of ignorance, incompetence, carelessness etc.

I could go on but I do actually have real work to do so I'm going to stop again here for now with the following.

In their defense the CRU note that their temperature series mostly agrees with the ones from NCDC and GISS as if this excuses the abysmal quality of the processing. This is indeed true but given that there is significant shared source data and apparent cooperation between the teams this is not precisely reassuring. I haven't paid any attention in this analysis to the mathematical calculations being done since that's not what I can readily comprehend but it seems quite likely that each system applies similar data "cleansing" and "homogenizing" because the synthetics (see above) are going to be similar. Hence the confirmation bias problem as results that don't match up with the other two will tend to be scrutinized (and possibly adjusted) while the ones that do are left alone.

Since the adjustment mechanism, indeed the entire process, is opaque the claim that black box 1 produces are result similar to black box 2 is not convincing.

In my opinion the entire project needs to be rewritten, and part of the rewrite should include using an actual database (SQL of some variety) to store the intermediate data. In fact the more I read through the HARRY_READ_ME doc (and glance at the code) the more I realize that open sourcing this, as suggested by Eric Raymond,  would get huge benefits. By open sourcing it, the calculation and work flow would have to become more clear and we would be able to identify questionable parts. Open sourcing would also lead to the imposition of basic project management things like archiving and version control which seem to have been sorely lacking and it would allow a large group of people to think about the data structures and the process flow so that it can be made less subject to error.

Furthermore there are enormous problems in terms of station IDs and identifying whether stations have moved, been renamed and so on. And there are problems with data quality for stations (record missing, mistranscribed etc etc) where a distributed team would almost sertainy do better in terms of verifying the data (and potentially adding more raw data when it is found).

Even if the end result of the open source version were identical to the current one, opening this stuff up to public scrutiny and allowing people to contribute patches would go a long way towards improvng the quality of the output and go even further towards improving the credibility of the output which right now is low amongst anyone who's actually taken a proper look at the process.