L'Ombre de l'Olivier

01 December 2009 Blog Home : December 2009 : Permalink

Science, Computer Science and Climate Science

I (and the visitors and commenters at this blog and elsewhere) have had another week or so to look through the CRU leaked code so it's time for an update. I should note that I am, making a VERY ARTIFICAL (ha ha) decision to concentrate purely on the CRU temperature series code and ignore all the tree-ring stuff. However with that said quite a few of the more generic comments will certainly apply to that code too.

Firstly it is obvious that there is a serious disconnect between the worlds of acedemic science and (commercial) software development. No one from a software development background who has looked at this code has anything polite to say about it. One of my commenters wrote:

As for the code review, I've seen a ton of awful FORTRAN when I was designing death for bux at the missile factory There is always a tension between The Scientists, who tend to be pathetic coders, and the programmers who don't understand the theory. And in many cases - ie, probably this one - academic environments don't have professional programmers.

In well-done scientific programming, you have developers write a framework and APIs and let the scientists implement The Magic Algorithm in an environment where they don't have to do things like file I/O, sorting, etc.

This indeed is the problem here. The CRU folks have, for various reasons, not outsourced much if any of the development to actual professional programmers and the result is therefore nasty.

In some respects the code itself is actually the lesser problem. The greater problem is the lack of process management tools - version control, archiving, etc. - which means that we have in fact no idea whether the code in the leak is the version used in the current hadcru ts (v3) or one of the earlier v2.x editions. This is important because it would help us understand whether this code is the one where certain bugs identified by "Harry" have been fixed or not. Take the notorious overflow bug in HARRY chapter 17:

Inserted debug statements into anomdtb.f90, discovered that a sum-of-squared variable is becoming very, very negative! [...]

DataA val = 49920, OpTotSq=-1799984256.00
[...]

..so the data value is unbfeasibly large, but why does the sum-of-squares parameter OpTotSq go negative?!!

Probable answer: the high value is pushing beyond the single-precision default for Fortran reals?

Now the code for this is fairly easy to find* and when one looks at it one discovers that the bug has not been fixed. The code still says:

integer, pointer, dimension (:,:,:)        :: Data,DataA,DataB,DataC
[...]
          if (DataA(XAYear,XMonth,XAStn).NE.DataMissVal) then
            OpEn=OpEn+1
            OpTot=OpTot+DataA(XAYear,XMonth,XAStn)
            OpTotSq=OpTotSq+(DataA(XAYear,XMonth,XAStn)**2)
          end if

As we see from the declaration line DataA is an Integer (and not even a long integer but a regular one with a range of 0..65535 for unsigned values or -32768..32767 for signed ones). Hence the overflow problem when the value read in turns out to be 49920. I have done a quick check for **2 in other parts of the code and haven't found any other integers that get squared but there may well be some.

Talking of bugs, I mentioned the "silently continue on error" 'feature' in my first post but my former boss John Graham-Cumming points out that the code actually has a bug in it:

  catch,error_value 
  ; avoids a bug in IDL that throws out an occasional 
  ;  plot error in virtual window
  if error_value ne 0 then begin
   error_value=0
   i=i+1
   goto,skip_poly
  endif

[...]The first bug appears to be in IDL itself. Sometimes the polyfill function will throw an error. This error is caught by the catch part and enters the little if there.

Inside the if there's a bug, it's the line i=i+1. This is adding 1 to the loop counter i whenever there's an error. This means that when an error occurs one set of data is not plotted (because the polyfill failed) and then another one is skipped because of the i=i+1.

Given the presence of two bugs in that code (one which was known about and ignored), I wonder how much other crud there is in the code.

To test that I was right about the bug I wrote a simple IDL program in IDL Workbench. Here's a screen shot of the (overly commented!) code and output. It should have output 100, 102, 103 but the bug caused it to skip 102.

Also, and this is a really small thing, the error_value=0 is not necessary because the catch resets the error_value.

In fact the whole error handling code could be reduced to the "goto" line. John ends his post with a question "BTW Does anyone know if these guys use source code management tools?" and I'm 99.99% sure the answer to his question is that they do not use SCM tools.

This is a problem because, as another commenter of mine wrote:

Code this bad is the equivalent of witchcraft. There is essentially no empirical test to distinguish its output from nonsense. Sad to say, I've seen things like this before. Multi-author, non-software engineer-written codebases tend to have these sorts of hair-raising betises liberally sprinkled throughout (although this an extreme example - I wouldn't want to go into that code without a pump-action shotgun and a torch). Ian Harris certainly deserves our sympathy. Trying to hack your way through this utter balderdash must still have him sitting bolt upright in the middle of the night with a look of horror on his face.

The problem all this hits is the difference between "Science" and "Computer Science".

When scientists submt papers to proper academic journals they are supposed to write up enough methodology so that someone else can replicate their results. In theory at least. In practice in much of modern science this is not adhered to, but in theory - as Feynman explained in "Cargo-cult Science" - what you should do is the following:

Details that could throw doubt on your interpretation must be given, if you know them. You must do the best you can--if you know anything at all wrong, or possibly wrong--to explain it. If you make a theory, for example, and advertise it, or put it out, then you must also put down all the facts that disagree with it, as well as those that agree with it. There is also a more subtle problem. When you have put a lot of ideas together to make an elaborate theory, you want to make sure, when explaining what it fits, that those things it fits are not just the things that gave you the idea for the theory; but that the finished theory makes something else come out right, in addition.

In summary, the idea is to try to give all of the information to help others to judge the value of your contribution; not just the information that leads to judgment in one particular direction or another.

The easiest way to explain this idea is to contrast it, for example, with advertising. Last night I heard that Wesson oil doesn't soak through food. Well, that's true. It's not dishonest; but the thing I'm talking about is not just a matter of not being dishonest, it's a matter of scientific integrity, which is another level. The fact that should be added to that advertising statement is that no oils soak through food, if operated at a certain temperature. If operated at another temperature, they all will-- including Wesson oil. So it's the implication which has been conveyed, not the fact, which is true, and the difference is what we have to deal with.

We've learned from experience that the truth will come out. Other experimenters will repeat your experiment and find out whether you were wrong or right. Nature's phenomena will agree or they'll disagree with your theory. And, although you may gain some temporary fame and excitement, you will not gain a good reputation as a scientist if you haven't tried to be very careful in this kind of work. And it's this type of integrity, this kind of care not to fool yourself, that is missing to a large extent in much of the research in cargo cult science.

For science that involves computer programs it seems blindingly obvious that the code must be included in the supplemental information. Even more, in cases like this where the data sources are mixed, it is vital that the actual raw data be available as well. And the excuses proffered by the CRU and its apologists (e.g. the CRU record largely agrees with GISS or a desire to not confuse the non-scientists with confusing detail) are neatly skewered in subsequent passages of the essay/speech.

Part of the problem is that as Feynman went on to note further down is that there has become a trend in science to not actually repeat experiments:

One of the students told me she wanted to do an experiment that went something like this--it had been found by others that under certain circumstances, X, rats did something, A. She was curious as to whether, if she changed the circumstances to Y, they would still do A. So her proposal was to do the experiment under circumstances Y and see if they still did A.

I explained to her that it was necessary first to repeat in her laboratory the experiment of the other person--to do it under condition X to see if she could also get result A, and then change to Y and see if A changed. Then she would know that the real difference was the thing she thought she had under control.

She was very delighted with this new idea, and went to her professor. And his reply was, no, you cannot do that, because the experiment has already been done and you would be wasting time. This was in about 1947 or so, and it seems to have been the general policy then to not try to repeat psychological experiments, but only to change the conditions and see what happens.

Nowadays there's a certain danger of the same thing happening, even in the famous (?) field of physics. I was shocked to hear of an experiment done at the big accelerator at the National Accelerator Laboratory, where a person used deuterium. In order to compare his heavy hydrogen results to what might happen with light hydrogen" he had to use data from someone else's experiment on light hydrogen, which was done on different apparatus.

This, I think, explains why the climate science "community" has such a hard time with people like Steve McIntyre. They simply are not used to people actually trying to replicate their results instead of taking them on trust.

As a result they have never really thought about data and code archiving policies and all the other techniques that are required if someone is to replicate a complex piece of data analysis. They aren't helped by the rise of the internet and the power of modern computers. These days you can spend about €400 and get a netbook (and a terabyte external harddrive) with more processing power and more data storage than a mainframe 25 years ago.

The photo to the left shows my latest toy - it is a Sharp Netwalker PC-Z1(B) which fits in a pocket and yet has 512Mbytes of RAM and a 4GB flash hard drive. It cost me Y39,000 (about €300) in Japan. It also has a USB socket into which you can plug in an external hard disk connected at USB 2 speeds. A 1Terabyte harddrive cost me US$120 a few months back and it can be used (as illustrated) as storage for the Netwalker. That $120 disk could store all the raw files, the intermediate files and the final output of the HADCRU temperature series and the cpu - despite it being merely an ARM, not even an Intel Atom - is almost certainly more powerful that the workstations that the version 1 and 2 of the temperature series were developed on.

Furthermore the availability of broadband internet access (at c.10Mbit/s - the same as Ethernet LANs promised 20 years ago) means that it is easy to transfer data to anyone not just those in academic research institutes. A current desktop PC with two or maybe four 64 bit cores, gigabytes of main memory etc. and a high speed internet connection is likely to be able to process a century of climate data from thousands of stations in a few hours at most and quite possibly the longest part of the process (once the code has been got running) would be downloading the raw data to start it off.

This challenges climate scientists because it lets the amateur dilettante try and reproduce scientific results that he or she is interested in. Climate science interests a lot of the engineering geeky sorts because it is a politically important topic and one where it seems like we should be able to easily verify the model results that predict the imminent end of the world as we know it. A lot of us geeks are also involved in open source development, work in IT departments in businesses etc. and hence have a view of software development and a knowledge of how change management and other related tools help to reduce the inevitable bugs in code.

And this leads us to the expectations of computer scientists/programmers and is why we get so upset when we are finally able to look at the CRU code. If the code were for a one off then what we see is excusable, however we are now at version 3 of this code and it is rerun every month. This means the code should have moved from the hack it together prototype form to one with clear datastructures, use of SQL databases, version control etc. etc. A product will have a clear list of dependencies, list of files and directories required (and ideally a config file where these things will be place so they aren't hard coded), test data and usecases to show how to get it working and so on.

*the link is to anomdtb.f90, which is mentioned quite a few times in the READ_ME.

L'Ombre de l'Olivier

The Shadow of the Olive Tree

Science, Computer Science and Climate Science