L'Ombre de l'Olivier

23 November 2009 Blog Home : November 2009 : Permalink

The HADCRU code as from the CRU leak

(and other related observations)

When I was a developer, in addition to the concepts of version control and frequent archiving, one thing my evil commercially oriented supervisors insisted on were "code reviews". This is the hated point where your manager and/or some other experienced developer goes through your code and critiques it in terms of clarity and quality.

As a general rule code reviews teach you a lot. And in places where you have a choice of potential language one of the big questions in a code review is often "why develop this in X?"

So I've been perusing the (soon to be infamous) HARRY_READ_ME.txt file, as helpfully split up in chunks here and some of the files in the /cru-code/ directory bearing these thoughts in mind. I want to point out here that the author of the readme (Ian "Harry" Harris apparently) has my sympathy. He's the poor sod who seems to have been landed with the responsibility of reworking the code after (I think) the departure of Tim Mitchell and/or Mark New who apparently wrote much of it.

The first thing I note is that a lot of the stuff is written in Fortran (of different vintages), and much of the rest in IDL. When you look at the code you see that a lot of it is not doing complex calculations but rather is doing text processing. Neither fortran nor IDL are the tools I would use for text processing - perl, awk, sed (all traditional unix tools available on the platforms the code runs on as far as I can tell) are all better at this. Indeed awk is used in a few spots making one wonder why it is not used elsewhere. The use of fortran means you get weird (and potentially non-portable) things like this in loadmikeh.f90 (program written to load Mike Hulme RefGrids ):

call system ('wc -l ' // LoadFile // ' > trashme-loadcts.txt')		! get number of lines
open  (3,file='trashme-loadcts.txt',status="old",access="sequential",form="formatted",action="read")
read  (3,"(i8)"), NLine
close (3)
call system ('rm trashme-loadcts.txt')

For those that don't do programming this is counting the number of lines in the file. It is doing so by getting an external (unix) utility to do the counting and making it store that in a temporary file (trashme-loadcts.txt) and then reading this file to learn how many lines there are before deleting the file. This is then used as test for whether the file is finished or not in the subsequent line-reading loop.

XLine=0
do
  read  (2,"(i7,49x,2i4)"), FileBox,Beg,End
  XExe=mod(FileBox,NExe) ; XWye=nint(real(FileBox-XExe)/NExe)+1
  XBox=RefGrid(XExe,XWye)
  
  do XEntry=Beg,End
    XYear=XEntry-YearAD(1)+1
    read  (2,"(4x,12i5)"), (LineData(XMonth),XMonth=1,NMonth)
    Data(XYear,1:12,XBox) = LineData(1:12)
  end do
  
  XLine=XLine+End-Beg+2
  if (XLine.GE.NLine) exit
end do

There are a bunchaton of related no-nos here to do with bounds checking and trusting of input - including the interesting point that this temporary file is the same for all users meaning that if, by some mischance, two people were running loadmikeh at the same time on different files in the same place there is the possibility that one gets the wrong file length (or crashes because the file has been deleted before it could read it). Now I'm fairly sure that this process is basically single user and that the input files are going to be correct when this is run but, as Harry discovers in chapter 8, sometimes it isn't clear what the right input file is:

Had a hunt and found an identically-named temperature database file which
did include normals lines at the start of every station. How handy - naming
two different files with exactly the same name and relying on their location
to differentiate! Aaarrgghh!! Re-ran anomdtb:

It is worth noting that the "wc -l" shell trick is also repeated in the idl files as well - e.g. in cru-code/idl/pro/quick_interp_tdm2.pro where its used even less efficiently in a pipe

 "cat " + ptsfile + " | wc -l"

This would almost merit an entry at thedailywtf.com but for the fact that is is far from the worst 'feature' of this particular program. This file has some other features that get mentioned in chapter 18. And subsequently in chapter 20 where our intrepid hero encounters a hole bunch of poorly documented required files squirrelled away in Mark New's old directory (good thing it wasn't his new one with the news of nude photos in it :) )

[Chapter 20 really is a total gem for mystery readers though not, one feels, for our hero including comments such as:

AGREED APPROACH for cloud (5 Oct 06).

For 1901 to 1995 - stay with published data. No clear way to replicate
process as undocumented.

For 1996 to 2002:
1. convert sun database to pseudo-cloud using the f77 programs;
2. anomalise wrt 96-00 with anomdtb.f;
3. grid using quick_interp_tdm.pro (which will use 6190 norms);
4. calculate (mean9600 - mean6190) for monthly grids, using the
published cru_ts_2.0 cloud data;
5. add to gridded data from step 3.

This should approximate the correction needed.

On we go.. firstly, examined the spc database.. seems to be in % x10.
Looked at published data.. cloud is in % x10, too.
First problem: there is no program to convert sun percentage to
cloud percentage. I can do sun percentage to cloud oktas or sun hours
to cloud percentage! So what the hell did Tim do?!! As I keep asking.

and

Then - comparing the two candidate spc databases:

spc.0312221624.dtb
spc.94-00.0312221624.dtb

I find that they are broadly similar, except the normals lines (which
both start with '6190') are very different. I was expecting that maybe
the latter contained 94-00 normals, what I wasn't expecting was that
thet are in % x10 not %! Unbelievable - even here the conventions have
not been followed. It's botch after botch after botch. Modified the
conversion program to process either kind of normals line.

but I digress]

The same file also gets a mention in Chapter 31 (at least I think its the same one - either that or there are two programs that do the same 'trick'):

  ; map all points with influence radii - that is with decay distance [IDL variable is decay]
  ;  this is done in the virtual Z device, and essentially finds all points on a 2.5 deg grid that 
  ;  fall outside station influence
  
  dummymax=max(dummygrid(*,*,(im-1)))
  
  while dummymax gt -9999 do begin
    if keyword_set(test) eq 0 then begin
      set_plot,'Z' ; set plot window to "virtual"
      erase,255 ; clear current plot buffer, set backgroudn to white
      device,set_res=[144,72]
    endif else begin
      initx
      set_plot,'x'
      window,0,xsize=144,ysize=72
    endelse
  
    lim=glimit(/all)
    nel=n_elements(pts1(*,0))-1
    map_set,limit=lim,/noborder,/isotro,/advance,xmargin=[0,0],ymargin=[0,0],pos=[0,0,1,1]
    a=findgen(33)*!pi*2/32 
[etc.]

What this is doing is finding out whether stations may influence each other (i.e. how close they are). It's doing this not by means of basic trig functions but by creating a virtual graphic and drawing circles on it and seeing if they overlap! There are a couple of problems here. Firstly it seems that sometimes, as the next few lines report, IDL doesn't like drawing virtual circles and throws an error.

    for i=0.0,nel do begin
      x=cos(a)*(xkm/110.0)*(1.0/cos(!pi*pts1(i,0)/180.0))+pts1(i,1)
      x=x<179.9 & x=x>(-179.9)
      y=sin(a)*(xkm/110.0)+pts1(i,0)
      y=y>(-89.9) & y=y<89.9
      catch,error_value		; avoids a bug in IDL that throws out an occasional 
      				; 	plot error in virtual window
      if error_value ne 0 then begin
        error_value=0
        i=i+1
        goto,skip_poly
      endif
   
      polyfill,x,y,color=160
      skip_poly:
    endfor

In that case we just skate over the point and (presumably) therefore assume it has no influence - which is, um, very reassuring for people trying to reproduce the algorithm in a more rational manner.

However that's not the only problem because as our hero also reports some inconsistencies:

..well that was, erhhh.. 'interesting'. The IDL gridding program calculates whether or not a
station contributes to a cell, using.. graphics. Yes, it plots the station sphere of influence then
checks for the colour white in the output. So there is no guarantee that the station number files,
which are produced *independently* by anomdtb, will reflect what actually happened!!

Fortunately our hero is able to fix this (although it isn't at all clear to me how the new process is integrated with the old one)

Well I've just spent 24 hours trying to get Great Circle Distance calculations working in Fortran,
with precisely no success. I've tried the simple method (as used in Tim O's geodist.pro, and the
more complex and accurate method found elsewhere (wiki and other places). Neither give me results
that are anything near reality. FFS.

Worked out an algorithm from scratch. It seems to give better answers than the others, so we'll go
with that.

And really at this point I think I am going to nominate this program to the thedailywtf.com and move on.

Another detail that becomes blindingly obvious (see the chapter 20 digression above) is that this is nothing like a turnkey process. Data comes in in various forms and formats and is munged into a common one with lots of operator intervention to ask what to do. Files are retrieved from certain places, stored in others and sometimes the operator knows (and can control) where they go and sometimes it's hard coded so the users have to hope no one has done something silly like delete the source files or kill the directories where the output should go.

Actually I'm going to stop here. I've examined two files in some depth and found (OK so Harry found some of this)

Inappropriate programming language usage
Totally nuts shell tricks
Hard coded constant files
Incoherent file naming conventions
Use of program libray subroutines that appear to be

far from ideal in how they do things when the work
do not produce an answer consistent with other way to calculate the same thing
but which fail at undefined times
and where when the function fails the the program silently continues without reporting the error

AAAAAAAAAARGGGGGHHHHHHHH!!!

PS IMO someone should buy the CRU a few hundred of these, because they really really need them.
Bad Code Offset

see http://thedailywtf.com/Articles/Introducing-Bad-Code-Offsets.aspx

L'Ombre de l'Olivier

The Shadow of the Olive Tree

The HADCRU code as from the CRU leak