L'Ombre de l'Olivier

The Shadow of the Olive Tree

being the maunderings of an Englishman on the Côte d'Azur

21 November 2009 Blog Home : All November 2009 Posts : Permalink

20091120 - Friday Olive Tree Blogging

Time to harvest our olives. Looks like a decent crop this year.
20091120 - Friday Olive Tree Blogging

As always click on the image to see it enlarged and don't forget to visit of the olive tree blogging archives for further reminders of how nice olive trees are.


21 November 2009 Blog Home : All November 2009 Posts : Permalink

Searching the CRU leak emails

I've decided to save the casually curious from the need to download 61MB of stuff, unzip etc. by sticking the emails (with addresses futzed and some phone numbers ditto) on my webserver along with a fairly basic search engine.

Now anyone can search for "M&M" or "FOI" and see everything that shows up - no need to rely on journalists or bloggers potentially selectively quoting emails. Also if you see a quote on a page with a somewhat cryptic reference such as "1103647149" or "1103647149.txt" you can paste the numbers in to the "Open" box and get the file displayed for you.

The tool is here


21 November 2009 Blog Home : All November 2009 Posts : Permalink

CRU leak thoughts

Using my CRU search tool, I've been reading a selection of the emails by coming up with a variety of search terms and seeing what shows up - a search on "your eyes only" is interesting for example - and I'm coming to some conclusions.

Firstly the scientists themselves come across very differently and indeed in these internal emails frequently seem to disagree with each other in ways that they don't in public. Something else that comes across quite clearly is how tolerant (or rather how intolerant) some names are regarding criticism or contrasting views. Micheal Mann and Ben Santer in particular come across as deeply unpleasant individuals and ones that tend to go off the handle when facing even mild criticism from their peers let alone someone like Steve McIntyre. For example here's a couple of extracts from an early email about the MM03 paper - the first is written by Mann, the second by someone else:

Who knows what trickery has been pulled or selective use of data made. Its clear that "Energy and Environment" is being run by the baddies [...] The important thing is to deny that this has any intellectual credibility whatsoever and, if contacted by any media, to dismiss this for the stunt that it is..

[...]

Anyway, there's going to be a lot of noise on this one, and knowing Mann's very thin skin I am afraid he will react strongly, unless he has learned (as I hope he has) from the past...."

Others - e.g. Kaufmann - seem much more reasonable:

Regarding the "upside down man", as Nick's plot shows, when flipped, the Korttajarvi series has little impact on the overall reconstructions. Also, the series was not included in the calibration. Nonetheless, it's unfortunate that I flipped the Korttajarvi data. We used the density data as the temperature proxy, as recommended to me by Antii Ojala (co-author of the original work). It's weakly inversely related to organic matter content. I should have used the inverse of density as the temperature proxy. I probably got confused by the fact that the 20th century shows very high density values and I inadvertently equated that directly with temperature.

This is new territory for me, but not acknowledging an error might come back to bite us.

[...]

(5) McIntyre wrote to me to request the annual data series that we used to calculate the 10-year mean values (10-year means were up on the NOAA site the same AM as the paper was published). The only "non-published" data are the annual series from the ice cores (Agassiz, Dye-3, NGRIP, and Renland). We stated this in the footnote, but it does stretch our assertion that all of the data are available publicly. Bo: How do you want to proceed?

Should I forward the annual data to McIntyre?

Please let me -- better yet, the entire group -- know whether you think we should post a revision on RealScience, and whether we should include a reply to other criticism (1 through 5 above). I'm also thinking that I should write to Ojala and Tiljander directly to apologize for inadvertently reversing their data.

Not that reasonable means willing to accept that the "Anthropic Global Warming" hypothesis is anything other than fact or that they consider that anyone who shows even the slightest doubt concerning it is an idiot. As William Briggs observes, there is no conspiracy here, just a bunch of fervant believers. Even when looking at the unravelling of Briffa/Yamal the questions asked do not admit any doubt as to AGW, just whether the right data/techniques for showing it are being used, and how to spin a recovery:

It is distressing to read that American Stinker item. But Keith does seem to have got himself into a mess. As I pointed out in emails, Yamal is insignificant. And you say that (contrary to what M&M say) Yamal is *not* used in MBH, etc. So these facts alone are enough to shoot down M&M is a few sentences (which surely is the only way to go -- complex and wordy responses will be counter productive).

But, more generally, (even if it *is* irrelevant) how does Keith explain the McIntyre plot that compares Yamal-12 with Yamal-all? And how does he explain the apparent "selection" of the less well-replicated chronology rather that the later (better replicated) chronology? Of course, I don't know how often Yamal-12 has really been used in recent, post-1995, work. I suspect from what you say it is much less often that M&M say -- but where did they get their information? I presume they went thru papers to see if Yamal was cited, a pretty foolproof method if you ask me. Perhaps these things can be explained clearly and concisely -- but I am not sure Keith is able to do this as he is too close to the issue and probably quite pissed of.

And the issue of with-holding data is still a hot potato, one that affects both you and Keith (and Mann). Yes, there are reasons -- but many *good* scientists appear to be unsympathetic to these. The trouble here is that with-holding data looks like hiding something, and hiding means (in some eyes) that it is bogus science that is being hidden.

This email leads us on the other big issue. Secrecy.

These guys (see quoted para above) realize that with-holding data is scientificially a no no: many *good* scientists appear to be unsympathetic to [climate scientist's "reasons" for with-holding data] but they still do it, even though it "looks like hiding something, and hiding means (in some eyes) that it is bogus science that is being hidden."

Yet they remain secretive. Indeed they seem to be proud that they hide their data/methods and prefer to delete things rather than respond to requests for data:

Yes, we've learned out lesson about FTP. We're going to be very careful in the future what gets put there. Scott really screwed up big time when he established that directory so that Tim could access the data. Yeah, there is a freedom of information act in the U.S., and the contrarians are going to try to use it for all its worth. But there are also intellectual property rights issues, so it isn't clear how these sorts of things will play out ultimately in the U.S.

[...]

Just sent loads of station data to Scott. Make sure he documents everything better this time ! And don't leave stuff lying around on ftp sites - you never know who is trawling them. The two MMs have been after the CRU station data for years. If they ever hear there is a Freedom of Information Act now in the UK, I think I'll delete the file rather than send to anyone. Does your similar act in the US force you to respond to enquiries within 20 days? - our does ! The UK works on precedents, so the first request will test it. We also have a data protection act, which I will hide behind. Tom Wigley has sent me a worried email when he heard about it - thought people could ask him for his model code. He has retired officially from UEA so he can hide behind that. IPR should be relevant here, but I can see me getting into an argument with someone at UEA who'll say we must adhere to it !

I think this email thread is probably one of the most damaging. Yes the "trick" thread is bad, but while I think the trick is pretty despicable and makes Mann a liar it isn't in itself evil so long as people can figure out what has happened.

Unfortunately what we see here is that they deliberately make it as hard as possible for people to reverse engineer what they are doing. If this were to do with attempts to model the sex life of nematodes (to pick an example at semi-random) then this probably wouldn't matter, but it isn't. Governments and industrial leaders are making multi-billion dollar (euro, pound etc.) decisions about taxes, investments and so on based on this research. If it is wrong then said politicians and business leaders are likely to make the wrong decisions and that could result in needless poverty, in global economic collapse and all sorts of other things (including I guess the extinction of humanity if it turns out AGW is rather more serious than it appears at present). I note that these emails (and the related code) does indeed show that Steve McIntyre was absolutely correct when he says that the various financial regulators (SEC etc.) would bar any stock tout who tried to raise money using a similar level of documentation of claims. It would be ironic that the economies of most of the nations of the world are being shaped by such poor quality science if it weren't so dangerous.



23 November 2009 Blog Home : All November 2009 Posts : Permalink

The Pierrehumbert Strawman

Over at DotEarth there is an interesting" guest post by Raymond Pierrehumbert which is, in my opinion, guilty of the classic PR-spin tricks of "Supression veri" and "Suggestio falsi". There's no harm, per se, in such tricks so long as others can point out the problems. I'm not sure if my comment (or Lucia's) will be published there so I've decided to put it here too. In the process of doing so I realised that I might as well give the thing a proper fisking instead of the rather shorter response I gave in my comment at dotEarth.

So here goes:

I just read your blog and article about the CRU attack. I do entirely understand that in your role as a reporter you can’t editorialize and pass judgment about what happens in the world, but you do edge into value judgments in some of your blog pieces and so I found the general lack of indignation in your piece rather disconcerting. After all, this is a criminal act of vandalism and of harassment of a group of scientists that are only going about their business doing science. It represents a whole new escalation in the war on climate scientists who are only trying to get at the truth. Think — this was a very concerted and sophisticated hacker attack.

If you look at the emails and documents (see below) it seems quite clear that the selection is neither random nor is it a grand sweeping up of everything in a scientist's private directory - indeed there is circumstantial evidence that this is data gathered in regards of an FOIA request to the CRU that was rejected a day after the last email. Hence it seems unlikely that "this was a very concerted and sophisticated hacker attack" rather it seems far more likely that this was an internal leak by a whistleblower upset at the continued hiding of information. Given the ubiquity of flash drives I would estimate that copying this data probably took about 5 minutes from the whistleblower's desk.

This paragraph is in fact full of inflamatory, and innaccurate rhetoric. Calling this act "vandalism" is definitely hard to sustain as, so far as we know, no changes have been made by the leaker to any document in the CRU computers. You can, if you like, call it theft, but this "theft" is the sort of act performed by the Raffles level of burglar who takes the diamond without disturbing the other, less valuable, objects nearby.

Lastly for this section, the claim that it is "a whole new escalation in the war on climate scientists who are only trying to get at the truth" is richly ironic since the truth seekers would appear to be Messrs McIntyre & co who want the code and data behind the CRU's published work. It is hard to interpret statements like "[i]f they ever hear there is a Freedom of Information Act now in the UK, I think I'll delete the file rather than send to anyone" as anything but attempts to hide the truth. And that is very far from the only email where FOI requests are treated with derision and evasion.

It was a far harder system to crack into than Sarah Palin’s Yahoo account that was compromised during the election campaign. Scientists don’t have much distinction between their personal life and work and it is pretty typical to have all sorts of personal emails (maybe even financially related ones, confidential medical matters, family affairs, Amazon order confirmations*, etc.) as well as frank discussions that are part of the general working out of science and not meant to be done with somebody looking over your shoulder. I don’t think Jones’ emails had any personally compromising data in them, but that was just luck; this illegal act of cyber-terrorism against a climate scientist (and I don’t think that’s too strong a word) is ominous and frightening. What next? Deliberate monkeying with data on servers? Insertion of bugs into climate models? Or at the next level, since the forces of darkness have moved to illegal operations, will we all have to get bodyguards to do climate science?

If the leaker was, as suggested above, an internal whistleblower then the comparison with the Palin hack is totally irrelevant. Actually, as Lucia points out, the emails leaked are very definitely sanitized:

A cursory examination of the emails reveals no announcements for group meetings, no invitations to the lunch room to celebrate a coworkers birthday, no email exchanges between husbands and wives discussing their shared love of Lassie DVDs, no letter from the safety training people, nothing related to performance reviews, and no pesky nag notes to update ones cyber security training. (Maybe CRU doesn’t have cyber security training?) Whoever assembled this edited, and it appears that all emails containing very highly personal information were removed from the collection.

Had the emails contained embarrassing revelations about the purchase of Lassie DVD’s, the blogosphere might be abuzz with indignation over the posting of truly personal information. In reality, no such information seems to be contained in what has come to be called the CRU Hack.

As with the "get at the truth" statement, the suggestions about "deliverate monkeying with data on the servers" and so on are ones that would seem better aimed at Messrs Jones, Santer, Mann and co, rather than whoever leaked their correspondence and code.

Cynically one may also point out that err it is the climate scientists who appear to have already done this. At the very least the code in the directory /documents/osborn-tree6/mann/oldprog is looking highly suspicious when it has comments like this in it:

; Plots 24 yearly maps of calibrated (PCR-infilled or not) MXD reconstructions
; of growing season temperatures. Uses “corrected” MXD – but shouldn’t usually
; plot past 1960 because these will be artificially adjusted to look closer to
; the real temperatures.

(maps24.pro similar comments in maps15, maps12, maps1 etc. )

The same goes for the bodyguards bit - "Next time I see Pat Michaels at a scientific meeting, I'll be tempted to beat the crap out of him. Very tempted." from Ben Santer sounds distinctly threatening to me.

Maybe reporters just like information to be out, even if it is illegally obtained. Certainly, I thought it was right to publish the Pentagon Papers. But when the attack is on an individual scientist rather than a government entity, and when the perpetrator is unknown and part of some shadowy anonymous network, it raises a lot of new concerns.

"Shadowy anonymous network" is rather rich. Unlike journal referees, bloggers are distinctly lacking in anonymity. True we (or at least I) don't know the identity of all the bloggers and blog commenters but I know do who many of the main ones are, and I'm sure I could figure out the others if I spent a few minutes checking things.

The "attack on an individual scientist" thesis only holds true if Santer, Mann, Jones & Briffa etc. are in fact a single person. True the majority of the emails appear to originate as to/from Jones but there are plenty from Osborn and Briffa as well as other members of the CRU so its only an "attack" on one person if you consider Jones, as head of the CRU, to be the only important person there. On the other hand the emails do show that the Climate Science Establishment has made repeated attempts to block the researches of Steve McIntyre. Now Steve is not a "scientist" according to many and he does tend to work with others, but overall it looks like he's been attacked repeatedly behind both in public and behind closed "anonymous" doors

The real story here, though, is that the tactics the inactivists have been using in the run-up to Copenhagen have been all outside the sphere of legitimate scientific discourse. Bogus petitions, sham attempts to gut the A.P.S. climate statement, and now cowardly illegal outings of private emails from an individual scientist. If this is what they have to stoop to, then it is clear that they must really not think they have a leg to stand on scientifically.

Well actually one can interpret things rather differently if one considers that the evidence suggests that Messrs Jones, Santer, Mann and co, have indeed been deilberately obfuscating and blocking research results that don't conform to their view of the world. If this release of data does derail Copenhagen then that can only be a good thing since it seems very clear that the politicians have not had the impartial scientific advice they should have had. One also can't help but note that the "inactivists" have not been the only people to try and get their message across "outside the sphere of legitimate scientific discourse". At least that's how I interpret attempts to nobble/blackmail journals and editors.

Cheers,

Ray

[*For example, anybody who hacked into my email would find the highly embarrassing fact that I once ordered a compilation of Lassie Christmas Stories on DVD :)]

And I agree that comments like "Mr I'm not right in the head" or "Pielke is a prat" are embarrassing but not, in themselves, malign. However, the main thrust of the leak is regarding the "Hockey stick" papers and the CRU temperature code. The fact that there are embarrassing comments like the following in the (soon to be infamous) HARRY_READ_ME.txt file is much much more important and, going back to the the Copenhagen point above, since it shows just what a shoddy base the climate warming papers seem ot have.
20. Secondary Variables - Eeeeeek!! Yes the time has come to attack what even
Tim seems to have been unhappy about (reading between the lines). To assist
me I have 12 lines in the gridding ReadMe file.. so par for the course.
Almost immediately I hit that familiar feeling of ambiguity: the text
suggests using the following three IDL programs:
frs_gts_tdm.pro
rd0_gts_tdm.pro
vap_gts_anom.pro

So.. when I look in the code/idl/pro/ folder, what do I find? Well:
3447 Jan 22 2004 fromdpe1a/code/idl/pro/frs_gts_anom.pro
2774 Jun 12 2002 fromdpe1a/code/idl/pro/frs_gts_tdm.pro
2917 Jan 8 2004 fromdpe1a/code/idl/pro/rd0_gts_anom.pro
2355 Jun 12 2002 fromdpe1a/code/idl/pro/rd0_gts_tdm.pro
5880 Jan 8 2004 fromdpe1a/code/idl/pro/vap_gts_anom.pro

In other words, the *anom.pro scripts are much more recent than the *tdm
scripts. There is no way of knowing which Tim used to produce the current
public files. The scripts differ internally but - you guessed it! - the
descriptions at the start are identical. WHAT IS GOING ON? Given that the
'README_GRIDDING.txt' file is dated 'Mar 30 2004' we will have to assume
that the originally-stated scripts must be used.


23 November 2009 Blog Home : All November 2009 Posts : Permalink

The HADCRU code as from the CRU leak

(and other related observations)Bad Code Offset

When I was a developer, in addition to the concepts of version control and frequent archiving, one thing my evil commercially oriented supervisors insisted on were "code reviews". This is the hated point where your manager and/or some other experienced developer goes through your code and critiques it in terms of clarity and quality.

As a general rule code reviews teach you a lot. And in places where you have a choice of potential language one of the big questions in a code review is often "why develop this in X?"

So I've been perusing the (soon to be infamous) HARRY_READ_ME.txt file, as helpfully split up in chunks here and some of the files in the /cru-code/ directory bearing these thoughts in mind. I want to point out here that the author of the readme (Ian "Harry" Harris apparently) has my sympathy. He's the poor sod who seems to have been landed with the responsibility of reworking the code after (I think) the departure of Tim Mitchell and/or Mark New who apparently wrote much of it.

The first thing I note is that a lot of the stuff is written in Fortran (of different vintages), and much of the rest in IDL. When you look at the code you see that a lot of it is not doing complex calculations but rather is doing text processing. Neither fortran nor IDL are the tools I would use for text processing - perl, awk, sed (all traditional unix tools available on the platforms the code runs on as far as I can tell) are all better at this. Indeed awk is used in a few spots making one wonder why it is not used elsewhere. The use of fortran means you get weird (and potentially non-portable) things like this in loadmikeh.f90 (program written to load Mike Hulme RefGrids ):
call system ('wc -l ' // LoadFile // ' > trashme-loadcts.txt')		! get number of lines
open (3,file='trashme-loadcts.txt',status="old",access="sequential",form="formatted",action="read")
read (3,"(i8)"), NLine
close (3)
call system ('rm trashme-loadcts.txt')
For those that don't do programming this is counting the number of lines in the file. It is doing so by getting an external (unix) utility to do the counting and making it store that in a temporary file (trashme-loadcts.txt) and then reading this file to learn how many lines there are before deleting the file. This is then used as test for whether the file is finished or not in the subsequent line-reading loop.
XLine=0
do
read (2,"(i7,49x,2i4)"), FileBox,Beg,End
XExe=mod(FileBox,NExe) ; XWye=nint(real(FileBox-XExe)/NExe)+1
XBox=RefGrid(XExe,XWye)

do XEntry=Beg,End
XYear=XEntry-YearAD(1)+1
read (2,"(4x,12i5)"), (LineData(XMonth),XMonth=1,NMonth)
Data(XYear,1:12,XBox) = LineData(1:12)
end do

XLine=XLine+End-Beg+2
if (XLine.GE.NLine) exit
end do
There are a bunchaton of related no-nos here to do with bounds checking and trusting of input - including the interesting point that this temporary file is the same for all users meaning that if, by some mischance, two people were running loadmikeh at the same time on different files in the same place there is the possibility that one gets the wrong file length (or crashes because the file has been deleted before it could read it). Now I'm fairly sure that this process is basically single user and that the input files are going to be correct when this is run but, as Harry discovers in chapter 8, sometimes it isn't clear what the right input file is:

Had a hunt and found an identically-named temperature database file which
did include normals lines at the start of every station. How handy - naming
two different files with exactly the same name and relying on their location
to differentiate! Aaarrgghh!! Re-ran anomdtb:

It is worth noting that the "wc -l" shell trick is also repeated in the idl files as well - e.g. in cru-code/idl/pro/quick_interp_tdm2.pro where its used even less efficiently in a pipe
 "cat " + ptsfile + " | wc -l"
This would almost merit an entry at thedailywtf.com but for the fact that is is far from the worst 'feature' of this particular program. This file has some other features that get mentioned in chapter 18. And subsequently in chapter 20 where our intrepid hero encounters a hole bunch of poorly documented required files squirrelled away in Mark New's old directory (good thing it wasn't his new one with the news of nude photos in it  :) )

[Chapter 20 really is a total gem for mystery readers though not, one feels, for our hero including comments such as:

AGREED APPROACH for cloud (5 Oct 06).

For 1901 to 1995 - stay with published data. No clear way to replicate
process as undocumented.

For 1996 to 2002:
1. convert sun database to pseudo-cloud using the f77 programs;
2. anomalise wrt 96-00 with anomdtb.f;
3. grid using quick_interp_tdm.pro (which will use 6190 norms);
4. calculate (mean9600 - mean6190) for monthly grids, using the
published cru_ts_2.0 cloud data;
5. add to gridded data from step 3.

This should approximate the correction needed.

On we go.. firstly, examined the spc database.. seems to be in % x10.
Looked at published data.. cloud is in % x10, too.
First problem: there is no program to convert sun percentage to
cloud percentage. I can do sun percentage to cloud oktas or sun hours
to cloud percentage! So what the hell did Tim do?!! As I keep asking.

and


Then - comparing the two candidate spc databases:

spc.0312221624.dtb
spc.94-00.0312221624.dtb

I find that they are broadly similar, except the normals lines (which
both start with '6190') are very different. I was expecting that maybe
the latter contained 94-00 normals, what I wasn't expecting was that
thet are in % x10 not %! Unbelievable - even here the conventions have
not been followed. It's botch after botch after botch. Modified the
conversion program to process either kind of normals line.

but I digress]

The same file also gets a mention in Chapter 31 (at least I think its the same one - either that or there are two programs that do the same 'trick'):
  ; map all points with influence radii - that is with decay distance [IDL variable is decay]
; this is done in the virtual Z device, and essentially finds all points on a 2.5 deg grid that
; fall outside station influence

dummymax=max(dummygrid(*,*,(im-1)))

while dummymax gt -9999 do begin
if keyword_set(test) eq 0 then begin
set_plot,'Z' ; set plot window to "virtual"
erase,255 ; clear current plot buffer, set backgroudn to white
device,set_res=[144,72]
endif else begin
initx
set_plot,'x'
window,0,xsize=144,ysize=72
endelse

lim=glimit(/all)
nel=n_elements(pts1(*,0))-1
map_set,limit=lim,/noborder,/isotro,/advance,xmargin=[0,0],ymargin=[0,0],pos=[0,0,1,1]
a=findgen(33)*!pi*2/32
[etc.]
What this is doing is finding out whether stations may influence each other (i.e. how close they are). It's doing this not by means of basic trig functions but by creating a virtual graphic and drawing circles on it and seeing if they overlap! There are a couple of problems here. Firstly it seems that sometimes, as the next few lines report, IDL doesn't like drawing virtual circles and throws an error.
    for i=0.0,nel do begin
x=cos(a)*(xkm/110.0)*(1.0/cos(!pi*pts1(i,0)/180.0))+pts1(i,1)
x=x<179.9 & x=x>(-179.9)
y=sin(a)*(xkm/110.0)+pts1(i,0)
y=y>(-89.9) & y=y<89.9
catch,error_value ; avoids a bug in IDL that throws out an occasional
; plot error in virtual window
if error_value ne 0 then begin
error_value=0
i=i+1
goto,skip_poly
endif

polyfill,x,y,color=160
skip_poly:
endfor
In that case we just skate over the point and (presumably) therefore assume it has no influence - which is, um, very reassuring for people trying to reproduce the algorithm in a more rational manner.

However that's not the only problem because as our hero also reports some inconsistencies:

..well that was, erhhh.. 'interesting'. The IDL gridding program calculates whether or not a
station contributes to a cell, using.. graphics. Yes, it plots the station sphere of influence then
checks for the colour white in the output. So there is no guarantee that the station number files,
which are produced *independently* by anomdtb, will reflect what actually happened!!

Fortunately our hero is able to fix this (although it isn't at all clear to me how the new process is integrated with the old one)

Well I've just spent 24 hours trying to get Great Circle Distance calculations working in Fortran,
with precisely no success. I've tried the simple method (as used in Tim O's geodist.pro, and the
more complex and accurate method found elsewhere (wiki and other places). Neither give me results
that are anything near reality. FFS.

Worked out an algorithm from scratch. It seems to give better answers than the others, so we'll go
with that.

And really at this point I think I am going to nominate this program to the thedailywtf.com and move on.

Another detail that becomes blindingly obvious (see the chapter 20 digression above) is that this is nothing like a turnkey process. Data comes in in various forms and formats and is munged into a common one with lots of operator intervention to ask what to do. Files are retrieved from certain places, stored in others and sometimes the operator knows (and can control) where they go and sometimes it's hard coded so the users have to hope no one has done something silly like delete the source files or kill the directories where the output should go.

Actually I'm going to stop here. I've examined two files in some depth and found (OK so Harry found some of this)
AAAAAAAAAARGGGGGHHHHHHHH!!!

PS IMO someone should buy the CRU a few hundred of these, because  they really really need them.
Bad Code Offset
see http://thedailywtf.com/Articles/Introducing-Bad-Code-Offsets.aspx


25 November 2009 Blog Home : All November 2009 Posts : Permalink

More CRU Code thoughts

Actually this is more CRU data thoughts, but since the data (and data structures) are key to the code used to create the results I think it counts as code.

I am hampered here in that we don't actually have (as far as I can tell) any of the relevent intermediate data files available, thus it is quite possible that I'm going to go off the rails here. Also, some of the actual programs are missing or have been totally renamed. I'm referring to the various versions of mergedb mentioned in HARRY but not in the source directory as far as I can tell. Hence this post is, speculative (and potentially completely wrong) but I don't think it is.

The code (and data) suffers from its designers coming from a fixed recond fortan background which doesn't help. Right now intermediate results get stored in "data bases" which tend to be one record per line things. These do have the benefit of being directly human readable which the SQL db wouldn't but sql browse queries aren't _THAT_ complicated and would mean you avoid problems where you need to calculate how many months since 1926 have elapse to get your 1991-1995 data. The current "databases" tend to be fix field width ones with the usage of special numbers like -9999 to indicate that data is missing. This is mostly fine but care needs to be taken because sometimes -9999 (possibly divided by 100) can be a genuine value (e.g. longitude -99.99 is quite valid for locations in Mexico, the US and Canada) which doesn't help.

Brief aside: In the input this is even worse and fields which are usually numeric sometimes aren't (see part 35a):

Then tried to merge that into wet.0311061611.dtb, and immediately hot formatting issues - that pesky last field has been badly abused here, taking values including:

 -999.00

    0.00

  nocode     (yes, really!)

Had a quick review of mergedb; it won't be trivial to update it to treat that field as a8. So reluctantly,
changed all the 'nocode' entries to '0':

crua6[/cru/cruts/version_3_0/db/rd0] perl -pi -e 's/nocode/     0/g' wet.0311061611.dt*

Unfortunately, that didn't solve the problems.. as there are alphanumerics in that field later on:

-712356  5492 -11782  665 SPRING CRK WOLVERINE CANADA        1969 1988   -999  307F0P9

and occasions where no data is simply indicated by spaces.
Aside to the aside:Various people have commented on the existance of fromexcel.f90, actually this isn't as bad as it soudns since what is being imported is CSV (comma separated variable) fields which is not a proprietary microsoft format. Given that the point here is to gather data from as many sources as possible its quite reasonable to be gathering data produced in CSV format by one researcher.

The fixed length problem manifests itself in various ways. In many cases there is a need for the program to multiply (or divide) by 10 or 100 to convert the data, as stored into the data as it should be used in calculations of the next output - see part 35j for a good example.

Another way that the fixed length problem shows up is in the pseudonumeric WMO station IDs. These are 5 digit (with CRU extension to 7 digits internally to allow for inconsistent duplicates etc.) with the first two digits being the country code and the other 3 the station ID in the country [Q what happens if we end up with >100 countries or >999 stations per country?] but handling in various ways leads to versions with 5, 6 or 7 digits and with leading (or trailing?) zeros being lost and so on - see part 35v's discussion of LILLE.

Station ids leads us to another problem that crops up in part 35r - the gridding problem. The CRU temperature series divides the globe into grid squares (and yes that is a problem in the polar regions) but, for what are presumably good scientific reasons (I think its the way the ocean observations are gathered from moving ships whereas on land they are static), does some calculations differently in ocean grid cells compared to land ones. Which leads to a problem when the grid includes coastline as that cell may have both land and ocean stations to process:

ERROR. Station in sea:
File: cldupdate6190/cldupdate6190.1996.01.txt
Offending line: 18.54 72.49 11.0 -6.600004305700
Resulting indices ilat,ilon: 218 505
crua6[/cru/cruts/version_3_0/secondaries/cld]

This is a station on the West coast of India; probably Mumbai. Unfortunately, as a coastal
station it runs the risk of missing the nearest land cell. The simple movenorms program is
about to become less simple.. but was do-able. The log file was empty at the end, indicating
that all 'damp' stations had found dry land

Then there is the problem of the inclusion of "synthetics".

This is not necessarily evil fakery (despite some more hysterical claims that I've read) it's a way to interpolate data that is missing for one reason or another. There are two problems here, the first is that the generation of the "synthetic" data is a black art and one that occured once many years ago and which is now lost (maybe) and the second of which is the decision about when to use the synthetic data.

The latter problem is simple to explain: there are numerous cases where there is partial data (Tmin say bit not Tmax for a station for a time period) or data that looks questionable (is it massive rainfall or a transcription error?) where the choice of whether to use a synthetic or not is not so clear cut.

The first problem is potentially more worrying in that they seem to be based on historical climate models - and old ones at that. Since the models tend to have been written and tested against older versions of the HADCRU (and other) historical series there is a feedback loop which appears to lead to a potential problem of confirmation bias. I'm not sure that it has done so - and we don't I think have the data to see one way or another - but if there is an "evil" bit of HADCRU then this positive feedback loop is probably where it is. Any other errors introduced are, it seems to me, not ascribable to malice but rather very definitely caused by some combination of ignorance, incompetence, carelessness etc.

I could go on but I do actually have real work to do so I'm going to stop again here for now with the following.

In their defense the CRU note that their temperature series mostly agrees with the ones from NCDC and GISS as if this excuses the abysmal quality of the processing. This is indeed true but given that there is significant shared source data and apparent cooperation between the teams this is not precisely reassuring. I haven't paid any attention in this analysis to the mathematical calculations being done since that's not what I can readily comprehend but it seems quite likely that each system applies similar data "cleansing" and "homogenizing" because the synthetics (see above) are going to be similar. Hence the confirmation bias problem as results that don't match up with the other two will tend to be scrutinized (and possibly adjusted) while the ones that do are left alone.

Since the adjustment mechanism, indeed the entire process, is opaque the claim that black box 1 produces are result similar to black box 2 is not convincing.

In my opinion the entire project needs to be rewritten, and part of the rewrite should include using an actual database (SQL of some variety) to store the intermediate data. In fact the more I read through the HARRY_READ_ME doc (and glance at the code) the more I realize that open sourcing this, as suggested by Eric Raymond,  would get huge benefits. By open sourcing it, the calculation and work flow would have to become more clear and we would be able to identify questionable parts. Open sourcing would also lead to the imposition of basic project management things like archiving and version control which seem to have been sorely lacking and it would allow a large group of people to think about the data structures and the process flow so that it can be made less subject to error.

Furthermore there are enormous problems in terms of station IDs and identifying whether stations have moved, been renamed and so on. And there are problems with data quality for stations (record missing, mistranscribed etc etc) where a distributed team would almost sertainy do better in terms of verifying the data (and potentially adding more raw data when it is found).

Even if the end result of the open source version were identical to the current one, opening this stuff up to public scrutiny and allowing people to contribute patches would go a long way towards improvng the quality of the output and go even further towards improving the credibility of the output which right now is low amongst anyone who's actually taken a proper look at the process.