Of course, now we have to do a proper run with the latest db..
..that very nearly worked :(
Mostly the same, but one noticeable exception is the hot 2003 JJA in Europe - it's much less
extreme in the automated version. So I ran with the original database again. Thought I'd
see if there were different station counts:
For the original anomdtb and original June 2006 db, tmp.0705101334.dtb):
1259 tmp.2003.06.txt
1216 tmp.2003.07.txt
1223 tmp.2003.08.txt
For 0909041051 (fixed anomauto and original June 2006 db, tmp.0705101334.dtb):
1210 tmp.2003.06.txt (-49)
1201 tmp.2003.07.txt (-15)
1178 tmp.2003.08.txt (-45)
For 0909021348 (the 'fixed' anomauto and the latest db, tmp.0904151410.dtb):
1246 tmp.2003.06.txt (-13)
1250 tmp.2003.07.txt (+34)
1228 tmp.2003.08.txt (+ 5)
Well the differences certainly show up! And it looks like a database change. So.. I
guess I need to look at changes in French stations. Argh. And that 'argh' was prescient,
since, when I ran getcountry to extract the French stations from each database, I found:
crua6[/cru/cruts/version_3_0/fixing_tmp_and_pre] ./getcountry
Enter the database to search: ../update_top/db/tmp/tmp.0705101334.dtb
Enter the country name to extract: FRANCE
33 stations written to ../update_top/db/tmp/tmp.0705101334.dtb.FRANCE
crua6[/cru/cruts/version_3_0/fixing_tmp_and_pre] ./getcountry
Enter the database to search: ../update_top/db/tmp/tmp.0904151410.dtb
Enter the country name to extract: FRANCE
104 stations written to ../update_top/db/tmp/tmp.0904151410.dtb.FRANCE
Somehow, I've added 71 new French stations?! Surely I'd remember that. Especially
as they'd have had to have arrived with the MCDW/CLIMAT bulletins. Sizes:
That's not so bad. Well the ratio's improved. Could be a lot of unmatched incoming
stations?
Oh, ****. It's the bloody WMO codes again. **** these bloody non-standard, ambiguous,
illogical systems. Amateur hour again.
First example, the beautiful city of Lille. Here are the appropriate headers:
tmp.0705101334.dtb.FRANCE:
70150 506 31 47 LILLE FRANCE 1851 2006 101851 -999.00
tmp.0904151410.dtb.FRANCE:
701500 5034 306 52 LILLE FRANCE 2000 2009 -999 0
7015000 5034 306 52 LILLE FRANCE 1851 2008 101851 -999.00
So.. just what I was secretly hoping for (not!) - a drains-up on the CLIMAT and MCDW
programs, otherwise known as climat2cruauto.for and mcdw2cruauto.for, as well as the
merging program, mergedbauto.for.
Some random, manual traceability that may be useful, or not:
Considering the MCDW update run numbered 0904151410:
cat /cru/cruts/version_3_0/update_top/runs/runs.0904151410/merg.mcdw.0904151410.dat
db/cld/cld.0904021239.dtb
updates/MCDW/db/db.0904151410/mcdw.cld.0904151410.dtb
updates/MCDW/db/db.0904151410/int1.cld.0904151410.dtb
blind
U
..is for cld, but indicates that the input database was tmp.0904021239.dtb.
The MCDW database was mcdw.tmp.0904151410.dtb.
The output database was, of course, tmp.0904151410.dtb.
I wonder what they all have for LILLE?
tmp.0904021239.dtb:
0701500 5057 310 52 LILLE/LESQUIN FRANCE 2000 2009 -999 0
7015000 5034 306 52 LILLE FRANCE 1851 2008 101851 -999.00
mcdw.tmp.0904151410.dtb:
0701500 5034 306 52 LILLE FRANCE 2009 2009 -999 0
tmp..dtb:
701500 5034 306 52 LILLE FRANCE 2000 2009 -999 0
7015000 5034 306 52 LILLE FRANCE 1851 2008 101851 -999.00
I'll bet this just updated the 'false' LILLE with another month or something. In fact:
Interestingly, we only seem to have the last three tmp databases, at least in terms of
having the short LILLE station.
This is so hard because I cannot remember the process. Have to dig some more..
OK, I think the key update is 0903091631. Here's the CLIMAT runfile:
conv.climat.0903091631.dat:
1 2000 12 2008
MCDW was even further-ranging: conv.mcdw.0903091631.dat:
9 1994 11 2008
Still not there. One issue is that for some reason I didn't give the merg runfiles
individual names for each parameter! So I might mod the update program to do that.
Then re-run all updates.
The problem with re-running all updates, of course, is that I also fixed WMO codes. And,
(though my memory is extremely flaky), probably corrected some extreme values detected
by Tim. Oh bugger.
Well, WMO code fixing is identifiable because you get a log file, ie, here's the tmp dir:
Well it's worth a try. Actually, let's compare those eight databases - assuming we can find
at least some common stations!
Oh, boy:
cld.0902101409.dtb: 7015000 5056 310 52 LILLE/LESQUIN FRANCE 1971 1996 -999 -999
cld.0902101409.dtb: 0701500 5057 310 52 LILLE/LESQUIN FRANCE 2000 2008 -999 0
dtr.0708081052.dtb: 701500 5057 310 52 LILLE FRANCE 1973 2006 -999 0
pre.0903051740.dtb: 0701500 5060 310 47 LILLE FRANCE 1784 2006 -999 -999.00
tmn.0708071548.dtb: 701500 5057 310 52 LILLE FRANCE 1973 2006 -999 0
tmp.0903081416.dtb: 0701500 5060 310 47 LILLE FRANCE 1851 2006 101851 -999.00
tmx.0708071548.dtb: 701500 5057 310 52 LILLE FRANCE 1973 2006 -999 0
vap.0804231150.dtb: 0701500 5057 310 52 LILLE/LESQUIN FRANCE 2003 2007 -999 0
vap.0804231150.dtb: 1378000 6108 1048 271 LILLEHAMMER SAETHER NORWAY 1972 1994 -999 -999
vap.0804231150.dtb: 7015000 5056 310 52 LILLE/LESQUIN FRANCE 1971 2003 -999 -999
wet.0710161148.dtb: 0701500 5057 310 52 LILLE/LESQUIN FRANCE 1996 2007 -999 -999
This whole project is SUCH A MESS. No wonder I needed therapy!!
So, cld already has the problem, and it's the earliest version in the archive. Also vap.
Well, looking back (er, up ^) we know what happened to cld - it was updated with newmergedb
before it went 'auto':
'So we now have cld.0902101409.dtb, a database consisting of cld.0312181428.dtb,
updated first with derived-cloud data from MCDW (1994-2008), then with
derived-cloud data from CLIMAT (2000-2008).'
And, finding cld.0312181428.dtb, does it have the 'Lille problem'? No!
70150 5056 310 52 LILLE/LESQUIN FRANCE 1971 1996 -999 -999
So cld.0312181428.dtb is our new starting point I think.
VAP - oh, dear. Again, from above:
'Discovered that WMO codes are still a pain in the arse. And that I'd forgotten to match Australian
updates by BOM code (last field in header) instead of WMO code - so I had to modify newmergedbauto.
Also found that running fixwmos.for was less than successful on VAP, because it's already screwed:
I am seriously close to giving up, again. The history of this is so complex that I can't get far enough
into it before by head hurts and I have to stop. Each parameter has a tortuous history of manual and
semi-automated interventions that I simply cannot just go back to early versions and run the update prog.
I could be throwing away all kinds of corrections - to lat/lons, to WMOs (yes!), and more.
So what the hell can I do about all these duplicate stations? Well, how about fixdupes.for? That would
be perfect - except that I never finished it, I was diverted off to fight some other fire. Aarrgghhh.
I - need - a - database - cleaner.
What about the ones I used for the CRUTEM3 work with Phil Brohan? Can't find the bugger!! Looked everywhere,
Matlab scripts aplenty but not the one that produced the plots I used in my CRU presentation in 2005. Oh,
FUCK IT. Sorry. I will have to WRITE a program to find potential duplicates. It can show me pairs of headers,
and correlations between the data, and I can say 'yay' or 'nay'. There is the finddupes.for program, though
I think the comment for *this* program sums it up nicely:
' program postprocdupes2
c Further post-processing of the duplicates file - just to show how crap the
c program that produced it was! Well - not so much that but that once it was
c running, it took 2 days to finish so I couldn't really reset it to improve
c things. Anyway, *this* version does the following useful stuff:
c (1) Removes and squirrels away all segments where dates don't match;
c (2) Marks segments >5 where dates don't match;
c (3) Groups segments from the same pair of stations;
c (4) Sorts based on total segment length for each station pair'
You see how messy it gets when you actually examine the problem?
This time around, (dedupedb.for), I took as simple an approach as possible - and almost immediately hit a
problem that's generic but which doesn't seem to get much attention: what's the minimum n for a reliable
standard deviation?
I wrote a quick Matlab proglet, stdevtest2.m, which takes a 12-column matrix of values and, for each month,
calculates standard deviations using sliding windows of increasing size - finishing with the whole vector
and what's taken to be *the* standard deviation.
The results are depressing. For Paris, with 237 years, +/- 20% of the real value was possible with even 40
values. Windter months were more variable than Summer ones of course. What we really need, and I don't think
it'll happen of course, is a set of metrics (by latitude band perhaps) so that we have a broad measure of
the acceptable minimum value count for a given month and location. Even better, a confidence figure that
allowed the actual standard deviation comparison to be made with a looseness proportional to the sample size.
All that's beyond me - statistically and in terms of time. I'm going to have to say '30'.. it's pretty good
apart from DJF. For the one station I've looked at.
Back to the actual database issues - I need a day or two to think about the duplicate finder.
Let's just look at the year 2003, for all the French stations in each database! Duh.
In the original db, I've x'd those lines missing in the new one. Just missing vals.
In the new db, I've asterisked all the lines matching the old one, with duplicate
matches labeled 'Do'. Any other duplicates are marked Da, DB, DC. We can see that
all the original 2003 lines are included, *and replicated*. Three new lines are also
replicated. A further 25 lines are apparently new (though could well have parents
in the original db). This implies that very little matching is being performed!!
Had a look at the .act (Action) files. Interesting..
crua6[/cru/cruts/version_3_0/update_top] gunzip -c updates/MCDW/mergefiles/merg.mcdw.tmp.0904021106.act.gz | grep 'LILLE'
Master: 701500 5034 306 52 LILLE FRANCE 1851 2008 101851 -999.00
Update: 0701500 5034 306 52 LILLE FRANCE 2001 2008 -999 0
NewH: 701500 5034 306 52 LILLE FRANCE 1851 2008 101851 -999.00
crua6[/cru/cruts/version_3_0/update_top] gunzip -c updates/MCDW/mergefiles/merg.mcdw.tmp.0904021239.act.gz | grep 'LILLE'
Master: 701500 5034 306 52 LILLE FRANCE 1851 2008 101851 -999.00
Update: 0701500 5034 306 52 LILLE FRANCE 2001 2008 -999 0
NewH: 701500 5034 306 52 LILLE FRANCE 1851 2008 101851 -999.00
crua6[/cru/cruts/version_3_0/update_top] gunzip -c updates/MCDW/mergefiles/merg.mcdw.tmp.0904151410.act.gz | grep 'LILLE'
Master: 701500 5034 306 52 LILLE FRANCE 2000 2009 -999 0
Update: 0701500 5034 306 52 LILLE FRANCE 2009 2009 -999 0
NewH: 701500 5034 306 52 LILLE FRANCE 2000 2009 -999 0
So it worked fine until the 0904151410 run, when it went crazee.
So.. what happened? Why did it behave differently? No idea. It was the same for pre though!
crua6[/cru/cruts/version_3_0/update_top] gunzip -c updates/MCDW/mergefiles/merg.mcdw.pre.0904021239.act.gz | grep 'LILLE'
Master: 701500 5034 306 52 LILLE FRANCE 1784 2008 -999 -999.00
Update: 0701500 5034 306 52 LILLE FRANCE 2001 2008 -999 0
NewH: 701500 5034 306 52 LILLE FRANCE 1784 2008 -999 -999.00
crua6[/cru/cruts/version_3_0/update_top] gunzip -c updates/MCDW/mergefiles/merg.mcdw.pre.0904151410.act.gz | grep 'LILLE'
Master: 701500 5034 306 52 LILLE FRANCE 2000 2009 -999 0
Update: 0701500 5034 306 52 LILLE FRANCE 2009 2009 -999 0
NewH: 701500 5034 306 52 LILLE FRANCE 2000 2009 -999 0
There was something very fishy about that run. Of course it was a single month - I wonder if that made a difference?
The CLIMAT update did it!! It's that bloody no-metadata problem!! So I should be looking at the
CLIMAT process for 0904021239, not the MCDW one. Duhh. So, the merge run:
crua6[/cru/cruts/version_3_0/update_top] cat runs/runs.0904021239/merg.climat.0904021239.dat
db/tmx/tmx.0708071548.dtb
updates/CLIMAT/db/db.0904021239/climat.tmx.0904021239.dtb
updates/CLIMAT/db/db.0904021239/int2.tmx.0904021239.dtb
blind
M
crua6[/cru/cruts/version_3_0/update_top]
The merged bulletins in tmp.0904021239.dtb (er, presumably)
crua6[/cru/cruts/version_3_0/update_top] gunzip -c updates/CLIMAT/db/db.0904021239/int2.tmp.0904021239.dtb.gz |grep -i 'lille'
0701500 5057 310 52 LILLE/LESQUIN FRANCE 2000 2009 -999 0
7015000 5034 306 52 LILLE FRANCE 1851 2008 101851 -999.00
crua6[/cru/cruts/version_3_0/update_top] grep -i 'lille' db/tmp/tmp.0*dtb
db/tmp/tmp.0705101334.dtb: 70150 506 31 47 LILLE FRANCE 1851 2006 101851 -999.00
db/tmp/tmp.0903081416.dtb:0701500 5060 310 47 LILLE FRANCE 1851 2006 101851 -999.00
db/tmp/tmp.0904021106.dtb:0701500 5057 310 52 LILLE/LESQUIN FRANCE 2000 2009 -999 0
db/tmp/tmp.0904021106.dtb:7015000 5034 306 52 LILLE FRANCE 1851 2008 101851 -999.00
db/tmp/tmp.0904021239.dtb:0701500 5057 310 52 LILLE/LESQUIN FRANCE 2000 2009 -999 0
db/tmp/tmp.0904021239.dtb:7015000 5034 306 52 LILLE FRANCE 1851 2008 101851 -999.00
db/tmp/tmp.0904151410.dtb: 701500 5034 306 52 LILLE FRANCE 2000 2009 -999 0
db/tmp/tmp.0904151410.dtb:7015000 5034 306 52 LILLE FRANCE 1851 2008 101851 -999.00
OK.. let's be absolutely clear about the process.
STEP 1 - convert MCDW bulletins (09/94 - 12/08) to produce mcdw.tmp.0904021106.dtb:
0701500 5034 306 52 LILLE FRANCE 2001 2008 -999 0
STEP 2 - Merge mcdw.tmp.0904021106.dtb into tmp.0903081416.dtb to produce int1.tmp.0904021106.dtb:
701500 5034 306 52 LILLE FRANCE 1851 2008 101851 -999.00
STEP 4 - Merge climat.tmp.0904021239.dtb into int1.tmp.0904021106.dtb to produce int2.tmp.0904021106.dtb:
0701500 5057 310 52 LILLE/LESQUIN FRANCE 2000 2009 -999 0
7015000 5034 306 52 LILLE FRANCE 1851 2008 101851 -999.00
(there's then the BOM section but it's all over by now)
So, the merging of climat.tmp.0904021106.dtb into int1.tmp.0904021106.dtb FAILED. WHY?
Well, the WMO codes are the same as for MCDW: 0701500. So it can't be that. The lat and lon
are ~slightly~ different, though. Remember, the DATABASE entry was originally (tmp.0903081416.dtb):
0701500 5060 310 47 LILLE FRANCE 1851 2006 101851 -999.00
After the MCDW (09/94 - 12/08) merge, it became (int1.tmp.0904021106.dtb):
701500 5034 306 52 LILLE FRANCE 1851 2008 101851 -999.00
After the CLIMAT (01/00 - 02/09) merge (int2.tmp.0904021106.dtb):
0701500 5057 310 52 LILLE/LESQUIN FRANCE 2000 2009 -999 0
7015000 5034 306 52 LILLE FRANCE 1851 2008 101851 -999.00
Now, the 'LILLE/LESQUIN' station header comes from the CLIMAT bulletins, ie, from the WMO
reference file wmo.0710151633.dat. But it should have matched with the existing LILLE -
the problem looks like the latitude shift (from 50.60 to 50.34) introduced by MCDW did
the damage. Obviously, if we are going to trust MCDW metadata as being valid corrections,
then the WMO reference file needs to be updated at the same time!! So, we'll need:
1. A file called 'wmoref.latest.dat' that contains the name of the latest WMO reference file.
- DONE
2. A hook in newmergedbauto that flags when a header is being changed by an MCDW/BOM bulletin.
- EXTREMELY COMPLICATED
3. A routine to write a new WMO reference and to archive the old one.
- EXTREMELY COMPLICATED
4. A record of every iteration of the db/latest.versions.dat file (in db/previous.latest/).
- DONE
As part of the investigations, I found that I wasn't close()-ing off channel 10 when I used
it in update.for. Now I'm pretty sure that F77 follows the convention that an OPEN on an
open channel initiates an initial CLOSE automatically, but who wants to take that chance with
the variety of compilers we're subject to? So I went through and inserted an indecent number
of close(10)s.
Point 2 - flagging changes to the metadata.
Well, the merged database is written principally from dbm*, with dbu* chipping in 'new' stations.
I guess that new stations should be added to the wmo reference file? They are pan-parameter (well
the MCDW ones are) but I have an eerie feeling that I won't experience joy when headers are
compared between parameters :/
Wrote metacmp.for. It accepts a list of parameter databases (by default, latest.versions.dat) and
compares headers when WMO codes match. If all WMO matches amongst the databases share common
metadata (lat, lon, alt, name, country) then the successful header is written to a file. If,
however, any one of the WMO matches fails on any metadata - even slightly! - the gaggle of
disjointed headers is written to a second file. I know that leeway should be given, particularly
with lats & lons, but as a first stab I just need to know how bad things are. Well, I got that:
Interesting, but not astounding. Roughly half are unpaired stations, with an impressive
23% showing a perfect match across all eight databases.
Analysis of the 4000+ bad matches will be more complicated unfortunately. An initial
re-run looking for lat/lon within half a degree, and/or station partial, will be useful.
No, hang on. Easier to analyse the output from metacmp! And so.. postmetacmp.for:
Stats report for: report.0909181759.metacmp.bad
Overall distribution of group sizes:
2 in group: 642
3 in group: 71
4 in group: 188
5 in group: 625
6 in group: 183
7 in group: 411
8 in group: 1957
LAT:
Number of diffs within a group:
1. 0
2. 3059
3. 276
4. 15
5. 0
6. 0
7. 0
8. 0
Maximum differences:
<0.1: 1233
<0.2: 726
<0.5: 1225
<1.0: 15
1.0+: 151
LON:
Number of diffs within a group:
1. 0
2. 2996
3. 339
4. 30
5. 1
6. 0
7. 0
8. 0
Maximum differences:
<0.1: 1195
<0.2: 722
<0.5: 1242
<1.0: 30
1.0+: 177
ALT:
Number of diffs within a group:
1. 0
2. 2035
3. 237
4. 17
5. 0
6. 0
7. 0
8. 0
Maximum differences:
<50m : 1767
<100m: 75
<500m: 121
<1km : 36
1km+ : 290
STATION NAME:
Number of diffs within a group:
1. 0
2. 2167
3. 365
4. 43
5. 0
6. 0
7. 0
8. 0
Worst percentage matches:
<25% : 281
<50% : 385
<75% : 770
<100%: 276
100% : 863
COUNTRY NAME:
Total groups with country mismatches: 1698
Number of diffs within a group:
1. 0
2. 1475
3. 182
4. 41
5. 0
6. 0
7. 0
8. 0
Hmmm.. lots of groups that could be eliminated if we incorporated the WMO reference
list, because then we could allow an element of 'drift' from a reference point.
Decided to make it a bit quicker and easier as well, by removing tmn/tmx and letting
dtr take the strain - they should all be identical anyway.