The HARRY_READ_ME.txt file

Part 35v

Of course, now we have to do a proper run with the latest db..

..that very nearly worked :(

Mostly the same, but one noticeable exception is the hot 2003 JJA in Europe - it's much less
extreme in the automated version. So I ran with the original database again. Thought I'd
see if there were different station counts:

For the original anomdtb and original June 2006 db, tmp.0705101334.dtb):
1259 tmp.2003.06.txt
1216 tmp.2003.07.txt
1223 tmp.2003.08.txt

For 0909041051 (fixed anomauto and original June 2006 db, tmp.0705101334.dtb):
1210 tmp.2003.06.txt (-49)
1201 tmp.2003.07.txt (-15)
1178 tmp.2003.08.txt (-45)

For 0909021348 (the 'fixed' anomauto and the latest db, tmp.0904151410.dtb):
1246 tmp.2003.06.txt (-13)
1250 tmp.2003.07.txt (+34)
1228 tmp.2003.08.txt (+ 5)

Ran retrace (because I might as well use it!).

For 0909041051, original db:
2003 6 16.13 6.90 273 375 20.60 266 380
2003 6 16.13 6.90 273 375 20.60 266 380
2003 6 16.13 6.90 273 375 20.60 266 380
2003 6 0.00 0.00 273 375 0.00 266 380
2003 6 -999.00 -999.00 -999 -999 -999 -999.00 -999 -999 -999
2003 6 -999.00 -999.00 -999 -999 -999 -999.00 -999 -999 -999
crua6[/cru/cruts/version_3_0/fixing_tmp_and_pre] cat retrace.France.0909041051.tmp.2003.07.stat
2003 7 18.58 9.40 273 375 23.80 265 380
2003 7 18.58 9.40 273 375 23.80 265 380
2003 7 18.58 9.40 273 375 23.80 265 380
2003 7 0.00 0.00 273 375 0.00 265 380
2003 7 -999.00 -999.00 -999 -999 -999 -999.00 -999 -999 -999
2003 7 -999.00 -999.00 -999 -999 -999 -999.00 -999 -999 -999
crua6[/cru/cruts/version_3_0/fixing_tmp_and_pre] cat retrace.France.0909041051.tmp.2003.08.stat
2003 8 18.25 8.90 273 375 23.80 265 380
2003 8 18.25 8.90 273 375 23.80 265 380
2003 8 18.25 8.90 273 375 23.80 265 380
2003 8 0.00 0.00 273 375 0.00 265 380
2003 8 -999.00 -999.00 -999 -999 -999 -999.00 -999 -999 -999
2003 8 -999.00 -999.00 -999 -999 -999 -999.00 -999 -999 -999

For 0909021348, latest db:
2003 6 19.12 8.50 273 375 23.60 267 367
2003 6 19.12 8.50 273 375 23.60 267 367
2003 6 16.13 6.90 273 375 20.10 267 367
2003 6 2.99 1.63 273 375 3.45 267 367
2003 6 3.40 1.90 277 352 66 5.10 270 360 68
2003 6 20.94 10.60 272 375 6717000 26.10 267 371 7650000
2003 7 20.61 12.20 273 375 26.70 266 380
2003 7 20.61 12.20 273 375 26.70 266 380
2003 7 18.58 9.40 273 375 23.80 266 380
2003 7 2.03 2.80 273 375 2.87 266 380
2003 7 2.18 1.10 275 358 69 3.20 273 373 65
2003 7 20.70 9.20 272 375 6717000 26.40 266 379 7790000
2003 8 21.04 11.90 273 375 26.60 266 380
2003 8 21.04 11.90 273 375 26.60 266 380
2003 8 18.25 8.90 273 375 23.80 266 380
2003 8 2.79 2.97 273 375 2.83 266 380
2003 8 2.90 2.80 277 352 62 3.00 281 369 61
2003 8 23.15 11.80 272 375 6717000 28.20 266 379 7790000

Well the differences certainly show up! And it looks like a database change. So.. I
guess I need to look at changes in French stations. Argh. And that 'argh' was prescient,
since, when I ran getcountry to extract the French stations from each database, I found:

crua6[/cru/cruts/version_3_0/fixing_tmp_and_pre] ./getcountry
Enter the database to search: ../update_top/db/tmp/tmp.0705101334.dtb

Enter the country name to extract: FRANCE
33 stations written to ../update_top/db/tmp/tmp.0705101334.dtb.FRANCE

crua6[/cru/cruts/version_3_0/fixing_tmp_and_pre] ./getcountry
Enter the database to search: ../update_top/db/tmp/tmp.0904151410.dtb

Enter the country name to extract: FRANCE
104 stations written to ../update_top/db/tmp/tmp.0904151410.dtb.FRANCE

Somehow, I've added 71 new French stations?! Surely I'd remember that. Especially
as they'd have had to have arrived with the MCDW/CLIMAT bulletins. Sizes:

crua6[/cru/cruts/version_3_0/fixing_tmp_and_pre] wc -l *FRANCE
2725 tmp.0705101334.dtb.FRANCE
3700 tmp.0904151410.dtb.FRANCE

That's not so bad. Well the ratio's improved. Could be a lot of unmatched incoming
stations?

Oh, ****. It's the bloody WMO codes again. **** these bloody non-standard, ambiguous,
illogical systems. Amateur hour again.

First example, the beautiful city of Lille. Here are the appropriate headers:

tmp.0705101334.dtb.FRANCE:
70150 506 31 47 LILLE FRANCE 1851 2006 101851 -999.00

tmp.0904151410.dtb.FRANCE:
701500 5034 306 52 LILLE FRANCE 2000 2009 -999 0
7015000 5034 306 52 LILLE FRANCE 1851 2008 101851 -999.00

So.. just what I was secretly hoping for (not!) - a drains-up on the CLIMAT and MCDW
programs, otherwise known as climat2cruauto.for and mcdw2cruauto.for, as well as the
merging program, mergedbauto.for.

Some random, manual traceability that may be useful, or not:

Considering the MCDW update run numbered 0904151410:

cat /cru/cruts/version_3_0/update_top/runs/runs.0904151410/merg.mcdw.0904151410.dat
db/cld/cld.0904021239.dtb
updates/MCDW/db/db.0904151410/mcdw.cld.0904151410.dtb
updates/MCDW/db/db.0904151410/int1.cld.0904151410.dtb
blind
U

..is for cld, but indicates that the input database was tmp.0904021239.dtb.

The MCDW database was mcdw.tmp.0904151410.dtb.

The output database was, of course, tmp.0904151410.dtb.

I wonder what they all have for LILLE?

tmp.0904021239.dtb:
0701500 5057 310 52 LILLE/LESQUIN FRANCE 2000 2009 -999 0
7015000 5034 306 52 LILLE FRANCE 1851 2008 101851 -999.00

mcdw.tmp.0904151410.dtb:
0701500 5034 306 52 LILLE FRANCE 2009 2009 -999 0

tmp..dtb:
701500 5034 306 52 LILLE FRANCE 2000 2009 -999 0
7015000 5034 306 52 LILLE FRANCE 1851 2008 101851 -999.00

I'll bet this just updated the 'false' LILLE with another month or something. In fact:

conv.mcdw.0904151410.dat:
1 2009 1 2009

Before (tmp.0904021239.dtb):
0701500 5057 310 52 LILLE/LESQUIN FRANCE 2000 2009 -999 0
6190-9999-9999-9999-9999-9999-9999-9999-9999-9999-9999-9999-9999
2000 42 63 76 99 147 166 159 182 164 116 78 57
2001 37-9999-9999-9999 144 158 187 193 135 145 69 30
2002 45 77 80 103 133 167 175 186 148 109 87 51
2003 30 29 87 106 137 186 195 208 158 85 81 45
2004 39 53 65 106 127 166 176 192 165 120 71 34
2005 54 30 71 105 129 179 182 170 165 144 65 37
2006 20 28 49 93 137 170 226 166 185 142 90 57
2007 74 67 79 136 142 168 173 173 145 107 69 40
2008 65 59 67 93 166 159 182 176 140 102 72 30
2009 7 37-9999-9999-9999-9999-9999-9999-9999-9999-9999-9999

After (tmp.0904151410.dtb):
701500 5034 306 52 LILLE FRANCE 2000 2009 -999 0
6190-9999-9999-9999-9999-9999-9999-9999-9999-9999-9999-9999-9999
2000 42 63 76 99 147 166 159 182 164 116 78 57
2001 37-9999-9999-9999 144 158 187 193 135 145 69 30
2002 45 77 80 103 133 167 175 186 148 109 87 51
2003 30 29 87 106 137 186 195 208 158 85 81 45
2004 39 53 65 106 127 166 176 192 165 120 71 34
2005 54 30 71 105 129 179 182 170 165 144 65 37
2006 20 28 49 93 137 170 226 166 185 142 90 57
2007 74 67 79 136 142 168 173 173 145 107 69 40
2008 65 59 67 93 166 159 182 176 140 102 72 30
2009 7 37-9999-9999-9999-9999-9999-9999-9999-9999-9999-9999

The data is i-dentical. But ooh, lookit the WMO codes! Oh, Lordy! The update has changed
the code from 7-digit to 6-digit!

Let's look at all available headers for this LILLE station (ie the short modern one).

tmp.0904021106.dtb:0701500 5057 310 52 LILLE/LESQUIN FRANCE 2000 2009 -999 0
tmp.0904021239.dtb:0701500 5057 310 52 LILLE/LESQUIN FRANCE 2000 2009 -999 0
tmp.0904151410.dtb: 701500 5034 306 52 LILLE FRANCE 2000 2009 -999 0

Aha - it's not the same station! ;0) An input has updated the details, istr that can happen.

Let's look at the MCDW update for Jan 2009, ssm0901.fin:
07015 LILLE 5034N 00306E 52 1007.8 1013.7 .7 5.8 7 63 75 145

And a previous update, ssm0812.fin:
07015 LILLE 5034N 00306E 52 1012.7 1018.6 3.0 6.9 7 32 1 65 140

Well that's no help. Why did it change? CLIMAT? But CLIMAT doesn't have station names!
Let's try and work out how the updates happened:

MCDW is first:
drwx------ 2 f098 cru 4096 Mar 5 2009 db.0903051342
drwx------ 2 f098 cru 4096 Mar 5 2009 db.0903051442
drwx------ 2 f098 cru 4096 Mar 5 2009 db.0903051448
drwx------ 2 f098 cru 4096 Mar 9 2009 db.0903091631
drwx------ 2 f098 cru 4096 Apr 2 11:25 db.0904021106
drwx------ 2 f098 cru 4096 Apr 2 12:57 db.0904021239
drwx------ 2 f098 cru 4096 Apr 15 14:16 db.0904151410

Then CLIMAT:
drwx------ 2 f098 cru 4096 Mar 5 2009 db.0903051342
drwx------ 2 f098 cru 4096 Mar 5 2009 db.0903051448
drwx------ 2 f098 cru 4096 Mar 9 2009 db.0903091631
drwx------ 2 f098 cru 4096 Apr 2 11:26 db.0904021106
drwx------ 2 f098 cru 4096 Apr 2 12:59 db.0904021239

We won't bother with BOM, I don't think they stretch to Lille..

I'm hoping I didn't do any on the escluster! *checks*.. nope. Just copied.

So this looks like the sequence:

0903051342 MCDW CLIMAT
0903051442 MCDW only
0903051448 MCDW CLIMAT
0903091631 MCDW CLIMAT
0904021106 MCDW CLIMAT
0904021239 MCDW CLIMAT
0904151410 MCDW only

Interestingly, we only seem to have the last three tmp databases, at least in terms of
having the short LILLE station.

This is so hard because I cannot remember the process. Have to dig some more..

OK, I think the key update is 0903091631. Here's the CLIMAT runfile:
conv.climat.0903091631.dat:
1 2000 12 2008

MCDW was even further-ranging: conv.mcdw.0903091631.dat:
9 1994 11 2008

Still not there. One issue is that for some reason I didn't give the merg runfiles
individual names for each parameter! So I might mod the update program to do that.
Then re-run all updates.

The problem with re-running all updates, of course, is that I also fixed WMO codes. And,
(though my memory is extremely flaky), probably corrected some extreme values detected
by Tim. Oh bugger.

Well, WMO code fixing is identifiable because you get a log file, ie, here's the tmp dir:

-r-------- 1 f098 cru 25936385 Aug 18 10:48 tmp.0705101334.dtb
-r--r--r-- 1 f098 cru 7278548 Feb 17 2009 tmp.0705101334.dtb.gz
-rw-r--r-- 1 f098 cru 25936385 Mar 8 2009 tmp.0903081416.dtb
-rw-r--r-- 1 f098 cru 408 Mar 8 2009 tmp.0903081416.log
-rw-r--r-- 1 f098 cru 27131064 Apr 2 11:15 tmp.0904021106.dtb
-rw-r--r-- 1 f098 cru 27131064 Apr 2 12:48 tmp.0904021239.dtb
-rw-r--r-- 1 f098 cru 27151999 Apr 15 14:11 tmp.0904151410.dtb

tmp.0705101334.dtb had its WMO codes fixed and became tmp.0903081416.dtb, which has the
accompanying log file (tmp.0903081416.log) to prove it:

Program fixwmos was run: 0903081416
Reference WMO List: /cru/cruts/version_3_0/WMO/from_dave_lister_wmo_list
Country Codes (unused): /cru/cruts/version_3_0/WMO/wmo_country_codes.dat
Input database: tmp.0705101334.dtb
Total stations: 5065
WMOs matched: 2263
False WMOs (<0): 0
Bad WMOs (>=0 & <1000): 0
Output Database: tmp.0903081416.dtb

Unfortunately, only tmp and pre have such log files. Here's the one for pre:

Program fixwmos was run: 0903051740
Reference WMO List: /cru/cruts/version_3_0/WMO/from_dave_lister_wmo_list
Country Codes (unused): /cru/cruts/version_3_0/WMO/wmo_country_codes.dat
Input database: pre.0803271802.dtb
Total stations: 15937
WMOs matched: 4196
False WMOs (<0): 88
Bad WMOs (>=0 & <1000): 1540
Bad WMO heads written to: pre.0803271802.bad
Output Database: pre.0903051740.dtb

So.. I guess I will use tmp.0903081416.dtb, pre.0903051740.dtb, and the earliest available
from the other parameters. In other words:

cld.0902101409.dtb
dtr.0708081052.dtb
pre.0903051740.dtb
tmn.0708071548.dtb
tmx.0708071548.dtb
tmp.0903081416.dtb
vap.0804231150.dtb
wet.0710161148.dtb

Well it's worth a try. Actually, let's compare those eight databases - assuming we can find
at least some common stations!

Oh, boy:

cld.0902101409.dtb: 7015000 5056 310 52 LILLE/LESQUIN FRANCE 1971 1996 -999 -999
cld.0902101409.dtb: 0701500 5057 310 52 LILLE/LESQUIN FRANCE 2000 2008 -999 0
dtr.0708081052.dtb: 701500 5057 310 52 LILLE FRANCE 1973 2006 -999 0
pre.0903051740.dtb: 0701500 5060 310 47 LILLE FRANCE 1784 2006 -999 -999.00
tmn.0708071548.dtb: 701500 5057 310 52 LILLE FRANCE 1973 2006 -999 0
tmp.0903081416.dtb: 0701500 5060 310 47 LILLE FRANCE 1851 2006 101851 -999.00
tmx.0708071548.dtb: 701500 5057 310 52 LILLE FRANCE 1973 2006 -999 0
vap.0804231150.dtb: 0701500 5057 310 52 LILLE/LESQUIN FRANCE 2003 2007 -999 0
vap.0804231150.dtb: 1378000 6108 1048 271 LILLEHAMMER SAETHER NORWAY 1972 1994 -999 -999
vap.0804231150.dtb: 7015000 5056 310 52 LILLE/LESQUIN FRANCE 1971 2003 -999 -999
wet.0710161148.dtb: 0701500 5057 310 52 LILLE/LESQUIN FRANCE 1996 2007 -999 -999

This whole project is SUCH A MESS. No wonder I needed therapy!!

So, cld already has the problem, and it's the earliest version in the archive. Also vap.

Well, looking back (er, up ^) we know what happened to cld - it was updated with newmergedb
before it went 'auto':

'So we now have cld.0902101409.dtb, a database consisting of cld.0312181428.dtb,
updated first with derived-cloud data from MCDW (1994-2008), then with
derived-cloud data from CLIMAT (2000-2008).'

And, finding cld.0312181428.dtb, does it have the 'Lille problem'? No!

70150 5056 310 52 LILLE/LESQUIN FRANCE 1971 1996 -999 -999

So cld.0312181428.dtb is our new starting point I think.

VAP - oh, dear. Again, from above:

'Discovered that WMO codes are still a pain in the arse. And that I'd forgotten to match Australian
updates by BOM code (last field in header) instead of WMO code - so I had to modify newmergedbauto.
Also found that running fixwmos.for was less than successful on VAP, because it's already screwed:

uealogin1[/cru/cruts/version_3_0/update_top/db/vap] grep -i 'jan mayen' vap.0804231150.dtb
0100100 7093 -867 9 JAN MAYEN(NOR-NAVY) NORWAY 2003 2007 -999 0
1001000 7093 -866 9 JAN MAYEN(NOR NAVY) NORWAY 1971 2003 -999 -999'

Ulp!

I am seriously close to giving up, again. The history of this is so complex that I can't get far enough
into it before by head hurts and I have to stop. Each parameter has a tortuous history of manual and
semi-automated interventions that I simply cannot just go back to early versions and run the update prog.
I could be throwing away all kinds of corrections - to lat/lons, to WMOs (yes!), and more.

So what the hell can I do about all these duplicate stations? Well, how about fixdupes.for? That would
be perfect - except that I never finished it, I was diverted off to fight some other fire. Aarrgghhh.

I - need - a - database - cleaner.

What about the ones I used for the CRUTEM3 work with Phil Brohan? Can't find the bugger!! Looked everywhere,
Matlab scripts aplenty but not the one that produced the plots I used in my CRU presentation in 2005. Oh,
FUCK IT. Sorry. I will have to WRITE a program to find potential duplicates. It can show me pairs of headers,
and correlations between the data, and I can say 'yay' or 'nay'. There is the finddupes.for program, though
I think the comment for *this* program sums it up nicely:

' program postprocdupes2
c Further post-processing of the duplicates file - just to show how crap the
c program that produced it was! Well - not so much that but that once it was
c running, it took 2 days to finish so I couldn't really reset it to improve
c things. Anyway, *this* version does the following useful stuff:
c (1) Removes and squirrels away all segments where dates don't match;
c (2) Marks segments >5 where dates don't match;
c (3) Groups segments from the same pair of stations;
c (4) Sorts based on total segment length for each station pair'

You see how messy it gets when you actually examine the problem?

This time around, (dedupedb.for), I took as simple an approach as possible - and almost immediately hit a
problem that's generic but which doesn't seem to get much attention: what's the minimum n for a reliable
standard deviation?

I wrote a quick Matlab proglet, stdevtest2.m, which takes a 12-column matrix of values and, for each month,
calculates standard deviations using sliding windows of increasing size - finishing with the whole vector
and what's taken to be *the* standard deviation.

The results are depressing. For Paris, with 237 years, +/- 20% of the real value was possible with even 40
values. Windter months were more variable than Summer ones of course. What we really need, and I don't think
it'll happen of course, is a set of metrics (by latitude band perhaps) so that we have a broad measure of
the acceptable minimum value count for a given month and location. Even better, a confidence figure that
allowed the actual standard deviation comparison to be made with a looseness proportional to the sample size.

All that's beyond me - statistically and in terms of time. I'm going to have to say '30'.. it's pretty good
apart from DJF. For the one station I've looked at.

Back to the actual database issues - I need a day or two to think about the duplicate finder.

Let's just look at the year 2003, for all the French stations in each database! Duh.

Original db extraction: tmp.0705101334.dtb.FRANCE:
2003 30 29 87 106 137 186 195 208 158 85 81 45
2003 59 64 96 108 123 159 174 188 155 110 103 75
2003-9999-9999-9999-9999-9999-9999-9999-9999-9999-9999-9999-9999 x
2003-9999-9999-9999-9999-9999-9999-9999-9999-9999-9999-9999-9999 x
2003 10 2 83 110 162 226 210 239 159 81 67 26
2003 42 54 108 122 144 199 200 232 173 109 101 65
2003 24 34 102 120 149 221 213 252 168 99 82 45
2003 12 15 86 109 154 232 220 251 161 88 67 31
2003 20 33 102 114 137 212 200 240 162 101 90 49
2003-9999-9999-9999-9999-9999-9999-9999-9999-9999-9999-9999-9999 x
2003 51 58 123 142 160 227 219 252 190 129 109 77
2003 45 56 114 133 165 242 243 267 197 133 109 72
2003-9999-9999-9999-9999-9999-9999-9999-9999-9999-9999-9999-9999 x
2003 54 63 111 141 194 261 263 279 201 153 121 85
2003 85 79 115 137 191 248 253 274 209 156 129 96
2003 73 73 118 140 181 247 262 275 205 155 125 94
2003 86 73 105 131 184 241 251 265 207 166 136 93

New db extraction: tmp.0904151410.dtb.FRANCE:
2003 30 29 87 106 137 186 195 208 158 85 81 45 *
2003 59 64 96 108 123 159 174 188 155 110 103 75 *
2003 10 2 83 110 162 226 210 239 159 81 67 26 *
2003 42 54 108 122 144 199 200 232 173 109 101 65 *
2003 24 34 102 120 149 221 213 252 168 99 82 45 *
2003 12 15 86 109 154 232 220 251 161 88 67 31 *
2003 20 33 102 114 137 212 200 240 162 101 90 49 *
2003 51 58 123 142 160 227 219 252 190 129 109 77 *
2003 45 56 114 133 165 242 243 267 197 133 109 72 *
2003 85 79 115 137 191 248 253 274 209 156 129 96 *
2003 73 73 118 140 181 247 262 275 205 155 125 94 *
2003 86 73 105 131 184 241 251 265 207 166 136 93 *
2003 30 29 87 106 137 186 195 208 158 85 81 45 Do
2003 27 32 88 101 129 181 186 209 153 87 79 45 DA
2003 59 64 96 108 123 159 174 188 155 110 103 75 Do
2003 10 2 83 110 162 226 210 239 159 81 67 26 Do
2003 42 54 108 122 144 199 200 232 173 109 101 65 Do
2003 24 34 102 120 149 221 213 252 168 99 82 45 Do
2003 12 15 86 109 154 232 220 251 161 88 67 31 Do
2003 20 33 102 114 137 212 200 240 162 101 90 49 Do
2003 18 32 93 116 157 232 217 242 160 104 83 43 DB
2003 51 58 123 142 160 227 219 252 190 129 109 77 Do
2003 45 56 114 133 165 242 243 267 197 133 109 72 Do
2003 57 60 110 133 184 245 256 267 196 147 113 81 DC
2003 54 63 111 141 194 261 263 279 201 153 121 85 *
2003 85 79 115 137 191 248 253 274 209 156 129 96 Do
2003 73 73 118 140 181 247 262 275 205 155 125 94 Do
2003 86 73 105 131 184 241 251 265 207 166 136 93 Do
2003-9999-9999-9999-9999-9999-9999-9999-9999-9999-9999-9999-9999 -
2003 27 32 88 101 129 181 186 209 153 87 79 45 DA
2003 18 32 93 116 157 232 217 242 160 104 83 43 DB
2003 57 60 110 133 184 245 256 267 196 147 113 81 DC
2003 35 35 85 101 128 170 183 200 152 86 83 46
2003 73 69 88 102 122 156 174 185 169 129 116 84
2003 43 49 90 105 130 178 186 204 154 99 91 55
2003 17 18 83 100 143 195 199 218 148 83 77 41
2003 69 70 95 107 124 160 177 184 164 123 109 82
2003 45 53 106 120 142 195 198 225 174 112 101 64
2003 30 39 93 107 134 191 191 221 155 91 83 47
2003 67 75 105 116 136 179 189 210 173 125 120 90
2003 28 39 102 115 144 209 205 243 167 97 85 51
2003 10 -3 78 102 156 230 210 235 150 74 62 22
2003 61 67 113 128 151 197 207 229 188 136 120 89
2003 30 43 101 122 144 214 206 243 168 104 89 54
2003 -7 -7 52 77 125 202 188 206 127 73 54 13
2003 20 28 100 128 171 254 233 261 179 107 86 42
2003 33 40 112 129 154 233 220 255 176 111 98 60
2003 -29 -40 25 32 79 158 151 189 96 39 42 -4
2003 39 48 107 131 183 254 251 268 182 126 100 61
2003 4 6 79 99 148 215 223 229 158 90 68 31
2003 48 52 113 135 159 229 224 255 184 128 99 65
2003 43 42 105 120 146 214 209 241 178 116 92 60
2003 41 44 101 117 145 216 212 237 173 116 93 55
2003 81 71 111 133 183 244 246 272 200 149 128 101
2003 93 69 111 137 191 254 264 282 212 168 137 103
2003 17 24 88 105 146 208 207 234 158 91 81 42
2003 1 3 84 99 143 208 202 233 153 77 70 29

In the original db, I've x'd those lines missing in the new one. Just missing vals.
In the new db, I've asterisked all the lines matching the old one, with duplicate
matches labeled 'Do'. Any other duplicates are marked Da, DB, DC. We can see that
all the original 2003 lines are included, *and replicated*. Three new lines are also
replicated. A further 25 lines are apparently new (though could well have parents
in the original db). This implies that very little matching is being performed!!

Had a look at the .act (Action) files. Interesting..

crua6[/cru/cruts/version_3_0/update_top] gunzip -c updates/MCDW/mergefiles/merg.mcdw.tmp.0904021106.act.gz | grep 'LILLE'
Master: 701500 5034 306 52 LILLE FRANCE 1851 2008 101851 -999.00
Update: 0701500 5034 306 52 LILLE FRANCE 2001 2008 -999 0
NewH: 701500 5034 306 52 LILLE FRANCE 1851 2008 101851 -999.00
crua6[/cru/cruts/version_3_0/update_top] gunzip -c updates/MCDW/mergefiles/merg.mcdw.tmp.0904021239.act.gz | grep 'LILLE'
Master: 701500 5034 306 52 LILLE FRANCE 1851 2008 101851 -999.00
Update: 0701500 5034 306 52 LILLE FRANCE 2001 2008 -999 0
NewH: 701500 5034 306 52 LILLE FRANCE 1851 2008 101851 -999.00
crua6[/cru/cruts/version_3_0/update_top] gunzip -c updates/MCDW/mergefiles/merg.mcdw.tmp.0904151410.act.gz | grep 'LILLE'
Master: 701500 5034 306 52 LILLE FRANCE 2000 2009 -999 0
Update: 0701500 5034 306 52 LILLE FRANCE 2009 2009 -999 0
NewH: 701500 5034 306 52 LILLE FRANCE 2000 2009 -999 0

So it worked fine until the 0904151410 run, when it went crazee.
So.. what happened? Why did it behave differently? No idea. It was the same for pre though!

crua6[/cru/cruts/version_3_0/update_top] gunzip -c updates/MCDW/mergefiles/merg.mcdw.pre.0904021239.act.gz | grep 'LILLE'
Master: 701500 5034 306 52 LILLE FRANCE 1784 2008 -999 -999.00
Update: 0701500 5034 306 52 LILLE FRANCE 2001 2008 -999 0
NewH: 701500 5034 306 52 LILLE FRANCE 1784 2008 -999 -999.00
crua6[/cru/cruts/version_3_0/update_top] gunzip -c updates/MCDW/mergefiles/merg.mcdw.pre.0904151410.act.gz | grep 'LILLE'
Master: 701500 5034 306 52 LILLE FRANCE 2000 2009 -999 0
Update: 0701500 5034 306 52 LILLE FRANCE 2009 2009 -999 0
NewH: 701500 5034 306 52 LILLE FRANCE 2000 2009 -999 0

There was something very fishy about that run. Of course it was a single month - I wonder if that made a difference?

crua6[/cru/cruts/version_3_0/update_top] cat runs/runs.0904021239/conv.mcdw.0904021239.dat
9 1994 12 2008
crua6[/cru/cruts/version_3_0/update_top] cat runs/runs.0904151410/conv.mcdw.0904151410.dat
1 2009 1 2009

Also of interest - how did the program find a 2000-2009 station when the previous update was to 2008?

Aha:

crua6[/cru/cruts/version_3_0/update_top] cat runs/runs.0904021239/conv.climat.0904021239.dat
1 2000 2 2009

The CLIMAT update did it!! It's that bloody no-metadata problem!! So I should be looking at the
CLIMAT process for 0904021239, not the MCDW one. Duhh. So, the merge run:

crua6[/cru/cruts/version_3_0/update_top] cat runs/runs.0904021239/merg.climat.0904021239.dat
db/tmx/tmx.0708071548.dtb
updates/CLIMAT/db/db.0904021239/climat.tmx.0904021239.dtb
updates/CLIMAT/db/db.0904021239/int2.tmx.0904021239.dtb
blind
M
crua6[/cru/cruts/version_3_0/update_top]

Looking at the CLIMAT dtb conversion...

The converted CLIMAT bulletins:
crua6[/cru/cruts/version_3_0/update_top] gunzip -c updates/CLIMAT/db/db.0904021239/climat.tmp.0904021239.dtb.gz |grep -i 'lille'
0701500 5057 310 52 LILLE/LESQUIN FRANCE 2000 2009 -999 0

The merged bulletins in tmp.0904021239.dtb (er, presumably)
crua6[/cru/cruts/version_3_0/update_top] gunzip -c updates/CLIMAT/db/db.0904021239/int2.tmp.0904021239.dtb.gz |grep -i 'lille'
0701500 5057 310 52 LILLE/LESQUIN FRANCE 2000 2009 -999 0
7015000 5034 306 52 LILLE FRANCE 1851 2008 101851 -999.00

crua6[/cru/cruts/version_3_0/update_top] grep -i 'lille' db/tmp/tmp.0*dtb
db/tmp/tmp.0705101334.dtb: 70150 506 31 47 LILLE FRANCE 1851 2006 101851 -999.00

db/tmp/tmp.0903081416.dtb:0701500 5060 310 47 LILLE FRANCE 1851 2006 101851 -999.00

db/tmp/tmp.0904021106.dtb:0701500 5057 310 52 LILLE/LESQUIN FRANCE 2000 2009 -999 0
db/tmp/tmp.0904021106.dtb:7015000 5034 306 52 LILLE FRANCE 1851 2008 101851 -999.00

db/tmp/tmp.0904021239.dtb:0701500 5057 310 52 LILLE/LESQUIN FRANCE 2000 2009 -999 0
db/tmp/tmp.0904021239.dtb:7015000 5034 306 52 LILLE FRANCE 1851 2008 101851 -999.00

db/tmp/tmp.0904151410.dtb: 701500 5034 306 52 LILLE FRANCE 2000 2009 -999 0
db/tmp/tmp.0904151410.dtb:7015000 5034 306 52 LILLE FRANCE 1851 2008 101851 -999.00

OK.. let's be absolutely clear about the process.

STEP 1 - convert MCDW bulletins (09/94 - 12/08) to produce mcdw.tmp.0904021106.dtb:
0701500 5034 306 52 LILLE FRANCE 2001 2008 -999 0

STEP 2 - Merge mcdw.tmp.0904021106.dtb into tmp.0903081416.dtb to produce int1.tmp.0904021106.dtb:
701500 5034 306 52 LILLE FRANCE 1851 2008 101851 -999.00

STEP 3 - convert CLIMAT bulletins (01/00 - 02/09) to climat.tmp.0904021106.dtb:
0701500 5057 310 52 LILLE/LESQUIN FRANCE 2000 2009 -999 0

STEP 4 - Merge climat.tmp.0904021239.dtb into int1.tmp.0904021106.dtb to produce int2.tmp.0904021106.dtb:
0701500 5057 310 52 LILLE/LESQUIN FRANCE 2000 2009 -999 0
7015000 5034 306 52 LILLE FRANCE 1851 2008 101851 -999.00

(there's then the BOM section but it's all over by now)

So, the merging of climat.tmp.0904021106.dtb into int1.tmp.0904021106.dtb FAILED. WHY?

Well, the WMO codes are the same as for MCDW: 0701500. So it can't be that. The lat and lon
are ~slightly~ different, though. Remember, the DATABASE entry was originally (tmp.0903081416.dtb):

0701500 5060 310 47 LILLE FRANCE 1851 2006 101851 -999.00

After the MCDW (09/94 - 12/08) merge, it became (int1.tmp.0904021106.dtb):

701500 5034 306 52 LILLE FRANCE 1851 2008 101851 -999.00

After the CLIMAT (01/00 - 02/09) merge (int2.tmp.0904021106.dtb):

0701500 5057 310 52 LILLE/LESQUIN FRANCE 2000 2009 -999 0
7015000 5034 306 52 LILLE FRANCE 1851 2008 101851 -999.00

Now, the 'LILLE/LESQUIN' station header comes from the CLIMAT bulletins, ie, from the WMO
reference file wmo.0710151633.dat. But it should have matched with the existing LILLE -
the problem looks like the latitude shift (from 50.60 to 50.34) introduced by MCDW did
the damage. Obviously, if we are going to trust MCDW metadata as being valid corrections,
then the WMO reference file needs to be updated at the same time!! So, we'll need:

1. A file called 'wmoref.latest.dat' that contains the name of the latest WMO reference file.
- DONE

2. A hook in newmergedbauto that flags when a header is being changed by an MCDW/BOM bulletin.
- EXTREMELY COMPLICATED

3. A routine to write a new WMO reference and to archive the old one.
- EXTREMELY COMPLICATED

4. A record of every iteration of the db/latest.versions.dat file (in db/previous.latest/).
- DONE

As part of the investigations, I found that I wasn't close()-ing off channel 10 when I used
it in update.for. Now I'm pretty sure that F77 follows the convention that an OPEN on an
open channel initiates an initial CLOSE automatically, but who wants to take that chance with
the variety of compilers we're subject to? So I went through and inserted an indecent number
of close(10)s.

Point 2 - flagging changes to the metadata.

Well, the merged database is written principally from dbm*, with dbu* chipping in 'new' stations.
I guess that new stations should be added to the wmo reference file? They are pan-parameter (well
the MCDW ones are) but I have an eerie feeling that I won't experience joy when headers are
compared between parameters :/

Wrote metacmp.for. It accepts a list of parameter databases (by default, latest.versions.dat) and
compares headers when WMO codes match. If all WMO matches amongst the databases share common
metadata (lat, lon, alt, name, country) then the successful header is written to a file. If,
however, any one of the WMO matches fails on any metadata - even slightly! - the gaggle of
disjointed headers is written to a second file. I know that leeway should be given, particularly
with lats & lons, but as a first stab I just need to know how bad things are. Well, I got that:

crua6[/cru/cruts/version_3_0/update_top] ./metacmp
METACMP - compare parameter database metadata
RESULTS:

Matched/unopposed: 2435
Clashed horribly: 4077

Ouch! Though actually, far, far better than expected. As for the disport of those 2435:

crua6[/cru/cruts/version_3_0/update_top] grep '^1' report.0909181759.metacmp.wmo | wc -l
1250
crua6[/cru/cruts/version_3_0/update_top] grep '^2' report.0909181759.metacmp.wmo | wc -l
279
crua6[/cru/cruts/version_3_0/update_top] grep '^3' report.0909181759.metacmp.wmo | wc -l
41
crua6[/cru/cruts/version_3_0/update_top] grep '^4' report.0909181759.metacmp.wmo | wc -l
92
crua6[/cru/cruts/version_3_0/update_top] grep '^5' report.0909181759.metacmp.wmo | wc -l
83
crua6[/cru/cruts/version_3_0/update_top] grep '^6' report.0909181759.metacmp.wmo | wc -l
9
crua6[/cru/cruts/version_3_0/update_top] grep '^7' report.0909181759.metacmp.wmo | wc -l
129
crua6[/cru/cruts/version_3_0/update_top] grep '^8' report.0909181759.metacmp.wmo | wc -l
552

Interesting, but not astounding. Roughly half are unpaired stations, with an impressive
23% showing a perfect match across all eight databases.

Analysis of the 4000+ bad matches will be more complicated unfortunately. An initial
re-run looking for lat/lon within half a degree, and/or station partial, will be useful.

No, hang on. Easier to analyse the output from metacmp! And so.. postmetacmp.for:

Stats report for: report.0909181759.metacmp.bad

Overall distribution of group sizes:
2 in group: 642
3 in group: 71
4 in group: 188
5 in group: 625
6 in group: 183
7 in group: 411
8 in group: 1957

LAT:
Number of diffs within a group:
1. 0
2. 3059
3. 276
4. 15
5. 0
6. 0
7. 0
8. 0
Maximum differences:
<0.1: 1233
<0.2: 726
<0.5: 1225
<1.0: 15
1.0+: 151

LON:
Number of diffs within a group:
1. 0
2. 2996
3. 339
4. 30
5. 1
6. 0
7. 0
8. 0
Maximum differences:
<0.1: 1195
<0.2: 722
<0.5: 1242
<1.0: 30
1.0+: 177

ALT:
Number of diffs within a group:
1. 0
2. 2035
3. 237
4. 17
5. 0
6. 0
7. 0
8. 0
Maximum differences:
<50m : 1767
<100m: 75
<500m: 121
<1km : 36
1km+ : 290

STATION NAME:
Number of diffs within a group:
1. 0
2. 2167
3. 365
4. 43
5. 0
6. 0
7. 0
8. 0
Worst percentage matches:
<25% : 281
<50% : 385
<75% : 770
<100%: 276
100% : 863

COUNTRY NAME:
Total groups with country mismatches: 1698
Number of diffs within a group:
1. 0
2. 1475
3. 182
4. 41
5. 0
6. 0
7. 0
8. 0

Hmmm.. lots of groups that could be eliminated if we incorporated the WMO reference
list, because then we could allow an element of 'drift' from a reference point.

Decided to make it a bit quicker and easier as well, by removing tmn/tmx and letting
dtr take the strain - they should all be identical anyway.








Go back to index or Email search