The HARRY_READ_ME.txt file

Part 35b


..okay, a another week, another razorblade to slide down. Modified mcdw2cru to include rain days:

<BEGIN QUOTE>
uealogin1[/cru/cruts/version_3_0/incoming/MCDW] ./mcdw2cru

MCDW2CRU: Convert MCDW Bulletins to CRU Format

Enter the earliest MCDW file: ssm0301.fin
Enter the latest MCDW file (or for single files): ssm0706.fin

All Files Processed
tmp.0709111032.dtb: 2407 stations written
vap.0709111032.dtb: 2398 stations written
rdy.0709111032.dtb: 2407 stations written
pre.0709111032.dtb: 2407 stations written
sun.0709111032.dtb: 1693 stations written

Thanks for playing! Byeee!
<END QUOTE>

Checked, and the four preexisting databases match perfectly with their counterparts, so I didn't break
anything in the adjustments. and the rdy file looks good too (actually the above is the *final* run;
there were numerous bugs as per).

<BEGIN QUOTE>
uealogin1[/cru/cruts/version_3_0/incoming/CLIMAT] ./climat2cru

CLIMAT2CRU: Convert MCDW Bulletins to CRU Format

Enter the earliest CLIMAT file: climat_data_200301.txt
Enter the latest CLIMAT file (or for single file): climat_data_200707.txt

All Files Processed
tmp.0709101706.dtb: 2881 stations written
vap.0709101706.dtb: 2870 stations written
rdy.0709101706.dtb: 2876 stations written
pre.0709101706.dtb: 2878 stations written
sun.0709101706.dtb: 2020 stations written
tmn.0709101706.dtb: 2800 stations written
tmx.0709101706.dtb: 2800 stations written

Thanks for playing! Byeee!
<END QUOTE>

Again, existing outputs are unchanged and the new rdy file looks OK (though see bracketed note above for MCDW).

So.. to the incorporation of these updates into the secondary databases. Oh, my.

Beginning with Rain Days, known variously as rd0, rdy, pdy.. this allowed me to modify newmergedb.for to cope
with various 'freedoms' enjoyed by the existing databases (such as six-digit WMO codes). And then, when run,
an unexpected side-effect of my flash correlation display thingy: it shows up existing problems with the data!

Here is the first 'issue' encountered by newmergedb, taken from the top and with my comments in :

<BEGIN QUOTE>
uealogin1[/cru/cruts/version_3_0/db/rd0] ./newmergedb

WELCOME TO THE DATABASE UPDATER

Before we get started, an important question:
Should the incoming 'update' header info and data take precedence over the existing database?
Or even vice-versa? This will significantly reduce user decisions later, but is a big step!

Enter 'U' to give Updates precedence, 'M' to give Masters precedence, 'X' for equality: U
Please enter the Master Database name: wet.0311061611.dtb
Please enter the Update Database name: rdy.0709111032.dtb

Reading in both databases..
Master database stations: 4988
Update database stations: 2407

Looking for WMO code matches..

***** OPERATOR ADJUDICATION REQUIRED *****

In attempting to pair two stations, possible data incompatibilities have been found.

MASTER: 221130 6896 3305 51 MURMANSK EX USSR 1936 2003 -999 -999
UPDATE: 2211300 6858 3303 51 MURMANSK RUSSIAN FEDER 2003 2007 -999 0

CORRELATION STATISTICS (enter 'C' for more information):
> -0.60 is minimum correlation coeff.
> 0.65 is maximum correlation coeff.
> -0.01 is mean correlation coeff.

Enter 'Y' to allow, 'N' to deny, or an information code letter: C

<OKAY - SO I'VE REQUESTED A DISPLAY OF THE LAGGED CORRELATIONS>

Master Data: Correlation with Update first year aligned to this year -v
1936 900 600 1000 800 1000 900 1300 1700 2100 1800 900 1000 0.27
1937 300 1400 1300 800 1400 1800 500 1200 1600 1000 1100 1500 0.15
1938 900 1000 1500 1800 1200 1500 1200 1700 500 700 1600 700 -0.13
1939 1500 1300 1100 1400 1200 1200 1000 1300 1800 1600 1100 1300 0.24
1940 1000 1500 1000 1200 1100 1700 2600 1500 1500 1400 1700 1100 0.15
1941 1800 1200 1000 1200 900 1100 900 1200 1900 1500 1000 1400 0.48
1942 900 900 1700 900 1600 1000 600 1100 1400 1300 700 700 0.51
1943 800 1000 1000 1300 900 800 1500 1600 1400 1500 1300 1200 0.44
1944 1000 400 900 800 1200 600 900 2000 900 1100 1000 900 0.32
1945 500 400 700 700 800 1800 900 1100 1200 1100 1300 700 0.19
1946 1200 1200 100 700 900 1200 400 900 800 1900 1300 1400 0.16
1947 900 1300 1300 1100 1600 1000 800 1400 1400 1700 2100 1900 0.09
1948 1100 1400 1400 1200 1300 1800 1200 1700 1500 2200 2100 1900 0.10
1949 1100 1100 500 1500 1600 1100 1500 1200 2200 2500 900 1600 0.04
1950 1300 800 1000 1100 1700 1200 1500 800 1100 1300 1500 1400 -0.04
1951 1100 600 1400 1400 1500 1600 2100 1300 1500 1700 2000 1700 -0.13
1952 2100 800 1100 1800 1300 1200 2400 2200 1600 1000 1000 2300 -0.23
1953 2100 1400 2100 1500 900 300 1300 1700 1500 800 1200 800 -0.24
1954 2100 600 1300 1000 1300 1700 1600 2000 1800 1300 1400 1200 -0.40
1955 2200 1300 900 1000 1600 2000 1100 1400 1000 2100 2300 1600 -0.20
1956 1300 1100 1300 400 1600 1300 900 1500 2000 1300 2000 1400 -0.30
1957 1700 1600 1100 1100 1900 1900 1400 1600 1400 1700 2300 2600 -0.27
1958 1300 2200 1900 700 1500 1200 2100 1000 1900 1700 1600 1000 -0.21
1959 2500 1800 1300 900 900 1600 1600 1500 2200 1700 1000 900 -0.33
1960 1800 1700 1500 400 1300 1500 400 1000 1300 1500 1000 1400 -0.21
1961 2100 1800 2200 1500 800 1400 1600 1100 1900 1200 1200 2100 -0.59
1962 2100 1100 1000 1500 1300 1100 1300 1700 1200 2000 1600 2300 -0.37
1963 2100 2100 2000 1000 700 2000 1400 1800 1400 1600 2000 2400 -0.56
1964 2400 1100 1000 1700 1100 1400 1400 1400 2000 1200 2100 1800 -0.42
1965 1400 2100 1300 1000 1700 1700 1400 2400 1300 2100 1900 2100 -0.41
1966 1600 1600 2000 2000 1700 1200 2000 2500 2500 2700 1600 600 -0.34
1967 2200 1700 1600 1200 1000 1400 1600 1300 1700 1500 1200 2100 -0.21
1968 1600 1800 1800 1800 1500 1800 1400 2100 1000 2000 2100 2000 -0.28
1969 1100 300 1900 1200 1000 1300 1500 1200 1200 2000 1700 800 -0.25
1970 1900 1400 1200 900 600 1200 1500 700 2300 1700 1700 2100 -0.23
1971 2000 1300 1600 1600 1200 1100 1400 1800 2000 1600 1700 1500 -0.39
1972 1300 1200 1300 1200 1700 800 1400 1800 1900 2000 1700 1600 -0.26
1973 1800 1100 1700 900 1200 1500 500 1800 1200 2000 2100 2100 -0.36
1974 1100 2400 700 1600 1300 1300 1800 2000 1900 1200 1400 2400 -0.29
1975 1500 2200 1400 1700 2500 2200 2300 1600 1700 2300 1800 2600 -0.47
1976 1900 800 1100 1500 1000 900 1300 1800 2200 1600 1400 1600 -0.33
1977 1800 1400 2200 1200 1600 1900 1300 1500 1500 1900 1500 2000 -0.40
1978 1500 1800 1400 2100 700 1000 1100 1900 1700 2300 1500 2200 -0.24
1979 1700 1700 1700 1200 1500 1800 900 1200 1800 1600 1500 2300 -0.39
1980 1900 1300 1300 1000 1400 900 700 1100 1300 1600 2200 1700 -0.36
1981 2600 500 1900 2000 800 1900 1500 2000 1400 1500 1800 1600 -0.46
1982 2200 1800 1100 1600 1500 2200 1800 1400 1700 1700 1900 1400 -0.60
1983 2400 1900 1700 1200 800 1500 1200 2000 1400 2100 2000 2500 -0.23
1984 1900 800 1500 2000 1100 1600 2000 1700 1100 1400 1000 1200
1985-9999-9999-9999-9999-9999-9999-9999-9999-9999-9999-9999-9999
1986-9999-9999-9999-9999-9999-9999-9999-9999-9999-9999-9999-9999
1987-9999-9999-9999-9999-9999-9999-9999-9999-9999-9999-9999-9999
1988-9999-9999-9999-9999-9999-9999-9999-9999-9999-9999-9999-9999
1989-9999-9999-9999-9999-9999-9999-9999-9999-9999-9999-9999-9999 0.65
1990-9999-9999-9999-9999-9999 500 1300 900 700 900 1300 700 0.62
1991-9999 900 500 300 700 1000 1500 700 1700 1000 1300 1300 0.54
1992 800 1000 600 500 700 900-9999 1300-9999 700 900 1200 0.60
1993 600 900 400 500 900 1500 1000 800 800 1000 400 1000 0.55
1994 1300 1000 300 600 700 1000 900 600 1200 0 1400 600 0.43
1995 900 900 600 700 700 900 1100 1300 600 1800 1300 500 0.61
1996 500 1100 400 700 700 1200 1200 1100 1100 900 1000 1400 0.54
1997 1200 800 1300 600 600 100 500 1100 900-9999 1000 900 0.61
1998 1200 1300 800 1100 1100 1100 800 600 1200 1100 600 1200 0.52
1999 600 400 600 1000 700 700 1800 1400 700 1600 800 1200 0.62
2000 1100 600 1500 1700 900 1500 800 800 1000 1000 600 600 0.40
2001 600 500 700 700 600 500 1200 1200 700 1300 900 1000 0.63
2002 1000 800 1300 200 900 1100 1400 1200 1400 1800 1100 700
2003 1100-9999-9999-9999-9999-9999-9999-9999-9999-9999-9999-9999
Update Data:
2003 1100 700 700 500 1000 400 700 1100 1200 2100 800 1900
2004 900 700 600 600 1300 1200 1000 1200 1400 900 1000 1000
2005 1000 400 800 1100 900 600 1200 1000 1600 1000 1300 1200
2006 700 500 1300 400 600 1200 1600 700 1000-9999 600 1500
2007 1400 400 400 1300 1200 1200-9999-9999-9999-9999-9999-9999

<DO YOU SEE? THERE'S THAT OH-SO FAMILIAR BLOCK OF MISSING CODES IN THE LATE 80S,
THEN THE DATA PICKS UP AGAIN. BUT LOOK AT THE CORRELATIONS ON THE RIGHT, ALL
GOOD AFTER THE BREAK, DECIDEDLY DODGY BEFORE IT. THESE ARE TWO DIFFERENT
STATIONS, AREN'T THEY? AAAARRRGGGHHHHHHH!!!!!>

MASTER: 221130 6896 3305 51 MURMANSK EX USSR 1936 2003 -999 -999
UPDATE: 2211300 6858 3303 51 MURMANSK RUSSIAN FEDER 2003 2007 -999 0

CORRELATION STATISTICS (enter 'C' for more information):
> -0.60 is minimum correlation coeff.
> 0.65 is maximum correlation coeff.
> -0.01 is mean correlation coeff.

Enter 'Y' to allow, 'N' to deny, or an information code letter:
<END QUOTE>

So.. should I really go to town (again) and allow the Master database to be 'fixed' by this
program? Quite honestly I don't have time - but it just shows the state our data holdings
have drifted into. Who added those two series together? When? Why? Untraceable, except
anecdotally.

It's the same story for many other Russian stations, unfortunately - meaning that (probably)
there was a full Russian update that did no data integrity checking at all. I just hope it's
restricted to Russia!!

There are, of course, metadata issues too. Take:

<BEGIN QUOTE>
MASTER: 206740 7353 8040 47 DIKSON ISLAND EX USSR 1936 2003 -999 -999
UPDATE: 2067400 7330 8024 47 OSTROV DIKSON RUSSIAN FEDER 2003 2007 -999 0

CORRELATION STATISTICS (enter 'C' for more information):
> -0.70 is minimum correlation coeff.
> 0.81 is maximum correlation coeff.
> -0.01 is mean correlation coeff.
<END QUOTE>

This is pretty obviously the same station (well OK.. apart from the duff early period, but I've
got used to that now). But look at the longitude! That's probably 20km! LUckily I selected
'Update wins' and so the metadata aren't compared. This is still going to take ages, because although
I can match WMO codes (or should be able to), I must check that the data correlate adequately - and
for all these stations there will be questions. I don't think it would be a good idea to take the
usual approach of coding to avoid the situation, because (a) it will be non-trivial to code for, and
(b) not all of the situations are the same. But I am beginning to wish I could just blindly merge
based on WMO code.. the trouble is that then I'm continuing the approach that created these broken
databases. Look at this one:

<BEGIN QUOTE>
***** OPERATOR ADJUDICATION REQUIRED *****

In attempting to pair two stations, possible data incompatibilities have been found.

MASTER: 239330 6096 6906 40 HANTY MANSIJSK EX USSR 1936 1984 -999 -999
UPDATE: 2393300 6101 6902 46 HANTY-MANSIJSK RUSSIAN FEDER 2003 2007 -999 0

CORRELATION STATISTICS (enter 'C' for more information):
> -0.42 is minimum correlation coeff.
> 0.39 is maximum correlation coeff.
> -0.02 is mean correlation coeff.

Enter 'Y' to allow, 'N' to deny, or an information code letter: C
Master Data: Correlation with Update first year aligned to this year -v
1936 1400 800 1700 900 1200 800 700 800 1800-9999-9999-9999 0.33
1937 1400 800 500 1700 1500 800 1200 1000 1700 1300 700 1200 0.32
1938 1000 1700 1200 1100 1100 800 800 1300 1400 1900 1800 1300 0.04
1939 1100 1700 1600 1800 1500 800 1500 1900 1700 1800 1300 1300 0.09
1940 1300 700 900 900 1800 1200 900 1300 1200 2200 1900 1800 0.08
1941 1400 1100 1800 1000 1400 1900 1400 700 1300 1200 1900 2000 0.02
1942 1700 900 1600 900 1200 1500 1300 1500 1200 1900 1500 1500 -0.06
1943 1400 1300 1300 800 1400 1600 1300 1500 1900 2000 700 1900 -0.17
1944 1900 1500 2000 1100 1200 1300 1500 1700 1800 1200 1500 1900 -0.32
1945 1300 1000 1400 2100 2000 1100 1700 700 1600 1800 2300 1700 -0.42
1946 2300 1900 1500 1100 1100 2000 1800 1000 1200 2100 2000 1800 -0.35
1947 1900 1400 1600 1000 2100 1900 2100 1000 1200 2000 2100 1500 -0.35
1948 1700 1500 1800 800 1300 1800 1700 1300 1800 2200 2000 2100 -0.15
1949 2300 2100 1000 700 1600 1400 1200 800 2100 2000 1100 1400 -0.07
1950 2100 2300 1000 1100 1500 1600 1600 2300 1900 1200 1100 1500 0.00
1951 1600 1000 1500 800 1500 1400 1200 600 1800 1800 1400 2400 -0.07
1952 1600 400 1100 1300 1100 1400 800 2000 1500 2300 1300 1600 -0.04
1953 2000 1200 1500 500 1300 1500 1100 1200 2300 2200 1600 2100 -0.02
1954 1700 1800 700 700 1000 1300 1200 1600 2000 1800 1800 600 0.01
1955 2400 1400 1000 1100 1700 1200 1000 1300 1500 1300 2300 1600 -0.08
1956 1300 800 1000 1100 1000 1000 1400 1800 1900 1900 2600 2000 -0.29
1957 1900 1200 1700 1000 1100 1100 1100 700 800 2300 1900 2200 -0.18
1958 1300 1600 1500 400 1500 1100 1300 1400 1900 2400 2000 1600 -0.28
1959 1700 1600 700 1300 1700 1100 1100 1600 2000 2100 1900 1600 -0.04
1960 1800 1600-9999-9999-9999-9999-9999-9999-9999-9999-9999-9999 0.24
1961-9999-9999-9999-9999-9999-9999-9999 1600 1600 1700 1900 1600 0.33
1962 1700 800 1200 600 400 1100 900 2000 1100 1900 1700 1500 0.25
1963 1200 1300 1700 700 1100 1600 900 1000 1100 1400 1800 2000 -0.04
1964 1900 500 1300 1300 1200 1200 1100 1100 1700 1500 2000 1800 0.13
1965 1200 1400 700 900 1200 1100 1300 1400 1800 2500 1000 1700 0.23
1966 1800 1600 2100 1300 1500 2100 900 1800 1500 2400 1900 800 0.11
1967 1600 1200 1100 600 800 1100 1100 700 1300 1200 1300 1900 0.39
1968 1600 1400 1600 1200 900 1300 1400 1000 1700 1300 1400 1200 0.24
1969 900 1000 1100 1500 1700 1700 1000 1800 1200 1400 1900 1300 0.04
1970 1500 1200 1600 1400 700 1600 700 1600 1000 1500 1900 1600 -0.02
1971 1700 400 1100 1700 1300 1700 700 2000 900 2100 2000 1900 -0.11
1972 1200 1500 1400 800 1700 1300 1700 2000 2100 1700 2500 1900 -0.08
1973 1200 1100 1100 700 800 1300 2100 1000 2400 1900 1800 2300 -0.11
1974 700 1200 1800 1800 1400 1200 1000 1300 1100 1600 1900 700 -0.14
1975 2200 1800 1400 1300 1500 1500 1400 1500 1400 2300 1900 2100 -0.15
1976 2000 1500 600 700 1100 1600 1300 1100 1500 1800 1600 1200 -0.11
1977 1900 1700 1800 1400 1000 1100 1000 1300 1500 1800 1700 2100 -0.15
1978 1600 1000 800 1400 1400 800 1600 1600 2300 2200 2200 1800 0.03
1979 1600 1600 1600 900 900 1900 1200 1700 1200 2100 1600 2000 0.00
1980 1600 1200 500 800 1500 1100 800 1700 1200 600 2200 2200 -0.05
1981 2000 1000 1700 1300 1500 1100 800 400 1500 800 1500 1900 0.06
1982 2400 1800 1100 1200 1200 1100 1000 1700 1200 2100 1800 2000 0.03
1983 2500 2100 1800 1300 1400 1200 1200 1300 1300 1900 2300 1900 0.10
1984 1200 700 500 1300 900 800 1100 1000 1700 1600 1600 1300
Update Data:
2003 1500 900 600 400 900 1200 500 700 1100 600 700 1500
2004 700 600 700 400 600 1100 500 900 900 1400 1500 600
2005 700 400 800 1400 300 900 800 800 900 500 1200 600
2006 800 700 900 1000 800 500 1000 500 1300 1100 700 1600
2007 1100 1100 900 700 1300 1500-9999-9999-9999-9999-9999-9999
<END QUOTE>

Here, the expected 1990-2003 period is MISSING - so the correlations aren't so hot! Yet
the WMO codes and station names /locations are identical (or close). What the hell is
supposed to happen here? Oh yeah - there is no 'supposed', I can make it up. So I have :-)

If an update station matches a 'master' station by WMO code, but the data is unpalatably
inconsistent, the operator is given three choices:

<BEGIN QUOTE>
You have failed a match despite the WMO codes matching.
This must be resolved!! Please choose one:

1. Match them after all.
2. Leave the existing station alone, and discard the update.
3. Give existing station a false code, and make the update the new WMO station.

Enter 1,2 or 3:
<END QUOTE>

You can't imagine what this has cost me - to actually allow the operator to assign false
WMO codes!! But what else is there in such situations? Especially when dealing with a 'Master'
database of dubious provenance (which, er, they all are and always will be).

False codes will be obtained by multiplying the legitimate code (5 digits) by 100, then adding
1 at a time until a number is found with no matches in the database. THIS IS NOT PERFECT but as
there is no central repository for WMO codes - especially made-up ones - we'll have to chance
duplicating one that's present in one of the other databases. In any case, anyone comparing WMO
codes between databases - something I've studiously avoided doing except for tmin/tmax where I
had to - will be treating the false codes with suspicion anyway. Hopefully.

Of course, option 3 cannot be offered for CLIMAT bulletins, there being no metadata with which
to form a new station.

This still meant an awful lot of encounters with naughty Master stations, when really I suspect
nobody else gives a hoot about. So with a somewhat cynical shrug, I added the nuclear option -
to match every WMO possible, and turn the rest into new stations (er, CLIMAT excepted). In other
words, what CRU usually do. It will allow bad databases to pass unnoticed, and good databases to
become bad, but I really don't think people care enough to fix 'em, and it's the main reason the
project is nearly a year late.

And there are STILL WMO code problems!!! Let's try again with the issue. Let's look at the first
station in most of the databases, JAN MAYEN. Here it is in various recent databases:


dtr.0705152339.dtb: 100100 7093 -867 9 JAN MAYEN NORWAY 1998 2006 -999 -999.00
pre.0709111032.dtb:0100100 7056 -840 9 JAN MAYEN NORWAY 2003 2007 -999 0
sun.0709111032.dtb:0100100 7056 -840 9 JAN MAYEN NORWAY 2003 2007 -999 0
tmn.0702091139.dtb: 100100 7093 -867 9 JAN MAYEN NORWAY 1998 2006 -999 -999.00
tmn.0705152339.dtb: 100100 7093 -867 9 JAN MAYEN NORWAY 1998 2006 -999 -999.00
tmp.0709111032.dtb:0100100 7056 -840 9 JAN MAYEN NORWAY 2003 2007 -999 0
tmx.0702091313.dtb: 100100 7093 -867 9 JAN MAYEN NORWAY 1998 2006 -999 -999.00
tmx.0705152339.dtb: 100100 7093 -867 9 JAN MAYEN NORWAY 1998 2006 -999 -999.00
vap.0709111032.dtb:0100100 7056 -840 9 JAN MAYEN NORWAY 2003 2007 -999 0

As we can see, even I'm cocking it up! Though recoverably. DTR, TMN and TMX need to be written as (i7.7).

Anyway, here it is in the problem database:

wet.0311061611.dtb: 10010 7093 -866 9 JAN MAYEN(NOR NAVY) NORWAY 1990 2003 -999 -999

You see? The leading zero's been lost (presumably through writing as i7) and then a zero has been added at
the trailing end. So it's a 5-digi WMO code BUT NOT THE RIGHT ONE. Aaaarrrgghhhhhh!!!!!!

I think this can only be fixed in one of two ways:

1. By hand.

2. By automatic comparison with other (more reliable) databases.

As usual - I'm going with 2. Hold onto your hats.

Go on to part 35c, back to index or Email search