The HARRY_READ_ME.txt file

Part 35q

OK, got cloud working, have to generate it now.. but distracted by starting on the
mythical 'Update' program. As usual, it's much more complicated than it seems. So,
let's work out the order of events.

Well first, some ground rules: should this be 'dumb'? Should the operator say what
they want to happen, and walk away, coming back later to check it worked? Or should
it be interactive, to the extent of the operator deciding on station matches and so
forth? At the moment, the introduction of new data (MCDW, CLIMAT, BOM) is highly
interactive, and, though BOM should be fully automatic in the future, the same
cannot be said for MCDW and CLIMAT. Hmmm. well I guess there are two possibilities:

1. Operator selects 'interactive' additions. Script proceeds, calling merge programs
as necessary, some of which may ask the operator to decide on matches. This could
take hours, or even days, depending on the quality of the incoming metadata.

2. Operator selects 'automatic' additions. Script proceeds, calling special versions
of the merge programs. These have a fixed threshold of confidence for adding new
data to exisitng databases. When the threshold is crossed, the data is not added
but stored in a new database, which might of course be later added under option 1.
Note that the threshold would be higher than that in 1. that initiates operator
involvement.

Is this sufficient? It certainly means more coding, but not a huge amount. In a worst
case scenario (where the operator always chooses '2.'), we still have the unused data
updates that can be interactively merged in at any time (even yaesr in the future).

This all avoids the big questions, of course. When do updates happen, and how far back
do they go? For instance, let's say there are six-month published updates. So say the
full 1901-present files are published yearly, with six-month update files as interims.
What happens in any of the following circumstances?

A. Updates for, say, 1965, are available.

B. The data used in the January-to-June update is further updated after publication and
is present in the next 'full' release (so that the early Jan-Jun grids differ from
those in the 1901-present publication).

(in both A. and B., it would usually be MCDW updates that carried retrospective data,
this is marked as 'OVERDUE').


Luckily, this isn't really up to me. Or.. is it? If the operator specifies a time period
to update, it ought to warn if it finds earlier updates in those files. So further mods
to mcdw2cruauto are required.. its results file must list extras. Or - ooh! How about a
SECOND output database for the MCDW updates, containing just the OVERDUE stuff?

Back.. think.. even more complicated. My head hurts. No, it actually does. And I ought
to be on my way home. But look, we create a new master database (for each parameter)
every time we update, don't we? What we ought to do is provide a log file for each
new database, identifying which data have been added. Oh, God. OK, let's go..

NEW DATA PROCESS

1. Ops runs 'Update', and chooses 'New Data'.

2. Ops selects MCDW, CLIMAT, and/or BOM data and gives update dates

3. Ops selects 'interactive' or 'automatic' database merging.

4. Update checks source files are present and initiates conversion to CRU format.

5. Update runs the merging program to join the new data to the existing databases,
creating new databases. If data for previous periods is included in the update
files, it will be included.

5a. If Ops selected 'automatic', merging program asks for decisions on 'difficult'
matches. These are all logged of course.

6. Merge program creates log of changes between old databases and new ones, inc.
source of the data.


UPDATE PROCESS

1. Ops runs 'Update', and chooses 'Update'. Yes, I know.

2. Ops gives parameter(s) and time period to update.

3. Ops specifies six-month interim or full update.

4. Update provides candidate databases for the update, Ops chooses.

5. Update runs the anomaly and gridding programs for the specified period.



Note. The following system command will find the number of stations reporting in a
given year from a given database.

grep '^2006 ' tmp/tmp.0710011359.dtb | grep -v '\-9999\-9999\-9999\-9999\-9999\-9999\-9999\-9999\-9999\-9999\-9999\-9999' | wc -l


Discovered ('remembered' would be better; sadly I didn't) that I never got round to
writing a BOM-to-CRU converter. It got overtaken by the drastic need to get the tmin
and tmax databases synchronised (see above, somewhere). There was a barely-started
thing, so I cannibalised it for bom2cruauto.for, which eventually worked. In fact, it
was a good entry into the fraught world of automatic, script-fed programs.

Got bom2cruauto.for working, then climat2cruauto.for and mcdw2cruauto.for in quick
succession (the latter two having their output databases compared successfully with
those generated in Nov 2007).

Next, I suppose it's the next in the sequence: mergedb. This is where I'm anxious: I
want it all to be plain sailing and automatic, but I don't think there's any practical
way to obviate the operator from the need to make judgements on the possible mapping
of stations.

---


Go on to part 35r, back to index or Email search