Index: [Article Count Order] [Thread]

Date:  Tue, 25 Aug 2009 02:30:52 +0200
From:  Michael Stauber <bq (at mark) solarspeed.net>
Subject:  [coba-e:15920] Re: Unable to create new records for DNS, users, etc.
To:  coba-e (at mark) bluequartz.org
Message-Id:  <200908250230.52507.bq (at mark) solarspeed.net>
In-Reply-To:  <op.uy6w64jn3mvmur (at mark) presto.technologynorth.net>
References:  <op.uy6w64jn3mvmur (at mark) presto.technologynorth.net>
X-Mail-Count: 15920

Hi J.D.,

> Recently our webserver had a problem with a full disk on the root
> partition, where BlueQuartz stores the codb.

Outch. That's pretty bad. Whenever a Linux runs into a 100% full / partition 
"bad things" happen. For example: Linux tries to write to a file. It opens the 
file for writing, erases the content and then tries to write the new data. 
Then it realizes: "Dang, not enough space!" and the net result is that the 
file (or at least all data in it) got destroyed in the process. 

So it can be assumes that a number of files that were written to during the 
100% disk utilization situation may have been corrupted.

> We currently have around 3800 directories in /usr/sausalito/codb/objects,
> so there should be room for more, but whenever we try to create anything
> new, we get an error like: UNKNOWN ERROR DURING CREATE.
>
> In /var/log/messages we get errors like this as well:
>
>   ww1 cced(smd)[28872]: client 7:[48:25081]: CREATE  "DnsRecord"
> "mail_server_name" "=" "domain.ca" "type" "=" "MX" "domainname" "="
> "domain.ca" "mail_server_priority" "=" "low" "hostname" "=" ""
> Aug 24 15:08:34 ww1 cced(smd)[28872]: client 7:[48:25081]: CREATE
> DnsRecord failed (-7)

The error message "UNKNOWN ERROR DURING CREATE" and the "-7" status message in 
/var/log/messages indicate a problem that I know pretty well. 

It's a little complicated to explain, but I'll do my best:

CODB has "Classes" and "Objects". 

A "Class" defines a database Object. Like what kind of storage fields (keys) 
it has inside and which kind of data (values) they take. For example a 
database field of type "ipaddr" will only take values that are valid IP 
addresses and nothing else.

When an Object is created, the create action specifies of which Class the new 
Object is. Even if you write no data at all into the new Object, all the 
default storage fields as defined for this Class are created in the new 
Object. These can later be populated with whatever data you want to write into 
the Object.

Now on to your problem: 

CODB also has an Index. That's basically a textfile which keeps track of which 
Object IDs (numbers) are already taken and which ones are free for usage. Of 
course every Object must have its unique Object ID. No two Objects may have 
the same ID.

Whenever a new Object is created, CCE refers to the Index to check which 
lowest Object ID is still free for usage. It then creates the new Object with 
the lowest free Object ID as reported by the Index.

You probably see the problem already:

When your / partition was 100% full, the file that contains the list of used 
Object IDs (the Index) apparently got messed up. When your GUI then tried to 
create new Objects, it found and empty Index (because it got destroyed) and 
therefore started to re-use Object IDs which were already in usage.

This then caused that the Object directories got populated with database 
fields from more than one Class. That in turn essentially corrupted those 
Objects to a point where CODB can no longer use them and this causes the 
"UNKNOWN ERROR DURING CREATE".

The "-7" status message appears whenever CODB tries to update an existing 
Object with information and suddenly fiends database fields in it, which - 
according to the Schema for this Class - shouldn't be there.

I hope you could follow me so far.

Now how to fix it?

In short: This is a trainwreck and a usually non-recoverable situation.

If you have a backup copy of /usr/sausalito/codb/ which was taken like a day 
before the crash, then you might want to try to use that one instead. But even 
then you could run into major inconsistencies like missing users, missing 
sites, changed settings and what not. This depends on how many changes were 
made through the GUI. Not only by you, but also by siteAdmins and regular 
users.

Even a CMUexport / CMUimport may not work, as CMU might be unable to export 
the data correctly if CODB is so highly inconsistent and messes up. In that 
case you might have to fall back to a CMUexport that was taken before the 
accident.

Is a manual repair of CODB possible? Yes. But is it practical? Probably not. 
You said yourself: You've got 3800 database Objects. One would need a good 
familiarity with the different CODB Classes and would need to examine all 3800 
Objects to make sure each only contains database fields which are expected to 
be there according to the Schema fields for that Class. There are no automated 
tools available for that and doing it manually could be a herculean task.

Additionally one would have to re-create the CODB Index file from scratch - 
which (of all things) is the most trivial.

It's the file /usr/sausalito/codb/codb.oids which contains the Index and in an 
example box of mine it looks like this:

1-554,570-595

Which means: Object IDs 1-554 and 570-595 are taken. All others are free.

Over the years I've run into this issue a few times (as recently as last 
Saturday) and usually the quickest way to recover from it is to restore from 
the backups.

I'm sorry to report these bad news, but a 100% full / partition can be pretty 
destructive. :o(

-- 
With best regards,

Michael Stauber