Index: [Article Count Order] [Thread]

Date:  Mon, 24 Aug 2009 19:02:25 -0600
From:  "J.D. Lien" <jd (at mark) fullspec.ca>
Subject:  [coba-e:15921] Re: Unable to create new records for DNS, users, etc.
To:  coba-e (at mark) bluequartz.org
Message-Id:  <op.uy606bp03mvmur@behemoth>
In-Reply-To:  <200908250230.52507.bq (at mark) solarspeed.net>
References:  <op.uy6w64jn3mvmur (at mark) presto.technologynorth.net> <200908250230.52507.bq (at mark) solarspeed.net>
X-Mail-Count: 15921

Hi Michael:

Much thanks for your thorough and quick reply.

You're right - that's certainly not the news we wanted to hear!
Unfortunately our backup might not be in the best of shape.  It should  
have backed up early this morning, but for some reason
the codb.oids and db.classes files are not present in the backup... I  
think that it was because of the permissions on them (600 root:root)
my codb.oids looks like:
1-2650,2653-2732,2742-2819,2825-2829,2831-2837,2839-2850,2855-2870,2876-2897,2903-3116,3122-3129,3131-3162,3164-3215,3221-3377,3380-3414,3420-3500,3502-3581,3583-3800,3815-3820
Which seems a little longer than is typical... I could perhaps try to  
restore from the backup and see what happens, if I keep the files I  
currently have then I can always go back to them.... there were a lot of  
changes made today and it's unfortunate that we're going to lose those,  
but typically this stuff only changes when we add a new site because this  
machine doesn't deal with mail, just websites and dns.

I'll have a go at restoring from backups and see how bad it ends up :(


Thanks again
-J.D.

On Mon, 24 Aug 2009 18:30:52 -0600, Michael Stauber <bq (at mark) solarspeed.net>  
wrote:

> Hi J.D.,
>
>> Recently our webserver had a problem with a full disk on the root
>> partition, where BlueQuartz stores the codb.
>
> Outch. That's pretty bad. Whenever a Linux runs into a 100% full /  
> partition
> "bad things" happen. For example: Linux tries to write to a file. It  
> opens the
> file for writing, erases the content and then tries to write the new  
> data.
> Then it realizes: "Dang, not enough space!" and the net result is that  
> the
> file (or at least all data in it) got destroyed in the process.
>
> So it can be assumes that a number of files that were written to during  
> the
> 100% disk utilization situation may have been corrupted.
>
>> We currently have around 3800 directories in  
>> /usr/sausalito/codb/objects,
>> so there should be room for more, but whenever we try to create anything
>> new, we get an error like: UNKNOWN ERROR DURING CREATE.
>>
>> In /var/log/messages we get errors like this as well:
>>
>>   ww1 cced(smd)[28872]: client 7:[48:25081]: CREATE  "DnsRecord"
>> "mail_server_name" "=" "domain.ca" "type" "=" "MX" "domainname" "="
>> "domain.ca" "mail_server_priority" "=" "low" "hostname" "=" ""
>> Aug 24 15:08:34 ww1 cced(smd)[28872]: client 7:[48:25081]: CREATE
>> DnsRecord failed (-7)
>
> The error message "UNKNOWN ERROR DURING CREATE" and the "-7" status  
> message in
> /var/log/messages indicate a problem that I know pretty well.
>
> It's a little complicated to explain, but I'll do my best:
>
> CODB has "Classes" and "Objects".
>
> A "Class" defines a database Object. Like what kind of storage fields  
> (keys)
> it has inside and which kind of data (values) they take. For example a
> database field of type "ipaddr" will only take values that are valid IP
> addresses and nothing else.
>
> When an Object is created, the create action specifies of which Class  
> the new
> Object is. Even if you write no data at all into the new Object, all the
> default storage fields as defined for this Class are created in the new
> Object. These can later be populated with whatever data you want to  
> write into
> the Object.
>
> Now on to your problem:
>
> CODB also has an Index. That's basically a textfile which keeps track of  
> which
> Object IDs (numbers) are already taken and which ones are free for  
> usage. Of
> course every Object must have its unique Object ID. No two Objects may  
> have
> the same ID.
>
> Whenever a new Object is created, CCE refers to the Index to check which
> lowest Object ID is still free for usage. It then creates the new Object  
> with
> the lowest free Object ID as reported by the Index.
>
> You probably see the problem already:
>
> When your / partition was 100% full, the file that contains the list of  
> used
> Object IDs (the Index) apparently got messed up. When your GUI then  
> tried to
> create new Objects, it found and empty Index (because it got destroyed)  
> and
> therefore started to re-use Object IDs which were already in usage.
>
> This then caused that the Object directories got populated with database
> fields from more than one Class. That in turn essentially corrupted those
> Objects to a point where CODB can no longer use them and this causes the
> "UNKNOWN ERROR DURING CREATE".
>
> The "-7" status message appears whenever CODB tries to update an existing
> Object with information and suddenly fiends database fields in it, which  
> -
> according to the Schema for this Class - shouldn't be there.
>
> I hope you could follow me so far.
>
> Now how to fix it?
>
> In short: This is a trainwreck and a usually non-recoverable situation.
>
> If you have a backup copy of /usr/sausalito/codb/ which was taken like a  
> day
> before the crash, then you might want to try to use that one instead.  
> But even
> then you could run into major inconsistencies like missing users, missing
> sites, changed settings and what not. This depends on how many changes  
> were
> made through the GUI. Not only by you, but also by siteAdmins and regular
> users.
>
> Even a CMUexport / CMUimport may not work, as CMU might be unable to  
> export
> the data correctly if CODB is so highly inconsistent and messes up. In  
> that
> case you might have to fall back to a CMUexport that was taken before the
> accident.
>
> Is a manual repair of CODB possible? Yes. But is it practical? Probably  
> not.
> You said yourself: You've got 3800 database Objects. One would need a  
> good
> familiarity with the different CODB Classes and would need to examine  
> all 3800
> Objects to make sure each only contains database fields which are  
> expected to
> be there according to the Schema fields for that Class. There are no  
> automated
> tools available for that and doing it manually could be a herculean task.
>
> Additionally one would have to re-create the CODB Index file from  
> scratch -
> which (of all things) is the most trivial.
>
> It's the file /usr/sausalito/codb/codb.oids which contains the Index and  
> in an
> example box of mine it looks like this:
>
> 1-554,570-595
>
> Which means: Object IDs 1-554 and 570-595 are taken. All others are free.
>
> Over the years I've run into this issue a few times (as recently as last
> Saturday) and usually the quickest way to recover from it is to restore  
> from
> the backups.
>
> I'm sorry to report these bad news, but a 100% full / partition can be  
> pretty
> destructive. :o(
>


-- 
J.D. Lien
Ph: 780-702-3114