« What is Professional Services Automation (PSA)? | Main | solidDB beta for MySQL »

August 17, 2006

Sendmail Internal Error - Digging in Sendmail Source Code

Yesterday we got a notification from a customer that a few email messages had been returned undeliverable and after some discussion with their mail folks they thought we might be able to troubleshoot the issue.

The message stuck into the returned email was something like:

554 5.3.5 deliver: mci=947ee74 rcode=0 errno=9 state=0
sig=[]: Bad file descriptor
554 5.3.0 Internal error

I've been hacking away at the problem for almost a day now and am getting deeper and deeper into the depths of SMTP and sendmail. Normally a few minutes of trying different Google search terms will turn up other folks who's tackled the same problem, but it appears either nobody's ever dealt with this (unlikely) or that it happens infrequently and nobody has gotten to writing about it.

For awhile I was sure the message was being returned with the error from the destination, but on looking through our logs I changed to thinking it's a problem in Sendmail's attempts to deliver the message when it's finally time to send it off (these errors were preceeded by a period of very heavy greylisting).

Our mail logs have entries similar to this:

Aug 16 13:14:39 webserver sendmail[18093]: k7GH0uLW017508: SYSERR(root): deliver: mci=6a9408
rcode=0 errno=9 state=0 sig=mxone.remoteserver.com.:mxtwo.remoteserver.com.: Bad file descriptor
Aug 16 13:14:39 webserver sendmail[18093]: k7GH0uLW017508: to=<user1@clientdomain.com>,
<user2@clientdomain.com>, ctladdr= (500/100),
delay=00:13:43, xdelay=00:00:00, mailer=esmtp, pri=240521,relay=mxtwo.remoteserver.com.,
dsn=5.3.0, stat=Internal error

I'm definitely not a Sendmail guru, although I'm capable of getting it running and fiddling with the configuration files to make small changes. But since I can't seem to find any information anywhere I don't know where to go except for the Sendmail source code to see where this message is being generated. A little digging and I discover that the only system error that is generated in the above format is found in deliver.c, line 3208. This snip is the surrounding logic:

if (mci->mci_state != MCIS_OPEN)
                /* couldn't open the mailer */
                rcode = mci->mci_exitstat;
                errno = mci->mci_errno;
                if (rcode == EX_OK)
                        /* shouldn't happen */
                        syserr("554 5.3.5 deliver: mci=%lx rcode=%d errno=%d state=%d sig=%s",
                               (unsigned long) mci, rcode, errno,
                               mci->mci_state, firstsig);
                        rcode = EX_SOFTWARE;
                else if (nummxhosts > hostnum)
                        /* try next MX site */
                        goto tryhost;

mci is a reference to the Mail Connection Information Caching Module, a module that caches open connections as well as the status for all hosts (whether or not there is an open connection to that host). What seems to be happening is the mci structure is no longer usable as it's gotten into an error state. What's not clear is why a new mci isn't instantiated if the existing one is gone bad. I'm missing something here.

From the comments in mci.c, it appears there are a few things that affect how the mci caches the host connections:

**      There should never be too many connections open (since this
**      could flood the socket table), nor should a connection be
**      allowed to sit idly for too long.
**      MaxMciCache is the maximum number of open connections that
**      will be supported.
**      MciCacheTimeout is the time (in seconds) that a connection
**      is permitted to survive without activity.
**      We actually try any cached connections by sending a NOOP
**      before we use them; if the NOOP fails we close down the
**      connection and reopen it.  Note that this means that a
**      server SMTP that doesn't support NOOP will hose the
**      algorithm -- but that doesn't seem too likely.

So maybe it's a problem with the NOOP, or maybe it's the cache size or timeout. Looking back to the mail logs I stumble into this log entry from mci:

Aug 16 13:14:39 webserver sendmail[18093]: k7GH0uLW017508: MCI@0x6a9408: flags=6006c, errno=9, herrno=1, exitstat=0, state
=0, pid=0, maxsize=67000000, phase=client DATA 354, mailer=esmtp, status=(null), rstatus=(null), 
host=mxone.remoteserver.com., lastuse=Wed Aug 16 13:14:37 2006

If I do a grep for that specific mci (6a9408) it is tied to mxone.remoteserver.com. and is responsible for all messages that were aborted with an internal error on that webserver. Interestingly enough, there are places in the log where messages run into this mci in the process and cause information to spew into the logs, but when they are delivered they end up using the mci for mxtwo and get through just fine.

Upon further inspection the only difference between the mxone and mxtwo mci is the status and rstatus being (null). Both mxone and mxtwo have an errno of 9 during this time.

So it seems like the mci gets into a state that really warrants dropping it from the cache, but it hangs around and causes internal errors until it does get dropped. The other alternative is that the mail servers at the destination agree to accept the mail but aren't really ready causing the mci to fail.

For now I've put all of this information (in summary fashion) out to the support folks with a question about if I should keep digging or if we might wait and see if it happens again (it hasn't happened before to my knowledge).

Posted by mike at August 17, 2006 10:03 AM