SCSI Timeouts when Symantec Core LC service starts

Duis mollis, est non commodo luctus, nisi erat porttitor ligula, eget lacinia odio sem nec elit. Sed posuere consectetur est at lobortis. Vestibulum id ligula porta felis euismod semper. Donec ullamcorper nulla non metus auctor fringilla. Aenean lacinia bibendum nulla sed consectetur. Cras justo odio, dapibus ac facilisis in, egestas eget quam. Cras mattis consectetur purus sit amet fermentum. Morbi leo risus, porta ac consectetur ac, vestibulum at eros. Sed posuere consectetur est at lobortis. Etiam porta sem malesuada magna mollis euismod. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Duis mollis, est non commodo luctus, nisi erat porttitor ligula, eget lacinia odio sem nec elit. Cras justo odio, dapibus ac facilisis in, egestas eget quam. Aenean eu leo quam. Pellentesque ornare sem lacinia quam venenatis vestibulum. Curabitur blandit tempus porttitor. Sed posuere consectetur est at lobortis.

Problem is that the first time the Symantec Core LC service starts (usually 5-7 minutes after startup), there is a period of ~2.5 minutes of system unresponsiveness during which I get system Error events (usually 6) logged as timeouts (event 9) from the SCSI controller (a320raid).  These are often accompanied by system Warning events from the SCSI disk, logged as error on device during a paging operation (event 51).

 

Dell Precision 670 workstation

2 @ 3.4GHz XEON HT (HT enabled in BIOS)

2GB

 

WinXP-Pro 5.01.2600 SP3

 

Clean install of Windows from Dell Reinstallation disk at SP1a.

All updates applied before NSW install.  The only additional software installed before NSW was Dell drivers and utilities.

NSW 2008 Prem, installed NSW, NAV, NS&R.  Did not install Goback.

Problem began immediately after NSW install, although I cannot be certain whether it began before all Symantec updates were applied.  I would guess that it did not as those would have occurred during requested LiveUpdate runs.

 

Problem only occurs once per Windows session.

 

The problem can be triggered by issuing a Start to the Symantec Core LC service from the Services Control Panel, provided that the service was not previously started (Restarts will not trigger problem).

 

I thought I had a workaround going when I found that issuing a Start to the Telephony service before I Started Symantec Core LC, prevented the problem.  This has not worked well in service.  Setting the Telephony service to Automatic, causes it to show as Started but has never prevented the problem.  Manually Starting or Restarting the Telephony service after startup has (I think) sometimes prevented the problem.  It did not work during the current Windows session wherein it was issued ~4.5 minutes before the Symantec Core LC service started on its own.

 

A requested LiveUpdate run does not appear to start the Symantec Core LC service, and has the effect of moving the problem 4 hours into the Windows session.

 

I'm posting this here because I spent yesterday lurking and found that many problems receive useful replies; and the chat session I had with an analyst this morning, was not promising.  I think he had some trouble getting his head around the oddities of the problem, and perhaps, did not believe my contention that it only occurs immediately after the first start of the Core LC service.

 

The analyst suggested that I stop and disable all Symantec services to see if the problem was still there.  When I asked how long I should wait for that which I was sure would not occur, he declined to be specific (presumably because he was confident it would).  He accepted a compromise wherein I have turned off Automatic LiveUpdate (rather than kill all my protection), restarted Windows, and am letting that machine sit for 8 hours.  If the problem was going to occur, it would have long since done so.

 

I spent days troubleshooting this as a drive or controller or cable problem.  Dell offered to send me a new cable and drive, and if that didn't work, a new motherboard (SCSI controller is embedded).  Then I noticed that the last event before the errors was always the Core LC start.

 

I'm perfectly willing to start the clean Windows install all over again, if there's even a hint that I'll get a different result.

 

If someone here knows something about what's going on when the Core LC service starts, I'd be grateful for any information.  I might be able to come up with my own workaround.

 

TIA

Message Edited by Norwegian on 07-21-2008 11:40 AM
Message Edited by Norwegian on 07-21-2008 11:43 AM

Progress, of sorts.  I've managed to trigger the system-hang and associated controller time-outs, by issuing a Start to the Core LC service.

 

I don't think Support would be very interested in this form of the problem.  But I'm kind of relieved that it's still present.

 

I'm going to spend some quality time digging through the logs to try to learn something about the Core LC service's behavior.  I hope that will help me create the conditions under which the problem will occur on its own.

Factoid:  If the Symantec Core LC service does not start within the first 21 minutes of a Windows session, it will not start less than 3 hours and 46 minutes into the session.

 

As a statement of fact, that’s probably not true; but it is what I found when I analyzed my System Event log.  Since my objective, on Wednesday, was to get the controller timeout messages that indicated a system-hang had occurred, and to get them without any direct intervention on my part; I used that information to determine when to re-boot.

 

I never did anything on the machine during a Windows session, except to check the System Event log.  If 21 minutes passed without a logged error, or the Core LC service started without any apparent error; I issued a Restart.

 

Eventually I got my error.

 

An aside:  I only have one network port available, so I’m swapping it between the new machine (the one with the errors) and the old.  One of the things I looked for in my log analysis was to see if connection to the network affected whether or when the Core LC service started, and the likelihood of the error condition presenting itself.  I found no correlation.

 

A few more statistics:  I ran 21 Windows sessions while I was trying to catch the error.  I commanded the Core LC service to start during session 6, without apparent error; and during session 7, triggering the error.  The Core LC service started without my intervention during sessions 3, 5, 9, 10, & 16.  The “natural” start of the service during session 16 triggered the timeout errors.  The service did not start during the final 5 sessions.

 

I will insert the caveat here that I did not check to see if the service was started; I only looked to see if its start was logged.  If it started before the Event logging service, there would be no log entry.  I think that is unlikely to have occurred.  The only time I’ve caught it started without a log entry, was when Analyst#1 left it set on Automatic.

 

Some restatement and clarification of the problem:  I consider the system-hang to be the problem, and the logged controller timeouts to be consequential.  I have no way of knowing if the hang is directly caused by Symantec software, or if the Symantec software is triggering a bug in MS or Dell software.  But the association with the start of the Core LC service is completely invariable.  Determining that the bug is in MS or Dell software will be very difficult without Symantec’s assistance.

 

There are two errors in my initial problem description.  The first is that I strongly imply that the system-hang always occurs when the service starts.  I may have believed that when I wrote it, and it might even be true; but it doesn’t always present itself, and it certainly does not always trigger logged controller timeouts.

 

The problem began on 7/10 after I installed NSW.  I don’t remember how quickly I noticed the errors, but I pursued them as hardware or firmware or BIOS errors until I discovered the association with the Core LC service on 7/17.  Some statistics from the period:
7/10 after NSW install, 3 service starts, all with logged controller errors.
7/11, 3 starts, 2 with errors.
7/12, 3 starts, 2 with errors.
7/13, 3 starts, 1 with errors.
7/14, 2 starts, 1 with errors.
7/15, 3 starts, 1 with errors.
7/16, 2 starts, 2 with errors.

 

The other error in my problem description is that it implies that the system-hangs are all ~2.5 minutes long.  I knew that wasn’t true.  That was the most common duration, but sometimes it’s shorter.

 

I get 6 logged controller timeouts during the ~2.5 minute version of the hang.  They are in 3 groups of 2.  The log entries within a group are time-stamped 2 seconds apart.  The groups are time-stamped about 1 minute and 15 seconds apart.

 

I believe this represents, at least, 3 separate “hangs”.  The corresponding desktop behavior is that I might click on something and the system does not promptly respond.  But eventually it does; and when I then try to do something else, it’s again unresponsive.

 

So I very commonly get 6 error messages, but sometimes only 4, and sometimes only 2.  And, of course, sometimes I get none; which can only be construed as a lack of evidence.

 

Yesterday I got to meet Analyst#3, a silver-tongued fellow.  I declined to let him establish the Remote Desktop session when he first suggested it, saying that I wanted to chat some more in order to ensure a common understanding of the issue; before we did that.

 

Analyst#3 seemed to know a little more about the Core LC service, but he didn’t seem very confident about his knowledge.

 

He pushed me, gently, to remove and reinstall NSW.  I resisted, saying that I was willing but wanted some reason to believe I’d get a different result.

 

Finally he said, “try reinstall of Norton System Works and I am sure that you will find it more optimized.”

 

How could I resist?

 

I ran the Removal Tool, and reinstalled from a fresh download.  After the installation was complete and on the first restart where the installation program did not trigger an immediate LiveUpdate, I ran one from the Protection Center console.  Got the full 2.5 minute event.

 

I’ve encountered numerous complaints by people that the Removal Tool does not, in fact, remove everything.  I’d like to say that I was pleased and grateful to see that it restored all of my settings for the product.

 

I’m going to take a break, and then go meet Analyst#4.

 

But I’d like to report the end of the chat session with Analyst#3.

 

I concluded with, “Thank you, sir, and Good Day.”

 

When I got back, “You are welcome.” I closed the session and began to archive the chat-log.  And noticed that he had sent another message.

 

“I believe we have fixed the issue. Can you please confirm if this issue has been resolved to your satisfaction?”

 

You’ve got to admire guts.

 

I’ve now had a session with Analyst#4, & #5.  I finally got escalated!

 

Analyst#5 ran a log extraction tool, and has submitted the data for analysis.

 

While I was still chatting with Analyst#5, I think I got a call from Symantec Support.

 

The Caller-ID was ISEVA from a 201 area-code.  I could not understand anything the caller said.

 

That makes me feel like an idiot.  I think I’m below average in my ability to understand English spoken with a strong accent.  I wish that weren’t the case.

 

Especially since, if Symantec Support wants to talk to me; I really want to talk to them.

 

It took three tries for me to understand that the first sentence contained my first name.  The second sentence sounded like he was offering me a credit card or credit card number.  It may not actually have contained the word “credit” or “card,” but perhaps he was trying to offer me a prepaid debit card if I would agree to quit bothering them (just kidding).

 

When I apologized for not being able to understand him, he said something that included (I think) the word “wait.”  Shortly I began to get a fast busy-signal, and eventually I hung up.

 

I’m left on the proverbial horns.  I thought I’d spend three or four days setting up the new machine, and then pull the drives from the old one and install them in the new one.  And then I’d be, more or less, in business.

 

But I stopped the set-up on the new machine when the problem showed up, initially because I thought I might have to replace the hard-drive.  That was two full weeks ago.  Now I’m left with not wanting to complicate things by installing more software, and the feeling that each day increases the chances that I will either have to or want to start over with a clean install of Windows.

I have no idea where I stand with Symantec Support.

 

Last Friday (25 July) I got escalated, log data was extracted and uploaded, and submitted for analysis; and I may have gotten a phone call.

 

But nothing since.

 

Grrrrrrrrrrrrr.


Norwegian wrote:

I have no idea where I stand with Symantec Support.

 

Last Friday (25 July) I got escalated, log data was extracted and uploaded, and submitted for analysis; and I may have gotten a phone call.

 

But nothing since.

 

Grrrrrrrrrrrrr.


 

From this I assume you have a case number or other reference. If you post it here someone may be able to follow up and find out what is happening.

I'm experiencing the same problem. I have a case open with Symantec {removed}. I've had this case open many months. Symantec staff has called me several times and they discuss the same things everytime. They don't seem interested in fixing the problem. I have a Dell 380 workstation. I suspect that part of the problem might be a conflict with Norton (in my case Norton Internet Security) and either Dell or Adaptec BIOS. I have an A320 raid adapter too. From looking at the Windows system event log and watching what my system does when the timeout occurs, I think that there's a bad BIOS call being done by the Norton software...meaning that the software is doing something that it shouldn't do for a SCSI drive.

Pete

 

 

 

[edit: removed case# from public post, symantec emloyee's needing this information can obtain it from an admin/mod.]

Message Edited by Allen_K on 08-31-2008 01:48 PM

Hi, pesaxe.

 

I've sent you a PM to offer you what I now know about this.

 

I can tell you who I'm communicating with at Symantec, and at Dell.  In each case I'd have to ask their permission, and I'll guess that some identification would be required.  I wouldn't want to get in the middle of that, but I'll predict that one of the mods here would take care of it.

Hello,

      What's a "PM"?  I would like to send you a word doc that I put together for the Symantec staff describing the problem, as a double check on what we're experiencing.  This could help assure that you and I really are experiencing the same thing.  

           Pete


pesaxe wrote:

 

      What's a "PM"?


PM = Private Message - look for the in the upper right hand corner of any community (forum) page.

 

I was skeptical, but Pete has sent me his information and it’s the exact same problem.

 

Unfortunately the systems are similar enough that it could be the exact same problem in the hardware or software or firmware or BIOS.

 

The increase in statistical significance from one to two is huge, although admittedly not yet quite what you’d call large.

 

One way or another, what fixes one of us, will fix us both.

I don’t know if anyone is reading this, but I’m going to log my experiences to, at least, clarify my thinking.

 

I’m going to criticize the analysts I’ve dealt with, but there’s no heat in my criticism.  They seem to have been nice people who were trying to help me, within the limits of their training and the resources available to them.

 

I have had two chat sessions, so far.  One on Monday and the other on Tuesday.  Each included a Remote Desktop session (I can’t manage to be very comfortable with that but I didn’t wish to impede the analysts).

 

The problem may be gone.  There seemed to be a reasonable chance that it would occur on Startup this morning, and it did not.  I’m going to issue periodic Restarts to the system and see if it shows up.  If it does not, I don’t have much of an idea why.

 

The only changes to the system, of which I’m aware, are as follows:

1. I changed the Startup of the Telephony service from Manual to Automatic.  I did this as part of my testing before I went to Symantec for help.  In my testing I found that starting the Telephony service before I started the Symantec Core LC service, prevented the problem (2.5 min. system hang).  The Startup change did not eliminate the problem but seemed (caution, small sample) to reduce the probability.

2. Analyst#1 changed a number of Symantec services from Manual to Automatic.  I reversed his change for the Core LC service, as I’ll explain shortly.

3. At Analyst#1’s suggestion, I turned off Automatic LiveUpdate from the NAV console.

 

I believe that Analyst#1 became convinced that I had a hardware problem and was seeking to remove any basis for my contention that there was a congruence with the Core LC service.  He (gender presumption) assured me that the Core LC service would not start with ALU turned off.  However, he had left it set to Automatic so I got the system hang quite promptly when I restarted the machine (much earlier than it had previously occurred).

 

So I changed the Core LC service startup back to Manual, and restarted the machine again.  I do not appear to have gotten the system-hang since this point.  I certainly have not gotten any errors logged by the SCSI controller.  The Core LC service did start, somewhere around six hours into the Windows session.

 

After Startup on Tuesday had passed without incident, I initiated another Chat to report that the system-hang appeared to be gone with ALU off.

 

Analyst#2 did not, to my knowledge, make any permanent changes to my system.  I made two changes at his (another gender presumption) request (numbering continued from above):

4. I added a password to my User Account when this proved to be required in order to run a test of the Task Scheduler function.

5. I turned ALU back on from the NAV console.

 

So the problem may be gone, but why?  I’ve restarted Windows several times, and even issued a Start to the Core LC service during one session.  Nothing.

 

I’m going to try to determine which services Analyst#1 changed from Manual to Automatic, and see if any of them are suggestive.

 

I’m going to compare Application and System Event Logs from the current sessions, to those from past sessions when the system-hang occurred.

 

At some point, I will remove the password from my User Account.

 

I may try changing the Telephony service back to Manual.

 

I will almost certainly post some excerpts and comments about Chat Session #2.  Identifying information will be redacted for both the analyst and me, but Analyst#2 has earned a little heat.