On IIS6: A process serving application pool ‘MyAppPMool’ terminated unexpectedly. The process id was '1234'. The process exit code was '0xffffffff'

January 13, 2010 00:18 by Marc

Suddenly some apparently well performing IIS servers recently started reporting this error regularly. some of them were also running SharePoint or OWA. all of them configured to use Integrated Windows Authentication (IWA) as authentication mechanism.

The problem with IIS worker process is that it can have so many explanation depending on the code executed that you can easily waste a week until you find a reasonable explanation. In this case, all servers were affected, regardless of the application they run. So my first idea was “they might be under attack”. But that was not the case: performance counters related to the worker process did not give any sign of that, this was confirmed by the IIS logs. Next”usual suspect”, a patch recently installed: bingo, that was it. Here are the details:

  • The Security Update implementing “Extended Protection” for authentication in IIS (KB973917) was just deployed on all servers
  • All impacted servers are running Windows Server 2003 Service Pack 2
  • One or multiple application served by that application pool/worker process have IWA enabled
  • After intensive file version analysis, it appeared that numerous IIS-related files (EXE, DLL’s…) were with a version prior SP2

Due to the inconsistency of IIS files in combination with that extra hot fix, the worker process keeps crashing –> root cause found!

Now how to fix it:

  1. Perform an inventory of currently installed post-SP2 fixes. I personally do it in a very straightforward way using psinfo but I am sure you’ll find plenty of methods to do it the way you like
  2. Reinstall the Service Pack 2
  3. Redeploy post-SP2 hot fixes, see step 1
  4. Check installed IIS File versions
  5. If file versions are OK, Install KB973917

Additional information’s:

Note: Make sure you pay attention to the process exit code which is always 0xffffffff. If you see another code, it might of course have another cause.

Marc


Enterprise File Services using Windows Server 2008 R2 - Tools you can’t afford to miss Part I: FSCT (Overview)

October 7, 2009 09:51 by Marc

Since I do not like one-liner posts, I recently started writing an (ambitious?) serie of posts covering enterprise file services based on Windows Server 2008 R2 based on a presentation I gave to certain customers of mine . The first “shot” is dedicated to a new swiss-army knife-like tool from Microsoft: FSCT, standing for File Server Capacity Tool.

Until now, validating a Windows file server setup has always been a difficult task since very few tools were available on the market to adequately simulate a realistic user load. In past, tools like NetBench were considered as references. Nowadays, if you’re lucky, you can rely on your own scripting toolkit, if you’re not, you may have use Intel’s NAS Performance Toolkit, which is not bad, but farm from being “enterprise” ready. I’ve even seen some people trying to benchmark file services using SQLIO…

Architecturally speaking, FSCT is similar to other load-tests tools you would use of web application, for example: it is made of a controller, a server (the one to be benchmarked) and one or multiple clients. Optionally, you can also include an AD domain controller in the picture in order to simulate AD-based authentication. Nevertheless, FSCT is also compatible with workgroup environments, but in a degraded manner.

Note: Combining roles is a possibility but as you expect, it may negatively affect the tests. So if you’re short on machines, combine wisely and keep the other roles off the “server” role.

On the other hand, you can conduct test campaigns from more that one client simultaneously, that’s where the architectural choice is paying: up to my knowledge, no other tool can do that today.

Plan and deploy your test environment carefully…

  • First, take time to read the whole paper included in the package carefully. Everything you need to know about the tool is in there
  • Practise before conducting the “real” tests: since the tool is command-line based and due to the way it is distributed among systems, you may not get the result you expect from your first try (it’s not point-and-click)
  • Make sure all components involved are healthy: server & clients (network configuration, drivers…, but also network components (switches and if applicable, routers or access points…). A single component improperly working may severly affect the result of the tests (argh, hard-coded duplexing/link speed)
  • Copy FSCT to all systems involved (unless the server is not Windows-based) and build your own batch files to speed-up the configuration, the execution and finally the cleanup
  • Unless you want to reach the limits of an HP Proliant DL 58x, do not bump the client + user count to the maximum, plan then realistically. An in any case, it is not advisable to conduct your test in production, especially, for the network in general as well as for the sanity of the AD you would populate users in...

Don’t be afraid of command-line based execution

Okay, there is no UI but who cares? Me? Yes I’ve made my own little WinForm apps to save me time (I’ll post it in the coming weeks) but frankly, once your config files and batches are ready (it take 2 hours max), command-line rules supreme over the mouse;)

And before you ask, no, there is no PowerShell support, it sounds a little old-fashion I admit

Plan multiple test scenario’s keeping in mind important factors such as:

  • CIFS/SMB Version: depending on the client and server/Configuration OS version, the usage of SMB2 will greatly improve performances under virtually any circumstances. If you plan to use pre-Vista client OS or maybe have a mix of them, take this into account in your scenario’s
  • SMB-related security settings like signing and so on also affect performances
  • Other security configuration like TCP/IP stack hardening or IPSec
  • The presence of a file-based Anti-virus: it is wise to test with and without. You might be surprised by the performance loss an A-V implies, particularly on heavily used servers of course. BTW, since most of (if not all) A-V are implemented as file system filter driver, do not simply disable it during the tests, uninstall it, for certainty
  • Take into account the other side-activities, particularly server-side like: backups (using shadow copies or third party solutions), monitoring or other background processing tasks that may affect tests (reporting and so on…)
  • So called “performance-boost” tweaks like cache manager, NTFS tweaks, disk alignment, cluster sizes… All-in all, they may greatly affect the results. BTW, I will dedicate another post to those tweaks and debunk some myths at the same time as well

What do you get from the tests?

Besides generating the load itself, FSCT, assuming you’re working in a standard setup, will provide detailed tests results containing the following useful information’s retrieved from the server and client(s):

Data collected from the following performance counters:

  • \Processor(_Total)\% Processor Time
  • \PhysicalDisk(_Total)\Disk Write Bytes/sec
  • \PhysicalDisk(_Total)\Disk Read Bytes/sec
  • \Memory\Available Mbytes
  • \Processor(_Total)\% Privileged Time
  • \Processor(_Total)\% User Time
  • \System\Context Switches/sec
  • \System\System Calls/sec
  • \PhysicalDisk(_Total)\Avg. Disk Queue Length
  • \TCPv4\Segments Retransmitted/sec
  • \PhysicalDisk(_Total)\Avg. Disk Bytes/Read
  • \PhysicalDisk(_Total)\Avg. Disk Bytes/Write
  • \PhysicalDisk(_Total)\Disk Reads/sec
  • \PhysicalDisk(_Total)\Disk Writes/sec
  • \PhysicalDisk(_Total)\Avg. Disk sec/Read
  • \PhysicalDisk(_Total)\Avg. Disk sec/Write

As well as the following metrics (correlated to the number of users simulated):

  • % Overload
  • Throughput
  • # Errors
  • % Errors
  • Duration in ms

Once the % Overload is higher than 0%, it will indicate the threshold above which your file server infrastructure does not scale anymore for the given number of users

Special Cases

Using FSCT against DFS-N

Can FSCT work against DFS-N: yes it does. But it will not allow you to stress-test the DFS part of your design since it has no knowledge of it and does no embark any technology to capture DFS’ behavior during the load test. Moreover, it may require to configure the “server” part as if it was a non-Microsoft file server (see below for details). Moreover, capturing performance counters on the server using FSCT itself might be an issue, the workaround being the good old “manual” capture using perform or logman.

Using FSCT against a failover cluster

Using FSCT against a failover cluster works perfectly but with one limitation identical as above: the tool will ne be able to collect performance counters directly. Instead, you will have to plan for manual capture on the node designated as owner of the file share resource, or on both if you wish to perform failovers during the tests.

Using FSCT against non-Microsoft File Server or NAS

Assuming you can leave with the same limitations as stated above, FSCT will work like a charm against non-MS file server, including SOHO devices. Depending on the server or device’ capabilities, you might be able to collect a reduced set of performance indicators using SNMP polling for example. Of course, FSCT itself does not include any SNMP but there are plenty of tools available and during the test I lead Cacti was very helpful.

Ready, Go?

Well, not 100% ready yet. The only workload scenario available at RTM release time being “Home Folders”, you might not be able to validate your setup realistically. But according to MS, an SDK is on the way in order to allow the creation of custom workloads. In the mean time, you can already start playing with the tool itself and with the customization of “HomeFolder” profile config file but you will not go far with that.

Additional Resources

In a coming post I will cover practical usage of FSCT.

Marc


An existing connection was forcibly closed by the remote host

June 4, 2009 11:17 by Marc

Two different projects complaining about the same issue: nice troubleshooting challenge! One is SharePoint-based and the second is a, let’s say “entertaining” .Net-based application. They both make use of SQL Server as back-end data store and both complain about having “existing connection forcibly closed” reported in they stack trace when then attempt to connect to SQL.

This happens when a client application is trying to re-sued an existing TCP connection to a remote host while it closes it, making connection reuse impossible. There are actually multiple possible root causes which do no seem to be mutually exclusive:

Limit set on the number of connection allowed by SQL Server on a given instance

For a given SQL instance, you can set the maximum number of connections that can be used by applications. Depending on the way your application is written, multiple connection might be used for a single transaction… Raise the limit or set it to unlimited as necessary.

The (infamous) Scalable Networking Pack

The Scalable Networking Pack is a set of improvements brought to the Windows Networking stack. It is available as an add-on for Windows Server 2003 but is included from Service Pack 2 as well as from Windows Vista/2003.

This update greatly modifies the way Windows handles network connectivity at TCP-level and might therefore provoke the error. In short, the following settings should be modified on the SQL Server (or on any server acting as the server component):

In the registry, under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\

  • EnableTCPChimney (REG_DWORD) set to 0 (disabled)
  • EnableRSS (REG_DWORD) 0 (disabled)
  • EnableTCPA (REG_DWORD) 0 (disabled))

Applying the change requires a reboot.[UPDATE] Some MS sources report that a reboot is not necessary for some settings so I switch my statement to *might* require a reboot.

You’ll find a lot of trustworthy online resources recommending to disable the SMP…

On the other hand, recent NIC drivers may allow your system to work properly with these options set to enabled… Look at this page to get a list of SMP “partners”: http://technet.microsoft.com/en-us/network/cc984184.aspx.

Faulty NIC, NIC driver or driver settings 

Some NIC include a TCP Offload Engine (TOE). Incorrectly configured or running an out-dated, they will generate error at TCP-level.

In some cases, the TOE simply does not work, so you also might want to test with this function completely disabled. When editing you driver’s parameters, look for “Large Send Offload”, “Checksum offload”…

Import to note, you might also want to check the link speed and duplex at NIC level AND at switch port level, they might also cause the problem. Remember,: they must be identical on BOTH sides

Applying the change *might* requires a reboot.

Windows TCP/IP Stack Custom Configuration or Hardening

There are plenty of resources describing how to “harden” the Windows TCP/IP stack. Unfortunately most of them simply show the “how to”, not its consequences. Of of them being the performance decrease implied by hardening. You’ll also find those parameters under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\

In our case, the parameter “SynAttackProtect” set to 1 instead of 0 (disabled) will force Windows to be more restrictive regarding the incoming or TCP connection requests and well as more aggressive with the re)use of existing one. If the parameter is  enabled, the following additional parameters will also be taken into account:

  • TcpMaxPortsExhausted: Determines the maximum number of connections that can be opened before enabling protection against SYN attacks
  • TCPMaxHalfOpen: Determines the maximum number of connections that can be left “half-open” (waiting for re-use)
  • TCPMaxHalfOpenRetried: Same as above BUT applicable to connections that were effectively re-used by the original client

The parameters above are thresholds used by Windows to determine if a TCP-based (SYN) attack is in progress or not. They should only be used if the server is put in a high risk situation (DMZ or internet-facing) while there is not other security device put in place (Firewall…).

Note that, before Windows 2003 SP2, this SynAttackProtect is set to 0 while with SP2, it is set to 1 then with the latest versions of Windows, it returns to 0…

Automatic adjustment for the TCP window size (From Vista or 2008 only)

On the client side, Windows, starting from Vista, comes with a feature that allows to dynamically set the TCP windows size depending upon the network (remote host) conditions. See http://support.microsoft.com/kb/929868. But I frankly doubt it can be the root cause, I just documented it for comprehensiveness.

If your application is affected by those problem, I hope you’ll find the culprit amongst one of those.

Any network device catching the traffic at TCP-level

If there is any firewall in place, look at their logs, they might reveal that some connections are refused when the client attempts to re-use them.

More Information

Thanks to Tim B (MSFT) and Pascal B (MSFT) for the hints and guidance.

And cut!

The French Connection Poster


Windows cannot obtain the domain controller name for your computer network: Thanks AMD Opteron!

April 9, 2009 20:56 by Marc

This week’s challenge: find out how Windows DC locator process is related to AMD Opteron CPU’s…

Thanks to one of my favorite customer, we experienced that silly behavior on a set of new domain controllers running Server 2003 SP2 on servers equipped with ADM Opteron multi-core CPU’s.

When a Windows system is affected by this problem, the most common symptoms are:

  • Errors logged in the Event Log: Unexpected Network Error (Event ID 1054). Event Description: Windows cannot obtain the domain controller name for your computer network. (An unexpected network error occurred.) Group Policy processing aborted.
  • Ping command may report abnormal latency results
  • Other applications making use of the QueryPerformanceCounter will also report strange results

At boot time, Windows select a timer mechanism used to synchronize operations between processors and/or cores: it can be PM (power management) or TSC (time-stamp counter) timer. What determines the choice is the fact that the BIOS supports APIC/ACPI or not. In the case of the Opteron, in combination with an BIOS that is not up-to-date or improperly configured, Windows incorrectly selects TSC instead of PM.

As consequence of this, the different cores are getting desynchronized over time causing the symptoms described above. This can start happening a few minutes to a few hours after booting. Restarting the system temporarily solves the problem until the drift starts again with time.

Since many system and security mechanisms make use of time-stamps, they start failing with the time skew getting larger.

To override this dynamic behavior and force the usage of PM timer, the boot.ini must be edited in order to contain the /usepmtimer switch.

Wanna know more about the issue? Go there:

Greets to Thierry G and Arnaud M and thanks for the troubleshooting session!

And cut!


Two IE8 behavioral changes worth mentioning…

April 3, 2009 13:15 by Marc

Session-cookie handling

Have you noticed? When you log on a site that uses session cookie to store authentication information (like eBay), this cookie is by default shared between all instances (processes) of Internet Explorer.

Also, each tab is opened as a separated process which are all child of the same parent process. This parent process is handling this resource-sharing job.

From a Process Explorer point of view, it looks like this:

This new “feature”, called “Loosely-Coupled IE (LCIE)”, is described in details on the official MS IE Blog.

In order to workaround this behavior and be able to authenticate against the same web site but using different credentials, you need to start a new IE parent process which is running separately from the other one and therefore not sharing the same session-cookie. How to do? simply execute IE using the –nomerge parameter: C:\Program Files\Internet Explorer\iexplore.exe –normerge. for convenience, you can create a new IE shortcut which includes this parameter.

After starting this second parent IE, it will look like this in Process Explorer:

HTML5-compliance regarding the file path parameter

While trying to help a newsgroup poster making his IE-related script work with IE8, French MVP Gilles Laurent (Master es scripting and automation), bought to my attention that, in order to comply with HTML5 rules, IE8 was handling file path parameter from a file sector differently.

So in order to prevent information disclosure (the path to a file may include the user name if the file reside under the user 'profile), there are actually two changes combined to achieve that:

  • The IE security setting “include local directory path when uploading files to a server” (already present in IE7) is set to “Disable” instead of “Enabled” as it was with IE7 for the “Internet Zone”

  • When this setting is set to “Disable”, the path returned will always start with C:\fakepath\FILENAME.EXT instead of starting with the real path

Once again, the official MS IE Blog describes this behavior.

And cut!