Recently I had an intriguing issue at one of my customers I frequently visit. They saw their domain controllers having a CPU usage way above what we normally saw. Typically their domain controllers have between 0 – 15% CPU usage whereas now we saw all of their domain controllers running at 80 – 90% during business hours. A quick look showed us that the process which required this much CPU power was lsass.exe. Lsass.exe is responsible for handling all kind of requests towards Active Directory. If you want you can skip to the end to find the cause, but I’ll write this rather lengthy post nevertheless so that others can learn from the steps I took before finding the answer.
Some background: the environment consists of approximately 20.000 users. Technologies in the mix: Windows 7, Windows 2008 R2, SCCM, Exchange, … Our Domain Controllers are virtual, run 2008 R2 x64 SP1, have 4 vCPU’s and 16 GB RAM. RAM is a bit oversized but CPU is really the issue here. There are 8 Domain Controllers.
Here you can see a screenshot of a more or less busy domain controller. It’s not 80% on this screenshot but it’s definitely much more than we are used to see. Task Manager shows us without doubt that lsass.exe is mainly responsible.
Whenever a domain controller (or server) is busy and you are trying to find the cause, it’s a good idea to start with the built-in tool Perfmon. In the Windows 2003 timeframe articles you’ll see that they mention SPA (Server Performance Advisor) a lot, but for Windows 2008 R2 and up you don’t need this separate download anymore. Its functionality is included in Perfmon. Just go to Start > Run > Perfmon
Open up Data Collector Sets > System and right-click Active Directory Diagnostics > Start. By default it will collect data for 5’ and then it will compile a nice HTML report for you. And that’s where things went wrong for me. The compiling seemed to take a lot of time (20-30 minutes) and after that I ended up with no performance data and no report. I guess the amount of data to be processed was just to much. I found a workaround to get the report though:
While it is compiling: copy the folder mentioned in “Output” to a temporary location. In my example C:\PerfLogs\ADDS\20140909-0001
If the report would fail to compile you will see that the folder will be emptied. However we can try to compile to report for the data we copied by executing the following command:
- tracerpt *.blg *.etl -df RPT3870.tmp-report report.html -f html
The .tmp file seems to be crucial for the command and it should be present in the folder you copied. Normally you’ll see again that lsass.exe is using a lot of CPU:
The Active Directory section is pretty cool and has a lot of information. For several categories you can see the top x heavy CPU queries/processes.
A sample for LDAP queries, I had to erase quite some information as I try to avoid sharing customer specific details.
However in our case, for all of the possible tasks/categories, nothing stood out. So the search went on. We took a network trace on a domain controller using the built-in tracing capabilities (Setspn: Network Tracing Awesomeness)
- Netsh trace start capture=yes
- Netsh trace stop
In our environment taking a trace of 20 minutes or 1 minute resulted in the same. Due to the large amount of traffic passing by, only 1 minute of tracing data was available in the file. The file was approximately 130 MB.
Using Microsoft Network Monitor (Microsoft.com: Network Monitor 3.4 ) the etl file from the trace can be opened and analyzed. However: 1 minute of tracing contained about 328041 frames (in my case). In order to find a needle in this haystack I used the Top Users expert (Network Monitor plugin): Codeplex: NMTopUsers This plugin will give you an overview of all IP’s that communicated and how many data/packets they exchanged. It’s an easy way to see which IP is communicating more than average. If that’s a server like Exchange that might be legit, but if it’s a client something might be off.
As a starting point I took an IP with many packets:
I filtered the trace to only show traffic involving traffic with that IP
- IPv4.Address == x.y.z.a
Going through the data of that specific client seemed to reveal a lot of TCP 445 (SMB) traffic. However in Network Monitor this was displayed as all kinds of protocols:
As far as I can tell it seems that TCP 445 (SMB) is being used as a way for certain protocols to talk to Active Directory (lsass.exe on the domain controller). When a client logs on, you could explain TCP 445 as group policy objects being downloaded from the SYSVOL share. However in our case that definitely didn’t seem to be case. Browsing a bit through the traffic, it seemed that the LSAT and SAMR messages were more than interesting. So I changed my display filter:
- IPv4.Address == x.y.z.a AND (ProtocolName == “LSAT” OR ProtocolName == “SAMR”)
This resulted in traffic being displayed which seemed pretty easy to read and understand:
It seemed that the client was doing a massive amount of LsarLookupNames3 request. A typical request looked like this:
Moreover by opening multiple requests I could see that each request was holding a different username to be looked up. Now why would a client be performing queries to AD that seemingly involved all (or a large subset) of our AD user accounts. In my case, in the 60’ seconds trace I had my top client was doing 1015 lookups in 25 seconds. Now that has to count for some load if potentially hundreds of clients are doing this.
As the traffic was identified as SMB, I figured the open files (Computer Management) on the domain controller might show something:
Seems like a lot client have a connection to \lsarpc
And \samr is also popular. Now I got to be honest, both lsarpc and samr were completely unknown to me. I always thought whatever kind of lookups have to be performed against AD are over some kind of LDAP.
After using my favourite search engine I came up with a bit more background information:
It seems LsarOpenPolicy2, LsarLookupNames3, LsarClose are all operations that are performed against a pipe that is available over TCP 445. The LsarLookupNames3 operation seems to resolve usernames to SIDs.
Using the open files tool I tried identifying clients that were currently causing traffic. Keep in mind that there’s also traffic to those pipes that are absolutely valid. In my case I took some random clients and used “netstat –ano | findstr 445” to check if the SMB session remained open for more than a few seconds. Typically this would mean the client was actually hammering our domain controllers.
From this trace we can see that the process ID of the process performing the traffic is 33016. Using Task Manager or Process Explorer (SysInternals) you can easily identifiy the actual process behind this ID.
We could see that the process was WmiPrvSe.exe and that it’s parent process was Svchost.exe (C:\WINDOWS\system32\svchost.exe -k DcomLaunch). If you see multiple WmiPrvSe.exe processes, don’t be alarmed. Each namespace (e.g. root\cimv2) has it’s own process. Those processes are only running when actual queries are being handled.
In the above screenshot you can see, by hovering over the process, that this instance is responsible for the root\CIMV2 namespace. The screenshot was taken from another problem instance so the PID does not reflect the one I mentioned before.
The properties of the process / The threads tab:
You can clearly see that there’s a thread at the top which draws attention. If we open the stack:
Use the copy all button to copy paste in a text file:
We can clearly see that this process is the one performing the lookups. Now the problem remains… Who ordered this (WMI) query with the WMI provider? WmiPrvSe is only the executor, not the one who ordered it…
In order to get to the bottom of this I needed to find out who performed the WMI query. So enabling WMI tracing was the way to go. In order to correlate the events, it might be convenient to know when the WmiPrvSe.exe procss was created so that I know that I’m looking at the correct events. I don’t want to be looking at one of the the many SCCM initiated WMI queries! In order to know what time we should be looking for events, we’ll check the Security event log.
Using calc.exe we can easily convert the PID: 33016 to HEX: 80F8. Just set it to programmer mode, enter it having Dec selected and then select Hex
In our environment the security event log has entries for each process that is created. If that’s not the case for you, you can play around with auditpol.exe or update your GPO’s. We can use the find option and enter the HEX code to find the appropriate entry in the security event log. You might find some non related events, but in my case using the HEX code worked out pretty well. The screenshot is from another problem instance, but you get the point.
So now all we need is WMI tracing. Luckily with Windows 7 a lot of this stuff is available from with the event viewer: Applications and Services Log > View > Show Analytic And Debug Logs
Microsoft > Windows > WMI-Actvitiy > Trace
One of the problems with this log is that it fills up quite fast. Especially in an environment where SCCM is active as SCCM relies on WMI extensively. This log will show each query that is handled by WMI. The good part is that it will include the PID (process identifier) of the process requesting the query. The challenge here was to have tracing enabled, make sure this specific log wasn’t full as then it would just drop new events, and have the issue, which we couldn't reproduce, occur… With some hardcore patience and a bit of luck I found an instance pretty fast.
So I correlated the events with the process creation time of the WmiPvrSe.exe process:
Basically it’s a connect to the namespace, execute a query, get some more information and repeat process. Some of the queries:
- SELECT AddressWidth FROM Win32_Processor
- Select * from __ClassProviderRegistration
- select __RELPATH, AddressWidth from Win32_Processor
- select * from msft_providers where HostProcessIdentifier = 38668
- SELECT Description FROM Win32_TimeZone
And the last ones:
GroupOperationId = 260426; OperationId = 260427; Operation = Start IWbemServices::ExecQuery - SELECT Name FROM Win32_UserAccount; ClientMachine = Computer01; User = NT AUTHORITY\SYSTEM; ClientProcessId = 21760; NamespaceName = \\.\root\CIMV2
Now the Select name FROM Win32_UserAccount seems to be the winner here. That one definitely seems to be relevant to our issue. If we open up the MSDN page for WIN32_UserAccount: link there’s actually a warning: Note Because both the Name and Domain are key properties, enumerating Win32_UserAccount on a large network can negatively affect performance. Calling GetObject or querying for a specific instance has less impact.
Now the good part PID 21760 actually leads to something:
The process we found seems to be a service from our antivirus solution:
The service we found to be the culprit is the McAfee Product Improvement Program service (TelemetryServer, mctelsvc.exe). Some background information from McAfee: McAfee.com: Product Improvement Program
In theory this is the info they are gathering:
- Data collected from client system
- BIOS properties
- Operating System properties
- Computer model, manufacturer and total physical memory
- Computer type (laptop, desktop, or tablet)
- Processor name and architecture, Operating System architecture (32-bit or 64-bit), number of processor cores, and number of logical processors
- Disk drive properties such as name, type, size,and description
- List of all third-party applications (including name, version, and installed date)
- AV DAT date and version and Engine version
- McAfee product feature usage data
- Application errors generated by McAfee Processes
- Detections made by On-Access Scanner from VirusScan Enterprise logs
- Error details and status information from McAfee Product Logs
I don’t see any reason why they would need to execute that query… But the event log doesn’t lie. In order to be absolutely sure that this is the query that resulted in such a massive amount of traffic we’ll try to execute the query we suspect using wbemtest.
Start > Run > WbemTest
If you have Network Monitor running alongside this test you’ll see a lot of SAMR traffic passing by. So it seemed to be conclusive. The McAfee component had to go. After removing the Product Improvement Program component of each pc we can clearly see the load dropping:
To conclude: I know this is a rather lengthy post and I could also just have said “hey, if you have a lsass.exe CPU issues, just check if you got the McAfee Product Improvement Program component running” but with this kind of blog entries I want to share my methodology with others. For me troubleshooting is not only fixing the issue at hand. It’s also about getting an answer (what? why? how?) and it’s always nice if you learn a thing or two on your quest. In my case I learned about WMI tracing and lsarpc/samr. As always feedback is appreciated!