Tuning vCenter Operations Manager – Going from OMG THE SKY IS FALLING to Relevant Alerts

If you’re new to vCenter Operations Manager or vCOPS as it is called (And that’s vCOPS and not VC Ops even though VMware wants you to believe that is what it is called…. ;))   You may notice that once your environment starts collecting data you’ll be getting alerted to everything under the sun, and thensome!  And let me tell you, that is AWESOME! Please do tell me about everything going on as that is beneficial and useful.   As the days, weeks and months drone on though, you really could care less about being alerted that your thick-provisioned Datastore which is maxed out by configuration is full. WE GET IT. IT’S FULL. STOP TELLING ME ABOUT IT!   Or that your Security Scanning server (Pick Retina, Nessus or your favorite choice) uses >100% CPU when the process is running. Totally get it. It’s not undersized, it’s just not USED except for when it is running, throwing more resources won’t make it faster or better.

HOW DO i STOP IRRELEVANT ERRORS FROM ANNOYING THE HELL OUT OF ME?!?!

That’s what this is all about! I’ve taken an environment which would normally have anywhere from 500-1000 “Warnings, Errors, Alerts” on a daily basis, down to where I’m really only seeing what actually MATTERS.  Ignoring a majority of the ‘blah crap’ to focus on active alerts as they’re happening.  For what its worth, I’d always have the same alerts appear, but the important anomalies were getting lost under the weight of the useless.

To start things off, login to vCOPS and click on the Configuration button  Open Configuration

That seems simple enough, right? Then you’ll want to go in and modify your Default Policy by simply clicking on the blue of “Default Policy”Modify Default Settings 

Now this is where we start getting into the meat of things.   You may notice I’ve made a series of modifications.  These are the Infrastructure Badge thresholds which apply to the Infrastructure and not specific to VMs or Groups of VMs.   Workload level while cute and all tends to annoy me more than not in my system as you can see I had originally kept increasing the threshold higher and higher (80,90,95) eventually just clicking on the square which “turns off” that particular alert.   Next the Time Level function keeps tracking Timing which I’ve found to be less useful on a day-to-day alerting basis.   Long-term the data is still collected and I can report against it, so I leverage the reporting function as needed.    When it comes to Capacity Levels, this applies to Capacity available in the Infrastructure (Datastores, etc) which frankly I keep an eye on personally.   If you find yourself thin provisioning by default then keeping a feature like this activated is likely important to you.   I have over 100 datacenters and ensure they’re not over provisioned, because being told 100+ datacenters are “full” or “near full” is just useless and annoying.   Then when it comes down to Waste Level and Density Level, I keep a tight hand on how that is handled within the Infrastructure so I also have it turned off.   Again, judge your environment based upon your needs.  You can always turn functions back on or tune them.

Infrastructure Badge Thresholds

VM Badge Thresholds are a little more important than Infrastructure in this regard.   I like to be alerted that my Workload is high but only to the point where it is basically maxed out.   Adjust accordingly based upon knowing your environments use and function.   If you have dozens of VMs which regularly butte up against this ceiling as part of their function you may find yourself tuning this up higher as well.    vCOPs likes to predict the timing of things and be all like OMG YOUR VM IS GOING TO RUN OUT OF CPU or something. Yea. Thanks for the offer, but I’ll run a report for Undersized VMs and know that a majority of VMs are oversized to start with.   So I turned this off. :)    You’ll note that Capacity Level is configured and activated, because here I DO want to know if the VMs hard disk is going to run out of disk space (or is out).   That’ll impact things so I leave that on.   Same as above for Waste and Density.

VM Badge Thresholds

I’ll be honest. I don’t use Groupings here because things are more isolated than they are ‘paired’ and I don’t need this calling out any false positives.   Consider that for your environment.  If you heavily use Grouping, awesome, definitely take advantage of this!

Groups Badge Thresholds

I’m not going to dive into the details of these next few tabs and instead will show you what MY settings are, but for the most part they’re less important than the first few tabs and the last few tabs.

Capacity and Time Remaining Usable Capacity Usage Calculation Powered off and Idle VMs Oversized and Undersized VMs Underuse and Stress

This is really where the rubber meets the road with the Alerts.   All of the configuration we made above while important comes into stride with what you have configured for Alerts.  You may notice that I monitor Workload on Infr and VMs but not Anomalies.   Anomalies are cute and insightful… and very important if you have applications which are anomalous in nature.   If you don’t though, EVERYTHING will report anomalies to the point of being annoying and useless.  What that means is, when you’re alerted on anomalies, you’ll spend more time chasing false positives than actual problems.   Yea you may get lucky… but if you understand your environment enough, you’ll get annoyed and turn this off just as I have. :)    Time remaining and Capacity remaining while deactivated on my Infrastructure is valid on my VMs (I’ll be honest…. I’m not sure why I have Time Remaining even on for VMs, but Capacity Remaining will identify if I’m running out of VMDK Harddisk space, so yay!)

While we did ignore Anomalies, I do not ignore Stress, as that’s an actual active task going on at the time of true stress on the system.  That’s important and lets you know something is happening, not simply something is high or low from it’s established pattern as an anomaly would detect.  And lastly… Waste and Density… Just don’t matter to me when I have this architected specifically for my needs.  Clearing that along got rid of a large chunk of erroneous alerts.

Alerts 

And lastly the Forecast and Trends function… Okay, seriously, there’s no reason this should be highlighted any more than just merely reviewed.  See how your environment compares but there’s nothing too important to call out here, but I since it was the ‘6th’ tab, I couldn’t omit it. :)

Forecast and Trends

Nothing beats a good understanding, architecture and design

vCOPS as we all know is a tool, and how you use that tool or respectively let it use you is important.   When getting started with vCOPS drink from the firehose, tune your things so you see everything, even more than everything and scour and look at every single tab, function, report and alert!

Then, once you’ve tuned your environment down and understand your limits start to scale it back so it becomes useful.   Hopefully some of the settings included here help you.  I literally went from thousands, THOUSANDS of alerts on my many hundreds of Datacenters, vCenters and beyond and on a good day can have –0- messages warning me.   Yea I said it. –0- ! ! !.    But at this point, even on a ‘bad day’ I’m looking at ~25 or so alerts at a maximum when one or more of my datacenters are experiencing some kind of issue.

Give it a try, tune tune tune and enjoy!

Stop logging me out vCOPS! WHY DO YOU HATE ME?! – Modifying vCOPS Timeout!

I’m sure you’ve been in that situation… sitting there, using vCOPS day in and day out, only to get annoyed as all get out every time you go to refresh or do something and it’s all like HEY WHY DON’T YOU AUTHENTICATE AGAIN!    Well, look no further than here (and this respective KB Article) to cut it out!

Changing or disabling the UI session timeout for vCenter Operations Manager vApp (2015135)

To change the session timeout period for Standard and Advanced versions, set the <session-timeout> parameter to the required value in minutes for the desired timeout.

To disable the session timeout, set the <session-timeout> parameter to -1.
To change or disable the session timeout:

  1. Log in to the vCenter Operations Manager vApp UI VM.
  2. Open this file using a text editor:
    • For the Standard UI – /usr/lib/vmware-vcops/tomcat/webapps/vcops-vsphere/WEB-INF/web.xml
    • For the Enterprise Custom UI – /usr/lib/vmware-vcops/tomcat-enterprise/webapps/vcops-custom/WEB-INF/web.xml
  3. Locate the <session-config> parameter and change this to:
    <session-config>
    <session-timeout>value</session-timeout>
    </session-config>
    Where value is any value in minutes after which you want the session to timeout.
    For example:
      • To set the session to time out after 60 minutes, change this parameter to:
        <session-config>
        <session-timeout>60</session-timeout>
        </session-config>
      • To disable session timeout, change this parameter to:
        <session-config>
        <session-timeout>-1</session-timeout>
        </session-config>
  4. Restart the web services:
      • /etc/init.d/vcopswebenterprise restart

      • /etc/init.d/vcopsweb restart

You may notice that it might ALSO be set to default at 30, I’ve noticed that to occur as I’ve upgraded versions of VCOPS over time.

Also, important to note is, even if you’re running the Enterprise or Advanced versions, you’ll still want to modify the “Standard UI” configuration so that the initial vcops-vsphere is modified in addition to modifying the vcops-custom as listed above in the Enterprise Custom UI.

One last comment is, when you update your version of vCOPS you will NEED to go and change this setting each and every time.  So if you’re loading in the latest .PAK file to upgrade, reset you shall or else get annoyed again by timeouts!

Disabling alarms in vCenter and ignoring “Health Status Monitoring” Errors in vCOPS!

I don’t know about you, but I hate it when I login to vCOPS only to find that ALL of my vCenters show my health status as 0 and red, removing any real chance of actually seeing if there are any real PROBLEMS going on. It’s annoying as all hell and really removes the chance to actually see what may be going on.   I know what you’re saying though, I can go in to the alerts and clear that Health error.  Yea, I can do a lot of things, but if all I’m going to be doing is removing the alert and effectively ignoring it, then the error itself serves no meaning and should be removed!   6

So, here’s how you go about actually disabling this Alert from showing up in vCenter.  Disclaimer; Even with this enabled it will OCCASIONALLY still show up in vCOPS. I don’t know why yet, and when I solve as to WTF that means I’ll publish said results.   Though in the meantime here are the steps!

OMG VCENTER IS ALERTING ME OF PROBLEMS!!!!

45

Yea you’ve seen this error! The sky is falling, and all that.  Now if you find this error to be pointless and stupid, here are two ways to go about clearing it up!

3 

Firstly, you can pop yourself into the Alarm settings tab, simply uncheck the box to “Enable this Alarm” and bam, the error will no longer ‘alert’ you and appear in your alarms section.  This is awesome. But if you have more than one vCenter like me, I mean not that my 100 vCenters really affords me the need to script this… But sometimes we want to disable alerts everywhere!  

You can run this script and it’ll show you what the status of particular Alarm settings are.  As you can see here, this is configured as “True”

2

Get-AlarmDefinition -Entity (Get-Folder -NoRecursion) -Name "Health Status Monitoring"

1 

Get-AlarmDefinition -Entity (Get-Folder -NoRecursion) -Name "Health Status Monitoring" | Set-AlarmDefinition -Enabled:$False

By using the Set-AlarmDefinition function of PowerCLI we can very easily change the status of this alarm from True to False, effectively disabling the alarm and setting us up for “teh win”.   What is even more awesome, is if you have other Alarms you’d like to disable, like License Logging Monitor and various other alarms you can simply run the same syntax changing the –Name and disable away!  Awesome, right?!

This personally saves me loads of time and not having to login to every vCenter to clear the stupid Alarm from appearing in the list.   Hope this helps you!

VMware Log Insight EXPOSED! Splitting your Syslog with an axe!

Well, for those of you who have read my “Exposed” expose’ as in the past… I’ll do my best to provide an in-depth coverage of this tool, lessons learned and so much more!  Allow me to disclaimer for a moment, this IS a beta, so your mileage may vary and your feedback has a chance to shape the product.

You can read the infamous Jon Herlocker’s breakdown of the tool at the Office of the CTO Blog; Introducing VMware vCenter Log Insight

Jon provides some great stock photos, descriptions, images, use-cases and all that jazz… What I’ll show you, is Production Use, and I won’t be using any screenshots I didn’t take myself! :)

Getting Started with #LogInsight

WTF YOU’RE ALREADY HASHTAGGING IT! Yea I am, but I digress. :)  Alright! Let’s focus on getting started!   First things first, you should visit the VMware Log Insight Beta Community – There you can join the ‘discussion forums’ okay, I know you won’t seriously do that, but you can download the product!

And once you get it downloaded and you deploy the OVA/OVF you’re pretty much set! You may experience ‘errors’ when going through the configuration process, I personally re-deployed my OVF 3 times (remember, it’s a beta) but once I got past that and little browser mess-ups, it’s been SOLD since!

Login Page

Look at that, nice clean login… seems pretty straight-forward (hint… it is :))

Cracking Open the Log (insight…)

I know what you’re saying DAMNIT MORE BAD LOG PUNS. Yea, that’s right! Alright, so you pop it open and here’s your dashboard!   You’ll notice events coming in, very simple interface, perhaps too simple but simple nonetheless.   The real keys will come into the next few sections.

Overview

 

ESX_ESXi_Hosts SCSI_iSCSI_NFS

Once you start diving into the details you’ll start to see more and more events coming in, and in their relevant and relative categories.

SCSI_iSCSI_NFS_Blank_5Mins

I want to share with you this little experience… Sometimes you may click on a tab and be all like WTF HAPPENED THERE WERE EVENTS HERE 5 MINUTES AGO. And that’s exactly it.  If you’re on the “Last 5 minutes of data” section, you’re literally only going to get the last 5 minutes of data.  Expand it out to an hour or so and you’ll start to see those messages you had seen just minutes before! 

vCenter_Servers Events_Tasks_Alarms

And lastly your main page happens to list again further various event types of screens… And I get it, this is all nice and interesting, but what does it mean?!

Diving into the weeds

Interactive_Analytics 

Once you start to get into the “Interactive Analysis” you start to get into the details, or quite frankly into the damn SYSLOGs!

Interactive_Analysis_Search 

One particularly awesome piece of this is the ability to ‘type’ something into the Search bar.  What this does is, it indexes all types of requests in the background and gives you an idea of how many of certain types of events or names shows up.  For example, if you specify a Hostname you’ll see how many syslog messages had that hostname, or VM Guest, or you name it.  Just type something in, and you’ll start to get some details and insight! (For security reasons.. I chose details which had no particular relevance but still provided you some ‘search’ context!)

Configuring your ESX environment!

You may notice upon reading the manuals which come with the software (hah, you’re never going to read those! ;)) but it comes with a tool called ‘Configure ESXi’ which will configure your environment.  Let’s say you’re like me and cannot run that tool, or just outright choose not to… Well, here are some alternatives to get your ESX hosts configured so they can start reporting back to your newly created SYSLOG Server!

OMG THERE’S TOO MUCH DATA

That’s right.  You find that your local traffic is okay, but you have a remote site which has a slower link, could be in a different country, or just over a Satellite like or something similarly ridiculous… Well, look no further!

When using VMware Log Insight you may want to change the amount of SysLog data you’re receiving

You can check your current logging levels with this PowerCLI command(s)

  • Get-VMHost | Get-VMHostAdvancedConfiguration -Name "Config.HostAgent.log.level"
  • Get-VMHost | Get-VMHostAdvancedConfiguration -Name "Vpx.Vpxa.config.log.level"

Chances are you’ll be getting a load of data coming in at Verbose ~1000s, even as high as 5000-10,000 logs in a 5 minute period.

I switched hostagent and vpx levels from "Verbose" to "Warning" and went down to ~10-15 logs for a 5 minute span!    If you have low bandwidth links this could mean significantly less impact.

And if you want to outright change those down to Warning as I did, or to any other value (say, Info) you can do it with these handy one-liners!

  • Get-VMHost | Set-VMHostAdvancedConfiguration -Name "Config.HostAgent.log.level" -Value "warning"
  • Get-VMHost | Set-VMHostAdvancedConfiguration -Name "Vpx.Vpxa.config.log.level" -Value "warning"

HOW THE HELL DO i POINT MY HOSTS TO POINT TO THIS THOUGH!

I’m glad you asked that, I mean metaphorically of course, because I’m writing this, not you! NOT YOU!   I went through various iterations to make this possible, and I found setting the Syslog server easy, Configuring the Firewalls equally easy whether via vSphere Client or PowerCLI, but I found the reloading the syslogd to be a pain in the ass.  That is until I came across this little gem!

I’d like to note I am stealing/borrowing this from Caleb in his post; Changing VMware ESXi 5.1 Syslog settings via PowerCLI – It worked like a charm and you shouldn’t feel shamed to use it!   Be sure you thank Caleb for this code of course!

    • get-vmhost | Get-VMHostAdvancedConfiguration -Name Syslog.global.logHost
    • #Get Each Host Connected to the vC
    • foreach ($myHost in get-VMHost)
    • {
    •     #Display the ESXi Host that you are applying the changes to
    •     Write-Host ‘$myHost = ‘ $myHost
    •     #Set the Syslog LogHost
    •     Set-VMHostAdvancedConfiguration -Name Syslog.global.logHost -Value ‘server.domain.com,server2.domain.com’ -VMHost $myHost
    •     #Use Get-EsxCli to restart the syslog service
    •     $esxcli = Get-EsxCli -VMHost $myHost
    •     $esxcli.system.syslog.reload()
    •     #Open the firewall on the ESX Host to allow syslog traffic
    •     Get-VMHostFirewallException -Name "syslog" -VMHost $myHost | set-VMHostFirewallException -Enabled:$true
    • }

And honestly, that is about it! Once you’re set with the right level of verbosity of information, and syslogs pointing to your newly built VMware Log Insight server… then it’s just a matter of collecting, and reviewing with the occasional troubleshooting as needed.  

I did come across this little bug which I’m sure they’ll fix eventually..

Display_Error_While_Not_FullScreen 

If you’re not seeing the bug, it simply is, if you have the Log Insight log viewer NOT in full-screen mode (like you have half the screen showing log insight, and the other half, oh I don’t know… with VLC finishing off the 4th season of Battlestar Galactica…) it’ll seemingly ‘truncate’ the text on the screen, instead of simply moving to the next line.   I’m sure it’d be pretty easy to fix, so don’t get too annoyed by it! :)

In Summation or Building a Log Cabin for your troubleshooting…

Wow, you couldn’t let it go without another bad pun? Yea, probably not… :)   There is a lot more to this tool than I could show you, unfortunately there are screens… which I could not edit down enough without destroying the value of what you’d be seeing.   This tool has vCenter Operations integration, the ability to pull and index all of your data points! I can see at a glance errors which are showing up, and then drill-down to find similarly correlated errors.   I mean, the tool isn’t overly too intelligent yet, but that is bound to come in time, and through our suggestions I hope!

I encourage you to check it out, especially if you don’t have something in place pulling your syslogs today, like Kiwi or Splunk.   This gives you a ‘single family’ set of solutions which in the end will have your virtual best interests at heart.   So check out the beta and let me know what you think.   I’ll keep rocking this tool out and continuing to pull in and index my extremely enormous virtual environment!   Enjoy!

Protecting vCenter Operations Manager (vCOPS) with SRM – Fixing CD Detach

So if you’ve read this KB Article, you’ll see that it states VCOPS IS NOT ACTUALY SUPPORTED WITH SRM. vCenter Operations Manager 5.0.x: Using Site Recovery Manager to Protect a vApp Deployment (2031891) – But hey, who are we to listen to things like KB Articles!

Alright, the truth of the matter is.. Yes, for multiple reasons (IP Pools, the way VMware released the vApp, etc) that technically vCOPS is not supported with SRM.   So now that you know what isn’t possible, let’s focus on fixing the challenge with not having to constantly “detach” your CD drive since you’re clearly ignoring this and continuing to SRM Protect your vCOPS environment :)

You’ll notice if you view the settings of the VMs in the vApp that the CD drive is shown as connecting to datastore ISO file “[]”.

vCenter Operations Manager CD drive ISO appears as []

I know what you’re thinking, “Well, I’ll just click Remove and be done with it!” A few things about that! If you choose to ‘”Remove” it, you’ll get this error, which is annoying at best!

Reconfigure Virtual Machine; Invalid Configuration for device '0'

So, the way you get around this, is in the following steps…

Power down your vApp Virtual Machine – I tend to use “Shutdown guest operating system” to let the system halt gracefully.   Then change the CD drive to “Client Device” click “Ok” and let it save those settings, and then return into it, and then select “Remove” following these steps, the CD drive will no longer exist in the Virtual Machine, you can power it back up and let it be protected by SRM without ever getting errors about the CD drive being connected to []

Change vCOPS Cd Drive to Client Device from ISO file With vCOPS vApp Powered Down - Choose to remove the CD Drive

Now, sometimes… a risk of shutting down your vCOPS environment is that you’ll get some whacked out error of some kind.  I’ve been there, I’ve done that, and this tends to resolve it every time.

  • Log in to the UI VM Console
  • su admin
  • vcops-admin repair –- ipaddress 192.168.0.x (IP address of the Analytics VM)
  • It’ll run for a while and you’ll see a screen like this – And then you’re set!

vCenter Operations Manager vApp was successfully repaired.

At times you may need to perform this activity, or even go so far as to repair in the event of an SRM failover.  

Hopefully this helps you out, as it has helped me out in managing my vCOPS environment(s) :)