Tag Archives: EdgeSight

Preparing for life without Edgesight with ExtraHop

So, the rumors have been swirling and I think we have all come to the quiet realization that Edgesight is going to be coming to an end. At least the Edgesight we know and Love/Hate.

For those of us who have continued with this labor of love trying squeeze every possible metric we could out of Edgesight we are likely going to have to come to grips with the fact that the next generation of Edgesight will not have the same level of metrics we have today. While we all await the next version of HDX Edgesight with we can almost be certain that the data model and all of the custom queries we have written over the last 3 years will not be the same.

Let’s be honest, Edgesight has been a nice concept but there have been extensive problematic issues with the agent both from a CPU standpoint (firebird service taking up 90% CPU) and keeping the versions consistent. The real-time monitoring requires elevated permissions of the person looking into the server forcing you to grant your service desk higher permissions than many engineers are comfortable with. I am, for the most part, a “tools”-hater. In the last 15 years I have watched millions of dollars spent on any number of tools, all of which told me that they would be the last tool I would need and all of them in my opinion where, for the most part, underwhelming. I would say that Edgesight has been tolerable to me and it has done a great job of collecting metrics but, like most tools I have worked with, it is Agent based, also it cannot log in real-time. The console was so unusable that I literally have not logged into it for the last four years. (In case you were wondering why I don’t answer emails with questions about the console).

For me, depending on an agent to tell you there is an issue is a lot like telling someone to “yell for help if you start drowning”. If a person is under water, it’s a little tough for them to yell for help. With agents, if there is an issue with the computer, whatever that is (CPU, Disk I/O, Memory) will likely impact the agent as well. The next best thing, which is what I believe Desktop Director is using, is to interrogate a system via WMI. Thanks to folks like Brandon Shell, Mark Schill and the people at Citrix who set up the Powershell SDK. This has given rise to some very useful scripting that has given us the real-time logs that we have desperately wanted. That works great for looking at a specific XenApp server but in the Citrix world where we are constantly “proving the negative” it does not provide the holistic view that Edgesight’s downstream server metrics provided.

Proving the negative:

As some of you are painfully aware, Citrix is not just a Terminal Services delivery solution. In our world, XenApp is a Web Client, a Database Client, Printing Client and a CIFS/SMB client. The performance of any of these protocols will result in a ticket resting in your queue regardless of the downstream server performance. Edgesight did a great job of providing this metric letting you know if you had a 40 second network delay getting to a DFS share or a 5000ms delay waiting for a server to respond. It wasn’t real-time but it was better than anything I had used until then.

While I loved the data that Edgesight provided, the agent was problematic to work with, I had to wait until the next day to actually look at the data, unless you ran your own queries and did your own BI integration you had, yet another, console to go to and you needed to provide higher credentials for the service desk to use the real-time console.

Hey! Wouldn’t it be great if there were a solution that would give me the metrics I need to get a holistic view of my environment? Even better, if it were agentless I wouldn’t have to worry about which .NET framework version I had; changes in my OS, the next Security patch that takes away kernel level access and just all around agent bloat from the other two dozen agents I already have on my XenApp sever. Not to mention the fact that the decoupling of GUIDs and Images thanks to PVS has caused some agents to really struggle to function in this new world of provisioned server images.

It’s early in my implementation but I think I have found one….Extrahop.

Extrahop is the brain-child of ADC pioneer Jesse Rothstein who was one of the original developers of the modern Application Delivery Controller. The way Extrahop works is that it sits on the wire and grabs pertinent data and makes it available to your engineer and, if you want, your Operations staff. Unlike wireshark, a great tool for troubleshooting; it does not force you, figuratively, to drink water from a fire hose. They have formed relationships with several vendors, gained insight into their packets and are able to discriminate between which packets are useful to you and which packets are not. I am now able to see, in real-time, without worrying about an agent, ICA Launch times and the Authentication time when a user launches an application. I can also see client latency, Virtual Channel Bytes In and Bytes Out for Printer, Audio, Mouse, Clipboard, etc.

(The Client-Name, Login time and overall Load time as well as the Latency of my Citrix Session)

In addition to the Citrix monitoring, it helps us with “proving the negative” by providing detailed data about Database, HTTP and CIFS connections. This means that you can see, in real-time, performance metrics of the application servers that XenAPP is connecting to. If there is a specific URI that is taking 300 seconds to process, you will see it when it happens without waiting the next day for the data or having to go to edgesightunderthehood.com to see if John, David or Alain have written a custom query.

If there is a conf file that has an improper DNS entry, it will show up as a DNS Query failure. If your SQL Server is getting hammered and is sending RTOs, you will see it in real-time/near-time and be able to save yourself hours of troubleshooting.

(Below, you see the different metrics you can interrogate a XenApp server for.)


Extrahop Viewpoints:
Another advantage of Extrahop is that you can actually look at metrics from the point of view of the downstream application servers as well. This means that if you publish an IE Application and it connects to a web server that integrates with a downstream database server you can actually go to that web server you have published in your application and look at the performance of that web server and the database server. If you have been a Citrix Engineer for more than three years, you should already be used to doing the other team’s troubleshooting for them but this will make it even faster. You basically get a true, holistic view of your entire environment, even outside of XenApp, where you can find bottlenecks, flapping interfaces and tables that need indexing. If your clients are on an internal network, depending on your topology you can actually look at THEIR performance on their workstations and tell if the switch in the MDF is saturated.

Things I have noted so far looking at Extrahop Data:

  • SRV Record Lookup failures
  • Poorly written Database Queries
  • Exessive Retransmissions
  • Long login times (thus long load times)
  • Slow CIFS/SMB Traffic
  • Inappropriate User Behavior

GEOCODING Packets:
Another feature I like is the geocoding of packets, this is very useful to use if you want to bind a geomap to your XenApp servers to see if there is any malware making connections to China or Russia, etc. (I have an ESUTH post on monitoring Malware with Edgesight.) Again, this gives me a real-time look at all of my TCP Connections through my firewall or I can bind it on a per-XenApp, Web Server or even PC node. The specific image below is of my ASA 5505 and took less than 15 seconds to set up (not kidding).

On the wire (Extrahop) vs. On the System (Agent):
I know most of us are “systems” guys and not so much Network guys. Because there is no agent on the system and it works on the wire, you have to approach it a little differently and you can see how you can live without an agent. Just about everything that happens in IT has to come across the wire and you already have incumbent tools to monitor CPU, Memory, Disk and Windows Events. The wire is the last “blind spot” that I have not had a great deal of visibility into from a tools perspective until I started using Extrahop. Yes there was wireshark but for archival purposes and looking at specific streams are not quite as easy. Yes, you can filter and you can “flow TCP Stream” with wireshark but it is going to give you very raw data. I even edited a TCPDUMP based powershell script to write the data to SQL Server thinking I could archive the data that way. I had 20GB of data inside of 30 minutes, with Extrahop you can actually trigger wire captures based on specific metrics and events that it sees in the flow and all of the sifting and stirring is done by Extrahop just leaving you to collect the gold nuggets.

Because it is agentless you don’t have questions like “Will Extrahop support the next edition of XenAPP?” “Will Extrahop Support Windows Server2012” “What version of the .Net Framework do I need to run Extrahop” “I am on Server Version X but my agents are on version Y”

The only question you have to answer to determine if your next generation of hardware/software will be compatible with Extrahop is “Will you have an IP Address?” If your product is going to have an IP Address, you can use Extrahop with it. Now, you have to use RFC Compliant protocols and Extrahop has to continue to develop relationships with vendors for visibility but in terms of deploying and maintaining it, you have a much simpler endeavor than other vendors. The simplicity of monitoring on the wire is going to put an end to some of the more memorable headaches I have had in my career revolving around agent compatibility.

Splunk/Syslog Integration:
So, I recently told my work colleagues that the next monitoring vendor that shows up saying I have to add yet another console I am going to say “no thanks”. While the Extrahop console is actually quite good and gives you the ability to logically collate metrics, applications and devices the way you like, it also has extensive Splunk integration. If there are specific metrics that you want sent to an external monitor, you can send them to your syslog server and integrate them into the existing syslog strategy be it Envision, KIWI Syslog Server or any other SIEM product. They have a javascript based trigger solution that allows you to tap into custom flows and cherry pick those metrics that are relevant to you. Currently, there is a very nice and extensive Splunk APP for Extrahop.

I am currently logging (in real-time) the following with Extrahop:

  • DNS Failures (Few people realize how poor DNS can wreck nth-tiered environments)
  • ICA OPEN Events (to get logon times and authentication times)
  • HTTP User Agent Data
  • HTTP Performance Data

So if this works by monitoring the wire, isn’t it the Network team’s tool?
The truth is it’s everybody’s tool, the only thing you need the network team to do is span ports for you (then log in and check out their own important metrics). You can have the DBA log in and check the performance of their queries, the Network Engineers can log in and check jitter, TCP retransmissions, RTOs and throughput, the Citrix guy can log in and check Client Latency, STA Ticket delivery times, ICA Channel throughput, Logon/Launch Times, the Security team can look for TCP Connections to China, Russia and catch people RDPing home to their home networks and the Web Team can go check which user-Agents are the most popular to determine if they need to spend more time accommodating tablets. Everybody has something they need on the wire; I sometimes fear that we tend to select our tools based on what technical pundits tell us too. In our world, from a vendor standpoint, we tend to like to put things in boxes (which is a great irony given everyone’s “think outside the box” buzz statement). We depend on thought leaders to put products in boxes and tell us which ones are leaders, visionaries, etc. I don’t blame them for providing product evaluations that way, we have demanded that. For me, Extrahop is a great APM tool but it is also a great Network Monitoring tool and has value to every branch of my IT Department. This is not a product whose value can be judged by finding its bubble in a Gartner scatter plot.

Conclusion:
I have not even scratched the surface of what this product can do. The triggers engine basically gives you the ability to write nearly any rule you want to log/report any metric you want. Yes, there are likely things you can get with an agent that you cannot get without an agent but in the last few years these agents have become a lot like a ball and chain. You basically install the appliance or import the VM, span the ports and watch the metrics come in. I have had to change my way of thinking of metrics gather from system specific to siphoning data off the wire but once you wrap your head around how it is getting the data you really get a grasp of how much more flexibility you have with this product than with other agent based solutions. The Splunk integration was the icing on the cake.

I hope to record a few videos showing how I am doing specific tasks, but please check out the links below as they have several very good live demos.

To download a trial version: (you have to register first)
http://www.extrahop.com/discovery/

Numerous webinars:
http://www.extrahop.com/resources/

Youtube Channel:
http://www.youtube.com/user/ExtraHopNetworks?feature=watch

Thanks for reading and happy holidays!

John


Advertisements

Digital Epidemiology: Edgesight Under the hood (Will be moved to EdgesightUndertheHood.com)

Okay, so no flat files, parsing or kiwi syslogging today.  Today I want to talk about Edgesight 5.x.  If any of you have attempted to reverse engineer Edgesight yet you have probably noticed that the tables are a lost cause.  All of the key data that you will want to try to harvest is located in the “Views”.  I Want to do a few blog posts on each of my favorite views and how you can pull statistics from them instantly via query analyzer.  I will start by saying Citrix has created an outstanding front end delivered via the web interface.  I am in no way knocking that interface, there are just times when the canned reports just don’t do it for you.  Until the engineers at Citrix get their hands on a Crystal Ball, there will always be a use for good ole-fashion ad hoc queries.  I am going to go over a few key queries from the vw_ctrx_archive_ica_roundtrip_perf  view from your Edgesight Database and how you can open query analyzer and gather these statistics post haste, or, if you are adept with Reporting Services, set up reports for yourself.  I have pitched to the Synergy 2010 group that they let me host a breakout covering how to integrate some of what I do with SQL Server Reporting Services, I think I can cover a lot in a 90 minute session and let engineers take something away from the session that they can use in their own environments. So, as I stated, the view of the day is  “vw_ctrx_archive_ica_roundtrip_perf” so open your SQL Server Management Studio and log into the SQL Server hosting your database with an account that has “Datareader” privilages.  If you admin account does not work, your Edgesight service account will likely suffice if your organization allows services accounts to be  used in that mannor.          

@Today:
The @Today variable is for the existing day.  That means that if you want to check between yesterday and the day before you would change “convert(varchar(10),dateadd(hh,-4,time_stamp),111) > @today-2” to “convert(varchar(10),dateadd(hh,-4,time_stamp),111) between @today-2 and @today-1″      

 Find the number of ICA SEssions by server by time of day
About this query: 
In this query we declare 3 variables, two of which you must edit.  The @begin and @end variables must have the time of day that you want to search.  So, if you wanted to know the number of unique users for each server between 8AM and 2PM, you would enter ’08’ for @begin and ’14’ for @end.           

declare @begin varchar
declare @end varchar
declare @today datetime
set @today = convert(varchar,getdate(),111)
set @begin = ’14’
set @end = ’23’
select machine_name, count(distinct [user])
from vw_ctrx_archive_ica_roundtrip_perf
whereconvert(varchar(2),dateadd(hh,-4,time_stamp),108) between @begin and @end
and convert(varchar(10),dateadd(hh,-4,time_stamp),111) > @today2
group by machine_name
order by count(distinct [user]) desc
         

 Find ICA Lantency by user by day
About this query:
This query will show you the ICA Latency for each user and sort it by the user with the worst latency.  If you wanted to check sessions on a specific server, you would add the following above the “Group By” statement:  ‘and machine_name = ‘%netbiosNameOfXenAPPServer%’
        

 declare @today datetime
set @today = convert(varchar,getdate(),111)
select [user], sum(network_latency_sum)/sum(network_latency_cnt) as “Latency”
from vw_ctrx_archive_ica_roundtrip_perf
where convert(varchar(10),dateadd(hh,-4,time_stamp),111) > @today1
group by [user]
order by sum(network_latency_sum)/sum(network_latency_cnt) desc          

        

ICA Latency by Server: 
About this query:
This query will show you the latency by server for a given day.  This can be handy if you want to keep tabs on server health.  If you note high latency for a particular server for a specific day you may need to look and see if there was a user connection that skewed the results or if all sessions on that server had issues. 
         

 declare @today datetime
set @today = convert(varchar,getdate(),111)
select machine_name, sum(network_latency_sum)/sum(network_latency_cnt) as “Latency”
from vw_ctrx_archive_ica_roundtrip_perf
where convert(varchar(10),dateadd(hh,-4,time_stamp),111) > @today1
group by machine_name
order by sum(network_latency_sum)/sum(network_latency_cnt) desc
         

Find total sessions by server by farm:
About this Query:
If you have more than one farm, than you can specify the farm name in this query to get the number of connections per server by farm name.  For users in very large environments with multiple farms may find it handy to query by farm name.  
          

 declare @today datetime
set @today = convert(varchar,getdate(),111)
select machine_name, count(distinct [user])
from vw_ctrx_archive_ica_roundtrip_perf
where convert(varchar(10),dateadd(hh,-4,time_stamp),111)= @today3
and xen_farm_name = ‘%FarmName%’
group by machine_name
order by count(distinct [user]) desc
       

Conclusion:
There are at least four views that I like to work with directly, I also integrate all of my queries, including the variables, into SQL Server Reporting Services letting me customize my reports for my specific needs.  The eventual goal is to provide our operations and support team with a proactive list of users with high latency so that we can call them and let them know that we noticed they were having issues.  My next post will cover how to look at problematic downstream hosts that cause you to get a bunch of calls saying it’s Citrix’s fault!!  I apologize for the lack of examples, I am limited to w hat I can show in my environment.  As I stated, I am hoping to show all of this integration, including custom SQL Reports,  at Synergy 2010.         

 If you have a specific query that you want, post it as a comment and I will reply with the SQL Query that gets you as close as I can.       

Thanks for reading!        

John