Home » Posts tagged 'Kusto Query'

Tag Archives: Kusto Query

Contoso.se

Welcome to contoso.se! My name is Anders Bengtsson and this is my blog about Azure infrastructure and system management. I am a senior engineer in the FastTrack for Azure team, part of Azure Engineering, at Microsoft.  Contoso.se has two main purposes, first as a platform to share information with the community and the second as a notebook for myself.

Everything you read here is my own personal opinion and any code is provided "AS-IS" with no warranties.

Anders Bengtsson

MVP
MVP awarded 2007,2008,2009,2010

My Books
Service Manager Unleashed
Service Manager Unleashed
Orchestrator Unleashed
Orchestrator 2012 Unleashed
OMS
Inside the Microsoft Operations Management Suite

Visualize Service Map data in a workbook

Service Map is a feature in Azure Monitor to automatically discovers communication between applications on both Windows and Linux servers. Service Map visualize collected data in a service map, with servers, processes, inbound and outbound connection latency, and ports across any TCP-connected architecture — more information about Service Map at Microsoft Docs.


The default Service Map view is very useful in many scenarios, but there is, from time to time, a need for creating custom views and reports based on the Service Map data. Custom views and reports are created with Kusto queries and workbooks. In this blog post, we will look at some examples of a visualize Service map data in a workbook.


One of the main reasons you may want to create customer workbooks based on Service Map data is that the default Service Map view only shows one hour of data, even if more data is collected.


Below is an image of Service Map, used in VM Insight. In the figure, you can see Windows server DC00 in the center and all processes on the server that communicates on the network. On the right side of the figure, we can see servers that DC00 communicates with, grouped on network ports. It is possible to select another server, for example, DC11, and see which process on DC11 communicating with the process on DC00.

All service map data is stored in two different tables, VMproccess and VMConnection. VMComputer has inventory data for servers. VMprocess has inventory data for TCP-connected processes on servers.

Here are a few sample queries to get you started.

To list all machines that have inbound communication on port 80 last week

VMConnection
| where DestinationPort == "80"
| where Direction == "inbound"
| where TimeGenerated > now(-7d)
| distinct Computer

To list unique processes on a virtual machine, for last week

VMProcess
| where Computer == "DC21.NA.contosohotels.com"
| where TimeGenerated > now(-7d)
| summarize arg_max(TimeGenerated, DisplayName, Description, Computer) by ExecutableName

To list all unique communication for a server, for last week

VMProcess
|VMConnection
| where Computer == "DC21.NA.contosohotels.com" 
| where TimeGenerated > now(-7d)
| summarize arg_max(TimeGenerated, Computer, Direction, ProcessName) by RemoteIp, DestinationPort  

To list all communication between two IP addresses

VMConnection
| where (SourceIp == "10.1.2.20" or SourceIp == "10.3.1.20") and (DestinationIp == "10.1.2.20" or DestinationIp == "10.3.1.20")
| where TimeGenerated > now(-7d)
| summarize arg_max(TimeGenerated, SourceIp, DestinationIp, Direction, ProcessName) by DestinationPort

With workbooks, you can create dynamic reports to visualize collected data. This is very useful in migration scenarios when building network traffic rules or needs to see dependencies between servers quickly. The picture below shows an example Workbook (download here) showing all traffic for a specific server and a summary (total MB) of network traffic per network port.

Return data only during office hours and workdays

Today I want to share a log query that only returns logs generated between 09 and 18, during workdays. The example is working with security events, without any filters. To improve query performances it is strongly recommended to add more filters, for example, event ID or account.

6.00:00:00 means Saturday and 7.00:00:00 means Sunday 🙂

let startDateOfAlert = startofday(now());
let StartAlertTime = startDateOfAlert + 9hours;
let StopAlertTime = startDateOfAlert + 18hours;
SecurityEvent
| extend localTimestamp = TimeGenerated + 2h
| extend ByPassDays = dayofweek(localTimestamp)
| where ByPassDays <> ‘6.00:00:00’
| where ByPassDays <> ‘7.00:00:00’
| where localTimestamp > StartAlertTime
| where localTimestamp < StopAlertTime
| order by localTimestamp asc

Setting up heartbeat failure alerts with Azure Monitor

One of the questions we receive regularly is how to use the Azure Monitor components to alert on machines that are not available, and then how to create availability reports using these tools.

My colleague Vanessa and I have been looking at the best ways of achieving this in a way that those who are migrating from tools like System Center Operations Manager would be familiar and comfortable with.

As the monitoring agent used by Azure Monitor on both Windows and Linux sends a heartbeat every minute, the easiest method to detect a server down event, regardless of server location, would be to alert on missing heartbeats. This means you can use one alert rule to notify for heartbeat failures, even if machines are hosted on-prem.

Log Ingestion Time and Latency
Before we look at the technical detail, it is worth calling out the Log Ingestion Time for Azure Monitor. This is particularly important if you are expecting heartbeat missed notifications within a specific time frame. In the Log Ingestion Time article, the following query is shared, which you can use to view the computers with the highest ingestion time over the last 8 hours. This can help you plan out the thresholds for the alerting settings.

Heartbeat
| where TimeGenerated > ago(8h)
| extend E2EIngestionLatency = ingestion_time() - TimeGenerated | summarize percentiles(E2EIngestionLatency,50,95) by Computer
| top 20 by percentile_E2EIngestionLatency_95 desc

Alerting
You can use the following query in Logs to retrieve machines that have not sent a heartbeat in the last 5 minutes:

Heartbeat
| summarize LastHeartbeat=max(TimeGenerated) by Computer
| where LastHeartbeat < ago(5m)

The query, based on Heartbeat, is good for reporting and dashboarding, but often using the Heartbeat Metric in the alert rule fields gives faster results. Read more about Metrics here. To create an alert rule based on metrics, you want to target the Workspace resource still, but, in the condition, you want to use the Heartbeat metric signal:

You will now be able to configure the alert options.

  1. Select the computers to alert on. You can choose Select All
  2. Change to Less or equal to, and enter 0 as your threshold value
  3. Select your aggregation granularity and frequency

The best results we have found during testing is an alert within two minutes of a machine shut down, with the above settings – keeping the ingestion and latency in mind.

Using these settings, you should get an alert for each unavailable machine within a few minutes after it becomes unavailable. But, as the signal relies on the heartbeat of the agent, this may also alert during maintenance times, or if the agent is stopped.

If you need an alert quickly, and you are not concerned with an alert flood, then use these settings.

However, if you want to ensure that you only alert on valid server outages, you may want to take a few additional steps. You can use Azure Automation Runbooks or Logic Apps as an alert response to perform some additional diagnostic steps, and trigger another alert based on the output. This could replicate the method used in SCOM with a Heartbeat Failure alert and a Failed to Connect alert.

If you are only monitoring Azure Hosted virtual machines, you could also use the Activity Log to look for Server Shutdown events, using the following query:

AzureActivity
| where OperationName == "Deallocate Virtual Machine" and ActivityStatus == "Succeeded"
| where TimeGenerated > ago(5m)

Reporting
Conversations about server unavailable alerts invariably lead to questions around the ability to report on Server Update/Availability. In the Logs blade, there are a few sample queries available relating to availability:

With the availability rate query by default returning the availability for monitored virtual machines for the last hour, but also providing you with an availability rate query that you can build on. This can be updated to show the availability for the last 30 days as follows:

let start_time=startofday(now()-30d);
let end_time=now();
Heartbeat
| where TimeGenerated > start_time and TimeGenerated < end_time

| summarize heartbeat_per_hour=count() by bin_at(TimeGenerated, 1h, start_time), Computer | extend available_per_hour=iff(heartbeat_per_hour>0, true, false)
| summarize total_available_hours=countif(available_per_hour==true) by Computer
| extend total_number_of_buckets=round((end_time-start_time)/1h)+1
| extend availability_rate=total_available_hours*100/total_number_of_buckets

Or, if you are storing more than one month of data, you can also modify the query to run for the previous month:

let start_time=startofmonth(datetime_add('month',-1,now()));
let end_time=endofmonth(datetime_add('month',-1,now()));
Heartbeat
| where TimeGenerated > start_time and TimeGenerated < end_time

| summarize heartbeat_per_hour=count() by bin_at(TimeGenerated, 5m, start_time), Computer | extend available_per_hour=iff(heartbeat_per_hour>0, true, false)
| summarize total_available_hours=countif(available_per_hour==true) by Computer
| extend total_number_of_buckets=round((end_time-start_time)/5m)+1
| extend availability_rate=total_available_hours*100/total_number_of_buckets

These queries can be used in a Workbook to create an availability report

Note that the availability report is based on heartbeats, not the actual service running on the server. For example, if multiple servers are part of an availability set or a cluster, the service might still be available even if one server is unavailable. 

Further reading



Disclaimer: Cloud is a very fast-moving target. It means that by the time you’re reading this post, everything described here could have been changed completely. The blog post is provided “AS-IS” with no warranties.

Monitor a process with Azure Monitor

A common question when working with Azure Monitor is monitoring of Windows services and processes running on Windows servers. In Azure Monitor we can monitor Windows Services and other processes the same way; by looking at process ID as a performance counter.

Even if a process can be monitored by looking at events, it is not always a reliable source. The challenge is that there is no “active monitoring” checking if the process is running at the moment when looking at only events. 

Each process writes a number of performance counters. None of these are collected by default in Azure Monitor, but easy to add under Windows Performance Counters.


The following query will show process ID for Notepad. If the Notepad process is not running, there will be no data. The alert rule, if needed, can be configured to generate an alert if zero results was returned during the last X minutes.

Perf
| where (Computer == “LND-DC-001.vnext.local”) and (CounterName == “ID Process”) and (ObjectName == “Process”)
| where InstanceName == “notepad”
| extend localTimestamp = TimeGenerated + 2h
| where TimeGenerated > ago(5m)
| project TimeGenerated , CounterValue, InstanceName
| order by TimeGenerated desc

Disclaimer: Cloud is a very fast-moving target. It means that by the time you’re reading this post everything described here could have been changed completely.
Note that this is provided “AS-IS” with no warranties at all. This is not a production-ready solution for your production environment, just an idea, and an example.