Home » Log Analytics

Category Archives: Log Analytics

Contoso.se

Welcome to contoso.se! My name is Anders Bengtsson and this is my blog about Azure infrastructure and system management. I am a senior engineer in the FastTrack for Azure team, part of Azure Engineering, at Microsoft.  Contoso.se has two main purposes, first as a platform to share information with the community and the second as a notebook for myself.

Everything you read here is my own personal opinion and any code is provided "AS-IS" with no warranties.

Anders Bengtsson

MVP
MVP awarded 2007,2008,2009,2010

My Books
Service Manager Unleashed
Service Manager Unleashed
Orchestrator Unleashed
Orchestrator 2012 Unleashed
OMS
Inside the Microsoft Operations Management Suite

From Service Map to Network Security Group

Many data center migration scenarios include moving from a central firewall to multiple smaller firewalls and network security groups. A common challenge is how to configure each network security group (NSG). What should be allowed?

One way to map out which traffic to allow is using Service Map, as shown in previous blog posts. It is also possible to take it one step further, by automatically reading Service Map data from Log Analytics and building NSG rules based on the collected data.

To show an example of this, we have put together a PowerShell script. The script reads Service Map data for a specific server and builds an NSG and NSG rules based on the read data. The NSG is then attached to the server’s network adapter. Download the script here.

Of course, there are some risks with this; for example, if there is an “evil process” running on the server and communicating on the network, then there will be an NSG rule for this too. Also, the Service Map will only collect data for TCP traffic, not UDP, and the script expects the server to already exist in Azure. You will not be able to use this script to create NSG rules for servers that have not been migrated.

Thanks to Vanessa for good conversation and ideas 🙂

Disclaimer: Cloud is a very fast-moving target. It means that by the time you’re reading this post, everything described here could have been changed completely. The blog post is provided “AS-IS” with no warranties.

Return data only during office hours and workdays

Today I want to share a log query that only returns logs generated between 09 and 18, during workdays. The example is working with security events, without any filters. To improve query performances it is strongly recommended to add more filters, for example, event ID or account.

6.00:00:00 means Saturday and 7.00:00:00 means Sunday 🙂

let startDateOfAlert = startofday(now());
let StartAlertTime = startDateOfAlert + 9hours;
let StopAlertTime = startDateOfAlert + 18hours;
SecurityEvent
| extend localTimestamp = TimeGenerated + 2h
| extend ByPassDays = dayofweek(localTimestamp)
| where ByPassDays <> ‘6.00:00:00’
| where ByPassDays <> ‘7.00:00:00’
| where localTimestamp > StartAlertTime
| where localTimestamp < StopAlertTime
| order by localTimestamp asc

Monitoring Windows services with Azure Monitor

Another question we are asked regularly is how to use the Azure Monitor tools to create visibility on Windows service health. One of the best options for monitoring of services across Windows and Linux leverages off the Change Tracking solution in Azure Automation.

The solution can track changes on both Windows and Linux. On Windows, it supports tracking changes on files, registry keys, services, and installed software. On Linux, it tracks changes to files, software, and daemons. There are a couple of ways to onboard the solution, from a virtual machine, Automation account, or an Azure Automation runbook. Read more about Change tracking and how to onboard at Microsoft Docs.

This blog post will focus on monitoring of a Window service, but the concept works the same for Linux daemons.

Changes to Windows Services are collected by default every 30 minutes but can be configured to be collected down to every 10 seconds. It is important that the agent only track changes, not the current state. If there is no change, then there is no data sent to Log Analytics and Azure Automation. Collecting only changes optimizes the performance of the agent.

Query collected data

To list the latest collected data, we can run the following query. Note that we use “let” to set offset between UTC (default time zone in Log Analytics) and our current time zones. An important thing to remember is what we said earlier; only changes are reported. In the example below, we can see that at 2019-07-15 the service changed state to running. But after this record, we have no information. If the VM suddenly crashes, there is a risk no “Stopped” event will be reported, and from a logging perspective, it will look like the service is running.

It is therefore important to monitoring everything from a different point of views, for example, in this example also monitor the heartbeat from the VM.

let utcoffset = 2h; // difference between local time zone and UTC
ConfigurationData
| where ConfigDataType == "WindowsServices"
| where SvcDisplayName == "Print Spooler"
| extend localTimestamp = TimeGenerated + utcoffset
| project localTimestamp, Computer, SvcDisplayName, SvcState
| order by localTimestamp desc
| summarize arg_max(localTimestamp, *) by SvcDisplayName

Configure alert on service changes

As with other collected data, it is possible to configure an alert rule based on service changes. Below is a query that can be used to alert if the Print Spooler service is stopped. For more steps how to configure the alert, see Microsoft Docs.

ConfigurationChange
| where ConfigChangeType == "WindowsServices" and SvcDisplayName == "Print Spooler" and SvcState == "Stopped"

You may be tempted to use a query to look for Event 7036 in the Application log instead, but there are a few reasons why we would recommend you use the ConfigurationChange data instead:

  • To be able to alert on Event 7036, you will need to collect informational level events from the Application log across all Windows servers, which becomes impractical very quickly when you have a larger number of Virtual Machines
  • It requires more complex queries to alert on specific services
  • It is only available on Windows servers

Workbook report

With Azure Monitor workbooks, we can create interactive reports based on collected data. Read more about Workbooks at Microsoft Docs.

For our service monitoring scenario, this is a great way to build a report of current status and a dashboard.

The following query can be used to list the latest event for each Windows service on each server. With the “case” operator, we can display 1 for running services and 0 for stopped services.

let utcoffset = 2h; // difference between local time zone and UTC
ConfigurationData
| where ConfigDataType == “WindowsServices”
| extend localTimestamp = TimeGenerated + utcoffset
| extend Status = case(SvcState == “Stopped”, “0”,
SvcState == “Running”, “1”,
“NA”
)
| project localTimestamp, Computer, SvcDisplayName, Status
| summarize arg_max(localTimestamp, *) by Computer, SvcDisplayName

1 and 0 can easily be used as thresholds in a workbook to colour set cells depending on status.

Workbooks can also be pinned to an Azure Dashboard, either all parts of a workbook or just some parts of it.

Setting up heartbeat failure alerts with Azure Monitor

One of the questions we receive regularly is how to use the Azure Monitor components to alert on machines that are not available, and then how to create availability reports using these tools.

My colleague Vanessa and I have been looking at the best ways of achieving this in a way that those who are migrating from tools like System Center Operations Manager would be familiar and comfortable with.

As the monitoring agent used by Azure Monitor on both Windows and Linux sends a heartbeat every minute, the easiest method to detect a server down event, regardless of server location, would be to alert on missing heartbeats. This means you can use one alert rule to notify for heartbeat failures, even if machines are hosted on-prem.

Log Ingestion Time and Latency
Before we look at the technical detail, it is worth calling out the Log Ingestion Time for Azure Monitor. This is particularly important if you are expecting heartbeat missed notifications within a specific time frame. In the Log Ingestion Time article, the following query is shared, which you can use to view the computers with the highest ingestion time over the last 8 hours. This can help you plan out the thresholds for the alerting settings.

Heartbeat
| where TimeGenerated > ago(8h)
| extend E2EIngestionLatency = ingestion_time() - TimeGenerated | summarize percentiles(E2EIngestionLatency,50,95) by Computer
| top 20 by percentile_E2EIngestionLatency_95 desc

Alerting
You can use the following query in Logs to retrieve machines that have not sent a heartbeat in the last 5 minutes:

Heartbeat
| summarize LastHeartbeat=max(TimeGenerated) by Computer
| where LastHeartbeat < ago(5m)

The query, based on Heartbeat, is good for reporting and dashboarding, but often using the Heartbeat Metric in the alert rule fields gives faster results. Read more about Metrics here. To create an alert rule based on metrics, you want to target the Workspace resource still, but, in the condition, you want to use the Heartbeat metric signal:

You will now be able to configure the alert options.

  1. Select the computers to alert on. You can choose Select All
  2. Change to Less or equal to, and enter 0 as your threshold value
  3. Select your aggregation granularity and frequency

The best results we have found during testing is an alert within two minutes of a machine shut down, with the above settings – keeping the ingestion and latency in mind.

Using these settings, you should get an alert for each unavailable machine within a few minutes after it becomes unavailable. But, as the signal relies on the heartbeat of the agent, this may also alert during maintenance times, or if the agent is stopped.

If you need an alert quickly, and you are not concerned with an alert flood, then use these settings.

However, if you want to ensure that you only alert on valid server outages, you may want to take a few additional steps. You can use Azure Automation Runbooks or Logic Apps as an alert response to perform some additional diagnostic steps, and trigger another alert based on the output. This could replicate the method used in SCOM with a Heartbeat Failure alert and a Failed to Connect alert.

If you are only monitoring Azure Hosted virtual machines, you could also use the Activity Log to look for Server Shutdown events, using the following query:

AzureActivity
| where OperationName == "Deallocate Virtual Machine" and ActivityStatus == "Succeeded"
| where TimeGenerated > ago(5m)

Reporting
Conversations about server unavailable alerts invariably lead to questions around the ability to report on Server Update/Availability. In the Logs blade, there are a few sample queries available relating to availability:

With the availability rate query by default returning the availability for monitored virtual machines for the last hour, but also providing you with an availability rate query that you can build on. This can be updated to show the availability for the last 30 days as follows:

let start_time=startofday(now()-30d);
let end_time=now();
Heartbeat
| where TimeGenerated > start_time and TimeGenerated < end_time

| summarize heartbeat_per_hour=count() by bin_at(TimeGenerated, 1h, start_time), Computer | extend available_per_hour=iff(heartbeat_per_hour>0, true, false)
| summarize total_available_hours=countif(available_per_hour==true) by Computer
| extend total_number_of_buckets=round((end_time-start_time)/1h)+1
| extend availability_rate=total_available_hours*100/total_number_of_buckets

Or, if you are storing more than one month of data, you can also modify the query to run for the previous month:

let start_time=startofmonth(datetime_add('month',-1,now()));
let end_time=endofmonth(datetime_add('month',-1,now()));
Heartbeat
| where TimeGenerated > start_time and TimeGenerated < end_time

| summarize heartbeat_per_hour=count() by bin_at(TimeGenerated, 5m, start_time), Computer | extend available_per_hour=iff(heartbeat_per_hour>0, true, false)
| summarize total_available_hours=countif(available_per_hour==true) by Computer
| extend total_number_of_buckets=round((end_time-start_time)/5m)+1
| extend availability_rate=total_available_hours*100/total_number_of_buckets

These queries can be used in a Workbook to create an availability report

Note that the availability report is based on heartbeats, not the actual service running on the server. For example, if multiple servers are part of an availability set or a cluster, the service might still be available even if one server is unavailable. 

Further reading



Disclaimer: Cloud is a very fast-moving target. It means that by the time you’re reading this post, everything described here could have been changed completely. The blog post is provided “AS-IS” with no warranties.

Inside Azure Management [e-book]

We are excited to announce the Preview release of Inside Azure Management is now available, with more than 500 pages covering many of the latest monitoring and management features in Microsoft Azure!

March 27, 2019] We are excited to announce the Preview release of Inside Azure Management is now available, with more than 500 pages covering many of the latest monitoring and management features in Microsoft Azure! 

This FREE e-book is written by Microsoft MVPs Tao Yang, Stanislav Zhelyazkov, Pete Zerger, and Kevin Greene, along with Anders Bengtsson.

Description: “Inside Azure Management” is the sequel to “Inside the Microsoft Operations Management Suite”, featuring end-to-end deep dive into the full range of Azure management features and functionality, complete with downloadable sample scripts.  

The chapter list in this edition is shown below:

  • Chapter 1 – Intro
  • Chapter 2 – Implementing Governance in Azure
  • Chapter 3 – Migrating Workloads to Azure
  • Chapter 4 – Configuring Data Sources for Azure Log Analytics
  • Chapter 5 – Monitoring Applications
  • Chapter 6 – Monitoring Infrastructure
  • Chapter 7 – Configuring Alerting and notification
  • Chapter 8 – Monitor Databases
  • Chapter 9 – Monitoring Containers
  • Chapter 10 – Implementing Process Automation
  • Chapter 11 – Configuration Management
  • Chapter 12 – Monitoring Security-related Configuration
  • Chapter 13 – Data Backup for Azure Workloads
  • Chapter 14 – Implementing a Disaster Recovery Strategy
  • Chapter 15 – Update Management for VMs
  • Chapter 16 – Conclusion

Download your copy here

Update Service Map groups with PowerShell

Service Map automatically discovers application components on Windows and Linux systems and maps the communication between services. With Service Map, you can view your servers in the way that you think of them: as interconnected systems that deliver critical services. Service Map shows connections between servers, processes, inbound and outbound connection latency, and ports across any TCP-connected architecture, with no configuration required other than the installation of an agent. Machine Groups allow you to see maps centered around a set of servers, not just one so you can see all the members of a multi-tier application or server cluster in one map. Source Microsoft Docs

A common question is how to update machine groups in Service Map automatically. Last week my colleague Jose Moreno and I was worked with Service Map and investigated how to automate machine group updates. The result was a couple of PowerShell examples, showing how to create and maintain machine groups with PowerShell. You can find all the examples on Jose GitHub page. With these scripts we can now use a source, for example, Active Directory groups, to set up and update machine groups in Service Map.

Building reports with Log Analytics data

A common question I see is how to present the data collected with Log Analytics. We can use View Designer in Log Analytics, PowerBI, Azure Dashboard, and Excel PowerPivot. But in this blog post, I would like to show another way to build a “report” direct in the Azure Portal for Log Analytics data.

Workbooks is a feature in Application Insights to build interactive reports. Workbooks are configured under Application Insights but it’s possible to access data from Log Analytics.

In this example, we will build a workbook for failed logins in Active Directory. The source data (event Id 4625) is collected by the Security and Audit solution in Log Analytics.

If we run a query in Log Analytics to show these events, we can easily see failed login reason and number of events. But we would also like to drill down into these events and see account names. That is not possible in Log Analytics today, and this is where workbooks can bring value.

Any Application Insights instance can be used; no data needs to be collected by the instance (no extra cost) as we will use Log Analytics as a data source. In Application Insights, there are some default workbooks and quick start templates. For this example, we will use the “Default Template.”

In the workbook, we can configure it to use any Log Analytics workspace, in any subscription, as a source. Using different workspaces for different parts of the workbook is possible. The query used in this example is shown below, note it shows data for the last 30 days.

SecurityEvent
| where AccountType == ‘User’ and EventID == 4625
| where TimeGenerated > ago(30d)
| extend Reason = case(
SubStatus == ‘0xc000005e’, ‘No logon servers available to service the logon request’,
SubStatus == ‘0xc0000062’, ‘Account name is not properly formatted’,
SubStatus == ‘0xc0000064’, ‘Account name does not exist’,
SubStatus == ‘0xc000006a’, ‘Incorrect password’,
SubStatus == ‘0xc000006d’, ‘Bad user name or password’,
SubStatus == ‘0xc000006f’, ‘User logon blocked by account restriction’,
SubStatus == ‘0xc000006f’, ‘User logon outside of restricted logon hours’,
SubStatus == ‘0xc0000070’, ‘User logon blocked by workstation restriction’,
SubStatus == ‘0xc0000071’, ‘Password has expired’,
SubStatus == ‘0xc0000072’, ‘Account is disabled’,
SubStatus == ‘0xc0000133’, ‘Clocks between DC and other computer too far out of sync’,
SubStatus == ‘0xc000015b’, ‘The user has not been granted the requested logon right at this machine’,
SubStatus == ‘0xc0000193’, ‘Account has expirated’,
SubStatus == ‘0xc0000224’, ‘User is required to change password at next logon’,
SubStatus == ‘0xc0000234’, ‘Account is currently locked out’,
strcat(‘Unknown reason substatus: ‘, SubStatus))
| project TimeGenerated, Account, Reason, Computer

In the workbook, on Column Settings, we can configure how the result will be grouped together. In this example, we will group by failed login reason and then account name.

When running the workbook, we get a list of failed login reasons and can expand to see account names and amount of failed logins. It is possible to add an extra filter to the query to remove “noise” for example accounts with less than three failed login events.
It is also possible to pin a workbook or part of a workbook, to an Azure Dashboard, to easily access the information.

In the workbook you can also add more text fields, metric fields and query fields, for example a time chart showing the amount of events per day.

Ingestion Time in Log Analytics

A common topic around Log Analytics is ingestion time. How long time does it take before an event is visible in Log Analytics?
The latency depends on three main areas agent time, pipeline time and indexing time. This is all described in this Microsoft Docs article.

In Log Analytics or Kusto, there is a hidden DateTime column in each table called IngestionTime. The time of ingestion is recorded for each record, in that hidden column. The IngestionTime can be used to estimate the end-to-end latency in ingesting data to Log Analytics. TimeGenerated is a timestamp from the source system, for example, a Windows server. By comparing TimeGenerated and IngestionTime we can estimate the latency in getting the data into Log Analytics. More info around IngestionTime policy here.

In the image below a test event is generated on a Windows, note the timestamp (Logged).

When the event is in Log Analytics, we can find it and compare IngestionTime and TimeGenerated. We can see that the difference is around a second. TimeGenerated is the same as “Logged” on the source system. This is just an estimate, as the clocks on the server and in Log Analytics might not be in sync.

If we want to calculate the estimated latency, we can use the following query. It will take all events and estimate the latency in minutes, and order it by latency.

Event
| extend LatencyInMinutes = datetime_diff('minute', ingestion_time(), TimeGenerated)
| project TimeGenerated, ingestion_time(), LatencyInMinutes
| order by LatencyInMinutes

You can also summaries the average latency per hour, and generated a chart, with the following query. This is useful when investigating latency over a longer period of time.

Event
| extend LatencyInMinutes = datetime_diff('minute', ingestion_time(), TimeGenerated)
| project TimeGenerated, ingestion_time(), LatencyInMinutes
| summarize avg(LatencyInMinutes) by bin(TimeGenerated, 1h)

Disclaimer: Cloud is a very fast-moving target. It means that by the time you’re reading this post everything described here could have been changed completely.
Note that this is provided “AS-IS” with no warranties at all. This is not a production-ready solution for your production environment, just an idea, and an example.

Analyze and visualize Azure Firewall with Log Analytics View Designer

A colleague and I have put together a sample view for Log Analytic to analyze and visualize Azure Firewall logs. You can download the sample view here. The sample view will visualize data around application rule and network rule log data. With view Designer in Azure Log Analytics, you can create custom views to visualize data in your Log Analytics workspace, read more about View Designer here.

 

Monitor Linux Daemon with Log Analytics

In this blog post I would like to share an example of how daemons on Linux machines can be monitored with Log Analytics. Monitoring daemons are not listed as a feature direct in the Log Analytic portal, but it is possible to do. When a daemon is started or stopped a line is written in Syslog. Syslog is possible to read with the Microsoft Monitoring Agent and send to Log Analytics.

The only thing to configure is to enable collection of Syslog and the daemon facility.

If the daemon is stopped (the cron daemon in this example) the following lines are written to the syslog logfile

Soon after the same lines are written to Log Analytics as events in the Syslog table

You can now configure an alert including notification when the daemon stops. The alert can, for example, be visualized in Azure Monitor and sent by e-mail.

 

 

 

Disclaimer: Cloud is a very fast-moving target. It means that by the time you’re reading this post everything described here could have been changed completely.
Note that this is provided “AS-IS” with no warranties at all. This is not a production-ready solution for your production environment, just an idea, and an example.