• Recommended Azure Monitors


    General

    This document describes the recommended Azure monitors which can be implemented in Azure cloud application subscriptions.

    SMT incident priority mapping

    The priority “Blocker” is mostly used by Developers to prioritize their tasks and its not applicable for operations team.

    0-CRITICALCritical<= 4 hrs
    1-ERRORHigh<= 12hrs
    2-WARNINGMedium<= 48hrs (2days)
    3 - InformationalLow<= 96hrs (4days)
    4 - VerboseNo TicketAction based on the notification and analysis

    Recommended Azure Monitors

    All ResourcesResource HealthResource HealthPrevious resource status=All, Current resource status=AllAlwaysCurrent status4 - VerboseMS teamsIncluded all future resource groups and future resourcesExcluding “Virtual machine instance from VMSS”
    All ResourcesService HealthService HealthEvent types: Service issue, Planned maintenance , Health advisories, Security AdvisoriesAlwaysCurrent status4 - VerboseMS teamsRegions : North Europe, West EuropeServices: Alerts & Metrics, Activity Logs & Alerts and 21 more
    Azure SQL DatabaseCPUMetricapp_cpu_percent > 805 mins1 hour2-WARNINGEmail
    Azure SQL DatabaseCPUMetricapp_cpu_percent > 955 mins1 hour1-ERRORMS teams & Email
    Azure SQL DatabaseMemoryMetricapp_memory_percent > 805 mins1 hour2-WARNINGEmail
    Azure SQL DatabaseMemoryMetricapp_memory_percent > 955 mins1 hour1-ERRORMS teams & Email
    Azure SQL DatabaseSpaceMetricallocated_data_storage greater or less than dynamic threshold15 mins1 hour2-WARNINGEmail
    AKS - NodeNode CPUMetricnode_cpu_usage_percentage > 8015 mins1 hour2-WARNINGEmailName of the node Include True
    AKS - NodeNode MemoryMetricnode_memory_working_set_percentage > 8015 mins1 hour2-WARNINGEmailName of the node Include True
    AKS - NodeNode DiskMetricnode_disk_usage_percentage > 8015 mins1 hour2-WARNINGEmailName of the node Include True
    AKS - NodeNode Status (NotReady,Unknown)Metrickube_node_status_condition > 05 mins15 mins2-WARNINGEmail
    AKS - PodsPods phases (Failed,Unknown,Pending)Metrickube_pod_status_phase >= 15 mins30 mins2-WARNINGEmailPhase of the pod Include Failed,Unknown,Pending
    AKS - PodsUnschedulable PodsMetricunschedulable > 115 mins1 hour2-WARNINGEmail
    AKS - PodsPods ready state percentageMetricpodReadyPercentage(preview)2-WARNINGEmail
    AKS - ContainersRestarting ContainersMetricrestarting container count(preview)2-WARNINGEmail
    AKS - ContainersOOM killed containersMetricoomKilledContainerCount)preview)2-WARNINGEmail
    AKS - ContainersCPU Exceeded PercentageMetriccpuExceededPercentage (preview)2-WARNINGEmail
    AKS - ContainersMemory working set exceeded percentageMetricmemoryWorkingSetExceededPercentage(preview)2-WARNINGEmail
    Application GatewayUnhealthy backend HostMetricUnhealthyHostCount > 01 min5 mins0-CRITICALMS teams & Email
    Application GatewayFailed RequestsMetricFailedRequests > 1005 mins15 mins2-WARNINGEmail
    Load balancerSNAT Connection Status CountMetricSnatConnectionCount >= 15 mins15 mins2-WARNINGEmailConnection State = Failed, Pending
    Public IP AddressesUnder DDoS attack or notMetricIfUnderDDoSAttack > 01 min5 mins0-CRITICALMS teams & Email
    Virtual machine scalesetCPU UsageMetricPercentage CPU > 9015 mins1 hour2-WARNINGEmail
    Container RegistryStorage UsedMetricStorageUsed > 90% of Storage size included in the SKU15 mins1 hour3 - InformationalEmailReview this which SKU of ACR has this metric
    LogicAppRunsFailedMetricRunsFailed>01 hour12 hours3 - InformationalEmail
    Log Analytics WorkspaceContainer SIGKILL ErrorLogsTable rows Count > 015 mins15 mins2-WARNINGEmailSignal KILL error Expand source
    Log Analytics WorkspaceWAF_Possible_DDoS_DetectedLogs Querycount_ > 100015 mins15 mins1 - ErrorMS teams & EmailWAF_Possible_DDoS_Detected Expand source
    Log Analytics workspaceNode-restart-delayed triggered by KuredLogs Query2-WARNINGEmailNode-restart-delayed Expand source
    Log Analytics workspaceNode-restart-successful-Kured ActionLogs QueryOBSOLETENode-restart-successful Expand source
    Azure SQL Database / serverVulnerability Scan ReportVulnerability Scan Report
    FailureFailure Anomalies - ETAS-BCP-PT-Forensic-Logic-App Failure Anomalies detected 3 - Informational etas-bcp-pt-forensic-logic-app Application Insights Smart detector

    Requirements

    ACRACR - To trigger alert when Create or Update Images from the ACR?
    SQL DBSQL DB - Slow / Long running Queries?
    Service Principal secret / certificate expiry?
    AKSCheck if we can sent an alert if k8s is not able to scale in new workernode
    VISUALIZATION KURED/AKS ALERTSCurrently we dont have a Dashboard / Vis for kured alertsA overview over time would be helpful to



    Refer : https://learn.microsoft.com/en-us/azure/azure-monitor/containers/container-insights-overview
    Overview diagram of Container insights



    https://learn.microsoft.com/en-us/azure/azure-monitor/alerts/alerts-overview
    Diagram that explains Azure Monitor alerts.

  • 相关阅读:
    【python 游戏】闲的无聊?那就和博主一起来滑雪吧~
    fastapi 在中间件中获取requestBody
    C++笔记
    GO语言-包的使用
    [动态规划] (十四) 简单多状态 LeetCode LCR 091.粉刷房子
    c++学习记录(六)
    bootstrap下拉菜单学习(五)
    Netty UDP不能发送大于2048字节包的问题
    openwrt Docker不能联网
    gin-基础笔记
  • 原文地址:https://blog.csdn.net/weixin_44388689/article/details/138123287