PowerShell 技能连载 - 春节假期自动化值守

适用于 PowerShell 5.1 及以上版本

春节长假是万家团圆的时刻,但对于 IT 运维团队来说,系统不会因为放假而停止运行。服务器、数据库、网络设备依然需要有人关注,而值班人员往往捉襟见肘——用最少的人力覆盖最长的假期,成为每年春节前的经典难题。

传统做法是安排轮班表,让值班人员定时登录系统查看状态。这种方式不仅效率低下,而且容易因为人为疏忽而遗漏关键告警。更理想的做法是构建一套自动化值守系统,让脚本替人完成日常巡检、故障处理和告警推送,值班人员只需要在真正出现异常时介入。

PowerShell 凭借其对 Windows 和 Linux(通过 PowerShell Core)的广泛支持、丰富的远程管理能力以及与 .NET 的深度融合,非常适合承担这个角色。本文将从监控系统、自动修复、告警通知三个方面,手把手搭建一个春节假期自动化值守方案。

假期值守监控系统

首先需要一套主动巡检机制,定期检查服务器的关键指标。以下脚本实现了磁盘空间、CPU 利用率和内存使用率的阈值检测,并将结果汇总为结构化对象,便于后续处理。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
function Start-HolidayWatch {
param(
[string[]]$ComputerName = $env:COMPUTERNAME,
[double]$DiskThresholdPercent = 10,
[double]$CpuThresholdPercent = 90,
[double]$MemoryThresholdPercent = 90,
[pscredential]$Credential
)

$results = foreach ($computer in $ComputerName) {
$params = @{
ComputerName = $computer
ErrorAction = 'Stop'
}
if ($Credential) { $params.Credential = $Credential }

try {
# 获取操作系统信息
$os = Get-CimInstance Win32_OperatingSystem @params
$freeMemoryPercent = [math]::Round(
($os.FreePhysicalMemory / $os.TotalVisibleMemorySize) * 100, 2
)
$usedMemoryPercent = [math]::Round(100 - $freeMemoryPercent, 2)

# 获取磁盘信息
$disks = Get-CimInstance Win32_LogicalDisk -Filter 'DriveType=3' @params
$diskAlerts = foreach ($disk in $disks) {
$freePercent = [math]::Round(($disk.FreeSpace / $disk.Size) * 100, 2)
if ($freePercent -lt $DiskThresholdPercent) {
[PSCustomObject]@{
Drive = $disk.DeviceID
FreePercent = $freePercent
FreeGB = [math]::Round($disk.FreeSpace / 1GB, 2)
Status = 'WARNING'
}
}
}

# 获取 CPU 使用率(采样 2 秒)
$cpu = Get-CimInstance Win32_Processor @params
$cpuPercent = [math]::Round(
($cpu | Measure-Object -Property LoadPercentage -Average).Average, 2
)

[PSCustomObject]@{
Computer = $computer
Timestamp = Get-Date -Format 'yyyy-MM-dd HH:mm:ss'
CPUPercent = $cpuPercent
MemoryPercent = $usedMemoryPercent
DiskAlerts = $diskAlerts
CPUStatus = if ($cpuPercent -gt $CpuThresholdPercent) { 'WARNING' } else { 'OK' }
MemoryStatus = if ($usedMemoryPercent -gt $MemoryThresholdPercent) { 'WARNING' } else { 'OK' }
OverallStatus = if ($cpuPercent -gt $CpuThresholdPercent -or
$usedMemoryPercent -gt $MemoryThresholdPercent -or $diskAlerts) {
'ALERT'
} else { 'OK' }
}
}
catch {
[PSCustomObject]@{
Computer = $computer
Timestamp = Get-Date -Format 'yyyy-MM-dd HH:mm:ss'
OverallStatus = 'ERROR'
ErrorMessage = $_.Exception.Message
}
}
}

# 汇总报告
$alertCount = ($results | Where-Object OverallStatus -eq 'ALERT').Count
$errorCount = ($results | Where-Object OverallStatus -eq 'ERROR').Count

[PSCustomObject]@{
ScanTime = Get-Date -Format 'yyyy-MM-dd HH:mm:ss'
TotalHosts = $ComputerName.Count
Alerts = $alertCount
Errors = $errorCount
Details = $results
}
}

# 执行监控——可以配合计划任务每 15 分钟运行一次
$watchParams = @{
ComputerName = 'SRV-WEB01', 'SRV-DB01', 'SRV-APP01'
DiskThresholdPercent = 15
CpuThresholdPercent = 85
MemoryThresholdPercent = 90
}
$report = Start-HolidayWatch @watchParams
$report.Details | Format-Table Computer, CPUPercent, MemoryPercent, CPUStatus, MemoryStatus, OverallStatus -AutoSize

执行结果示例:

1
2
3
4
5
Computer  CPUPercent MemoryPercent CPUStatus MemoryStatus OverallStatus
-------- ---------- ------------- --------- ------------ -------------
SRV-WEB01 12.5 62.30 OK OK OK
SRV-DB01 45.8 78.15 OK OK OK
SRV-APP01 91.2 92.50 WARNING WARNING ALERT

自动修复与应急响应

监控只是第一步,更高级的玩法是让系统自动处理常见故障。下面的脚本展示了服务自动重启、日志清理和磁盘空间释放三种应急操作,每种操作都有日志记录和回滚机制。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
function Invoke-AutoRemediation {
param(
[string]$ComputerName = $env:COMPUTERNAME,
[string[]]$CriticalServices = @('W3SVC', 'MSSQLSERVER', 'WinRM'),
[double]$LogCleanupThresholdGB = 2,
[string]$LogPath = 'C:\Logs',
[string]$TranscriptPath = 'C:\HolidayWatch\Remediation.log'
)

# 记录所有操作
Start-Transcript -Path $TranscriptPath -Append -Force
$actions = @()

foreach ($svc in $CriticalServices) {
try {
$service = Get-Service -Name $svc -ComputerName $ComputerName -ErrorAction Stop
if ($service.Status -ne 'Running') {
Write-Warning "服务 [$svc] 状态异常: $($service.Status),尝试启动..."
$service | Start-Service -ErrorAction Stop
$actions += [PSCustomObject]@{
Time = Get-Date -Format 'HH:mm:ss'
Action = 'RestartService'
Target = $svc
Result = 'SUCCESS'
}
Write-Host "服务 [$svc] 已成功启动" -ForegroundColor Green
}
}
catch {
$actions += [PSCustomObject]@{
Time = Get-Date -Format 'HH:mm:ss'
Action = 'RestartService'
Target = $svc
Result = "FAILED: $($_.Exception.Message)"
}
Write-Error "服务 [$svc] 启动失败: $($_.Exception.Message)"
}
}

# 清理旧日志文件
if (Test-Path $LogPath) {
$oldLogs = Get-ChildItem -Path $LogPath -Recurse -File |
Where-Object { $_.LastWriteTime -lt (Get-Date).AddDays(-7) }

$totalSize = ($oldLogs | Measure-Object -Property Length -Sum).Sum / 1GB
if ($totalSize -gt $LogCleanupThresholdGB) {
Write-Warning "日志目录占用 $([math]::Round($totalSize, 2)) GB,超过阈值,开始清理..."
$oldLogs | Remove-Item -Force -ErrorAction SilentlyContinue
$freed = [math]::Round($totalSize, 2)
$actions += [PSCustomObject]@{
Time = Get-Date -Format 'HH:mm:ss'
Action = 'CleanLogs'
Target = $LogPath
Result = "SUCCESS - 释放 ${freed} GB"
}
}
}

# 磁盘空间应急释放:清理临时文件和回收站
$systemDrive = Get-CimInstance Win32_LogicalDisk -Filter 'DeviceID="C:"'
$freePercent = [math]::Round(($systemDrive.FreeSpace / $systemDrive.Size) * 100, 2)

if ($freePercent -lt 15) {
Write-Warning "C 盘剩余空间仅 $freePercent%,执行应急清理..."
$tempPaths = @(
"$env:TEMP\*",
'C:\Windows\Temp\*',
'C:\Windows\SoftwareDistribution\Download\*'
)
foreach ($path in $tempPaths) {
Remove-Item -Path $path -Recurse -Force -ErrorAction SilentlyContinue
}
# 清理回收站
Clear-RecycleBin -DriveLetter C -Force -ErrorAction SilentlyContinue

$newFree = [math]::Round(
((Get-CimInstance Win32_LogicalDisk -Filter 'DeviceID="C:"').FreeSpace /
(Get-CimInstance Win32_LogicalDisk -Filter 'DeviceID="C:"').Size) * 100, 2
)
$actions += [PSCustomObject]@{
Time = Get-Date -Format 'HH:mm:ss'
Action = 'EmergencyDiskCleanup'
Target = 'C:'
Result = "释放空间: $freePercent% -> $newFree%"
}
}

Stop-Transcript
return $actions
}

# 检测到告警后自动触发修复
$report = Start-HolidayWatch -ComputerName 'SRV-APP01'
if ($report.Details.OverallStatus -eq 'ALERT') {
Write-Host "检测到告警,启动自动修复流程..." -ForegroundColor Yellow
$remediation = Invoke-AutoRemediation -ComputerName 'SRV-APP01'
$remediation | Format-Table Time, Action, Target, Result -AutoSize
}

执行结果示例:

1
2
3
4
5
6
检测到告警,启动自动修复流程...
Time Action Target Result
---- ------ ------ ------
14:32 RestartService W3SVC SUCCESS
14:33 CleanLogs C:\Logs SUCCESS - 释放 3.45 GB
14:34 EmergencyDiskCleanup C: 释放空间: 11.20% -> 18.65%

告警通知与值班管理

监控系统发现问题、自动修复尝试完毕后,需要及时通知值班人员。以下脚本集成了邮件通知、企业微信 Webhook 和钉钉 Webhook 三种告警通道,并支持值班轮换和告警升级机制。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
function Send-HolidayAlert {
param(
[Parameter(Mandatory)]
[string]$Title,

[Parameter(Mandatory)]
[string]$Message,

[ValidateSet('Mail', 'WeCom', 'DingTalk', 'All')]
[string]$Channel = 'All',

# 邮件参数
[string]$SmtpServer = 'smtp.company.com',
[int]$SmtpPort = 587,
[string]$MailFrom = 'ops-holiday@company.com',
[string[]]$MailTo = @('oncall@company.com'),

# Webhook URL
[string]$WeComWebhookUrl,
[string]$DingTalkWebhookUrl,

# 告警级别
[ValidateSet('Info', 'Warning', 'Critical')]
[string]$Severity = 'Warning'
)

$severityEmoji = @{
Info = '[INFO]'
Warning = '[WARN]'
Critical = '[CRIT]'
}
$prefix = $severityEmoji[$Severity]
$body = "$prefix $Title`n`n$Message`n`n时间: $(Get-Date -Format 'yyyy-MM-dd HH:mm:ss')"

# 邮件通知
if ($Channel -in 'Mail', 'All') {
try {
Send-MailMessage -From $MailFrom -To $MailTo -Subject "$prefix $Title" `
-Body $body -SmtpServer $SmtpServer -Port $SmtpPort -Encoding UTF8
Write-Host "邮件告警已发送至: $($MailTo -join ', ')" -ForegroundColor Green
}
catch {
Write-Warning "邮件发送失败: $($_.Exception.Message)"
}
}

# 企业微信通知
if ($Channel -in 'WeCom', 'All' -and $WeComWebhookUrl) {
$wecomBody = @{
msgtype = 'text'
text = @{ content = $body }
} | ConvertTo-Json -Compress

try {
Invoke-RestMethod -Uri $WeComWebhookUrl -Method Post `
-Body $wecomBody -ContentType 'application/json' | Out-Null
Write-Host '企业微信告警已发送' -ForegroundColor Green
}
catch {
Write-Warning "企业微信发送失败: $($_.Exception.Message)"
}
}

# 钉钉通知
if ($Channel -in 'DingTalk', 'All' -and $DingTalkWebhookUrl) {
$dingBody = @{
msgtype = 'text'
text = @{ content = $body }
} | ConvertTo-Json -Compress

try {
Invoke-RestMethod -Uri $DingTalkWebhookUrl -Method Post `
-Body $dingBody -ContentType 'application/json' | Out-Null
Write-Host '钉钉告警已发送' -ForegroundColor Green
}
catch {
Write-Warning "钉钉发送失败: $($_.Exception.Message)"
}
}
}

function Get-CurrentOnCall {
param(
[hashtable]$Schedule = @{
'2026-02-08' = '张三', '2026-02-09' = '张三'
'2026-02-10' = '李四', '2026-02-11' = '李四'
'2026-02-12' = '王五', '2026-02-13' = '王五'
'2026-02-14' = '张三'
},
[hashtable]$Contacts = @{
'张三' = @{ Email = 'zhangsan@company.com'; Phone = '138-0001-0001' }
'李四' = @{ Email = 'lisi@company.com'; Phone = '138-0002-0002' }
'王五' = @{ Email = 'wangwu@company.com'; Phone = '138-0003-0003' }
}
)

$today = Get-Date -Format 'yyyy-MM-dd'
$person = $Schedule[$today]
if (-not $person) {
# 如果当天没有排班,查找最近的一天
$nearest = $Schedule.Keys | Sort-Object | Where-Object { $_ -le $today } | Select-Object -Last 1
$person = $Schedule[$nearest]
}

[PSCustomObject]@{
Date = $today
OnCall = $person
Email = $Contacts[$person].Email
Phone = $Contacts[$person].Phone
}
}

# 告警升级:Critical 级别同时通知值班人员和运维经理
$onCall = Get-CurrentOnCall
$alertParams = @{
Title = 'SRV-APP01 CPU 使用率持续超过 90%'
Message = "服务器 SRV-APP01 CPU 使用率 91.2%,内存使用率 92.5%。`n自动修复已尝试重启服务。`n当前值班: $($onCall.OnCall) ($($onCall.Phone))"
Channel = 'All'
Severity = 'Critical'
MailTo = @($onCall.Email, 'ops-manager@company.com')
}
Send-HolidayAlert @alertParams

执行结果示例:

1
2
3
4
5
6
7
邮件告警已发送至: zhangsan@company.com, ops-manager@company.com
企业微信告警已发送
钉钉告警已发送

Date OnCall Email Phone
---- ------ ----- -----
2026-02-08 张三 zhangsan@company.com 138-0001-0001

完整部署建议

将以上三个模块整合后,可以创建一个计划任务,在假期期间每 15 分钟自动执行一轮巡检。核心逻辑如下:

1
2
3
4
5
# Deploy-HolidayWatch.ps1 - 注册计划任务
$action = New-ScheduledTaskAction -Execute 'pwsh.exe' -Argument '-File "C:\HolidayWatch\Run-Watch.ps1"'
$trigger = New-ScheduledTaskTrigger -Once -At (Get-Date) -RepetitionInterval (New-TimeSpan -Minutes 15)
$settings = New-ScheduledTaskSettingsSet -AllowStartIfOnBatteries -DontStopIfGoingOnBatteries -StartWhenAvailable
Register-ScheduledTask -TaskName 'HolidayWatch-2026' -Action $action -Trigger $trigger -Settings $settings -RunLevel Highest -Description '春节假期自动化值守任务'

注意事项

  1. 凭据安全:远程管理使用的凭据应存储在 Windows 凭据管理器或 Azure Key Vault 中,切勿以明文形式写在脚本里。可以使用 Get-Credential 交互式获取,或通过 Export-Clixml 加密存储。
  2. 网络可达性:确保执行脚本的跳板机与目标服务器之间网络畅通,WinRM(端口 5985/5986)或 SSH 已正确配置并允许远程连接。
  3. 告警风暴防护:设置告警冷却时间(如同一告警 30 分钟内不重复发送),避免因瞬时抖动产生大量重复通知淹没值班人员。
  4. 日志持久化:所有巡检和修复操作必须记录到持久化日志文件,建议同时写入本地文件和集中日志平台(如 ELK),以便假期结束后复盘。
  5. 自动修复边界:自动修复只应处理已知的安全操作(如重启服务、清理临时文件),切勿让脚本自动执行删除数据库、重启服务器等高风险操作。
  6. 节前演练:在放假前至少进行一次完整的端到端演练,包括模拟告警触发、自动修复执行、通知送达,确保每个环节都能正常工作。

PowerShell 技能连载 - Azure Monitor 告警

适用于 PowerShell 5.1 及以上版本

Azure Monitor 是微软 Azure 平台的核心可观测性服务,提供指标采集、日志分析、告警通知等一站式监控能力。在云原生架构日益复杂的今天,运维团队往往需要管理数十甚至上百个 Azure 资源的监控策略。手动在 Azure 门户中逐一配置告警规则既耗时又容易遗漏,而通过 PowerShell 自动化管理告警,可以实现告警策略的标准化、版本化和批量部署。

本文将介绍如何使用 PowerShell 和 Az 模块操作 Azure Monitor 告警,包括查看现有告警规则、创建指标告警、配置操作组(Action Group)实现通知推送,以及批量管理告警规则。这些方法适用于日常运维自动化,也能与基础设施即代码(IaC)流程相结合,确保监控策略随应用部署同步更新。

在开始之前,请确保已安装 Az PowerShell 模块并完成 Azure 身份认证。所有示例基于 Azure 资源管理器(ARM)REST API,需要拥有目标资源组或订阅级别的 Monitoring Contributor 权限。

连接 Azure 并获取监控资源

第一步是连接 Azure 账户并获取目标资源的信息。告警规则需要绑定到具体的 Azure 资源(如虚拟机、App Service、SQL 数据库等),因此我们先确认订阅上下文并查询需要监控的资源列表。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# 安装 Az 监控相关模块
Install-Module -Name Az.Accounts, Az.Monitor, Az.Resources -Force -Scope CurrentUser

# 连接 Azure 账户
Connect-AzAccount

# 获取当前订阅信息
$context = Get-AzContext
$subscriptionId = $context.Subscription.Id
Write-Host ("当前订阅: {0} ({1})" -f $context.Subscription.Name, $subscriptionId)

# 查询目标资源组中的虚拟机
$resourceGroupName = "rg-production"
$vms = Get-AzVM -ResourceGroupName $resourceGroupName

Write-Host "`n===== 资源组中的虚拟机 =====" -ForegroundColor Cyan
foreach ($vm in $vms) {
$status = (Get-AzVM -ResourceGroupName $resourceGroupName -Name $vm.Name -Status).Statuses[1].DisplayStatus
Write-Host (" {0,-30} {1}" -f $vm.Name, $status)
}

# 选取第一台虚拟机的资源 ID 作为后续示例
$targetResourceId = $vms[0].Id
Write-Host ("`n目标资源 ID: {0}" -f $targetResourceId)

执行结果示例:

1
2
3
4
5
6
7
8
9
当前订阅: Production-Subscription (xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx)

===== 资源组中的虚拟机 =====
vm-web-01 VM running
vm-web-02 VM running
vm-db-01 VM running
vm-api-01 VM deallocated

目标资源 ID: /subscriptions/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/resourceGroups/rg-production/providers/Microsoft.Compute/virtualMachines/vm-web-01

连接成功后,我们得到了目标资源的完整 ID。后续创建告警规则时,需要将这个资源 ID 作为监控的目标范围(Scope)。如果需要监控整个资源组或订阅级别的指标,可以使用资源组 ID 或订阅 ID 作为范围。

创建指标告警规则

Azure Monitor 的指标告警(Metric Alert)是最常用的告警类型,它基于资源发出的性能指标数据进行评估。以下代码演示如何为虚拟机创建 CPU 使用率告警,当 CPU 平均使用率连续 5 分钟超过 85% 时触发告警。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# 定义告警规则参数
$alertRuleName = "cpu-high-vm-web-01"
$alertDescription = "虚拟机 CPU 使用率连续 5 分钟超过 85%,可能影响应用性能"
$windowSize = "PT5M"
$evaluationFrequency = "PT1M"
$threshold = 85
$operator = "GreaterThan"
$aggregation = "Average"
$severity = 2

# 获取目标虚拟机的资源 ID
$targetVm = Get-AzVM -ResourceGroupName $resourceGroupName -Name "vm-web-01"

# 创建告警条件
$condition = New-AzMetricAlertRuleV2Criteria `
-MetricName "Percentage CPU" `
-TimeAggregation $aggregation `
-Operator $operator `
-Threshold $threshold

# 创建告警规则
$alertRule = New-AzMetricAlertRuleV2 `
-Name $alertRuleName `
-ResourceGroupName $resourceGroupName `
-WindowSize $windowSize `
-Frequency $evaluationFrequency `
-TargetResourceId $targetVm.Id `
-Condition $condition `
-Severity $severity `
-Description $alertDescription `
-Enabled

if ($alertRule) {
Write-Host "指标告警规则创建成功!" -ForegroundColor Green
Write-Host (" 规则名称: {0}" -f $alertRule.Name)
Write-Host (" 目标资源: {0}" -f $targetVm.Name)
Write-Host (" 监控指标: Percentage CPU")
Write-Host (" 聚合方式: {0}" -f $aggregation)
Write-Host (" 阈值: {0} {1}" -f $operator, $threshold)
Write-Host (" 检测窗口: {0}" -f $windowSize)
Write-Host (" 评估频率: {0}" -f $evaluationFrequency)
Write-Host (" 严重级别: Sev{0}" -f $severity)
}

执行结果示例:

1
2
3
4
5
6
7
8
9
指标告警规则创建成功!
规则名称: cpu-high-vm-web-01
目标资源: vm-web-01
监控指标: Percentage CPU
聚合方式: Average
阈值: GreaterThan 85
检测窗口: PT5M
评估频率: PT1M
严重级别: Sev2

指标告警规则的几个关键参数需要根据实际场景调优。WindowSize(检测窗口)决定了评估指标的时间跨度,Frequency(评估频率)决定了多久检查一次条件。窗口越大越不容易产生误报,但响应时间也会变长。对于关键业务系统,建议采用较短的窗口(如 PT1M 到 PT5M),配合适度的阈值,在灵敏度和稳定性之间取得平衡。

配置操作组实现告警通知

告警规则触发后,需要有渠道将通知送达运维人员。Azure Monitor 的操作组(Action Group)定义了告警触发时的响应动作,包括发送邮件、短信、Webhook、调用 Azure Function 等。以下代码展示如何创建一个操作组,包含邮件通知和 Webhook 两种动作。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# 创建邮件接收人
$emailReceiver = New-AzActionGroupReceiver `
-Name "ops-team-email" `
-EmailAddress "ops-team@example.com"

# 创建 Webhook 接收(可对接企业 IM、事件管理平台)
$webhookReceiver = New-AzActionGroupReceiver `
-Name "incident-webhook" `
-WebhookUri "https://hooks.example.com/azure-alerts/incident"

# 创建操作组
$actionGroupName = "ag-production-critical"
$actionGroupShortName = "prod-crit"

$actionGroup = Set-AzActionGroup `
-Name $actionGroupName `
-ResourceGroupName $resourceGroupName `
-ShortName $actionGroupShortName `
-Receiver $emailReceiver, $webhookReceiver

if ($actionGroup) {
Write-Host "操作组创建成功!" -ForegroundColor Green
Write-Host (" 操作组名称: {0}" -f $actionGroup.Name)
Write-Host (" 短名称: {0}" -f $actionGroup.ShortName)
Write-Host (" 接收人数量: {0}" -f $actionGroup.Receivers.Count)
Write-Host "`n 接收人详情:" -ForegroundColor Cyan
foreach ($receiver in $actionGroup.Receivers) {
Write-Host (" - {0}: {1}" -f $receiver.Name, $receiver.EmailAddress)
}
}

# 将操作组关联到告警规则
$actionGroupId = (Get-AzActionGroup -ResourceGroupName $resourceGroupName -Name $actionGroupName).Id

# 创建动作引用
$action = New-AzAlertRuleAction -ActionGroupId $actionGroupId

Write-Host "`n操作组已就绪,可关联到告警规则。" -ForegroundColor Green
Write-Host (" Action Group ID: {0}" -f $actionGroupId)

执行结果示例:

1
2
3
4
5
6
7
8
9
10
11
操作组创建成功!
操作组名称: ag-production-critical
短名称: prod-crit
接收人数量: 2

接收人详情:
- ops-team-email: ops-team@example.com
- incident-webhook:

操作组已就绪,可关联到告警规则。
Action Group ID: /subscriptions/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/resourceGroups/rg-production/providers/microsoft.insights/actionGroups/ag-production-critical

操作组设计为独立于告警规则的资源,这意味着一个操作组可以被多条告警规则复用。建议按照团队或响应级别来组织操作组,例如”生产环境-紧急”、”生产环境-警告”、”测试环境-通知”等。当团队成员变动时,只需修改操作组即可,无需逐条更新告警规则。

批量查询和管理告警规则

在大型 Azure 环境中,告警规则可能多达数十甚至上百条。手动逐条检查既不现实也容易遗漏。以下代码展示如何批量查询告警规则的状态,并生成一份告警规则清单报告。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
# 查询资源组中所有指标告警规则
$alertRules = Get-AzMetricAlertRuleV2 -ResourceGroupName $resourceGroupName

Write-Host "===== 告警规则清单 =====" -ForegroundColor Cyan
Write-Host ("资源组: {0}" -f $resourceGroupName)
Write-Host ("规则总数: {0}" -f $alertRules.Count)
Write-Host ""

# 构建告警规则报告
$reportData = foreach ($rule in $alertRules) {
$targetResources = @()
foreach ($scope in $rule.Scopes) {
$parts = $scope -split "/"
$resourceName = $parts[-1]
$targetResources += $resourceName
}

$criteria = $rule.Criteria
$metricName = if ($criteria.AllOf) { $criteria.AllOf[0].MetricName } else { "N/A" }
$thresholdValue = if ($criteria.AllOf) { $criteria.AllOf[0].Threshold } else { "N/A" }
$operatorValue = if ($criteria.AllOf) { $criteria.AllOf[0].OperatorProperty } else { "N/A" }

[PSCustomObject]@{
规则名称 = $rule.Name
严重级别 = "Sev{0}" -f $rule.Severity
启用状态 = if ($rule.Enabled) { "已启用" } else { "已禁用" }
监控指标 = $metricName
阈值条件 = "{0} {1}" -f $operatorValue, $thresholdValue
检测窗口 = $rule.WindowSize
目标资源 = ($targetResources -join ", ")
操作组 = if ($rule.Actions.Count -gt 0) { "已配置 ({0} 个)" -f $rule.Actions.Count } else { "未配置" }
}
}

# 按严重级别排序并输出
$sortedReport = $reportData | Sort-Object 严重级别, 规则名称

foreach ($item in $sortedReport) {
Write-Host "---" -ForegroundColor Gray
Write-Host (" 规则名称: {0}" -f $item.规则名称)
Write-Host (" 严重级别: {0}" -f $item.严重级别)
Write-Host (" 启用状态: {0}" -f $item.启用状态)
Write-Host (" 监控指标: {0}" -f $item.监控指标)
Write-Host (" 阈值条件: {0}" -f $item.阈值条件)
Write-Host (" 检测窗口: {0}" -f $item.检测窗口)
Write-Host (" 目标资源: {0}" -f $item.目标资源)
Write-Host (" 操作组: {0}" -f $item.操作组)
}

# 导出 CSV 报告
$reportPath = Join-Path $env:TEMP "AlertRules-Report-{0:yyyyMMdd}.csv" -f (Get-Date)
$sortedReport | Export-Csv -Path $reportPath -NoTypeInformation -Encoding UTF8
Write-Host ("`n报告已导出: {0}" -f $reportPath) -ForegroundColor Green

# 检查未配置操作组的规则
$noActionRules = $reportData | Where-Object { $_.操作组 -eq "未配置" }
if ($noActionRules) {
Write-Host "`n[警告] 以下告警规则未配置操作组,触发后不会发送通知:" -ForegroundColor Yellow
foreach ($rule in $noActionRules) {
Write-Host (" - {0}" -f $rule.规则名称) -ForegroundColor Yellow
}
}

执行结果示例:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
===== 告警规则清单 =====
资源组: rg-production
规则总数: 8

---
规则名称: cpu-high-vm-api-01
严重级别: Sev2
启用状态: 已启用
监控指标: Percentage CPU
阈值条件: GreaterThan 90
检测窗口: 00:05:00
目标资源: vm-api-01
操作组: 已配置 (1 个)
---
规则名称: cpu-high-vm-web-01
严重级别: Sev2
启用状态: 已启用
监控指标: Percentage CPU
阈值条件: GreaterThan 85
检测窗口: 00:05:00
目标资源: vm-web-01
操作组: 已配置 (2 个)
---
规则名称: disk-space-vm-db-01
严重级别: Sev3
启用状态: 已启用
监控指标: OsDisk.Used
阈值条件: GreaterThan 80
检测窗口: 00:10:00
目标资源: vm-db-01
操作组: 未配置

报告已导出: /tmp/AlertRules-Report-20251106.csv

[警告] 以下告警规则未配置操作组,触发后不会发送通知:
- disk-space-vm-db-01

批量审查告警规则是运维巡检的重要环节。脚本中特别加入了”未配置操作组”的检测逻辑——一条没有操作组的告警规则即使被触发,也不会通知任何人,形同虚设。建议将此脚本纳入定期巡检流程,确保所有告警规则都处于有效工作状态。

使用日志查询告警

除了指标告警,Azure Monitor 还支持基于 Log Analytics 日志查询的告警。日志告警使用 KQL(Kusto Query Language)编写查询条件,能够对结构化日志数据进行复杂的关联分析。以下示例展示如何通过 PowerShell 创建一条日志告警,检测异常登录行为。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# 定义日志告警参数
$logAlertName = "anomalous-signin-detection"
$logAlertDescription = "检测 1 小时内同一账户从不同地区登录的行为,可能表示凭据泄露"
$workspaceResourceId = "/subscriptions/$subscriptionId/resourceGroups/$resourceGroupName/providers/Microsoft.OperationalInsights/workspaces/law-production"

# 构建 KQL 查询
$kqlQuery = @"
SigninLogs
| where TimeGenerated > ago(1h)
| where ResultType == 0
| summarize LoginCount = count(), DistinctLocations = dcount(Location),
Locations = make_set(Location, 10) by UserPrincipalName
| where DistinctLocations > 2
| project TimeGenerated = now(), UserPrincipalName, LoginCount, DistinctLocations, Locations
| order by DistinctLocations desc
"@

# 创建日志查询告警的条件
$schedule = New-AzScheduledQueryRuleScheduleObject `
-FrequencyInMinutes 15 `
-TimeWindowInMinutes 60

$conditionObject = New-AzScheduledQueryRuleConditionObject `
-Query $kqlQuery `
-TimeWindow (New-TimeSpan -Minutes 60)

# 创建日志告警规则
$logAlert = New-AzScheduledQueryRule `
-Name $logAlertName `
-ResourceGroupName $resourceGroupName `
-Location "eastus" `
-DisplayName "异常登录行为检测" `
-Description $logAlertDescription `
-Severity 2 `
-Enabled `
-Schedule $schedule `
-Condition $conditionObject `
-Scope $workspaceResourceId

if ($logAlert) {
Write-Host "日志查询告警规则创建成功!" -ForegroundColor Green
Write-Host (" 规则名称: {0}" -f $logAlert.Name)
Write-Host (" 显示名称: {0}" -f $logAlert.DisplayName)
Write-Host (" 严重级别: Sev{0}" -f $logAlert.Severity)
Write-Host (" 查询频率: 每 {0} 分钟" -f $schedule.FrequencyInMinutes)
Write-Host (" 查询窗口: {0} 分钟" -f $schedule.TimeWindowInMinutes)
Write-Host (" 目标工作区: {0}" -f ($workspaceResourceId -split "/")[-1])
}

执行结果示例:

1
2
3
4
5
6
7
日志查询告警规则创建成功!
规则名称: anomalous-signin-detection
显示名称: 异常登录行为检测
严重级别: Sev2
查询频率: 每 15 分钟
查询窗口: 60 分钟
目标工作区: law-production

日志告警的灵活性远高于指标告警,但代价是更高的 Log Analytics 查询成本。编写 KQL 查询时,务必使用 where 子句尽早过滤数据,减少扫描的数据量。TimeWindow 参数决定查询回溯的时间范围,Frequency 参数决定查询的执行间隔,两者的设置需要平衡检测时效性和运行成本。

注意事项

  1. 权限配置:操作 Azure Monitor 告警需要 Microsoft.Insights/metricAlerts/*Microsoft.Insights/actionGroups/* 等权限。建议创建 Azure 自定义角色,仅授予 Monitoring Contributor 级别的权限,避免使用 Owner 或 Contributor 等过宽的角色。在多团队协作环境中,通过角色划分明确告警管理的责任边界。

  2. 告警疲劳治理:阈值设置不当会导致大量无效告警(Alert Fatigue),使运维人员逐渐忽视告警通知。建议先以较宽松的阈值试运行一周,观察触发频率后再逐步收紧。对于非关键指标,可以设置较高的严重级别(Sev3 或 Sev4),减少高频告警的干扰。

  3. API 版本管理Az.Monitor 模块中的 cmdlet 对应不同版本的 ARM API,行为可能随模块更新而变化。生产脚本中应固定 Az 模块版本(如 RequiredVersion),并在升级前在测试环境中验证兼容性。同时关注 Azure 更新公告,了解 Breaking Change。

  4. 操作组测试:创建操作组后,务必使用 Azure 门户的”测试操作组”功能或手动触发一条测试告警,确认邮件和 Webhook 通知能够正常送达。Webhook 端点可能存在防火墙规则或认证配置问题,仅靠创建成功的返回值无法验证端到端的可达性。

  5. 成本控制:日志告警基于 Log Analytics 查询,每次执行都会消耗数据扫描量。高频的日志告警(如每分钟执行一次复杂查询)可能产生显著的额外费用。建议在 Azure Cost Management 中设置预算告警,监控 Monitor 服务的月度开支,并对查询频率和复杂度进行成本效益评估。

  6. 告警规则即代码:将告警规则的定义保存在 JSON 或 PowerShell 脚本文件中,通过 Git 进行版本管理。这样不仅能追踪每次阈值调整的历史,还能在灾难恢复场景中快速重建完整的监控体系。结合 Azure DevOps Pipeline 或 GitHub Actions,可以实现告警规则的自动部署和审批流程。

PowerShell 技能连载 - 通知与告警系统

适用于 PowerShell 5.1 及以上版本

运维自动化的最后一环是通知——部署完成需要告知团队、服务异常需要唤醒值班人员、磁盘满了需要及时处理。PowerShell 可以通过多种渠道发送通知:邮件(SMTP)、Webhook(Slack/Teams/钉钉)、Windows Toast 通知、甚至短信。构建统一的通知系统,让所有脚本复用同一套告警机制,是提升运维响应效率的关键。

本文将讲解 PowerShell 中的各种通知方式和统一的告警系统设计。

邮件通知

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
function Send-AlertEmail {
param(
[Parameter(Mandatory)]
[string]$Subject,

[Parameter(Mandatory)]
[string]$Body,

[string[]]$To = @("admin@contoso.com"),

[string]$Priority = "Normal",

[string]$SmtpServer = "mail.contoso.com",

[int]$Port = 587
)

$params = @{
From = "alerts@contoso.com"
To = $To
Subject = "[PS-Alert] $Subject"
Body = $Body
BodyAsHtml = $true
SmtpServer = $SmtpServer
Port = $Port
Encoding = [System.Text.Encoding]::UTF8
Priority = $Priority
}

if ($env:SMTP_USER -and $env:SMTP_PASS) {
$secPass = ConvertTo-SecureString $env:SMTP_PASS -AsPlainText -Force
$params.Credential = New-Object PSCredential($env:SMTP_USER, $secPass)
}

try {
Send-MailMessage @params
Write-Host "邮件已发送:$Subject" -ForegroundColor Green
} catch {
Write-Host "邮件发送失败:$($_.Exception.Message)" -ForegroundColor Red
}
}

# HTML 格式告警邮件
function Send-HtmlAlert {
param(
[string]$Title,
[string]$Message,
[ValidateSet("Info", "Warning", "Critical")]
[string]$Severity = "Info"
)

$color = switch ($Severity) {
"Info" { "#3498db" }
"Warning" { "#f39c12" }
"Critical" { "#e74c3c" }
}

$html = @"
<!DOCTYPE html>
<html><body style="font-family:Arial,sans-serif;padding:20px">
<div style="border-left:4px solid $color;padding:15px;background:#f8f9fa">
<h2 style="color:$color;margin:0">$Title</h2>
<p style="color:#555">$Message</p>
<p style="color:#999;font-size:12px">时间:$(Get-Date -Format 'yyyy-MM-dd HH:mm:ss') | 计算机:$($env:COMPUTERNAME)</p>
</div>
</body></html>
"@

$priority = if ($Severity -eq "Critical") { "High" } else { "Normal" }
Send-AlertEmail -Subject $Title -Body $html -Priority $priority
}

Send-HtmlAlert -Title "磁盘空间告警" -Message "服务器 SRV01 的 C 盘使用率已达到 92%" -Severity Warning

执行结果示例:

1
邮件已发送:磁盘空间告警

Webhook 通知

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
# Slack Webhook
function Send-SlackNotification {
param(
[Parameter(Mandatory)][string]$Message,
[ValidateSet("good", "warning", "danger")]
[string]$Color = "good",
[string]$WebhookUrl = $env:SLACK_WEBHOOK_URL
)

if (-not $WebhookUrl) {
Write-Host "未设置 SLACK_WEBHOOK_URL" -ForegroundColor Yellow
return
}

$body = @{
attachments = @(
@{
text = $Message
color = $Color
ts = [int][double]::Parse((Get-Date -UFormat %s))
fields = @(
@{ title = "Server"; value = $env:COMPUTERNAME; short = $true }
@{ title = "Time"; value = (Get-Date -Format 'HH:mm:ss'); short = $true }
)
}
)
} | ConvertTo-Json -Depth 5

try {
Invoke-RestMethod -Uri $WebhookUrl -Method Post -Body $body -ContentType "application/json"
Write-Host "Slack 通知已发送" -ForegroundColor Green
} catch {
Write-Host "Slack 发送失败:$($_.Exception.Message)" -ForegroundColor Red
}
}

# Microsoft Teams Webhook
function Send-TeamsNotification {
param(
[Parameter(Mandatory)][string]$Title,
[Parameter(Mandatory)][string]$Message,
[ValidateSet("Info", "Warning", "Error")]
[string]$Level = "Info",
[string]$WebhookUrl = $env:TEAMS_WEBHOOK_URL
)

if (-not $WebhookUrl) { return }

$color = switch ($Level) {
"Info" { "0078D7" }
"Warning" { "FFB900" }
"Error" { "E81123" }
}

$body = @{
"@type" = "MessageCard"
"@context" = "http://schema.org/extensions"
themeColor = $color
title = $Title
text = $Message
sections = @(
@{
facts = @(
@{ name = "Computer"; value = $env:COMPUTERNAME },
@{ name = "Timestamp"; value = (Get-Date -Format 'yyyy-MM-dd HH:mm:ss') }
)
}
)
} | ConvertTo-Json -Depth 5

Invoke-RestMethod -Uri $WebhookUrl -Method Post -Body $body -ContentType "application/json"
Write-Host "Teams 通知已发送" -ForegroundColor Green
}

# 钉钉 Webhook
function Send-DingTalkNotification {
param(
[Parameter(Mandatory)][string]$Message,
[string]$WebhookUrl = $env:DINGTALK_WEBHOOK_URL
)

if (-not $WebhookUrl) { return }

$body = @{
msgtype = "text"
text = @{ content = "[$($env:COMPUTERNAME)] $Message" }
} | ConvertTo-Json

Invoke-RestMethod -Uri $WebhookUrl -Method Post -Body $body -ContentType "application/json; charset=utf-8"
Write-Host "钉钉通知已发送" -ForegroundColor Green
}

Send-SlackNotification -Message "部署完成:MyApp v2.5.0" -Color "good"
Send-TeamsNotification -Title "服务告警" -Message "CPU 使用率超过 90%" -Level Warning

执行结果示例:

1
2
Slack 通知已发送
Teams 通知已发送

统一告警系统

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
# 统一告警入口
function Send-Alert {
param(
[Parameter(Mandatory)]
[string]$Title,

[Parameter(Mandatory)]
[string]$Message,

[ValidateSet("Info", "Warning", "Critical")]
[string]$Severity = "Info",

[string[]]$Channels = @("Log"),

[hashtable]$ExtraData
)

$timestamp = Get-Date -Format "yyyy-MM-dd HH:mm:ss"
$alertId = [guid]::NewGuid().ToString().Substring(0, 8)

$fullMessage = "[$alertId] [$Severity] $Title`n$Message"
if ($ExtraData) {
foreach ($key in $ExtraData.Keys) {
$fullMessage += "`n $key : $($ExtraData[$key])"
}
}

# 始终记录日志
$logDir = "C:\Logs\Alerts"
New-Item $logDir -ItemType Directory -Force | Out-Null
$logEntry = "[$timestamp] [$Severity] [$alertId] $Title | $Message"
Add-Content "$logDir\alerts-$(Get-Date -Format 'yyyyMM').log" -Value $logEntry -Encoding UTF8

# 根据渠道分发
foreach ($channel in $Channels) {
switch ($channel) {
"Email" {
Send-HtmlAlert -Title "[$Severity] $Title" -Message $Message -Severity $Severity
}
"Slack" {
$color = switch ($Severity) {
"Info" { "good" }
"Warning" { "warning" }
"Critical" { "danger" }
}
Send-SlackNotification -Message $fullMessage -Color $color
}
"Teams" {
$level = switch ($Severity) {
"Info" { "Info" }
"Warning" { "Warning" }
"Critical" { "Error" }
}
Send-TeamsNotification -Title $Title -Message $fullMessage -Level $level
}
"DingTalk" {
Send-DingTalkNotification -Message $fullMessage
}
"Log" {
$color = switch ($Severity) {
"Info" { "Green" }
"Warning" { "Yellow" }
"Critical" { "Red" }
}
Write-Host $logEntry -ForegroundColor $color
}
}
}
}

# 使用示例
Send-Alert -Title "部署完成" -Message "MyApp v2.5.0 已部署到生产环境" `
-Severity Info -Channels @("Log", "Slack")

Send-Alert -Title "磁盘空间告警" -Message "SRV01 C盘使用率 95%" `
-Severity Warning -Channels @("Log", "Email", "Teams") `
-ExtraData @{ Drive = "C:"; FreeGB = "10.2"; Threshold = "90%" }

Send-Alert -Title "服务宕机" -Message "MyApp 服务无响应" `
-Severity Critical -Channels @("Log", "Email", "Slack", "Teams", "DingTalk") `
-ExtraData @{ Server = "SRV01"; LastSeen = "5分钟前" }

执行结果示例:

1
2
3
4
5
6
7
8
9
10
[2025-08-01 08:30:15] [Info] [a1b2c3d4] 部署完成 | MyApp v2.5.0 已部署到生产环境
Slack 通知已发送
[2025-08-01 08:30:16] [Warning] [e5f6a7b8] 磁盘空间告警 | SRV01 C盘使用率 95%
邮件已发送:[Warning] 磁盘空间告警
Teams 通知已发送
[2025-08-01 08:30:17] [Critical] [c9d0e1f2] 服务宕机 | MyApp 服务无响应
邮件已发送:[Critical] 服务宕机
Slack 通知已发送
Teams 通知已发送
钉钉通知已发送

Windows Toast 通知

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# 本地 Windows Toast 通知
function Send-ToastNotification {
param(
[Parameter(Mandatory)][string]$Title,
[Parameter(Mandatory)][string]$Message
)

Add-Type -AssemblyName System.Windows.Forms

$notify = New-Object System.Windows.Forms.NotifyIcon
$notify.Icon = [System.Drawing.SystemIcons]::Warning
$notify.BalloonTipTitle = $Title
$notify.BalloonTipText = $Message
$notify.Visible = $true
$notify.ShowBalloonTip(5000)

Start-Sleep -Seconds 6
$notify.Dispose()
}

Send-ToastNotification -Title "部署完成" -Message "MyApp v2.5.0 已部署"

执行结果示例:

1
# Windows 系统托盘弹出通知气泡

注意事项

  1. Webhook 安全:Webhook URL 等同于密码,不要硬编码在脚本中,使用环境变量或密钥管理
  2. 告警抑制:同一告警短时间内重复发送会造成告警疲劳,添加去重和抑制逻辑
  3. 告警升级:Critical 级别告警未被处理时应自动升级,通知更高级别的人员
  4. 发送失败处理:通知发送失败时应有备用方案(如主邮件服务器失败时尝试备用)
  5. 告警分级:合理使用 Info/Warning/Critical,避免所有告警都是最高级别
  6. 时区处理:分布式团队注意时区差异,告警时间使用 UTC 或明确标注时区