适用于 PowerShell 5.1 及以上版本
春节长假是万家团圆的时刻,但对于 IT 运维团队来说,系统不会因为放假而停止运行。服务器、数据库、网络设备依然需要有人关注,而值班人员往往捉襟见肘——用最少的人力覆盖最长的假期,成为每年春节前的经典难题。
传统做法是安排轮班表,让值班人员定时登录系统查看状态。这种方式不仅效率低下,而且容易因为人为疏忽而遗漏关键告警。更理想的做法是构建一套自动化值守系统,让脚本替人完成日常巡检、故障处理和告警推送,值班人员只需要在真正出现异常时介入。
PowerShell 凭借其对 Windows 和 Linux(通过 PowerShell Core)的广泛支持、丰富的远程管理能力以及与 .NET 的深度融合,非常适合承担这个角色。本文将从监控系统、自动修复、告警通知三个方面,手把手搭建一个春节假期自动化值守方案。
假期值守监控系统 首先需要一套主动巡检机制,定期检查服务器的关键指标。以下脚本实现了磁盘空间、CPU 利用率和内存使用率的阈值检测,并将结果汇总为结构化对象,便于后续处理。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 function Start-HolidayWatch { param ( [string []]$ComputerName = $env:COMPUTERNAME , [double ]$DiskThresholdPercent = 10 , [double ]$CpuThresholdPercent = 90 , [double ]$MemoryThresholdPercent = 90 , [pscredential ]$Credential ) $results = foreach ($computer in $ComputerName ) { $params = @ { ComputerName = $computer ErrorAction = 'Stop' } if ($Credential ) { $params .Credential = $Credential } try { $os = Get-CimInstance Win32_OperatingSystem @params $freeMemoryPercent = [math ]::Round( ($os .FreePhysicalMemory / $os .TotalVisibleMemorySize) * 100 , 2 ) $usedMemoryPercent = [math ]::Round(100 - $freeMemoryPercent , 2 ) $disks = Get-CimInstance Win32_LogicalDisk -Filter 'DriveType=3' @params $diskAlerts = foreach ($disk in $disks ) { $freePercent = [math ]::Round(($disk .FreeSpace / $disk .Size) * 100 , 2 ) if ($freePercent -lt $DiskThresholdPercent ) { [PSCustomObject ]@ { Drive = $disk .DeviceID FreePercent = $freePercent FreeGB = [math ]::Round($disk .FreeSpace / 1 GB, 2 ) Status = 'WARNING' } } } $cpu = Get-CimInstance Win32_Processor @params $cpuPercent = [math ]::Round( ($cpu | Measure-Object -Property LoadPercentage -Average ).Average, 2 ) [PSCustomObject ]@ { Computer = $computer Timestamp = Get-Date -Format 'yyyy-MM-dd HH:mm:ss' CPUPercent = $cpuPercent MemoryPercent = $usedMemoryPercent DiskAlerts = $diskAlerts CPUStatus = if ($cpuPercent -gt $CpuThresholdPercent ) { 'WARNING' } else { 'OK' } MemoryStatus = if ($usedMemoryPercent -gt $MemoryThresholdPercent ) { 'WARNING' } else { 'OK' } OverallStatus = if ($cpuPercent -gt $CpuThresholdPercent -or $usedMemoryPercent -gt $MemoryThresholdPercent -or $diskAlerts ) { 'ALERT' } else { 'OK' } } } catch { [PSCustomObject ]@ { Computer = $computer Timestamp = Get-Date -Format 'yyyy-MM-dd HH:mm:ss' OverallStatus = 'ERROR' ErrorMessage = $_ .Exception.Message } } } $alertCount = ($results | Where-Object OverallStatus -eq 'ALERT' ).Count $errorCount = ($results | Where-Object OverallStatus -eq 'ERROR' ).Count [PSCustomObject ]@ { ScanTime = Get-Date -Format 'yyyy-MM-dd HH:mm:ss' TotalHosts = $ComputerName .Count Alerts = $alertCount Errors = $errorCount Details = $results } } $watchParams = @ { ComputerName = 'SRV-WEB01' , 'SRV-DB01' , 'SRV-APP01' DiskThresholdPercent = 15 CpuThresholdPercent = 85 MemoryThresholdPercent = 90 } $report = Start-HolidayWatch @watchParams$report .Details | Format-Table Computer, CPUPercent, MemoryPercent, CPUStatus, MemoryStatus, OverallStatus -AutoSize
执行结果示例:
1 2 3 4 5 -------- ---------- ------------- --------- ------------ ------------- - . . - . . - . .
自动修复与应急响应 监控只是第一步,更高级的玩法是让系统自动处理常见故障。下面的脚本展示了服务自动重启、日志清理和磁盘空间释放三种应急操作,每种操作都有日志记录和回滚机制。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 function Invoke-AutoRemediation { param ( [string ]$ComputerName = $env:COMPUTERNAME , [string []]$CriticalServices = @ ('W3SVC' , 'MSSQLSERVER' , 'WinRM' ), [double ]$LogCleanupThresholdGB = 2 , [string ]$LogPath = 'C:\Logs' , [string ]$TranscriptPath = 'C:\HolidayWatch\Remediation.log' ) Start-Transcript -Path $TranscriptPath -Append -Force $actions = @ () foreach ($svc in $CriticalServices ) { try { $service = Get-Service -Name $svc -ComputerName $ComputerName -ErrorAction Stop if ($service .Status -ne 'Running' ) { Write-Warning "服务 [$svc ] 状态异常: $ ($service .Status),尝试启动..." $service | Start-Service -ErrorAction Stop $actions += [PSCustomObject ]@ { Time = Get-Date -Format 'HH:mm:ss' Action = 'RestartService' Target = $svc Result = 'SUCCESS' } Write-Host "服务 [$svc ] 已成功启动" -ForegroundColor Green } } catch { $actions += [PSCustomObject ]@ { Time = Get-Date -Format 'HH:mm:ss' Action = 'RestartService' Target = $svc Result = "FAILED: $ ($_ .Exception.Message)" } Write-Error "服务 [$svc ] 启动失败: $ ($_ .Exception.Message)" } } if (Test-Path $LogPath ) { $oldLogs = Get-ChildItem -Path $LogPath -Recurse -File | Where-Object { $_ .LastWriteTime -lt (Get-Date ).AddDays(-7 ) } $totalSize = ($oldLogs | Measure-Object -Property Length -Sum ).Sum / 1 GB if ($totalSize -gt $LogCleanupThresholdGB ) { Write-Warning "日志目录占用 $ ([math]::Round($totalSize , 2)) GB,超过阈值,开始清理..." $oldLogs | Remove-Item -Force -ErrorAction SilentlyContinue $freed = [math ]::Round($totalSize , 2 ) $actions += [PSCustomObject ]@ { Time = Get-Date -Format 'HH:mm:ss' Action = 'CleanLogs' Target = $LogPath Result = "SUCCESS - 释放 $ {freed} GB" } } } $systemDrive = Get-CimInstance Win32_LogicalDisk -Filter 'DeviceID="C:"' $freePercent = [math ]::Round(($systemDrive .FreeSpace / $systemDrive .Size) * 100 , 2 ) if ($freePercent -lt 15 ) { Write-Warning "C 盘剩余空间仅 $freePercent %,执行应急清理..." $tempPaths = @ ( "$env:TEMP \*" , 'C:\Windows\Temp\*' , 'C:\Windows\SoftwareDistribution\Download\*' ) foreach ($path in $tempPaths ) { Remove-Item -Path $path -Recurse -Force -ErrorAction SilentlyContinue } Clear-RecycleBin -DriveLetter C -Force -ErrorAction SilentlyContinue $newFree = [math ]::Round( ((Get-CimInstance Win32_LogicalDisk -Filter 'DeviceID="C:"' ).FreeSpace / (Get-CimInstance Win32_LogicalDisk -Filter 'DeviceID="C:"' ).Size) * 100 , 2 ) $actions += [PSCustomObject ]@ { Time = Get-Date -Format 'HH:mm:ss' Action = 'EmergencyDiskCleanup' Target = 'C:' Result = "释放空间: $freePercent % -> $newFree %" } } Stop-Transcript return $actions } $report = Start-HolidayWatch -ComputerName 'SRV-APP01' if ($report .Details.OverallStatus -eq 'ALERT' ) { Write-Host "检测到告警,启动自动修复流程..." -ForegroundColor Yellow $remediation = Invoke-AutoRemediation -ComputerName 'SRV-APP01' $remediation | Format-Table Time, Action, Target, Result -AutoSize }
执行结果示例:
1 2 3 4 5 6 检测到告警,启动自动修复流程... Time Action Target Result ---- ------ ------ ------ 14:32 RestartService W3SVC SUCCESS 14:33 CleanLogs C:\Logs SUCCESS - 释放 3.45 GB 14:34 EmergencyDiskCleanup C: 释放空间: 11.20% -> 18.65%
告警通知与值班管理 监控系统发现问题、自动修复尝试完毕后,需要及时通知值班人员。以下脚本集成了邮件通知、企业微信 Webhook 和钉钉 Webhook 三种告警通道,并支持值班轮换和告警升级机制。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 function Send-HolidayAlert { param ( [Parameter (Mandatory )] [string ]$Title , [Parameter (Mandatory )] [string ]$Message , [ValidateSet ('Mail' , 'WeCom' , 'DingTalk' , 'All' )] [string ]$Channel = 'All' , [string ]$SmtpServer = 'smtp.company.com' , [int ]$SmtpPort = 587 , [string ]$MailFrom = 'ops-holiday@company.com' , [string []]$MailTo = @ ('oncall@company.com' ), [string ]$WeComWebhookUrl , [string ]$DingTalkWebhookUrl , [ValidateSet ('Info' , 'Warning' , 'Critical' )] [string ]$Severity = 'Warning' ) $severityEmoji = @ { Info = '[INFO]' Warning = '[WARN]' Critical = '[CRIT]' } $prefix = $severityEmoji [$Severity ] $body = "$prefix $Title `n`n$Message `n`n时间: $ (Get-Date -Format 'yyyy-MM-dd HH:mm:ss')" if ($Channel -in 'Mail' , 'All' ) { try { Send-MailMessage -From $MailFrom -To $MailTo -Subject "$prefix $Title " ` -Body $body -SmtpServer $SmtpServer -Port $SmtpPort -Encoding UTF8 Write-Host "邮件告警已发送至: $ ($MailTo -join ', ')" -ForegroundColor Green } catch { Write-Warning "邮件发送失败: $ ($_ .Exception.Message)" } } if ($Channel -in 'WeCom' , 'All' -and $WeComWebhookUrl ) { $wecomBody = @ { msgtype = 'text' text = @ { content = $body } } | ConvertTo-Json -Compress try { Invoke-RestMethod -Uri $WeComWebhookUrl -Method Post ` -Body $wecomBody -ContentType 'application/json' | Out-Null Write-Host '企业微信告警已发送' -ForegroundColor Green } catch { Write-Warning "企业微信发送失败: $ ($_ .Exception.Message)" } } if ($Channel -in 'DingTalk' , 'All' -and $DingTalkWebhookUrl ) { $dingBody = @ { msgtype = 'text' text = @ { content = $body } } | ConvertTo-Json -Compress try { Invoke-RestMethod -Uri $DingTalkWebhookUrl -Method Post ` -Body $dingBody -ContentType 'application/json' | Out-Null Write-Host '钉钉告警已发送' -ForegroundColor Green } catch { Write-Warning "钉钉发送失败: $ ($_ .Exception.Message)" } } } function Get-CurrentOnCall { param ( [hashtable ]$Schedule = @ { '2026-02-08' = '张三' , '2026-02-09' = '张三' '2026-02-10' = '李四' , '2026-02-11' = '李四' '2026-02-12' = '王五' , '2026-02-13' = '王五' '2026-02-14' = '张三' }, [hashtable ]$Contacts = @ { '张三' = @ { Email = 'zhangsan@company.com' ; Phone = '138-0001-0001' } '李四' = @ { Email = 'lisi@company.com' ; Phone = '138-0002-0002' } '王五' = @ { Email = 'wangwu@company.com' ; Phone = '138-0003-0003' } } ) $today = Get-Date -Format 'yyyy-MM-dd' $person = $Schedule [$today ] if (-not $person ) { $nearest = $Schedule .Keys | Sort-Object | Where-Object { $_ -le $today } | Select-Object -Last 1 $person = $Schedule [$nearest ] } [PSCustomObject ]@ { Date = $today OnCall = $person Email = $Contacts [$person ].Email Phone = $Contacts [$person ].Phone } } $onCall = Get-CurrentOnCall $alertParams = @ { Title = 'SRV-APP01 CPU 使用率持续超过 90%' Message = "服务器 SRV-APP01 CPU 使用率 91.2%,内存使用率 92.5%。`n自动修复已尝试重启服务。`n当前值班: $ ($onCall .OnCall) ($ ($onCall .Phone))" Channel = 'All' Severity = 'Critical' MailTo = @ ($onCall .Email, 'ops-manager@company.com' ) } Send-HolidayAlert @alertParams
执行结果示例:
1 2 3 4 5 6 7 邮件告警已发送至: zhangsan@company .com, ops- manager@company .com 企业微信告警已发送 钉钉告警已发送 Date OnCall Email Phone2026 -02 -08 张三 zhangsan@company .com 138 -0001 -0001
完整部署建议 将以上三个模块整合后,可以创建一个计划任务,在假期期间每 15 分钟自动执行一轮巡检。核心逻辑如下:
1 2 3 4 5 $action = New-ScheduledTaskAction -Execute 'pwsh.exe' -Argument '-File "C:\HolidayWatch\Run-Watch.ps1"' $trigger = New-ScheduledTaskTrigger -Once -At (Get-Date ) -RepetitionInterval (New-TimeSpan -Minutes 15 )$settings = New-ScheduledTaskSettingsSet -AllowStartIfOnBatteries -DontStopIfGoingOnBatteries -StartWhenAvailable Register-ScheduledTask -TaskName 'HolidayWatch-2026' -Action $action -Trigger $trigger -Settings $settings -RunLevel Highest -Description '春节假期自动化值守任务'
注意事项
凭据安全 :远程管理使用的凭据应存储在 Windows 凭据管理器或 Azure Key Vault 中,切勿以明文形式写在脚本里。可以使用 Get-Credential 交互式获取,或通过 Export-Clixml 加密存储。
网络可达性 :确保执行脚本的跳板机与目标服务器之间网络畅通,WinRM(端口 5985/5986)或 SSH 已正确配置并允许远程连接。
告警风暴防护 :设置告警冷却时间(如同一告警 30 分钟内不重复发送),避免因瞬时抖动产生大量重复通知淹没值班人员。
日志持久化 :所有巡检和修复操作必须记录到持久化日志文件,建议同时写入本地文件和集中日志平台(如 ELK),以便假期结束后复盘。
自动修复边界 :自动修复只应处理已知的安全操作(如重启服务、清理临时文件),切勿让脚本自动执行删除数据库、重启服务器等高风险操作。
节前演练 :在放假前至少进行一次完整的端到端演练,包括模拟告警触发、自动修复执行、通知送达,确保每个环节都能正常工作。