敏感词过滤踩坑记录 在做游戏聊天功能时,需要实现敏感词过滤。对比了几种方案后,最终选择了DFA算法。这里记录一下实现过程和踩过的坑。
敏感词过滤算法对比 常见算法方案
算法
时间复杂度
空间复杂度
特点
适用场景
暴力匹配
O(nm k)
O(1)
实现简单,效率低
词库小、实时性要求低
Trie树
O(n)
O(m*k)
前缀匹配,效率较高
前缀匹配场景
DFA
O(n)
O(m*k)
状态机跳转,效率最高
高并发、大词库
AC自动机
O(n)
O(m*k)
多模式匹配
超大规模词库
其中:
n:待检测文本长度
m:敏感词数量
k:敏感词平均长度
DFA算法优势 DFA(Deterministic Finite Automaton,确定性有限自动机)之所以成为敏感词过滤的首选算法,主要因为:
时间复杂度最优 :只需遍历一次文本即可完成检测
空间换时间 :预处理阶段构建状态机,查询阶段O(1)跳转
易于扩展 :支持动态添加删除敏感词
多字节字符支持 :天然支持中文等多字节字符
DFA算法原理 核心思想 DFA通过构建树形状态机(Trie树的变种)来实现敏感词匹配:
1 2 3 4 5 6 7 8 9 10 敏感词库:["敏感", "敏感词", "词库"] 构建的DFA树结构: 根节点 / \ 敏 词 | | 感 库 / \ 词 (结束)
状态转移过程 以检测文本”这里有敏感词”为例:
1 2 3 4 5 6 7 8 步骤1: 读取"这" -> 根节点无"这"分支,跳过 步骤2: 读取"里" -> 根节点无"里"分支,跳过 步骤3: 读取"有" -> 根节点无"有"分支,跳过 步骤4: 读取"敏" -> 从根节点转移到"敏"节点 步骤5: 读取"感" -> 从"敏"节点转移到"感"节点(发现结束标记) 步骤6: 匹配到"敏感" 步骤7: 继续读取"词" -> 从"感"节点转移到"词"节点(发现结束标记) 步骤8: 匹配到"敏感词"
节点标记说明 每个节点包含两个关键信息:
子节点映射 :字符到子节点的映射表
结束标记(isEnd) :标记是否为一个敏感词的结尾
TypeScript完整实现 核心类定义 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 class SensitiveWordFilter { private sensitiveWordMap : Map <string , any >; private minMatchLength : number = 2 ; constructor (minMatchLength : number = 2 ) { this .minMatchLength = minMatchLength; this .sensitiveWordMap = new Map (); } makeSensitiveMap (sensitiveWordList : string []): void { this .sensitiveWordMap = new Map (); for (const word of sensitiveWordList) { let map = this .sensitiveWordMap ; for (let i = 0 ; i < word.length ; i++) { const char = word.charAt (i); if (map.has (char)) { map = map.get (char); } else { if (map.get ('isEnd' ) === true ) { map.set ('isEnd' , false ); } const item = new Map (); item.set ('isEnd' , true ); map.set (char, item); map = map.get (char); } } } } }
敏感词检测 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 private checkSensitiveWord (txt : string , index : number ): { flag : boolean , sensitiveWord : string } { let currentMap = this .sensitiveWordMap ; let flag = false ; let wordNum = 0 ; let sensitiveWord = '' ; for (let i = index; i < txt.length ; i++) { const word = txt.charAt (i); currentMap = currentMap.get (word); if (currentMap) { wordNum++; sensitiveWord += word; if (currentMap.get ('isEnd' ) === true ) { flag = true ; if (wordNum < this .minMatchLength ) { flag = false ; } break ; } } else { break ; } } return { flag, sensitiveWord }; }
文本过滤方法 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 filterSensitiveWord (txt : string , check : boolean = false ): string | boolean { let matchResult = { flag : false , sensitiveWord : '' }; const txtTrim = txt.replace (/[^\u4e00-\u9fa5\u0030-\u0039\u0061-\u007a\u0041-\u005a]+/g , '' ); for (let i = 0 ; i < txtTrim.length ; i++) { matchResult = this .checkSensitiveWord (txtTrim, i); if (matchResult.flag ) { if (check) { return false ; } const regex = new RegExp (matchResult.sensitiveWord , "gi" ); txt = txt.replace (regex, '*' .repeat (matchResult.sensitiveWord .length )); } } return check ? true : txt; }
完整使用示例 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 const filter = new SensitiveWordFilter (2 ); const sensitiveWords = [ "敏感" , "敏感词" , "脏话" , "政治" , "暴力" ]; filter.makeSensitiveMap (sensitiveWords); const testText = "这是一条包含敏感词和脏话的测试文本" ;const filtered = filter.filterSensitiveWord (testText, false );console .log ("过滤结果:" , filtered);const isClean = filter.filterSensitiveWord (testText, true );console .log ("是否干净:" , isClean);
算法优化技巧 特殊字符处理 在实际应用中,用户可能使用特殊字符绕过过滤:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 private preprocessText (txt : string ): string { let processed = txt.replace (/[\s·•・]+/g , '' ); processed = processed.replace (/[\u200B-\u200D\uFEFF]/g , '' ); processed = processed.replace (/[\uFF10-\uFF19\uFF21-\uFF3A\uFF41-\uFF5A]/g , char => String .fromCharCode (char.charCodeAt (0 ) - 0xFEE0 )); return processed; }
拼音匹配 针对拼音绕过的情况,可以扩展支持拼音匹配:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 import { pinyin } from 'pinyin-pro' ;class PinyinSensitiveFilter extends SensitiveWordFilter { private pinyinMap : Map <string , string >; makeSensitiveMap (sensitiveWordList : string []): void { super .makeSensitiveMap (sensitiveWordList); this .pinyinMap = new Map (); for (const word of sensitiveWordList) { const py = pinyin (word, { toneType : 'none' , type : 'string' }); this .pinyinMap .set (word, py); } } filterSensitiveWord (txt : string , check : boolean = false ): string | boolean { const result = super .filterSensitiveWord (txt, check); if (check && result === false ) return false ; const txtPinyin = pinyin (txt, { toneType : 'none' , type : 'string' }); return result; } }
变体词处理 针对敏感词的变体(如”法L功”、”氵去车仑工力”),可以使用形近字映射:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 const similarChars : { [key : string ]: string [] } = { '法' : ['氵去' , 'fa' , 'FA' ], '轮' : ['车仑' , 'lun' , 'LUN' ], '功' : ['工力' , 'gong' , 'GONG' ] }; function generateVariants (word : string ): string [] { const variants : string [] = [word]; for (const [original, similars] of Object .entries (similarChars)) { if (word.includes (original)) { for (const similar of similars) { variants.push (word.replace (original, similar)); } } } return variants; }
性能优化 词库分片 对于超大规模词库,可以按词长或首字母分片:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 class OptimizedSensitiveFilter { private wordMapByLength : Map <number , Map <string , any >> = new Map (); makeSensitiveMap (sensitiveWordList : string []): void { for (const word of sensitiveWordList) { const length = word.length ; if (!this .wordMapByLength .has (length)) { this .wordMapByLength .set (length, new Map ()); } } } }
布隆过滤器预筛 使用布隆过滤器(Bloom Filter)进行快速预筛:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 class BloomFilterSensitiveFilter extends SensitiveWordFilter { private bloomFilter : BloomFilter ; constructor (expectedItems : number , falsePositiveRate : number ) { super (); this .bloomFilter = new BloomFilter (expectedItems, falsePositiveRate); } makeSensitiveMap (sensitiveWordList : string []): void { super .makeSensitiveMap (sensitiveWordList); for (const word of sensitiveWordList) { this .bloomFilter .add (word); } } filterSensitiveWord (txt : string , check : boolean = false ): string | boolean { if (!this .bloomFilter .mightContain (txt)) { return check ? true : txt; } return super .filterSensitiveWord (txt, check); } }
实际应用场景 游戏聊天系统 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 class ChatFilter { private filter : SensitiveWordFilter ; constructor ( ) { this .filter = new SensitiveWordFilter (2 ); this .loadSensitiveWords (); } processMessage (playerId : string , message : string ): { allowed : boolean , content ?: string } { const isClean = this .filter .filterSensitiveWord (message, true ); if (!isClean) { const filtered = this .filter .filterSensitiveWord (message, false ) as string ; this .logSensitiveWord (playerId, message); return { allowed : true , content : filtered }; } return { allowed : true , content : message }; } private logSensitiveWord (playerId : string , original : string ): void { console .log (`[敏感词触发] 玩家: ${playerId} , 内容: ${original} ` ); } }
用户昵称检测 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 class NicknameValidator { private filter : SensitiveWordFilter ; private minLength : number = 2 ; private maxLength : number = 12 ; validate (nickname : string ): { valid : boolean , error ?: string } { if (nickname.length < this .minLength ) { return { valid : false , error : `昵称至少需要${this .minLength} 个字符` }; } if (nickname.length > this .maxLength ) { return { valid : false , error : `昵称最多${this .maxLength} 个字符` }; } const isClean = this .filter .filterSensitiveWord (nickname, true ); if (!isClean) { return { valid : false , error : '昵称包含敏感词汇' }; } if (!/^[\u4e00-\u9fa5a-zA-Z0-9]+$/ .test (nickname)) { return { valid : false , error : '昵称只能包含中文、英文和数字' }; } return { valid : true }; } }
测试用例 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 describe ('SensitiveWordFilter' , () => { let filter : SensitiveWordFilter ; beforeEach (() => { filter = new SensitiveWordFilter (2 ); filter.makeSensitiveMap (['敏感' , '脏话' , '政治' ]); }); test ('检测敏感词' , () => { const result = filter.filterSensitiveWord ('这是一条敏感信息' , true ); expect (result).toBe (false ); }); test ('过滤敏感词' , () => { const result = filter.filterSensitiveWord ('这是一条敏感信息' , false ); expect (result).toBe ('这是一条**信息' ); }); test ('检测通过' , () => { const result = filter.filterSensitiveWord ('这是一条正常信息' , true ); expect (result).toBe (true ); }); test ('多敏感词过滤' , () => { const result = filter.filterSensitiveWord ('政治敏感和脏话' , false ); expect (result).toBe ('****和**' ); }); });
总结 DFA算法是敏感词过滤的常用方案,时间复杂度O(n),效率很高。实际项目中还需要考虑拼音、变体词等绕过手段。希望这个实现对大家有帮助。
参考资源: