采集内容插入词库(Javaweb网站敏感词过滤的实现调研结果写出来了)

优采云发布时间: 2021-09-19 21:05

　　几乎所有网站现在都需要设置敏感字过滤，这似乎已经成为网站的标准配置网站. 如果您的网站没有或您没有相应处理，请小心相关部门邀请您喝茶

　　我最近一直在研究Javaweb网站对于敏感词过滤的实施，我在互联网上找到了相关信息。经过我的核实，我写下了我的研究成果供你参考

　　一、敏感字过滤工具类

　　将敏感词词典的内容加载到ArrayList集中，通过双层循环找到与敏感词列表匹配的字符串。如果找到了，用*号替换它，最后得到替换的字符串

　　该方法匹配度高，匹配速度快

　　初始化敏感词库：

　　//初始化敏感词库

public void InitializationWork()

{

replaceAll = new StringBuilder(replceSize);

for(int x=0;x < replceSize;x++)

{

replaceAll.append(replceStr);

}

//加载词库

arrayList = new ArrayList();

InputStreamReader read = null;

BufferedReader bufferedReader = null;

try {

read = new InputStreamReader(SensitiveWord.class.getClassLoader().getResourceAsStream(fileName),encoding);

bufferedReader = new BufferedReader(read);

for(String txt = null;(txt = bufferedReader.readLine()) != null;){

if(!arrayList.contains(txt))

arrayList.add(txt);

}

} catch (UnsupportedEncodingException e) {

e.printStackTrace();

} catch (IOException e) {

e.printStackTrace();

}finally{

try {

if(null != bufferedReader)

bufferedReader.close();

} catch (IOException e) {

e.printStackTrace();

}

try {

if(null != read)

read.close();

} catch (IOException e) {

e.printStackTrace();

}

　　筛选敏感词信息：

　　public String filterInfo(String str)

{

sensitiveWordSet = new HashSet();

sensitiveWordList= new ArrayList();

StringBuilder buffer = new StringBuilder(str);

HashMap hash = new HashMap(arrayList.size());

String temp;

for(int x = 0; x < arrayList.size();x++)

{

temp = arrayList.get(x);

int findIndexSize = 0;

for(int start = -1;(start=buffer.indexOf(temp,findIndexSize)) > -1;)

{

//System.out.println("###replace="+temp);

findIndexSize = start+temp.length();//从已找到的后面开始找

Integer mapStart = hash.get(start);//起始位置

if(mapStart == null || (mapStart != null && findIndexSize > mapStart))//满足1个，即可更新map

{

hash.put(start, findIndexSize);

//System.out.println("###敏感词："+buffer.substring(start, findIndexSize));

}

Collection values = hash.keySet();

for(Integer startIndex : values)

{

Integer endIndex = hash.get(startIndex);

//获取敏感词，并加入列表，用来统计数量

String sensitive = buffer.substring(startIndex, endIndex);

//System.out.println("###敏感词："+sensitive);

if (!sensitive.contains("*")) {//添加敏感词到集合

sensitiveWordSet.add(sensitive);

sensitiveWordList.add(sensitive);

}

buffer.replace(startIndex, endIndex, replaceAll.substring(0,endIndex-startIndex));

}

hash.clear();

return buffer.toString();

}

　　下载地址：sensitiveword

　　链接：密码：qmcw（如果无效，请使用文本末尾的地址下载）

　　二、Java关键词过滤

　　该方法使用正则表达式匹配，比第一种方法稍慢，匹配度好

　　主要代码：

　　// 从words.properties初始化正则表达式字符串

private static void initPattern() {

StringBuffer patternBuffer = new StringBuffer();

try {

//words.properties

InputStream in = KeyWordFilter.class.getClassLoader().getResourceAsStream("keywords.properties");

Properties property = new Properties();

property.load(in);

Enumeration enu = property.propertyNames();

patternBuffer.append("(");

while (enu.hasMoreElements()) {

String scontent = (String) enu.nextElement();

patternBuffer.append(scontent + "|");

//System.out.println(scontent);

keywordsCount ++;

}

patternBuffer.deleteCharAt(patternBuffer.length() - 1);

patternBuffer.append(")");

//System.out.println(patternBuffer);

// unix换成UTF-8

// pattern = Pattern.compile(new

// String(patternBuf.toString().getBytes("ISO-8859-1"), "UTF-8"));

// win下换成gb2312

// pattern = Pattern.compile(new String(patternBuf.toString()

// .getBytes("ISO-8859-1"), "gb2312"));

// 装换编码

pattern = Pattern.compile(patternBuffer.toString());

} catch (IOException ioEx) {

ioEx.printStackTrace();

}

private static String doFilter(String str) {

Matcher m = pattern.matcher(str);

// while (m.find()) {// 查找符合pattern的字符串

// System.out.println("The result is here :" + m.group());

// }

// 选择替换方式，这里以* 号代替

str = m.replaceAll("*");

return str;

}

　　下载地址：关键字过滤器

　　链接：密码：xi24（如果失败，请在文本末尾下载）

　　三、DFA滤波算法

　　在这种情况下，采用了DFA算法。我对这个算法知之甚少。经过测试，发现匹配度不好，速度也不错。也许可以改进。请请求伟大的上帝来改进

　　有两个主要文件：sensitivewordfilter.java和sensitivewordinit.java

　　主要代码：

　　public int CheckSensitiveWord(String txt,int beginIndex,int matchType){

boolean flag = false; //敏感词结束标识位：用于敏感词只有1位的情况

int matchFlag = 0; //匹配标识数默认为0

char word = 0;

Map nowMap = sensitiveWordMap;

for(int i = beginIndex; i < txt.length() ; i++){

word = txt.charAt(i);

nowMap = (Map) nowMap.get(word); //获取指定key

if(nowMap != null){ //存在，则判断是否为最后一个

matchFlag++; //找到相应key，匹配标识+1

if("1".equals(nowMap.get("isEnd"))){ //如果为最后一个匹配规则,结束循环，返回匹配标识数

flag = true; //结束标志位为true

if(SensitivewordFilter.minMatchTYpe == matchType){ //最小规则，直接返回,最大规则还需继续查找

break;

}

else{ //不存在，直接返回

break;

}

if(matchFlag < 2 || !flag){ //长度必须大于等于1，为词

matchFlag = 0;

}

return matchFlag;

}

　　下载地址：sensitivewordfilter

　　链接：密码：mc1x（如果无效，请使用文本末尾的地址下载）

　　四、多树搜索算法

　　该方法采用多树搜索算法。至于这个算法是怎么回事，您可以检查与数据结构相关的内容。提供了Jar包，可以直接调用它进行过滤

　　经测试，该方法匹配度好，速度慢

　　调用方法：

　　//敏感词过滤

FilteredResult result = WordFilterUtil.filterText(str, '*');

//获取过滤后的内容

System.out.println("替换后的字符串为:\n"+result.getFilteredContent());

//获取原始字符串

System.out.println("原始字符串为:\n"+result.getOriginalContent());

//获取替换的敏感词

System.out.println("替换的敏感词为:\n"+result.getBadWords());

　　下载地址：wordfilterutil

　　链接：密码：5t2h（如果无效，请使用文本末尾的地址下载）

0

2021-09-19

采集内容插入词库

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

采集内容插入词库(Javaweb网站敏感词过滤的实现调研结果写出来了)

0 个评论

发起人

AI时代内容工厂

采集内容插入词库(Javaweb网站敏感词过滤的实现调研结果写出来了)

0 个评论

发起人

相关问题