收藏——The Easy Way to Extract Useful Text from Arbitrary HTML
星期日, 06月 29th, 2008原文地址:http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html/
作者: alexjc
介绍了一种另类的、简单的、有效的、具有一定广泛性的提取HTML文档正文内容的方法,这种方法从统计学和机器学习的角度出发,使用文本和HTML代码的密度来决定一行文件是否应该输出,从而跨越了HTML文件的结构和标签的分析难度,实现真正文本信息的挖掘。
如文章中所示,其主要原理如下:
- Parse the HTML code and keep track of the number of bytes processed.
- Store the text output on a per-line, or per-paragraph basis.
- Associate with each text line the number of bytes of HTML required to describe it.
- Compute the text density of each line by calculating the ratio of text to bytes.
- Then decide if the line is part of the content by using a neural network.
作者使用了python来实现了基本的实例,并使用了FANN(Fast Artificial Neural NetWork,人工神经网络库)的机器学习算法使得结果更为成熟,思路明确,代码简单,图表清晰,很棒的文章。
csdn的赖勇浩(恋花蝶的博客)曾翻译过该文,地址是:http://blog.csdn.net/lanphaday/archive/2007/08/13/1741185.aspx








