Part of #Language detection for classification and content-based web pages filtering# :
Publishing year : 2011
Conference : National Electronic City Congress
Number of pages : 5
Abstract: According to the daily increase of documents increasing on the internet, automatic language detection becomes more important. In this paper we used a language detection system to classify and filter the immoral web pages based on their contents. This system could detect 10 most used languages in immoral web pages, including FARSI language. As a technique we introduce a new combined method that consists of three parts; URL Processor, page encoding processor, and text processor. In order to generate proper results, this system has a voter that combines the results of these three parts. We used the immoral web pages and labeled web pages as an input data set in order to create a linguistic model for each language and system evaluation. Our experiments show 95% accuracy in the accuracy of the results. Because in this particular issue, it is possible that the name used in the address does not show the immorality of the page. Another reason is that there could be many web pages with different languages that used the same coding. Consequently, each method could not solve the problem by itself. It is stated in this paper that the combination of these three methods has a very promising result. The paper structure consists of related works, problemdefinition, solution introduction, results interpretation, conclusion and future works.