With the rapid development of Internet, the information content on the Web has become very rich. A new technology is urgently needed to automatically find, extract and filter information from the Web resources. Web mining technology appeared. The purpose of Web Mining is to find and extract the potential useful model and the hidden information from the Web documents and Web activities. It combines together the traditional data mining technology and Web together, and can play a role in many ways, such as the mining of search engines, the development of search engines, improving and enhancing the quality and efficiency of search engines, determining authority pages, Web document classification, Web Log mining, intelligent enquiries and the establishment of Meta-Web data warehouse.

The WWW serves as a huge, wide, distributed, global information service center for news, advertisements, consumer information, financial management, education, government, e-commerce, and many other services. With the rapid increasing of information in the WWW, the Web Mining has gradually become more and more important in Data Mining. Web Mining can be classified into three domains: Web Structure Mining, Web Content Mining and Web Usage Mining. There are generally three tasks in Web Usage Mining: pre-processing, knowledge discovery and pattern analysis.


2.1 Web-based data mining concept

Data mining is extracting potential, unknown, useful information, patterns and trends from abundant, incomplete, noise, fuzzy and random data which is used in practical application. Web mining comes from the development of data mining and it is a new field that the data mining technology used in Web information. It has new character compared with the traditional data mining. First, the objects of Web mining are a large number of Web documents which are heterogeneously distributed and each data source are heterogeneous; second, the Web document itself is semi-structured or unstructured and lack the semantics the machine can understand. The semi-structure refers to that although the data on the Web has certain structural, but has no specific model description, The data of each site is designed independently and the data itself has dynamic variability, consequently it is a type of non-total structured data. The objects of the traditional data mining are confined to be the structured data in the database and it used the storage structure such as forms of relationship to find knowledge, so some data mining technologies are not applicable to Web mining, even if useable it must be on the basis of Web documents pretreatment

