Login

Computer Science Clay · 08-17-2017, 12:16 AM

E-MINE: A NOVEL WEB MINING APPROACH

[attachment=364]

ABSTRACT

In recent years government agencies and industrial enterprises are using the web as the medium of publication. Hence, a large collection of documents, images, text files and other forms of data in structured, semi structured and unstructured forms are available on the web. It has become increasingly difficult to identify relevant pieces of information since the pages are often cluttered with irrelevant content like advertisements, copyright notices, etc surrounding the main content. Thus, we propose a technique that mines the relevant data regions from a web page. This technique is based on three important observations about data regions on the web.

. Introduction

Extracting the regularly structured data records from web pages is an important problem. So far, several attempts have been made to deal with the problem. The main disadvantage with the existing automatic approaches is their assumption that the relevant information of a data record is contained in a contiguous segment of HTML code, which is not always true. Thus, we propose a more effective method to mine the data region in a web page. The algorithm, eMine, finds the data regions formed by all types of tags using visual cues.

Related Work

Related work, mainly in the area of mining data records in a web page is MDR (Mining Data Records). MDR is a well known approach which basically exploits the regularities in the HTML tag structure directly. MDR algorithm makes use of the HTML tag tree of the web page to extract data records from the page. However, an incorrect tag tree may be constructed due to the misuse of HTML tags, which in turn makes it impossible to extract data records correctly.

The Proposed Technique

We propose a novel and an effective method, eMine, to mine the data region from a web page automatically. The basic criteria which eMine uses are the locations on the screen at which tags are rendered i.e. visual Information.

How the Algorithm works?

The algorithm takes the HTML source of the web page as input. In step 2 we scan the HTML document for tags and identify the height and width of all the bounding rectangles. Thus, you have the area of each bounding rectangle. The step 3 finds the largest rectangle out of all the bounding rectangles. Step 4 identifies the container which holds most of the relevant data region (and some irrelevant regions also). Step 5 identifies the actual relevant data region by filtering the irrelevant regions.
The following sections provide more details about the individual modules associated with the algorithm.

Determining the Height and width of all bounding rectangles

In the first step of the proposed technique, we determine the dimensions of all the bounding rectangles in the web page. Every <table> tag in a web page will be associated with a specific height and width attribute. We extract them. If not specified, the MSHTML parsing and rendering engine of Microsoft Internet Explorer 6.0 can be used. This parsing and rendering engine of the web browser gives us the coordinates of a bounding rectangle. We scan the HTML file for tags. For each tag encountered, we determine the coordinates of the bounding rectangle of the corresponding tag and plot it.

Conclusion

In this paper, we have proposed a new approach to extract structured data from web pages. Although the problem has been studied by several researchers, existing techniques make many strong assumptions. eMine is a pure visual structure oriented method that can correctly identify the data regions. Most of the current algorithms fail to correctly determine the data region, when the data region consists of only one data record. Also, most of the approaches fail in the case where a series of data records is separated by an advertisement, followed again by a single data record. eMine works correctly for the above case. Further, the comparisons are made on numbers, unlike other methods where strings or trees are compared. Thus eMine overcomes the drawbacks of existing methods and performs significantly better than existing methods.

vijayrabha · 08-17-2017, 12:16 AM

satyajit · 08-17-2017, 12:16 AM

sanjeev kr singh · 08-17-2017, 12:16 AM

to get information about the topic "A novel web mining approach" full report ppt and related topic refer the page link bellow

http://seminarsprojects.net/Thread-e-min...e=threaded

http://seminarsprojects.net/Thread-e-min...g-approach

priyansha · 08-17-2017, 12:16 AM

Sir/Madam,
I am Sampath studying B.E.(ISE) at VCET,Mangalore. I want seminar report and ppt on the topic "E-MINE: A novel web mining approach".
Please send those to my mail id "[email protected]".

Thank you.

butobvious · 08-17-2017, 12:16 AM

hei frds my name is bharath chandra. studyng btech ece final year..
pls sen me info about technical seminor topic APPLE - a novel approach for direct energy weapon control..mail id is [email protected]

nishawilson · 08-17-2017, 12:16 AM

I think you mentioned wrong disadvantages because it only extracts center portion which contains only useful data but not unwanted data.