![]() |
Text Classification from Labeled and Unlabeled Documents using EM - Printable Version +- Free Academic Seminars And Projects Reports (https://easyreport.in) +-- Forum: Seminars Topics And Discussions (https://easyreport.in/forumdisplay.php?fid=30) +--- Forum: Engineering Seminars Topics (https://easyreport.in/forumdisplay.php?fid=7) +---- Forum: Computer Science Seminar Topics (https://easyreport.in/forumdisplay.php?fid=12) +---- Thread: Text Classification from Labeled and Unlabeled Documents using EM (/showthread.php?tid=51007) |
Text Classification from Labeled and Unlabeled Documents using EM - [email protected] - 10-04-2017 Text Classifi cation from Labeled and Unlabeled Documents using EM [attachment=673] Abstract. This paper shows that the accuracy of learned text classi ers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is important because in many text classi cation problems obtaining training labels is expensive, while large quantities of unlabeled documents are readily available. Introduction Consider the problem of automatically classifying text documents. This problem is of great practical importance given the massive volume of online text available through the World Wide Web, Internet news feeds, electronic mail, corporate databases, medical patient records and digital libraries. Existing statistical text learning algorithms can be trained to approximately classify documents, given a su cient set of labeled training examples. These text classi cation algorithms have been used to automatically catalog news articles (Lewis & Gale, 1994; Joachims, 1998) and web pages (Craven, DiPasquo, Freitag, McCallum, Mitchell, Nigam, & Slattery, 1998; Shavlik & Eliassi-Rad, 1998), automatically learn the reading interests of users (Pazzani, Muramatsu, & Billsus, 1996; Lang, 1995), and automatically sort electronic mail (Lewis & Knowles, 1997; Sahami, Dumais, Heckerman, & Horvitz, 1998). Argument for the Value of Unlabeled Data How are unlabeled data useful when learning classi cation? Unlabeled data alone are generally insu cient to yield better-than-random classi cation because there is no information about the class label (Castelli & Cover, 1995). However, unlabeled data do contain information about the joint distribution over features other than the class label. Because of this they can sometimes be used together with a sample of labeled data to signi cantly increase classi cation accuracy in certain problem settings. To see this, consider a simple classi cation problem one in which instances are generated using a Gaussian mixture model. Here, data are generated according to two Gaussian distributions, one per class, whose parameters are unknown. Figure 1 illustrates the Bayes-optimal decision boundary (x > d), which classi es instances into the two classes shown by the shaded and unshaded areas. Note that it is possible to calculate d from Bayes rule if we know the Gaussian mixture distribution parameters (i.e., the mean and variance of each Gaussian, and the mixing parameter between them). The Probabilistic Framework This section presents a probabilistic framework for characterizing the nature of documents and classi ers. The framework de nes a probabilistic generative model for the data, and embodies two assumptions about the generative process: (1) the data are produced by a mixture model, and (2) there is a one-to-one correspondence between mixture components and classes.1 The naive Bayes text classi er we will discuss later falls into this framework, as does the example in Section 2. |