good classifier prenciple



Classification is a process of dividing a particle-laden gas stream into two, ideally at a particular particle size, known as the cut size. An important industrial application of classifiers is to reduce overgrinding in a mill by separating the grinding zone output into fine and coarse fractions. Many types of classifier are available, which can be categorized according to their operating principles. A distinction must be made between gas cleaning equipment, in which the aim is the removal of all solids from the gas stream, and classifiers in which a partition of the particle size distribution is sought. Prasher (1987) identifies the following categories: a) screens, b) cross-flow systems, c) elutriation, d) inertia systems, e) centrifugal systems without moving parts, f) centrifugal systems with rotating walls, and g) mechanical rotor systems. A classification process may combine these alternative principles, sometimes within a single separator, to achieve a desired result.

These contain apertures which are uniformly-sized and spaced, and which may have circular, square or rectangular shapes. Particles which are smaller than the aperture in at least two dimensions pass through, and larger ones are retained on the surface. The screen is shaken or vibrated to assist motion of particles to the surface, and continuous screens are often tilted to further aid particle bed motion along the screen surface. Static (or low-frequency) screens or grizzlies have a different construction. They are comprised of parallel bars or rods with uniformly clear openings, often tapered from feed to discharge ends. The bars may lie horizontally above a bin, or be inclined to provide the feed to a crusher.

It is in principle possible to winnow out fines from a falling curtain of material of constant density by a cross-current of air. In practice, humidity of the air (and moisture on the particles) leads to blockage of the narrow ducts necessary to give a thin enough falling curtain for winnowing. It is possible to winnow thin flakes; Etkin et al. (1980) has successfully classified mica particles with an aspect ratio greater than 30.

Gravity counter-current classifiers (elutriators) have been reviewed by Wessel (1962). A simple example, the Gonnell (1928) classifier, consists of a long vertical cylindrical tube with a conical transition zone located at the bottom end. Air flows up the tube, carrying with it the finer particles. The disadvantage of this and many other gravity counter-current classifiers is the presence of a laminar velocity profile in the gas, a large cone angle leading to flow separation and eddy formation, settling out of fines due to the retarded velocities near the walls, and the noise of vibrators necessary to prevent particle adhesion to the walls. Their advantage lies in the good dispersion of powders achieved in the cylindrical section. In the zig-zag classifier, vortex formation leads to the acceleration of the main flow owing to a reduction of the effective tube cross-section. Fines follow the main gas stream and coarse particles travel to the wall, and fall back against the main gas flow. In this design, the sharpness of cut is low at each stage (zig-zag), but a required cut size is generally achievable even at high velocities.

In an inertial classifier, the particle-laden gas stream is turned through 180 by appropriate internal baffling. In order to reach the exit port, the gas passes through a further 180 to continue in the same direction it was travelling before it was diverted. The fines are able to follow, more or less, the same route as the gas. However, the momentum of coarser or denser particles prevents them from following the same trajectory and they fall into a collection zone after the first turn.

The capacities of these types of classifiers cover a wide range. Generally, higher-capacity machines have a poorer sharpness of cut. Typical high-capacity industrial units are the cone classifier (often built into some types of mills) and the cyclone. The feed is given a high tangential velocity and is introduced near to the top of the unit. The gas flows in a spiralling fashion towards the bottom end where it experiences a flow reversal and passes up as a central core. In the cone classifier, the central core of gas actually flows in a reverse spiral up the wall of a central feed. Under the influence of centrifugal force, coarse particles are thrown to the inner wall of the cone or cyclone. Particles less than the cut size are carried up the central vortex and are carried out of the unit by the bulk of the gas flow. The diameter and position of the vortex finder at the top of the unit are critical in the determination of a specified cut size. Further information on cyclones is given in the overview article on Gas-Solid Separation.

The Larox classifier is another high capacity system, shown in Figure1. The particles are dispersed by the feed falling across an inlet gas; the coarsest particles fall through the gas stream and into an outlet chute, and are thereby separated. Classification of the remainder occurs in a horizontal cyclone. There are three adjustable flights (A, B and C) to be positioned to give the best cut.

Spiral classifiers, such as the Alpine Mikroplex design for separation in the superfine region, were developed to partially overcome undesirable boundary layer effects associated with spinning fluids at stationary walls (Rumpf and Leschonski (1967)). Air is introduced tangentially at the periphery into a flat cylindrical space and moves along spiral flow lines into the center, from where it is drawn off. The fines follow the flow while the coarse particles spin round at the circumference; in some designs, this recirculating coarse stream is reclassified by passage of the incoming air through it. The coarse fraction leaves through a slit at the periphery (as in the Walther Classifier) or is removed using a screw extractor (as in the Alpine Mikroplex Classifier). The cut size theoretically has a stable circular trajectory in the classifying zone, but (in common with most other classifiers) separation is poorer with higher solids loadings.

To extend effective separation over a wider range of operating parameters, many classifiers are designed with a mechanical rotor built into them. The rotor has several effects: 1) large particles are deflected back into the classifier, thereby reducing the proportion of coarse particles in the fine product, 2) it aids recirculation of the air stream in some classifier types, and 3) the generation of a forced vortex keeps large particles at the periphery, but fines follow a helical trajectory to the center where they pass out with the exiting air.

what is a good classifier? (2/4) | skilja

what is a good classifier? (2/4) | skilja

In our small series about classification quality we have used the precision-recall graph to show the difference between a very good and a so-so classifier in a recent post that you can findhere. This graphical representation is very common and easy to understand. Apart from the absolute numbers for the recall (e.g. 85% correctly classified documents) it is also important to understand how classification quality can be influenced by a threshold applied to the classification result. This can already be seen in the precision-recall graph if you know how it should look like. But it becomes much more obvious if the errors are displayed as a function of the recall. We call this diagram theinverted-precisiongraph which is described in this part 2 of our little series.

The graph can be easily created by the same type of benchmark test that is used for the precision-recall graph. Either by measuring the classification quality against a golden test set or by simply using the train-test split method where a certain percentage of the training set is used for testing (e.g. 10%) while the remaining 90% are used for training. Of course this is repeated iteratively (in this case 10-times) until at each document has been classified once.

The first curve in the inverted-precision graph plots the error rate as a function of the read rate (recall). Apparently the higher the recall, the higher the error rate that you need to accept. The error rate is shown on the left vertical axis. Let me show you how the graph also allows to exactly determine the achievable recall and required threshold for a desired error rate. With the right y-axis the threshold is plotted as function of recall. By connecting the lines it is easy to see where we need to put the threshold to achieve a predefined error rate. The animated graphic below shows this step by step:

In a real life example we have again used the well known Reuters-21578 Apte test set. This set has been assembled many years ago (available at It includes 12,902 documents for 90 classes, with a fixed splitting between test and training data (3,299 vs. 9,603). The image shows the graph for a very good and linear classifier.

The second image shows the graph for a standard classifier. This is the same data as in the first post of the series but if you compare you see that the differences between a good and a mediocre classifier become much more obvious in this representation.

The discrepancy has mainly to do with normalization of results. Even if you accept that the absolute recall is low for the weak classifier at least the results should be normalized in the way that an error rate below 5% can somehow be achieved. This is obviously not the case. The inverted-precision graph is a good way to uncover this fact which might be due either to a weak classifier or to an incomplete training set. Therefore a good classification toolkit should always provide the means to create and visualize the results also in this way.

There are good technical reasons in the algorithms to explain the differences above but this should not be the topic of this blog. More important for users is to understand that therearesignificant differences and that they become visible in the graphical evaluation. In an upcoming article we will drill even deeper and show the effect of classifier quality on the separation of selected pairs of classes. Stay tuned!

Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.

Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.

air classification working principles

air classification working principles

The air drag force has an increasing influence as particles become smaller. A very small low mass particle is easily carried along with an air stream even when turning a corner. Thus, fine particles follow the air away from the classifying zone.

The ideal method of operating an air classifier would be to feed one particle at a time into the system to avoid collisions and interference. Naturally, practical operation involves a substantial cloud of material in the classification zone at all times. This does result in some loss of sharpness because of two factors: (1) coarse particles can collide with fines and carry them along to the oversize outlet and (2) particle collisions can cause an occasional large particle to bounce through into the fines recovery system.

When an air classifier is used in a milling curcuit care must be taken to avoid overload in the event of increasing recirculating load. The pulverizer and the separator are both subject to overloading unless the feed system is adjusted to compensate for recirculation. This can become more difficult if a percentage of ungrindable material exists in the feed. Extraction of ungrindable material is possible in many cases, such as the technique of using extraction screws.

naive bayes classifier explained

naive bayes classifier explained

Naive Bayes Classifier is a simple model that's usually used in classification problems. The math behind it is quite easy to understand and the underlying principles are quite intuitive. Yet this model performs surprisingly well on many cases and this model and its variations are used in many problems. So in this article we are going to explain the math and the logic behind the model and also implement a Naive Bayes Classifier in Python and Scikit-Learn.

This article is part of a mini-series of two about the Naive Bayes Classifier. This will cover the theory, maths and principles behing the classifier. If you are more interested in the implementation using Python and Scikit-Learn, please read the other article, Naive Bayes Classifier Tutorial in Python and Scikit-Learn.

Classification tasks in Machine Learning are responsible for mapping a series of inputs X = [x1, x2, ..., xn] to a series of probabilities Y = [y1, y2, ..., ym]. This means that given one particular set of observation X = (x1, x2, ..., xn), we need to find out what is the odd that Y is yi and in order to obtain a classification, we just need to choose the highest yi.

Yeah, I know, I also don't like these things explained this way. I know a formal explanation is necessary, but let's also try it in another way. Let's have this fictional table that we can use to predict if a city will experience a traffic jam.

So in a classification task, our goal would be to train a classifier model that can take information from the left(the weather outside, what kind of day it is and the time of the day) and can predict if the city will experience a traffic jam.

where y1 is the probability that there's no traffic jam and y2 is the probability that there is a traffic jam. We only need to choose the highest probability and we're done, we've obtained our prediction.

What we know from the probability theory is that if X1 and X2 are independent values(meaning that, for example, the fact that the weather is rainy and that today is a weekend day are totally independent, there's no conditional relation between them), then we can use this equation.

Now in our example, this assumption is true. There is absolutely no way that the fact that today is rainy is influenced by the fact that today is Saturday. But generally speaking, this assumption is not true in most of the cases. If we observe a large number of variables for a classification tasks, chances are that at least some of those variables are dependent(for example, education level and monthly income).

But the Naive Bayes Classifier is called naive just because it works based on this assumption. We consider all observed variables to be independent, because using the equation above helps us simplify the next steps.

So let's take a look back at our table to see what happens. Let's try to see what it is the probability of there being a traffic jam given the fact that the weather is clear, today is a workday and it's morning time(first line in our table).

You can see that this is already becoming a painful process. You might have doubts because the intuition behind this model looks very simple(although calculating so many probabilities may give you a headache) but it simply works very well and it's used in so many use cases. Let's see some of them.

If you're like me, all of this theory is almost meaningless unless we see the classifier in action. So let's see it used on a real-world example. We'll use a Scikit-Learn implementation in Python and play with a dataset. This was quite a lenghty article, so to make it easier, I've split this subject into a mini-series of two articles. For the implementation in Python and Scikit-Learn, please read Naive Bayes Classifier Tutorial in Python and Scikit-Learn.

mathematical concepts and principles of naive bayes

mathematical concepts and principles of naive bayes

Simplicity is the ultimate sophistication. Leonardo Da Vinci With time, machine learning algorithms are becoming increasingly complex. This, in most cases, is increasing accuracy at the expense of higher training-time requirements. Fast-training algorithms that deliver decent accuracy are also available. These types of algorithms are generally based on simple mathematical concepts and principles. Today, well have a look at a similar machine-learning classification algorithm, naive Bayes. It is an extremely simple, probabilistic classification algorithm which, astonishingly, achieves decent accuracy in many scenarios. Naive Bayes Algorithm In machine learning, naive Bayes classifiers are simple, probabilistic classifiers that use Bayes Theorem. Naive Bayes has strong (naive), independence assumptions between features. In simple terms, a naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. For example, a ball may be considered a soccer ball if it is hard, round, and about seven inches in diameter. Even if these features depend on each other or upon the existence of the other features, naive Bayes believes that all of these properties independently contribute to the probability that this ball is a soccer ball. This is why it is known as naive. Naive Bayes models are easy to build. They are also very useful for very large datasets. Although, naive Bayes models are simple, they are known to outperform even the most highly sophisticated classification models. Because they also require a relatively short training time, they make a good alternative for use in classification problems. Mathematics Behind Naive Bayes Bayes Theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x), and P(x|c). Consider the following equation: Here, P(c|x): posterior probability ofclass(c,target) givenpredictor(x,attributes). This represents the probability of c being true, provided x is true. P(c): is the prior probability ofclass. This is the observed probability of class out of all the observations. P(x|c): is the likelihood which is the probability ofpredictor-givenclass. This represents the probability of x being true, provided x is true. P(x): is the prior probability ofpredictor. This is the observed probability of predictor out of all the observations. Lets better understand this with the help of a simple example. Consider a well-shuffled deck of playing cards. A card is picked from that deck at random. The objective is to find the probability of a King card, given that the card picked is red in color. Here, P(King | Red Card) = ? Well use, P(King | Red Card) = P(Red Card | King) x P(King) / P(Red Card) So, P (Red Card | King) = Probability of getting a Red card given that the card chosen is King = 2 Red Kings / 4 Total Kings = P (King) = Probability that the chosen card is a King = 4 Kings / 52 Total Cards = 1 / 13 (Red Card) = Probability that the chosen card is red =26 Red cards / 52 Total Cards = 1/ 2 Hence, finding the posterior probability of randomly choosing a King given a Red card is: P (King | Red Card) = (1 / 2) x (1 / 13) / (1 / 2) =1 / 13 or 0.077 Understanding Naive Bayes with an Example Lets understand naive Bayes with one more exampleto predict the weather based on three predictors: humidity, temperature and wind speed. The training data is the following: Humidity Temperature Wind Speed Weather Humid Hot Fast Sunny Humid Hot Fast Sunny Humid Hot Slow Sunny Not Humid Cold Fast Sunny Not Humid Hot Slow Rainy Not Humid Cold Fast Rainy Humid Hot Slow Rainy Humid Cold Slow Rainy Well use naive Bayes to predict the weather for the following test observation: Humidity % Temperature (C) Wind Speed (Km/h) Weather Humid Cold Fast ? We have to determine which posterior is greater, sunny or rainy. For the classification Sunny, the posterior is given by: Posterior( Sunny) = (P(Sunny) x P(Humid / Sunny) x P(Cold / Sunny) x P(Fast / Sunny)) / evidence Similarly, for the classification Rainy, the posterior is given by: Posterior( Rainy) = (P(Rainy) x P(Humid / Rainy) x P(Cold / Rainy) x P(Fast / Rainy)) / evidence Where, evidence = [ P(Sunny) x p(Humid / Sunny) x p(Cold / Sunny) x P(Fast / Sunny) ] + [ (P(Rainy) x P(Humid / Rainy) x P(Cold / Rainy) x P(Fast / Rainy) ) ] Here, P(Sunny) = 0.5 P(Rainy) = 0.5 P(Humid/ Sunny) = 0.75 P(Cold/ Sunny) = 0.25 P(Fast/ Sunny) = 0.75 P(Humid/ Sunny) = 0.25 P(Cold/ Sunny) = 0.75 P(Fast/ Sunny) = 0.25 Therefore, evidence = 0.703 + 0.023 = 0.726. Posterior (Sunny) = 0.968 Posterior (Rainy) = 0.032 Since the posterior numerator is greater in the Sunny case, we predict the sample is Sunny. Applications of Naive Bayes Naive Bayes classifiers can be trained very efficiently in a supervised learning setting. In many practical applications, parameter estimation for naive Bayes models uses the method of maximum likelihood. Despite their naive design and apparently oversimplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations. Recommendation System: Naive Bayes classifiers are used in various inferencing systems for making certain recommendations to users out of a list of possible options. Real-Time Prediction: Naive Bayes is a fast algorithm, which makes it an ideal fit for making predictions in real time. Multiclass Prediction: This algorithm is also well-known for its multiclass prediction feature. Here, we can predict the probability of multiple classes of the target variable. Sentiment Analysis: Naive Bayes is used in sentiment analysis on social networking datasets like Twitter* and Facebook* to identify positive and negative customer sentiments. Text Classification: Naive Bayes classifiers are frequently used in text classification and provide a high success rate, as compared to other algorithms. Spam Filtering: Naive Bayes is widely used inspam filtering for identifying spam email. Why is Naive Bayes so Efficient? An interesting point about naive Bayes is that even when the independence assumption is violated and there are clear, known relationships between attributes, it works decently anyway. There are two reasons that make naive Bayes a very efficient algorithm for classification problems. Performance: The naive Bayes algorithm gives useful performances despite having correlated variables in the dataset, even though it has a basic assumption of independence among features. The reason for this is that in a given dataset, two attributes may depend on each other, but the dependence may distribute evenly in each of the classes. In this case, the conditional independence assumption of naive Bayes is violated, but it is still the optimal classifier. Further, what eventually affects the classification is the combination of dependencies among all attributes. If we just look at two attributes, there may exist strong dependence between them that affects the classification. When the dependencies among all attributes work together, however, they may cancel each other out and no longer affect the classification. Therefore, we argue that it is the distribution of dependencies among all attributes over classes that affects the classification of naive Bayes, not merely the dependencies themselves. Speed: The main cause for the fast speed of naive Bayes training is that it converges toward its asymptotic accuracy at a different rate than other methods, like logistic regression, support vector machines, and so on. Naive Bayes parameter estimates converge toward their asymptotic values in order of log(n) examples, where n is number of dimensions. In contrast, logistic regression parameter estimates converge more slowly, requiring order n examples. It is also observed that in several datasets logistic regression outperforms naive Bayes when many training examples are available in abundance, but naive Bayes outperforms logistic regression when training data is scarce. Practical Applications of Naive Bayes: Email ClassifierSpam or Ham? Lets see a practical application of naive Bayes for classifying email as spam or ham. We will use sklearn.naive_bayes to train a spam classifier in Python*. import os import os import io import numpy from pandas import DataFrame from sklearn.feature_extraction.text import CountVectorizer from sklearn.naive_bayes import MultinomialNB The following example will be using the MultinomialNB operation. Creating the readFiles function: def readFiles(path): for root, dirnames, filenames in os.walk(path): for filename in filenames: path = os.path.join(root, filename) inBody = False lines = [] f =, 'r', encoding='latin1') for line in f: if inBody: lines.append(line) elif line == '\n': inBody = True f.close() message = '\n'.join(lines) yield path, message Creating a function to help us create a dataFrame: def dataFrameFromDirectory(path, classification): rows = [] index = [] for filename, message in readFiles(path): rows.append({'message': message, 'class': classification}) index.append(filename) return DataFrame(rows, index=index) data = DataFrame({'message': [], 'class': []}) data = data.append(dataFrameFromDirectory('//SPAMORHAM /emails/spam/', 'spam')) data = data.append(dataFrameFromDirectory('//SPAMORHAM/emails/ham/', 'ham')) Let's have a look at that dataFrame: data.head() class message //SPAMORHAM/emails/spam/00001.7848dde101aa985090474a91ec93fcf0 spam ' with 429785 stored elements in Compressed Sparse Row format> Now we are using MultinomialNB(): classifierModel = MultinomialNB() ## This is the target ## Class is the target targets = data['class'].values ## Using counts to fit the model, targets) MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True) The classifierModel is ready. Now, lets prepare sample email messages to see how the model works. Email number 1 is Free Viagra now!!!, Email number 2 is A quick brown fox is not ready, and so on: examples = ['Free Viagra now!!!', "A quick brown fox is not ready", "Could you bring me the black coffee as well?", "Hi Bob, how about a game of golf tomorrow, are you FREE?", "Dude , what are you saying", "I am FREE now, you can come", "FREE FREE FREE Sex, I am FREE", "CENTRAL BANK OF NIGERIA has 100 Million for you", "I am not available today, meet Sunday?"] example_counts = vectorizer.transform(examples) Now we are using the classifierModel to predict: predictions = classifierModel.predict(example_counts) Lets check the prediction for each email: predictions array(['spam', 'ham', 'ham', 'ham', 'ham', 'ham', 'spam', 'spam', 'ham'], dtype='|S4') Therefore, the first email is spam, the second is ham, and so on. End Notes We hope you have gained a clear understanding of the mathematical concepts and principles of naive Bayes using this guide. It is an extremely simple algorithm, with oversimplified assumptions at times, that might not stand true in many real-world scenarios. In this article we explained why naive Bayes often produces decent results, despite these facts. We feel naive Bayes is a very good algorithm and its performance, despite its simplicity, is astonishing.

With time, machine learning algorithms are becoming increasingly complex. This, in most cases, is increasing accuracy at the expense of higher training-time requirements. Fast-training algorithms that deliver decent accuracy are also available. These types of algorithms are generally based on simple mathematical concepts and principles. Today, well have a look at a similar machine-learning classification algorithm, naive Bayes. It is an extremely simple, probabilistic classification algorithm which, astonishingly, achieves decent accuracy in many scenarios.

In machine learning, naive Bayes classifiers are simple, probabilistic classifiers that use Bayes Theorem. Naive Bayes has strong (naive), independence assumptions between features. In simple terms, a naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. For example, a ball may be considered a soccer ball if it is hard, round, and about seven inches in diameter. Even if these features depend on each other or upon the existence of the other features, naive Bayes believes that all of these properties independently contribute to the probability that this ball is a soccer ball. This is why it is known as naive.

Naive Bayes models are easy to build. They are also very useful for very large datasets. Although, naive Bayes models are simple, they are known to outperform even the most highly sophisticated classification models. Because they also require a relatively short training time, they make a good alternative for use in classification problems.

Lets better understand this with the help of a simple example. Consider a well-shuffled deck of playing cards. A card is picked from that deck at random. The objective is to find the probability of a King card, given that the card picked is red in color.

Naive Bayes classifiers can be trained very efficiently in a supervised learning setting. In many practical applications, parameter estimation for naive Bayes models uses the method of maximum likelihood. Despite their naive design and apparently oversimplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations.

An interesting point about naive Bayes is that even when the independence assumption is violated and there are clear, known relationships between attributes, it works decently anyway. There are two reasons that make naive Bayes a very efficient algorithm for classification problems.

We hope you have gained a clear understanding of the mathematical concepts and principles of naive Bayes using this guide. It is an extremely simple algorithm, with oversimplified assumptions at times, that might not stand true in many real-world scenarios. In this article we explained why naive Bayes often produces decent results, despite these facts. We feel naive Bayes is a very good algorithm and its performance, despite its simplicity, is astonishing.

Intel technologies may require enabled hardware, software or service activation. // No product or component can be absolutely secure. // Your costs and results may vary. // Performance varies by use, configuration and other factors. // See our complete legal Notices and Disclaimers. // Intel is committed to respecting human rights and avoiding complicity in human rights abuses. See Intels Global Human Rights Principles. Intels products and software are intended only to be used in applications that do not cause or contribute to a violation of an internationally recognized human right.

good classifier - an overview | sciencedirect topics

good classifier - an overview | sciencedirect topics

We begin by taking a preliminary look at the dataset. Then we examine the effect of selecting different attributes for nearest-neighbor classification. Next we study class noise and its impact on predictive performance for the nearest-neighbor method. Following that we vary the training set size, both for nearest-neighbor classification and for decision tree learning. Finally, you are asked to interactively construct a decision tree for an image segmentation dataset.

The glass dataset glass.arff from the U.S. Forensic Science Service contains data on six types of glass. Glass is described by its refractive index and the chemical elements that it contains; the the aim is to classify different types of glass based on these features. This dataset is taken from the UCI datasets, which have been collected by the University of California at Irvine and are freely available on the Web. They are often used as a benchmark for comparing data mining algorithms.

Exercise 17.2.1. How many attributes are there in the dataset? What are their names? What is the class attribute? Run the classification algorithm IBk (weka.classifiers.lazy.IBk). Use cross-validation to test its performance, leaving the number of folds at the default value of 10. Recall that you can examine the classifier options in the Generic Object Editor window that pops up when you click the text beside the Choose button. The default value of the KNN field is 1: This sets the number of neighboring instances to use when classifying.

Exercise 17.2.2. What is the accuracy of IBk (given in the Classifier Output box)? Run IBk again, but increase the number of neighboring instances to k = 5 by entering this value in the KNN field. Here and throughout this section, continue to use cross-validation as the evaluation method.

Now we investigate which subset of attributes produces the best cross-validated classification accuracy for the IBk algorithm on the glass dataset. Weka contains automated attribute selection facilities, which are examined in a later section, but it is instructive to do this manually.

Performing an exhaustive search over all possible subsets of the attributes is infeasible (why?), so we apply the backward elimination procedure described in Section 7.1 (page 311). To do this, first consider dropping each attribute individually from the full dataset, and run a cross-validation for each reduced version. Once you have determined the best eight-attribute dataset, repeat the procedure with this reduced dataset to find the best seven-attribute dataset, and so on.

Exercise 17.2.4. Record in Table 17.1 the best attribute set and the greatest accuracy obtained in each iteration. The best accuracy obtained in this process is quite a bit higher than the accuracy obtained on the full dataset.

Exercise 17.2.5. Is this best accuracy an unbiased estimate of accuracy on future data? Be sure to explain your answer. (Hint: To obtain an unbiased estimate of accuracy on future data, we must not look at the test data at all when producing the classification model for which the estimate is being obtained.)

Nearest-neighbor learning, like other techniques, is sensitive to noise in the training data. In this section we inject varying amounts of class noise into the data and observe the effect on classification performance.

You can flip a certain percentage of class labels in the data to a randomly chosen other value using an unsupervised attribute filter called AddNoise, in weka.filters.unsupervised.attribute. However, for this experiment it is important that the test data remains unaffected by class noise. Filtering the training data without filtering the test data is a common requirement, and is achieved using a metalearner called FilteredClassifier, in weka.classifiers.meta, as described near the end of Section 11.3 (page 444). This metalearner should be configured to use IBk as the classifier and AddNoise as the filter. FilteredClassifier applies the filter to the data before running the learning algorithm. This is done in two batches: first the training data and then the test data. The AddNoise filter only adds noise to the first batch of data it encounters, which means that the test data passes through unchanged.

Exercise 17.2.6. Record in Table 17.2 the cross-validated accuracy estimate of IBk for 10 different percentages of class noise and neighborhood sizes k = 1, k = 3, k = 5 (determined by the value of k in the k-nearest-neighbor classifier).

This section examines learning curves, which show the effect of gradually increasing the amount of training data. Again, we use the glass data, but this time with both IBk and the C4.5 decision tree learners, implemented in Weka as J48.

To obtain learning curves, use FilteredClassifier again, this time in conjunction with weka.filters.unsupervised.instance.Resample, which extracts a certain specified percentage of a given dataset and returns the reduced dataset.1 Again, this is done only for the first batch to which the filter is applied, so the test data passes unmodified through the FilteredClassifier before it reaches the classifier.

Follow the procedure described in Section 11.2 (page 424). Load the file segment-challenge.arff (in the data folder that comes with the Weka distribution). This dataset has 20 attributes and 7 classes. It is an image segmentation problem, and the task is to classify images into seven different groups based on properties of the pixels.

Set the classifier to UserClassifier, in the weka.classifiers.trees package. We use a separate test set (performing cross-validation with UserClassifier is incredibly tedious!), so in the Test options box choose the Supplied test set option and click the Set button. A small window appears in which you choose the test set. Click Open file and browse to the file segment-test.arff (also in the Weka distribution's data folder). On clicking Open, the small window updates to show the number of attributes (20) in the data. The number of instances is not displayed because test instances are read incrementally (so that the Explorer interface can process larger test files than can be accommodated in main memory).

Click Start. UserClassifier differs from all other classifiers: It opens a special window and waits for you to build your own classifier in it. The tabs at the top of the window switch between two views of the classifier. The Tree visualizer shows the current state of your tree, and the nodes give the number of class values there. The aim is to come up with a tree of which the leaf nodes are as pure as possible. To begin with, the tree has just one nodethe root nodecontaining all the data. More nodes will appear when you proceed to split the data in the Data visualizer.

Click the Data visualizer tab to see a two-dimensional plot in which the data points are color-coded by class, with the same facilities as the Visualize panel discussed in Section 17.1. Try different combinations of x- and y-axes to get the clearest separation you can find between the colors. Having found a good separation, you then need to select a region in the plot: This will create a branch in the tree. Here's a hint to get you started: Plot region-centroid-row on the x-axis and intensity-mean on the y-axis (the display is shown in Figure 11.14(a)); you can see that the red class (sky) is nicely separated from the rest of the classes at the top of the plot.

There are four tools for selecting regions in the graph, chosen using the dropdown menu below the y-axis selector. Select Instance identifies a particular instance. Rectangle (shown in Figure 11.14(a)) allows you to drag out a rectangle on the graph. With Polygon and Polyline you build a free-form polygon or draw a free-form polyline (left-click to add a vertex and right-click to complete the operation).

When you have selected an area using any of these tools, it turns gray. (In Figure 11.14(a) the user has defined a rectangle.) Clicking the Clear button cancels the selection without affecting the classifier. When you are happy with the selection, click Submit. This creates two new nodes in the tree, one holding all the instances covered by the selection and the other holding all remaining instances. These nodes correspond to a binary split that performs the chosen geometric test.

Switch back to the Tree visualizer view to examine the change in the tree. Clicking on different nodes alters the subset of data that is shown in the Data visualizer section. Continue adding nodes until you obtain a good separation of the classesthat is, the leaf nodes in the tree are mostly pure. Remember, however, that you should not overfit the data because your tree will be evaluated on a separate test set.

When you are satisfied with the tree, right-click any blank space in the Tree visualizer view and choose Accept The Tree. Weka evaluates the tree against the test set and outputs statistics that show how well you did.

Exercise 17.2.12. You are competing for the best accuracy score of a hand-built UserClassifier produced on the segment-challenge dataset and tested on the segment-test set. Try as many times as you like. When you have a good score (anything close to 90% correct or better), right-click the corresponding entry in the Result list, save the output using Save result buffer, and copy it into your answer for this exercise. Then run J48 on the data to see how well an automatic decision tree learner performs on the task.

The classical SLT approach has two features which are important to point out. The capacity term in the generalization bound always involves some quantity which measures the size of the function space F. However, this quantity usually does not directly depend on the complexity of the individual functions in F, it rather counts how many functions there are in F. Moreover, the bounds do not contain any quantity which measures the complexity of an individual function f itself. In this sense, all functions in the function space F are treated the same: as long as they have the same training error, their bound on the generalization error is identical. No function is singled out in any special way.

This can be seen as an advantage or a disadvantage. If we believe that all functions in F are similarly well suited to fit a certain problem, then it would not be helpful to introduce any ordering between them. However, often this is not the case. We already have some prior knowledge which we accumulated in the past. This knowledge might tell us that some functions fF are much more likely to be a good classifier than others. The Bayesian approach is one way to try to incorporate such prior knowledge into statistical inference. The general idea is to introduce some prior distribution on the function space F. This prior distribution expresses our belief about how likely a certain function is to be a good classifier. The larger the value (f) is, the more confident we are that f might be a good function. The important point is that this prior will be chosen before we get access to the data. It should be selected only based on background information or prior experience.

where (f) denotes the value of the prior on f. This is the simplest PAC-Bayesian bound. The name comes from the fact that it combines the classical bounds (sometimes called PAC bounds where PAC stands for probably approximately correct) and the Bayesian framework. Note that the right hand side does not involve a capacity term for F, but instead punishes individual functions f according to their prior likelihood (f). The more unlikely we believed f to be, the smaller (f), and the larger the bound. This mechanism shows that among two functions with the same empirical risk on the training data, one prefers the one with the higher prior value (f). For background reading on PAC-Bayesian bounds, see for example Section 6 of [Boucheron et al., 2005] and references therein.

The goal in this application was to build a predictive model that would detect frustration in learners. The evaluation data consisted of sensory observations recorded from 24 middle school students aged 12 to 13. The raw data from the camera, the posture sensor, the skin conductance sensor, and the pressure mouse was first analyzed to extract 14 features that included various statistics of different signals (see [9] for data acquisition details). Out of the 24 children, 10 got frustrated and the rest persevered on the task; thus, the evaluation data set consisted of 24 samples with 10 belonging to one class and 14 to another. The challenge here was to combine the 14 multimodal features to build a good classifier.

To this end the work presented in [9] used a sensor fusion scheme based on Gaussian process classification. Specifically the product of kernels as shown in Equation (15.2) was used, where each individual kernel Ki was constructed using an RBF kernel with kernel width 1. The sensor fusion procedure thus found optimal hyperparameters using evidence maximization that scales all modalities appropriately. Table 15.1 compares alternate methods and highlights the advantage of sensor fusion with GP when using a leave-one-out strategy to evaluate classification schemes.

Note that the value of optimized hyperparameters [1, k] can be used to determine the most discriminative features. Figure 15.2 shows the MATLAB boxplot of the inverse kernel parameters (i.e., i=1i) corresponding to the different features obtained during the 24 leave-one-out runs of the algorithm. A low value of the inverse kernel parameter corresponds to high discriminative capability of the feature. From the boxplot we can see that fidgets, velocity of the head, and ratio of postures were the three most discriminative features. However, as mentioned in [9], there were many outliers for the fidgets, which implies that the they can be unreliable, possibly because of sensor failure and individual differences.

Figure 15.2. Finding the most discriminative features: the MATLAB boxplot of kernel hyperparameters, i=1i, optimized during the 24 leave-one-out runs. The lines in the middle of the box represent the median, the bounding box represents quartile values, the whiskers show the extent of the values, and the pluses (+) represent statistical outliers. A low value of i corresponds to the high discriminative power of the feature.

Very recently researchers have started looking at written language usage as a biometric trait (Fridman et al., 2013; Pokhriyal et al., 2014; Stolerman et al., 2014). Some of the cognitive modalities reported in the literature involve the use of biological signals captured through electrocardiograms, electroencephalograms, and electrodermal responses, to provide possible individual-authentication modalities (Faria et al., 2011). However, these are invasive and require users to have the electrodes placed on their specific body parts. It is an exciting prospect to investigate the use of language by people as a cognitive biometric trait, based on the previously reported psycholinguistic study (Pennebaker et al., 2003). Our biometrics analysis are performed on a very large corpus of real user data, having several thousands of authors and writings. In general, such large-scale studies are not typical in biometrics although are essential in order to transition biometric systems from the lab to real-life. Because we are evaluating language as a biometric modality, it is important to have a large-scale study such as this for complete results, since deductive results can only be obtained when large data is studied. This study also incorporates big data into biometrics where our dataset is characterized by high volume and high noise (veracity).

We conclude that language can indeed be used as a biometric modality, as it does hold some biometric fingerprint of the author. We report reasonable performance (72% AUC), even when the data consisted of unstructured blogs collected from across the Internet. Our study indicates that blogs provide a diverse and convenient way to study about authorship on the Internet. We found that better results are obtained with cleaner, high-quality texts. We found that if number of authors are known, than even few texts per author would suffice to build a good classifier. However, the accuracy of the classifier is independent of the number of authors for the study. We also performed stricter testing, where our classifier was to correctly classify an unseen author. When classified genuine authors 78% of the time, and impostors 76% of the time. Obviously these conclusions are data dependent, but provide an encouraging lead.

Regarding the issue of permanence, as long as the author maintains a specific writing style, this methodology will work. As our features are canonical in nature, they should be resistant to moderate changes in writing style and are expected to capture the variability in the nature of blogs. More work needs to be done to better understand permanence and spoof-ability. For the dataset used in this study, we verified that multiple persons have not authored with same author name. It is difficult to ensure in the blogs dataset, when a single person has written as multiple author names (he/she created profiles with different names). In that sense, our results are for the worst-case scenario.

The problem of author attribution can also be formulated as a multiclass classification approach, such that during testing, each blog has to be attributed to one of the known classes (authors). However, on an Internet-style author attribution like ours, the number of authors is very large (order of 50K in this case), and the text written by most of the authors is usually small (average is 4). Thus we get a large number of classes, with very limited number of data points for each class. A simplistic solution can be devised in which each author is characterized by a signature, which is obtained by combining the blogs written by that author. A new blog is then compared to all the available signatures and assigned to the author with most similar signature. Given that fusion of data instances of an enrollee into a signature is an open-area in biometrics, this is definitely an area of future work for our language biometrics paradigm.

An interesting extension to this research would be to work more closely with psycholinguistic community to investigate additional language-based features to more effectively capture the cognitive fingerprint of a person. With a large set of features to work with, we can employ feature selection algorithms to reduce the feature spaces and increase the area under the ROC curve. So far, we have only performed our evaluative study on Tier 1 and compared this with all the other tiers combined. However, there needs to be a more detailed study on the other tiers individually to see how the statistics regarding authorship attribution vary with the tiers. Also, is the blog of an author (how influential that author is) a measure of the personality of the author?

Since we take only the author name associated with a blog as the identity of the authormay be another statistic, such as email id could be taken so that we dont run into the problem when authors give themselves same pen-names. According to our algorithm all blogs written by such authors would be treated as written by one author, when in essence they are different.

To get all the blogs written by a particular author, we have to scan through the entire dataset once, which provides a considerable bottleneck as the entire dataset needs to be in memory, and also limits the amount of data that we can process. An alternative to this could be to assume that each blog is written by a different author, so that there is no need to read the entire dataset once, and it can be processed sequentially.

ROC curves and their relatives are very useful for exploring the tradeoffs among different classifiers over a range of scenarios. However, they are not ideal for evaluating machine learning models in situations with known error costs. For example, it is not easy to read off the expected cost of a classifier for a fixed cost matrix and class distribution. Neither can you easily determine the ranges of applicability of different classifiers. For example, from the crossover point between the two ROC curves in Fig. 5.4 it is hard to tell for what cost and class distributions classifier A outperforms classifier B.

Cost curves are a different kind of display on which a single classifier corresponds to a straight line that shows how the performance varies as the class distribution changes. Again, they work best in the two-class case, although you can always make a multiclass problem into a two-class one by singling out one class and evaluating it against the remaining ones.

Fig. 5.5A plots the expected error against the probability of one of the classes. You could imagine adjusting this probability by resampling the test set in a nonuniform way. We denote the two classes by + and . The diagonals show the performance of two extreme classifiers: one always predicts +, giving an expected error of one if the dataset contains no + instances and zero if all its instances are +; the other always predicts , giving the opposite performance. The dashed horizontal line shows the performance of the classifier that is always wrong, and the x-axis itself represents the classifier that is always correct. In practice, of course, neither of these is realizable. Good classifiers have low error rates, so where you want to be is as close to the bottom of the diagram as possible.

The line marked A represents the error rate of a particular classifier. If you calculate its performance on a certain test set, its FP rate fp is its expected error on a subsample of the test set that contains only negative examples (P(+)=0), and its FN rate fn is the error on a subsample that contains only positive examples (P(+)=1). These are the values of the intercepts at the left and right, respectively. You can see immediately from the plot that if P(+) is smaller than about 0.2, predictor A is outperformed by the extreme classifier that always predicts , while if it is larger than about 0.65, the other extreme classifier is better.

So far we have not taken costs into account, or rather we have used the default cost matrix in which all errors cost the same. Cost curves, which do take cost into account, look very similarvery similar indeedbut the axes are different. Fig. 5.5B shows a cost curve for the same classifier A (note that the vertical scale has been enlarged, for convenience, and ignore the gray lines for now). It plots the expected cost of using A against the probability cost function, which is a distorted version of P(+) that retains the same extremes: zero when P(+)=0 and one when P(+)=1. Denote by C[+|] the cost of predicting + when the instance is actually , and the reverse by C[|+]. Then the axes of Fig. 5.5B are

The maximum value that the normalized expected cost can have is 1i.e., why it is normalized. One nice thing about cost curves is that the extreme cost values at the left and right sides of the graph are fp and fn, just as they are for the error curve, so you can draw the cost curve for any classifier very easily.

Fig. 5.5B also shows classifier B, whose expected cost remains the same across the rangei.e., its FP and FN rates are equal. As you can see, it outperforms classifier A if the probability cost function exceeds about 0.45, and knowing the costs we could easily work out what this corresponds to in terms of class distribution. In situations that involve different class distributions, cost curves make it easy to tell when one classifier will outperform another.

In what circumstances might this be useful? To return to our example of predicting when cows will be in estrus, their 30-day cycle, or 1/30 prior probability, is unlikely to vary greatly (barring a genetic cataclysm!). But a particular herd may have different proportions of cows that are likely to reach estrus in any given week, perhaps synchronized withWho knows?the phase of the moon. Then, different classifiers would be appropriate at different times. In the oil spill example, different batches of data may have different spill probabilities. In these situations cost curves can help to show which classifier to use when.

Each point on a lift chart, ROC curve, or recall-precision curve represents a classifier, typically obtained by using different threshold values for a method such as Nave Bayes. Cost curves represent each classifier by a straight line, and a suite of classifiers will sweep out a curved envelope whose lower limit shows how well that type of classifier can do if the parameter is well chosen. Fig. 5.5B indicates this with a few gray lines. If the process were continued, it would sweep out the dotted parabolic curve.

The operating region of classifier B ranges from a probability cost value of about 0.25 to a value of about 0.75. Outside this region, classifier B is outperformed by the trivial classifiers represented by dashed lines. Suppose we decide to use classifier B within this range and the appropriate trivial classifier below and above it. All points on the parabola are certainly better than this scheme. But how much better? It is hard to answer such questions from an ROC curve, but the cost curve makes them easy. The performance difference is negligible if the probability cost value is around 0.5, and below a value of about 0.2 and above 0.8 it is barely perceptible. The greatest difference occurs at probability cost values of 0.25 and 0.75 and is about 0.04, or 4% of the maximum possible cost figure.

The ensemble of classifiers has been used to reduce uncertainties in the classification model and improve generalization performance [78]. It has been demonstrated that a good ensemble is one in which the individual classifiers in the ensemble are both accurate and create errors on different parts of the input space [73,78]. In other words, an ideal ensemble consists of good classifiers (not necessarily excellent) that disagree as much as possible on difficult cases. Diversity and accuracy are two important objective criteria and are two key issues that should be considered when constructing ensembles [85,86]. For example, after creating classifiers based on the amount of error created for each class, Ahmadian et al. [79] took size, accuracy, and two other diversity measures in their use of NSGA-II-based algorithm to choose the best ensembles. Ishibuchi and Nojima [75] also examined the performance of three multi-objective ensemble classifiers and concentrated on generating an ensemble of classifiers with high diversity. To avoid choosing from overfitting solutions, Oliveira et al. [78] jointly used diversity with the accuracy of the ensemble as a selection criterion. Previous studies have shown that an ensemble is often more accurate than any of the single classifiers in the ensemble [87,88]. Although there are several studies in accuracy and diversity, multi-objectivity in ensembles is still an important area of research that needs to be extensively explored.

Many real-world applications require estimating and monitoring the distribution of a population across different classes. An example of such applications is the crucial task of determining the percentage (or prevalence) of unemployed people across different geographical regions, genders, age ranges or even temporal observations. In the literature, this task has been called quantification [22,25,46,79].

Quantification is closely related to classification. However, the goal of classification differs from the quantification one since, in the former, the focus is on correctly guessing the real class label of every single individual while, in the latter, the aim is to estimate class prevalence. Classification and quantification differ because, while a perfect classifier is also a perfect quantifier, not necessarily a good classifier is also a good quantifier. Indeed, a classifier that generates on the test set a similar number of misclassified items over the different classes is a good quantifier because the compensation of the misclassifications leads towards a perfect estimation of the class distribution.

Most of the works address the quantification problem taking into consideration data presented in conventional attribute format. Since the ever-growing availability of web and social media, we have a flourish of networking data representing a new valuable source of information. In this scenario an interesting question arises: how can the quantification be performed in contexts where the observed entities are related to each other?

The impact of quantification techniques for networking data is potentially high: this because today we are witnessing an ever more effective dissemination of social networks and social media where people express their interests and disseminate information on their opinions, about their habits, and their wishes. The possibility to analyze and quantify the percentage of individuals with specific characteristics or a particular behavior could help the analysis of many social aspects. For example, analyzing social platforms like Facebook or Google+, as we did in [47], where as we have already discussed users can choose to specify their education level, we could estimate the level of education of an entire population even in the presence of missing, or evolving, data. Following the same rationale, using a quantification approach, we could determine the distribution of the political orientation or the geographical origin of the social network population. In [47] we compare quantification approaches based on Demon [16], Infohiermap [71] and ego-networks on three different datasets: (i) Google+, where the target variable is the education level, (ii) CoRa, a reference based graph built upon a computer science bibliographic library where class labels represent topics, and (iii) IMDB a movie-to-movie network where each nodes label capture whether the opening weekend box-office sales have exceeded $2 million or not. In such scenarios, we leveraged social networks homophilic behaviors to assign labels to the unlabeled nodes belonging to each of the clusters generated by the selected algorithms. We applied two strategies: (i) density based, where the class label selected is the highest frequency one of the denser community to which the unlabeled node belongs, and (ii) frequency based, where to each unlabeled node is assigned the class having the greatest overall relative frequency across all the communities the nodes belongs to. The latter strategy in case of ego-network partitioning lead the selection of the highest frequency class among the direct neighbors of the unlabeled ego node, as shown in Fig.3. Experimental results highlight that the latter approach constantly outperforms the former concerning quantification quality (measured via KLD, KullbackLeibler divergence [25]). Moreover, the more the topologies used to assign class labels to unlabeled nodes are small the more likely that the predicted class distribution will better approximate the real one, as shown in Table1 for Google+. In this particular scenario, ego-networks represent the best choice since they enable for a tighter bound of homophilic phenomena, confirming what already observed in Section 3.3.1.

Table 1. Google+ Edu: mean of KLD scores for the frequency based approaches. Tests were performed neglecting class labels for 20% and 30% of the graph nodes sampled using a random strategy (RS) bottom-degree (BS) and top-degree (TS).

Related Equipments