classifier 1/4

ipv4 acl__51cto

ipv4 acl__51cto

1.vim /etc/network/interfaces auto lo iface lo inet loopback auto enp0s3 iface enp0s3 inet static address **** netmask **** gateway **** 2. service networking restart 3.vim /etc/resolv.conf name...

![](https://s1.51cto.com/images/blog/201901/31/8e33556202d47e8c6b93d3fd967ddb93.png?x-oss-process=image/watermark,size_16,text_QDUxQ1RP5Y2a5a6i,color_FFFFFF,t_100,g_se,x_10,y_10,shadow_90,type_ZmFuZ3p...

what is a good classifier? (1/4) | skilja

what is a good classifier? (1/4) | skilja

Auto-Classification is able to assign categories and hence meaning to documents with an unprecedented speed and quality. The technology for auto-classification has been developed over the last 15 years from the first tentative rule based systems to elaborate statistical and semantic learn-by-example algorithms today. We see auto-classification being established as an accepted and standard approach that is provided either as a built-in function in a business software or as a toolkit in the same way as we are using OCR today.

But how to know what is a good classifier and when the optimal performance has been achieved in a classification project? Like in OCR there can be huge differences between simple textbook open-source approaches to classification and elaborate and sophisticated classifiers that incorporate all the lessons learned over the last decades.

This series of 4 articles will focus on the measurement of classification quality and will show examples by graphically comparing standard classifiers with some of the most advanced technologies today. This kind of evaluation will allow our readers to understand the methodology of measuring classification quality and at the same time demonstrate drastically why a good classifier needs more than a simple standard algorithm.

The most important numbers by which any classification can be measured are precision and recall. Precision is the percentage of correctly categorized items in relation to all items categorized and hence measures the error rate or false positives. Recall is the percentage of items classified into a class with respect to the total number of items in the reference set of this class and hence the correct rate. A threshold can be used to suppress the errors and create a third set of rejects. For a more detailed explanation see an older post here:http://www.skilja.de/2012/measuring-classification-quality/

In the precision-recall graph precision and the recall percentages are plotted over the threshold showing the evolution of these values and allowing a project designer to find the correct threshold for the target error rate. A good classifier will reduce the number of errors smoothly when the threshold is applied which will lead to a rising upper curve. In the same way the correct items will be diminished producing the reject set. This is shown in the schematical graph below with the three sets of items, the Errors, Correct and Rejects.

In a real life example we have used the well known Reuters-21578 Apte test set. This set has been assembled many years ago (available athttp://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html). It includes 12,902 documents for 90 classes, with a fixed splitting between test and training data (3,299 vs. 9,603).

The differences are obvious. Apart from the significant differences in the absolute values of recall and precision, the standard classifier shows a very undesirable behavior. The threshold does not affect the error rate for a long time and then suddenly reduces the read rate (recall) drastically. For a project designer it will be very difficult to find the correct threshold to achieve the target error rate as the function is very unsteady.

The good classifier shows a much better behavior. The error rate decreases constantly with increasing threshold with minimal effect on the read rate. Both values run almost in parallel and it is easy to find the correct settings. Of course also the absolute rates are much higher.

There are good technical reasons in the algorithms to explain these differences but this should not be the topic of this blog. More important is to understand that therearesignificant differences and that they become visible in the graphical evaluation. In an upcoming article we will show an even better visualization of true differences. Stay tuned!

Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.

Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.

Related Equipments