{"id":1079,"date":"2020-08-28T23:30:43","date_gmt":"2020-08-29T03:30:43","guid":{"rendered":"http:\/\/aristotle2digital.blogwyrm.com\/?p=1079"},"modified":"2021-11-19T23:16:03","modified_gmt":"2021-11-20T04:16:03","slug":"machine-classification-part-2-a-naive-bayes-classification-algorithm","status":"publish","type":"post","link":"https:\/\/aristotle2digital.blogwyrm.com\/?p=1079","title":{"rendered":"Machine Classification: Part 2 \u2013 A Na\u00efve Bayes Classification Algorithm"},"content":{"rendered":"\n<p>As discussed in the last column, classification can be a\ntricky thing.&nbsp; Much of the machine\nlearning buzz centers on classification problems.&nbsp; Typical examples include things like optical\ncharacter recognition (classify a set of pixels in a image as a particular\ncharacter), computer vision and image processing (classify a region on the\nground as flooded or not), and so on.&nbsp; <\/p>\n\n\n\n<p>This column focuses on one of the more common classification\nalgorithms: <a href=\"https:\/\/en.wikipedia.org\/wiki\/Naive_Bayes_classifier\">na\u00efve\nBayes classifier<\/a> (NBC).&nbsp; Paraphrasing\nthe Wikipedia article, the NBC is a simple technique that produces a model that\nacts as an agent, which by looking at some collection of features associated\nwith an object, can place that object within the appropriate \u2018bucket\u2019.&nbsp; <\/p>\n\n\n\n<p>To create a concrete example, we\u2019ll use the scheme used by James\nMcCaffrey in his June 2019 Test Run column entitled <em><a href=\"https:\/\/docs.microsoft.com\/en-us\/archive\/msdn-magazine\/2019\/june\/test-run-simplified-naive-bayes-classification-using-csharp\">Simplified\nNaive Bayes Classification Using C#<\/a><\/em>.&nbsp;\nOne can imagine that we are pawn brokers in McCaffrey\u2019s universe.&nbsp; People frequently come in hawking jewelry and,\ngiven that we run a pawn shop, we should expect that some of our clientele are\nless than trustworthy.&nbsp; We want build a\nmodel that allows us to classify a gemstone as being real or fake based on its\ncolor, size, and style of the cut.<\/p>\n\n\n\n<p>These three attributes of the gemstone will be factors used\nto make the prediction and they are typically arranged in a list or array that\nis euphemistically called a vector (it is only euphemistically so as these list\ndon\u2019t obey the accepted definitions for a vector space).&nbsp; The gemstone vector will have 3 dimensions\nfor color, size, and style.&nbsp; Each\nattribute has various realizations as shown in this figure<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"857\" height=\"748\" src=\"http:\/\/aristotle2digital.blogwyrm.com\/wp-content\/uploads\/2020\/08\/Gemstones.png\" alt=\"\" class=\"wp-image-1078\" srcset=\"https:\/\/aristotle2digital.blogwyrm.com\/wp-content\/uploads\/2020\/08\/Gemstones.png 857w, https:\/\/aristotle2digital.blogwyrm.com\/wp-content\/uploads\/2020\/08\/Gemstones-300x262.png 300w, https:\/\/aristotle2digital.blogwyrm.com\/wp-content\/uploads\/2020\/08\/Gemstones-768x670.png 768w, https:\/\/aristotle2digital.blogwyrm.com\/wp-content\/uploads\/2020\/08\/Gemstones-810x707.png 810w\" sizes=\"auto, (max-width: 857px) 100vw, 857px\" \/><\/figure>\n\n\n\n<p>To develop our model we first have to pool together what we\nknow based on the gemstones we\u2019ve seen.&nbsp;\nFor example, if a kind-hearted woman who had fallen on hard times came\nin with a small, twisted aqua-colored stone that we verified was authentic then\nwe would enter into our database the entry:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">Aqua,Small,Twisted,1<\/pre>\n\n\n\n<p>where the \u20181\u2019 means authentic or good.&nbsp; If some shady character, acting all tough came\nin with a small, blue, pointed stone that we reluctantly took and found out\nlater was fake we would amend our database to read:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">Aqua,Small,Twisted,1<br>Blue,Small,Pointed,0 <\/pre>\n\n\n\n<p>where the \u20180\u2019 means fake or bad.&nbsp; Proceeding in this fashion, we produce a\ntraining set for our agent to gain experience with as it develops its own\ninternal mode.&nbsp; For this initial prototype,\nI used the 40 element training set provided by McCaffrey (of which the first\ntwo points are as shown above).&nbsp; <\/p>\n\n\n\n<p>This kind of training is called supervised since we actually\nlabel each feature vector with the category into which it belongs.&nbsp; It is worth noting that there isn\u2019t a single Bayesian\nclassifier but rather a family of related algorithms.&nbsp; The basic concepts are the same but the\nparticular way in which the training set is characterized leads to better or\nworse performance based on context.&nbsp; In\nparticular, all NBCs assume that a given attribute is independent of the value\nof any other attribute.<\/p>\n\n\n\n<p>Anyway, returning to McCaffrey\u2019s NBC, the structure of his\nalgorithm is most easily summarized in the following steps (the names of my\nPython routines to implement these steps are shown in parentheses):<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>Training data is digested, the dimension of the\nfeature vector is deduced, and the types of each attribute uniquely cataloged (find_distinct_values)<\/li><li>The marginals of the distributions are\ndetermined (calculate_Laplace_smoothed_marginals) with an added nuance to\nhandle if a feature combination is not present<\/li><li>Additional statistics are computed to facilitate\nthe classification scheme (characterize_data_set)<\/li><li>Finally the model is able to classify on the\nfeature vector of a new instance of a gemstone (calculate_evidence)<\/li><\/ol>\n\n\n\n<p>The primary data structure is the python dictionary which\nbuilds up around each attribute discovered in the training set.&nbsp; Obviously, this limits the NBC to classifying\non known attributes.&nbsp; In other words, if\na ruby-colored gemstone came on the scene the agent\/model wouldn\u2019t know how to classify\nit. This situation would be the same for us manning the pawn shop when a person\nwho don\u2019t know whether to trust of not comes in with such a stone.<\/p>\n\n\n\n<p>The code for each function is listed here: <\/p>\n\n\n\n<div class=\"myQuoteDiv\">\n<pre>def find_distinct_values(df,attributes_lst):\n    distinct_values = {}\n    for attribute in attributes_lst:\n        distinct_values[attribute] = set(df[attribute])\n\n    return distinct_values\n<\/pre>\n<\/div>\n\n\n\n<div class=\"myQuoteDiv\">\n<pre>def calculate_Laplace_smoothed_marginals(df,distinct_values):\n    #initialize the marginals\n    marginals = {}\n    for attribute_type in distinct_values:\n        for attribute in distinct_values[attribute_type]:\n           #initializing to [1,1] implements Laplace smoothing\n            marginals[attribute] = np.array([1,1])  \n            \n    for attribute_type in distinct_values:\n        for attribute, authenticity in zip(df[attribute_type],df['authenticity']):\n            marginals[attribute][authenticity] += 1\n            \n    return marginals  \n<\/pre>\n<\/div>\n\n\n\n<div class=\"myQuoteDiv\">\n<pre>def characterize_data_set(df):\n    fake_label           = 0\n    true_label           = 1\n    summary              = {}\n    authenticity_data    = df['authenticity']\n    fake_counts          = len(np.where(authenticity_data==fake_label)[0])\n    true_counts          = len(np.where(authenticity_data==true_label)[0])\n  \n    summary['num samples'] = fake_counts + true_counts\n    summary['num fake']    = fake_counts\n    summary['num true']    = true_counts\n    \n    return summary\n<\/pre>\n<\/div>\n\n\n\n<div class=\"myQuoteDiv\">\n<pre>def calculate_evidence(distinct_values,smoothed_marginals,summary,sample_values):\n    fake_label           = 0\n    true_label           = 1\n    num_attributes       = len(distinct_values)\n    \n    prob_fake            = summary['num fake']\/summary['num samples']\n    prob_true            = summary['num true']\/summary['num samples']\n    smoothed_num_fake    = summary['num fake'] + num_attributes\n    smoothed_num_true    = summary['num true'] + num_attributes\n    \n    sample_evidence_fake = 1\n    for attribute in sample_values:\n        sample_evidence_fake *= smoothed_marginals[attribute][fake_label]\/smoothed_num_fake\n    sample_evidence_fake *= prob_fake\n    \n    sample_evidence_true = 1\n    for attribute in sample_values:\n        sample_evidence_true *= smoothed_marginals[attribute][true_label]\/smoothed_num_true\n    sample_evidence_true *= prob_true\n    \n    normalization = sample_evidence_fake + sample_evidence_true\n    \n    return sample_evidence_fake\/normalization, sample_evidence_true\/normalization\n<\/pre>\n<\/div>\n\n\n\n<p>Happily, the code reproduces the results of McCaffrey\u2019s original article but preliminary tests with more varied training sets have been disappointing.\u00a0<\/p>\n","protected":false},"excerpt":{"rendered":"<p>As discussed in the last column, classification can be a tricky thing.&nbsp; Much of the machine learning buzz centers on classification problems.&nbsp; Typical examples include things like optical character recognition&#8230; <a class=\"read-more-button\" href=\"https:\/\/aristotle2digital.blogwyrm.com\/?p=1079\">Read more &gt;<\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-1079","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/aristotle2digital.blogwyrm.com\/index.php?rest_route=\/wp\/v2\/posts\/1079","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/aristotle2digital.blogwyrm.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/aristotle2digital.blogwyrm.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/aristotle2digital.blogwyrm.com\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/aristotle2digital.blogwyrm.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1079"}],"version-history":[{"count":0,"href":"https:\/\/aristotle2digital.blogwyrm.com\/index.php?rest_route=\/wp\/v2\/posts\/1079\/revisions"}],"wp:attachment":[{"href":"https:\/\/aristotle2digital.blogwyrm.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1079"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/aristotle2digital.blogwyrm.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1079"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/aristotle2digital.blogwyrm.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1079"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}