LibCat » Книги » Компьютеры и интернет » Программирование » Peter Siebel - Practical Common Lisp

Peter Siebel - Practical Common Lisp

Здесь есть возможность читать онлайн «Peter Siebel - Practical Common Lisp» весь текст электронной книги совершенно бесплатно (целиком полную версию без сокращений). В некоторых случаях можно слушать аудио, скачать через торрент в формате fb2 и присутствует краткое содержание. Год выпуска: 2005, ISBN: 2005, Издательство: Apress, Жанр: Программирование, на английском языке. Описание произведения, (предисловие) а так же отзывы посетителей доступны на портале библиотеки ЛибКат.

Читать книгу

Название:
Practical Common Lisp
Автор:
Peter Siebel
Издательство:
Apress
Жанр:
Программирование / на английском языке
Год:
2005
ISBN:
1-59059-239-5
Рейтинг книги:
4 / 5. Голосов: 1
Избранное:

Добавить в избранное
Отзывы:
Написать комментарий
Ваша оценка:
- 80
- 1
- 2
- 3
- 4
- 5

Practical Common Lisp: краткое содержание, описание и аннотация

Предлагаем к чтению аннотацию, описание, краткое содержание или предисловие (зависит от того, что написал сам автор книги «Practical Common Lisp»). Если вы не нашли необходимую информацию о книге — напишите в комментариях, мы постараемся отыскать её.

Practical Common Lisp — читать онлайн бесплатно полную книгу (весь текст) целиком

Ниже представлен текст книги, разбитый по страницам. Система сохранения места последней прочитанной страницы, позволяет с удобством читать онлайн бесплатно книгу «Practical Common Lisp», без необходимости каждый раз заново искать на чём Вы остановились. Поставьте закладку, и сможете в любой момент перейти на страницу, на которой закончили чтение.

Тёмная тема

Шрифт:

↓

↑

Сбросить

Интервал:

↓

↑

Закладка:

Сделать

The keys aren't evaluated. In other words, the value of typewill be compared to the literal objects read by the Lisp reader as part of the ECASE form. In this function, that means the keys are the symbols hamand spam, not the values of any variables named hamand spam. So, if increment-countis called like this:

(increment-count some-feature 'ham)

the value of typewill be the symbol ham, and the first branch of the ECASE will be evaluated and the feature's ham count incremented. On the other hand, if it's called like this:

(increment-count some-feature 'spam)

then the second branch will run, incrementing the spam count. Note that the symbols hamand spamare quoted when calling increment-countsince otherwise they'd be evaluated as the names of variables. But they're not quoted when they appear in ECASE since ECASE doesn't evaluate the keys. [253] Technically, the key in each clause of a CASE or ECASE is interpreted as a list designator , an object that designates a list of objects. A single nonlist object, treated as a list designator, designates a list containing just that one object, while a list designates itself. Thus, each clause can have multiple keys; CASE and ECASE will select the clause whose list of keys contains the value of the key form. For example, if you wanted to make good a synonym for ham and bad a synonym for spam , you could write increment-count like this: (defun increment-count (feature type) (ecase type ((ham good) (incf (ham-count feature))) ((spam bad) (incf (spam-count feature)))))

The E in ECASE stands for "exhaustive" or "error," meaning ECASE should signal an error if the key value is anything other than one of the keys listed. The regular CASE is looser, returning NIL if no matching clause is found.

To implement increment-total-count, you need to decide where to store the counts; for the moment, two more special variables, *total-spams*and *total-hams*, will do fine.

(defvar *total-spams* 0)

(defvar *total-hams* 0)

(defun increment-total-count (type)

(ecase type

(ham (incf *total-hams*))

(spam (incf *total-spams*))))

You should use DEFVAR to define these two variables for the same reason you used it with *feature-database*—they'll hold data built up while you run the program that you don't necessarily want to throw away just because you happen to reload your code during development. But you'll want to reset those variables if you ever reset *feature-database*, so you should add a few lines to clear-databaseas shown here:

(defun clear-database ()

(setf

*feature-database* (make-hash-table :test #'equal)

*total-spams* 0

*total-hams* 0))

Per-Word Statistics

The heart of a statistical spam filter is, of course, the functions that compute statistics-based probabilities. The mathematical nuances [254] Speaking of mathematical nuances, hard-core statisticians may be offended by the sometimes loose use of the word probability in this chapter. However, since even the pros, who are divided between the Bayesians and the frequentists, can't agree on what a probability is, I'm not going to worry about it. This is a book about programming, not statistics. of why exactly these computations work are beyond the scope of this book—interested readers may want to refer to several papers by Gary Robinson. [255] Robinson's articles that directly informed this chapter are "A Statistical Approach to the Spam Problem" (published in the Linux Journal and available at http://www.linuxjournal.com/ article.php?sid=6467 and in a shorter form on Robinson's blog at http://radio.weblogs.com/ 0101454/stories/2002/09/16/spamDetection.html ) and "Why Chi? Motivations for the Use of Fisher's Inverse Chi-Square Procedure in Spam Classification" (available at http://garyrob.blogs.com/ whychi93.pdf ). Another article that may be useful is "Handling Redundancy in Email Token Probabilities" (available at http://garyrob.blogs.com//handlingtokenredundancy94.pdf ). The archived mailing lists of the SpamBayes project ( http://spambayes.sourceforge.net/ ) also contain a lot of useful information about different algorithms and approaches to testing spam filters. I'll focus rather on how they're implemented.

The starting point for the statistical computations is the set of measured values—the frequencies stored in *feature-database*, *total-spams*, and *total-hams*. Assuming that the set of messages trained on is statistically representative, you can treat the observed frequencies as probabilities of the same features showing up in hams and spams in future messages.

The basic plan is to classify a message by extracting the features it contains, computing the individual probability that a given message containing the feature is a spam, and then combining all the individual probabilities into a total score for the message. Messages with many "spammy" features and few "hammy" features will receive a score near 1, and messages with many hammy features and few spammy features will score near 0.

The first statistical function you need is one that computes the basic probability that a message containing a given feature is a spam. By one point of view, the probability that a given message containing the feature is a spam is the ratio of spam messages containing the feature to all messages containing the feature. Thus, you could compute it this way:

(defun spam-probability (feature)

(with-slots (spam-count ham-count) feature

(/ spam-count (+ spam-count ham-count))))

The problem with the value computed by this function is that it's strongly affected by the overall probability that any message will be a spam or a ham. For instance, suppose you get nine times as much ham as spam in general. A completely neutral feature will then appear in one spam for every nine hams, giving you a spam probability of 1/10 according to this function.

But you're more interested in the probability that a given feature will appear in a spam message, independent of the overall probability of getting a spam or ham. Thus, you need to divide the spam count by the total number of spams trained on and the ham count by the total number of hams. To avoid division-by-zero errors, if either of *total-spams*or *total-hams*is zero, you should treat the corresponding frequency as zero. (Obviously, if the total number of either spams or hams is zero, then the corresponding per-feature count will also be zero, so you can treat the resulting frequency as zero without ill effect.)

(defun spam-probability (feature)

(with-slots (spam-count ham-count) feature

(let ((spam-frequency (/ spam-count (max 1 *total-spams*)))

(ham-frequency (/ ham-count (max 1 *total-hams*))))

(/ spam-frequency (+ spam-frequency ham-frequency)))))

This version suffers from another problem—it doesn't take into account the number of messages analyzed to arrive at the per-word probabilities. Suppose you've trained on 2,000 messages, half spam and half ham. Now consider two features that have appeared only in spams. One has appeared in all 1,000 spams, while the other appeared only once. According to the current definition of spam-probability, the appearance of either feature predicts that a message is spam with equal probability, namely, 1.