LibCat » Книги » Компьютеры и интернет » Программирование » Peter Siebel - Practical Common Lisp

Peter Siebel - Practical Common Lisp

Здесь есть возможность читать онлайн «Peter Siebel - Practical Common Lisp» весь текст электронной книги совершенно бесплатно (целиком полную версию без сокращений). В некоторых случаях можно слушать аудио, скачать через торрент в формате fb2 и присутствует краткое содержание. Год выпуска: 2005, ISBN: 2005, Издательство: Apress, Жанр: Программирование, на английском языке. Описание произведения, (предисловие) а так же отзывы посетителей доступны на портале библиотеки ЛибКат.

Читать книгу

Название:
Practical Common Lisp
Автор:
Peter Siebel
Издательство:
Apress
Жанр:
Программирование / на английском языке
Год:
2005
ISBN:
1-59059-239-5
Рейтинг книги:
4 / 5. Голосов: 1
Избранное:

Добавить в избранное
Отзывы:
Написать комментарий
Ваша оценка:
- 80
- 1
- 2
- 3
- 4
- 5

Practical Common Lisp: краткое содержание, описание и аннотация

Предлагаем к чтению аннотацию, описание, краткое содержание или предисловие (зависит от того, что написал сам автор книги «Practical Common Lisp»). Если вы не нашли необходимую информацию о книге — напишите в комментариях, мы постараемся отыскать её.

Practical Common Lisp — читать онлайн бесплатно полную книгу (весь текст) целиком

Ниже представлен текст книги, разбитый по страницам. Система сохранения места последней прочитанной страницы, позволяет с удобством читать онлайн бесплатно книгу «Practical Common Lisp», без необходимости каждый раз заново искать на чём Вы остановились. Поставьте закладку, и сможете в любой момент перейти на страницу, на которой закончили чтение.

Тёмная тема

Шрифт:

↓

↑

Сбросить

Интервал:

↓

↑

Закладка:

Сделать

Testing the Filter

To test the filter, you need a corpus of messages of known types. You can use messages lying around in your inbox, or you can grab one of the corpora available on the Web. For instance, the SpamAssassin corpus [257] Several spam corpora including the SpamAssassin corpus are linked to from http://nexp.cs.pdx.edu/~psam/cgi-bin/view/PSAM/CorpusSets . contains several thousand messages hand classified as spam, easy ham, and hard ham. To make it easy to use whatever files you have, you can define a test rig that's driven off an array of file/type pairs. You can define a function that takes a filename and a type and adds it to the corpus like this:

(defun add-file-to-corpus (filename type corpus)

(vector-push-extend (list filename type) corpus))

The value of corpusshould be an adjustable vector with a fill pointer. For instance, you can make a new corpus like this:

(defparameter *corpus* (make-array 1000 :adjustable t :fill-pointer 0))

If you have the hams and spams already segregated into separate directories, you might want to add all the files in a directory as the same type. This function, which uses the list-directoryfunction from Chapter 15, will do the trick:

(defun add-directory-to-corpus (dir type corpus)

(dolist (filename (list-directory dir))

(add-file-to-corpus filename type corpus)))

For instance, suppose you have a directory mailcontaining two subdirectories, spamand ham, each containing messages of the indicated type; you can add all the files in those two directories to *corpus*like this:

SPAM> (add-directory-to-corpus "mail/spam/" 'spam *corpus*)

NIL

SPAM> (add-directory-to-corpus "mail/ham/" 'ham *corpus*)

NIL

Now you need a function to test the classifier. The basic strategy will be to select a random chunk of the corpus to train on and then test the corpus by classifying the remainder of the corpus, comparing the classification returned by the classifyfunction to the known classification. The main thing you want to know is how accurate the classifier is—what percentage of the messages are classified correctly? But you'll probably also be interested in what messages were misclassified and in what direction—were there more false positives or more false negatives? To make it easy to perform different analyses of the classifier's behavior, you should define the testing functions to build a list of raw results, which you can then analyze however you like.

The main testing function might look like this:

(defun test-classifier (corpus testing-fraction)

(clear-database)

(let* ((shuffled (shuffle-vector corpus))

(size (length corpus))

(train-on (floor (* size (- 1 testing-fraction)))))

(train-from-corpus shuffled :start 0 :end train-on)

(test-from-corpus shuffled :start train-on)))

This function starts by clearing out the feature database. [258] If you wanted to conduct a test without disturbing the existing database, you could bind *feature-database* , *total-spams* , and *total-hams* with a LET , but then you'd have no way of looking at the database after the fact—unless you returned the values you used within the function. Then it shuffles the corpus, using a function you'll implement in a moment, and figures out, based on the testing-fractionparameter, how many messages it'll train on and how many it'll reserve for testing. The two helper functions train-from-corpusand test-from-corpuswill both take :startand :endkeyword parameters, allowing them to operate on a subsequence of the given corpus.

The train-from-corpusfunction is quite simple—simply loop over the appropriate part of the corpus, use DESTRUCTURING-BIND to extract the filename and type from the list found in each element, and then pass the text of the named file and the type to train. Since some mail messages, such as those with attachments, are quite large, you should limit the number of characters it'll take from the message. It'll obtain the text with a function start-of-file, which you'll implement in a moment, that takes a filename and a maximum number of characters to return. train-from-corpuslooks like this:

(defparameter *max-chars* (* 10 1024))

(defun train-from-corpus (corpus &key (start 0) end)

(loop for idx from start below (or end (length corpus)) do

(destructuring-bind (file type) (aref corpus idx)

(train (start-of-file file *max-chars*) type))))

The test-from-corpusfunction is similar except you want to return a list containing the results of each classification so you can analyze them after the fact. Thus, you should capture both the classification and score returned by classifyand then collect a list of the filename, the actual type, the type returned by classify, and the score. To make the results more human readable, you can include keywords in the list to indicate which values are which.

(defun test-from-corpus (corpus &key (start 0) end)

(loop for idx from start below (or end (length corpus)) collect

(destructuring-bind (file type) (aref corpus idx)

(multiple-value-bind (classification score)

(classify (start-of-file file *max-chars*))

(list

:file file

:type type

:classification classification

:score score)))))

A Couple of Utility Functions

To finish the implementation of test-classifier, you need to write the two utility functions that don't really have anything particularly to do with spam filtering, shuffle-vectorand start-of-file.

An easy and efficient way to implement shuffle-vectoris using the Fisher-Yates algorithm. [259] This algorithm is named for the same Fisher who invented the method used for combining probabilities and for Frank Yates, his coauthor of the book Statistical Tables for Biological, Agricultural and Medical Research (Oliver & Boyd, 1938) in which, according to Knuth, they provided the first published description of the algorithm. You can start by implementing a function, nshuffle-vector, that shuffles a vector in place. This name follows the same naming convention of other destructive functions such as NCONC and NREVERSE . It looks like this:

(defun nshuffle-vector (vector)

(loop for idx downfrom (1- (length vector)) to 1

for other = (random (1+ idx))

do (unless (= idx other)

(rotatef (aref vector idx) (aref vector other))))

vector)

The nondestructive version simply makes a copy of the original vector and passes it to the destructive version.

(defun shuffle-vector (vector)

(nshuffle-vector (copy-seq vector)))

The other utility function, start-of-file, is almost as straightforward with just one wrinkle. The most efficient way to read the contents of a file into memory is to create an array of the appropriate size and use READ-SEQUENCE to fill it in. So it might seem you could make a character array that's either the size of the file or the maximum number of characters you want to read, whichever is smaller. Unfortunately, as I mentioned in Chapter 14, the function FILE-LENGTH isn't entirely well defined when dealing with character streams since the number of characters encoded in a file can depend on both the character encoding used and the particular text in the file. In the worst case, the only way to get an accurate measure of the number of characters in a file is to actually read the whole file. Thus, it's ambiguous what FILE-LENGTH should do when passed a character stream; in most implementations, FILE-LENGTH always returns the number of octets in the file, which may be greater than the number of characters that can be read from the file.