Peter Siebel - Practical Common Lisp
Здесь есть возможность читать онлайн «Peter Siebel - Practical Common Lisp» весь текст электронной книги совершенно бесплатно (целиком полную версию без сокращений). В некоторых случаях можно слушать аудио, скачать через торрент в формате fb2 и присутствует краткое содержание. Год выпуска: 2005, ISBN: 2005, Издательство: Apress, Жанр: Программирование, на английском языке. Описание произведения, (предисловие) а так же отзывы посетителей доступны на портале библиотеки ЛибКат.
- Название:Practical Common Lisp
- Автор:
- Издательство:Apress
- Жанр:
- Год:2005
- ISBN:1-59059-239-5
- Рейтинг книги:4 / 5. Голосов: 1
-
Избранное:Добавить в избранное
- Отзывы:
-
Ваша оценка:
- 80
- 1
- 2
- 3
- 4
- 5
Practical Common Lisp: краткое содержание, описание и аннотация
Предлагаем к чтению аннотацию, описание, краткое содержание или предисловие (зависит от того, что написал сам автор книги «Practical Common Lisp»). Если вы не нашли необходимую информацию о книге — напишите в комментариях, мы постараемся отыскать её.
Practical Common Lisp — читать онлайн бесплатно полную книгу (весь текст) целиком
Ниже представлен текст книги, разбитый по страницам. Система сохранения места последней прочитанной страницы, позволяет с удобством читать онлайн бесплатно книгу «Practical Common Lisp», без необходимости каждый раз заново искать на чём Вы остановились. Поставьте закладку, и сможете в любой момент перейти на страницу, на которой закончили чтение.
Интервал:
Закладка:
Testing the Filter
To test the filter, you need a corpus of messages of known types. You can use messages lying around in your inbox, or you can grab one of the corpora available on the Web. For instance, the SpamAssassin corpus [257] Several spam corpora including the SpamAssassin corpus are linked to from http://nexp.cs.pdx.edu/~psam/cgi-bin/view/PSAM/CorpusSets .
contains several thousand messages hand classified as spam, easy ham, and hard ham. To make it easy to use whatever files you have, you can define a test rig that's driven off an array of file/type pairs. You can define a function that takes a filename and a type and adds it to the corpus like this:
(defun add-file-to-corpus (filename type corpus)
(vector-push-extend (list filename type) corpus))
The value of corpus
should be an adjustable vector with a fill pointer. For instance, you can make a new corpus like this:
(defparameter *corpus* (make-array 1000 :adjustable t :fill-pointer 0))
If you have the hams and spams already segregated into separate directories, you might want to add all the files in a directory as the same type. This function, which uses the list-directory
function from Chapter 15, will do the trick:
(defun add-directory-to-corpus (dir type corpus)
(dolist (filename (list-directory dir))
(add-file-to-corpus filename type corpus)))
For instance, suppose you have a directory mail
containing two subdirectories, spam
and ham
, each containing messages of the indicated type; you can add all the files in those two directories to *corpus*
like this:
SPAM> (add-directory-to-corpus "mail/spam/" 'spam *corpus*)
NIL
SPAM> (add-directory-to-corpus "mail/ham/" 'ham *corpus*)
NIL
Now you need a function to test the classifier. The basic strategy will be to select a random chunk of the corpus to train on and then test the corpus by classifying the remainder of the corpus, comparing the classification returned by the classify
function to the known classification. The main thing you want to know is how accurate the classifier is—what percentage of the messages are classified correctly? But you'll probably also be interested in what messages were misclassified and in what direction—were there more false positives or more false negatives? To make it easy to perform different analyses of the classifier's behavior, you should define the testing functions to build a list of raw results, which you can then analyze however you like.
The main testing function might look like this:
(defun test-classifier (corpus testing-fraction)
(clear-database)
(let* ((shuffled (shuffle-vector corpus))
(size (length corpus))
(train-on (floor (* size (- 1 testing-fraction)))))
(train-from-corpus shuffled :start 0 :end train-on)
(test-from-corpus shuffled :start train-on)))
This function starts by clearing out the feature database. [258] If you wanted to conduct a test without disturbing the existing database, you could bind *feature-database* , *total-spams* , and *total-hams* with a LET , but then you'd have no way of looking at the database after the fact—unless you returned the values you used within the function.
Then it shuffles the corpus, using a function you'll implement in a moment, and figures out, based on the testing-fraction
parameter, how many messages it'll train on and how many it'll reserve for testing. The two helper functions train-from-corpus
and test-from-corpus
will both take :start
and :end
keyword parameters, allowing them to operate on a subsequence of the given corpus.
The train-from-corpus
function is quite simple—simply loop over the appropriate part of the corpus, use DESTRUCTURING-BIND
to extract the filename and type from the list found in each element, and then pass the text of the named file and the type to train
. Since some mail messages, such as those with attachments, are quite large, you should limit the number of characters it'll take from the message. It'll obtain the text with a function start-of-file
, which you'll implement in a moment, that takes a filename and a maximum number of characters to return. train-from-corpus
looks like this:
(defparameter *max-chars* (* 10 1024))
(defun train-from-corpus (corpus &key (start 0) end)
(loop for idx from start below (or end (length corpus)) do
(destructuring-bind (file type) (aref corpus idx)
(train (start-of-file file *max-chars*) type))))
The test-from-corpus
function is similar except you want to return a list containing the results of each classification so you can analyze them after the fact. Thus, you should capture both the classification and score returned by classify
and then collect a list of the filename, the actual type, the type returned by classify
, and the score. To make the results more human readable, you can include keywords in the list to indicate which values are which.
(defun test-from-corpus (corpus &key (start 0) end)
(loop for idx from start below (or end (length corpus)) collect
(destructuring-bind (file type) (aref corpus idx)
(multiple-value-bind (classification score)
(classify (start-of-file file *max-chars*))
(list
:file file
:type type
:classification classification
:score score)))))
A Couple of Utility Functions
To finish the implementation of test-classifier
, you need to write the two utility functions that don't really have anything particularly to do with spam filtering, shuffle-vector
and start-of-file
.
An easy and efficient way to implement shuffle-vector
is using the Fisher-Yates algorithm. [259] This algorithm is named for the same Fisher who invented the method used for combining probabilities and for Frank Yates, his coauthor of the book Statistical Tables for Biological, Agricultural and Medical Research (Oliver & Boyd, 1938) in which, according to Knuth, they provided the first published description of the algorithm.
You can start by implementing a function, nshuffle-vector
, that shuffles a vector in place. This name follows the same naming convention of other destructive functions such as NCONC
and NREVERSE
. It looks like this:
(defun nshuffle-vector (vector)
(loop for idx downfrom (1- (length vector)) to 1
for other = (random (1+ idx))
do (unless (= idx other)
(rotatef (aref vector idx) (aref vector other))))
vector)
The nondestructive version simply makes a copy of the original vector and passes it to the destructive version.
(defun shuffle-vector (vector)
(nshuffle-vector (copy-seq vector)))
The other utility function, start-of-file
, is almost as straightforward with just one wrinkle. The most efficient way to read the contents of a file into memory is to create an array of the appropriate size and use READ-SEQUENCE
to fill it in. So it might seem you could make a character array that's either the size of the file or the maximum number of characters you want to read, whichever is smaller. Unfortunately, as I mentioned in Chapter 14, the function FILE-LENGTH
isn't entirely well defined when dealing with character streams since the number of characters encoded in a file can depend on both the character encoding used and the particular text in the file. In the worst case, the only way to get an accurate measure of the number of characters in a file is to actually read the whole file. Thus, it's ambiguous what FILE-LENGTH
should do when passed a character stream; in most implementations, FILE-LENGTH
always returns the number of octets in the file, which may be greater than the number of characters that can be read from the file.
Интервал:
Закладка:
Похожие книги на «Practical Common Lisp»
Представляем Вашему вниманию похожие книги на «Practical Common Lisp» списком для выбора. Мы отобрали схожую по названию и смыслу литературу в надежде предоставить читателям больше вариантов отыскать новые, интересные, ещё непрочитанные произведения.
Обсуждение, отзывы о книге «Practical Common Lisp» и просто собственные мнения читателей. Оставьте ваши комментарии, напишите, что Вы думаете о произведении, его смысле или главных героях. Укажите что конкретно понравилось, а что нет, и почему Вы так считаете.