So I made a blog and some requests have started coming in. And of course I want to get some data about my visitors.
I could just download a program from somewhere and get all the statistics I could think of in less than ten minutes, but where is the fun in that? Better to spend hours to try to write some lisp code that barely manages to get the most basic stats from the logfiles.
You'll notice that I redefine the same function again and again and again. This is how I usually work. Gradually refining functions until I think they work as I want.(This is where you're appaled by the lack of unit testing/formal verification. I basically fiddle until it works for the limited amount of test cases I have.)
First some libraries. I'm gonna chicken out of the hard work of the parsing because Edi Weitz already did that and made the cl-ppcre package which implements perls regular expressions in Common Lisp.
(defpackage :com.jalat.logparse (:use :cl :cl-ppcre))
A short run-down of the regular expressions I'm going to use for those who don't know perl regular expressions:
most characters matches themselves. | ||
\S | matches non-empty characters (anything but spaces, newlines and tabs). | |
\d | matches digits. | |
\w | matches alphanumeric characters (Uses alphanumericp ,
might differ from implementation to implementation). | |
[abc] | matches a list of characters. | |
[^abc] | matches everything except the list of characters. | |
(pattern) | stores pattern in a "register" for later use | |
+ | repeat previous character/group one or more times | |
* | repeat previous character/group zero or more times | |
{n} | repeat previous character/group n times | |
| | means or, so (a|b) matches a or b. | |
^ | at the start of line, means the beginning of a line so "^a" does not match "ba" |
Here's two log entries, this is actually one request. The first request use a old url, and I redirect the visitor to the new url. (Which by now also is an old url... If only planning things were fun I might actually start doing some one day.):
222.153.17.71 - - [08/Jan/2006:00:14:47 +0000] "GET /back-of-the-envelope.html HTTP/1.1" 302 302 "http://planet.lisp.org/" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.6) Gecko/20050317 Firefox/1.0.2" "-" 222.153.17.71 - - [08/Jan/2006:00:14:48 +0000] "GET /blog.ucw?entry-id=2 HTTP/1.1" 200 15016 "http://planet.lisp.org/" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.6) Gecko/20050317 Firefox/1.0.2" "-"
The first part is the ip address of the user who looked at our blog.
As a simple demo of how to use cl-ppcre, I'll scan and collect the
first string up until the first whitespace. (hopefully gathering
the ip address in the process.) The register-groups-bind
will do the scanning and storing (binding) for me.
(in-package :com.jalat.logparse) (defparameter *teststrings* '("222.153.17.71 - - [08/Jan/2006:00:14:48 +0000] \"GET /blog.ucw?entry-id=2 HTTP/1.1\" 200 15016 \"http://planet.lisp.org/\" \"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.6) Gecko/20050317 Firefox/1.0.2\" \"-\"" "222.153.17.71 - - [08/Jan/2006:00:14:47 +0000] \"GET /back-of-the-envelope.html HTTP/1.1\" 302 302 \"http://planet.lisp.org/\" \"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.6) Gecko/20050317 Firefox/1.0.2\" \"-\"")) (defun split-string (logstring) (register-groups-bind (ipaddr) ("(\\S+) " logstring) (list ipaddr))) COM.JALAT.LOGPARSE 39 > (first *teststrings*) "222.153.17.71 - - [08/Jan/2006:00:14:48 +0000] \"GET /blog.ucw?entry-id=2 HTTP/1.1\" 200 15016 \"http://planet.lisp.org/\" \"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.6) Gecko/20050317 Firefox/1.0.2\" \"-\"" COM.JALAT.LOGPARSE 40 > (split-string (first *teststrings*)) ("222.153.17.71")
As you can see, the (\\S+) has matched non-empty characters and is
stored in the variable ipaddr. cl-ppcre
returns the first match
it can find, that is why I get the ip address and not just any substring
consisting of non-whitespace. Why do I use \\S if \S is what matches a character?
\ is the quote character for strings in lisp so a simple \ will just pass "S" on
to cl-ppcre
:
COM.JALAT.LOGPARSE 41 > (princ "\a") a "a" COM.JALAT.LOGPARSE 42 > (princ "\\a") \a "\\a" COM.JALAT.LOGPARSE 44 > (length "\a") 1 COM.JALAT.LOGPARSE 45 > (length "\\a") 2
I'm not going to bother with the ip adresses. I guess it might be useful if you want to try to count individual users, but then you have to be clever about different user behind a proxy/NAT. So I won't be doing that. Too complicated for me, and who really wants to reduce the percieved number of visitors anyway. I'm doing this to boost my ego, after all.
What I will do is to use the time data. I can't use \\S+ here since there is a space in the string. What I'll do is read the first [ and then collect all non-] characters.
(defun split-string (logstring) (register-groups-bind (when) ("\\[([^\\]]*)\\]" logstring) (list when))) COM.JALAT.LOGPARSE 46 > (split-string (first *teststrings*)) ("08/Jan/2006:00:14:48 +0000")
As you can see, I have to quote (twice) the ['s and ]'s that are literal, to stop them being interpreted as a [] directive. If you have problems sorting out the expression above, it might help to eliminate from outside and in starting with \\[ and \\]. Inside those are the ( ) meaning that I want to collect the string matching the pattern inside the ( ). Finally the inner expression [^\\]]* means 0 or more characters that are not the character ']'
I'm a bit lucky because of the fact that there are no [] in the log strings before the time description. This means that I abuse the fact that cl-ppcre returns the first match it finds. But this is not the best way of writing a regular expression if we want performance. With this expression the scanner have to search the string for the start of the expression. If we tell the scanner where the expression starts we can reduce the workload a bit:
(defun split-string (logstring) (register-groups-bind (when) ("^\\S+ \\S+ \\S+ \\[([^\\]]*)\\]" logstring) (list when)))
The ^ tells the scanner that our expression starts at the start of the string, then there are three substrings separated by spaces that I'm not interested in. Finally there is the date string which I am interested in.
Next is to extract the url. The url is surrounded by "GET and HTTPsomething". I'm using a similar trick to what I used above to match up to the ".
(defun split-string (logstring) (register-groups-bind (when request) ("^\\S+ \\S+ \\S+ \\[([^\\]]*)\\] \"GET (\\S*) HTTP[^\\\"]*\"" logstring) (list when request))) COM.JALAT.LOGPARSE 5 > (split-string (first *teststrings*)) ("08/Jan/2006:00:14:48 +0000" "/blog.ucw?entry-id=2") COM.JALAT.LOGPARSE 7 > (split-string (second *teststrings*)) ("08/Jan/2006:00:14:47 +0000" "/back-of-the-envelope.html")
Now, remember that this is only one request that is being redirected. I don't really want to count it twice. The next element of the string helps us sort out requests that has been answered successfully from requests that fail, are redirected or don't exist. This element is the result code, 200 means a successfull request so I'll hardcode 200 in my regexp string. The field after the result code I'll skip by matching but not collecting it. Then I'll collect the referring url. I want to see who have been nice and linked to me.
(defun split-string (logstring) (register-groups-bind (when request referrer) ("^\\S+ \\S+ \\S+ \\[([^\\]]*)\\] \"GET (\\S*) HTTP[^\\\"]*\" 200 \\S+ \"([^\\\"]+)\"" logstring) (list when request referrer))) COM.JALAT.LOGPARSE 9 > (split-string (car *teststrings*)) ("08/Jan/2006:00:14:48 +0000" "/blog.ucw?entry-id=2" "http://planet.lisp.org/") COM.JALAT.LOGPARSE 10 > (split-string (cadr *teststrings*)) NIL
One more detail, I don't like the date/time format, for now all I want is
the date. It would also be nice to have a more sortable date. I will do
this by splitting up the date string in days months year and combining
it to a number. \d{2}
will match two digits (the day part),
\w{3}
three alphanumeric character (month) and \d{4}
the year.
(defparameter *months* '(("Jan" . "01") ("Feb" . "02") ("Mar" . "03") ("Apr" . "04") ("May" . "05") ("Jun" . "06") ("Jul" . "07") ("Aug" . "08") ("Sep" . "09") ("Oct" . "10") ("Nov" . "11") ("Dec" . "12"))) (defun numerical-date (date month year) (parse-integer (concatenate 'string year (cdr (assoc month *months* :test #'string=)) date))) (defun split-string (logstring) (register-groups-bind (date month year request referrer) ("^\\S+ \\S+ \\S+ \\[(\\d{2})/(\\w{3})/(\\d{4})[^\\]]*\\] \"GET (\\S*) HTTP[^\\\"]*\" 200 \\S+ \"([^\\\"]+)\"" logstring) (list (numerical-date date month year) request referrer))) COM.JALAT.LOGPARSE 34 > (split-string (car *teststrings*)) (20060108 "/blog.ucw?entry-id=2" "http://planet.lisp.org/")
Ok, time to start counting. I'm just going to start with a hash-table of dates and count how many requests I get for each date. I will also move the regexp out of the record-string function, so the regexp scanner only have to be built once. This should increase the performance a bit.
(defparameter *logscanner* (create-scanner "^\\S+ \\S+ \\S+ \\[(\\d{2})/(\\w{3})/(\\d{4})[^\\]]*\\] \"GET (\\S*) HTTP[^\\\"]*\" 200 \\S+ \"([^\\\"]+)\"")) (defparameter *datehash* (make-hash-table)) (defun record-entry (date request referrer) (declare (ignore request referrer)) (let* ((currentcount (gethash date *datehash* 0))) (setf (gethash date *datehash*) (1+ currentcount)))) (defun record-string (logstring) (register-groups-bind (day month year request referrer) (*logscanner* logstring) (record-entry (numerical-date day month year) request referrer))) (defun parse-file (filename) (with-open-file (s filename) (do ((string (read-line s) (read-line s nil 'eof))) ((eql string 'eof) nil) (record-string string)))) (defun printstats (filename) (parse-file filename) (dolist (x (sort (loop for k being the hash-key of *datehash* collect (list k (gethash k *datehash*))) #'> :key #'car)) (format t "~a ~a~%" (car x) (cadr x)))) COM.JALAT.LOGPARSE 28 > (printstats "~/poundlog") 20060128 1986 20060127 325 20060126 33 20060125 29 20060124 39 20060123 25 20060122 73 20060121 30 20060120 32 20060119 67 20060118 103 20060117 116 20060116 96 20060115 98 20060114 89 20060113 130 20060112 123
The code above is quite simple I hope, create-scanner
creates a closure that I use in the now renamed
record-string
. The result is passed on to
record-entry
that updates the hash table of hits.
parse-file
loops through a file calling
record-string
for each line. printstats
just
prints out the keys/values of the hash table.
Ok, this raises the obvious question, what happened on the 27/28 of January? 30 visitors a day is unlikely enough (I'm not going to bother showing this, but there is a surprisingly large amount of bots traversing the net :) so what on earth triggered almost 2000 requests on the 28th? Let's have a look at where they come from. Instead of keeping a single number for each date, I'll have a hash table for each date, and in that hash table record the number of times each url appear as a referrer:
(defun record-entry (date request referrer) (declare (ignore request)) (let* ((referrerhash (gethash date *datehash* (make-hash-table :test 'equal)))) (incf (gethash referrer referrerhash 0)) (setf (gethash date *datehash*) referrerhash)))
Hash tables by default uses eql to compare values. This works well with the numerical key I use for the dates, but won't work too well for the url strings. That is why I the :test argument to choose a different equality test. The possible choices are eq, eql, equal and equalp. For string keys equal will do nicely.
(defun printstats (filename) (setf *datehash* (make-hash-table)) (parse-file filename) (dolist (date-entry (sort (loop for k being the hash-key of *datehash* collect (list k (gethash k *datehash*))) #'> :key #'first)) (format t "~%~a:~%" (first date-entry)) (dolist (referrer (sort (loop for k being the hash-key of (second date-entry) collect (list k (gethash k (second date-entry)))) #'> :key #'second)) (format t " ~a ~a~%" (second referrer) (first referrer)))))
printstats
is updated with an inner loop that loops over
the hashes in the date hashtable, sorting by the number of request from
the url. And the result:
COM.JALAT.LOGPARSE 19 > (printstats "~/poundlog") 20060128: 621 http://www.jalat.com/blogs/lisp 433 http://www.jalat.com/blogs/lisp?id=4 242 http://planet.lisp.org/ 153 http://www.jalat.com/ 113 http://www.jalat.com/blogs/lisp?id=3 67 http://www.jalat.com/blogs/lisp?id=1 53 http://www.jalat.com/blogs/lisp?id=2 42 http://weitz.de/hunchentoot/ 40 http://lemonodor.com/archives/001339.html 31 http://lemonodor.com/ 28 http://www.lemonodor.com/ 17 http://www.google.com/reader/lens/ 7 http://keithdevens.com/weblog 5 http://anarchaia.org/
I cut off most of this list, it's 108 different referrers for the 28th january and while there are some interesting entries there, they are not responsible for the spike on the 27/28th. It turns out quite a few people read http://planet.lisp.org and that Mr. John Wiseman aka. lemonodor mentioned my ramblings in his blog which happens to be aggregated on planet.lisp.org.
Anyway, the biggest referrer seems to be myself. Mostly links to images, is my guess. Shouldn't be too hard to check. In fact, since is is possible to use a list as a key to the hash table, I only have to make the key to my hash a list of (referrer request). A simple two line change:
(defun record-entry (date request referrer) (let* ((referrerhash (gethash date *datehash* (make-hash-table :test 'equal)))) (incf (gethash (list referrer request) referrerhash 0)) (setf (gethash date *datehash*) referrerhash))) (defun printstats (filename) (setf *datehash* (make-hash-table)) (parse-file filename) (dolist (date-entry (sort (loop for k being the hash-key of *datehash* collect (list k (gethash k *datehash*))) #'> :key #'first)) (format t "~%~a:~%" (first date-entry)) (dolist (referrer (sort (loop for k being the hash-key of (second date-entry) collect (list k (gethash k (second date-entry)))) #'> :key #'second)) (format t " ~a ~a => ~a~%" (second referrer) (first (first referrer)) (second (first referrer))))) *datehash*) COM.JALAT.LOGPARSE 22 > (printstats "~/poundlog") 20060128: 266 http://www.jalat.com/blogs/lisp => /static-files/pieces.gif 227 http://www.jalat.com/blogs/lisp => /static-files/stylesheet.css 197 http://www.jalat.com/blogs/lisp?id=4 => /static-files/pieces.gif 186 http://www.jalat.com/blogs/lisp?id=4 => /static-files/stylesheet.css 96 http://planet.lisp.org/ => /blogs/lisp?id=4 94 http://planet.lisp.org/ => /blogs/lisp 85 http://www.jalat.com/ => /blogs/lisp 54 http://www.jalat.com/blogs/lisp => /blogs/lisp?id=3 53 http://www.jalat.com/blogs/lisp?id=3 => /static-files/stylesheet.css 52 http://planet.lisp.org/ => /blogs/lisp?id=3 42 http://weitz.de/hunchentoot/ => / 40 http://www.jalat.com/ => /static-files/stylesheet.css 33 http://www.jalat.com/blogs/lisp?id=2 => /blogs/lisp?id=1 31 http://www.jalat.com/blogs/lisp?id=3 => /blogs/lisp?id=2 29 http://www.jalat.com/blogs/lisp?id=1 => /static-files/flipper-linux.jpg 24 http://www.jalat.com/blogs/lisp => /blogs/lisp?id=1 21 http://lemonodor.com/archives/001339.html => /blogs/lisp
Not just pictures it seems. The stylesheet is a big contributor as well. I'm not all that interested in this as they aren't really hits, so I'm going to filter it out. At the same time i'll add in the missing hostname for the request strings.
(defun record-entry (date request referrer) (unless (scan "static-files|favicon.ico|flipper-linux.jpg" request) (let* ((referrerhash (gethash date *datehash* (make-hash-table :test 'equal))) (full-request (concatenate 'string "http://www.jalat.com" request))) (incf (gethash (list referrer full-request) referrerhash 0)) (setf (gethash date *datehash*) referrerhash))))
scan
is another function from cl-ppcre
. It returns
indexes to where the pattern is found in the string, or nil if the pattern is
not found. I'm not interested in the index, but simply use it as a true/false
test for if the pattern is in the string. The resulting output is:
20060128: 96 http://planet.lisp.org/ => http://www.jalat.com/blogs/lisp?id=4 94 http://planet.lisp.org/ => http://www.jalat.com/blogs/lisp 85 http://www.jalat.com/ => http://www.jalat.com/blogs/lisp 54 http://www.jalat.com/blogs/lisp => http://www.jalat.com/blogs/lisp?id=3 52 http://planet.lisp.org/ => http://www.jalat.com/blogs/lisp?id=3 42 http://weitz.de/hunchentoot/ => http://www.jalat.com/ 33 http://www.jalat.com/blogs/lisp?id=2 => http://www.jalat.com/blogs/lisp?id=1 31 http://www.jalat.com/blogs/lisp?id=3 => http://www.jalat.com/blogs/lisp?id=2 24 http://www.jalat.com/blogs/lisp => http://www.jalat.com/blogs/lisp?id=1 21 http://lemonodor.com/archives/001339.html => http://www.jalat.com/blogs/lisp 16 http://www.jalat.com/blogs/lisp => http://www.jalat.com/blogs/lisp?id=2 16 http://www.jalat.com/blogs/lisp?id=1 => http://www.jalat.com/blogs/lisp?id=2 15 http://www.jalat.com/blogs/lisp?id=3 => http://www.jalat.com/blogs/lisp?id=1 15 http://www.jalat.com/blogs/lisp?id=4 => http://www.jalat.com/blogs/lisp?id=3 15 http://lemonodor.com/archives/001339.html => http://www.jalat.com/blogs/lisp?id=4 12 http://www.lemonodor.com/ => http://www.jalat.com/blogs/lisp 12 http://www.lemonodor.com/ => http://www.jalat.com/blogs/lisp?id=4 12 http://lemonodor.com/ => http://www.jalat.com/blogs/lisp 11 http://lemonodor.com/ => http://www.jalat.com/blogs/lisp?id=4
It works even though it's not that pretty. I'm going to leave it at that anyway, because I'm trying to shorten these blog entries and I've hopefully demonstrated some regexps and hash table usage by now. Throw it into a table for more presentable output. And you probably want to combine www.domain.name and domain.name urls. I'm going to finish off with someting a bit different:
CL-DOT is a package by
Juho Snellman
that makes it easy to make data files for the dot program in the Graphviz suite.
It works by defining methods that describe how different objects should be
represented, and then passing a data structure to generate-graph
.
generate-graph
will generate a graph that can be saved in
different ways, I just save it to a dot file, and later use dot to
generate the picture. It is also possible to have cl-dot
generate ps/jpeg/etc. directly.
(defparameter *dothash* (make-hash-table)) (defun record-entry (date request referrer) (unless (scan "static-files" request) (let* ((referrerhash (gethash date *datehash* (make-hash-table))) (interned-request (intern (concatenate 'string "http://www.jalat.com" request))) (interned-referrer (intern referrer)) (ref-dothash (gethash interned-referrer *dothash* (make-hash-table)))) (incf (gethash (list interned-referrer interned-request) referrerhash 0)) (setf (gethash date *datehash*) referrerhash) (incf (gethash interned-request ref-dothash 0)) (incf (gethash :num-referrals ref-dothash 0)) (setf (gethash interned-referrer *dothash*) ref-dothash))))
I have to make one workaround. cl-dot
keeps track of
object by keeping them in a hash. That means that I have to turn the
strings into symbols. intern
does just that. *dothash*
Each entry in the *dothash* hash has a url as a key and the value is
another hash which lists the urls the first url refers to and the number
of times it refers to them. I'm also putting in a special symbol
:num-referrals
that contains the sum of the urls referrals.
BTW: The use of hash tables here is a bit dubious. Each url usually only refers to a couple of other urls. So the hash tables are very small, and the overhead is high. A list would probably have been a more efficient choice.
(defpackage :com.jalat.logparse (:use :cl :cl-ppcre :cl-dot)) (defmethod object-node ((object list)) nil) (defmethod object-knows-of ((object list)) object)
I'm going to pass a bunch of urls in a list to the
generate-graph
function. I don't want to print the list in
the graph, it's just a container for the data, so object-node for lists
just return nil. I do need cl-dot to model the contents of the list, so I have
to make the contents known. I do this with the object-knows-of method.
(defmethod object-node ((object symbol)) (make-instance 'node :attributes (list :label object :shape :box))) (defmethod object-points-to ((object symbol)) (let ((reqhash (gethash object *dothash*))) (when reqhash (remove-if 'null (loop for i being the hash-key of reqhash collect (unless (eql :num-referrals i) (make-instance 'attributed :object i :attributes (list :label (gethash i reqhash)))))))))
I'm going to represent symbols (urls) as boxes, and the symbol itself
as the label of the box. Urls point to other urls, finding the urls
this url refers to is as simple as looking it up in the *dothash* table and
removing the :num-referrals
symbol.
(defun make-graph (filename) (with-open-file (out filename :direction :output :if-exists :supersede) (cl-dot:print-graph (cl-dot:generate-graph (mapcar #'first (subseq (sort (loop for k being the hash-key of *dothash* collect (let ((refhash (gethash k *dothash*))) (list k (gethash :num-referrals refhash)))) #'> :key #'second) 0 20))) out)))
I could just generate a huge graph with all the urls, but it gets
quite messy. Here the loop generates a list of (url
num-referrals)
, sort
sorts it based on the
num-referrals, subseq
makes alist of the first 20,
mapcar
makes a list of only the urls, generate-graph
makes the graph, and print-graph
writes it to a file.
Using dot I can now generate a graph of the top referring urls to my site. As you can see it's a bit messy even with 20 urls. (It's also blurry due to resizing...) I still think it's kind of neat in a way.
As usual: Feedback welcome at asbjxrn@bjxrnstad.net
No comments:
Post a Comment