Tuesday, October 7, 2008

Parsing logs

So I made a blog and some requests have started coming in. And of course I want to get some data about my visitors.

I could just download a program from somewhere and get all the statistics I could think of in less than ten minutes, but where is the fun in that? Better to spend hours to try to write some lisp code that barely manages to get the most basic stats from the logfiles.

You'll notice that I redefine the same function again and again and again. This is how I usually work. Gradually refining functions until I think they work as I want.(This is where you're appaled by the lack of unit testing/formal verification. I basically fiddle until it works for the limited amount of test cases I have.)

First some libraries. I'm gonna chicken out of the hard work of the parsing because Edi Weitz already did that and made the cl-ppcre package which implements perls regular expressions in Common Lisp.


(defpackage :com.jalat.logparse
  (:use :cl :cl-ppcre))

A short run-down of the regular expressions I'm going to use for those who don't know perl regular expressions:

most characters matches themselves.
\Smatches non-empty characters (anything but spaces, newlines and tabs).
\dmatches digits.
\w matches alphanumeric characters (Uses alphanumericp, might differ from implementation to implementation).
[abc]matches a list of characters.
[^abc]matches everything except the list of characters.
(pattern)stores pattern in a "register" for later use
+repeat previous character/group one or more times
*repeat previous character/group zero or more times
{n}repeat previous character/group n times
|means or, so (a|b) matches a or b.
^at the start of line, means the beginning of a line so "^a" does not match "ba"

Here's two log entries, this is actually one request. The first request use a old url, and I redirect the visitor to the new url. (Which by now also is an old url... If only planning things were fun I might actually start doing some one day.):


222.153.17.71 - - [08/Jan/2006:00:14:47 +0000] "GET /back-of-the-envelope.html HTTP/1.1" 302 302 "http://planet.lisp.org/" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.6) Gecko/20050317 Firefox/1.0.2" "-"
222.153.17.71 - - [08/Jan/2006:00:14:48 +0000] "GET /blog.ucw?entry-id=2 HTTP/1.1" 200 15016 "http://planet.lisp.org/" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.6) Gecko/20050317 Firefox/1.0.2" "-"

The first part is the ip address of the user who looked at our blog. As a simple demo of how to use cl-ppcre, I'll scan and collect the first string up until the first whitespace. (hopefully gathering the ip address in the process.) The register-groups-bind will do the scanning and storing (binding) for me.


(in-package :com.jalat.logparse)

(defparameter *teststrings* '("222.153.17.71 - - [08/Jan/2006:00:14:48 +0000] \"GET /blog.ucw?entry-id=2 HTTP/1.1\" 200 15016 \"http://planet.lisp.org/\" \"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.6) Gecko/20050317 Firefox/1.0.2\" \"-\"" 
                              "222.153.17.71 - - [08/Jan/2006:00:14:47 +0000] \"GET /back-of-the-envelope.html HTTP/1.1\" 302 302 \"http://planet.lisp.org/\" \"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.6) Gecko/20050317 Firefox/1.0.2\" \"-\""))

(defun split-string (logstring)
    (register-groups-bind 
      (ipaddr) 
      ("(\\S+) " logstring)
    (list ipaddr)))

COM.JALAT.LOGPARSE 39 > (first *teststrings*)
"222.153.17.71 - - [08/Jan/2006:00:14:48 +0000] \"GET /blog.ucw?entry-id=2 HTTP/1.1\" 200 15016 \"http://planet.lisp.org/\" \"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.6) Gecko/20050317 Firefox/1.0.2\" \"-\""

COM.JALAT.LOGPARSE 40 > (split-string (first *teststrings*))
("222.153.17.71")

As you can see, the (\\S+) has matched non-empty characters and is stored in the variable ipaddr. cl-ppcre returns the first match it can find, that is why I get the ip address and not just any substring consisting of non-whitespace. Why do I use \\S if \S is what matches a character? \ is the quote character for strings in lisp so a simple \ will just pass "S" on to cl-ppcre:



COM.JALAT.LOGPARSE 41 > (princ "\a")
a
"a"

COM.JALAT.LOGPARSE 42 > (princ "\\a")
\a
"\\a"

COM.JALAT.LOGPARSE 44 > (length "\a")
1

COM.JALAT.LOGPARSE 45 > (length "\\a")
2

I'm not going to bother with the ip adresses. I guess it might be useful if you want to try to count individual users, but then you have to be clever about different user behind a proxy/NAT. So I won't be doing that. Too complicated for me, and who really wants to reduce the percieved number of visitors anyway. I'm doing this to boost my ego, after all.

What I will do is to use the time data. I can't use \\S+ here since there is a space in the string. What I'll do is read the first [ and then collect all non-] characters.


(defun split-string (logstring)
    (register-groups-bind 
      (when) 
      ("\\[([^\\]]*)\\]" logstring)
    (list when)))

COM.JALAT.LOGPARSE 46 > (split-string (first *teststrings*))
("08/Jan/2006:00:14:48 +0000")

As you can see, I have to quote (twice) the ['s and ]'s that are literal, to stop them being interpreted as a [] directive. If you have problems sorting out the expression above, it might help to eliminate from outside and in starting with \\[ and \\]. Inside those are the ( ) meaning that I want to collect the string matching the pattern inside the ( ). Finally the inner expression [^\\]]* means 0 or more characters that are not the character ']'

I'm a bit lucky because of the fact that there are no [] in the log strings before the time description. This means that I abuse the fact that cl-ppcre returns the first match it finds. But this is not the best way of writing a regular expression if we want performance. With this expression the scanner have to search the string for the start of the expression. If we tell the scanner where the expression starts we can reduce the workload a bit:


(defun split-string (logstring)
    (register-groups-bind 
      (when) 
      ("^\\S+ \\S+ \\S+ \\[([^\\]]*)\\]" logstring)
    (list when)))

The ^ tells the scanner that our expression starts at the start of the string, then there are three substrings separated by spaces that I'm not interested in. Finally there is the date string which I am interested in.

Next is to extract the url. The url is surrounded by "GET and HTTPsomething". I'm using a similar trick to what I used above to match up to the ".


(defun split-string (logstring)
    (register-groups-bind 
      (when request) 
      ("^\\S+ \\S+ \\S+ \\[([^\\]]*)\\] \"GET (\\S*) HTTP[^\\\"]*\"" logstring)
    (list when request)))


COM.JALAT.LOGPARSE 5 > (split-string (first *teststrings*))
("08/Jan/2006:00:14:48 +0000" "/blog.ucw?entry-id=2")

COM.JALAT.LOGPARSE 7 > (split-string (second *teststrings*))
("08/Jan/2006:00:14:47 +0000" "/back-of-the-envelope.html")

Now, remember that this is only one request that is being redirected. I don't really want to count it twice. The next element of the string helps us sort out requests that has been answered successfully from requests that fail, are redirected or don't exist. This element is the result code, 200 means a successfull request so I'll hardcode 200 in my regexp string. The field after the result code I'll skip by matching but not collecting it. Then I'll collect the referring url. I want to see who have been nice and linked to me.


(defun split-string (logstring)
    (register-groups-bind 
      (when request referrer) 
      ("^\\S+ \\S+ \\S+ \\[([^\\]]*)\\] \"GET (\\S*) HTTP[^\\\"]*\" 200 \\S+ \"([^\\\"]+)\"" logstring)
    (list when request referrer)))

COM.JALAT.LOGPARSE 9 > (split-string (car *teststrings*))
("08/Jan/2006:00:14:48 +0000" "/blog.ucw?entry-id=2" "http://planet.lisp.org/")

COM.JALAT.LOGPARSE 10 > (split-string (cadr *teststrings*))
NIL

One more detail, I don't like the date/time format, for now all I want is the date. It would also be nice to have a more sortable date. I will do this by splitting up the date string in days months year and combining it to a number. \d{2} will match two digits (the day part), \w{3} three alphanumeric character (month) and \d{4} the year.


(defparameter *months* '(("Jan" . "01") ("Feb" . "02") ("Mar" . "03")
                         ("Apr" . "04") ("May" . "05") ("Jun" . "06")
                         ("Jul" . "07") ("Aug" . "08") ("Sep" . "09")
                         ("Oct" . "10") ("Nov" . "11") ("Dec" . "12")))

(defun numerical-date (date month year)
  (parse-integer 
   (concatenate 'string
                year
                (cdr (assoc month *months* :test #'string=))
                date)))

(defun split-string (logstring)
    (register-groups-bind 
      (date month year request referrer) 
      ("^\\S+ \\S+ \\S+ \\[(\\d{2})/(\\w{3})/(\\d{4})[^\\]]*\\] \"GET (\\S*) HTTP[^\\\"]*\" 200 \\S+ \"([^\\\"]+)\"" logstring)
    (list (numerical-date date month year) request referrer)))

COM.JALAT.LOGPARSE 34 > (split-string (car *teststrings*))
(20060108 "/blog.ucw?entry-id=2" "http://planet.lisp.org/")

Ok, time to start counting. I'm just going to start with a hash-table of dates and count how many requests I get for each date. I will also move the regexp out of the record-string function, so the regexp scanner only have to be built once. This should increase the performance a bit.


(defparameter *logscanner* (create-scanner "^\\S+ \\S+ \\S+ \\[(\\d{2})/(\\w{3})/(\\d{4})[^\\]]*\\] \"GET (\\S*) HTTP[^\\\"]*\" 200 \\S+ \"([^\\\"]+)\""))
(defparameter *datehash* (make-hash-table))

(defun record-entry (date request referrer)
  (declare (ignore request referrer))
  (let* ((currentcount (gethash date *datehash* 0)))
    (setf (gethash date *datehash*) (1+ currentcount))))

(defun record-string (logstring)
    (register-groups-bind 
      (day month year request referrer) 
      (*logscanner* logstring)
      (record-entry (numerical-date day month year) request referrer)))

(defun parse-file (filename)
  (with-open-file (s filename)
    (do ((string (read-line s) (read-line s nil 'eof)))
        ((eql string 'eof) nil)
      (record-string string))))

(defun printstats (filename)
  (parse-file filename)
  (dolist (x (sort 
               (loop for k being the hash-key of *datehash*
                     collect (list k (gethash k *datehash*)))
               #'>
               :key #'car))
    (format t "~a ~a~%" (car x) (cadr x))))


COM.JALAT.LOGPARSE 28 > (printstats "~/poundlog")
20060128 1986
20060127 325
20060126 33
20060125 29
20060124 39
20060123 25
20060122 73
20060121 30
20060120 32
20060119 67
20060118 103
20060117 116
20060116 96
20060115 98
20060114 89
20060113 130
20060112 123

The code above is quite simple I hope, create-scanner creates a closure that I use in the now renamed record-string. The result is passed on to record-entry that updates the hash table of hits. parse-file loops through a file calling record-string for each line. printstats just prints out the keys/values of the hash table.

Ok, this raises the obvious question, what happened on the 27/28 of January? 30 visitors a day is unlikely enough (I'm not going to bother showing this, but there is a surprisingly large amount of bots traversing the net :) so what on earth triggered almost 2000 requests on the 28th? Let's have a look at where they come from. Instead of keeping a single number for each date, I'll have a hash table for each date, and in that hash table record the number of times each url appear as a referrer:



(defun record-entry (date request referrer)
  (declare (ignore request))
  (let* ((referrerhash (gethash date *datehash* (make-hash-table :test 'equal))))
    (incf (gethash referrer referrerhash 0))
    (setf (gethash date *datehash*) referrerhash)))

Hash tables by default uses eql to compare values. This works well with the numerical key I use for the dates, but won't work too well for the url strings. That is why I the :test argument to choose a different equality test. The possible choices are eq, eql, equal and equalp. For string keys equal will do nicely.


(defun printstats (filename)
  (setf *datehash* (make-hash-table))
  (parse-file filename)
  (dolist (date-entry (sort 
                       (loop for k being the hash-key of *datehash*
                             collect (list k (gethash k *datehash*)))
                       #'>
                       :key #'first))
    (format t "~%~a:~%" (first date-entry))
    (dolist (referrer (sort 
                       (loop for k being the hash-key of (second date-entry)
                             collect (list k (gethash k (second date-entry))))
                       #'>
                       :key #'second))
      (format t "  ~a ~a~%" (second referrer) (first referrer)))))

printstats is updated with an inner loop that loops over the hashes in the date hashtable, sorting by the number of request from the url. And the result:


COM.JALAT.LOGPARSE 19 > (printstats "~/poundlog")

20060128:
  621 http://www.jalat.com/blogs/lisp
  433 http://www.jalat.com/blogs/lisp?id=4
  242 http://planet.lisp.org/
  153 http://www.jalat.com/
  113 http://www.jalat.com/blogs/lisp?id=3
  67 http://www.jalat.com/blogs/lisp?id=1
  53 http://www.jalat.com/blogs/lisp?id=2
  42 http://weitz.de/hunchentoot/
  40 http://lemonodor.com/archives/001339.html
  31 http://lemonodor.com/
  28 http://www.lemonodor.com/
  17 http://www.google.com/reader/lens/
  7 http://keithdevens.com/weblog
  5 http://anarchaia.org/

I cut off most of this list, it's 108 different referrers for the 28th january and while there are some interesting entries there, they are not responsible for the spike on the 27/28th. It turns out quite a few people read http://planet.lisp.org and that Mr. John Wiseman aka. lemonodor mentioned my ramblings in his blog which happens to be aggregated on planet.lisp.org.

Anyway, the biggest referrer seems to be myself. Mostly links to images, is my guess. Shouldn't be too hard to check. In fact, since is is possible to use a list as a key to the hash table, I only have to make the key to my hash a list of (referrer request). A simple two line change:


(defun record-entry (date request referrer)
  (let* ((referrerhash (gethash date *datehash* (make-hash-table :test 'equal))))
    (incf (gethash (list referrer request) referrerhash 0)) 
    (setf (gethash date *datehash*) referrerhash)))

(defun printstats (filename)
  (setf *datehash* (make-hash-table))
  (parse-file filename)
  (dolist (date-entry (sort 
                       (loop for k being the hash-key of *datehash*
                             collect (list k (gethash k *datehash*)))
                       #'>
                       :key #'first))
    (format t "~%~a:~%" (first date-entry))
    (dolist (referrer (sort 
                       (loop for k being the hash-key of (second date-entry)
                             collect (list k (gethash k (second date-entry))))
                       #'>
                       :key #'second))
      (format t "  ~a ~a => ~a~%" (second referrer) (first (first referrer)) (second (first referrer)))))
  *datehash*)


COM.JALAT.LOGPARSE 22 > (printstats "~/poundlog")

20060128:
  266 http://www.jalat.com/blogs/lisp => /static-files/pieces.gif
  227 http://www.jalat.com/blogs/lisp => /static-files/stylesheet.css
  197 http://www.jalat.com/blogs/lisp?id=4 => /static-files/pieces.gif
  186 http://www.jalat.com/blogs/lisp?id=4 => /static-files/stylesheet.css
  96 http://planet.lisp.org/ => /blogs/lisp?id=4
  94 http://planet.lisp.org/ => /blogs/lisp
  85 http://www.jalat.com/ => /blogs/lisp
  54 http://www.jalat.com/blogs/lisp => /blogs/lisp?id=3
  53 http://www.jalat.com/blogs/lisp?id=3 => /static-files/stylesheet.css
  52 http://planet.lisp.org/ => /blogs/lisp?id=3
  42 http://weitz.de/hunchentoot/ => /
  40 http://www.jalat.com/ => /static-files/stylesheet.css
  33 http://www.jalat.com/blogs/lisp?id=2 => /blogs/lisp?id=1
  31 http://www.jalat.com/blogs/lisp?id=3 => /blogs/lisp?id=2
  29 http://www.jalat.com/blogs/lisp?id=1 => /static-files/flipper-linux.jpg
  24 http://www.jalat.com/blogs/lisp => /blogs/lisp?id=1
  21 http://lemonodor.com/archives/001339.html => /blogs/lisp

Not just pictures it seems. The stylesheet is a big contributor as well. I'm not all that interested in this as they aren't really hits, so I'm going to filter it out. At the same time i'll add in the missing hostname for the request strings.


(defun record-entry (date request referrer)
  (unless (scan "static-files|favicon.ico|flipper-linux.jpg" request)
    (let* ((referrerhash (gethash date *datehash* (make-hash-table :test 'equal)))
           (full-request (concatenate 'string "http://www.jalat.com" request)))
      (incf (gethash (list referrer full-request) referrerhash 0))
      (setf (gethash date *datehash*) referrerhash))))

scan is another function from cl-ppcre. It returns indexes to where the pattern is found in the string, or nil if the pattern is not found. I'm not interested in the index, but simply use it as a true/false test for if the pattern is in the string. The resulting output is:


20060128:
  96 http://planet.lisp.org/ => http://www.jalat.com/blogs/lisp?id=4
  94 http://planet.lisp.org/ => http://www.jalat.com/blogs/lisp
  85 http://www.jalat.com/ => http://www.jalat.com/blogs/lisp
  54 http://www.jalat.com/blogs/lisp => http://www.jalat.com/blogs/lisp?id=3
  52 http://planet.lisp.org/ => http://www.jalat.com/blogs/lisp?id=3
  42 http://weitz.de/hunchentoot/ => http://www.jalat.com/
  33 http://www.jalat.com/blogs/lisp?id=2 => http://www.jalat.com/blogs/lisp?id=1
  31 http://www.jalat.com/blogs/lisp?id=3 => http://www.jalat.com/blogs/lisp?id=2
  24 http://www.jalat.com/blogs/lisp => http://www.jalat.com/blogs/lisp?id=1
  21 http://lemonodor.com/archives/001339.html => http://www.jalat.com/blogs/lisp
  16 http://www.jalat.com/blogs/lisp => http://www.jalat.com/blogs/lisp?id=2
  16 http://www.jalat.com/blogs/lisp?id=1 => http://www.jalat.com/blogs/lisp?id=2
  15 http://www.jalat.com/blogs/lisp?id=3 => http://www.jalat.com/blogs/lisp?id=1
  15 http://www.jalat.com/blogs/lisp?id=4 => http://www.jalat.com/blogs/lisp?id=3
  15 http://lemonodor.com/archives/001339.html => http://www.jalat.com/blogs/lisp?id=4
  12 http://www.lemonodor.com/ => http://www.jalat.com/blogs/lisp
  12 http://www.lemonodor.com/ => http://www.jalat.com/blogs/lisp?id=4
  12 http://lemonodor.com/ => http://www.jalat.com/blogs/lisp
  11 http://lemonodor.com/ => http://www.jalat.com/blogs/lisp?id=4

It works even though it's not that pretty. I'm going to leave it at that anyway, because I'm trying to shorten these blog entries and I've hopefully demonstrated some regexps and hash table usage by now. Throw it into a table for more presentable output. And you probably want to combine www.domain.name and domain.name urls. I'm going to finish off with someting a bit different:

CL-DOT is a package by Juho Snellman that makes it easy to make data files for the dot program in the Graphviz suite. It works by defining methods that describe how different objects should be represented, and then passing a data structure to generate-graph.

generate-graph will generate a graph that can be saved in different ways, I just save it to a dot file, and later use dot to generate the picture. It is also possible to have cl-dot generate ps/jpeg/etc. directly.

(defparameter *dothash* (make-hash-table))

(defun record-entry (date request referrer)
  (unless (scan "static-files" request)
    (let* ((referrerhash (gethash date *datehash* (make-hash-table)))
           (interned-request (intern (concatenate 'string "http://www.jalat.com" request)))
           (interned-referrer (intern referrer))
           (ref-dothash (gethash interned-referrer *dothash* (make-hash-table))))
      (incf (gethash (list interned-referrer interned-request) referrerhash 0))
      (setf (gethash date *datehash*) referrerhash)
      (incf (gethash interned-request ref-dothash 0))
      (incf (gethash :num-referrals ref-dothash 0))
      (setf (gethash interned-referrer *dothash*) ref-dothash))))

I have to make one workaround. cl-dot keeps track of object by keeping them in a hash. That means that I have to turn the strings into symbols. intern does just that. *dothash* Each entry in the *dothash* hash has a url as a key and the value is another hash which lists the urls the first url refers to and the number of times it refers to them. I'm also putting in a special symbol :num-referrals that contains the sum of the urls referrals.

BTW: The use of hash tables here is a bit dubious. Each url usually only refers to a couple of other urls. So the hash tables are very small, and the overhead is high. A list would probably have been a more efficient choice.


(defpackage :com.jalat.logparse
  (:use :cl :cl-ppcre :cl-dot))

(defmethod object-node ((object list))
  nil)

(defmethod object-knows-of ((object list))
  object)

I'm going to pass a bunch of urls in a list to the generate-graph function. I don't want to print the list in the graph, it's just a container for the data, so object-node for lists just return nil. I do need cl-dot to model the contents of the list, so I have to make the contents known. I do this with the object-knows-of method.


(defmethod object-node ((object symbol))
  (make-instance 'node
                 :attributes (list :label object
                                   :shape :box)))

(defmethod object-points-to ((object symbol))
  (let ((reqhash (gethash object *dothash*)))
    (when reqhash
      (remove-if 'null
                 (loop for i being the hash-key of reqhash
                       collect (unless (eql :num-referrals i)
                                 (make-instance 'attributed
                                                :object i
                                                :attributes (list :label (gethash i reqhash)))))))))

I'm going to represent symbols (urls) as boxes, and the symbol itself as the label of the box. Urls point to other urls, finding the urls this url refers to is as simple as looking it up in the *dothash* table and removing the :num-referrals symbol.


(defun make-graph (filename)
  (with-open-file (out filename :direction :output :if-exists :supersede)
    (cl-dot:print-graph
     (cl-dot:generate-graph 
      (mapcar #'first
              (subseq (sort (loop for k being the hash-key of *dothash*
                                  collect (let ((refhash (gethash k *dothash*)))
                                            (list k (gethash :num-referrals refhash))))
                            #'>
                            :key #'second)
                      0 20)))
     out)))



I could just generate a huge graph with all the urls, but it gets quite messy. Here the loop generates a list of (url num-referrals), sort sorts it based on the num-referrals, subseq makes alist of the first 20, mapcar makes a list of only the urls, generate-graph makes the graph, and print-graph writes it to a file.

Using dot I can now generate a graph of the top referring urls to my site. As you can see it's a bit messy even with 20 urls. (It's also blurry due to resizing...) I still think it's kind of neat in a way.

As usual: Feedback welcome at asbjxrn@bjxrnstad.net

No comments: