Detailed Amazon S3 Logging

Just as there is a wealth of value to be gained from analyzing your web logs, there is much to be had from being able to analyze your activity logs from Amazon's S3 storage service.

In addition to being able to get a detailed view into the aggregate billing, being able to tell things like from what IP addresses most of your outbound bandwidth goes to, or do I have automated processes that are repeatedly loading the same file to the same key giving me no added value but costing me inbound bandwidth.

These are just some of the questions that you can easily and quickly address having the detailed logging available to you. First and foremost it is important to enable logging on the buckets you wish to track. I would advise tracking all your buckets, even the bucket that you setup to catch your logs. Better to cast a wide net early and filter down than to get into a position where you wish you had captured something. You can always delete what you don't want but you can't go back in time and capture something later.

Turning on logging is a matter of issuing a command for each one of your buckets. Doing this is as simple as a few lines of python using the boto framework:

First set the following environment variables with the proper values:

$~ AWS_ACCESS_KEY_ID=[YOUR_KEY_ID]
$~ AWS_SECRET_ACCESS_KEY=[YOUR_SECRET_KEY]
$~ export AWS_ACCESS_KEY_ID
$~ export AWS_SECRET_ACCESS_KEY
$~ python
>>> from boto.s3.connection import S3Connection
>>> con = S3Connection()
>>> buckets = con.get_all_buckets();
>>> for bucket in buckets:
...     bucket.enable_logging(
...          target_bucket=[YOUR_LOGGING_BUCKET], 
...          target_prefix=bucket.name)
>>>

You'll want to replace "[YOUR_LOGGING_BUCKET]" with the bucket you create to hold all of your logs.

Now that you will be getting logs dumped into this bucket, you'll need a way to view and analyze them. Looking at individual lines in a log is not nearly as interesting or useful as being able to "pivot" on the data. For me the most natural way to do this is using SQL to apply aggregate functions (e.g. count, sum) and group by different fields.

So to do this you'll need to have some way of getting that data out of your logging bucket and into a database.

I have written the following python, bash, sql script to download and import this data into a mysql database that contains one denormalized table that matches the format of the log file. You can easily set this script to run in a cronjob automatically at whatever frequency you need your database to be kept up to date.

First the shell script:

1
2
3
4
5
6
7
8
9
#!/bin/bash
cd logs
s3get.py [YOUR_LOGGING_BUCKET] | grep "Fetching " | cut -d " " -f 2  > process.list 
cd ..
cp /dev/null clean_logs.txt
for x in `cat logs/process.list`; do 
    cat logs/$x | sed 's/\[/"/' | sed 's/\]/"/' >> clean_logs.txt
done;
mysql logs < load_log_data.sql

You'll need to replace "[YOUR_LOGGING_BUCKET]" with the same bucket you created above to hold all of your logs on S3. You'll also need to created the directory "logs" or modify the script to point to the path you want.

The "s3get.py" is a simple python script I wrote using the boto framework to download a single file or a group of files for a given bucket depending on how much or little of a key's name I provided (from nothing to fetch the entire bucket, which is what I do above, to a specific named key as the second parameter). This script will check for the existence of the base name of the key in the current working directory and skip if if a file of the same name exists (this is one of things that makes the bash script above re-entrant).

s3get.py:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
#!/usr/bin/python
from boto.s3.connection import S3Connection
from boto.s3.bucket import Bucket
import os
import sys

def main():
    if len(sys.argv) < 2:
        sys.exit('s3get.py BUCKET [PREFIX]')

    c = S3Connection()
    b = c.get_bucket(sys.argv[1])

    blist = None
    if len(sys.argv) == 3:   
        blist = b.list(prefix=sys.argv[2])
    else:
        blist = b.list()

    for k in blist:
        fname = os.path.basename(k.name)
        if os.path.exists(fname):
            print "Local copy of %s already exists." % fname
            continue

        try:
            print "Fetching %s" % (k.name)
        except:
            print "Fetching a file that had unprintable characters in filename"

        k.get_contents_to_filename(fname, cb=None, num_cb=10)

if __name__ == "__main__":
    main()

As stated previously, in order for these boto framework based python scripts to work, your id and secret key variables should be set.

Finally, make sure you have a local mysql server running with an database called "logs" and a single table called "log" with the following definition:

CREATE TABLE `log` (
  `bucket_owner` varchar(512) default NULL,
  `bucket` varchar(256) default NULL,
  `time` varchar(125) default NULL,
  `remote_ip` varchar(50) default NULL,
  `requestor` varchar(512) default NULL,
  `request_id` varchar(50) default NULL,
  `operation` varchar(125) default NULL,
  `keyname` varchar(2048) default NULL,
  `request_uri` varchar(4096) default NULL,
  `http_status` varchar(10) default NULL,
  `error_code` varchar(125) default NULL,
  `bytes_sent` int(11) default NULL,
  `object_size` int(11) default NULL,
  `total_time` int(11) default NULL,
  `turn_around_time` int(11) default NULL,
  `referrer` varchar(4096) default NULL,
  `user_agent` varchar(4096) default NULL,
  `ctime` datetime default NULL,
  UNIQUE KEY `unq_request_id_indx` (`request_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

The "load_log_data.sql" script loads the data and transform the text date column into an actual date that will accept date functions for better manipulation and analyzing of the data:

LOAD DATA LOCAL INFILE 'clean_logs.txt' 
IGNORE
INTO TABLE log 
FIELDS TERMINATED BY ' ' 
OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\n' 
(   bucket_owner, bucket, time, remote_ip, requestor, 
    request_id, operation, keyname, request_uri, 
    http_status, error_code, bytes_sent, object_size, 
    total_time, turn_around_time, referrer, user_agent
);
UPDATE log 
   SET ctime = CONVERT_TZ(STR_TO_DATE(time, '%d/%b/%Y:%k:%i:%s +0000'), '+00:00', '-05:00')
 WHERE ctime is null;

Lastly, you'll want to note that the Amazon S3 log datetime format is in GMT, I am in the CDT and so I want all my times local and relevant for me. Adjust the "'-05:00'" to match your locale or desired time zone.