Labeled Tab-separated Values

Description

Labeled Tab-separated Values (LTSV) format is a variant of Tab-separated Values (TSV). Each record in a LTSV file is represented as a single line. Each field is separated by TAB and has a label and a value. The label and the value have been separated by ':'. With the LTSV format, you can parse each line by spliting with TAB (like original TSV format) easily, and extend any fields with unique labels in no particular order.

FAQ

Follow the link.

Example

The LTSV format originally focuses on access logs of web servers, so I'll show an access log of traditional Combined Log Format and the same log of LTSV format version as examples.

The configuration of traditional Combined Log Format on Apache is:

LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-agent}i\"" combined

and access log will look like: (ref. http://httpd.apache.org/docs/2.2/logs.html)

127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 "http://www.example.com/start.html" "Mozilla/4.08 [en] (Win98; I ;Nav)"

The configuration of LTSV format with the same infomation will be:

LogFormat "host:%h\tident:%l\tuser:%u\ttime:%t\treq:%r\tstatus:%>s\tsize:%b\treferer:\%{Referer}i\tua:%{User-Agent}i" combined_ltsv

then the access log will be like:

host:127.0.0.1<TAB>ident:-<TAB>user:frank<TAB>time:[10/Oct/2000:13:55:36 -0700]<TAB>req:GET /apache_pb.gif HTTP/1.0<TAB>status:200<TAB>size:2326<TAB>referer:http://www.example.com/start.html<TAB>ua:Mozilla/4.08 [en] (Win98; I ;Nav)

Here is a simple LTSV parser:

#!/usr/bin/env ruby

while gets
  record = Hash[$_.split("\t").map{|f| f.split(":", 2)}]
  p record
end

With this parser, you will get the hash like:

{"host"=>"127.0.0.1", "ident"=>"-", "user"=>"frank", "time"=>"[10/Oct/2000:13:55:36 -0700]", "req"=>"GET /apache_pb.gif HTTP/1.0", "status"=>"200", "size"=>"2326", "referer"=>"http://www.example.com/start.html", "ua"=>"Mozilla/4.08 [en] (Win98; I ;Nav)\n"}

Definition

A LTSV file must be a byte sequence which matches the ltsv production in the following ABNF:

ltsv = *(record NL) [record]
record = [field *(TAB field)]
field = label ":" field-value
label = 1*lbyte
field-value = *fbyte

TAB = %x09
NL = [%x0D] %x0A
lbyte = %x30-39 / %x41-5A / %x61-7A / "_" / "." / "-" ;; [0-9A-Za-z_.-]
fbyte = %x01-08 / %x0B / %x0C / %x0E-FF

Recommendations for labeling

The specification of LTSV is simple and primitive. Nevertheless label standardization may help to improve reusability of some implementations for processing or analysis.

Labels for Web server's Log

Here are labeling recommendations, their descriptions, format strings for apache and ones for nginx.

Recommended Label Description Format String of Apache mod_log_config Format String of nginx log format
time Time the request was received %t $time_local
host Remote host %h $remote_addr
forwardedfor X-Forwarded-For header %{X-Forwarded-For}i $http_x_forwarded_for
ident Remote logname %l
user Remote user %u $remote_user
req First line of request %r $request
method Request method %m $request_method
uri Request URI %U%q $request_uri
protocol Requested Protocol (usually "HTTP/1.0" or "HTTP/1.1") %H $server_protocol
status Status code %>s $status
size Size of response in bytes, excluding HTTP headers. %B (or '%b' for compatibility with combined format) $body_bytes_sent
reqsize Bytes received, including request and headers. %I (mod_log_io required) $request_length
referer Referer header %{Referer}i $http_referer
ua User-Agent header %{User-agent}i $http_user_agent
vhost Host header %{Host}i $host
reqtime_microsec The time taken to serve the request, in microseconds %D
reqtime The time taken to serve the request, in seconds %T $request_time
cache X-Cache header %{X-Cache}o $upstream_http_x_cache
runtime Execution time for processing some request, e.g. X-Runtime header for application server or processing time of SQL for DB server. %{X-Runtime}o $upstream_http_x_runtime
apptime Response time from the upstream server - $upstream_response_time

A LogFormat example for Apache mod_log_config.

LogFormat "time:%t\tforwardedfor:%{X-Forwarded-For}i\thost:%h\treq:%r\tstatus:%>s\tsize:%B\treferer:%{Referer}i\tua:%{User-Agent}i\treqtime_microsec:%D\tcache:%{X-Cache}o\truntime:%{X-Runtime}o\tvhost:%{Host}i" ltsv

A log_format example for nginx.

log_format ltsv "time:$time_local"
                "\thost:$remote_addr"
                "\tforwardedfor:$http_x_forwarded_for"
                "\treq:$request"
                "\tstatus:$status"
                "\tsize:$body_bytes_sent"
                "\treferer:$http_referer"
                "\tua:$http_user_agent"
                "\treqtime:$request_time"
                "\tcache:$upstream_http_x_cache"
                "\truntime:$upstream_http_x_runtime"
                "\tvhost:$host";

Tools supporting LTSV

fluentd

fluentd (http://fluentd.org/) supports to parse a LTSV file with in_tail plugin. The configuration is like this:

<source>
  type tail
  format ltsv
  time_format %d/%b/%Y:%H:%M:%S %z
  path /var/log/nginx/access_log
  pos_file /var/log/nginx/access_log.pos
  tag nginx.access
</source>

plugins for fluentd

ltsview

Plack::Middleware::AxsLog

combined2ltsv.pl

MCombined2LTSV.java

Parser Implementations

Perl

Ruby

Python

PHP

Java

D

Dart

Emacs Lisp

Scheme

node.js

Erlang

C#

Go

Clojure

Scala

bash / ksh

Vim

C89

Apache Pig

Apache Hive