Regular expression to extract cookie value from Apache access logs
First published on January 21, 2016
I was recently troubleshooting a problem where I needed to extract cookie values and IP addresses from Apache access logs. In short, cookies were being shared across sessions instead of being unique to each session. The Apache log entries looked something like this:
216.37.10.126 - [29/Oct/2015:23:59:46 -0400] "GET /user/profile HTTP/1.1" 503 17839 "https://www.site.com/user/login" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36" "SESSID=59n42qa556o2k08ekmbmlhgdg1; othercookie=(direct)" -
Using this command, I could extract the cookie values for SESSID and save them to a file:
grep -ir 'GET \/user\/profile HTTP\/1.1" 503' /web/logs/access_log | sed -r 's/.*SESSID\=(.*)[;|"].*/\1/' > 503_cookies.txt
The sed regular expression wasn’t stopping the match at the semi-colon or quote, however. Instead of using (.*) in the capture for any character, I had to use [^;"] for “not semi-colon or quote” even though the match on the same characters happens outside of the parentheses:
grep -ir 'GET \/user\/profile HTTP\/1.1" 503' /web/logs/access_log | sed -r 's/.*SESSID\=([^;"]*)[;|"].*/\1/' > 503_cookies.txt
Further work was needed to grab the IP addresses for the matches and save them to another file. Here I didn’t need a regular expression, as I could just grab the first column with awk:
grep -ir 'GET \/user\/profile HTTP\/1.1" 503' /web/logs/access_log | awk '{print $1}' > 503_ips.txt
Then I could use the paste command to put the relevant IP address + cookie value entries on the same lines in the report and collapse all duplicate entries:
paste 503_ips.txt 503_cookies.txt | sort | uniq
Each line then looked something like this:
216.37.10.126 59n42qa556o2k08ekmbmlhgdg1