Git Statistics — Simpler, Faster

Posted: January 1, 2010 in linux

Today I’ve reworked LookAtGit which was a project I originally implemented because (A) I was really bored, and (B) I wanted to learn Scala.

The new “v2″ is done in Ruby, which is nice because the regexes there don’t make me want to hurt someone and Ruby is generally awesome, and I’ve forgotten how much I missed it having a lot of corporate Python jobs. Ruby is fun.

So, lookatgit… I’ve also optimized it a HUGE amount since last time, for one — I found “git log –shortstat”, it has to execute a billion less commands and can also skip binary commits.

The rewrite doesn’t quite yet offer some of the statistics-oriented statistics (SOS) that the previous version offered, but it does offer some additional reports, arbitrary field sorting, and will enable adding lots more other reports in the future.

To get started, check out from github and read the README file in the “v2″ directory here.

You can see from those instructions how to do simple things like see the top 50 most active files in a repo, or the top 100 contributors sorted by arbitrary statistics.

Here’s an example from Spacewalk. Spacewalk is a HUGE repo, and while I’m limiting the length of the output below, the report time is spent in the calculations, rather than the output, so with a reasonable machine it will only take a minute to scan the repo.

mdehaan@snowball:~/code/lookatgit/v2$ time ruby lookatgit.rb -r ~/code/spacewalk/ --limit 10 -T -F --header --verbose
scanning...
processing 9534 commits...
generating report...
--------------------------------------------------
TOP CONTRIBUTORS REPORT                           
name,lines_changed,lines_added,lines_removed,commit_ct
--------------------------------------------------
Miroslav Suchý <msuchy@redhat.com>,1587958,143182,1444776,1650
Jan Pazdziora <jpazdziora@redhat.com>,183227,121126,62101,1428
Michael Mraka <michael.mraka@redhat.com>,196978,94669,102309,781
Devan Goodwin <dgoodwin@redhat.com>,146951,44522,102429,611
Justin Sherrill <jsherril@redhat.com>,1398046,763492,634554,587
Pradeep Kilambi <pkilambi@redhat.com>,726023,361122,364901,476
Mike McCune <mmccune@gmail.com>,39470,27793,11677,447
jesus m. rodriguez <jesusr@redhat.com>,392192,112554,279638,430
Partha Aji <paji@redhat.com>,49294,29033,20261,406
Milan Zazrivec <mzazrivec@redhat.com>,42101,22507,19594,375
------------------------------------------
TOP FILES REPORT                          
filename,lines_changed,change_ct,author_ct,commit_ct
------------------------------------------
java/spacewalk-java.spec,2718,246,17,246
backend/spacewalk-backend.spec,2420,235,14,235
java/code/src/com/redhat/rhn/frontend/strings/jsp/StringResource_en_US.xml,23742,202,19,202
java/code/webapp/WEB-INF/struts-config.xml,8040,147,18,147
rel-eng/packages/spacewalk-java,291,146,13,146
web/spacewalk-web.spec,1739,139,11,139
java/code/src/com/redhat/rhn/frontend/strings/java/StringResource_en_US.xml,8999,128,14,128
schema/spacewalk/spacewalk-schema.spec,726,121,10,121
rel-eng/packages/spacewalk-backend,237,119,11,119
proxy/installer/spacewalk-proxy-installer.spec,533,99,7,99

real	1m1.190s
user	0m6.912s
sys	0m0.340s

The next step is to build in those “statistics oriented statistics” and enhance the query capabilities. For instance, I’d like to be able to generate a report on the standard deviation times between commits, to show which developers on a given project are slacking off :) . Similarly, I’d like to generate aggregate statistics on a project so I can show that Project X contributors typically commit changes with certain distribution patterns, which may or may not be revealing.

Contributors are very welcome. I currently do not have a project list, but if enough folks are interested we can get this going. It is not so much about what it can generate now but what we can generate in the future.

Advertisement
Comments
  1. i82much says:

    Very cool. Would be fun to put a front end on it, or make it a web service that analyzes github repos

    • mpdehaan says:

      I’m not working on (or maintaining) this and probably won’t for the near future, so feel free to fork/copy and do that. It would indeed be nice. I’d like github to do something like this built in.

  2. n3ko says:

    Nice work.
    I like the author summary.
    I see only a small flaw on it: some authors are separated

    User Name ,106936,90270,16666,1459,8.14,1.22
    User Name ,3009,2025,984,28,1781.21,64.44

    The git’s shortlog summary uses .mailmap file in the root of working tree, and displays it summarized (commit count only):
    git shortlog -s
    1487 User Name

Leave a Reply

Please log in using one of these methods to post your comment:

Gravatar
WordPress.com Logo

Please log in to WordPress.com to post a comment to your blog.

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s