Another Gource Video: Func
As a follow up to my previous post, here’s a video from Func’s source control history, from conception until today. It differs a fair amount from Cobbler, mainly because commit attribution was preserved early on (due to using git am and related tools).
Source Code Visualization with Gource.
Gource is an amazing program for visualizing commit history in a git-based code project. What I like about it is that it can also show what areas of the project are active in an easy to understand way, to show whether there is community around a whole project or just aspects of it. What looks like a shiny useless visualization is, in fact, pretty useful stuff. I’ll get to that in a bit.
So, I needed something to scan and past OSS things I’ve been involved with were logical first targets. To index the history on Cobbler into a concise video, I ran the following:
gource -s 0.03 --auto-skip-seconds 0.1 --file-idle-time 500 --max-files 500 --multi-sampling -1280x720 --stop-at-end --output-ppm-stream - | ffmpeg -y -b 3000K -r 24 -f image2pipe -vcodec ppm -i - -vcodec mpeg4 gource.mp4
Want to run this yourself? You will likely have to build gource from source. I’ll warn you that building from source involves installing a ton of deps, though all are in Ubuntu 9.10, and once ./configure finally passes it does build fast. Fedora was downlevel with respect to ftgl, and compiling ftgl from source was difficult, hence the Ubuntu usage. The parameters I use above result in a large video (75MB) but are intended for YouTube HD.
The result is below. Note that I didn’t keep my source control commit attribution for the first couple of years on the project (lesson learned in how to use git!), I used to do development on devel and switched to master (this video shows master), and koan was grafted into the cobbler tree late in the game. Early on, I committed from two different user IDs. As a result, the video is not perfect — things tend to “pop” into view as releases happen. You’ll see the first outside attribution happen about 1/2 way through, though of course this was happening much much earlier. Still, the acceleration at the end, I think, means we achieved something pretty decent. Not all projects do.
Perhaps this is a start of a good meme. Get your code up on YouTube. Show us the life of your code and who you collaborate with.
I can see gource being immediately useful for a one major purpose. When evaluating OSS software for use in business, you always need to know if the community is solid and self sustaining. This allows you to watch a short video and find out. Coupled with looking through the mailing list archives, that’s a pretty good check. It can also help identify interesting patterns of large scale refactoring, new development, or stagnation.
Gource may also a great way to explain open source to people who don’t immediately understand how collaboration can work, and how contributors come and go.
That is what I call good TV. It is also rather trippy to look at. Please turn up the Floyd.
If we had a free supercomputer and infinite development time, my ultimate dream visualization would be a 20×20 foot wall section of these graphs, showing multiple projects side by side, with developers flying between projects. Who flies between projects? Is that common? What are the clustering patterns of these projects that share contributors? Where are the hubs and spokes? Are the hubs bigger projects than the spokes? (Can we get that in 3D?). There is something to be learned here, even if we don’t know what that is.
Install Gource. Let’s see your project video.
VirtualBox & Wireless
I’ve arrived at a fairly nice home virtual machine developer setup, that no longer cares whether I am docked or undocked. VirtualBox can set up a bridge around wlan0 by using the GUI — no manual setup, and it can also bridge wlan0. AFAIK, Virt Manager can still do neither of these things, though perhaps if you edit the libvirt XML and dnsmasq configurations you could get close. Further, the newly created bridges do not cause problems with NetworkManager or Firefox (as I’ve had happen in Fedora numerous times).
What I have:
- Virtual machines are set up to be bridged on wlan0
- Virtual machines are configured statically, i.e. 192.168.1.150+N (outside the DHCP range of the router), gateway=router_ip, dns=host_ip
- Host runs dnsmasq with each machine in /etc/hosts
(For purposes of full disclosure, this is on Ubuntu 9.10)
I would like to be using virt-manager and supporting my former Fedora cohorts instead, but this setup is way too portable and awesome. I’m also running on a Dual 3GHz system with 8GB of RAM, and … for some reason, VirtualBox seems faster (even with /dev/kvm present) and the graphics are also much better.
Anyway, I’m pleasantly surprised by this being as manageable as it is for a bridged setup.
I’d like it to be even easier, but I realize virtual machines outside of NAT are largely a server use case, so much more than that would probably be overkill — and server folk can handle it. I still can’t help but wonder what would happen if Virt-Manager was as usable as Virtual Box, and Virtual Box had Virt-Manager’s dnsmasq (with dynamic DNS so hostnames just worked everywhere) and also added a UI to configure it.
An aside: /sbin/dhclient-script clobbering /etc/resolv.conf? Not cool. Not cool.
(Really I’d like universally unique IPs all of the time, regardless of what networks I’m on. And a magic IP that is always the IP of the host the guest is running on. And SkyNet…)
Parsing CacheGrind from Ruby
Here’s a little script to show how many times files are accessed during a function call (or integration test) series, for use with tools like XDebug. In my particular application, I had a large codebase and didn’t know what files were touched (or not) during a relatively complex call chain.
require 'optparse'
require 'ostruct'
options = OpenStruct.new()
options.compact = false
OptionParser.new do |opts|
opts.banner = "Usage: cache_blaser.rb [options]"
opts.on("-c", "--compact", "Report on directories, not files") do |c|
options.compact = true
end
end.parse!
called = {}
ARGV.each do |filename|
open(filename) do |handle|
handle.each_line do |line|
if line=~/^fl=(.*)/
key = $1
key = File.dirname(key) if key.include?("/") and options.compact
called[key] = called.has_key?(key) ? called[key]+1 : 1
end
end
end
end
keys = called.keys.sort do |a,b|
(called[b] == called[a]) ? b <=> a : called[b] <=> called[a]
end
keys.each do |file|
begin
printf("%06d | %s\n", called[file], file)
rescue Errno::EPIPE
end
end
Usage: (top 50 most accessed files)
ruby cache_blaster.rb /tmp/cachegrind* | head -n 50
The output is number of times something in each file was referenced. This could easily be adapted to also include function calls, or list top function calls for each file.
080976 | php:internal 029876 | /path/to/file/a.php 009454 | /path/to/file/b.php 005875 | /path/to/file/c.php 004552 | /path/to/file/d.php 002433 | /path/to/file/e.php 001522 | /path/to/file/f.php (etc)
Obviously all this profiling data is already scanned by tools like KCacheGrind, though who really likes GUI tools for more complex data mining and for generating custom reports?
Flattening Hashes In Python
I was looking for a simple way to diff two complex hashes. In order to make the output nicely readable for humans, I’d first like to flatten the hashes. For example:
test = {
"a" : [ "dog", "cat", "chicken" ],
"b" : {
"c" : 0,
"d" : [ "red", "yellow", "blue" ],
},
"e" : "shiny"
}
Becomes:
{
'a': ['dog', 'cat', 'chicken'],
'b.c': 0,
'b.d': ['red', 'yellow', 'blue'],
'e': 'shiny',
}
Here is a very long Perl module that does this. Here’s my cut:
def _flatten_ds(self, ds, result=None, memo=""):
if result is None:
result = {}
assert type(ds) == type({})
for (k,v) in ds.iteritems():
if memo == "":
new_memo = k
else:
new_memo = "%s.%s" % (memo,k)
if type(v) == type({}):
self._flatten_ds(v, result=result, memo=new_memo)
else:
result[new_memo] = v
return result
Almost 300 lines shorter than the Perl module
Git Statistics — Simpler, Faster
Today I’ve reworked LookAtGit which was a project I originally implemented because (A) I was really bored, and (B) I wanted to learn Scala.
The new “v2″ is done in Ruby, which is nice because the regexes there don’t make me want to hurt someone and Ruby is generally awesome, and I’ve forgotten how much I missed it having a lot of corporate Python jobs. Ruby is fun.
So, lookatgit… I’ve also optimized it a HUGE amount since last time, for one — I found “git log –shortstat”, it has to execute a billion less commands and can also skip binary commits.
The rewrite doesn’t quite yet offer some of the statistics-oriented statistics (SOS) that the previous version offered, but it does offer some additional reports, arbitrary field sorting, and will enable adding lots more other reports in the future.
To get started, check out from github and read the README file in the “v2″ directory here.
You can see from those instructions how to do simple things like see the top 50 most active files in a repo, or the top 100 contributors sorted by arbitrary statistics.
Here’s an example from Spacewalk. Spacewalk is a HUGE repo, and while I’m limiting the length of the output below, the report time is spent in the calculations, rather than the output, so with a reasonable machine it will only take a minute to scan the repo.
mdehaan@snowball:~/code/lookatgit/v2$ time ruby lookatgit.rb -r ~/code/spacewalk/ --limit 10 -T -F --header --verbose scanning... processing 9534 commits... generating report... -------------------------------------------------- TOP CONTRIBUTORS REPORT name,lines_changed,lines_added,lines_removed,commit_ct -------------------------------------------------- Miroslav Suchý <msuchy@redhat.com>,1587958,143182,1444776,1650 Jan Pazdziora <jpazdziora@redhat.com>,183227,121126,62101,1428 Michael Mraka <michael.mraka@redhat.com>,196978,94669,102309,781 Devan Goodwin <dgoodwin@redhat.com>,146951,44522,102429,611 Justin Sherrill <jsherril@redhat.com>,1398046,763492,634554,587 Pradeep Kilambi <pkilambi@redhat.com>,726023,361122,364901,476 Mike McCune <mmccune@gmail.com>,39470,27793,11677,447 jesus m. rodriguez <jesusr@redhat.com>,392192,112554,279638,430 Partha Aji <paji@redhat.com>,49294,29033,20261,406 Milan Zazrivec <mzazrivec@redhat.com>,42101,22507,19594,375 ------------------------------------------ TOP FILES REPORT filename,lines_changed,change_ct,author_ct,commit_ct ------------------------------------------ java/spacewalk-java.spec,2718,246,17,246 backend/spacewalk-backend.spec,2420,235,14,235 java/code/src/com/redhat/rhn/frontend/strings/jsp/StringResource_en_US.xml,23742,202,19,202 java/code/webapp/WEB-INF/struts-config.xml,8040,147,18,147 rel-eng/packages/spacewalk-java,291,146,13,146 web/spacewalk-web.spec,1739,139,11,139 java/code/src/com/redhat/rhn/frontend/strings/java/StringResource_en_US.xml,8999,128,14,128 schema/spacewalk/spacewalk-schema.spec,726,121,10,121 rel-eng/packages/spacewalk-backend,237,119,11,119 proxy/installer/spacewalk-proxy-installer.spec,533,99,7,99 real 1m1.190s user 0m6.912s sys 0m0.340s
The next step is to build in those “statistics oriented statistics” and enhance the query capabilities. For instance, I’d like to be able to generate a report on the standard deviation times between commits, to show which developers on a given project are slacking off
. Similarly, I’d like to generate aggregate statistics on a project so I can show that Project X contributors typically commit changes with certain distribution patterns, which may or may not be revealing.
Contributors are very welcome. I currently do not have a project list, but if enough folks are interested we can get this going. It is not so much about what it can generate now but what we can generate in the future.
All The New Stuff You Can’t Use
Do you get excited about new language features? Probably. Can you use them immediately at work? Sometimes. If you write software that you distribute though, you often can’t!
Wouldn’t it be awesome if I could get all of the Python Package and Ruby gems (and CPAN) as RPMs on EL 4 and I get to pick my interpreter version (any version I like) and be able to choose from multiple interpreter versions to run on the same system?
Part of the problem with writing software that you want to be easy to set up and install for users is that you can’t use the shiny newness. For instance, if you have to support EL 2, you won’t be excited about new features in Python 2.5, as you’ll never be able to use them. Same deal with Rails 3 or TurboGears 4000.
If you are a hosted service, you could decide to do a lot of work and become a mini-distribution (packaging these things for yourself and tracking security updates and other bugfixes), but that’s a lot of duplicated IT across the world for everyone trying similar things. Each new library you want to use becomes a discussion with IT because someone needs to package it and look after updates. Ouch!
I am not a fan of the java-style mode of deployment as it seems to imply a great chance for security vulnerabilities (due to lack of updates by packaging a sub-module with your code), retards progress, and also tends to encourages forking. However I can kind of understand why it occurs. Fear of the outside world breaking your code, or not being able to deploy what you want where you want it.
Virtual appliances are also the wrong answer to deployment, because of the same update concerns, and the fact they take a sledgehammer to the problem and waste resources.
Ideally what I think I want is a cross-distribution build server that all upstream software projects could use that would automatically build packages for different interpreter versions and distributions. By cross distribution I not only mean all of Fedora, CentOS, and RHEL, but also Debian and Ubuntu. If we are lucky, also OS X. I’m tired of OS X deployments working differently.
Then I should be able to do:
yum install python25-simplejson python24-simplejson
and run the same interpreter on the same box.
Some issues are to be had with contention over Apache, I’m assuming, though I think it would be really awesome if we could take every hosted service developer in the world out of having to maintain their own libraries when Enterprise Linuxes are too far behind and the likes of Fedora or Lefty Lemur are too fast-changing and unstable.
I don’t forsee this happening, of course… but you know… sometimes I think it would be nicer if the newness was easier to deploy on the oldness. To do this though, we really have to take the human distro-specific packagers out of the equation, and make a build service that is very very encouraging for all upstream developers to use. And it should (because of the upstream focus) ideally involve a partnership between distributions. This may also require unifying Debian and RHEL packaging in order to gain widespread adoption of software developers packaging their own content and submitting it to common build servers. If possible, do OS X as well.
Also, I want a pony.
Logging We Have Set On High
You thought there could be no more, but unfortunately, there are. It is time for another software Christmas carol!
Logging we have set on high
We don’t know just what this does
Intently grepping, much work remains,
And in confusion we sigh
(on) Echoing what /var/log containshm hm hm hm hm hm hm hm hm hm hm hm
in bash shell we cat it
hm hm hm hm hm hm hm hm hm hm hm hm
in bash shell we cat it
Thank you, thank you, I’ll be here all week.
He We Come A Refactoring
One more! I must stop, but I cannot hold back my fans. All three of you!
Would you believe two?
Here We Come A Refactoring
Here we come a refactoring,
Among the code not lean,
Here we come a making,
A design so very clean.
Love and joy come to you,
And to your architecture too,
Let code impress you and how about a beer?
Simple code means more time for beer…