What Percentage of Users Join A Mailing List Or Contribute? / Random Statistical Thoughts
A while back I posted about The Drake Equation and trying to figure out the number of Open Source users for arbitrary applications out there. Even with popularity-contest apps in play (see Debian), it’s rather impossible to get hard numbers when not all users opt-in, and the reasons for opting-in or out vary depending on the type of user, with different applications being more geared for certain types of users, and applications being used on different distributions. Even tracking download sources (git, yum, http, mirrors) for all of Fedora, EPEL, and CentOS won’t bring us those kind of numbers as a proprietary app can normally get by just asking your sales guy how many copies something sold. I thought EKG would help, but honestly even if we do gather all that data, it won’t tell us the answer to that question. Adding application-specific phone-home opt-in definitely won’t help. When someone asks me how many people are using foo, I just have to answer “I don’t know”.
Rather than trying to find that out by sifting through data, what about a straw poll to challenge some assumptions?
Do folks really think that 1/100 people contribute, or depending on application domain, may that range wildly from anywhere between 1/10 and 1/1000 ?
What about people who join mailing lists or file bugs — what percentage of users do you think file bugs? (Exemption given for mythical bug-free projects)
How does that vary by application domain? How is a Firefox or GNOME user different in engagement ratios from the user of a systems management application?
I’ve heard the “1 in 100″ figure thrown around a lot, but I haven’t really seen any hard evidence that the code contributor ratio is really 1 in 100. I do agree it’s /close/ to that range, but it would be interesting to know how that varies by different application domains and the degree of openness in which apps are run and whether contributions are actively asked for. If it’s 1/100 that is not bad, necessarily, it may mean that you get more users
(Now just in case Matt Asay decides to misquote me again, I’m not talking here about proving that contributions are small. What I’m doing here is trying to reverse engineer the size of the userbase from the data we have available, to show that usage of OSS projects is astronomically large. I have the github graphs for various projects to show that contributions are often large, data that he should definitely be looking at — but the user end of the equation is the area where we /don’t/ have those statistics — and those are things that EKG can’t track because the data is simply not there…)
This brings up another point in which I think it would be worth trying to apply some analysis to Fedora’s download numbers, and what we can possibly gather from other distributions that is verifyable, and whether we can discern any trends there. Not that we’ll find anything, I just wonder what more analysis can be done.
Other silly relations that might be interesting — project language vs contributor involvement. code size vs contributor involvement. manpage size versus mailing list size.
Probably a lot of that would not yield good data, but some might, and could help us understand our field more.
Perhaps we could start a field called Software Forensics
What to drill down and try to see how valid the 1/100 assumption is?
The first thing that needs to be done is to look statistically across a number of different opt-in metrics for a number of projects of varying sizes and see if the ratio of those metrics show consistent relationships as a function of project size. For however you want to define size…as long as you consistently apply that definition.
For example, on average, does the ratio of debian popcon stats to upstream project -devel mailinglist activity hold as ratio across many projects of different size and complexity? Is that ratio actually more of a shaped distribution?
That’s just one ratio, you could probably construct dozens each with similarly equal value as a tool..sort of like how there are different methmatical expressions for entropy depending on the context. Hmm I wonder.. can build a partition function for project participation and define entropy in that context and grind away at estimating project health from a mathematical perspective….hmmm.
But i digress….If a ratio like that is very narrowly distributed around an average, that would support the idea that there is some sort of general participation ratio rule for generally successful projects. it wont tell you if that magic partitioning is exactly. If there is an obvious distribution of the ratio values then it maybe worthwhile to look at projects in the the extremes of that distribution searching for specific indicators of project health and correlating them with specific project policies. In that way you can point to specific project policies that tend to lower your participation ratio and lead to overall lowering of project health.
-jef
Jef Spaleta
June 3, 2009 at 6:23 pm
“For example, on average, does the ratio of debian popcon stats to upstream project -devel mailinglist activity hold as ratio across many projects of different size and complexity? Is that ratio actually more of a shaped distribution?”
I like that thinking. If there’s no drift from that data, you can make more assumptions, or at least know what the relative factors are — if they are high you at least know the experiment might not work. Plus, I think that data might not be too hard to collect.
Finding OSS idealness is the problem of optimizing the Nth-dimensional functions of all the characteristics that may affect things (perhaps detectable from the above — but I suspect language mix and Lines of Code are obvious, percentage of posts by project owner maybe, frequency of releases, and as many other things as you can mix in) using something measurable like list activity (or some better list metric) as the fitness function. And of course avoid local maxima when doing that. Maybe try to maximize for different fitness functions.
Then we ultimately write an app you can use that says “you need to release 50% more each year, cuss 25% less, and write longer manpages”. Well, now I’m just being crazy but having the data would be interesting.
Right now we can only speculate that people are really afraid of million line java apps and don’t know how to code in Intercal
mpdehaan
June 3, 2009 at 6:41 pm
There’s no doubt that there are many complex factors…simple ratios are only going to get you so far as they can only really examine linear relationships between factors. But you have to start somewhere. If statistical analysis of ratioing identifies factors which appear to be independent or nearly independent of each other that would be a good start at charting the space. Forming a basis of independent factors gives you the ability to start building an empirical model. This sort of crap is done all the time in certain hard science fields as performance prediction tools to span highly nonlinear parameter spaces.
Deep nonlinear complexity is why i think a probabilistic analysis of an entropy equivalent model is a compelling mathematical framework. Maximum Entropy optimization techniques are wonderful tools..once you have an entropy-alike function defined by a set of parameters that make sense for this problem space. But the space is so abstract, more abstract that conceptual information theory examples used to introduce the entropy concept that I don’t have a suggestion as to how to build an entropy model here. I’m not sure Shannon entropy applies..or if it does I’m not currently drunk enough yet to see it.
-jef
Jef Spaleta
June 3, 2009 at 7:03 pm
Michael, I wasn’t trying to misquote you. In fact, I don’t think I did. It certainly wasn’t my intention to do so.
I was simply using your excellent post to suggest, as you do, that community is harder to make work than most people think. The current thinking is that you hire a community manager and, VOILA! You’ve got a community and it will magically do all your work for you/with you.
That’s not true, and I didn’t read anything in any of your posts that suggests the contrary, including this one. The best research I’ve seen (and blogged) suggests that in all open-source projects a core team of 85% of the work. If you’ve seen better research, I’d love to see it.
This isn’t a weakness of open source, it’s just a reality of working with people. It’s really hard to work with big teams. Period. Open source, however, delivers other benefits that make it worthwhile.
Do you disagree?
Matt Asay
June 3, 2009 at 9:38 pm
I would agree that it’s harder than most people think… mainly because most folks don’t try very hard and don’t have the experience of seeing it work. Is it actually hard? Not all that much so. The rest of your blog post I do disagree with — I can’t agree that half-closed business models are a natural evolution or an inevitable conclusion. This is selling OSS short on a grand scale. It is giving up. I see these as a failure of the upstream in fighting for what they believe in, and using OSS as a marketing point rather than embracing it. These are the folks that don’t grok community factors. Debian and Fedora have both been excellent examples of this done right. Even more so, the Linux kernel. All of these places are places where people collaborate without room for multiple business models to exist, and also the /lack/ of a business model to exist (for multiple companies, collaborating together, without holding things back).
Seeing you linked his comments and mine in the same areas, Marc Fleury and I also have /quite/ different views about how OSS projects should be run. His “professional open source” theory of employing everyone from a central source is quite different from how Fedora thrives — even if it was successful — but I can’t help but note he’s not doing it anymore. Simply put, the tools for collaborating and building the place for collaborating is what we do. Then we let what happens happen. This is because we know that the evolution of software is a biological process, not a mechanical one. This is one of the reasons all bits of producing the distro are open source, and anyone can maintain any package. This of course trickles down to projects as well.
Since you asked, I actually find working with big groups of people pretty easy, provided you realize that you are not working with a team that you give assignemnts to, but rather a group of people who have an interest in a common thing. I don’t try to say “you do X by Y” but instead share the ideas that I have and encourage folks to submit their own ideas — ideas being just as important of a currency as code. The easy way to fail there is to think they are a part of a classical team and that you direct the effort. Rather, the effort directs you, and the way to succeed is not by attempting to control it, but providing tools and places for people to collaborate and letting that evolve to it’s natural direction. This is all rather radical from the way most folks build software. And what comes out of that, to me, is suprising. And it all starts with relinquishing control, trusting the community of users and developers, and laying everything on the table. No secrets. No bits held back. All comers welcome.
No single person owns a properly run OSS project. Instead, the people often seen as the owners should only be enabling others and providing guidance. This is what I seek to do, and this is how OSS can change the world. It doesn’t change the world by being partially open or vendor controlled, it does so by opening the doors and questioning the way we build /everything/.
Many folks may think they can throw bits over a wall and call the job done, because they are so far removed from the OSS development model. These are the folks that are the first to say the community does not work or is dead, as you mentioned in the comments quoted on InfoWorld. Again, I think this does the OSS world a large disservice.
(Note that this is my personal blog of course, and I don’t pretend to speak for my employer)
mpdehaan
June 4, 2009 at 2:28 am
P.S. Marten Mickos will tell you that the ratio is actually 1/1000. He seemed to like open source a lot. I think it’s possible to be accurate without that implying a lack of affection for open source. The Kool-Aid tastes good either way.
Matt Asay
June 3, 2009 at 9:41 pm
I’m not familar with Martin, linkage welcome.
Anyway, some projects, maybe — GNOME I might agree that it could be a larger disparity than that (lots of end users), but for some, quite smaller. Ultimately it matters how simple the project is to get involved with, and how interesting the project is, and whether quick wins are possible in contributing.
Some projects do a better job of enabling that than others — those that don’t have that criteria need to work towards enabling it.
mpdehaan
June 4, 2009 at 2:12 am
Jef, a lot of this is probably beyond my abilities, and definitely beyond my bandwidth. (I didn’t take a lot of statistics, which is to blame).
Would love to know what all the data says though, I think a lot of people try to write books on software development and are just making stuff up. Especially open projects, with effectively limitless team size if the idea is good enough … where that can go and what’s involved, having models that are better than trial and error and blogging would be great
This is why we need 10 more people with time to work on expanding EKG, among other things. I saw sourceforge bought ohloh (no longer in control of ex-Microsoft execs?) so perhaps there is some data to be had in cooperation with them also; ditto github.
Perhaps not but maybe we can ask. If they even had something that was as kludgy as Flickr’s API it might be a start… that just gives us code, but combined with mailing list data, hmm…
mpdehaan
June 4, 2009 at 2:48 am