When we (rigorously) measure effectiveness, what do we find? Initial results from an Oxfam experiment.
Guest post from ace evaluator Dr Karl Hughes (right, in the field. Literally.)
Just over a year ago now, I wrote a blog featured on FP2P – Can we demonstrate effectiveness without bankrupting our NGO and/or becoming a randomista? – about Oxfam’s attempt to up its game in understanding and demonstrating its effectiveness. Here, I outlined our ambitious plan of ‘randomly selecting and then evaluating, using relatively rigorous methods by NGO standards, 40-ish mature interventions in various thematic areas’. We have dubbed these ‘effectiveness reviews’. Given that most NGOs are currently grappling with how to credibly demonstrate their effectiveness, our ‘global experiment’ has grabbed the attention of some eminent bloggers (see William Savedoff’s post for a recent example). Now I’m back with an update.
The first thing to say is that the effectiveness reviews are now up on the web. Here you will find introductory material, a summary of the results for 2011/12, and some glossy (and hopefully easy to read) two-page summaries of each effectiveness review, as well the full reports. (You may not want to download and print off the full technical reports for the quantitative effectiveness reviews unless you know what a p-value is. With the statistically challenged in mind, we have kindly created summary reports for these reviews, complete with traffic lights….). Eventually, all the effectiveness reviews we carry out/commission will be available from this site, unless there are good reasons why they cannot be publicly shared, e.g. security issues.
Plug over, I can now give you the inside scoop. In the first year (2011/12) we aimed to do 30 effectiveness reviews, and we managed to pull off 26. Not bad, but our experience in the first year made us realise that our post-first-year target of 40-ish reviews per year was perhaps a bit overly ambitious. We have now scaled down our ambitions to 30-ish, to both avoid overburdening the organisation and enable better quality control.
The issue of quality control, in particular, is critical because there are certainly opportunities to strengthen the effectiveness reviews, particularly in terms of rigour. Currently, there is considerable interest in how to evaluate the impact of interventions that don’t lend themselves to statistical approaches, such as those that are seeking to bring about policy change (aka “small n” interventions). See a recent paper by Howard White and Daniel Phillips. We have attempted to address this by developing an evaluation protocol based on a methodology called process tracing used by some case study researchers. However, we are struggling to ensure consistent application of this protocol. Time and budgetary constraints, as well as inaccessibility of certain data sources, are – no doubt – key militating factors. Nevertheless, we aim to improve things this year by more tightly overseeing the researchers’ work, coupled with the provision of more detailed guidelines and templates so they better understand what is expected.
While in no way perfect, we have perhaps had more success with the reviews of our “large-n” interventions, i.e. those targeting large numbers of people. This is, at least in part, because we are directly involved in setting up the data collection exercises, and we carry out the data analysis in-house. The key to their success is capturing quality data on plausible comparison populations and key factors that influence programme participation, and this has worked out better in some cases than in others. We are also attempting to measure things that just aren’t easy to measure, e.g. women’s empowerment and ‘resilience’. We are modifying our approaches and seeking to collaborate with academia to get better at this. Despite their shortfalls, at £10,000-ish a pop (excluding staff time), we believe these exercises deliver pretty good value for money.
Humanitarian programming is not my thing, but I am particularly pleased with the humanitarian effectiveness reviews that critically look at adherence to recognised quality standards. While there are some methodological tweaks needed here and there, the cohort of reviews presents an impartial and critical assessment of Oxfam’s performance and identifies key areas that need to be strengthened, e.g. gender mainstreaming.
So what do the effectiveness reviews reveal about Oxfam’s effectiveness? While the sample of projects is too small to draw any firm conclusions, the results for this particular cohort of projects are – as one might expect – mixed. For most projects, there is evidence of impact for some measures but none for others.
There are, no question, some clear success stories, such as a disaster risk reduction (DRR) project in Pakistan’s Punjab Province. Here, the intervention group reported receiving, on average, about 48 hours of advanced warning of the devastating floods that hit Pakistan in the late summer of 2010, as compared with only 24 hours for the comparison group. Having had more time to prepare is one possible explanation why the intervention households reported losing significantly less livestock and other productive assets. Oxfam’s research team is in the process of commissioning some qualitative research to drill down on this project to better understand what made it work.
Given Oxfam’s size and capacity to mobilise and make noise, it is no surprise that there is reasonably reliable evidence that many of the campaign projects have brought about at least some positive and meaningful changes, despite falling short of fully realising their lofty aims. However, the results for several of the sampled livelihoods and adaptation and risk reduction projects are, quite frankly, disappointing. Figuring out why these particular projects have not worked is just as critical for learning as is figuring why the Pakistan one did.
Whether their findings are positive or negative, I have to admit that I am impressed with how seriously the effectiveness reviews are being taken by senior management. A management response system has been set up and embedded into the management line, where country teams formally commit themselves to taking action on the results.
That being said, the effectiveness reviews are in no way immune from internal controversy. The random nature of project selection is perhaps the biggest sticking point. While we do this to avoid ‘cherry picking’, inevitably some of the projects that are selected are small-scale and have little strategic relevance to the countries and regions. Some are also concerned about how much time and resources the effectiveness reviews are sucking up.
We know that what we are attempting to pull off can be improved on a number of fronts, in terms of rigour, learning, and engagement and ownership of country teams. And the good thing is that we are able to modify and improve things as we go along. So any constructive criticism, advice, etc. is most welcome.