Last night (4/2) the Airbnb Tech Talk was all about Experiments at Airbnb. (NB: @AirbnbNerds does a good job posting videos of (many of) the talks, so watch here for the posting of the video).
I was extremely interested in hearing how @AirbnbNerds do experiments. A previous TechTalk I’d attended covered the methods they’ve tried to nudge hosts to price their properties accurately (i.e., at the market clearing price). [no video apparently].
The talk was extremely intelligent, but not geared toward me and my posse (designers of particular experimental manipulations).
The speakers, Jan Overgoor and Will Moss, discussed the analytical infrastructure developed for implementing a host of experiments. Most of the attendees seemed to be data scientists; this became particularly evident at the audience’s “oooh” upon hearing that Airbnb has 18 data science people + 4 more programmers devoted to maintenance of the site tools. The role these teams play is self-evidently crucial: they’ve developed deep insight into appropriate sampling hygiene, valid statistical inferences, and well-thought-through protocols for timing and executing experiments. I also absolutely loved the rigor that they treat experiments as “code;” checked into GitHub, and if conditions change/evolve, check that change in.
Let me try to pinpoint one thing I had hoped to hear about which was not possible to fit into the talk: My dream talk would have revealed an inventory of hypotheses that they’ve formulated, based on deep knowledge of their customer base in combination with behavioral insights from academic research. Truth be told, I would be out of a job if they had built a cogent guide for identifying good topics to probe, designing those specific conditions, prioritizing the sequence of nested or imbricated experiments, along with clear steps for operationalizing each manipulation, and a concise and compelling way to think clearly about what counts as the appropriate control.
I don’t have pictures (my phone died), but Will Moss showed a beautiful dashboard they’ve built. Very well thought out and lovingly designed, it shows the experiment, with 4 columns for each cell (Mean, % change, p-value, graphic sparkline)
Below are my complete notes
Experiments at Airbnb
What we have learned and how do it now
Jan- data scientist Will – sw engineer
Jan – 2 hats – A team (analytics) Search team (lot of experiments, write search algorithms)
Experiments- why & what
Some common pitfalls
Experiment Reporting Framework
Why – Why not just launch & see what happens?
Often really hard to tell the effect of the change. Lots of external effects.
Daily, weekly (e.g. Tues more freq than Sat for searches); seasonality, etc
Little changes make little differences
2 references in footnote:
http://mcfunley.com/design-for-continuous-experimentation
http://www.evanmiller.org/how-not-to-run-an-ab-test.html
Actual effect [of up-down chart shown] is 1% – Treatment effect of dummy experiment.
What is the p-value?
How do people tend to use it – stop the experiment when p < some value (.05)
Very likely to find effects where none exist.
Def: The probability of B being diff than A by at least the observed difference, if there was no actual difference between the 2 options
Better: It’s a measure of whether the effect is bigger than noise
Pitfalls
Stopping too early
Not understanding the context
Assume that the system works correctly (returning the things it should be returning)
Example experiment: Increase the max value of the price slider from $300 to $1,0000
After 8 days: A significant 4% increase in bookings, rejoice! & P-value hit .05
After 36 days, delta was ~0, p = 0.4
When to Stop:
Don’t stop when p hits significance
Estimate time in advance that you run (ideal)
Interpret progress of delta and p-value (heuristic)
Stop when p < or = to dynamic variable (compromise)
Start low, and over time
Dynamic decision boundary: Much more conservative about false positives
Looking at Context
Huge re-design for Search
Before: Small map, little pictures
After: Big map, bigger pictures of properties
Test: delta near zero.
Broke down results by browser, and bug with IE in some cases caused a net drop of -0.27, but in chrome +2.07% and FFox +2.81%, Safari +0.86, yet IE -3.66%
Look at the funnel, dynamics
Search to book -0.31% p = 0.37
Search to contact -1.29% p = 0.04
Contact to book +0.99% p = 0.06
Contact to accept +1.58% p = 0.00
Accept to book -0.58% p = 0.11
Regression is more rigorous
Breaking down into separate cells reduces sample size
Computing interaction effects is onerous
Hard to account for external factors
Regression is better
Scrutinize your set-up
Whether you’re using 3rd party vendor for setting
Run dummy experiments – A/A tests
Hopefully it will return a null result
An example that helped us find a bug:
Treatment Uneven sample sizes uncover a bias
The mixed group problem
Most visitors are not logged in
Some are assigned to more than one group (multiple cookies on multiple devices)
Exclude them from analysis
But they’re the most engaged users
Where does bias come from
Mixed group users are evenly distributed over the treatments. We can’t assign the big users to that category
Not equal sample sizes
Control contains those who would have been mixed group users, and these are more likely to book and they are not excluded.
Control has a huge advantage over the test.
Think hard about confounding problems.
Our solution is to have equal sample sizes across cells.
[I really don’t understand this issue clearly]
Takeaways
Controlled are the way to go
Use them in product development
But
Set time in advance or give it enough time to run
Interpret progress over time
Break results down into meaningful cohorts and funnel steps
Use regression to model subtle effects
Run dummy experiments to test your set up
Identify confounding factors in the set-up
ENDS his part at 7;22
Experimental infrastructure
Mesos; hadoop; hive; presto are tools we build on to democratize data
Running experiments is crucial. But there are a lot of pitfalls. I used to believe that you can just throw into different buckets and then test differences
Solve pitfalls once. Don’t duplicate work
A team, should automate the things they approach
Design Tenets
Less susceptible to bad patterns (limit bias)
Analysis comes for free (we handle the nuances)
Deployment through Git(hub) – have the same rigor to experiments that we have for code changes.
Automatic (and reliable) logging. Need to reliably know that we put a user into the treatment group.
Example: Vary number of results shown on page (currently defaults to 12; vs 18, 24)
Define the experiment and the analysis
Deploy – write the code to run the experiment
Analyze – check out the results.
Dashboard shows the following fields
3 groups of columns + spark line for each cell (18 per page as control)
Mean, percent change, p-value + sparkline
Experiment Start End
User Cohort (All vs new vs returning)
Metric: All Pivot [select metric to pivot] Button for ‘Show raw value‘
Metrics include a dozen things (but not revenue), e.g., Date searchers, Number of date searches
Define experiment: We use yaml (bc if you show a PM words that are all under case run together, they’ll say this looks stupid), it’s machine readable, human readable: search results per page
Yaml lists:
treatments:
Control:
Deploy: Deliver_experiment
Function- give it the name of the experiment; 4 lambda functions. One for each treatment. Plus “unknown” lambda – if something goes wrong, deliver the miss; user won’t know that
Assigns user to group; logs assignment; execute lambda + Also in Javascript. We try to cache as much as possible. If you can’t run yr experiment in cache, it lops off half your sample.
Avoid bias:
user.get_treatment
Imagine that someone later comes back and says, IE is a dog, we can’t be showing users
Someone on another team jumps in and excludes IE.
Not clear now that we’re giving equal size cells.
Log correctly: 2 different ways of exclude IE, one that says, if IE, don’t serve 24 results. The 2nd way is logged as “served 24”, but if IE, exclude from log
Generate queries
Chronos (job scheduler)
Shared calculations (mixed groups, etc)
Experiments 1 through N
Database
UI for viewing data
Q: What trends are red-flags?
In general, if it’s fluctuating a lot, it shows that you’re too early. It is commutative, it’s computed from that time all the way back. We most often see high p-value & stable.
Q: How do you compute the function of dynamic decision boundary?
I run simulations, this trend, this # of hits, etc. How likely are we to miss. If you have enough data, I suggest you calculate a similar curve for your customer base.
Q: How many- 18 people + 4 on data infra team (audience gasps with envy)
Q: Mixed group – if you pick 10% for experiment, why not pick 10% of the rest for the control? (They are more likely to book and not excluded.)
Even if you only choose 10%, then it will still have too many mixed users/
Q: Instead of looking at the conversion, why not weight for sales value?
A: We do test on revenue, but those are a little different
Q: Ever segment by frequency of users. When you start, you’ll get more frequent users. Frequent users can blanket, and then over time, the less frequent visitors will begin to be in the mix.
A: We do segment out users
Q: Regression. Was this a recommendation to use regression on cohort?
Nice thing about regression is that you can do it all at once.
You can include all the factors, with one big analysis, and give you estimates, and whether the sets are correct. You can do many interactions at once. It’s a little bit harder to automate, and a little complicated
Regressions can be done wrong in a number of ways, but it’s a pretty stable
Q: Multi-armed bandit testing? Any reason you’ve not implemented?
A: We have not tried it. We found systems
Ranking function might be done best by M-A_B http://www.cs.nyu.edu/~mohri/pub/bandit.pdf
Q: What is the ROI of engineering?
We haven’t gotten into that. It’s not possible to answer mechanistically.
One of the things we didn’t mention: We lump in key business metrics, e.g., revenue.
Q: Multiple experiments impacting same KPI, and whether they interact
A: Great q. Interaction between experiments often happens. There’s not one single answer