When Data Goes Bad: How Sour Survey Samples Nearly Ruined the Research Industry

By David M. Schneer, Ph.D.

Part 1

Ask any well-trained researcher what is the number one concern with any quantitative study and they will tell you it’s about respondent quality. That is, it’s about the sample, stupid. Without good sample, data truly is “garbage in, garbage out.”

With the advent of the web as a means to gather data, Merrill Research became pioneers in the use of online data collection and gradually moved away from telephone data collection (yes, we once used phones to collect data) to the web. We did this only after having done many side-by-side comparisons to ensure that the web could provide us the same level of quality data to which we and our clients had grown accustomed. And it did.

Over the first decade of web data collection, the online sample business both soared and improved. Sample became more targeted and less expensive, and we had greater choices as new firms entered this market. Fortunately, with few exceptions, the quality of sample remained solid. The 90s were truly a boon for the industry. But that was not to last.

We began seeing problems with B2B sample when supposedly qualified respondents could not answer basic questions for those in their field. This caused us to examine data down to the respondent level and to replace about 10% of the “bad data” we were regularly receiving. Even companies we considered to offer the very highest quality sample began to show poor results! Over-sampling became part of our norm and we stepped up our efforts to identify and eliminate the bad data.

Fast forward a couple years and we began to see the same thing happen with B2C samples, but the problem was even worse. Using multiple web sample partners for one study, we were amazed to see very different results. Data were so different that we knew not all sample sources could possibly be “right”. We started to implement even more stringent data checking (e.g., eliminating speeders, straight liners, and if contradictory answers were given). This “bad data” comprised about 30% of the data collected. More cleaning and more replacement interviews followed. This took considerable professional time and was not a process that lent itself well to automation.

Eventually, we added extra pre-screening to our surveys to “catch the bad guys” – the ones who pretend to be someone they are not, and those that don’t take the survey experience seriously by either rushing through it or by giving false answers to questions. We started to compare data collected from the same sample provider for the same study and we found big differences in the data. The data collected with and without the pre-screener we crafted were so different, it became evident that pre-screening clearly mattered.

There is no question that the speed in which data can be collected today is quicker than ever. And there is no question that the cost of data collection has never been lower. However, our empirically driven conclusion is that much of the data collected via the web is over-valued (i.e., even if it was free and quick, it would not be worth using). For some studies, error caused by poor quality sample is small enough that decisions drawn from the data won’t be impacted. But from what we have seen, conclusions drawn from other studies will be increasingly wrong: garbage in, garbage out.

Merrill Research – Experience You Can Count On