How To Clean A Marketing Research Questionnaire Dataset


Depending on who you talk to, cleaning market research data files for use with advanced statistical analyses takes between 50% and 95% of the total time spent working on a dataset. It’s the only way to ensure that the results you see and the conclusions you draw are the result of the treatment and conditions, rather than some tiny, little error dismissed long ago as inconsequential.

Hopefully, the tactics that follow will help you ensure the advanced statistical analyses you run are based on the best quality dataset possible.

First, and more important than perhaps anything else, is to save the first dataset you receive in a separate folder and then never touch it. This is your safety net for that one day down the road when you accidently overwrite the entire file you’re working from. That day will come and you need to build habits to ensure that day doesn’t become a nightmare. Never touch that separate copy.


Second, make it a habit to regularly save additional versions of the cleaned datafile. Minimally, save a new file once a day. Even better, save a new file every time you make a major revision to the dataset. Never simply click on ‘Save.’ Make sure to use ‘Save As’ and save it with a new name. Preferably a number so that all the files remain in order. As before, this will come in handy on that one day you realize you implemented a correction…incorrectly, and need to return to the previous version without losing six days of work.

Begin your formal data checking process by running a frequency distribution of every variable.

  • Look to see that the numbers of men and women match your expectations, that the distributions of age, income, education, ethnicity, and children make sense, e.g., you know that far fewer than 50% of people have college degrees and far more than 10% of people have high school diplomas.
  • Specifically check that numbers that ought to be very small or very large are actually very small or very large. Make sure most people recognize Tide, Pepsi, and Oprah and that very few people recognize the Acco brand of paper clips.
  • Look for answer options that were selected by no one. Should these be zero or did variables in the datafile unknowingly shift over by one column?
  • Follow through the logic of the skip patterns. If people answered questions in a certain way, do their follow up answers match those responses? Or did the rows or columns shift here too?


Check variable labeling and coding.

  • Make sure numeric values are coded as numeric variables not string variables, and vice versa. This is what will determine the order that answer options appear in your outputs. It will also determine which statistical tests will be ‘turned on’ by your statistical software.
  • Check that missing responses aren’t recoded into zeros, thereby making them appear to be valid responses.
  • Check that ‘Don’t know,’ ‘All of the above,’ and other non-substantive responses are correctly labeled and immediately identifiable. If not, it’s possible that those 9s and 99s will be improperly included in t-tests or correlations, or treated as valid responses for calculating means and standard deviations.
  • Check that every single variable and response option is correctly coded. For example, make sure that the label for ‘Male’ actually matches with responses from men. This simple mistake has already caused numerous, published academic articles to be retracted and knowledge within a discipline to change. Don’t add to that list!

Remove untrustworthy data.

  • And of course, make sure to apply standard data quality processes to ensure that low quality data is not part of the final data set. It’s always possible that research participants could become bored or distracted partway through data collection necessitating removing some or all of their data. Don’t leave low quality data in the file simply because you need the sample size. The data errors are most dangerous when they appear in the smallest sample sizes.

Besides ensuring that your data is top quality, running all of these checks also serves as exploratory analysis, a key component of better understanding the basic findings that allow you to generate hypotheses to test with more advanced statistics.

The next time you need to run advanced statistical analyses, make sure you leave enough time for data cleaning and exploratory analysis. You’ll be grateful you did!

You might like to read these:

Canadian Viewpoint is a one-stop market research data collection and fieldwork company. For over 40 years, we have been trusted by clients ranging from global Fortune 500 companies to local, boutique market, social, and academic research firms and offering top-quality solutions for offline, online, qualitative, and quantitative fieldwork. We specialize in providing high-quality solutions for offlineonline, qualitative, and quantitative fieldwork. As long-term members of the Insights Association, accredited members of the Canadian Research and Insights Council (CRIC), and corporate members of ESOMAR, we uphold the highest industry standards. Our diverse range of services includes sampleprogramming and hostingmall interceptscentral location recruitmentmystery shoppingin-home usage tests (HUTS)sensory testingshelf testingcomputer-assisted telephone interviewing (CATI)Facial Coding, and other cutting-edge technologies. Explore our website to learn more about our offerings and access our demo site to experience our tools firsthand.

Follow us on LinkedInInstagram, & Twitter and sign up for our newsletter to stay updated with the latest industry insights and news.

This entry was posted in Market Research and tagged , , , , , , , . Bookmark the permalink.