Conventional hypothesis testing is carried out in a hypothesis-driven manner. A scientist must first formu-
late a hypothesis based on what he/she sees, and then devise a variety of experiments to test it. Given the
rapid growth of data, it has become virtually impossible for a person to manually inspect all the data to
find all the interesting hypotheses for testing. In this paper, we propose and develop a data-driven frame-
work for automatic hypothesis testing and analysis. We define a hypothesis as a comparison between two
or more sub-populations. We find sub-populations for comparison using frequent pattern mining techniques
and then pair them up for statistical hypothesis testing. We also generate additional information for fur-
ther analysis of the hypotheses that are deemed significant. The number of hypotheses generated can be
very large and many of them are very similar. We develop algorithms to remove redundant hypotheses and
present a succinct set of significant hypotheses to users. We conducted a set of experiments to show the
efficiency and effectiveness of the proposed algorithms. The results show that our system can help users (1)
identify significant hypotheses efficiently; (2) isolate the reasons behind significant hypotheses efficiently;
and (3) find confounding factors that form Simpson’s Paradoxes with discovered significant hypotheses.
License type:
PublisherCopyrights
Funding Info:
supported in part by Singapore Agency for Science, Technology and Research grant SERC 102
1010 0030