I would like to know if my methodology was ‘correct’:

I am trying to conduct an experiment on my stores.

I would like to find out the effect of a marketing campaign on the number of transactions.

Only about 20% of the stores are participating in the marketing campaign.

The original methodology was to use the entire 20% as the experimental group and the remaining 80% as the control group. Unfortunately, these two groups are incomparable in terms of number of transactions.

when plotted as box and whisker plots next to each other, their distributions are incomparable (mean, median, quartlies, min, max, etc).

So what I did was filter out the ‘outlier stores’ at each end until the box and whisker plots for each group were practically identical. I then ran a t-test on the filtered groups, we failed to rej the null (meaning that these groups are statistically the same prior to the promotion).

Now that we have 2 comparable groups for time -1, we run the promotion for a month.

after promotion month is over, we take the number of transactions from each group and run another t-test.

We Rejected the null in favor of the alternate Hypo, which is that these 2 groups are now statistically different with an alpha of 0.05.

My first question is: is this methodology okay ?

My second question is: alternatively from using box and whiskers and removing outliers until both groups’ descriptive stats are similar, can i use a normal distribution and STDEV to remove outliers and create comparable groups within my population ?

The box and whisker method worked to get a comparable groups as confirmed by the t-test, but is very manual. So i would like to create an automated method and was wondering if using a normal distribution and removing outliers by STDEV would be plausible ?

Sorry for the long read.

Thank you