Hey Ali, thanks for reading! On your second point — yes, you want to keep the test and train sets entirely separate, so in the test set you run the categorical variables conversion process again. As for the percentile idea, it sounds interesting but I’m not entirely sure I understand it. Just to confirm, each categorical value is represented as the percentile of its corresponding target value? Perhaps an example will help.
If this is the idea, wouldn’t the relative distances between target values be lost and replaced with comparisons? However, this gets me thinking — if we were there incorporate percentiles in again with the actual target value, the model would be able to identify/form the entire distribution, which is more accurate than any one single statistic.
I look forward to your response! Thanks.