Hey Ali, thanks for reading! On your second point — yes, you want to keep the test and train sets entirely separate, so in the test set you run the categorical variables conversion process again. As for the percentile idea, it sounds interesting but I’m not entirely sure I understand it. Just to confirm, each categorical value is represented as the percentile of its corresponding target value? Perhaps an example will help.

If this is the idea, wouldn’t the relative distances between target values be lost and replaced with comparisons? However, this gets me thinking — if we were there incorporate percentiles in again with the actual target value, the model would be able to identify/form the entire distribution, which is more accurate than any one single statistic.

I look forward to your response! Thanks.

ML & CS enthusiast. Let’s connect: https://www.linkedin.com/in/andre-ye. Check out my podcast: https://open.spotify.com/show/0wUzfk9C6nnH9G0tKXudUe

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store