💡
20
c/ai-innovations•ellioth37ellioth37•2d ago

Shoutout to the old school way of cleaning training data

Everyone says you need huge, perfect datasets for good AI, but I had a messy set of 10,000 customer service logs from a client in Boise. Instead of using a fancy auto-cleaner, I just wrote a simple script to find and fix the 500 most common spelling mistakes by hand. The model trained on that fixed data worked way better than the one trained on the 'cleaned' version from a popular tool. Has anyone else had better luck with a basic fix over a full automated clean?
3 comments

Log in to join the discussion

Log In
3 Comments
sage_green
Yeah, fixing the most common mistakes by hand is the way to go. Automated tools often miss the weird, specific stuff that actually matters.
5
the_alice
the_alice1d ago
Honestly, I've seen automated tools catch some pretty weird edge cases lately. Tbh they're getting better at the specific stuff.
4
nancyj11
nancyj111d ago
Totally, I did the same thing with some forum posts last month.
3