The filtration capabilities available to most users is pretty robust; depending on what you use to interact with the Fediverse. I thinik it would be possible to filter out problematic bots, users and even whole domain sources with the right kind of software.
Such a system might be constructed for one’s own scraping needs by taking any one of the current frontend/backends and customizing that behavior such that it could mitigate issues or ingest/ignore data based on your own inputs as well; such that your model could be “riding along on a human surfboard with human guidance”