Well, the legal issues regarding modification of the HTML present the biggest problems, otherwise MySpace would have been completely overrun by now! ;-)
With such draconic legal issues, draconic actions would probably have to implemented. What I would do is solve the problem client-side. Think about email clients: they need to be able to render a diverse amount of HTML without compromising the user of the mail client. Thus, they won't execute any JavaScript, they'll refrain from retrieving external resources, i.e. anything that we would normally protect against by removing from the text. Just implement something similar and block any other browser.
If you had a spare domain lying around, and you don't mind users DOSing each other with image crash or infinite JavaScript loops, just stuff the unsafe HTML there. They won't be able to steal cookies since there are no cookies to steal, and the threat is downgraded to that of visiting a random website on the web.
The final proposal requires the most work: create a parser that, parses the document while keeping track of the original HTML, verbatim. The parser must be able to programmatically recognize all well-defined malformed HTML like <a href=url.html>. If it runs across a particular corrupt string of HTML, i.e. one that doesn't match its whitelist of "good" corrupt strings, it rejects outright. At the same time, it's developing a DOM, which would be identical throughout all major browsers that the filter supports. This consistency throughout browsers is dependent on the ability of the parser to recognize when a corrupt string of HTML would be interpreted differently by different browsers. The DOM would also be created by inlining the HTML into sandbox HTML document that emulates the real world conditions, so that later on, when analyzing the DOM, we could determine if the user maliciously broke out of their container to wreck havoc on the document.
After finishing the parsing process successfully, it would then traverse the DOM and validate all the tags and attributes. Blacklisted content would immediately result in failure, and if you want to be lenient you would let everything else through and ensure that your blacklist of browser features is thoroughly up-to-date. Preferably you'd perform a code audit of all open-source browsers for hidden "features". You could also maintain a whitelist, and, for added leniency, use some fuzzy text matching algorithms to detect when a user made a typo.
If the DOM validation process was successful, the HTML is good. It's two stages:
1. Generate a DOM from the document, using well-known behaviors for malformed HTML and rejecting too-badly-formed HTML
2. Parse the DOM for legit features that need to be blocked
Back to reality: what we could start to do is start profiling inconsistencies in the parsers of major browser in how the transform regular HTML into DOMs. Simple things like missing tags to full fledge angled bracket un-equal-signed attribute frenzies.
But... if any company is stupid enough to go this route, they'll probably get it all wrong. In that case, security by obscurity is always a good second defense. ;-)
HTML Purifier - Standards Compliant HTML filtering