== String escaping/filtering functions are crypto == The advanced parts of XSS prevention Outline: 1. Presenting the scenario Remember how we said HTML has lots of contexts? 2. What a true filtering library should do; a sketch 3. Shopping for a filter library 4. Practical considerations 5. Considerations for format shifting (BBCode, Wikitext) == Presenting the scenario: HTML, the ultimate use-case == Everything we've discussed up to this point involves cases where the user can't *change the context*. Plain text stays plain text irregardless of what the user writes. If you want a user to be able to make his text bold, or underline it, or perform some sort of formatting or organization, you now need to allow some subset of HTML. You may already be doing this with a function like nl2br(), which is relatively trivial and hard to get wrong. But once you add your first tag, questions arise: 1. What should I use to match against this tag? 2. How do I let tags nest inside each other? 3. What attributes do I want to allow in this tag? 4. What are the valid values for the attributes inside this tag? You might have been able to write a regular expression for that email address or that username, but you will not be able to write a regular expression for HTML. That's because HTML is recursive. == A sketch of a filtering library that works == For the longest time, the conventional wisdom was that it was impossible to write a filtering library. Here we present a brief outline of a filtering library that works. 1. Parse, tokenize and build a DOM fragment tree of the document. You can use an algorithm like the one described by the HTML 5 spec, with the additional bonus that any malformed HTML you interpret will be interpreted the same way by other browers. - This step is of unusual importance, because it eliminates a whole class of attacks against HTML filters, namely malformed HTML tags that are parsed one way by the filter, but parsed another way by a browser. By constructing a DOM, and then re-serializing it, you guarantee the output to be well-formed and ensure all browsers treat your HTML the same way. 2. Walk through the DOM tree, performing the following operations: - Remove or flatten nodes that have tag names that are not on your allowed elements whitelist. Since the text content inside may be valid, although isn't, flattening is usually a better choice in terms of user-friendliness. - Iterate through all attributes in an element. Remove attributes that don't match your allowed attributes whitelist for that tag name. Attributes that remain should be subject to validation according to their type as specified by the HTML specification. Depending on what you decide to allow, the most complicated algorithms will be those for URLs and CSS. - If you're interested in standards compliance, you will need to move around nodes in order to ensure that the content models of all elements are satisfied. 3. Serialize again to taste. A few important details in this implementation are: * You use a whitelist, not a blacklist. To use a blacklist is a singular failing, because it means you fail to acknowledge the ingenuity of crackers and the mounds of legacy functionality that can cause arbitrary code execution. * You completely and thoroughly validate all attributes. If you support CSS, that means building a CSS parser and going through every CSS property and ensuring that it is valid. There are 91 elements in the HTML 4.01 specification. Oh, and I omitted many, many implementation details. Happy coding! == Shopping for a filter library == I don't expect you all to run out and write your own HTML filtering libraries now. So you'll need to know what you want to use. At this point, I should point out a conflict of interest: I'm the author of HTML Purifier, a filtering library for PHP. So obviously I'm going to recommend that to you. If you're not on PHP? Make a system call to a PHP interpreter when you need to clean HTML. [ker-thunk!] Quite frankly, though, the ideal situation I've delineated above doesn't actually exist yet, not even in HTML Purifier (though we're working towards it, from a token-oriented model!) Fortunately, you can get some measure of security if you restrict the HTML you are going to allow in your documents. Thus, here is Edward's easy checklist of things to look for when you're looking for a filter library. Your answer should be YES for all of these questions 1. Does it pass the XSS cheatsheet test? http://ha.ckers.org/xss.html 2. Does it use a whitelist? 3. Does it make the HTML well-formed when it's done? This means that it corrects malformed tags and performs tag-balancing. 4. Does it perform some checking on its attributes? (Everyone cuts corners here, but it should at least have the well-known attacks explicitly filtered against) 5. Is it well known, well established and well used? == Practical considerations == The most important thing when deploying an HTML filter is to keep it up to date. The second most important thing is to cache the output of an HTML filter. HTML filtration is invariably an expensive operation, and becomes more so the more comprehensive the filter is. The usual implementation for this is to have two columns in your database table: one with the original HTML submitted by the user, and another with the cleaned up version. The final thing is that if you did take my advice, and used a DOM builder, you'll end up in the sticky situation of having some text HTML that you need to insert into a DOM. That means you have to parse it again. If at all possible, select (or hack) an implementation that doesn't serialize it back to a string before you get the cleaned HTML back. It will make the DOM much happier. == Considerations for format shifting (BBCode, Wikitext) == What we've talked about was an HTML -> HTML transition. Since building a working HTML filter is considered difficult, many people implement their own input format, which they then convert into HTML, ala BBCode, Textile, Markdown or Wikitext. These all have well-known, well defined implementations. If you decide to offer this to your users, use them. Be warned, however: some of these markup languages don't actually protect you against XSS! (they were designed with trusted writers in mind). Markdown is known to have this problem; Textile requires you to use a "restricted" mode (and I'm not fully convinced it's 100% effective). In which case, you end up having to run the output through an HTML filter anyway. Having a good HTML filter certainly is helpful. Also, any BBCode implementation will have the same traits as a stripped down HTML filter implementation, just parsing against [b] instead of . Rules 3 (does it tag-balance) and 5 (is it widely used) are as applicable as ever.