Against using metaphors too much "I was like 'why don't you just say it in terrabytes of RAM?" -- ekate == String is not a type == The basics of XSS prevention Which also happen to cover header splitting and SQL injection and a whole host of other nasties Outline: 1. What is XSS? 2. Why is XSS bad? 3. How do we stop XSS? 4. Why is (3) a bad approach? also, dirty details of escaping in PHP 5. A conceptual model of strings in web applications also, a generalization to other forms (maybe also present the conceptual model for XSS) 6. The "right way": Make the tools do it for you 7. The "right way" in practice 8. Bonus! The UTF-8 story == What is XSS? == XSS = Cross-Site Scripting A fundamental part of the security of the web stems from the "Same-Origin policy", which restricts the ability of web sites from accessing information from other web sites (on different domain names). XSS means that an attacker is able to get *his* script running on *your* domain, bypassing the Same-Origin policy. Basic XSS scenario: <?php echo $description; ?> What if: $description = '">'; then we get: " /> Presto, instant JavaScript execution. == Why is XSS bad? == If you actually run the previous code, you'll get an undefined function error. What might an attacker do? A silent attack may be just gaining access to your account using your session cookie or remember me cookie. A web worm might take over your account immediately, and attempt to spread the XSS vector to other places and infect other people. Example: Samy. If you're a high profile target with lots of users or if you're an open-source project with lots of users, the stakes are high. Even an XSS attack on a small scripts.mit site undermines trust and puts your users at risk. == How do we stop XSS? == A first impulse of many people is "Nuke the scripty parts". str_replace() out any JavaScripts and script tags and etc. Don't do this. [a full explanation would involve whitelists and a case where those things might actually be wanted to be kept] The next response is "escape" the data. In PHP, this means using htmlspecialchars() or htmlentities() In Ruby, this means using CGI.escapeHTML() In Python, this means cgi.escape() (depending on the modules you're using) == String is not a type == That is not the whole story. And in many cases, knowing just that is going to get you in trouble. We're going to be a little more sophisticated then that. We're actually going to build a conceptual model of strings, so that you understand why and when to apply escaping, and why the best practices work. The very simplest story is that of plain-text: this is what a string would like you to believe it is. This text, however, can be structured in interesting ways, much like the text in a novel is structured into chapters and sections. Anything that is a "plaintext format" is eligible. Some notable web examples: - HTML - XML (yes, it's distinct) - SQL - JSON And some not-so-webby examples: - Rich Text Format - LaTeX - Shell code - Any other programming language When we *escape* some text, we are actually performing a format shift. Example: Plaintext -> HTML -> <bob@example.com> Since plaintext to X is the easiest type of format shift, we're dealing with just this case in our first section. (We'll consider what happens if you want X to Y, such as Wikitext -> HTML, or even HTML -> safe HTML later) Thus, the code we discussed earlier now means this: $html = htmlspecialchars($plaintext); $sql = mysql_real_escape($plaintext); The story is not complete, though. One of the uncanny things about HTML is the fact that it is actually an amalgamation of many different formats. Regular text There's three different forms in the code above: 1. http://example.com => URL (technically URI) 2. Double-quoted/single-quoted text => Text inside an HTML attribute 3. Regular text => Regular HTML text And they have DIFFERENT rules. 1. javascript:alert(1); => this is active JavaScript when treated as a URL, but is ordinary text for an attribute or regular text. 2. He said, "Oh no!" => Valid regular text and single-quoted text, but not double-quoted text 3. You're funny. => Valid regular text and double-quoted text, but not double-quoted text So, let's look at our original example again: Plaintext -> HTML (regular text) -> <bob@example.com> This can happen in other languages too, although a gaffe is usually much more obvious. Example, SQL: Plaintext -> SQL (string) bob -> bob But if you then put it in the SQL query without quotes (which signify that the inner text is a string), you'll (usually) get an error: SELECT * FROM users WHERE name=bob; ERROR 1054 (42S22): Unknown column 'bob' in 'where clause' MySQL thinks that the input data is a column, not a string. == How to apply this conceptual model == First, a few quick questions: secure or not? First thing to point out is that you don't actually know if this is an HTML document or a text document. Let's say it's an HTML document. This is secure. Insecure. Insecure. Insecure. As a general rule, one escaping function that's secure in one context is insecure in another context. The correct function you should have used was json_encode(). (side-note: a single string is not actually valid JSON, but you can abuse it that way) Now, what are some practical considerations? 1. You need to know what format your strings are in. If you're dealing with plaintext, this boils down to "Is this text escaped or not?" 2. You need to know where your string is going. If it's going into an SQL query, it should be SQL escaped; if it's going into HTML, it should go into HTML text. These questions can be has hard or as easy as you want them to answer. THE HARD WAY: (read fast) PHP magic quotes automatically escapes your data for insertion into the database. If you want to do anything to this data, you un-escape it first, do some processing, then escape it again. Sometimes you HTML escape it before you put it in the database, sometimes you don't. When stuff comes out of the database, you have to remember to HTML escape only the stuff you didn't already HTML escape, otherwise you'll have double-escaped. THE EASY WAY: Turn off PHP magic quotes. Never escape anything until it's absolutely necessary; that means SQL escaping right before you put in a database, that means HTML escaping right before you output it to the web. Answers to our questions are trivial, then: 1. Unescaped. 2. It should be obvious from the string concatenation right below what you should be escaping it as. Important Note: Data in the database should be UNESCAPED. When your database recieves the SQL data, it internally de-SQL-escapes it, and returns a "pure" version of the data. If this seems confusing, think about the difference between a browser's regular view, and the source code view. You only see the HTML escaped version when you view source (which is interpreting HTML text as plaintext); it looks "normal" otherwise. Another important note: There's no distinction between user data or environment variables or internally generated data. If it's plaintext, it needs to be format-shifted before you use it for something else. A cautionary tale: This page Is XSS-vulnerable, if the page is: index.php?">Whee! == Jack-Of-All-Trades-Escaping == Let's look at one common problem under the light of our new conceptual model. $safe = htmlspecialchars(mysql_real_escape_string($input)); We used the SQL and HTML escaping functions, so won't this be "safe" in both contexts? Something odd has happened: our string was first packaged up in a format suitable for databases... and then packaged up again in a format suitable for a browser. The only case when this would work properly is if a browser decoded this data, and then passed the result directly into an SQL query. Which is ridiculous. This is the story of PHP magic quotes, which automatically performs DB escaping on all input data. This is why you should not use it. You can imagine the opposite process, however: $safe = mysql_real_escape_string(htmlspecialchars($input)); Decoded by the database, and then decoded by the browser. But as we've discussed earlier, there's no guarantee that a browser is going to be handling the $data: it might be placed in a plaintext variable or an external API handled in JSON. Let this be known: MORE does not mean BETTER when it comes to escaping functions. == The "right way": DOM building, SQL bound queries == Up until this point, we've considering escaping strictly in the context of string concatenation. Now we're going to ask, "Does string concatenation really make sense?" There are two considerations: EASE OF USE / COMPLEXITY OF THE FORMAT If I asked you to make a bitmap file, you wouldn't concatenate byte wise sequences to draw the file pixel by pixel. You'd use a high level function for drawing lines or type-setting text (and you'd probably convert it to JPEG or PNG before outputting it) HTML straddles the critical point where it's simple enough for people to get by using string concatenation, but is complex enough for tools for generating HTML to be popular (e.g. Frontpage or a form building library). Why not use a tool to generate ALL your HTML? MAKE IT REALLY HARD TO "DO THE WRONG THING" is unfortunate because it is both wrong and the simplest thing of doing things. Even a veteran programmer may forget an escaping function if they've done it five hundred times already today. This is also why bulk-escaping inputs before hand is so attractive: that means you don't have to repeat htmlspecialchars() ad nauseum in the output code. What if you used something other than string concatenation, which in its simplest form didn't require you to call the escape function because it already knows what you did? SQL - QUERY BUILDER Instead of: mysql_query('SELECT * FROM users WHERE name=' . mysql_real_escape_string($name)); Use: $sth = $dbh->prepare('SELECT * FROM users WHERE name = ?'); $sth->execute(array($name)); In the second case (for most databases), the value of $name is never concatenated into the SQL query; it's sent separately in a data format that you don't have to care about. Since you don't need to escape anything now, there's no worry about forgetting! Also, doesn't that code look so much nicer... HTML, XML - DOM BUILDER Instead of: $html = '' . htmlspecialchars($text) . ''; Use: $b = $doc->createElement('b'); $b->addChild($doc->createTextNode($text)); Extra benefit: your code is guaranteed to be well-formed, and you can easily run a validator on it natively. A sidenote on newlines: Plaintext newlines are not displayed in HTML. As a result, you've probably used a function like nl2br() to preserve newlines. With a DOM builder, you probably want to create a helper function that adds a stream of text nodes and empty tags as children to an empty, using your neighborhood "explode" or "split" function. foreach (explode($text, "\n") as $i => $part) { if ($i !== 0) $b->addChild($doc->createElement('br')); $b->addChild($doc->createTextNode($part)); } SHELL CODE - MULTI-ARGUMENT EXEC (PYTHON) Instead of: # XXX find real name for shellescape os.system("stella " + shellescape(name)) Use: subprocess.call(["stella", name]) (kind of bad example) URLs - URL BUILDER Instead of: $url = 'index.php?name=' . urlencode($foo); Use: $url = 'index.php?' . http_build_query(array('name' => $foo)); == Practical considerations of "the right way" == 1. Verbosity/Ease of use Compare:

Welcome . Here's a link

With (DOM): $p = $doc->createElement('p'); $p->appendChild($doc->createTextNode('This is a paragraph of text ')); $em = $doc->createElement('em', $username); $p->appendChild($em); $p->appendChild($doc->createTextNode('. Here's a ')); $a = $doc->createElement('a', 'link'); $a->setAttribute('href', 'http://example.com'); $p->appendChild($a); The DOM-builder version is substantially longer. There ought to be a shorter version, but many languages simply don't have it. This is something someone should write! Another option is XSLT, but somewhere along the line an XML document still has to be programatically generated. One last thing: these tools are very much XML oriented. In order to get HTML output, a little bit of coaxing is necessary. (Talk to me later if you want to know more about it) 2. Native or not? (external libraries) Concatenation is universal. You can use it anywhere, and not have to worry about dependencies. Some builder APIs are also built into a language: most languages have SQL bindings by default (obviating the need for a small wrapper class to emulate bindings). If a builder API is not native, however, you have to go get one. And that brings with it all of the pains of an external library (not going to go into that here). Or you may choose to code one yourself, a fine example of yak shaving. 3. Performance/Memory issues Concatenation works great with the input/output model. Once you've printed that bit of HTML, it gets handed to the output buffer, and eventually the client; there is no intrinsic need for your application to store it in memory. A DOM structure, however, needs to be kept in its entirety in memory. It as a whole weighs more than a string version of the HTML, because each node is a full-fledged object and carries the associated overhead. It needs to be serialized at the end, an extra cost not associated with concatenation (you could say with concatenation you're serializing from the very beginning). These issues can be offset by a robust caching infrastructure (Squid, for example), even with heavy dynamical content. == The UTF-8 story == Up until now, we've assumed that we've been dealing with text. This isn't strictly true: I can send null bytes and other non-sensical byte sequences over HTTP requests. In some cases, this can lead to security vulnerabilities, but at its basic level, it's just another form of validation you have to perform on user input--a username should not null bytes contain. Why not properly encoding data is bad - Breaks in strict contexts, like XML - XSS multibyte attack To deal with this issue effectively, however, you need to know a little bit about character encodings. [basic explanation of what encoding is] The bottom line is that, whatever encoding you're using (preferably UTF-8), you should ensure that any text you deal with is in that encoding. Additionally, you should make sure that characters forbidden by the HTML specification are stripped out. How do you do that? In an 8-bit encoding, that means stripping out bytes: - 0x00 to 0x1F, excluding 0x09 (the tab) and 0x0A (the newline)--and possibly 0x0D (carriage return) if you're feeling charitable. - 0x7F - If you're using an ISO 8859 charset (usually 8859-1, known as Latin-1), 0x80 to 0x9F In a Unicode based encoding, that means stripping out: - U+0000, U+0001 to U+0008, U+000B, U+000E to U+001F, U+007F to U+009F, U+D800 to U+DFFF, U+FDD0 to U+FDDF, and characters U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, U+3FFFE, U+3FFFF, U+4FFFE, U+4FFFF, U+5FFFE, U+5FFFF, U+6FFFE, U+6FFFF, U+7FFFE, U+7FFFF, U+8FFFE, U+8FFFF, U+9FFFE, U+9FFFF, U+AFFFE, U+AFFFF, U+BFFFE, U+BFFFF, U+CFFFE, U+CFFFF, U+DFFFE, U+DFFFF, U+EFFFE, U+EFFFF, U+FFFFE, U+FFFFF, U+10FFFE, and U+10FFFF Since UTF-8 is variable width, you also have to check whether or not it's well-formed. Don't write that down. Depending on your language, various parts of these steps will have already been done for you. Many others will not have. For PHP, you have to do it all yourself. You can find a list of these codepoints in the HTML 5 specification under the Parser pre-processing section; for codepoints you 100% must remove, check the XML specification's Char designation. Any language with Unicode support will normally check well-formedness automatically; codepoint stripping will usually have to be done manually. One last note: htmlentities() versus htmlspecialchars(). htmlentities() never makes sense. If the input encoding is capable of expressing certain characters, there's no point in entity-izing them. If the input encoding is not (which is when you'd want to allow entities of those characters), there's still no way of entity-izing them, because they're not supported in the first place. [rewrite this]