HTML insecurities

Tuesday, March 16, 2010

_ Comet

Cc5b3987fdda8edb89d6f82eb0d64fed

This brief post discusses the history and evolution of the web, noting some security issues and some solutions.

Way back when, the first webserver was created, serving HTML documents.  HTML was designed to show documents with hypertext links, and also to allow the documents to have semantic markup that would be displayed to the reader.

HTML version 1 included a HEAD and BODY section.  The HEAD section contains information about the document, such as the TITLE.  The BODY section contained the text, markup for the text (e.g. H1 headers, LI lists, STRONG emphasis), hyperlinks to other documents, and embedded images.

Web pages were perceived by the web browser as static entities; no scripting or even direct inclusions of other HTML documents (i.e. no EMBED or FRAMES).  Such documents posed little security threat, as they did not contain interpreted code or third-party documents (apart from images).

Although it is straightforward to use secure coding practices to ensure that the browser will not be vulnerable to being exploited from viewing a web page, there have been a number of vulnerabilities that have continued to affect even modern browsers, in the area of image display.

For the computer to draw an image, it must understand the image file format, and for security, the image processing code must not assume that the image file is valid.  Image formats often contain header fields which contain field lengths, and maliciously crafted images have values stored that are inconsistent with a valid image.

When the number of image types understood by the browser is small (e.g. JPG and GIF), the developer and QA of the image display program can validate the program by systematically testing the limits of such fields, ensuring that specifying 0 or an overlarge number does not result in overflows or invalid processing.  In practice, many authors of browsers rely on third-party libraries for image processing, and those libraries may not have undergone rigorous security testing.

In such cases where third-party code is relied upon, it is the responsibility of the developer/QA to validate it.  Alternatively, if it is given tentative trust (e.g. it is a well-known library), it is the responsibility of the developer/QA to monitor any third-party library's maintenance, and to update the browser with newest update, should a security fix be available.

The text of a hypertext document, along with its associated markup, is much simpler to validate for security issues, and so there are many web sites (including this one) that allow users to write HTML, allowing for more useful display of textual information than plain text allows on its own.

There are several security issues that come up when a web site wishes to display on a web page some textual data that is entered by the user.  Many papers have been written on this subject (cf http://www.owasp.org).

First among the security issues affecting web display of user-entered content is the cross-site script (XSS) vulnerability, which allows an attacker to abuse the trust of the reader, by presenting web content as owned by the hosting site.  When  the web content is merely words, the security issue can be ameliorated using various legal notices, ensuring that the reader realizes that the text being shown is not the "official statement" of the hosting entity.

The web has evolved from its early beginnings of displaying marked-up text with included images to a more dynamic environment.  EMBED,  ECMAscript/JScript/JavaScript/etc., frames, Java, and the ability for a web page to read and rewrite its own content [cf Document Object Model (DOM)] make for additional security considerations when including user-supplied data on a web page.  One generally wouldn't want one's site to be rewritten in an arbitrary fashion that may induce users to enter their credentials on a different site, or to host malicious scripts which could read data from a user's computer and send it to a third-party.

The underlying HTTP protocol itself has also gotten more complex, and the ability of HEAD tags to modify caching and redirects also present a security challenge.  Some of these HEAD issues, such as specifying the code page of the included body text, also complicate the textual parsing, as the character "<" which specifies the start of an HTML tag could possibly be stored as a different byte (or bytes) than ASCII.

When a web site wishes to include only user-supplied text, with some markup in HTML, it needs to ensure that this text conforms to a subset of HTML.  Any HTML tags included by the user must be examined for malicious potential and any malicious tags must be stripped from the document.

The task of only allowing known good tags and stripping the remainder is made easier by the backward-compatible design of the web.  If a web browser does not understand markup, it is free to ignore the tags, and the result should be a valid page.

So what are the good tags?  Well, I would include most of hyperlinks, HR, NOBR, PR, TABLE, headers [H1,H2,H3,H4,H5,H6], lists [OL, UL, DL], PRE, CODE, KBD, TT, VAR, SAMP, PLAINTEXT, emphasis [STRONG, EMPHASIS], SUB, SUP, and usually one also wishes to allow non-structural markup such as BOLD, ITALIC, U, SMALL, and BLINK unless one really cares about forcing adherence to accessibility standards.  One might also allow additional presentational markup, such as text-alignment directives.  In addition, one may choose to include IMG and FONT tags, if the additional trust in the reader's web browser will not be vulnerable to IMG or FONT parsing attacks.

Now the above list, even with such obvious inclusions as the markup needed within lists (LI) and tables, is not very difficult to validate, and one may even add additional tags if one wishes (e.g. if you wanted a user to be able to specify an annoying background sound, and you trust the reader's sound parser won't be subverted).  Please do comment on this blog post for any other HTML markup that you think should be added to the "good" list.

In order to parse the user-submitted marked-up text to only allow good HTML markup, the web server must be able to recognize the "<" and ">" characters which delimit the HTML markup from the text; good practice is to ensure that the HTML page is served with a specified character set.

The simplest way to parse user-entered data is to disallow markup, and to change all "<" and ">" to "<" and ">", which are the character entities for these characters.  This will prevent markup from being injected into the displayed page.

If one wishes to allow some markup, a web server may provide a menu of good markup that can be selected to markup text.  This can become unwieldy if one wishes to allow all good tags, however.

In some cases, one wishes to allow a user to be able to copy in an HTML document, with embedded markup.  In this case, one must parse the document and remove any tags which are not good.

 The two preding methods can be combined to allow ease of use for people who are typing in their text instead of copying it, in addition to allowing easy inclusion of HTML documents.  If there is good markup not presented in the menu, there is a design choice the web site needs to make.  One one hand, having good markup that cannot be modified by the menu system will be a cause of confusion for people who do not know how to edit HTML.  On the other hand, preventing good markup from being in a document will break the structure and/or the format of the included text. 

Personally, I recommend having an HTML mode (as this site does), and allowing all good markup to remain in the text (as this site does not).   It was a hassle to have to reformat a document that contains only valid HTML and not even any script, frames, or any embedded content including images, and removing good HTML does a disservice to the author, while providing no security benefits to the reader.

My next blog post will be an examination of when good HTML goes bad, and discuss the IMG tag.  Thank you, readers, for your time and comments; I am enjoying this security biographical log (blog), and the opportunity to share both security issues and my personal involvement with it.
:-)

Possibly Related Articles:
8948
Webappsec->General
Browser Security HTML
Post Rating I Like this!
The views expressed in this post are the opinions of the Infosec Island member that posted this content. Infosec Island is not responsible for the content or messaging of this post.

Unauthorized reproduction of this article (in part or in whole) is prohibited without the express written permission of Infosec Island and the Infosec Island member that posted this content--this includes using our RSS feed for any purpose other than personal use.