If you’ve ever read any kind of article that talks about sanitizing, filtering, cleaning, escaping, or validating user input, I’m sure you’ve come across this line: don’t trust it — ever! Before you can protect yourself from unwanted data, injections, script crashes, and the like, you’d be smart to follow the next three steps to ensure that you understand the methods, data, and its applications.
- Identify all areas that allow for user input. In other words, what can the user manipulate? It’s not necessarily just the form field you provided for them to fill out, but maybe a piece of data that you’re passing in via a query string, a hidden form field, or a browser header.
- Determine what is valid input. The more specific you are in asking for a piece of data, the more granular you can be when it comes to validating. For example, instead of loosely asking for an address, you’d ask for the street, city, state, and zip code, which allows you to validate each piece of data separately.
- Understand how you’re going to use that data. Whether you display it back on the screen, pass it to another program, or simply store it in a database, each application has its specific needs and requirements. For example, if you display data with HTML code on the screen, you won’t see the code, you’ll see the rendered version — which may or may not be desired — but if you’re storing this data in the database, you’d be OK.
After you’ve asked yourself those questions and thought about your data, you’ll have a good idea on where to start, but just knowing the concept won’t help you with implementation, so let’s take a closer look at each of these.
1. Identify all areas that allow for user input
There are a lot of places a user can change your data, so the best way to look at it is so: if you’re using a piece of data that you did not create and are not in control of every step of the way, you need to validate it. As a side note, filtering, validating, and escaping data are technically speaking three different things, but for simplicity sake, when I refer to validation, I’m really referring to some or all of those aspects.
Let’s take a look at how you might receive user controllable data:
- Form fields: these are the most obvious of all, because you’re directly asking the user for data.
- Hidden form fields: even though the average user doesn’t see those fields, anyone can easily alter the contents of them. So make sure to double check that data before you use it and verify that your script gracefully halts if that piece of data is missing or invalid.
- Cookies: if you ever store data in a cookie, you need to double check that the value inside has not been altered and doesn’t contain anything that might cause your script to break.
- Query strings: if your application relies on reading information from the URL, you also need to check that the values you grab are valid. If you’re using a value as an ID number to make a database query, it’s a good idea to ensure that the value truly consists of digits.
- Browser headers: sometimes an application uses information from the user’s browser, such as the referer or user agent, so it’s important to remember that this data can be manipulated as well.
- Responses: if your script receives data from other applications or servers that you don’t control, that would also be considered user input. For example, maybe you’re parsing an RSS feed or work with JSON reponses.
Checking the above does several things for you:
- It reduces the risk of your application being used for illicit purposes e.g. compromising your database or sending spam email.
- It provides you with more control over the data that enters your system e.g. ensuring that a phone number is really a phone number.
- It allows you to be more efficient in processing the data e.g. if you look up a product in the database via its ID number, but you notice the ID number consists of letters as apposed to a number, there’s no need to make a database query in the first place.
2. Determine what is valid input
This is a pretty tricky subject, because if you’re too specific, you might miss a scenario where perfectly valid data is rejected, but if you’re too unrestrictive, you’ll get data that’s not useful. Some of the things you can control are the length of characters — you should always set a limit — ranges e.g. a number between zero and fifty, and their data type e.g. a number or a string. You can validate input by using some of PHP‘s built-in functions like is_int, ctype_alpha, ctype_alnum, or something more custom like a regular expression or a user-defined function.
In addition, there are two approaches to validating user input: inclusive and exclusive. Inclusive means that you specifically tell your application what is acceptable input, and exclusive let’s you specify the kind of data you don’t want.
An inclusive example would be to say that you want a number greater than zero, but less than fifty. You’re telling the system what to expect. The exclusive version would state that a number less than zero or greater than fifty is unacceptable. Here you told the system what to prevent. A different example would be checking that a phone number consists of only digits, parenthesis, or hyphens (inclusive) versus ensuring that the phone number doesn’t contain any letters or other symbols (exclusive).
Most of the time you’ll probably end up using a combination of the two, depending on whichever scenario is easier to test.
3. Understand how you’re going to use that data
Here are several ways your data can be used and what the best practices are for working with such data:
- Database: if you’re planning on storing the data in a database, you need to escape it. Different database languages may have different ways of escaping, so pay attention to which one you’re using, but for MySQL, which is a popular and open source relational database management system (RDBMS), any data must pass through the mysql_real_escape_string() PHP function. It protects your database from injections.
- Display: if you’re going to display user supplied data on your website, you should pass it through PHP’s htmlspecialchars() function. It protects your website from cross-site scripting (XSS).
- URL: it’s possible that you need to send or save data from a user in the URL. If that’s the case, everything should pass through the urlencode() PHP function. This ensures that your URL remains valid.
- Files: perhaps you’re saving information in a log file. Generally speaking, you can store it anyway you want, but if you know which method you’ll be using to read that data, you should store it accordingly. If the file’s contents will be displayed on the screen, use method #2 described above.
- Transmission: sometimes you may be sending data from one server or application to another. You should validate your data before transmission appropriately and if you’re on the receiver’s end, you should validate it again to ensure that it hasn’t been modified somewhere in between. It’s similar to the concept of loose coupling, where each system is unaware of what the other system is doing, therefore you can’t assume that it has been validated on the other end.
The purpose of this article was to give you an idea of the methods, data, and its application in systems. There are exceptions to every rule, but now that you know what’s possible and how things are commonly utilized, you have the knowledge to take that and apply it to your own specific project.
If you have suggestions or questions, feel free to leave comments below.
Featured image by Patrick Robert Doyle.