URL encoding of a sequence of octets (bytes) is defined by RFC 1738 to allow the transmission of arbitrary data over the URL in the form of ASCII text.
Any octet can be expressed as %XX, where XX is the hexadecimal value of the octet. The XX is always two bytes long.
For example, a tab would be encoded as %09, a space as %20, the % as %25, etc.
Not all bytes need to be encoded. For example, characters A-Z, a-z, digits 0-9, are usually not encoded (though a decoder should handle the case when they are). Some octets are considered safe, others not safe. Which ones are safe depends on the protocol used. Refer to RFC 1738, section 2.2, for details.
Additionally, whenever CGI arguments are transmitted, the blank space is often encoded as a plus (+). Any real plus is URL encoded to prevent confusion. To learn more about CGI, read the CGI Programming Is Simple! tutorial.
The urlendec collection comes with two utilities that allow you to URL encode/decode any sequence of data. They are:
They can take the data from the command line or from standard input. They write to standard output. Their usage is:
urlencode [options] [string ...]
urldecode [options] [string ...]
If a string is specified, that is the input data. For example,
urlencode Hello, world!
will output:
Hello%2C%20world%21
Similarly,
urldecode Hello%2C%20world%21
will result in:
Hello, world!
If no string exists on the command line (with an exception described under the -e option), input is taken from standard input. And, of course, it can be piped. For example,
date | urlencode
will produce something like this:
Wed%20Oct%2025%2019%3A35%3A03%20CDT%202000%0A
Note that it ends with %0A, which is the new line character URL encoded. That is because the date command prints one at the end of its output. If you didn't want it there, you could type:
urlencode `date`
In that case, the output would be:
Wed%20Oct%2025%2019%3A35%3A03%20CDT%202000
By default, urlencode will URL encode everything, except letters A-Z, a-z, and digits 0-9. Various command line options allow you to determine what should or should not be URL encoded. The options must precede the string (if you specify one). To see all options in alphabetical order, type:
urlencode -h
I will describe most of them here, but not alphabetically.
You can exclude individual values, or ranges of values, by the list option. A list (as all options) is preceded with a dash (-). The list is enclosed in square brackets. Non-printable characters may be URL encoded within the list. Four characters have special meaning:
Options can be grouped. In other words, several options (including lists) can follow a single dash.
Note: When entering the list from Unix shell, you need to prevent the shell from interpreting the [ as its own command. You can do so by preceding it with a \ or enclosing it with single quotes (but not both at the same time).
Examples:
urlencode -a\[0-7]
URL encode everything except octal digits.urlencode '-x%d[89]'
URL encode the % sign and the octal digits. Leave everything else unencoded.urlencode '-[%00-%1F]'
URL encode everything, except alphanumeric and control characters.urlencode '-[:/.\-_]p'
URL encode everything, except A-Z, a-z, 0-9, colon, slash, dot, dash, and underline. Encode spaces into plus signs.
Type urldecode -h for the full list.
The following options are common to both urlencode and urldecode:
The -e option is common to both urlencode and urldecode. It serves a double purpose:
It denotes the next command line argument is not an option parameter even if it happens to start with a dash. Use this option whenever the string starts, or can start with a dash.
Consider this example:
urlencode -p $SOMEVAR
This will work fine as long as $SOMEVAR expands to a string that does not start with a dash. But if the string happens to start with a dash, urlencode will assume it is part of the options. The problem disappears if we use the following command instead:
urlencode -p -e $SOMEVAR
It indicates that input is to be taken from the command line even if no string is listed. In other words, urldecode -e will do nothing.
Consider, for example, a CGI shell script containing the following command:
urldecode -p `cat somefile`
As long as somefile contains data (and it does not start with a dash), everything works fine. But, if for whatever reason the file no longer contains any data, cat will produce no output. Our command line will effectively become:
urldecode -p
This tells urldecode to read from standard input. But since nothing comes from standard input, it will wait forever, and your script will hang. The problem is easily solved by including the -e option:
urldecode -ep `cat somefile`
I have deliberately merged the -p -e into -ep in the above example, just illustrate that, like all other options, the -e option can be grouped with other options. It tells urlencode/urldecode that the next argument is not an option argument. But the current argument is still scanned for options. So the p from the -ep is still applied.
Copyright
© G. Adam Stanislav.
All rights reserved.
Author's web sites: Whiz Kid
Technomagic and Red Prince
Castle.