How to determine if a user agent string has proper syntax or might be a hacking attempt?

Question

I was checking my server via awstats to see who was visiting my site and I have a user-agent of the following value:

}__test|O:21:\

Some research has lead me to believe that someone was trying to hack my server.

To prevent such things from happening, what are the syntax rules in determining that a user agent string is 100% legitimate and not some hacker-crafted string? For example, what characters are/are not allowed in a proper user-agent string and do some characters need to be in a certain order?

score 4 · Answer 1 · answered Feb 17 '16 at 01:21

There are no rules. A user agent can be anything.

There's no reasonable way to whitelist user agents as there are a lot of legitimate ones and you do not want to accidentally block a legitimate user. There's also no way to block bad user agents because, once again, there is no standard way to determine if a user agent represents a bad user.

If you want to try to block bad bots you can compare a user agent against this database and see who it is and then make a semi-educated decision about whether you should block it or not. There are also some attempts to maintain lists of bad user agents but I don't know how current they are and new user agents appear every day.

score 3 · Answer 2 · edited Oct 07 '21 at 07:34

As the user agent is completely client controlled, it is a good thing to pay attention to it, as it can be used in various attacks.

Allowed Characters in User Agent

what characters are/are not allowed in a proper user-agent string and do some characters need to be in a certain order?

@Stephen Ostermiller already linked to RFC2616. It was updated in RFC7231, but nothing really changed:

User-Agent = product *( RWS ( product / comment ) )
[...]
product = token ["/" product-version]
product-version = token

It does however link to RFC7230 to specify how comments may look:

comment = "(" *( ctext / quoted-pair / comment ) ")"
ctext = HTAB / SP / %x21-27 / %x2A-5B / %x5D-7E / obs-text
[...]
quoted-pair = "" ( HTAB / SP / VCHAR / obs-text )

This is a fancy way of saying that pretty much all characters are allowed in the comment part of the user agent. ()\ are the only ones that cannot be placed freely.

token is a bit more restrictive, as can be seen in RFC7230. It doesn't allow (),/:;<=>?@[\]{}.

How to filter user agents

what are the syntax rules in determining that a user agent string is 100% legitimate and not some hacker-crafted string?

As user agents can contain pretty much any character, reasonable filtering is impossible. And this isn't even considering that not all clients will follow the RFC (filtering should not be very restrictive, for usability reasons).

Filtering user input is a good first line of defense, but it should never be your only one, as it is extremely difficult to prevent all attacks with input filtering.

You need secure coding practices, and you need to implement proper defenses against common attacks. So if the user agent is echoed, you need to encode it to prevent XSS. If the user agent is stored in the database, you need to use prepared statements to defend against SQL injection. If you pass something to the PHP function unserialize, you need to keep object injection in mind (I'm mentioning it because the O:21 looks a bit as it might have been a test). And so on.

If you want an additional line of defense, you might think about using a WAF such as mod_security.

score 1 · Answer 3 · edited May 23 '17 at 12:37

The User-Agent header is part of the RFC2616, which is an improved version of the RFC1945, where it states:

The User-Agent request-header field contains information about the user agent originating the request. This is for statistical purposes, the tracing of protocol violations, and automated recognition of user agents for the sake of tailoring responses to avoid particular user agent limitations. User agents SHOULD include this field with requests. The field can contain multiple product tokens (section 3.8) and comments identifying the agent and any subproducts which form a significant part of the user agent. By convention, the product tokens are listed in order of their significance for identifying the application.
   User-Agent     = "User-Agent" ":" 1*( product | comment )
Where product is defined as:
   product         = token ["/" product-version]
   product-version = token
   token           = 1*<any CHAR except CTLs or separators>
And comment as:
   comment        = "(" *( ctext | quoted-pair | comment ) ")"
   ctext          = <any TEXT excluding "(" and ")">

Source: Paulo Santos's answer to What is the standard format for a browser's User-Agent string?

How to determine if a user agent string has proper syntax or might be a hacking attempt?

3 Answers3