Regular Expression (RegEx) for matching Urls as in the rfc

When you try to find a regular expression for a url you will end up finding 20 different expressions which all have here and there some problems
Recently i've found a very good webpage which directly transformed the rfc-definition into a regular expression
look here
My modification to the original regex are:
  • keeping only ftp and http since all other protocols aren't required here
  • replacing the / with \/ to make php preg_replace() happy
  • added anchor to the end of http to allow urls like http://example.com#test (this one surely needs improvements - but I don't know where the valid anchor signs are listed

And finaly I come along with that:
(?:ftp:\/\/(?:(?:(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[;?&=])*)(?::(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[;?&=])*))?@)?(?:(?:(?:(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?:\.(?:\d+)){3}))(?::(?:\d+))?))(?:\/(?:(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[?:@&=])*)(?:\/(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[?:@&=])*))*)(?:;type=[AIDaid])?)?)|(?:http:\/\/(?:(?:(?:(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?:\.(?:\d+)){3}))(?::(?:\d+))?)(?:\/(?:(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[;:@&=])*)(?:\/(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[;:@&=])*))*)(?:\?(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[;:@&=])*))?)?(#[a-zA-Z\d]?)?)
If you don't need ftp or http - they are quite easily divided so it's no problem to strip one thing out.

It can be used to transform links in normal text (e.g. a forum-post) into an <a href="..."> sheme
So in the following some tests to demonstrate that it works how it should work:
OriginalokMatched
http://example.com http://example.com
http://www.example.com http://www.example.com
http://127.0.0.1/ http://127.0.0.1/
http://127.0.0.1.1/ http://127.0.0.1
http://127.0.0/ 
http://127.0.0./ 
http://12345.0.0.1 http://12345.0.0.1
http://localhost/ http://localhost/
http://www.bla.de/in.php?muh=1/2 http://www.bla.de/in.php?muh=1
http://127.0.0.1/balpage/index.php?r=site/page&view=urlRegex http://127.0.0.1/balpage/index.php?r=site
http://www.bla.de?muh/muh http://www.bla.de
muhhttp://www.bla.de/?muh=i/muh asd http://www.bla.de/?muh=i
http://b.d/?a=%20@/ http://b.d/?a=%20@
http://dreiländereck.de http://dreil
http://bla.de/a/sd/ba.xyzfg/ba.php http://bla.de/a/sd/ba.xyzfg/ba.php
http://bla.de:123/a/sd/ba.xyzfg/ba.php http://bla.de:123/a/sd/ba.xyzfg/ba.php
http://bla.de:123456 http://bla.de:123456
http://blub.deVeryLongEnding/asd http://blub.deVeryLongEnding/asd
http://256.0.0.1 http://256.0.0.1
http://bla.de:65540 http://bla.de:65540

As you see there is an Error with that Regex - i think "/" should be matched too - also german umlauts are valid urls now too.
I also removed ftp and all upercase versions inside the regex to simplify it.
The strange ?: are also removed
Next i fixed the invalid ips

Version 2


http:\/\/(((([a-zäöüß\d](([a-zäöüß\d]|-)*[a-zäöüß\d])?\.)*([a-zäöüß](([a-zäöüß\d]|-)*[a-zäöüß\d])?))|((\d{1,3})(\.\d{1,3}){3}))(:(\d{1,5}))?)(\/(((([a-z\d$\-_.+!*'(),]|(%[a-f\d]{2}))|[;:@&=])*)(\/((([a-z\d$\-_.+!*'(),]|(%[a-f\d]{2}))|[;:@&=])*))*)?)?(\?((([a-z\d$\-_.+!*'(),]|(%[a-f\d]{2})|(\/))|[;:@&=])*))?(#[a-z\d]?)?
OriginalokMatched
http://example.com http://example.com
http://www.example.com http://www.example.com
http://127.0.0.1/ http://127.0.0.1/
http://127.0.0.1.1/ http://127.0.0.1
http://127.0.0/ 
http://127.0.0./ 
http://12345.0.0.1 
http://localhost/ http://localhost/
http://www.bla.de/in.php?muh=1/2 http://www.bla.de/in.php?muh=1/2
http://127.0.0.1/balpage/index.php?r=site/page&view=urlRegex http://127.0.0.1/balpage/index.php?r=site/page&view=urlRegex
http://www.bla.de?muh/muh http://www.bla.de?muh/muh
muhhttp://www.bla.de/?muh=i/muh asd http://www.bla.de/?muh=i/muh
http://b.d/?a=%20@/ http://b.d/?a=%20@/
http://dreiländereck.de http://dreiländereck.de
http://bla.de/a/sd/ba.xyzfg/ba.php http://bla.de/a/sd/ba.xyzfg/ba.php
http://bla.de:123/a/sd/ba.xyzfg/ba.php http://bla.de:123/a/sd/ba.xyzfg/ba.php
http://bla.de:123456 http://bla.de:12345
http://blub.deVeryLongEnding/asd http://blub.deVeryLongEnding/asd
http://256.0.0.1 http://256.0.0.1
http://bla.de:65540 http://bla.de:65540

remaining Errors

At the bottom I added those urls which are not valid but can match..
The Problem is, that i have to keep a List of all valid Top Level Domains
The 256 Ip shouldn't match since Ips only go up to 255 - but this would look to complex inside the regex,
the same goes for the 65540 port - I think the highest possible is 65535

TODO

Collect more testing Urls (maybe with usersuggestion)
Simplify more (since it was autocreated there might be some redundancy
Show this script's source code