Regular Expression (RegEx) for matching Urls as in the rfc
When you try to find a regular expression for a url you will end up finding 20 different expressions which all have here and there some problemsRecently i've found a very good webpage which directly transformed the rfc-definition into a regular expression
look here
My modification to the original regex are:
- keeping only ftp and http since all other protocols aren't required here
- replacing the / with \/ to make php preg_replace() happy
- added anchor to the end of http to allow urls like http://example.com#test (this one surely needs improvements - but I don't know where the valid anchor signs are listed
And finaly I come along with that:
(?:ftp:\/\/(?:(?:(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[;?&=])*)(?::(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[;?&=])*))?@)?(?:(?:(?:(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?:\.(?:\d+)){3}))(?::(?:\d+))?))(?:\/(?:(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[?:@&=])*)(?:\/(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[?:@&=])*))*)(?:;type=[AIDaid])?)?)|(?:http:\/\/(?:(?:(?:(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?:\.(?:\d+)){3}))(?::(?:\d+))?)(?:\/(?:(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[;:@&=])*)(?:\/(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[;:@&=])*))*)(?:\?(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{2}))|[;:@&=])*))?)?(#[a-zA-Z\d]?)?)
If you don't need ftp or http - they are quite easily divided so it's no problem to strip one thing out.It can be used to transform links in normal text (e.g. a forum-post) into an <a href="..."> sheme
So in the following some tests to demonstrate
Original | ok | Matched |
---|---|---|
http://example.com | http://example.com | |
http://www.example.com | http://www.example.com | |
http://127.0.0.1/ | http://127.0.0.1/ | |
http://127.0.0.1.1/ | http://127.0.0.1 | |
http://127.0.0/ | ||
http://127.0.0./ | ||
http://12345.0.0.1 | http://12345.0.0.1 | |
http://localhost/ | http://localhost/ | |
http://www.bla.de/in.php?muh=1/2 | http://www.bla.de/in.php?muh=1 | |
http://127.0.0.1/balpage/index.php?r=site/page&view=urlRegex | http://127.0.0.1/balpage/index.php?r=site | |
http://www.bla.de?muh/muh | http://www.bla.de | |
muhhttp://www.bla.de/?muh=i/muh asd | http://www.bla.de/?muh=i | |
http://b.d/?a=%20@/ | http://b.d/?a=%20@ | |
http://dreiländereck.de | http://dreil | |
http://bla.de/a/sd/ba.xyzfg/ba.php | http://bla.de/a/sd/ba.xyzfg/ba.php | |
http://bla.de:123/a/sd/ba.xyzfg/ba.php | http://bla.de:123/a/sd/ba.xyzfg/ba.php | |
http://bla.de:123456 | http://bla.de:123456 | |
http://blub.deVeryLongEnding/asd | http://blub.deVeryLongEnding/asd | |
http://256.0.0.1 | http://256.0.0.1 | |
http://bla.de:65540 | http://bla.de:65540 |
As you see there is an Error with that Regex - i think "/" should be matched too - also german umlauts are valid urls now too.
I also removed ftp and all upercase versions inside the regex to simplify it.
The strange ?: are also removed
Next i fixed the invalid ips
Version 2
http:\/\/(((([a-zäöüß\d](([a-zäöüß\d]|-)*[a-zäöüß\d])?\.)*([a-zäöüß](([a-zäöüß\d]|-)*[a-zäöüß\d])?))|((\d{1,3})(\.\d{1,3}){3}))(:(\d{1,5}))?)(\/(((([a-z\d$\-_.+!*'(),]|(%[a-f\d]{2}))|[;:@&=])*)(\/((([a-z\d$\-_.+!*'(),]|(%[a-f\d]{2}))|[;:@&=])*))*)?)?(\?((([a-z\d$\-_.+!*'(),]|(%[a-f\d]{2})|(\/))|[;:@&=])*))?(#[a-z\d]?)?
Original | ok | Matched |
---|---|---|
http://example.com | http://example.com | |
http://www.example.com | http://www.example.com | |
http://127.0.0.1/ | http://127.0.0.1/ | |
http://127.0.0.1.1/ | http://127.0.0.1 | |
http://127.0.0/ | ||
http://127.0.0./ | ||
http://12345.0.0.1 | ||
http://localhost/ | http://localhost/ | |
http://www.bla.de/in.php?muh=1/2 | http://www.bla.de/in.php?muh=1/2 | |
http://127.0.0.1/balpage/index.php?r=site/page&view=urlRegex | http://127.0.0.1/balpage/index.php?r=site/page&view=urlRegex | |
http://www.bla.de?muh/muh | http://www.bla.de?muh/muh | |
muhhttp://www.bla.de/?muh=i/muh asd | http://www.bla.de/?muh=i/muh | |
http://b.d/?a=%20@/ | http://b.d/?a=%20@/ | |
http://dreiländereck.de | http://dreiländereck.de | |
http://bla.de/a/sd/ba.xyzfg/ba.php | http://bla.de/a/sd/ba.xyzfg/ba.php | |
http://bla.de:123/a/sd/ba.xyzfg/ba.php | http://bla.de:123/a/sd/ba.xyzfg/ba.php | |
http://bla.de:123456 | http://bla.de:12345 | |
http://blub.deVeryLongEnding/asd | http://blub.deVeryLongEnding/asd | |
http://256.0.0.1 | http://256.0.0.1 | |
http://bla.de:65540 | http://bla.de:65540 |
remaining Errors
At the bottom I added those urls which are not valid but can match..The Problem is, that i have to keep a List of all valid Top Level Domains
The 256 Ip shouldn't match since Ips only go up to 255 - but this would look to complex inside the regex,
the same goes for the 65540 port - I think the highest possible is 65535
TODO
Collect more testing Urls (maybe with usersuggestion)Simplify more (since it was autocreated there might be some redundancy
Show this script's source code