I would like to match just the root of a URL and not the whole URL from a text string. Given:
http://www.youtube.com/watch?v=ClkQA2Lb_iE
http://youtu.be/ClkQA2Lb_iE
http://www.example.com/12xy45
http://example.com/random
I want to get the 2 last instances resolving to the www.example.com
or example.com
domain.
I heard regex is slow and this would be my second regex expression on the page so If there is anyway to do it without regex let me know.
I'm seeking a JS/jQuery version of this solution.
I recommend using the npm package psl (Public Suffix List). The "Public Suffix List" is a list of all valid domain suffixes and rules, not just Country Code Top-Level domains, but unicode characters as well that would be considered the root domain (i.e. www.??.??.cn, b.c.kobe.jp, etc.). Read more about it here.
Try:
npm install --save psl
Then with my "extractHostname" implementation run:
let psl = require('psl');
let url = 'http://www.youtube.com/watch?v=ClkQA2Lb_iE';
psl.get(extractHostname(url)); // returns youtube.com
I can't use an npm package, so below only tests extractHostname.
function extractHostname(url) {
var hostname;
//find & remove protocol (http, ftp, etc.) and get hostname
if (url.indexOf("//") > -1) {
hostname = url.split('/')[2];
}
else {
hostname = url.split('/')[0];
}
//find & remove port number
hostname = hostname.split(':')[0];
//find & remove "?"
hostname = hostname.split('?')[0];
return hostname;
}
//test the code
console.log("== Testing extractHostname: ==");
console.log(extractHostname("http://www.blog.classroom.me.uk/index.php"));
console.log(extractHostname("http://www.youtube.com/watch?v=ClkQA2Lb_iE"));
console.log(extractHostname("https://www.youtube.com/watch?v=ClkQA2Lb_iE"));
console.log(extractHostname("www.youtube.com/watch?v=ClkQA2Lb_iE"));
console.log(extractHostname("ftps://ftp.websitename.com/dir/file.txt"));
console.log(extractHostname("websitename.com:1234/dir/file.txt"));
console.log(extractHostname("ftps://websitename.com:1234/dir/file.txt"));
console.log(extractHostname("example.com?param=value"));
console.log(extractHostname("https://facebook.github.io/jest/"));
console.log(extractHostname("//youtube.com/watch?v=ClkQA2Lb_iE"));
console.log(extractHostname("http://localhost:4200/watch?v=ClkQA2Lb_iE"));
Regardless having the protocol or even port number, you can extract the domain. This is a very simplified, non-regex solution, so I think this will do.
*Thank you @Timmerz, @renoirb, @rineez, @BigDong, @ra00l, @ILikeBeansTacos, @CharlesRobertson for your suggestions! @ross-allen, thank you for reporting the bug!
A neat trick without using regular expressions:
var tmp = document.createElement ('a');
; tmp.href = "http://www.example.com/12xy45";
// tmp.hostname will now contain 'www.example.com'
// tmp.host will now contain hostname and port 'www.example.com:80'
Wrap the above in a function such as the below and you have yourself a superb way of snatching the domain part out of an URI.
function url_domain(data) {
var a = document.createElement('a');
a.href = data;
return a.hostname;
}
Try this:
var matches = url.match(/^https?\:\/\/([^\/?#]+)(?:[\/?#]|$)/i);
var domain = matches && matches[1]; // domain will be null if no match is found
If you want to exclude the port from your result, use this expression instead:
/^https?\:\/\/([^\/:?#]+)(?:[\/:?#]|$)/i
Edit: To prevent specific domains from matching, use a negative lookahead. (?!youtube.com)
/^https?\:\/\/(?!(?:www\.)?(?:youtube\.com|youtu\.be))([^\/:?#]+)(?:[\/:?#]|$)/i
There is no need to parse the string, just pass your URL as an argument to URL
constructor:
var url = 'http://www.youtube.com/watch?v=ClkQA2Lb_iE';
var hostname = (new URL(url)).hostname;
assert(hostname === 'www.youtube.com');
Parsing a URL can be tricky because you can have port numbers and special chars. As such, I recommend using something like parseUri to do this for you. I doubt performance is going to be a issue unless you are parsing hundreds of URLs.
URL.hostname
for readabilityIn the Babel era, the cleanest and easiest solution is to use URL.hostname
.
const getHostname = (url) => {
// use URL constructor and return hostname
return new URL(url).hostname;
}
// tests
console.log(getHostname("https://stackoverflow.com/questions/8498592/extract-hostname-name-from-string/"));
console.log(getHostname("https://developer.mozilla.org/en-US/docs/Web/API/URL/hostname"));
URL.hostname
is part of the URL API, supported by all major browsers except IE (caniuse):
Using this solution will also give you access to other URL properties and methods. This will be useful if you also want to extract the URL's pathname or query string params, for example.
URL.hostname
is faster than using the anchor solution or parseUri. However it's still much slower than gilly3's regex:
const getHostnameFromRegex = (url) => {
// run against regex
const matches = url.match(/^https?\:\/\/([^\/?#]+)(?:[\/?#]|$)/i);
// extract hostname (will be null if no match is found)
return matches && matches[1];
}
// tests
console.log(getHostnameFromRegex("https://stackoverflow.com/questions/8498592/extract-hostname-name-from-string/"));
console.log(getHostnameFromRegex("https://developer.mozilla.org/en-US/docs/Web/API/URL/hostname"));
Test it yourself on this jsPerf
If you need to process a very large number of URLs (where performance would be a factor), I recommend using this solution instead. Otherwise, choose URL.hostname
for readability.
If you end up on this page and you are looking for the best REGEX of URLS try this one:
^(?:https?:)?(?:\/\/)?([^\/\?]+)
https://regex101.com/r/pX5dL9/1
It works for urls without http:// , with http, with https, with just // and dont grab the path and query path as well.
Good Luck
I tried to use the Given solutions, the Chosen one was an overkill for my purpose and "Creating a element" one messes up for me.
It's not ready for Port in URL yet. I hope someone finds it useful
function parseURL(url){
parsed_url = {}
if ( url == null || url.length == 0 )
return parsed_url;
protocol_i = url.indexOf('://');
parsed_url.protocol = url.substr(0,protocol_i);
remaining_url = url.substr(protocol_i + 3, url.length);
domain_i = remaining_url.indexOf('/');
domain_i = domain_i == -1 ? remaining_url.length - 1 : domain_i;
parsed_url.domain = remaining_url.substr(0, domain_i);
parsed_url.path = domain_i == -1 || domain_i + 1 == remaining_url.length ? null : remaining_url.substr(domain_i + 1, remaining_url.length);
domain_parts = parsed_url.domain.split('.');
switch ( domain_parts.length ){
case 2:
parsed_url.subdomain = null;
parsed_url.host = domain_parts[0];
parsed_url.tld = domain_parts[1];
break;
case 3:
parsed_url.subdomain = domain_parts[0];
parsed_url.host = domain_parts[1];
parsed_url.tld = domain_parts[2];
break;
case 4:
parsed_url.subdomain = domain_parts[0];
parsed_url.host = domain_parts[1];
parsed_url.tld = domain_parts[2] + '.' + domain_parts[3];
break;
}
parsed_url.parent_domain = parsed_url.host + '.' + parsed_url.tld;
return parsed_url;
}
Running this:
parseURL('https://www.facebook.com/100003379429021_356001651189146');
Result:
Object {
domain : "www.facebook.com",
host : "facebook",
path : "100003379429021_356001651189146",
protocol : "https",
subdomain : "www",
tld : "com"
}
This solution gives your answer plus additional properties. No JQuery or other dependencies required, paste and go.
Usage
getUrlParts("https://news.google.com/news/headlines/technology.html?ned=us&hl=en")
Output
{
"origin": "https://news.google.com",
"domain": "news.google.com",
"subdomain": "news",
"domainroot": "google.com",
"domainpath": "news.google.com/news/headlines",
"tld": ".com",
"path": "news/headlines/technology.html",
"query": "ned=us&hl=en",
"protocol": "https",
"port": 443,
"parts": [
"news",
"google",
"com"
],
"segments": [
"news",
"headlines",
"technology.html"
],
"params": [
{
"key": "ned",
"val": "us"
},
{
"key": "hl",
"val": "en"
}
]
}
Code
The code is designed to be easy to understand rather than super fast. It can be called easily 100 times per second, so it's great for front end or a few server usages, but not for high volume throughput.
function getUrlParts(fullyQualifiedUrl) {
var url = {},
tempProtocol
var a = document.createElement('a')
// if doesn't start with something like https:// it's not a url, but try to work around that
if (fullyQualifiedUrl.indexOf('://') == -1) {
tempProtocol = 'https://'
a.href = tempProtocol + fullyQualifiedUrl
} else
a.href = fullyQualifiedUrl
var parts = a.hostname.split('.')
url.origin = tempProtocol ? "" : a.origin
url.domain = a.hostname
url.subdomain = parts[0]
url.domainroot = ''
url.domainpath = ''
url.tld = '.' + parts[parts.length - 1]
url.path = a.pathname.substring(1)
url.query = a.search.substr(1)
url.protocol = tempProtocol ? "" : a.protocol.substr(0, a.protocol.length - 1)
url.port = tempProtocol ? "" : a.port ? a.port : a.protocol === 'http:' ? 80 : a.protocol === 'https:' ? 443 : a.port
url.parts = parts
url.segments = a.pathname === '/' ? [] : a.pathname.split('/').slice(1)
url.params = url.query === '' ? [] : url.query.split('&')
for (var j = 0; j < url.params.length; j++) {
var param = url.params[j];
var keyval = param.split('=')
url.params[j] = {
'key': keyval[0],
'val': keyval[1]
}
}
// domainroot
if (parts.length > 2) {
url.domainroot = parts[parts.length - 2] + '.' + parts[parts.length - 1];
// check for country code top level domain
if (parts[parts.length - 1].length == 2 && parts[parts.length - 1].length == 2)
url.domainroot = parts[parts.length - 3] + '.' + url.domainroot;
}
// domainpath (domain+path without filenames)
if (url.segments.length > 0) {
var lastSegment = url.segments[url.segments.length - 1]
var endsWithFile = lastSegment.indexOf('.') != -1
if (endsWithFile) {
var fileSegment = url.path.indexOf(lastSegment)
var pathNoFile = url.path.substr(0, fileSegment - 1)
url.domainpath = url.domain
if (pathNoFile)
url.domainpath = url.domainpath + '/' + pathNoFile
} else
url.domainpath = url.domain + '/' + url.path
} else
url.domainpath = url.domain
return url
}
Was looking for a solution to this problem today. None of the above answers seemed to satisfy. I wanted a solution that could be a one liner, no conditional logic and nothing that had to be wrapped in a function.
Here's what I came up with, seems to work really well:
hostname="http://www.example.com:1234" hostname.split("//").slice(-1)[0].split(":")[0].split('.').slice(-2).join('.') // gives "example.com"
May look complicated at first glance, but it works pretty simply; the key is using 'slice(-n)' in a couple of places where the good part has to be pulled from the end of the split array (and [0] to get from the front of the split array).
Each of these tests return "example.com":
"http://example.com".split("//").slice(-1)[0].split(":")[0].split('.').slice(-2).join('.') "http://example.com:1234".split("//").slice(-1)[0].split(":")[0].split('.').slice(-2).join('.') "http://www.example.com:1234".split("//").slice(-1)[0].split(":")[0].split('.').slice(-2).join('.') "http://foo.www.example.com:1234".split("//").slice(-1)[0].split(":")[0].split('.').slice(-2).join('.')
This is not a full answer, but the below code should help you:
function myFunction() {
var str = "https://www.123rf.com/photo_10965738_lots-oop.html";
matches = str.split('/');
return matches[2];
}
I would like some one to create code faster than mine. It help to improve my-self also.
String.prototype.trim = function(){return his.replace(/^\s+|\s+$/g,"");}
function getHost(url){
if("undefined"==typeof(url)||null==url) return "";
url = url.trim(); if(""==url) return "";
var _host,_arr;
if(-1<url.indexOf("://")){
_arr = url.split('://');
if(-1<_arr[0].indexOf("/")||-1<_arr[0].indexOf(".")||-1<_arr[0].indexOf("\?")||-1<_arr[0].indexOf("\&")){
_arr[0] = _arr[0].trim();
if(0==_arr[0].indexOf("//")) _host = _arr[0].split("//")[1].split("/")[0].trim().split("\?")[0].split("\&")[0];
else return "";
}
else{
_arr[1] = _arr[1].trim();
_host = _arr[1].split("/")[0].trim().split("\?")[0].split("\&")[0];
}
}
else{
if(0==url.indexOf("//")) _host = url.split("//")[1].split("/")[0].trim().split("\?")[0].split("\&")[0];
else return "";
}
return _host;
}
function getHostname(url){
if("undefined"==typeof(url)||null==url) return "";
url = url.trim(); if(""==url) return "";
return getHost(url).split(':')[0];
}
function getDomain(url){
if("undefined"==typeof(url)||null==url) return "";
url = url.trim(); if(""==url) return "";
return getHostname(url).replace(/([a-zA-Z0-9]+.)/,"");
}
function hostname(url) {
var match = url.match(/:\/\/(www[0-9]?\.)?(.[^/:]+)/i);
if ( match != null && match.length > 2 && typeof match[2] === 'string' && match[2].length > 0 ) return match[2];
}
The above code will successfully parse the hostnames for the following example urls:
http://WWW.first.com/folder/page.html first.com
http://mail.google.com/folder/page.html mail.google.com
https://mail.google.com/folder/page.html mail.google.com
http://www2.somewhere.com/folder/page.html?q=1 somewhere.com
https://www.another.eu/folder/page.html?q=1 another.eu
Original credit goes to: http://www.primaryobjects.com/CMS/Article145
Okay, I know this is an old question, but I made a super-efficient url parser so I thought I'd share it.
As you can see, the structure of the function is very odd, but it's for efficiency. No prototype functions are used, the string doesn't get iterated more than once, and no character is processed more than necessary.
function getDomain(url) {
var dom = "", v, step = 0;
for(var i=0,l=url.length; i<l; i++) {
v = url[i]; if(step == 0) {
//First, skip 0 to 5 characters ending in ':' (ex: 'https://')
if(i > 5) { i=-1; step=1; } else if(v == ':') { i+=2; step=1; }
} else if(step == 1) {
//Skip 0 or 4 characters 'www.'
//(Note: Doesn't work with www.com, but that domain isn't claimed anyway.)
if(v == 'w' && url[i+1] == 'w' && url[i+2] == 'w' && url[i+3] == '.') i+=4;
dom+=url[i]; step=2;
} else if(step == 2) {
//Stop at subpages, queries, and hashes.
if(v == '/' || v == '?' || v == '#') break; dom += v;
}
}
return dom;
}
Here's the jQuery one-liner:
$('<a>').attr('href', url).prop('hostname');
oneline with jquery
$('<a>').attr('href', document.location.href).prop('hostname');
// use this if you know you have a subdomain
// www.domain.com -> domain.com
function getDomain() {
return window.location.hostname.replace(/([a-zA-Z0-9]+.)/,"");
}
I personally researched a lot for this solution, and the best one I could find is actually from CloudFlare's "browser check":
function getHostname(){
secretDiv = document.createElement('div');
secretDiv.innerHTML = "<a href='/'>x</a>";
secretDiv = secretDiv.firstChild.href;
var HasHTTPS = secretDiv.match(/https?:\/\//)[0];
secretDiv = secretDiv.substr(HasHTTPS.length);
secretDiv = secretDiv.substr(0, secretDiv.length - 1);
return(secretDiv);
}
getHostname();
I rewritten variables so it is more "human" readable, but it does the job better than expected.
Well, doing using an regular expression will be a lot easier:
mainUrl = "http://www.mywebsite.com/mypath/to/folder";
urlParts = /^(?:\w+\:\/\/)?([^\/]+)(.*)$/.exec(mainUrl);
host = Fragment[1]; // www.mywebsite.com
in short way you can do like this
var url = "http://www.someurl.com/support/feature"
function getDomain(url){
domain=url.split("//")[1];
return domain.split("/")[0];
}
eg:
getDomain("http://www.example.com/page/1")
output:
"www.example.com"
Use above function to get domain name
Parse-Urls appears to be the JavaScript library with the most robust patterns
Here is a rundown of the features:
Chapter 1. Normalize or parse one URL
Chapter 3. Extract URIs with certain names
Chapter 4. Extract all fuzzy URLs
Code:
var regex = /\w+.(com|co\.kr|be)/ig;
var urls = ['http://www.youtube.com/watch?v=ClkQA2Lb_iE',
'http://youtu.be/ClkQA2Lb_iE',
'http://www.example.com/12xy45',
'http://example.com/random'];
$.each(urls, function(index, url) {
var convertedUrl = url.match(regex);
console.log(convertedUrl);
});
Result:
youtube.com
youtu.be
example.com
example.com
parse-domain - a very solid lightweight library
npm install parse-domain
const parseDomain = require("parse-domain");
Example 1
parseDomain("http://www.example.com/12xy45")
{ tld: 'com', domain: 'example', subdomain: 'www' }
Example 2
parseDomain("http://subsub.sub.test.ExAmPlE.coM/12xy45")
{ tld: 'com', domain: 'example', subdomain: 'subsub.sub.test' }
Why?
Depending on the use case and volume I strongly recommend against solving this problem yourself using regex or other string manipulation means. The core of this problem is that you need to know all the gtld and cctld suffixes to properly parse url strings into domain and subdomains, these suffixes are regularly updated. This is a solved problem and not one you want to solve yourself (unless you are google or something). Unless you need the hostname or domain name in a pinch don't try and parse your way out of this one.
My code looks like this. Regular expressions can come in many forms, and here are my test cases I think it's more scalable.
function extractUrlInfo(url){
let reg = /^((?<protocol>http[s]?):\/\/)?(?<host>((\d{1,2}|1\d\d|2[0-4]\d|25[0-5])\.(\d{1,2}|1\d\d|2[0-4]\d|25[0-5])\.(\d{1,2}|1\d\d|2[0-4]\d|25[0-5])\.(\d{1,2}|1\d\d|2[0-4]\d|25[0-5])|[[email protected]:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)))(\:(?<port>[0-9]|[1-9]\d|[1-9]\d{2}|[1-9]\d{3}|[1-5]\d{4}|6[0-4]\d{3}|65[0-4]\d{2}|655[0-2]\d|6553[0-5]))?$/
return reg.exec(url).groups
}
var url = "https://192.168.1.1:1234"
console.log(extractUrlInfo(url))
var url = "https://stackoverflow.com/questions/8498592/extract-hostname-name-from-string"
console.log(extractUrlInfo(url))
Try below code for exact domain name using regex,
String line = "http://www.youtube.com/watch?v=ClkQA2Lb_iE";
String pattern3="([\\w\\W]\\.)+(.*)?(\\.[\\w]+)";
Pattern r = Pattern.compile(pattern3);
Matcher m = r.matcher(line);
if (m.find( )) {
System.out.println("Found value: " + m.group(2) );
} else {
System.out.println("NO MATCH");
}
import URL from 'url';
const pathname = URL.parse(url).path;
console.log(url.replace(pathname, ''));
this takes care of both the protocol.
©2020 All rights reserved.