Check if a string is html or not

I have a certain string for which I want to check if it is a html or not. I am using regex for the same but not getting the proper result.

I validated my regex and it works fine here.

var htmlRegex = new RegExp("<([A-Za-z][A-Za-z0-9]*)\b[^>]*>(.*?)</\1>");
return htmlRegex.test(testString);

Here's the fiddle but the regex isn't running in there. http://jsfiddle.net/wFWtc/

On my machine, the code runs fine but I get a false instead of true as the result. What am missing here?

Answers:

Answer

A better regex to use to check if a string is HTML is:

/^/

For example:

/^/.test('') // true
/^/.test('foo bar baz') //true
/^/.test('<p>fizz buzz</p>') //true

In fact, it's so good, that it'll return true for every string passed to it, which is because every string is HTML. Seriously, even if it's poorly formatted or invalid, it's still HTML.

If what you're looking for is the presence of HTML elements, rather than simply any text content, you could use something along the lines of:

/<\/?[a-z][\s\S]*>/i.test()

It won't help you parse the HTML in any way, but it will certainly flag the string as containing HTML elements.

Answer

Method #1. Here is the simple function to test if the string contains HTML data:

function isHTML(str) {
  var a = document.createElement('div');
  a.innerHTML = str;

  for (var c = a.childNodes, i = c.length; i--; ) {
    if (c[i].nodeType == 1) return true; 
  }

  return false;
}

The idea is to allow browser DOM parser to decide if provided string looks like an HTML or not. As you can see it simply checks for ELEMENT_NODE (nodeType of 1).

I made a couple of tests and looks like it works:

isHTML('<a>this is a string</a>') // true
isHTML('this is a string')        // false
isHTML('this is a <b>string</b>') // true

This solution will properly detect HTML string, however it has side effect that img/vide/etc. tags will start downloading resource once parsed in innerHTML.

Method #2. Another method uses DOMParser and doesn't have loading resources side effects:

function isHTML(str) {
  var doc = new DOMParser().parseFromString(str, "text/html");
  return Array.from(doc.body.childNodes).some(node => node.nodeType === 1);
}

Notes:
1. Array.from is ES2015 method, can be replaced with [].slice.call(doc.body.childNodes).
2. Arrow function in some call can be replaced with usual anonymous function.

Answer

Here's a sloppy one-liner that I use from time to time:

var isHTML = RegExp.prototype.test.bind(/(<([^>]+)>)/i);

It will basically return true for strings containing a < followed by ANYTHING followed by >.

By ANYTHING, I mean basically anything except an empty string.

It's not great, but it's a one-liner.

Usage

isHTML('Testing');               // false
isHTML('<p>Testing</p>');        // true
isHTML('<img src="hello.jpg">'); // true
isHTML('My < weird > string');   // true (caution!!!)
isHTML('<>');                    // false

As you can see it's far from perfect, but might do the job for you in some cases.

Answer

Using jQuery in this case, the simplest form would be:

if ($(testString).length > 0)

If $(testString).length = 1, this means that there is one HTML tag inside textStging.

Answer

My solution is

const element = document.querySelector('.test_element');

const setHtml = elem =>{
    let getElemContent = elem.innerHTML;

    // Clean Up whitespace in the element
    // If you don't want to remove whitespace, then you can skip this line
    let newHtml = getElemContent.replace(/[\n\t ]+/g, " ");

    //RegEX to check HTML
    let checkHtml = /<([A-Za-z][A-Za-z0-9]*)\b[^>]*>(.*?)<\/\1>/.test(getElemContent);

    //Check it is html or not
    if (checkHtml){
        console.log('This is an HTML');
        console.log(newHtml.trim());
    }
    else{
        console.log('This is a TEXT');
        console.log(elem.innerText.trim());
    }
}

setHtml(element);
Answer

There are fancy solutions involving utilizing the browser itself to attempt to parse the text, identifying if any DOM nodes were constructed, which will be… slow. Or regular expressions which will be faster, but… potentially inaccurate. There are also two very distinct questions arising from this problem:

Q1: Does a string contain HTML fragments?

Is the string part of an HTML document, containing HTML element markup or encoded entities? This can be used as an indicator that the string may require bleaching / sanitization or entity decoding:

/</?[a-z][^>]*>|(\&(?:[\w\d]+|#\d+|#x[a-f\d]+);/

You can see this pattern in use against all of the examples from all existing answers at the time of this writing, plus some… rather hideous WYSIWYG- or Word-generated sample text and a variety of character entity references.

Q2: Is the string an HTML document?

The HTML specification is shockingly loose as to what it considers an HTML document. Browsers go to extreme lengths to parse almost any garbage text as HTML. Two approaches: either just consider everything HTML (since if delivered with a text/html Content-Type, great effort will be expended to try to interpret it as HTML by the user-agent) or look for the prefix marker:

<!DOCTYPE html>

In terms of "well-formedness", that, and almost nothing else is "required". The following is a 100% complete, fully valid HTML document containing every HTML element you think is being omitted:

<!DOCTYPE html>
<title>Yes, really.</title>
<p>This is everything you need.

Yup. There are explicit rules on how to form "missing" elements such as <html>, <head>, and <body>. Though I find it rather amusing that SO's syntax highlighting failed to detect that properly without an explicit hint.

Answer

A little bit of validation with:

/<(?=.*? .*?\/ ?>|br|hr|input|!--|wbr)[a-z]+.*?>|<([a-z]+).*?<\/\1>/i.test(htmlStringHere) 

This searches for empty tags (some predefined) and / terminated XHTML empty tags and validates as HTML because of the empty tag OR will capture the tag name and attempt to find it's closing tag somewhere in the string to validate as HTML.

Explained demo: http://regex101.com/r/cX0eP2

Update:

Complete validation with:

/<(br|basefont|hr|input|source|frame|param|area|meta|!--|col|link|option|base|img|wbr|!DOCTYPE).*?>|<(a|abbr|acronym|address|applet|article|aside|audio|b|bdi|bdo|big|blockquote|body|button|canvas|caption|center|cite|code|colgroup|command|datalist|dd|del|details|dfn|dialog|dir|div|dl|dt|em|embed|fieldset|figcaption|figure|font|footer|form|frameset|head|header|hgroup|h1|h2|h3|h4|h5|h6|html|i|iframe|ins|kbd|keygen|label|legend|li|map|mark|menu|meter|nav|noframes|noscript|object|ol|optgroup|output|p|pre|progress|q|rp|rt|ruby|s|samp|script|section|select|small|span|strike|strong|style|sub|summary|sup|table|tbody|td|textarea|tfoot|th|thead|time|title|tr|track|tt|u|ul|var|video).*?<\/\2>/i.test(htmlStringHere) 

This does proper validation as it contains ALL HTML tags, empty ones first followed by the rest which need a closing tag.

Explained demo here: http://regex101.com/r/pE1mT5

Answer

zzzzBov's answer above is good, but it does not account for stray closing tags, like for example:

/<[a-z][\s\S]*>/i.test('foo </b> bar'); // false

A version that also catches closing tags could be this:

/<[a-z/][\s\S]*>/i.test('foo </b> bar'); // true
Answer

All of the answers here are over-inclusive, they just look for < followed by >. There is no perfect way to detect if a string is HTML, but you can do better.

Below we look for end tags, and will be much tighter and more accurate:

import re
re_is_html = re.compile(r"(?:</[^<]+>)|(?:<[^<]+/>)")

And here it is in action:

# Correctly identified as not HTML:
print re_is_html.search("Hello, World")
print re_is_html.search("This is less than <, this is greater than >.")
print re_is_html.search(" a < 3 && b > 3")
print re_is_html.search("<<Important Text>>")
print re_is_html.search("<a>")

# Correctly identified as HTML
print re_is_html.search("<a>Foo</a>")
print re_is_html.search("<input type='submit' value='Ok' />")
print re_is_html.search("<br/>")

# We don't handle, but could with more tweaking:
print re_is_html.search("<br>")
print re_is_html.search("Foo &amp; bar")
print re_is_html.search("<input type='submit' value='Ok'>")
Answer

If you're creating a regex from a string literal you need to escape any backslashes:

var htmlRegex = new RegExp("<([A-Za-z][A-Za-z0-9]*)\\b[^>]*>(.*?)</\\1>");
// extra backslash added here ---------------------^ and here -----^

This is not necessary if you use a regex literal, but then you need to escape forward slashes:

var htmlRegex = /<([A-Za-z][A-Za-z0-9]*)\b[^>]*>(.*?)<\/\1>/;
// forward slash escaped here ------------------------^

Also your jsfiddle didn't work because you assigned an onload handler inside another onload handler - the default as set in the Frameworks & Extensions panel on the left is to wrap the JS in an onload. Change that to a nowrap option and fix the string literal escaping and it "works" (within the constraints everybody has pointed out in comments): http://jsfiddle.net/wFWtc/4/

As far as I know JavaScript regular expressions don't have back-references. So this part of your expression:

</\1>

won't work in JS (but would work in some other languages).

Answer

/<\/?[^>]*>/.test(str) Only detect whether it contains html tags, may be a xml

Answer

With jQuery:

function isHTML(str) {
  return /^<.*?>$/.test(str) && !!$(str)[0];
}
Answer

There is an NPM package is-html that can attempt to solve this https://github.com/sindresorhus/is-html

Tags

Recent Questions

Top Questions

Home Tags Terms of Service Privacy Policy DMCA Contact Us

©2020 All rights reserved.