RegEx for match/replacing JavaScript comments (both multiline and inline)

I need to remove all JavaScript comments from a JavaScript source using the JavaScript RegExp object.

What I need is the pattern for the RegExp.

So far, I've found this:

compressed = compressed.replace(/\/\*.+?\*\/|\/\/.*(?=[\n\r])/g, '');

This pattern works OK for:

/* I'm a comment */

or for:

/*
 * I'm a comment aswell
*/

But doesn't seem to work for the inline:

// I'm an inline comment

I'm not quite an expert for RegEx and it's patterns, so I need help.

Also, I' would like to have a RegEx pattern which would remove all those HTML-like comments.

<!-- HTML Comment //--> or <!-- HTML Comment -->

And also those conditional HTML comments, which can be found in various JavaScript sources.

Thanks.

Answers:

Answer

NOTE: Regex is not a lexer or a parser. If you have some weird edge case where you need some oddly nested comments parsed out of a string, use a parser. For the other 98% of the time this regex should work.

I had pretty complex block comments going on with nested asterisks, slashes, etc. The regular expression at the following site worked like a charm:

http://upshots.org/javascript/javascript-regexp-to-remove-comments
(see below for original)

Some modifications have been made, but the integrity of the original regex has been preserved. In order to allow certain double-slash (//) sequences (such as URLs), you must use back reference $1 in your replacement value instead of an empty string. Here it is:

/\/\*[\s\S]*?\*\/|([^\\:]|^)\/\/.*$/gm

// JavaScript: 
// source_string.replace(/\/\*[\s\S]*?\*\/|([^\\:]|^)\/\/.*$/gm, '$1');

// PHP:
// preg_replace("/\/\*[\s\S]*?\*\/|([^\\:]|^)\/\/.*$/m", "$1", $source_string);

DEMO: https://regex101.com/r/B8WkuX/1

FAILING USE CASES: There are a few edge cases where this regex fails. An ongoing list of those cases is documented in this public gist. Please update the gist if you can find other cases.

...and if you also want to remove <!-- html comments --> use this:

/\/\*[\s\S]*?\*\/|([^\\:]|^)\/\/.*|<!--[\s\S]*?-->$/

(original - for historical reference only)

// DO NOT USE THIS - SEE ABOVE
/(\/\*([\s\S]*?)\*\/)|(\/\/(.*)$)/gm
Answer

I have been putting togethor an expression that needs to do something similar.
the finished product is:

/(?:((["'])(?:(?:\\\\)|\\\2|(?!\\\2)\\|(?!\2).|[\n\r])*\2)|(\/\*(?:(?!\*\/).|[\n\r])*\*\/)|(\/\/[^\n\r]*(?:[\n\r]+|$))|((?:=|:)\s*(?:\/(?:(?:(?!\\*\/).)|\\\\|\\\/|[^\\]\[(?:\\\\|\\\]|[^]])+\])+\/))|((?:\/(?:(?:(?!\\*\/).)|\\\\|\\\/|[^\\]\[(?:\\\\|\\\]|[^]])+\])+\/)[gimy]?\.(?:exec|test|match|search|replace|split)\()|(\.(?:exec|test|match|search|replace|split)\((?:\/(?:(?:(?!\\*\/).)|\\\\|\\\/|[^\\]\[(?:\\\\|\\\]|[^]])+\])+\/))|(<!--(?:(?!-->).)*-->))/g

Scary right?

To break it down, the first part matches anything within single or double quotation marks
This is necessary to avoid matching quoted strings

((["'])(?:(?:\\\\)|\\\2|(?!\\\2)\\|(?!\2).|[\n\r])*\2)

the second part matches multiline comments delimited by /* */

(\/\*(?:(?!\*\/).|[\n\r])*\*\/)

The third part matches single line comments starting anywhere in the line

(\/\/[^\n\r]*(?:[\n\r]+|$))

The fourth through sixth parts matchs anything within a regex literal
This relies on a preceding equals sign or the literal being before or after a regex call

((?:=|:)\s*(?:\/(?:(?:(?!\\*\/).)|\\\\|\\\/|[^\\]\[(?:\\\\|\\\]|[^]])+\])+\/))
((?:\/(?:(?:(?!\\*\/).)|\\\\|\\\/|[^\\]\[(?:\\\\|\\\]|[^]])+\])+\/)[gimy]?\.(?:exec|test|match|search|replace|split)\()
(\.(?:exec|test|match|search|replace|split)\((?:\/(?:(?:(?!\\*\/).)|\\\\|\\\/|[^\\]\[(?:\\\\|\\\]|[^]])+\])+\/))

and the seventh which I originally forgot removes the html comments

(<!--(?:(?!-->).)*-->)

I had an issue with my dev environment issuing errors for a regex that broke a line, so I used the following solution

var ADW_GLOBALS = new Object
ADW_GLOBALS = {
  quotations : /((["'])(?:(?:\\\\)|\\\2|(?!\\\2)\\|(?!\2).|[\n\r])*\2)/,
  multiline_comment : /(\/\*(?:(?!\*\/).|[\n\r])*\*\/)/,
  single_line_comment : /(\/\/[^\n\r]*[\n\r]+)/,
  regex_literal : /(?:\/(?:(?:(?!\\*\/).)|\\\\|\\\/|[^\\]\[(?:\\\\|\\\]|[^]])+\])+\/)/,
  html_comments : /(<!--(?:(?!-->).)*-->)/,
  regex_of_doom : ''
}
ADW_GLOBALS.regex_of_doom = new RegExp(
  '(?:' + ADW_GLOBALS.quotations.source + '|' + 
  ADW_GLOBALS.multiline_comment.source + '|' + 
  ADW_GLOBALS.single_line_comment.source + '|' + 
  '((?:=|:)\\s*' + ADW_GLOBALS.regex_literal.source + ')|(' + 
  ADW_GLOBALS.regex_literal.source + '[gimy]?\\.(?:exec|test|match|search|replace|split)\\(' + ')|(' + 
  '\\.(?:exec|test|match|search|replace|split)\\(' + ADW_GLOBALS.regex_literal.source + ')|' +
  ADW_GLOBALS.html_comments.source + ')' , 'g'
);

changed_text = code_to_test.replace(ADW_GLOBALS.regex_of_doom, function(match, $1, $2, $3, $4, $5, $6, $7, $8, offset, original){
  if (typeof $1 != 'undefined') return $1;
  if (typeof $5 != 'undefined') return $5;
  if (typeof $6 != 'undefined') return $6;
  if (typeof $7 != 'undefined') return $7;
  return '';
}

This returns anything captured by the quoted string text and anything found in a regex literal intact but returns an empty string for all the comment captures.

I know this is excessive and rather difficult to maintain but it does appear to work for me so far.

Answer

This is late to be of much use to the original question, but maybe it will help someone.

Based on @Ryan Wheale's answer, I've found this to work as a comprehensive capture to ensure that matches exclude anything found inside a string literal.

/(?:\r\n|\n|^)(?:[^'"])*?(?:'(?:[^\r\n\\']|\\'|[\\]{2})*'|"(?:[^\r\n\\"]|\\"|[\\]{2})*")*?(?:[^'"])*?(\/\*(?:[\s\S]*?)\*\/|\/\/.*)/g

The last group (all others are discarded) is based on Ryan's answer. Example here.

This assumes code is well structured and valid javascript.

Note: this has not been tested on poorly structured code which may or may not be recoverable depending on the javascript engine's own heuristics.

Note: this should hold for valid javascript < ES6, however, ES6 allows multi-line string literals, in which case this regex will almost certainly break, though that case has not been tested.


However, it is still possible to match something that looks like a comment inside a regex literal (see comments/results in the Example above).

I use the above capture after replacing all regex literals using the following comprehensive capture extracted from es5-lexer here and here, as referenced in Mike Samuel's answer to this question:

/(?:(?:break|case|continue|delete|do|else|finally|in|instanceof|return|throw|try|typeof|void|[+]|-|[.]|[/]|,|[*])|[!%&(:;<=>?[^{|}~])?(\/(?![*/])(?:[^\\\[/\r\n\u2028\u2029]|\[(?:[^\]\\\r\n\u2028\u2029]|\\(?:[^\r\n\u2028\u2029ux]|u[0-9A-Fa-f]{4}|x[0-9A-Fa-f]{2}))+\]|\\(?:[^\r\n\u2028\u2029ux]|u[0-9A-Fa-f]{4}|x[0-9A-Fa-f]{2}))*\/[gim]*)/g

For completeness, see also this trivial caveat.

Answer

If you click on the link below you find a comment removal script written in regex.

These are 112 lines off code that work together also works with mootools and Joomla and drupal and other cms websites. Tested it on 800.000 lines of code and comments. works fine. This one also selects multiple parenthetical like ( abc(/nn/('/xvx/'))"// testing line") and comments that are between colons and protect them. 23-01-2016..! This is the code with the comments in it.!!!!

Click Here

Answer

I wonder if this was a trick question given by a professor to students. Why? Because it seems to me it is IMPOSSIBLE to do this, with Regular Expressions, in the general case.

Your (or whoever's code it is) can contain valid JavaScript like this:

let a = "hello /* ";
let b = 123;
let c = "world */ ";

Now if you have a regexp which removes everything between a pair of /* and */, it would break the code above, it would remove the executable code in the middle as well.

If you try to devise a regexp that would not remove comments which contain quotes then you cannot remove such comments. That applies to single-quote, double-quotes and back-quotes.

You can not remove (all) comments with Regular Expressions in JavaScript, it seems to me, maybe someone can point out a way how to do it for the case above.

What you can do is build a small parser which goes through the code character by character and knows when it is inside a string and when it is inside a comment, and when it is inside a comment inside a string and so on.

I'm sure there are good open source JavaScript parsers that can do this. Maybe some of the packaging and minifying tools can do this for you as well.

Answer

For block comment: https://regex101.com/r/aepSSj/1

Matches slash character (the \1) only if slash character is followed by asterisk.

(\/)(?=\*)

maybe followed by another asterisk

(?:\*)

followed by first group of match, or zero or more times from something...maybe, without remember the match but capture as a group.

((?:\1|[\s\S])*?)

followed by asterisk and first group

(?:\*)\1

For block and/or inline comment: https://regex101.com/r/aepSSj/2

where | mean or and (?=\/\/(.*)) capture anything after any //

or https://regex101.com/r/aepSSj/3 to capture the third part too

all in: https://regex101.com/r/aepSSj/8

Answer

2019:

All the answer come with fit fall so I write something that just work, try it out:

function scriptComment(code){
        const savedText = [];
        return code
           .replace(/(['"`]).*?\1/gm,function (match) {
            var i = savedText.push(match);
            return (i-1)+'###';
        })
        // remove  // comments
        .replace(/\/\/.*/gm,'')
        // now extract all regex and save them
        .replace(/\/[^*\n].*\//gm,function (match) {
            var i = savedText.push(match);
            return (i-1)+'###';
        })
        // remove /* */ comments
        .replace(/\/\*[\s\S]*\*\//gm,'')
        // remove <!-- --> comments
        .replace(/<!--[\s\S]*-->/gm, '')
        .replace(/\d+###/gm,function(match){
            var i = Number.parseInt(match);
            return  savedText[i];
        })
       
    }
    var cleancode = scriptComment(scriptComment.toString())
    console.log(cleancode)


old answer: not working on sample code like this :

// won't execute the creative code ("Can't execute code form a freed script"),
navigator.userAgent.match(/\b(MSIE |Trident.*?rv:|Edge\/)(\d+)/);

function scriptComment(code){
    const savedText = [];
    return code
          // extract strings and regex 
        .replace(/(['"`]).*?\1/gm,function (match) {
            savedText.push(match);
            return '###';
        })
        // remove  // comments
        .replace(/\/\/.*/gm,'')
        // now extract all regex and save them
        .replace(/\/[^*\n].*\//gm,function (match) {
            savedText.push(match);
            return '###';
        })
        // remove /* */ comments
        .replace(/\/\*[\s\S]*\*\//gm,'')
        // remove <!-- --> comments
        .replace(/<!--[\s\S]*-->/gm, '')
        /*replace \ with \\ so we not lost \b && \t*/
        .replace(/###/gm,function(){
            return savedText.shift();
        })
   
}
var cleancode = scriptComment(scriptComment.toString())
console.log(cleancode)

Answer

Based on above attempts and using UltraEdit , mostly Abhishek Simon, I found this to work for inline comments and handles all of the characters within the comment.

(\s\/\/|$\/\/)[\w\s\W\S.]*

This matches comments at the start of the line or with a space before //

//public static final String LETTERS_WORK_FOLDER = "/Letters/Generated/Work";

but not

"http://schemas.us.com.au/hub/'>" +

so it is only not good for something like

if(x){f(x)}//where f is some function

it just needs to be

if(x){f(x)} //where f is function

Answer

try this,

(\/\*[\w\'\s\r\n\*]*\*\/)|(\/\/[\w\s\']*)|(\<![\-\-\s\w\>\/]*\>)

should work :) enter image description here

Answer

This works for almost all cases:

var RE_BLOCKS = new RegExp([
  /\/(\*)[^*]*\*+(?:[^*\/][^*]*\*+)*\//.source,           // $1: multi-line comment
  /\/(\/)[^\n]*$/.source,                                 // $2 single-line comment
  /"(?:[^"\\]*|\\[\S\s])*"|'(?:[^'\\]*|\\[\S\s])*'/.source, // - string, don't care about embedded eols
  /(?:[$\w\)\]]|\+\+|--)\s*\/(?![*\/])/.source,           // - division operator
  /\/(?=[^*\/])[^[/\\]*(?:(?:\[(?:\\.|[^\]\\]*)*\]|\\.)[^[/\\]*)*?\/[gim]*/.source
  ].join('|'),                                            // - regex
  'gm'  // note: global+multiline with replace() need test
);

// remove comments, keep other blocks
function stripComments(str) {
  return str.replace(RE_BLOCKS, function (match, mlc, slc) {
    return mlc ? ' ' :         // multiline comment (replace with space)
           slc ? '' :          // single/multiline comment
           match;              // divisor, regex, or string, return as-is
  });
}

The code is based on regexes from jspreproc, I wrote this tool for the riot compiler.

See http://github.com/aMarCruz/jspreproc

Answer

In plain simple JS regex, this:

my_string_or_obj.replace(/\/\*[\s\S]*?\*\/|([^:]|^)\/\/.*$/gm, ' ')
Answer

a bit simpler -

this works also for multiline - (<!--.*?-->)|(<!--[\w\W\n\s]+?-->)

enter image description here

Answer

I was looking for a quick Regex solution too, but none of the answers provided work 100%. Each one ends up breaking the source code in some way, mostly due to comments detected inside string literals. E.g.

var string = "https://www.google.com/";

Becomes

var string = "https:

For the benefit of those coming in from google, I ended up writing a short function (in Javascript) that achieves what the Regex couldn't do. Modify for whatever language you are using to parse Javascript.

function removeCodeComments(code) {
    var inQuoteChar = null;
    var inBlockComment = false;
    var inLineComment = false;
    var inRegexLiteral = false;
    var newCode = '';
    for (var i=0; i<code.length; i++) {
        if (!inQuoteChar && !inBlockComment && !inLineComment && !inRegexLiteral) {
            if (code[i] === '"' || code[i] === "'" || code[i] === '`') {
                inQuoteChar = code[i];
            }
            else if (code[i] === '/' && code[i+1] === '*') {
                inBlockComment = true;
            }
            else if (code[i] === '/' && code[i+1] === '/') {
                inLineComment = true;
            }
            else if (code[i] === '/' && code[i+1] !== '/') {
                inRegexLiteral = true;
            }
        }
        else {
            if (inQuoteChar && ((code[i] === inQuoteChar && code[i-1] != '\\') || (code[i] === '\n' && inQuoteChar !== '`'))) {
                inQuoteChar = null;
            }
            if (inRegexLiteral && ((code[i] === '/' && code[i-1] !== '\\') || code[i] === '\n')) {
                inRegexLiteral = false;
            }
            if (inBlockComment && code[i-1] === '/' && code[i-2] === '*') {
                inBlockComment = false;
            }
            if (inLineComment && code[i] === '\n') {
                inLineComment = false;
            }
        }
        if (!inBlockComment && !inLineComment) {
            newCode += code[i];
        }
    }
    return newCode;
}

Answer

Simple regex ONLY for multi-lines:

/\*((.|\n)(?!/))+\*/

Tags

Recent Questions

Top Questions

Home Tags Terms of Service Privacy Policy DMCA Contact Us

©2020 All rights reserved.