Remove everything within script and style tags

I have a variable named $articleText and it contains html code. There are script and style codes within <script> and <style> html elements. I want to scan the $articleText and remove these pieces of code. If I can also remove the actual html elements <script>, </script>, <style> and </style>, I would do that too.

I imagine I need to be using regex however I am not skilled in it.

Can anyone assist?

I wish I could provide some code but like I said I am not skilled in regex so I don't have anything to show.

I cannot use DOM. I need specifically to use regex against these specific tags

Answers:

Answer

Do not use RegEx on HTML. PHP provides a tool for parsing DOM structures, called appropriately DomDocument.

<?php
// some HTML for example
$myHtml = '<html><head><script>alert("hi mom!");</script></head><body><style>body { color: red;} </style><h1>This is some content</h1><p>content is awesome</p></body><script src="someFile.js"></script></html>';

// create a new DomDocument object
$doc = new DOMDocument();

// load the HTML into the DomDocument object (this would be your source HTML)
$doc->loadHTML($myHtml);

removeElementsByTagName('script', $doc);
removeElementsByTagName('style', $doc);
removeElementsByTagName('link', $doc);

// output cleaned html
echo $doc->saveHtml();

function removeElementsByTagName($tagName, $document) {
  $nodeList = $document->getElementsByTagName($tagName);
  for ($nodeIdx = $nodeList->length; --$nodeIdx >= 0; ) {
    $node = $nodeList->item($nodeIdx);
    $node->parentNode->removeChild($node);
  }
}

You can try it here: https://eval.in/private/4f225fa0dcb4eb

Documentation

Answer

Even regex is not a good tool for this kind of task, for small simple task it may work.


If you want to remove just inner text of tag(s), use:

preg_replace('/(<(script|style)\b[^>]*>).*?(<\/\2>)/is', "$1$3", $txt);

See demo here.

If you want to remove also tags, replacement string in the above code would be empty, so just "".

Answer

I think this should do what you need (assuming there are no nested script and style tags):

preg_replace('/(<script[^>]*>.+?<\/script>|<style[^>]*>.+?<\/style>)/s', '', $articleText);
Answer

Here's sample data:

$in = '
<html>
    <head>
        <script type="text/javascript">window.location="somehwere";</script>
        <style>
            .someCSS {border:1px solid black;}
        </style>
    </head>
    <body>
        <p>....</p>
        <div>
            <script type="text/javascript">document.write("bad stuff");</script>
        </div>
        <ul>
            <li><style type="text/css">#moreCSS {font-weight:900;}</style></li>
        </ul>
    </body>
</html>';

And now the spelled-out version:

$dom = new DOMDocument('1.0','UTF-8');
$dom->loadHTML($in);

removeByTag($dom,'style');
removeByTag($dom,'script');

var_dump($dom->saveHTML());

function removeByTag($dom,$tag) {
    $nodeList = $dom->getElementsByTagName($tag);
    removeAll($nodeList);
}

function removeAll($nodeList) {
    for ( $i = $nodeList->length; --$i >=0; ) {
        removeSelf($nodeList->item($i));
    }
}

function removeSelf($node) {
    $node->parentNode->removeChild($node);
}

And an alternate (does the same thing, just no function declarations):

$dom = new DOMDocument('1.0','UTF-8');
$dom->loadHTML($in);

for ( $list = $dom->getElementsByTagName('script'), $i = $list->length; --$i >=0; ) {
    $node = $list->item($i);
    $node->parentNode->removeChild($node);
}

for ( $list = $dom->getElementsByTagName('style'), $i = $list->length; --$i >=0; ) {
    $node = $list->item($i);
    $node->parentNode->removeChild($node);
}

var_dump($dom->saveHTML());

The trick is to iterate backwards when deleting nodes. And getElementsByTagName will traverse the entire DOM for you, so you don't have to (none of that hasChildNodes, nextSibling, nextChild stuff).

Perhaps the best solution is somewhere in between those two extreme examples.


Couldn't help myself, this is probably the best version of my suggestions. It doesn't include an incrementor ($i) to muck things up, and removes from the bottom-up:

$dom = new DOMDocument('1.0','UTF-8');
$dom->loadHTML($in);

removeElementsByTagName($dom,'script');
removeElementsByTagName($dom,'style');

function removeElementsByTagName($dom,$tagName) {
    $list = $dom->getElementsByTagName($tagName);
    while ( $node = $list->item(0) ) {
        $node->parentNode->removeChild($node);
    }
}

var_dump($dom->saveHTML());

As you remove nodes, they get moved up in the child list of the parent, so 1 becomes 0 and 2 becomes 1, etc. Keep doing this (while) until there aren't anymore (->item returns null). Also wrapped this in a reusable function.

Answer

Assuming this is both a concern of not letting your design get messed up by random styles as well as secure your site from user scripting, removing these tags will not alone keep you safe.

Consider the case of event attributes (ex: onmouseover, onclick):

<h1 onclick="console.log('user made this happen');">User Scripting Test</h1>

or even worse

<h1 onclick='function addCSSRule(a,b,c,d){"insertRule"in a?a.insertRule(b+"{"+c+"}",d):"addRule"in a&&a.addRule(b,c,d)}var style=document.createElement("style");style.appendChild(document.createTextNode("")),document.head.appendChild(style),sheet=style.sheet,addCSSRule(sheet,"*","color: #ff0!important");'>Messing with your styles!</h1>

With this, it's fairly trivial to start inserting all sorts of stuff into the document.

Last example of stylesheet mods taken from David Walsh -https://davidwalsh.name/add-rules-stylesheets

The only solution

... is to use a proven third-party library that specializes in this. I suggest HTML Purifier. It'll rid your user input of styles, scripts, and pesky event attributes.

Answer

A regex to do this would be incredibly obtuse, because of the possibility of tags within tags, and such confounding constructs like tag attributes.

I would suggest doing this in a DOM (either in PHP or JavaScript), which can identify and remove the undesired tags through actual parsing.

Tags

Recent Questions

Top Questions

Home Tags Terms of Service Privacy Policy DMCA Contact Us

©2020 All rights reserved.