I have a variable named $articleText
and it contains html code. There are script
and style
codes within <script>
and <style>
html elements. I want to scan the $articleText
and remove these pieces of code. If I can also remove the actual html elements <script>
, </script>
, <style>
and </style>
, I would do that too.
I imagine I need to be using regex however I am not skilled in it.
Can anyone assist?
I wish I could provide some code but like I said I am not skilled in regex so I don't have anything to show.
Do not use RegEx on HTML. PHP provides a tool for parsing DOM structures, called appropriately DomDocument.
<?php
// some HTML for example
$myHtml = '<html><head><script>alert("hi mom!");</script></head><body><style>body { color: red;} </style><h1>This is some content</h1><p>content is awesome</p></body><script src="someFile.js"></script></html>';
// create a new DomDocument object
$doc = new DOMDocument();
// load the HTML into the DomDocument object (this would be your source HTML)
$doc->loadHTML($myHtml);
removeElementsByTagName('script', $doc);
removeElementsByTagName('style', $doc);
removeElementsByTagName('link', $doc);
// output cleaned html
echo $doc->saveHtml();
function removeElementsByTagName($tagName, $document) {
$nodeList = $document->getElementsByTagName($tagName);
for ($nodeIdx = $nodeList->length; --$nodeIdx >= 0; ) {
$node = $nodeList->item($nodeIdx);
$node->parentNode->removeChild($node);
}
}
You can try it here: https://eval.in/private/4f225fa0dcb4eb
Documentation
DomDocument
- http://php.net/manual/en/class.domdocument.phpDomNodeList
- http://php.net/manual/en/class.domnodelist.phpDomDocument::getElementsByTagName
- http://us3.php.net/manual/en/domdocument.getelementsbytagname.phpEven regex is not a good tool for this kind of task, for small simple task it may work.
If you want to remove just inner text of tag(s), use:
preg_replace('/(<(script|style)\b[^>]*>).*?(<\/\2>)/is', "$1$3", $txt);
See demo here.
If you want to remove also tags, replacement string in the above code would be empty, so just ""
.
I think this should do what you need (assuming there are no nested script and style tags):
preg_replace('/(<script[^>]*>.+?<\/script>|<style[^>]*>.+?<\/style>)/s', '', $articleText);
Here's sample data:
$in = '
<html>
<head>
<script type="text/javascript">window.location="somehwere";</script>
<style>
.someCSS {border:1px solid black;}
</style>
</head>
<body>
<p>....</p>
<div>
<script type="text/javascript">document.write("bad stuff");</script>
</div>
<ul>
<li><style type="text/css">#moreCSS {font-weight:900;}</style></li>
</ul>
</body>
</html>';
And now the spelled-out version:
$dom = new DOMDocument('1.0','UTF-8');
$dom->loadHTML($in);
removeByTag($dom,'style');
removeByTag($dom,'script');
var_dump($dom->saveHTML());
function removeByTag($dom,$tag) {
$nodeList = $dom->getElementsByTagName($tag);
removeAll($nodeList);
}
function removeAll($nodeList) {
for ( $i = $nodeList->length; --$i >=0; ) {
removeSelf($nodeList->item($i));
}
}
function removeSelf($node) {
$node->parentNode->removeChild($node);
}
And an alternate (does the same thing, just no function declarations):
$dom = new DOMDocument('1.0','UTF-8');
$dom->loadHTML($in);
for ( $list = $dom->getElementsByTagName('script'), $i = $list->length; --$i >=0; ) {
$node = $list->item($i);
$node->parentNode->removeChild($node);
}
for ( $list = $dom->getElementsByTagName('style'), $i = $list->length; --$i >=0; ) {
$node = $list->item($i);
$node->parentNode->removeChild($node);
}
var_dump($dom->saveHTML());
The trick is to iterate backwards when deleting nodes. And getElementsByTagName will traverse the entire DOM for you, so you don't have to (none of that hasChildNodes, nextSibling, nextChild stuff).
Perhaps the best solution is somewhere in between those two extreme examples.
Couldn't help myself, this is probably the best version of my suggestions. It doesn't include an incrementor ($i
) to muck things up, and removes from the bottom-up:
$dom = new DOMDocument('1.0','UTF-8');
$dom->loadHTML($in);
removeElementsByTagName($dom,'script');
removeElementsByTagName($dom,'style');
function removeElementsByTagName($dom,$tagName) {
$list = $dom->getElementsByTagName($tagName);
while ( $node = $list->item(0) ) {
$node->parentNode->removeChild($node);
}
}
var_dump($dom->saveHTML());
As you remove nodes, they get moved up in the child list of the parent, so 1 becomes 0 and 2 becomes 1, etc. Keep doing this (while
) until there aren't anymore (->item
returns null). Also wrapped this in a reusable function.
Assuming this is both a concern of not letting your design get messed up by random styles as well as secure your site from user scripting, removing these tags will not alone keep you safe.
Consider the case of event attributes (ex: onmouseover, onclick):
<h1 onclick="console.log('user made this happen');">User Scripting Test</h1>
or even worse
<h1 onclick='function addCSSRule(a,b,c,d){"insertRule"in a?a.insertRule(b+"{"+c+"}",d):"addRule"in a&&a.addRule(b,c,d)}var style=document.createElement("style");style.appendChild(document.createTextNode("")),document.head.appendChild(style),sheet=style.sheet,addCSSRule(sheet,"*","color: #ff0!important");'>Messing with your styles!</h1>
With this, it's fairly trivial to start inserting all sorts of stuff into the document.
Last example of stylesheet mods taken from David Walsh -https://davidwalsh.name/add-rules-stylesheets
... is to use a proven third-party library that specializes in this. I suggest HTML Purifier. It'll rid your user input of styles, scripts, and pesky event attributes.
A regex to do this would be incredibly obtuse, because of the possibility of tags within tags, and such confounding constructs like tag attributes.
I would suggest doing this in a DOM (either in PHP or JavaScript), which can identify and remove the undesired tags through actual parsing.
©2020 All rights reserved.