Remove formatting tags from string body of email

How do you remove all formatting tags when calling:

GmailApp.getInboxThreads()[0].getMessages()[0].getBody()

such that the only remainder of text is that which can be read.

Formatting can be destroyed; the text in the body is only needed to be parsed, but tags such as:

"&" 
<br>

and possibly others, need to be removed.

Answers:

Answer

I am not sure what you mean by .getBody() - is this supposed to return a DOM body element?

However, the simplest solution for removing HTML tags is probably to let the browser render the HTML and ask him for the text content:

var myHTMLContent = "hello &amp; world <br />!";
var tempDiv = document.createElement('div');
tempDiv.innerHTML = myHTMLContent;

// retrieve the cleaned content:
var textContent = tempDiv.innerText;

With the above example, the textContent variable will contain the text

"hello & world
!"

(Note the line break due to the <br /> tag.)

Answer

Even though there's no DOM in Apps Script, you can parse out HTML and get the plain text this way:

function getTextFromHtml(html) {
  return getTextFromNode(Xml.parse(html, true).getElement());
}

function getTextFromNode(x) {
  switch(x.toString()) {
    case 'XmlText': return x.toXmlString();
    case 'XmlElement': return x.getNodes().map(getTextFromNode).join('');
    default: return '';
  }
}

calling

getTextFromHtml("hello <div>foo</div>&amp; world <br /><div>bar</div>!");

will return

"hello foo& world bar!".

To explain, Xml.parse with the second param as "true" parses the document as an HTML page. We then walk the document (which will be patched up with missing HTML and BODY elements, etc. and turned into a valid XHTML page), turning text nodes into text and expanding all other nodes.

This is admittedly poorly documented; I wrote this by playing around with the Xml object and logging intermediate results until I got it to work. We need to document the Xml stuff better.

Answer

I noticed you are writing a Google Apps Script. There's no DOM in Google Apps Script, nor you can create elements and get the innerText property.

getBody() gives you the email's body in HTML. You can replace tags with this code:

var html = GmailApp.getInboxThreads()[0].getMessages()[0].getBody();
html=html.replace(/<\/div>/ig, '\n');
html=html.replace(/<\/li>/ig, '\n');
html=html.replace(/<li>/ig, '  *');
html=html.replace(/<\/ul>/ig, '\n');
html=html.replace(/<\/p>/ig, '\n');
html=html.replace(/<br\/?>/ig, '\n');
html=html.replace(/<[^>]+>/ig, '');

May be you can find more tags to replace. Remember this code isn't for any HTML, but for the getBody() HTML. GMail has its own way to format de body, and doesn't use every possible existing tag in HTML, only a subset of it; then our GMail specific code is shorter.

Answer

I found an easier way to accomplish this task.

Use the htmlBody advanced argument within the arguments of sendEmail(). Heres an example:

var threads = GmailApp.search ('is:unread'); //searches for unread messages   
var messages = GmailApp.getMessagesForThreads(threads); //gets messages in 2D array

for (i = 0; i < messages.length; ++i)
{
j = messages[i].length; //to process most recent conversation in thread (contains messages from previous conversations as well, reduces redundancy
messageBody = messages[i][j-1].getBody(); //gets body of message in HTML
messageSubject = messages [i][j-1].getSubject();
GmailApp.sendEmail("[email protected]", messageSubject, "", {htmlBody: messageBody});
}

First I find all the threads containing unread messages. Then I get the messages contained within the threads into a two dimensional array using the getMessagesForThreads() method within GmailApp. Then I created a for loop that runs for all of the threads I found. I set j equal to the threads message count so I can send only the most recent message on the thread (j-1). I get the HTML body of the message with getBody() and the subject through getSubject(). I use the sendEmail(recipients, subject, body, optAdvancedArgs) to send the email and process the HTML body. The result is an email sent properly formatted with all features of HTML included. The documentation for these methods can be found here: https://developers.google.com/apps-script/service_gmail

I hope this helps, again the manual parsing method does work, but I still found bits and pieces of HTML left hanging around so I thought I would give this a try, It worked for me, if I find any issues in the longrun I will update this post. So far so good!

Answer

Google now has the getPlainBody() function that will get the plain text from the body of an email. It is in the text class.

I had been using a script to send emails to convert them to tasks and google broke it with a change to the functionality of Corey's answer above. I've replaced it with the following.

var taskNote = ((thread.getMessages()[0]).getPlainBody()).substring(0,1000);

Tags

Recent Questions

Top Questions

Home Tags Terms of Service Privacy Policy DMCA Contact Us

©2020 All rights reserved.