DOWNLOAD THE CODE:
Download the Code 53877.zip

ScrapeText is a simple function that I use to pull plain text out of strings that contain markup-language formatting. I wrote this function to help build a report for a questionnaire system. The report needed to display only the plain text of each question. However, the text of each question was stored in heavily formatted HTML.

I searched online for a solution but didn’t find anything useful.Then, in a moment of clarity,a solution came to me:Write a function that accepts an HTML string as input and returns a plain text string by simply skipping over any text contained within the <and> tags. For example, if you run the code

SELECT dbo.ScrapeText
  (‘<a href=”www.sqlserver.com”>
  SQL Server</a>

the result is "SQL Server".

I wrote the ScrapeText function for SQL Server 2005 and SQL Server 2000. Listing 1 shows the code that does the main processing.This code first wraps the input string (which can be a string or an alphanumeric column in a table) with the > and <tags. The code then searches the input string for any nonbreaking spaces (&nbsp;), replacing them with standard spaces.

After the initial data cleansing is complete, the code examines the input string one character at a time from left to right. An imaginary pen is ready to start copying the desired characters to an output string. When a > tag is encountered, the pen is put to paper because it has reached the beginning of a desired string of text. The pen then writes each character to the output string until a <tag is encountered, at which point the pen stops writing. This process continues until the end of the input string is reached.
—Bill McEvoy

End of Article




You must log on before posting a comment.

If you don't have a username & password, please register now.

 
 

ADS BY GOOGLE