regex - How to parse unicode characters in UTF-8 HTML document with PHP -


I have an HTML file generated by Google, which is with the following headings,

  & Lt;! Doctype html & gt; & Lt; Html & gt; & Lt; Head & gt; & Lt; Title & gt; Ddd & lt; / Title & gt; Meta http-equiv = "content type" content = "text / html; charset = UTF-8" & gt; & Lt; Meta http-equiv = "X-UA-Compatible" content = "IE = Edge" & gt; & Lt; Meta name = "viewport" content = "width = device-width, initial-scale = 1, minimum-level = 1, maximum-scale = 2" & gt;  

and use the following method to match Unicode (Chinese and special characters) text.

$ pattern_Title = '/ class = \ "text1t \" & gt; [\ '\ W \ s \: \ d] + / u';

I know that I can use "U" to enable uniform matching in PHP for UTF-8 compatible documents. However, although it is a UTF-8 document, there is something wrong here. When I run the PHP code and parse the Online HTML page (without saving the contents in my computer), nothing changes because of the "U" letter. When I remove "u", the code works fine but fails to match Chinese characters. I then copied the HTML content and stored them in my PHP code inside the string variable and saved the file. Then I run the code with "u" and it works just fine.

So, I do not know how to fix the problem. There is a post in the stack overflow about changing the UTF-8 from UTF-8 to PHP, I used it, but there was no difference in all. The HTML code is generated by Google.

Any thoughts? thank you in advanced.


Comments