Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit 76da7ee

Browse files
committed
html tag regex
1 parent 6e5414c commit 76da7ee

File tree

1 file changed

+56
-12
lines changed

1 file changed

+56
-12
lines changed

tutorials/useful_regex.ipynb

Lines changed: 56 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
{
22
"metadata": {
33
"name": "",
4-
"signature": "sha256:9fd7d5201ce5b97fadad65f2c30cfec993fc83907e04418b032bd1bbdac05ff4"
4+
"signature": "sha256:8f1ee7a7bfaeca0ee3e12b2387445faa10632d57277d59af6dbfdca9732e5910"
55
},
66
"nbformat": 3,
77
"nbformat_minor": 0,
@@ -41,7 +41,7 @@
4141
"output_type": "stream",
4242
"stream": "stdout",
4343
"text": [
44-
"Last updated: 06/07/2014 10:07:02 EDT\n",
44+
"Last updated: 06/07/2014 12:24:58 EDT\n",
4545
"\n",
4646
"CPython 3.4.1\n",
4747
"IPython 2.1.0\n"
@@ -264,7 +264,9 @@
264264
"cell_type": "markdown",
265265
"metadata": {},
266266
"source": [
267-
"A regular expression to check for file extensions."
267+
"A regular expression to check for file extensions. \n",
268+
"\n",
269+
"Note: This approach is not recommended for thorough limitation of file types (parse the file header instead). However, this regex is still a useful alternative to e.g., a Python's `endswith` approach for quick pre-filtering for certain files of interest."
268270
]
269271
},
270272
{
@@ -746,14 +748,6 @@
746748
"<br>"
747749
]
748750
},
749-
{
750-
"cell_type": "heading",
751-
"level": 2,
752-
"metadata": {},
753-
"source": [
754-
"Time"
755-
]
756-
},
757751
{
758752
"cell_type": "markdown",
759753
"metadata": {},
@@ -810,14 +804,64 @@
810804
"\n",
811805
"for t in str_true:\n",
812806
" assert(bool(re.match(pattern, t)) == True), '%s is not True' %t\n",
807+
"for f in str_false:\n",
808+
" assert(bool(re.match(pattern, f)) == False), '%s is not False' %f"
809+
],
810+
"language": "python",
811+
"metadata": {},
812+
"outputs": [],
813+
"prompt_number": 18
814+
},
815+
{
816+
"cell_type": "markdown",
817+
"metadata": {},
818+
"source": [
819+
"<br>\n",
820+
"<br>"
821+
]
822+
},
823+
{
824+
"cell_type": "heading",
825+
"level": 2,
826+
"metadata": {},
827+
"source": [
828+
"Checking for HTML tags"
829+
]
830+
},
831+
{
832+
"cell_type": "markdown",
833+
"metadata": {},
834+
"source": [
835+
"[[back to top](#Sections)]"
836+
]
837+
},
838+
{
839+
"cell_type": "code",
840+
"collapsed": false,
841+
"input": [
842+
"pattern = r\"\"\"</?\\w+((\\s+\\w+(\\s*=\\s*(?:\".*?\"|'.*?'|[^'\">\\s]+))?)+\\s*|\\s*)/?>\"\"\"\n",
843+
"\n",
844+
"str_true = ('<a>', '<a href=\"something\">', '</a>', '<img src>')\n",
845+
" \n",
846+
"str_false = ('a>', '<a ', '< a >')\n",
847+
"\n",
848+
"for t in str_true:\n",
849+
" assert(bool(re.match(pattern, t)) == True), '%s is not True' %t\n",
813850
"\n",
814851
"for f in str_false:\n",
815852
" assert(bool(re.match(pattern, f)) == False), '%s is not False' %f"
816853
],
817854
"language": "python",
818855
"metadata": {},
819856
"outputs": [],
820-
"prompt_number": 33
857+
"prompt_number": 16
858+
},
859+
{
860+
"cell_type": "markdown",
861+
"metadata": {},
862+
"source": [
863+
"<font size=\"1px\">source: [http://haacked.com/archive/2004/10/25/usingregularexpressionstomatchhtml.aspx/](http://haacked.com/archive/2004/10/25/usingregularexpressionstomatchhtml.aspx/)</font>"
864+
]
821865
}
822866
],
823867
"metadata": {}

0 commit comments

Comments
 (0)