From 453819639badbeab1706e7516ca32bd90aef7e5d Mon Sep 17 00:00:00 2001 From: Anthony Marakis Date: Wed, 12 Jul 2017 18:55:14 +0300 Subject: [PATCH 1/2] Update text.ipynb --- text.ipynb | 57 ++++++++++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 53 insertions(+), 4 deletions(-) diff --git a/text.ipynb b/text.ipynb index 1ecabaf56..18865abf7 100644 --- a/text.ipynb +++ b/text.ipynb @@ -30,10 +30,8 @@ "* Text Models\n", "* Viterbi Text Segmentation\n", "* Information Retrieval\n", - "* Decoders\n", - " * Introduction\n", - " * Shift Decoder\n", - " * Permutation Decoder" + "* Information Extraction\n", + "* Decoders" ] }, { @@ -560,6 +558,57 @@ "Even though we are basically asking for the same thing, we got a different top result. The `diff` command shows the differences between two files. So the system failed us and presented us an irrelevant document. Why is that? Unfortunately our IR system considers each word independent. \"Remove\" and \"delete\" have similar meanings, but since they are different words our system will not make the connection. So, the `diff` manual which mentions a lot the word `delete` gets the nod ahead of other manuals, while the `rm` one isn't in the result set since it doesn't use the word at all." ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## INFORMATION EXTRACTION\n", + "\n", + "**Information Extraction (IE)** is a method for finding occurences of object classes and relationships in text. Unlike IR systems, an IE system includes (limited) notions of syntax and semantics. While it is difficult to extract object information in a general setting, for more specific domains the system is very useful. One model of an IE system makes use of templates that match with strings in a text.\n", + "\n", + "A typical example of such a model is reading prices from web pages. Prices usually appear after a dollar and consist of numbers, maybe followed by two decimal points. Before the price, usually there will appear a string like \"price:\". Let's build a sample template.\n", + "\n", + "With the following regular expression (*regex*) we can extract prices from text:\n", + "\n", + "`[$][0-9]+([.][0-9][0-9])?`\n", + "\n", + "Where `+` means 1 or more occurences and `?` means at most 1 occurence. Usually a template consists of a prefix, a target and a postfix regex. In this template, the prefix regex can be \"price:\", the target regex can be the above regex and the postfix regex can be empty.\n", + "\n", + "A template can match with multiple strings. If this is the case, we need a way to resolve the multiple matches. Instead of having just one template, we can use multiple templates (ordered by priority) and pick the match from the highest-priority template. We can also use other ways to pick. For the dollar example, we can pick the match closer to the numerical half of the highest match. For the text \"Price \\$90, special offer \\$70, shipping \\$5\" we would pick \"\\$70\" since it is closer to the half of the highest match (\"\\$90\")." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The above is called *attribute-based* extraction, where we want to find attributes in the text (in the example, the price). A more sophisticated extraction system aims at dealing with multiple objects and the relations between them. When such a system reads the text \"\\$100\", it should determine not only the price but also which object has that price.\n", + "\n", + "Relation extraction systems can be built as a series of finite state automata. Each automaton receives as input text, performs transformations on the text and passes it on to the next automaton as input. An automata setup can consist of the following stages:\n", + "\n", + "1. **Tokenization**: Segments text into tokens (words, numbers and punctuation).\n", + "\n", + "2. **Complex-word Handling**: Handles complex words such as \"give up\", or even names like \"Smile Inc.\".\n", + "\n", + "3. **Basic-group Handling**: Handles noun and verb groups, segmenting the text into strings of verbs or nouns (for example, \"had to give up\").\n", + "\n", + "4. **Complex Phrase Handling**: Handles complex phrases using finite-state grammar rules. For example, \"Human+PlayedChess(\"with\" Human+)?\" can be one template/rule for capturing a relation of someone playing chess with others.\n", + "\n", + "5. **Structure Merging**: Merges the structures built in the previous steps." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Finite-state, template based information extraction models work well for restricted domains, but perform poorly as the domain becomes more and more general. There are many models though to choose from, each with its own strengths and weaknesses. Some of the models are the following:\n", + "\n", + "* **Probabilistic**: Using Hidden Markov Models, we can extract information in the form of prefix, target and postfix from a given text. Two advantages of using HMMs over templates is that we can train HMMs from data and don't need to design elaborate templates, and that a probabilistic approach behaves well even with noise. In a regex, if one character is off, we do not have a match, while with a probabilistic approach we have a smoother process.\n", + "\n", + "* **Conditional Random Fields**: One problem with HMMs is the assumption of state independence. CRFs are very similar to HMMs, but they don't have the latter's constraint. In addition, CRFs make use of *feature functions*, which act as transition weights. For example, if for observation $e_{i}$ and state $x_{i}$ we have $e_{i}$ is \"run\" and $x_{i}$ is the state ATHLETE, we can have $f(x_{i}, e_{i}) = 1$ and equal to 0 otherwise. We can use multiple, overlapping features, and we can even use features for state transitions. Feature functions don't have to be binary (like the above example) but they can be real-valued as well. Also, we can use any $e$ for the function, not just the current observation. To bring it all together, we weigh a transition by the sum of features.\n", + "\n", + "* **Ontology Extraction**: This is a method for compiling information and facts in a general domain. A fact can be in the form of `NP is NP`, where `NP` denotes a noun-phrase. For example, \"Rabbit is a mammal\"." + ] + }, { "cell_type": "markdown", "metadata": {}, From 8ea620d8026e29245c8c2dfb1973839aeb0a6bcf Mon Sep 17 00:00:00 2001 From: Anthony Marakis Date: Thu, 13 Jul 2017 02:00:51 +0300 Subject: [PATCH 2/2] fixing dollar signs --- text.ipynb | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/text.ipynb b/text.ipynb index 18865abf7..f1c61e175 100644 --- a/text.ipynb +++ b/text.ipynb @@ -574,14 +574,14 @@ "\n", "Where `+` means 1 or more occurences and `?` means at most 1 occurence. Usually a template consists of a prefix, a target and a postfix regex. In this template, the prefix regex can be \"price:\", the target regex can be the above regex and the postfix regex can be empty.\n", "\n", - "A template can match with multiple strings. If this is the case, we need a way to resolve the multiple matches. Instead of having just one template, we can use multiple templates (ordered by priority) and pick the match from the highest-priority template. We can also use other ways to pick. For the dollar example, we can pick the match closer to the numerical half of the highest match. For the text \"Price \\$90, special offer \\$70, shipping \\$5\" we would pick \"\\$70\" since it is closer to the half of the highest match (\"\\$90\")." + "A template can match with multiple strings. If this is the case, we need a way to resolve the multiple matches. Instead of having just one template, we can use multiple templates (ordered by priority) and pick the match from the highest-priority template. We can also use other ways to pick. For the dollar example, we can pick the match closer to the numerical half of the highest match. For the text \"Price $90, special offer $70, shipping $5\" we would pick \"$70\" since it is closer to the half of the highest match (\"$90\")." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "The above is called *attribute-based* extraction, where we want to find attributes in the text (in the example, the price). A more sophisticated extraction system aims at dealing with multiple objects and the relations between them. When such a system reads the text \"\\$100\", it should determine not only the price but also which object has that price.\n", + "The above is called *attribute-based* extraction, where we want to find attributes in the text (in the example, the price). A more sophisticated extraction system aims at dealing with multiple objects and the relations between them. When such a system reads the text \"$100\", it should determine not only the price but also which object has that price.\n", "\n", "Relation extraction systems can be built as a series of finite state automata. Each automaton receives as input text, performs transformations on the text and passes it on to the next automaton as input. An automata setup can consist of the following stages:\n", "\n",