Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
23 views165 pages

Text Analysis Semantic Search

The document outlines a tutorial on Text Analysis using GATE, focusing on Natural Language Processing (NLP) and Information Extraction (IE). It discusses the components and functionalities of GATE, including entity recognition, event recognition, and applications in social media analysis. The tutorial emphasizes the importance of filtering and extracting relevant information from large datasets to address challenges like information overload.

Uploaded by

govinbimsara
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views165 pages

Text Analysis Semantic Search

The document outlines a tutorial on Text Analysis using GATE, focusing on Natural Language Processing (NLP) and Information Extraction (IE). It discusses the components and functionalities of GATE, including entity recognition, event recognition, and applications in social media analysis. The tutorial emphasizes the importance of filtering and extracting relevant information from large datasets to address challenges like information overload.

Uploaded by

govinbimsara
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 165

Text Analysis with GATE

Dr. Diana Maynard


University of Sheffield, UK

Search Solutions Tutorial 2017


London, UK
Outline of the Tutorial

• Introduction to NLP and Information Extraction


• Introduction to GATE
• Social media analysis
• Text Analysis for Semantic Search
• Semantic Search
• Semantic Annotation
• GATE MIMIR
• Example applications
• ExoPatent, TNA Web Archive, BBC News, Envilod,
Prospector, Political Futures Tracker, Brexit
1. NLP and Information Extraction
Oddly enough, people have successfully combined
information and toast...
The weather-forecasting toaster

l This weather-forecasting toaster, connected to a phone point,


was designed in 2001 by a PhD student
l It accessed the MetOffice website via a modem inside the
toaster and translated the information into a 1, 2 or 3 for rain,
cloud or sun
l The relevant symbol was then branded into the toast in the last
few seconds of toasting
However, toast isn't actually a very good medium for getting
your information…
Information overload or filter failure?
l We all know that there's
masses of information
available online these days,
and it's constantly growing
l You often hear people talk
about “information overload”
l But the real problem is not
the amount of information,
but our inability to filter it
correctly
l Clay Shirky has an excellent
talk on this topic
http://bit.ly/oWJTNZ
Big Data is not new!

Staff sorting 4M used tickets from #London Underground


to analyse line use in 1939
NLP gives us a way to understand data

l sort the data to remove the rubbish from the interesting parts
l extract the relevant pieces of information
l link the extracted information to other sources of information
(e.g. from DBpedia)
l aggregate the information according to potential new
categories
l query the (aggregated) information
l visualise the results of the query
What is information extraction?

l The automatic discovery of new, previously unknown


information, by automatically extracting information from
different textual resources.
l A key element is the linking together of the extracted
information to form new facts or new hypotheses to be
explored further
l In IE, the goal is to discover previously unknown information,
i.e. something that no one yet knows
IE is not Data Mining

Data mining is about using analytical techniques to find


interesting patterns from large structured databases

Examples:
• using consumer purchasing
patterns to predict which products
to place close together on shelves
in supermarkets
• analysing spending patterns on
credit cards to detect fraudulent
card use.
IE is not Web Search
l IE is also different from traditional web search or IR.
l In search, the user is typically looking for something that is
already known and has been written by someone else.
l The problem lies in sifting through all the material that currently
isn't relevant to your needs, in order to find the information that
is.
l The solution often lies in better ways to ask the right question
l You can't ask Google to tell you about
– all the speeches made by Tony Blair about foot and mouth
disease while he was Prime Minister
– all the documents in which a politician born in Sheffield is
quoted as saying something about hospitals.
More about how we can do this will be revealed later...

GATE Mímir:
Answering Questions
Google Can't
Information Extraction Basics

l Entity recognition
is required for
l Relation extraction
which is required for
l Event recognition
which is required for
l Summarisation, answering questions, and other things
What is Entity Recognition?
l Entity Recognition is about recognising and classifying key Named Entities
and terms in the text

l A Named Entity is a Person, Location, Organisation, Date etc.

l A term is a key concept or phrase that is representative of the text

Mitt Romney, the favorite to win the Republican nomination for president in 2012
Person Term Date

l Entities and terms may be described in different ways but refer to the same
thing. We call this co-reference.
co-reference

The GOP tweeted that they had knocked on 75,000 doors in Ohio the day prior.

Organisation Location
What is Event Recognition?

l An event is an action or situation relevant to the domain


expressed by some relation between entities or terms.
l It is always grounded in time, e.g. the performance of a
band, an election, the death of a person

Relation Relation

Mitt Romney, the favorite to win the Republican nomination for president in 2012
Person Event Date
Why are Entities and Events Useful?

l They can help answer the “Big 5” journalism questions


(who, what, when, where, why)
l They can be used to categorise the texts in different ways
– look at all texts about Trump.
l They can be used as targets for opinion mining
– find out what people think about Trump
– When linked to an ontology and/or combined with other
information, they can be used for reasoning about things
not explicit in the text
– seeing how opinions about different American
presidents have changed over the years
NLP components for text mining

l A text mining system is usually built up from a number of


different NLP components
l First, you need some Information Extraction tools to do the
donkey work of getting all the relevant pieces of information
and facts.
l Then you need some tools to apply the reasoning to the facts,
e.g. opinion mining, information aggregation, semantic
technologies, dynamics analysis
l GATE is an example of a tool for text mining which allows you
to combine all the necessary NLP components
Low-level linguistic processing components

l These are needed to pre-process the text ready for the more
complex IE tasks (NER, relations etc.)
Approaches to Information Extraction
Knowledge Engineering Learning Systems
l rule based l use statistics or other
l developed by machine learning
experienced language l developers do not need
engineers LE expertise
l make use of human l requires large amounts of
intuition annotated training data
l easier to understand l some changes may
results require re-annotation of
l development could be the entire training corpus
very time consuming l Can be unsupervised, but
l some changes may be results are less good, and
hard to accommodate hard to adapt to a domain
GATE
What is GATE?

GATE is an NLP toolkit developed at the University of Sheffield


over the last 20 years

It includes:

• components for language processing, e.g. parsers, machine


learning tools, stemmers, IR tools, IE components for various
languages...

• tools for visualising and manipulating text, annotations,


ontologies, parse trees, etc.

• various information extraction tools

• evaluation and benchmarking tools


GATE components
l Language Resources (LRs), e.g. lexicons, corpora, ontologies
l Processing Resources (PRs), e.g. parsers, generators,
taggers
l Visual Resources (VRs), i.e. visualisation and editing
components
l Algorithms are separated from the data, which means:
– the two can be developed independently by users with
different expertise.
– alternative resources of one type can be used without
affecting the other, e.g. a different visual resource can be
used with the same language resource
ANNIE
• ANNIE is GATE's rule-based IE system
• It uses the language engineering approach (though we also
have tools in GATE for ML)
• Distributed as part of GATE
• Uses a finite-state pattern-action rule language, JAPE
• ANNIE contains a reusable and easily extendable set of
components:
– generic pre-processing components for tokenisation,
sentence splitting etc
– components for performing NE on general open domain text
What's in ANNIE?

• The ANNIE application contains a set of core PRs:


• Tokeniser
• Sentence Splitter
• POS tagger
• Gazetteers
• Named entity tagger (JAPE transducer)
• Orthomatcher (orthographic coreference)
• There are also other PRs available in the ANNIE plugin, which
are not used in the default application, but can be added if
necessary
• NP and VP chunker, morphological analysis
ANNIE Modules
ANNIE English Tokeniser

l Tokenisation is based on Unicode classes


l It chops a piece of text into words and spaces, and also has
features for numbers, punctuation, symbols, capitalisation
etc.
l It converts constructs involving apostrophes into more
sensible combinations
l don’t → do + n't
l you've → you + 've
l Other tokenisers might have different definitions of Token
l Other languages might need different tokenisers, e.g.
Chinese
Document with Tokens

28
Sentence Splitter

• The default splitter finds sentences based on Tokens


• Main problem is to find whether a full stop or line break
denotes the end of a sentence or something else
• Also what to do with things like bullet points
• Uses a gazetteer of abbreviations etc. and a set of rules to find
sentence delimiters
• There are a few sentence splitter variants available which work
in slightly different ways
Document with Sentences

30
POS tagger

l Finds the part of speech for every word, e.g. noun, verb etc.
l Trained on WSJ, uses Penn Treebank tagset
l Other taggers may be trained on different corpora and have
different tags
l Default ruleset and lexicon can be modified manually (with a
little deciphering)
l Adds category feature to Token annotations
l Requires Tokeniser and Sentence Splitter to be run first
Morphological analyser
l Finds the root form of a word (singular noun, base form of verb)
l This is not the same as stemming (the root is a real word, a stem might not
be)
l e.g. educating: root = educate; stem = educat
l Does not perform derivational analysis (will not conflate verbs and nouns)
l Not an integral part of ANNIE, but can be found in the Tools plugin as an
“added extra”
l Flex based rules: can be modified by the user
l Generates “root” feature on existing Token annotations
l Requires tokenisation and POS tagging
Gazetteers

l Gazetteers are plain text files containing lists of names (e.g


rivers, cities, people, …)
l The lists are compiled into Finite State Machines
l Each gazetteer has an index file listing all the lists, plus
features of each list that help to categorise the lists
l Each entry can also have specific features and values, e.g. a
DBpedia link
l Lists can be modified either internally using the Gazetteer
Editor, or externally in your favourite editor
l Gazetteers generate Lookup annotations with relevant features
corresponding to the list matched
l ANNIE gazetteer has about 60,000 entries arranged in 80 lists
Gazetteer editor

definition file
entries entries for selected list
Named Entity Recognition
• Gazetteers can be used to find terms that suggest entities
• However, the entries can often be ambiguous
– “May Jones” vs “May 2010” vs “May I help you?”
– “Mr Parkinson” vs “Parkinson's Disease”
– “General Motors” vs. “General Smith”
• Handcrafted grammars are used to define patterns over the
Lookups and other annotations
• These patterns can help disambiguate, and they can combine
different annotations, e.g. dates can be comprised of day +
number + month
Named Entity Grammars

• Hand-coded rules written in JAPE applied to annotations to


identify NEs
• Phases run sequentially and constitute a cascade of FSTs over
annotations
• Because phases are sequential, annotations can be built up over a
period of phases, as new information is gleaned
• Standard named entities: persons, locations, organisations, dates,
addresses, money
• Basic NE grammars can be adapted for new applications, domains
and languages
Example of a JAPE pattern-action rule

Rule: PersonName Look for an entry in a


gazetter list of first
(
names
{Lookup.majorType == firstname}
followed by a proper noun
{Token.category == NNP}
):tag
-->
:tag.Person = {kind = fullName}

create an annotation of type “Person” give the annotation the feature


kind and value “fullName”
JAPE rules are usually a bit more complex

l They can make use of any annotations already created, e.g.


– strings of text
– POS categories
– gazetteer lookup
– root forms of words
– document metadata, e.g. HTML tags
– annotations “in context”
l They can use regular expression operators, including negative
constraints, “contains”, “within” etc.
l And the RHS of a rule can contain any Java code you like
Document with NEs
Using co-reference

l Different expressions may refer to the same entity


l Orthographic co-reference module (orthomatcher)
matches proper names and their variants in a
document
l [Mr Smith] and [John Smith] will be matched as the
same person
l [International Business Machines Ltd.] will match
[IBM]
Co-reference in GATE
Other NLP Toolkits

• UIMA
• OpenCalais
• Lingpipe
• OpenNLP
• Stanford Tools

l All integrated into GATE as plugins


3. Analysing Social Media
People don’t write “properly” on social media

l Grundman:politics makes #climatechange scientific issue,people


don’t like knowitall rational voice tellin em wat 2do
l Want to solve the problem of #ClimateChange? Just #vote for a
#politician! Poof! Problem gone! #sarcasm #TVP #99%
l Human Caused #ClimateChange is a Monumental Scam!
http://www.youtube.com/watch?v=LiX792kNQeE … F**k yes!!
Lying to us like MOFO's Tax The Air We Breath! F**k Them!
l The last people I will listen2 about guns r those that know
nothing about them&politicians who live in states w/strictest
gun laws #cali #ny
Linguistic challenges of social media

• Language
• Problem: typically exhibits very different language style
• Solution: train specific language processing components
• Relevance
• Problem: topics and comments can rapidly diverge.
• Solution: train a classifier or use clustering techniques
• Lack of context
• Problem: hard to disambiguate entities
• Solution: data aggregation, metadata, entity linking
Challenges for NLP

l Noisy language: unusual punctuation, capitalisation, spelling, use


of slang, sarcasm etc.
l Terse nature of microposts such as tweets
l Use of hashtags, @mentions etc causes problems for
tokenisation #thisistricky
l Lack of context gives rise to ambiguities
l NER performs poorly on microposts, mainly because of linguistic
pre-processing failure
– Performance of standard IE tools decreases from ~90% to
~40% when run on tweets rather than news articles
NLP Pre-Processing Pipeline

Language ID

Text

POS tagging Tokenisation


Named Entity Recognition and Linking

NER (Professor Plum)

dbpedia.org/resource/.....
Michael_Jackson
Michael_Jackson_(writer)

Entity Linking
Pipelines for tweets
Errors have a cumulative effect

Per-stage

Overall

Good performance is important at each stage


Language Identification
The Jan. 21 show started with the unveiling of an
impressive three-story castle from which Gaga emerges.
The band members were in various portals, separated
from each other for most of the show. For the next 2
Newswire
hours and 15 minutes, Lady Gaga repeatedly stormed
the moveable castle, turning it into her own gothic
Barbie Dreamhouse .
LADY GAGA IS BETTER THE 5th TIME OH BABY(:

je bent Jacques cousteau niet die een nieuwe soort heeft


Twitter ontdekt, het is duidelijk, ze bedekken hun gezicht. Get
over it

I'm at 地铁望京站 Subway Wangjing (Beijing)


http://t.co/KxHzYm00
Tokenisation is only 80% accurate on tweets
Improper grammar, e.g. apostrophe usage:
doesn't → does n't
doesnt → doesn’t

Smileys and emoticons: loss of information (e.g. sentiment)


I <3 you
This piece ;,,( so emotional

Punctuation for emphasis


*HUGS YOU**KISSES YOU* → * HUGS YOU**KISSES YOU *
Words run together / skip
I wonde rif Tsubasa is okay..
We need tools for hashtag analysis
l Hashtags need unravelling and disambiguating:
l #nowthatcherisdead

l #powergenitalia

l #lesbocages

l #molestationnursery

l #teacherstalking

l #therapist
Test your social media skills

What do these hashtags mean?

l #kktny
l #fomo
l #jomo
l #ootd
l #wcw
Hashtag Hijacking

Hashtags are not always used in the same way,


or for their original purpose
Have sex to save the planet!
Tweet Normalisation
l “RT @Bthompson WRITEZ: @libbyabrego honored?! Everybody
knows the libster is nice with it...lol...(thankkkks a bunch;))”
l OMG! I’m so guilty!!! Sprained biibii’s leg! ARGHHHHHH!!!!!!
l For some components to work well (POS tagger, parser), we need
normalisation
l BUT uppercasing, and repetition often convey strong sentiment
l Other forms of “misspelling” might indicate information about the
author
l First challenge: separate out-of-vocabulary and in-vocabulary
l Second challenge: fix mis-spelled IV words (e.g. Levenshtein edit
distance)
l The ZOMG phenomenon
Part-of-speech tagging
• Similar issues as for normalisation – we don’t have big datasets
to train on
• Label unlabelled data with multiple taggers and accept tweets
where tagger votes never conflict
• Model words using Brown clustering and word representations
(Turian 2010)

2m, 2ma, 2mar, 2mara, 2maro, 2marrow, 2mor, 2mora, 2moro, 2morow,
2morr, 2morro, 2morrow, 2moz, 2mr, 2mro, 2mrrw, 2mrw, 2mw, tmmrw, tmo,
tmoro, tmorrow, tmoz, tmr, tmro, tmrow, tmrrow, tmrrw, tmrw, tmrww, tmw,
tomaro, tomarow, tomarro, tomarrow, tomm, tommarow, tommarrow,
tommoro, tommorow, tommorrow, tommorw, tommrow, tomo, tomolo,
tomoro, tomorow, tomorro, tomorrw, tomoz, tomrw, tomz
Named entities: lack of context causes ambiguity

Branching out from Lincoln park after dark ... Hello Russian
Navy, it's like the same thing but with glitter!

??
Getting the NEs right is crucial

Branching out from Lincoln park after dark ... Hello Russian
Navy, it's like the same thing but with glitter!
How do we deal with all these problematic tweets?

l Typical NLP pipeline means that degraded performance has a


knock-on effect along the chain
l Short sentences confuse language identification tools
l Linguistic processing tools have to be adapted to the domain

l Retraining individual components on large volumes of data


l Adaptation of techniques from e.g. SMS analysis
l Development of new Twitter-specific tools (e.g. GATE's TwitIE)
l But....lack of standards, easily accessible data, common
evaluation etc. are holding back development
4. Text Analytics for Semantic Search
Semantic Queries in Google
Searching for Things, Not Strings

• 500 million entities that Google


“knows” about
• Used to provide more accurate
search results

• Summaries of information about


the entity being searched

http://googleblog.blogspot.it/2012/05/introducing-knowledge-graph-things-not.html
Facebook Graph Search
Semantic Enrichment
l Textual mentions aren't actually that useful in isolation
– knowing that something is a “Person" isn't very helpful
– knowing which real-life Person it refers to can be very useful
l Disambiguating mentions against an ontology provides extra context
l This is where semantic enrichment comes in
l The end product is a set of textual mentions linked to an ontology,
otherwise known as semantic annotations
l Annotations on their own can be useful but they can also
– be used to generate corpus level statistics
– be used for further ontology population
– form the basis of summaries
– be indexed to provide semantic search
Automatic Semantic Enrichment

• Use Text Mining, e.g.


• Information Extraction – recognise names of people,
organisations, locations, dates, references, etc.
• Term recognition – identify domain-specific terms
• Automatically extend article metadata to improve search quality
Mining medical records

l Medical records contain a large amount of unstructured text


– letters between hospitals and GPs
– discharge summaries
l These documents might contain information not recorded
elsewhere
– it turns out doctors don't like forms!
– often information-specific fields are ignored, with everything
put in the free text area
Medical Records at SLAM

l NIHR Biomedical Research Centre at the South London and


Maudsley Hospital are using our text mining tools in a number
of their studies
l Developed applications to extract:
– the results of mental state tests, and the date the test was
administered
– education level (high school, university, etc.)
– smoking status
– medication history
l They have even had promising results predicting suicides
30% of medical information
is structured GATE structures the free text
Hb WBC RB Plt
C Easy to search portion of the medical record,
14.1 5.1 4.5 210 and manipulate and makes it available for
10.2 36.6 5.0 420 with computers research, management, and
13.4 10.1 5.1 180 clinical care
12.3 8.3 4.6 340

70% of medical information


is in free text

Hard for GATE can link


The peritoneum
computers to the text to
contains deposits
of tumour... the understand. external
tumour cells are Ambiguous, knowledge,
negative for nuanced, and terminologies
desmin. complexity and coding
As used by several UK hospitals schemes, for
and world-leading medical record intelligent search
software vendors and analysis
Cancer Research: can GATE cure cancer?

l Genome Wide Association Studies (GWAS) aim to investigate


genetic variants across the whole genome
– With enough cases and controls, this allows them to state
that a given SNP (Single Nucleotide Polymorphism) is
related to a given disease.
– A single study can be very expensive in both time and
money to collect the required samples.
l Can we reduce the costs by analysing published articles to
generate prior probabilities for each SNP?
Is GATE better than baking soda?
l In conjunction with IARC (International Agency for Research on
Cancer, part of the WHO) we developed a text analysis
approach to mine PubMed
l We showed retrospectively that our approach would have saved
over a year's worth of work and more than 1.5 million Euros
l We completed a new study which found a new cause for oral
cancer
– Oral cancer is rare enough that traditional methods would
have failed to find enough cases to make the study plausible
Government Web Archive

l We developed a semantic annotation application to process


every crawled page in the archive.
l Entities annotated included; people, companies, locations,
government departments, ministerial positions, social
documents, dates, money....
l Where possible, annotations were linked to an ontology which
– was based on DBpedia
– was extended with UK government-specific concepts
– included the modelling of the evolution of government
l Annotations were indexed to allow for complex semantic
querying of the collection
Political Futures Tracker

Where in the UK did Conservative


MPs tweet more about the economy?
5.1 Semantic Annotation
Why ontologies for semantic search?
l Semantic annotation: rather than just annotating the word “Cambridge” as
a location, link it to an ontology instance
l Differentiate between Cambridge, UK and Cambridge, Mass.
l Semantic search via reasoning
l So we can infer that this document mentions a city in Europe.
l Ontologies tell us that this particular Cambridge is part of the country
called the UK, which is part of the continent Europe.
l Knowledge source
l If I want to annotate strikes in baseball reports, the ontology will tell me
that a strike involves a batter who is a person
l In the text “BA went on strike”, using the knowledge that BA is a company
and not a person, the IE system can conclude that this is not the kind of
strike it is interested in
More semantic search examples

Q: {ScalarValue}{MeasurementUnit} ->
A: “12 cm”, “190 g”, “two hours”
---
Q: {Reference} ->
A: JP-A-60-180889
A: Kalderon et al. (1984) Cell 39:499-509
Semantic Annotation
What is an Ontology?

l Set of concepts (instances and


classes)
l Relationships between them (is-a,
part-of, located-in)
l Multiple inheritance
l Classes can have more than one
parent
l Instances can have more than one
class
DBpedia

l Machine readable knowledge on various entities and topics,


including:
– 410,000 places/locations,
– 310,000 persons
– 140,000 organisations
l For each entity we have:
– entity name variants (e.g. IBM, Int. Business Machines)
– a textual abstract
– reference(s) to corresponding Wikipedia page(s)
– entity-specific properties (e.g. latitude and longitude for
places)
Example from DBpedia

… Links to GeoNames
And Freebase

Latitude & Longitude


GeoNames
l 2.8 million populated places
– 5.5 million alternate names
l Knowledge about NUTS country sub-divisions
– use for enrichment of recognised locations with the implied
higher-level country sub-divisions
l However, the sheer size of GeoNames creates a lot of ambiguity
during semantic enrichment
l We use it as an additional knowledge source, but not as a primary
source (DBpedia)
5.2 Semantic Annotation In GATE
Information Extraction for the Semantic Web

l Traditional IE is based on a flat structure, e.g. recognising


Person, Location, Organisation, Date, Time etc.
l For the Semantic Web, we need information in a hierarchical
structure
l Idea is that we attach semantic metadata to the documents,
pointing to concepts in an ontology
l Information can be exported as an ontology annotated with
instances, or as text annotated with links to the ontology
Traditional NE Recognition

John lives in London . He works there for Polar Bear Design .


PERSON LOCATION ORGANISATION
Co-reference

same_as
John lives in London . He works there for Polar Bear Design .
PER LOC ORG
Relations

live_in

John lives in London . He works there for Polar Bear Design .

PER LOC ORG


Relations (2)

employee_of

John lives in London . He works there for Polar Bear Design .

PER LOC ORG


Relations (3)

based_in

John lives in London . He works there for Polar Bear Design .

PER LOC ORG


Richer NE Tagging

l Attachment of instances in
the text to concepts in the
domain ontology
l Disambiguation of
instances, e.g. Cambridge,
MA vs Cambridge, UK
Ontology-based IE

John lives in London. He works there for Polar Bear Design.


Ontology-based IE (2)

John lives in London. He works there for Polar Bear Design.


Automatic Semantic Annotation in ENVIA

l Locations (linked to DBpedia and GeoNames)


– Annotate the place name itself (e.g. Norwich) with the
corresponding DBpedia and GeoNames URIs
– Also use knowledge of the implied reference to the levels 1, 2,
and 3 sub-divisions from the Nomenclature of Territorial Units for
Statistics (NUTS).
– For Norwich, these are East of England (UKH – level 1), East
Anglia (UKH1 – level 2), and Norfolk (UKH13 – level 3).
– Similarly use knowledge to retrieve nearby places
“South Gloucestershire” Example
Semantic Annotation (2)
l Organisations (linked to DBpedia)
– Names of companies, government organisations,
committees, agencies, universities, and other organisations
l Dates
– Absolute (e.g. 31/03/2012) and relative (yesterday)
l Measurements and Percentages
– e.g. 8,596 km2 , 1 km, one fifth, 10%
Semantic Search: An Overview
GATE Mímir

• can be used to index and search over text, annotations,


semantic metadata (concepts and instances)
• allows queries that arbitrarily mix full-text, structural, linguistic
and semantic annotations
• is open source
What can GATE Mímir do that Google can't?

Show me:
• all documents mentioning a temperature between 30 and 90
degrees F (expressed in any unit)
• all abstracts written in French on patent documents from the last
6 months which mention any form of the word “transistor” in the
English abstract
• the names of the patent inventors of those abstracts
• all documents mentioning steel industries in the UK, along with
their location
Search News Articles for
Politicians born in Sheffield

http://demos.gate.ac.uk/mimir/gpd/search/gus
Easily Create Your Own
Custom GATE Mímir Interfaces

http://demos.gate.ac.uk/pin/
MIMIR: Searching Text Mining Results

• Searching and managing text annotations, semantic


information, and full text documents in one search engine
• Queries over annotation graphs
• Regular expressions, Kleene operators
• Designed to be integrated as a web service in custom end-user
systems with bespoke interfaces
• Demos at http://services.gate.ac.uk/mimir/
Scaling Up
l We annotated 1.08 million web pages using a GATE language
analysis pipeline.
– Documents crawled using Heritrix10 , with total content size of
57 GiB or 6.6 billions plain text characters.
– The indexing server has 2 Intel Xeon 2.8GHz CPUs 11 GB of
RAM, and runs 64 bit Ubuntu Linux. Indexing process was 94
hours.
l We also indexed 150 million similar web pages, using two hundred
Amazon EC2 Large Instances running for a week to produce a
federated index
l Mimir runs on GateCloud.net, so easy to scale up
Search for string Harriet Harman
Search for string Harriet Harman says
Search with morphological variants:
Harriet Harman root:say
Replace strings with NEs: {Person} root:say
{Person} AND root:say – 11803 hits
{Person} [0..5] root: say – 5495 hits
Patent Annotation: Data Model

113
An Example Text

114
Hands-On with patent data
http://demos.gate.ac.uk/mimir/patents/search/index

Text. Matches plain text.


Example: nanomaterial
Linguistic variations of text
Example: (root:nanomaterial | root:nanoparticle )
Annotation. Matches semantic annotations.
Syntax: {Type feature1=value1 feature2=value2...}
Example: {Abstract lang="DE"}
Sequence Query. Sequence of other queries.
Syntax: Query1 [n..m] Query2...
Example: from {Measurement} [1..5] {Measurement}
Inclusion Queries

IN Query. Hits of one query only if inside another.


Syntax: Query1 IN Query2
Example: (root:nanomaterial | root:nanoparticle ) IN {Abstract}
Number of times these words are mentioned in patent abstracts (as
well as links to the actual documents)

OVER Query. Hits of a query, only if overlapping hits of another.


Syntax: Query1 OVER Query2
Example: {Abstract} OVER (root:nanomatrial | root:nanoparticle )
Finds all abstracts that contain nanomaterial(s) or nanoparticle(s)
Date restrictions
(
{Abstract lang="EN"} OVER
(root:nanomaterial | root:nanoparticle )
)
IN
{PatentDocument date > 20050000}

YYYYMMDD
Find references to literature or patents in the prior
art or background sections, which contain
nanomaterial/nanoparticle
({Reference type="Literature"}
|
{Reference type="Patent"}
) IN
({Section type="PriorArt"}
|
{Section type="BackgroundArt"}
)
OVER
(root:nanomaterial | root:nanoparticle)
Queries Using External Knowledge

{Measurement spec="1 to 100 volts“}


l Uses GNU Units (http://www.gnu.org/software/units/) to convert
measurements and normalise them to ISI units

{Measurement spec="1 to 100 kg m^2 / A s^3"}


l Example hits: 10 volts, 2V, +20 and -20 volts; ±10V; +/- 100V; +3.3
volts

{Measurement spec="1 to 100 m / s"}


l Example hits: 40 km/hr, 60m/min, 100cm/sec, 60 fps; 10 to 2000
cm/sec
Searching LOD with SPARQL

• SQL-like query language for RDF data


• Simple protocol for querying remote databases over HTTP
• Query types
• select: projections of variables and expressions
• construct: create triples (or graphs) based on query results
• ask: whether a query returns results (result is true/false)
• describe: describes resources in the graph
SPARQL Example

# Software companies founded in the US

PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dbpedia: <http://dbpedia.org/resource/>
PREFIX dbp-ont: <http://dbpedia.org/ontology/>
PREFIX geo-ont: <http://www.geonames.org/ontology#>
PREFIX umbel-sc: <http://umbel.org/umbel/sc/>

SELECT DISTINCT ?Company ?Location


WHERE {
?Company rdf:type dbp-ont:Company ;
dbp-ont:industry dbpedia:Computer_software ;
dbp-ont:foundationPlace ?Location .
?Location geo-ont:parentFeature dbpedia:United_States .
}
SPARQL Results

Try at: http://factforge.net/sparql


Documents mentioning Persons born in Sheffield

{Person sparql = "SELECT ?inst WHERE { ?inst :birthPlace


<http://dbpedia.org/resource/Sheffield>}"}

The BBC web document does not mention Sheffield at all:


http://www.bbc.co.uk/news/uk-politics-11494915

The relevant text snippet is:


Try these on the BBC News demo:
http://services.gate.ac.uk/mimir/gpd/search/index

Gordon Brown [0..3] root:say

{Person} [0..3] root:say

{Person inst ="http://dbpedia.org/resource/Gordon_Brown"} [0..3] root:say

{Person sparql="SELECT ?inst WHERE { ?inst a :Politician }"} [0..3] root:say

{Person sparql = "SELECT ?inst WHERE { ?inst :party


<http://dbpedia.org/resource/Labour_Party_%28UK%29> }" } [0..3] root:say

{Person sparql = "SELECT ?inst WHERE {


?inst :party <http://dbpedia.org/resource/Labour_Party_%28UK%29> .
?inst :almaMater <http://dbpedia.org/resource/University_of_Edinburgh> }"
} [0..3] root:say
User Interfaces for SPARQL-based Semantic
Search

• SPARQL-based semantic searches, tapping into LOD resources are


extremely powerful
• However, impossible to write by the vast majority of users
• User interfaces for SPARQL-based semantic search:
• Faceted searches (see ExoPatent next)
• Form-based searches (see EnviLOD, PIN)
• Text-based searches (natural language interfaces for querying
ontologies), e.g. FREyA
Faceted Search: ExoPatent Example

l Use semantic information to expose linkages between


documents based on the intersecting relationships between
various sets of data from
l FDA Orange Book (23,000 patented drugs
l Unified Medical Language System (UMLS) – database of medical
terms (370,000)
l Patent bibliographic information
l Search for diseases, drug names, body parts, references to
literature and other patents, numeric values, ranges
l Demo uses a small set of patents (40,000)
ExoPatent: Faceted Search
Annotated Patent Document
Find all applicants who filed patents related to
mitochondria, as well as drug names and active
ingredients
Semantic Search over Content and Annotations

http://demos.gate.ac.uk/trendminer/envilod
EnviLOD Semantic Search UI
EnviLOD Semantic Search UI
EnviLOD Semantic Search UI
Example Results
…and the underlying SPARQL Query

{Sem_Location dbpediaSparql="select distinct ?inst


where {
{{ ?inst <http://dbpedia.org/property/north> ?loc} UNION
{ ?inst <http://dbpedia.org/property/east> ?loc } UNION
{ ?inst <http://dbpedia.org/property/west> ?loc } UNION
{ ?inst <http://dbpedia.org/property/south> ?loc } UNION
{ ?inst <http://dbpedia.org/property/northeast> ?loc } UNION
{ ?inst <http://dbpedia.org/property/northwest> ?loc } UNION
{ ?inst <http://dbpedia.org/property/southeast> ?loc } UNION
{ ?inst <http://dbpedia.org/property/southwest> ?loc }
FILTER(REGEX(STR(?loc), \"Sheffield\", \"i\"))} }"} AND (root:"flood")

Ongoing work: Use GeoSparql instead and be able to specify distances and
reason with the richer information in GeoNames
Flooding in Oxford
We don't just have to look at politicians saying and
measuring things
l If we first process the text with other NLP tools such as
sentiment analysis, we can also search for positive or negative
documents
l Or positive/negative comments about certain
people/topics/things/companies
l In the Decarbonet project, we looked at people's opinions
about topics relating to climate change, e.g. fracking
l We could index on the sentiment annotations too
l Other people are using the combination of opinion mining and
MIMIR to look at e.g. customer feedback about products /
companies on a huge corpus
Explicitly Choosing The Search Classes
Choosing A Specific Instance
What diseases are in these documents?
What Pathogens?
Disease vs Disease Co-ocurrences
Diseases vs Pathogens
Some cool applications with GATE
Back to Trump (there’s no escape)
Real-time Opinion Monitoring

vs replies
Climate change, ISIS and Trump

147
Querying election data with MIMIR

• Dataset: every
tweet by MP /
Candidate / Party,
plus all
replies/retweets

• Find all tweets


where a
Conservative MP
talked about the
economy
Media and communications
Community and society
Borders and Immigration
Tax and revenue

Public health
Employment
Scotland
UK economy
Europe

NHS
Labour Party Candidate
Labour Party MP
Conservative Party Candidate
Conservative Party MP
UKIP Candidate
Other MP
SNP Candidate
Green Party Candidate
Liberal Democrats Candidate
Parties / themes co-occurrence

SNP Other
Hate speech towards MPs on Twitter
2015 2017

http://greenwoodma.servehttp.com/data/buzzfeed/sunburst.html
Environmental behaviour analysis
• Based on the assumption that users in different behavioural
stages communicate differently (different emotions,
directives, etc.)
Pajarito @lindopajarito . 2h
Our building needs 40% of all energy consumed in Switzerland! L
Desirability: Negative sentiment (expressing personal
frustration- anger/sadness)

DJPajarito @DJPajaritoGenial . 12h


I'm so proud when I remember to save energy and I know
however small it's helping.

Buzz: Positive sentiment (happiness/joy). I/we + present tense

HotelPajarito @HotelPajarito . 18h


Join us today today to switch of a light for EH! J

Invitation: Positive sentiment (happy) + use of vocatives


Recognition of environmental terms in Decarbonet
A positive tweet
A negative tweet
A sarcastic tweet
Term recognition and sentiment analysis in Decarbonet

l http://services.gate.ac.uk/decarbonet/sentiment
KNOWMAK:
Mapping the state of European research

• Build an ontology to map between user queries (who’s


doing what where?) and databases of projects, patents,
publications
• Build indicators on the data (how many patents by which
actors in which country?)
• Build visualisation tools to show the results
• Searching and mapping are done via keywords mapped
to topics (ontology classes) based around KETs and
SGCs
• This deals with the problem that the terms used can
vary widely between users and between document
types
• Ontology search demo
Topic visualisation

http://www.dcs.shef.ac.uk//~adam/stuff/knowmak/visualization/
Summary
l Text mining is a very useful pre-requisite for doing all kinds of
more interesting things, like semantic search
l Semantic web search allows you to do much more interesting
kinds of search than a standard text-based search
l Text mining is hard, so it won't always be correct
l This is especially true on lower quality text such as social
media
l We run an annual GATE training course in June in Sheffield,
where you can spend a whole week learning all this and more!
Acknowledgements
This work is supported by:
• the European Union/EU under the Information and
Communication Technologies (ICT) theme of the 7th
Framework and H2020 Programmes for R&D
• DecarboNet (610829) http://www.decarbonet.eu
● SoBigData (654024) http://www.sobigdata.eu

● COMRADES (687847) http://www.comrades-project.eu

● KNOWMAK (726992) http://knowmak.eu

● Nesta http://nesta.org.uk
What to ask Santa for this Christmas

• Great basic intro to NLP


• Uses GATE as examples
• Discusses other tools and
the differences between
them
• Chapters on semantic
search, social media
analysis, sentiment
analysis, cool
applications, and more
Key Publications
l Semantic Search over Documents and Ontologies (2014) K
Bontcheva, V Tablan, H Cunningham. Bridging Between Information
Retrieval and Databases, 31-53
l D. Maynard, I. Roberts, M. A. Greenwood, D. Rout and K. Bontcheva.
A Framework for Real-time Semantic Social Media Analysis. Web
Semantics: Science, Services and Agents on the World Wide Web,
2017
l V. Tablan, I. Roberts, H. Cunningham, and K. Bontcheva.
GATECloud.net: a Platform for Large-Scale, Open-Source Text
Processing on the Cloud. Philosophical Transactions of the Royal
Society A, 371(1983), 2013
l More papers on the GATE website: http://gate.ac.uk/gate/doc/papers.html
Some useful links
• GATE: http://gate.ac.uk
• GateCloud: https://cloud.gate.ac.uk
• Annual GATE training course in June in Sheffield:
https://gate.ac.uk/family/training.html
• Download GATE: http://gate.ac.uk/download
• GATE blog posts on social media analysis:
http://gate4ugc.blogspot.co.uk/
• UK elections monitor http://gate.ac.uk/projects/pft
• Blog post on abuse of MPs:
• COMRADES project on disasters: http://gate.ac.uk/projects/comrades
• KNOWMAK project and demos: http://gate.ac.uk/projects/knowmak
• SoBigData project: http://sobigdata.eu
Has your head exploded yet?

Questions?

You might also like