0% found this document useful (0 votes)

23 views165 pages

Text Analysis Semantic Search

The document outlines a tutorial on Text Analysis using GATE, focusing on Natural Language Processing (NLP) and Information Extraction (IE). It discusses the components and functionalities of GATE, including entity recognition, event recognition, and applications in social media analysis. The tutorial emphasizes the importance of filtering and extracting relevant information from large datasets to address challenges like information overload.

Uploaded by

govinbimsara

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views165 pages

Text Analysis Semantic Search

Uploaded by

govinbimsara

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 165

Text Analysis with GATE

Dr. Diana Maynard

University of Sheffield, UK

Search Solutions Tutorial 2017

London, UK
Outline of the Tutorial

• Introduction to NLP and Information Extraction

• Introduction to GATE
• Social media analysis
• Text Analysis for Semantic Search
• Semantic Search
• Semantic Annotation
• GATE MIMIR
• Example applications
• ExoPatent, TNA Web Archive, BBC News, Envilod,
Prospector, Political Futures Tracker, Brexit
1. NLP and Information Extraction
Oddly enough, people have successfully combined
information and toast...
The weather-forecasting toaster

l This weather-forecasting toaster, connected to a phone point,

was designed in 2001 by a PhD student
l It accessed the MetOffice website via a modem inside the
toaster and translated the information into a 1, 2 or 3 for rain,
cloud or sun
l The relevant symbol was then branded into the toast in the last
few seconds of toasting
However, toast isn't actually a very good medium for getting
your information…
Information overload or filter failure?
l We all know that there's
masses of information
available online these days,
and it's constantly growing
l You often hear people talk
about “information overload”
l But the real problem is not
the amount of information,
but our inability to filter it
correctly
l Clay Shirky has an excellent
talk on this topic
http://bit.ly/oWJTNZ
Big Data is not new!

Staff sorting 4M used tickets from #London Underground

to analyse line use in 1939
NLP gives us a way to understand data

l sort the data to remove the rubbish from the interesting parts
l extract the relevant pieces of information
l link the extracted information to other sources of information
(e.g. from DBpedia)
l aggregate the information according to potential new
categories
l query the (aggregated) information
l visualise the results of the query
What is information extraction?

l The automatic discovery of new, previously unknown

information, by automatically extracting information from
different textual resources.
l A key element is the linking together of the extracted
information to form new facts or new hypotheses to be
explored further
l In IE, the goal is to discover previously unknown information,
i.e. something that no one yet knows
IE is not Data Mining

Data mining is about using analytical techniques to find

interesting patterns from large structured databases

Examples:
• using consumer purchasing
patterns to predict which products
to place close together on shelves
in supermarkets
• analysing spending patterns on
credit cards to detect fraudulent
card use.
IE is not Web Search
l IE is also different from traditional web search or IR.
l In search, the user is typically looking for something that is
already known and has been written by someone else.
l The problem lies in sifting through all the material that currently
isn't relevant to your needs, in order to find the information that
is.
l The solution often lies in better ways to ask the right question
l You can't ask Google to tell you about
– all the speeches made by Tony Blair about foot and mouth
disease while he was Prime Minister
– all the documents in which a politician born in Sheffield is
quoted as saying something about hospitals.
More about how we can do this will be revealed later...

GATE Mímir:
Answering Questions
Google Can't
Information Extraction Basics

l Entity recognition
is required for
l Relation extraction
which is required for
l Event recognition
which is required for
l Summarisation, answering questions, and other things
What is Entity Recognition?
l Entity Recognition is about recognising and classifying key Named Entities
and terms in the text

l A Named Entity is a Person, Location, Organisation, Date etc.

l A term is a key concept or phrase that is representative of the text

Mitt Romney, the favorite to win the Republican nomination for president in 2012
Person Term Date

l Entities and terms may be described in different ways but refer to the same
thing. We call this co-reference.
co-reference

The GOP tweeted that they had knocked on 75,000 doors in Ohio the day prior.

Organisation Location
What is Event Recognition?

l An event is an action or situation relevant to the domain

expressed by some relation between entities or terms.
l It is always grounded in time, e.g. the performance of a
band, an election, the death of a person

Relation Relation

Mitt Romney, the favorite to win the Republican nomination for president in 2012
Person Event Date
Why are Entities and Events Useful?

l They can help answer the “Big 5” journalism questions

(who, what, when, where, why)
l They can be used to categorise the texts in different ways
– look at all texts about Trump.
l They can be used as targets for opinion mining
– find out what people think about Trump
– When linked to an ontology and/or combined with other
information, they can be used for reasoning about things
not explicit in the text
– seeing how opinions about different American
presidents have changed over the years
NLP components for text mining

l A text mining system is usually built up from a number of

different NLP components
l First, you need some Information Extraction tools to do the
donkey work of getting all the relevant pieces of information
and facts.
l Then you need some tools to apply the reasoning to the facts,
e.g. opinion mining, information aggregation, semantic
technologies, dynamics analysis
l GATE is an example of a tool for text mining which allows you
to combine all the necessary NLP components
Low-level linguistic processing components

l These are needed to pre-process the text ready for the more
complex IE tasks (NER, relations etc.)
Approaches to Information Extraction
Knowledge Engineering Learning Systems
l rule based l use statistics or other
l developed by machine learning
experienced language l developers do not need
engineers LE expertise
l make use of human l requires large amounts of
intuition annotated training data
l easier to understand l some changes may
results require re-annotation of
l development could be the entire training corpus
very time consuming l Can be unsupervised, but
l some changes may be results are less good, and
hard to accommodate hard to adapt to a domain
GATE
What is GATE?

GATE is an NLP toolkit developed at the University of Sheffield

over the last 20 years

It includes:

• components for language processing, e.g. parsers, machine

learning tools, stemmers, IR tools, IE components for various
languages...

• tools for visualising and manipulating text, annotations,

ontologies, parse trees, etc.

• various information extraction tools

• evaluation and benchmarking tools

GATE components
l Language Resources (LRs), e.g. lexicons, corpora, ontologies
l Processing Resources (PRs), e.g. parsers, generators,
taggers
l Visual Resources (VRs), i.e. visualisation and editing
components
l Algorithms are separated from the data, which means:
– the two can be developed independently by users with
different expertise.
– alternative resources of one type can be used without
affecting the other, e.g. a different visual resource can be
used with the same language resource
ANNIE
• ANNIE is GATE's rule-based IE system
• It uses the language engineering approach (though we also
have tools in GATE for ML)
• Distributed as part of GATE
• Uses a finite-state pattern-action rule language, JAPE
• ANNIE contains a reusable and easily extendable set of
components:
– generic pre-processing components for tokenisation,
sentence splitting etc
– components for performing NE on general open domain text
What's in ANNIE?

• The ANNIE application contains a set of core PRs:

• Tokeniser
• Sentence Splitter
• POS tagger
• Gazetteers
• Named entity tagger (JAPE transducer)
• Orthomatcher (orthographic coreference)
• There are also other PRs available in the ANNIE plugin, which
are not used in the default application, but can be added if
necessary
• NP and VP chunker, morphological analysis
ANNIE Modules
ANNIE English Tokeniser

l Tokenisation is based on Unicode classes

l It chops a piece of text into words and spaces, and also has
features for numbers, punctuation, symbols, capitalisation
etc.
l It converts constructs involving apostrophes into more
sensible combinations
l don’t → do + n't
l you've → you + 've
l Other tokenisers might have different definitions of Token
l Other languages might need different tokenisers, e.g.
Chinese
Document with Tokens

28
Sentence Splitter

• The default splitter finds sentences based on Tokens

• Main problem is to find whether a full stop or line break
denotes the end of a sentence or something else
• Also what to do with things like bullet points
• Uses a gazetteer of abbreviations etc. and a set of rules to find
sentence delimiters
• There are a few sentence splitter variants available which work
in slightly different ways
Document with Sentences

30
POS tagger

l Finds the part of speech for every word, e.g. noun, verb etc.
l Trained on WSJ, uses Penn Treebank tagset
l Other taggers may be trained on different corpora and have
different tags
l Default ruleset and lexicon can be modified manually (with a
little deciphering)
l Adds category feature to Token annotations
l Requires Tokeniser and Sentence Splitter to be run first
Morphological analyser
l Finds the root form of a word (singular noun, base form of verb)
l This is not the same as stemming (the root is a real word, a stem might not
be)
l e.g. educating: root = educate; stem = educat
l Does not perform derivational analysis (will not conflate verbs and nouns)
l Not an integral part of ANNIE, but can be found in the Tools plugin as an
“added extra”
l Flex based rules: can be modified by the user
l Generates “root” feature on existing Token annotations
l Requires tokenisation and POS tagging
Gazetteers

l Gazetteers are plain text files containing lists of names (e.g

rivers, cities, people, …)
l The lists are compiled into Finite State Machines
l Each gazetteer has an index file listing all the lists, plus
features of each list that help to categorise the lists
l Each entry can also have specific features and values, e.g. a
DBpedia link
l Lists can be modified either internally using the Gazetteer
Editor, or externally in your favourite editor
l Gazetteers generate Lookup annotations with relevant features
corresponding to the list matched
l ANNIE gazetteer has about 60,000 entries arranged in 80 lists
Gazetteer editor

definition file
entries entries for selected list
Named Entity Recognition
• Gazetteers can be used to find terms that suggest entities
• However, the entries can often be ambiguous
– “May Jones” vs “May 2010” vs “May I help you?”
– “Mr Parkinson” vs “Parkinson's Disease”
– “General Motors” vs. “General Smith”
• Handcrafted grammars are used to define patterns over the
Lookups and other annotations
• These patterns can help disambiguate, and they can combine
different annotations, e.g. dates can be comprised of day +
number + month
Named Entity Grammars

• Hand-coded rules written in JAPE applied to annotations to

identify NEs
• Phases run sequentially and constitute a cascade of FSTs over
annotations
• Because phases are sequential, annotations can be built up over a
period of phases, as new information is gleaned
• Standard named entities: persons, locations, organisations, dates,
addresses, money
• Basic NE grammars can be adapted for new applications, domains
and languages
Example of a JAPE pattern-action rule

Rule: PersonName Look for an entry in a

gazetter list of first
(
names
{Lookup.majorType == firstname}
followed by a proper noun
{Token.category == NNP}
):tag
-->
:tag.Person = {kind = fullName}

create an annotation of type “Person” give the annotation the feature

kind and value “fullName”
JAPE rules are usually a bit more complex

l They can make use of any annotations already created, e.g.

– strings of text
– POS categories
– gazetteer lookup
– root forms of words
– document metadata, e.g. HTML tags
– annotations “in context”
l They can use regular expression operators, including negative
constraints, “contains”, “within” etc.
l And the RHS of a rule can contain any Java code you like
Document with NEs
Using co-reference

l Different expressions may refer to the same entity

l Orthographic co-reference module (orthomatcher)
matches proper names and their variants in a
document
l [Mr Smith] and [John Smith] will be matched as the
same person
l [International Business Machines Ltd.] will match
[IBM]
Co-reference in GATE
Other NLP Toolkits

• UIMA
• OpenCalais
• Lingpipe
• OpenNLP
• Stanford Tools

l All integrated into GATE as plugins

3. Analysing Social Media
People don’t write “properly” on social media

l Grundman:politics makes #climatechange scientific issue,people

don’t like knowitall rational voice tellin em wat 2do
l Want to solve the problem of #ClimateChange? Just #vote for a
#politician! Poof! Problem gone! #sarcasm #TVP #99%
l Human Caused #ClimateChange is a Monumental Scam!
http://www.youtube.com/watch?v=LiX792kNQeE … F**k yes!!
Lying to us like MOFO's Tax The Air We Breath! F**k Them!
l The last people I will listen2 about guns r those that know
nothing about them&politicians who live in states w/strictest
gun laws #cali #ny
Linguistic challenges of social media

• Language
• Problem: typically exhibits very different language style
• Solution: train specific language processing components
• Relevance
• Problem: topics and comments can rapidly diverge.
• Solution: train a classifier or use clustering techniques
• Lack of context
• Problem: hard to disambiguate entities
• Solution: data aggregation, metadata, entity linking
Challenges for NLP

l Noisy language: unusual punctuation, capitalisation, spelling, use

of slang, sarcasm etc.
l Terse nature of microposts such as tweets
l Use of hashtags, @mentions etc causes problems for
tokenisation #thisistricky
l Lack of context gives rise to ambiguities
l NER performs poorly on microposts, mainly because of linguistic
pre-processing failure
– Performance of standard IE tools decreases from ~90% to
~40% when run on tweets rather than news articles
NLP Pre-Processing Pipeline

Language ID

Text

POS tagging Tokenisation

Named Entity Recognition and Linking

NER (Professor Plum)

dbpedia.org/resource/.....
Michael_Jackson
Michael_Jackson_(writer)

Entity Linking
Pipelines for tweets
Errors have a cumulative effect

Per-stage

Overall

Good performance is important at each stage

Language Identification
The Jan. 21 show started with the unveiling of an
impressive three-story castle from which Gaga emerges.
The band members were in various portals, separated
from each other for most of the show. For the next 2
Newswire
hours and 15 minutes, Lady Gaga repeatedly stormed
the moveable castle, turning it into her own gothic
Barbie Dreamhouse .
LADY GAGA IS BETTER THE 5th TIME OH BABY(:

je bent Jacques cousteau niet die een nieuwe soort heeft

Twitter ontdekt, het is duidelijk, ze bedekken hun gezicht. Get
over it

I'm at 地铁望京站 Subway Wangjing (Beijing)

http://t.co/KxHzYm00
Tokenisation is only 80% accurate on tweets
Improper grammar, e.g. apostrophe usage:
doesn't → does n't
doesnt → doesn’t

Smileys and emoticons: loss of information (e.g. sentiment)

I <3 you
This piece ;,,( so emotional

Punctuation for emphasis

*HUGS YOU**KISSES YOU* → * HUGS YOU**KISSES YOU *
Words run together / skip
I wonde rif Tsubasa is okay..
We need tools for hashtag analysis
l Hashtags need unravelling and disambiguating:
l #nowthatcherisdead

l #powergenitalia

l #lesbocages

l #molestationnursery

l #teacherstalking

l #therapist
Test your social media skills

What do these hashtags mean?

l #kktny
l #fomo
l #jomo
l #ootd
l #wcw
Hashtag Hijacking

Hashtags are not always used in the same way,

or for their original purpose
Have sex to save the planet!
Tweet Normalisation
l “RT @Bthompson WRITEZ: @libbyabrego honored?! Everybody
knows the libster is nice with it...lol...(thankkkks a bunch;))”
l OMG! I’m so guilty!!! Sprained biibii’s leg! ARGHHHHHH!!!!!!
l For some components to work well (POS tagger, parser), we need
normalisation
l BUT uppercasing, and repetition often convey strong sentiment
l Other forms of “misspelling” might indicate information about the
author
l First challenge: separate out-of-vocabulary and in-vocabulary
l Second challenge: fix mis-spelled IV words (e.g. Levenshtein edit
distance)
l The ZOMG phenomenon
Part-of-speech tagging
• Similar issues as for normalisation – we don’t have big datasets
to train on
• Label unlabelled data with multiple taggers and accept tweets
where tagger votes never conflict
• Model words using Brown clustering and word representations
(Turian 2010)

2m, 2ma, 2mar, 2mara, 2maro, 2marrow, 2mor, 2mora, 2moro, 2morow,
2morr, 2morro, 2morrow, 2moz, 2mr, 2mro, 2mrrw, 2mrw, 2mw, tmmrw, tmo,
tmoro, tmorrow, tmoz, tmr, tmro, tmrow, tmrrow, tmrrw, tmrw, tmrww, tmw,
tomaro, tomarow, tomarro, tomarrow, tomm, tommarow, tommarrow,
tommoro, tommorow, tommorrow, tommorw, tommrow, tomo, tomolo,
tomoro, tomorow, tomorro, tomorrw, tomoz, tomrw, tomz
Named entities: lack of context causes ambiguity

Branching out from Lincoln park after dark ... Hello Russian
Navy, it's like the same thing but with glitter!

??
Getting the NEs right is crucial

Branching out from Lincoln park after dark ... Hello Russian
Navy, it's like the same thing but with glitter!
How do we deal with all these problematic tweets?

l Typical NLP pipeline means that degraded performance has a

knock-on effect along the chain
l Short sentences confuse language identification tools
l Linguistic processing tools have to be adapted to the domain

l Retraining individual components on large volumes of data

l Adaptation of techniques from e.g. SMS analysis
l Development of new Twitter-specific tools (e.g. GATE's TwitIE)
l But....lack of standards, easily accessible data, common
evaluation etc. are holding back development
4. Text Analytics for Semantic Search
Semantic Queries in Google
Searching for Things, Not Strings

• 500 million entities that Google

“knows” about
• Used to provide more accurate
search results

• Summaries of information about

the entity being searched

http://googleblog.blogspot.it/2012/05/introducing-knowledge-graph-things-not.html
Facebook Graph Search
Semantic Enrichment
l Textual mentions aren't actually that useful in isolation
– knowing that something is a “Person" isn't very helpful
– knowing which real-life Person it refers to can be very useful
l Disambiguating mentions against an ontology provides extra context
l This is where semantic enrichment comes in
l The end product is a set of textual mentions linked to an ontology,
otherwise known as semantic annotations
l Annotations on their own can be useful but they can also
– be used to generate corpus level statistics
– be used for further ontology population
– form the basis of summaries
– be indexed to provide semantic search
Automatic Semantic Enrichment

• Use Text Mining, e.g.

• Information Extraction – recognise names of people,
organisations, locations, dates, references, etc.
• Term recognition – identify domain-specific terms
• Automatically extend article metadata to improve search quality
Mining medical records

l Medical records contain a large amount of unstructured text

– letters between hospitals and GPs
– discharge summaries
l These documents might contain information not recorded
elsewhere
– it turns out doctors don't like forms!
– often information-specific fields are ignored, with everything
put in the free text area
Medical Records at SLAM

l NIHR Biomedical Research Centre at the South London and

Maudsley Hospital are using our text mining tools in a number
of their studies
l Developed applications to extract:
– the results of mental state tests, and the date the test was
administered
– education level (high school, university, etc.)
– smoking status
– medication history
l They have even had promising results predicting suicides
30% of medical information
is structured GATE structures the free text
Hb WBC RB Plt
C Easy to search portion of the medical record,
14.1 5.1 4.5 210 and manipulate and makes it available for
10.2 36.6 5.0 420 with computers research, management, and
13.4 10.1 5.1 180 clinical care
12.3 8.3 4.6 340

70% of medical information

is in free text

l Genome Wide Association Studies (GWAS) aim to investigate

genetic variants across the whole genome
– With enough cases and controls, this allows them to state
that a given SNP (Single Nucleotide Polymorphism) is
related to a given disease.
– A single study can be very expensive in both time and
money to collect the required samples.
l Can we reduce the costs by analysing published articles to
generate prior probabilities for each SNP?
Is GATE better than baking soda?
l In conjunction with IARC (International Agency for Research on
Cancer, part of the WHO) we developed a text analysis
approach to mine PubMed
l We showed retrospectively that our approach would have saved
over a year's worth of work and more than 1.5 million Euros
l We completed a new study which found a new cause for oral
cancer
– Oral cancer is rare enough that traditional methods would
have failed to find enough cases to make the study plausible
Government Web Archive

l We developed a semantic annotation application to process

every crawled page in the archive.
l Entities annotated included; people, companies, locations,
government departments, ministerial positions, social
documents, dates, money....
l Where possible, annotations were linked to an ontology which
– was based on DBpedia
– was extended with UK government-specific concepts
– included the modelling of the evolution of government
l Annotations were indexed to allow for complex semantic
querying of the collection
Political Futures Tracker

Where in the UK did Conservative

MPs tweet more about the economy?
5.1 Semantic Annotation
Why ontologies for semantic search?
l Semantic annotation: rather than just annotating the word “Cambridge” as
a location, link it to an ontology instance
l Differentiate between Cambridge, UK and Cambridge, Mass.
l Semantic search via reasoning
l So we can infer that this document mentions a city in Europe.
l Ontologies tell us that this particular Cambridge is part of the country
called the UK, which is part of the continent Europe.
l Knowledge source
l If I want to annotate strikes in baseball reports, the ontology will tell me
that a strike involves a batter who is a person
l In the text “BA went on strike”, using the knowledge that BA is a company
and not a person, the IE system can conclude that this is not the kind of
strike it is interested in
More semantic search examples

Q: {ScalarValue}{MeasurementUnit} ->
A: “12 cm”, “190 g”, “two hours”
---
Q: {Reference} ->
A: JP-A-60-180889
A: Kalderon et al. (1984) Cell 39:499-509
Semantic Annotation
What is an Ontology?

l Set of concepts (instances and

classes)
l Relationships between them (is-a,
part-of, located-in)
l Multiple inheritance
l Classes can have more than one
parent
l Instances can have more than one
class
DBpedia

l Machine readable knowledge on various entities and topics,

including:
– 410,000 places/locations,
– 310,000 persons
– 140,000 organisations
l For each entity we have:
– entity name variants (e.g. IBM, Int. Business Machines)
– a textual abstract
– reference(s) to corresponding Wikipedia page(s)
– entity-specific properties (e.g. latitude and longitude for
places)
Example from DBpedia

… Links to GeoNames
And Freebase

Latitude & Longitude

GeoNames
l 2.8 million populated places
– 5.5 million alternate names
l Knowledge about NUTS country sub-divisions
– use for enrichment of recognised locations with the implied
higher-level country sub-divisions
l However, the sheer size of GeoNames creates a lot of ambiguity
during semantic enrichment
l We use it as an additional knowledge source, but not as a primary
source (DBpedia)
5.2 Semantic Annotation In GATE
Information Extraction for the Semantic Web

l Traditional IE is based on a flat structure, e.g. recognising

Person, Location, Organisation, Date, Time etc.
l For the Semantic Web, we need information in a hierarchical
structure
l Idea is that we attach semantic metadata to the documents,
pointing to concepts in an ontology
l Information can be exported as an ontology annotated with
instances, or as text annotated with links to the ontology
Traditional NE Recognition

John lives in London . He works there for Polar Bear Design .

PERSON LOCATION ORGANISATION
Co-reference

same_as
John lives in London . He works there for Polar Bear Design .
PER LOC ORG
Relations

live_in

John lives in London . He works there for Polar Bear Design .

PER LOC ORG

Relations (2)

employee_of

John lives in London . He works there for Polar Bear Design .

PER LOC ORG

Relations (3)

based_in

John lives in London . He works there for Polar Bear Design .

PER LOC ORG

Richer NE Tagging

l Attachment of instances in
the text to concepts in the
domain ontology
l Disambiguation of
instances, e.g. Cambridge,
MA vs Cambridge, UK
Ontology-based IE

John lives in London. He works there for Polar Bear Design.

Ontology-based IE (2)

John lives in London. He works there for Polar Bear Design.

Automatic Semantic Annotation in ENVIA

l Locations (linked to DBpedia and GeoNames)

– Annotate the place name itself (e.g. Norwich) with the
corresponding DBpedia and GeoNames URIs
– Also use knowledge of the implied reference to the levels 1, 2,
and 3 sub-divisions from the Nomenclature of Territorial Units for
Statistics (NUTS).
– For Norwich, these are East of England (UKH – level 1), East
Anglia (UKH1 – level 2), and Norfolk (UKH13 – level 3).
– Similarly use knowledge to retrieve nearby places
“South Gloucestershire” Example
Semantic Annotation (2)
l Organisations (linked to DBpedia)
– Names of companies, government organisations,
committees, agencies, universities, and other organisations
l Dates
– Absolute (e.g. 31/03/2012) and relative (yesterday)
l Measurements and Percentages
– e.g. 8,596 km2 , 1 km, one fifth, 10%
Semantic Search: An Overview
GATE Mímir

• can be used to index and search over text, annotations,

semantic metadata (concepts and instances)
• allows queries that arbitrarily mix full-text, structural, linguistic
and semantic annotations
• is open source
What can GATE Mímir do that Google can't?

Show me:
• all documents mentioning a temperature between 30 and 90
degrees F (expressed in any unit)
• all abstracts written in French on patent documents from the last
6 months which mention any form of the word “transistor” in the
English abstract
• the names of the patent inventors of those abstracts
• all documents mentioning steel industries in the UK, along with
their location
Search News Articles for
Politicians born in Sheffield

http://demos.gate.ac.uk/mimir/gpd/search/gus
Easily Create Your Own
Custom GATE Mímir Interfaces

http://demos.gate.ac.uk/pin/
MIMIR: Searching Text Mining Results

• Searching and managing text annotations, semantic

information, and full text documents in one search engine
• Queries over annotation graphs
• Regular expressions, Kleene operators
• Designed to be integrated as a web service in custom end-user
systems with bespoke interfaces
• Demos at http://services.gate.ac.uk/mimir/
Scaling Up
l We annotated 1.08 million web pages using a GATE language
analysis pipeline.
– Documents crawled using Heritrix10 , with total content size of
57 GiB or 6.6 billions plain text characters.
– The indexing server has 2 Intel Xeon 2.8GHz CPUs 11 GB of
RAM, and runs 64 bit Ubuntu Linux. Indexing process was 94
hours.
l We also indexed 150 million similar web pages, using two hundred
Amazon EC2 Large Instances running for a week to produce a
federated index
l Mimir runs on GateCloud.net, so easy to scale up
Search for string Harriet Harman
Search for string Harriet Harman says
Search with morphological variants:
Harriet Harman root:say
Replace strings with NEs: {Person} root:say
{Person} AND root:say – 11803 hits
{Person} [0..5] root: say – 5495 hits
Patent Annotation: Data Model

113
An Example Text

114
Hands-On with patent data
http://demos.gate.ac.uk/mimir/patents/search/index

Text. Matches plain text.

Example: nanomaterial
Linguistic variations of text
Example: (root:nanomaterial | root:nanoparticle )
Annotation. Matches semantic annotations.
Syntax: {Type feature1=value1 feature2=value2...}
Example: {Abstract lang="DE"}
Sequence Query. Sequence of other queries.
Syntax: Query1 [n..m] Query2...
Example: from {Measurement} [1..5] {Measurement}
Inclusion Queries

IN Query. Hits of one query only if inside another.

Syntax: Query1 IN Query2
Example: (root:nanomaterial | root:nanoparticle ) IN {Abstract}
Number of times these words are mentioned in patent abstracts (as
well as links to the actual documents)

OVER Query. Hits of a query, only if overlapping hits of another.

Syntax: Query1 OVER Query2
Example: {Abstract} OVER (root:nanomatrial | root:nanoparticle )
Finds all abstracts that contain nanomaterial(s) or nanoparticle(s)
Date restrictions
(
{Abstract lang="EN"} OVER
(root:nanomaterial | root:nanoparticle )
)
IN
{PatentDocument date > 20050000}

YYYYMMDD
Find references to literature or patents in the prior
art or background sections, which contain
nanomaterial/nanoparticle
({Reference type="Literature"}
|
{Reference type="Patent"}
) IN
({Section type="PriorArt"}
|
{Section type="BackgroundArt"}
)
OVER
(root:nanomaterial | root:nanoparticle)
Queries Using External Knowledge

{Measurement spec="1 to 100 volts“}

l Uses GNU Units (http://www.gnu.org/software/units/) to convert
measurements and normalise them to ISI units

{Measurement spec="1 to 100 kg m^2 / A s^3"}

l Example hits: 10 volts, 2V, +20 and -20 volts; ±10V; +/- 100V; +3.3
volts

{Measurement spec="1 to 100 m / s"}

l Example hits: 40 km/hr, 60m/min, 100cm/sec, 60 fps; 10 to 2000
cm/sec
Searching LOD with SPARQL

• SQL-like query language for RDF data

• Simple protocol for querying remote databases over HTTP
• Query types
• select: projections of variables and expressions
• construct: create triples (or graphs) based on query results
• ask: whether a query returns results (result is true/false)
• describe: describes resources in the graph
SPARQL Example

# Software companies founded in the US

PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dbpedia: <http://dbpedia.org/resource/>
PREFIX dbp-ont: <http://dbpedia.org/ontology/>
PREFIX geo-ont: <http://www.geonames.org/ontology#>
PREFIX umbel-sc: <http://umbel.org/umbel/sc/>

SELECT DISTINCT ?Company ?Location

WHERE {
?Company rdf:type dbp-ont:Company ;
dbp-ont:industry dbpedia:Computer_software ;
dbp-ont:foundationPlace ?Location .
?Location geo-ont:parentFeature dbpedia:United_States .
}
SPARQL Results

Try at: http://factforge.net/sparql

Documents mentioning Persons born in Sheffield

{Person sparql = "SELECT ?inst WHERE { ?inst :birthPlace

<http://dbpedia.org/resource/Sheffield>}"}

The BBC web document does not mention Sheffield at all:

http://www.bbc.co.uk/news/uk-politics-11494915

The relevant text snippet is:

Try these on the BBC News demo:
http://services.gate.ac.uk/mimir/gpd/search/index

Gordon Brown [0..3] root:say

{Person} [0..3] root:say

{Person inst ="http://dbpedia.org/resource/Gordon_Brown"} [0..3] root:say

{Person sparql="SELECT ?inst WHERE { ?inst a :Politician }"} [0..3] root:say

{Person sparql = "SELECT ?inst WHERE { ?inst :party

<http://dbpedia.org/resource/Labour_Party_%28UK%29> }" } [0..3] root:say

{Person sparql = "SELECT ?inst WHERE {

?inst :party <http://dbpedia.org/resource/Labour_Party_%28UK%29> .
?inst :almaMater <http://dbpedia.org/resource/University_of_Edinburgh> }"
} [0..3] root:say
User Interfaces for SPARQL-based Semantic
Search

• SPARQL-based semantic searches, tapping into LOD resources are

extremely powerful
• However, impossible to write by the vast majority of users
• User interfaces for SPARQL-based semantic search:
• Faceted searches (see ExoPatent next)
• Form-based searches (see EnviLOD, PIN)
• Text-based searches (natural language interfaces for querying
ontologies), e.g. FREyA
Faceted Search: ExoPatent Example

l Use semantic information to expose linkages between

documents based on the intersecting relationships between
various sets of data from
l FDA Orange Book (23,000 patented drugs
l Unified Medical Language System (UMLS) – database of medical
terms (370,000)
l Patent bibliographic information
l Search for diseases, drug names, body parts, references to
literature and other patents, numeric values, ranges
l Demo uses a small set of patents (40,000)
ExoPatent: Faceted Search
Annotated Patent Document
Find all applicants who filed patents related to
mitochondria, as well as drug names and active
ingredients
Semantic Search over Content and Annotations

http://demos.gate.ac.uk/trendminer/envilod
EnviLOD Semantic Search UI
EnviLOD Semantic Search UI
EnviLOD Semantic Search UI
Example Results
…and the underlying SPARQL Query

{Sem_Location dbpediaSparql="select distinct ?inst

where {
{{ ?inst <http://dbpedia.org/property/north> ?loc} UNION
{ ?inst <http://dbpedia.org/property/east> ?loc } UNION
{ ?inst <http://dbpedia.org/property/west> ?loc } UNION
{ ?inst <http://dbpedia.org/property/south> ?loc } UNION
{ ?inst <http://dbpedia.org/property/northeast> ?loc } UNION
{ ?inst <http://dbpedia.org/property/northwest> ?loc } UNION
{ ?inst <http://dbpedia.org/property/southeast> ?loc } UNION
{ ?inst <http://dbpedia.org/property/southwest> ?loc }
FILTER(REGEX(STR(?loc), \"Sheffield\", \"i\"))} }"} AND (root:"flood")

Ongoing work: Use GeoSparql instead and be able to specify distances and
reason with the richer information in GeoNames
Flooding in Oxford
We don't just have to look at politicians saying and
measuring things
l If we first process the text with other NLP tools such as
sentiment analysis, we can also search for positive or negative
documents
l Or positive/negative comments about certain
people/topics/things/companies
l In the Decarbonet project, we looked at people's opinions
about topics relating to climate change, e.g. fracking
l We could index on the sentiment annotations too
l Other people are using the combination of opinion mining and
MIMIR to look at e.g. customer feedback about products /
companies on a huge corpus
Explicitly Choosing The Search Classes
Choosing A Specific Instance
What diseases are in these documents?
What Pathogens?
Disease vs Disease Co-ocurrences
Diseases vs Pathogens
Some cool applications with GATE
Back to Trump (there’s no escape)
Real-time Opinion Monitoring

vs replies
Climate change, ISIS and Trump

147
Querying election data with MIMIR

• Dataset: every
tweet by MP /
Candidate / Party,
plus all
replies/retweets

• Find all tweets

where a
Conservative MP
talked about the
economy
Media and communications
Community and society
Borders and Immigration
Tax and revenue

Public health
Employment
Scotland
UK economy
Europe

NHS
Labour Party Candidate
Labour Party MP
Conservative Party Candidate
Conservative Party MP
UKIP Candidate
Other MP
SNP Candidate
Green Party Candidate
Liberal Democrats Candidate
Parties / themes co-occurrence

SNP Other
Hate speech towards MPs on Twitter
2015 2017

http://greenwoodma.servehttp.com/data/buzzfeed/sunburst.html
Environmental behaviour analysis
• Based on the assumption that users in different behavioural
stages communicate differently (different emotions,
directives, etc.)
Pajarito @lindopajarito . 2h
Our building needs 40% of all energy consumed in Switzerland! L
Desirability: Negative sentiment (expressing personal
frustration- anger/sadness)

DJPajarito @DJPajaritoGenial . 12h

I'm so proud when I remember to save energy and I know
however small it's helping.

Buzz: Positive sentiment (happiness/joy). I/we + present tense

HotelPajarito @HotelPajarito . 18h

Join us today today to switch of a light for EH! J

Invitation: Positive sentiment (happy) + use of vocatives

Recognition of environmental terms in Decarbonet
A positive tweet
A negative tweet
A sarcastic tweet
Term recognition and sentiment analysis in Decarbonet

l http://services.gate.ac.uk/decarbonet/sentiment
KNOWMAK:
Mapping the state of European research

• Build an ontology to map between user queries (who’s

doing what where?) and databases of projects, patents,
publications
• Build indicators on the data (how many patents by which
actors in which country?)
• Build visualisation tools to show the results
• Searching and mapping are done via keywords mapped
to topics (ontology classes) based around KETs and
SGCs
• This deals with the problem that the terms used can
vary widely between users and between document
types
• Ontology search demo
Topic visualisation

http://www.dcs.shef.ac.uk//~adam/stuff/knowmak/visualization/
Summary
l Text mining is a very useful pre-requisite for doing all kinds of
more interesting things, like semantic search
l Semantic web search allows you to do much more interesting
kinds of search than a standard text-based search
l Text mining is hard, so it won't always be correct
l This is especially true on lower quality text such as social
media
l We run an annual GATE training course in June in Sheffield,
where you can spend a whole week learning all this and more!
Acknowledgements
This work is supported by:
• the European Union/EU under the Information and
Communication Technologies (ICT) theme of the 7th
Framework and H2020 Programmes for R&D
• DecarboNet (610829) http://www.decarbonet.eu
● SoBigData (654024) http://www.sobigdata.eu

● COMRADES (687847) http://www.comrades-project.eu

● KNOWMAK (726992) http://knowmak.eu

● Nesta http://nesta.org.uk
What to ask Santa for this Christmas

• Great basic intro to NLP

• Uses GATE as examples
• Discusses other tools and
the differences between
them
• Chapters on semantic
search, social media
analysis, sentiment
analysis, cool
applications, and more
Key Publications
l Semantic Search over Documents and Ontologies (2014) K
Bontcheva, V Tablan, H Cunningham. Bridging Between Information
Retrieval and Databases, 31-53
l D. Maynard, I. Roberts, M. A. Greenwood, D. Rout and K. Bontcheva.
A Framework for Real-time Semantic Social Media Analysis. Web
Semantics: Science, Services and Agents on the World Wide Web,
2017
l V. Tablan, I. Roberts, H. Cunningham, and K. Bontcheva.
GATECloud.net: a Platform for Large-Scale, Open-Source Text
Processing on the Cloud. Philosophical Transactions of the Royal
Society A, 371(1983), 2013
l More papers on the GATE website: http://gate.ac.uk/gate/doc/papers.html
Some useful links
• GATE: http://gate.ac.uk
• GateCloud: https://cloud.gate.ac.uk
• Annual GATE training course in June in Sheffield:
https://gate.ac.uk/family/training.html
• Download GATE: http://gate.ac.uk/download
• GATE blog posts on social media analysis:
http://gate4ugc.blogspot.co.uk/
• UK elections monitor http://gate.ac.uk/projects/pft
• Blog post on abuse of MPs:
• COMRADES project on disasters: http://gate.ac.uk/projects/comrades
• KNOWMAK project and demos: http://gate.ac.uk/projects/knowmak
• SoBigData project: http://sobigdata.eu
Has your head exploded yet?

Questions?

Introduction To Text Mining NLP Foundations
No ratings yet
Introduction To Text Mining NLP Foundations
10 pages
Unit 4 DNLP
No ratings yet
Unit 4 DNLP
52 pages
Text Mining and The Semantic Web: DR Diana Maynard NLP Group Department of Computer Science University of Sheffield
No ratings yet
Text Mining and The Semantic Web: DR Diana Maynard NLP Group Department of Computer Science University of Sheffield
46 pages
Information Extraction
No ratings yet
Information Extraction
7 pages
Piskorski 2012
No ratings yet
Piskorski 2012
27 pages
Data Mining
No ratings yet
Data Mining
84 pages
Annotation Imprtant
No ratings yet
Annotation Imprtant
5 pages
Information Extraction
No ratings yet
Information Extraction
8 pages
Introduction To Information Extraction Technology: Douglas E. Appelt David J. Israel
No ratings yet
Introduction To Information Extraction Technology: Douglas E. Appelt David J. Israel
41 pages
Information Extraction: Jim Cowie and Yorick Wilks
No ratings yet
Information Extraction: Jim Cowie and Yorick Wilks
22 pages
FALLSEM2023-24 CSE4022 ETH VL2023240103739 2023-08-23 Reference-Material-II
No ratings yet
FALLSEM2023-24 CSE4022 ETH VL2023240103739 2023-08-23 Reference-Material-II
5 pages
Informal Domain Research
No ratings yet
Informal Domain Research
34 pages
Unit 4 Updated
No ratings yet
Unit 4 Updated
178 pages
Nformation Xtraction: Santosh S. Peerappagol
No ratings yet
Nformation Xtraction: Santosh S. Peerappagol
18 pages
Unit 4
No ratings yet
Unit 4
174 pages
Unit - 1
No ratings yet
Unit - 1
11 pages
Unit4 Final
No ratings yet
Unit4 Final
57 pages
WINSEM2023-24 BCSE306L TH VL2023240500598 2024-04-30 Reference-Material-I
No ratings yet
WINSEM2023-24 BCSE306L TH VL2023240500598 2024-04-30 Reference-Material-I
44 pages
A Machine Learning Approach To Information Extraction
No ratings yet
A Machine Learning Approach To Information Extraction
8 pages
DeekshikaJadyada27 AP24LDS11
No ratings yet
DeekshikaJadyada27 AP24LDS11
4 pages
Information Extraction - CS
No ratings yet
Information Extraction - CS
19 pages
Lect 06
No ratings yet
Lect 06
21 pages
Offered To Final Year B.Tech. CSE by Dept. of C.Tech.: 18CSE359T Natural Language Processing
No ratings yet
Offered To Final Year B.Tech. CSE by Dept. of C.Tech.: 18CSE359T Natural Language Processing
178 pages
Handbook NLP Final
No ratings yet
Handbook NLP Final
32 pages
NLP Unit4 Mat
No ratings yet
NLP Unit4 Mat
13 pages
NLP Unit 3&4
No ratings yet
NLP Unit 3&4
37 pages
Unit5 NLP RNP
No ratings yet
Unit5 NLP RNP
112 pages
Is WC 06 Welty Murdock
No ratings yet
Is WC 06 Welty Murdock
14 pages
IRS - Unit 1
No ratings yet
IRS - Unit 1
72 pages
7-Information Extraction (IE) and Machine Translation (MT)
No ratings yet
7-Information Extraction (IE) and Machine Translation (MT)
46 pages
Aplicacion de Tecnicas de Extraccion de Informacion A Bibliotecas Digitales Applying Information Extraction Techniques To Dls 0
No ratings yet
Aplicacion de Tecnicas de Extraccion de Informacion A Bibliotecas Digitales Applying Information Extraction Techniques To Dls 0
10 pages
Chapter #7 Applicatios of NLP (Reading Ass)
No ratings yet
Chapter #7 Applicatios of NLP (Reading Ass)
58 pages
Part2b IE
No ratings yet
Part2b IE
66 pages
IR Ass1
No ratings yet
IR Ass1
4 pages
Information Extraction and Named Entity Recognition
No ratings yet
Information Extraction and Named Entity Recognition
39 pages
Module 15
No ratings yet
Module 15
2 pages
NLP MiniProject GroupNo 16
No ratings yet
NLP MiniProject GroupNo 16
9 pages
Extracting Information Science Concepts Based On J
No ratings yet
Extracting Information Science Concepts Based On J
8 pages
Unit 1
No ratings yet
Unit 1
25 pages
Session 6
No ratings yet
Session 6
19 pages
Advanced Info Extraction Methods
No ratings yet
Advanced Info Extraction Methods
20 pages
A Review of Open Information Extraction
No ratings yet
A Review of Open Information Extraction
9 pages
Information Extraction for Researchers
No ratings yet
Information Extraction for Researchers
100 pages
Building Information Extraction System Based On Computing Domain Ontology
No ratings yet
Building Information Extraction System Based On Computing Domain Ontology
5 pages
Overview of Natural Language Processing: Advanced AI CSCE 976 Amy Davis
No ratings yet
Overview of Natural Language Processing: Advanced AI CSCE 976 Amy Davis
54 pages
Chapter19 IE, Relations
No ratings yet
Chapter19 IE, Relations
28 pages
Keyword Extraction Issues and Methods
No ratings yet
Keyword Extraction Issues and Methods
33 pages
Information Retrieval
No ratings yet
Information Retrieval
3 pages
NLP Notes
No ratings yet
NLP Notes
18 pages
Language Models for Entity Extraction
No ratings yet
Language Models for Entity Extraction
1 page
22 WebIntelligence Tools Feb2008
No ratings yet
22 WebIntelligence Tools Feb2008
26 pages
Nasar 2021
No ratings yet
Nasar 2021
39 pages
A Machine Learning Approach To Information Extract
No ratings yet
A Machine Learning Approach To Information Extract
10 pages
CS 523 - Essentials of Natural Language Processing: Project Title: Report On Named Entity Recognition
No ratings yet
CS 523 - Essentials of Natural Language Processing: Project Title: Report On Named Entity Recognition
19 pages
Information Extraction
No ratings yet
Information Extraction
25 pages
NLP Unit 1,2 Notes
No ratings yet
NLP Unit 1,2 Notes
37 pages
A Survey On Hidden Markov Models For Information Extraction
No ratings yet
A Survey On Hidden Markov Models For Information Extraction
4 pages
Diana FB
No ratings yet
Diana FB
26 pages
Fractal Previous Year Coding Questions Super Dream
100% (1)
Fractal Previous Year Coding Questions Super Dream
2 pages
Statistical Approach To Quality Management
No ratings yet
Statistical Approach To Quality Management
57 pages
Data Science Interview Questions (Healthcare)
No ratings yet
Data Science Interview Questions (Healthcare)
19 pages
Futsal Booking System Report
No ratings yet
Futsal Booking System Report
39 pages
Refrigeration System Optimization
No ratings yet
Refrigeration System Optimization
14 pages
Abtik Group
No ratings yet
Abtik Group
23 pages
Sapera User
No ratings yet
Sapera User
109 pages
Activity
No ratings yet
Activity
5 pages
Cisco® Catalyst® 9400 Series
No ratings yet
Cisco® Catalyst® 9400 Series
25 pages
OMRON PLC Cable Guide
No ratings yet
OMRON PLC Cable Guide
2 pages
MDC Selection Ist Sem NEP 20 Oct 2024 FINAL
No ratings yet
MDC Selection Ist Sem NEP 20 Oct 2024 FINAL
6 pages
Identify Your Imac Model - Apple Support
No ratings yet
Identify Your Imac Model - Apple Support
1 page
Daftar Harga Produk TIENS
No ratings yet
Daftar Harga Produk TIENS
2 pages
Software Quality Concepts
No ratings yet
Software Quality Concepts
38 pages
Leaflet HEMK 20191010
100% (1)
Leaflet HEMK 20191010
14 pages
Hussein 2015
No ratings yet
Hussein 2015
4 pages
NTC Type SMD: Thermometrics Surface Mount Devices
No ratings yet
NTC Type SMD: Thermometrics Surface Mount Devices
8 pages
Python Note 5
No ratings yet
Python Note 5
10 pages
1 s2.0 S0141029619311046 Main
No ratings yet
1 s2.0 S0141029619311046 Main
11 pages
Email Marketing Assignment - Omera Anjum - PGSDM
No ratings yet
Email Marketing Assignment - Omera Anjum - PGSDM
9 pages
Recruitment Process of Ufone, Telenor and Zong
100% (3)
Recruitment Process of Ufone, Telenor and Zong
29 pages
IE PSheet
No ratings yet
IE PSheet
3 pages
Xuv300 Accessories
No ratings yet
Xuv300 Accessories
2 pages
IPR - Quiz 1 2024
No ratings yet
IPR - Quiz 1 2024
1 page
J Jfoodeng 2018 01 016
No ratings yet
J Jfoodeng 2018 01 016
8 pages
Visa Cashless Cities Report
No ratings yet
Visa Cashless Cities Report
68 pages
Rohini 59125306424
No ratings yet
Rohini 59125306424
5 pages
Grade 2 Homeschool Pacing Guide Unit 1: Work Like A Scientist
No ratings yet
Grade 2 Homeschool Pacing Guide Unit 1: Work Like A Scientist
30 pages

Text Analysis Semantic Search

Uploaded by

Text Analysis Semantic Search

Uploaded by

Text Analysis with GATE

Dr. Diana Maynard

Search Solutions Tutorial 2017

• Introduction to NLP and Information Extraction

l This weather-forecasting toaster, connected to a phone point,

Staff sorting 4M used tickets from #London Underground

l The automatic discovery of new, previously unknown

Data mining is about using analytical techniques to find

l A Named Entity is a Person, Location, Organisation, Date etc.

l A term is a key concept or phrase that is representative of the text

l An event is an action or situation relevant to the domain

l They can help answer the “Big 5” journalism questions

l A text mining system is usually built up from a number of

GATE is an NLP toolkit developed at the University of Sheffield

• components for language processing, e.g. parsers, machine

• tools for visualising and manipulating text, annotations,

• various information extraction tools

• evaluation and benchmarking tools

• The ANNIE application contains a set of core PRs:

l Tokenisation is based on Unicode classes

• The default splitter finds sentences based on Tokens

l Gazetteers are plain text files containing lists of names (e.g

• Hand-coded rules written in JAPE applied to annotations to

Rule: PersonName Look for an entry in a

create an annotation of type “Person” give the annotation the feature

l They can make use of any annotations already created, e.g.

l Different expressions may refer to the same entity

l All integrated into GATE as plugins

l Grundman:politics makes #climatechange scientific issue,people

l Noisy language: unusual punctuation, capitalisation, spelling, use

POS tagging Tokenisation

NER (Professor Plum)

Good performance is important at each stage

je bent Jacques cousteau niet die een nieuwe soort heeft

I'm at 地铁望京站 Subway Wangjing (Beijing)

Smileys and emoticons: loss of information (e.g. sentiment)

Punctuation for emphasis

What do these hashtags mean?

Hashtags are not always used in the same way,

l Typical NLP pipeline means that degraded performance has a

l Retraining individual components on large volumes of data

• 500 million entities that Google

• Summaries of information about

• Use Text Mining, e.g.

l Medical records contain a large amount of unstructured text

l NIHR Biomedical Research Centre at the South London and

70% of medical information

Hard for GATE can link

l Genome Wide Association Studies (GWAS) aim to investigate

l We developed a semantic annotation application to process

Where in the UK did Conservative

l Set of concepts (instances and

l Machine readable knowledge on various entities and topics,

Latitude & Longitude

l Traditional IE is based on a flat structure, e.g. recognising

John lives in London . He works there for Polar Bear Design .

John lives in London . He works there for Polar Bear Design .

PER LOC ORG

John lives in London . He works there for Polar Bear Design .

PER LOC ORG

John lives in London . He works there for Polar Bear Design .

PER LOC ORG

John lives in London. He works there for Polar Bear Design.

John lives in London. He works there for Polar Bear Design.

l Locations (linked to DBpedia and GeoNames)

• can be used to index and search over text, annotations,

• Searching and managing text annotations, semantic

Text. Matches plain text.

IN Query. Hits of one query only if inside another.

OVER Query. Hits of a query, only if overlapping hits of another.

{Measurement spec="1 to 100 volts“}

{Measurement spec="1 to 100 kg m^2 / A s^3"}

{Measurement spec="1 to 100 m / s"}

• SQL-like query language for RDF data

# Software companies founded in the US

SELECT DISTINCT ?Company ?Location