Lecture 1: Introduction to Computational Linguistics
Nigel Collier
Faculty of Modern and Medieval Languages and Linguistics
Li18 Computational Linguistics
1
Summary
1. Course Admin.
2. What is Computational Linguistics?
3. Language Models?
4. Complexity of Language Tasks
2
Course Admin.
Course Supervisors for Undergraduates
• Should be in touch with you by the end of the first week of term - if
not then please contact me (nhc30)
3
Course Admin.
Course Supervisors
Carlos Balhana ceb81
Alan Ansell aja63
Andrew Caines apc38
Nigel Collier nhc30
Evgeniia Razumovska er563
Olga Majeswska om304
Ulla Petti ump20
Fangyu Liu fl399
4
Course Admin.
Course Textbook
• In 2018 I updated the course to bring it up to speed with the latest
developments that are shaping the field
• The main course text book remains as Speech and Language
Processing by Daniel Jurafsky and James Martin (‘J&M’)
• Second Edition. International Edition. 2008 - There are 4
copies in the library with 2 on over night loan. Ask your DoS to
get a copy for the college library
5
Course Admin.
Make sure you know which edition/version you’ve got
You’re Be mindful
ready to
go
6
Course Admin.
The lecture notes and slides will be drawing on new material
• Third Edition. Online. In draft - available online at
https://web.stanford.edu/~jurafsky/slp3/
• Page references in the lecture notes will point to the Second
edition. Where necessary I will indicate chapter titles for new
material in the online Third edition
• Recommended reading and Going Further in the lecture notes
will point to published papers. Check Google Scholar!
• Feedback is very much encouraged!
7
Undergraduates – Part IIA
Mathieu Barrier mb2407 Suchir Salhan sas245
Tabassum Chowdhury tc592 Emily Shen eys23
Georgia Clothier gc629 Alice Sizer as2999
Benjamin Conway bpc41 Jadd Virji jv427
Jacob Davies jed74 Katherine Wu jw2134
Beatrice Greenhalgh beg31
Ben Hunt bsh35
Harriet Innes hi257
Madeleine Jaeger maj70
Eve James emj42
Mirriam Lay ml968
Rosa Millard ram204
Jasper Pennings jcp75
Alex Provost awbp2
Charis Saer cas240
8
Undergraduates – IIB, MML, Erasmus
Patrick Farnworth IIB prf24
Henrietta Manning IIB hejm3
Anastasia Karamzina MML aak62
Sandra Perosa MML sp924
9
Course Admin.
Other Undergraduates
If you haven’t seen your name, please email me (nhc30).
TAL MPhils
Welcome onboard!
Other Language Science Interdisciplinary Programme MPhils
Please email me (nhc30).
10
Study and Supervisions
• Lecture notes and slides will be available on Moodle before the
lecture
• Recommended Reading in lecture notes to extend your
understanding (focus on quality and accessibility)
• Going Further - not obligatory - to take your understanding of the
subject beyond the course (focus on quality and influence on the
field. Warning: no boundary on technical accessibility)
• Post-lecture exercises and pre-lecture exercises to help you study
• Starred exercises to be handed in for supervisions
• No mathematical or programming ability required
• Maths crib sheet available on Moodle
11
Michaelmas Overview
12
What Topics Changed in 2021?
• Lecture 2: added in a subsection on byte-pair encoding;
• Lecture 6: Named entity recognition is now added to part of speech
as an example of sequence labelling;
• Lecture 7: Treebanking has been moved here to show how human
labelled corpora can be useful;
• Updated Recommended and Going Further references
• Python for Computational Linguistics becomes integrated into the
course!
13
Undergraduates: marking and examinations in 2021/22
20% Python Lab practicals
(assessed in Weeks 7 of
Michaelmas and Lent)
80% End of year written
examination
14
Information about Programming
Python is one of the most popular and well supported programming languages used
for Natural Language Processing. Python for Computational Linguists – self-paced
Jupyter notebooks:
https://github.com/cambridgeltl/python4cl
15
Information about Programming
Undergraduates: Python Programming for Linguists course is now
a core part of Li18 – self-paced Jupyter notebooks supported by 6
hours of labs per term in Michaelmas and Lent. See Moodle for
how to get started. First lab is on Tuesday October 12th from 1pm to
3pm. Assessments in weeks 7 of Michaelmas and Lent.
MPhils: Python Programming for Linguists course is voluntary and
no credits but recommended. Can be pursued as a self-paced
activity online from home. Again, see Moodle for how to get started.
16
Python for Computational Linguists
• A self-paced course on Python specially designed for Li18 students
• Uses Jupyter Notebooks
• Trialled on a volunteer basis in 2019 and 2020, now an essential part of
the course
• Two sets of modules – one in Michaelmas and one in Lent
• Expect 7 to 12 hours of work per term
• Full instructions on Moodle
17
Summary
1. Course Admin.
2. What is Computational Linguistics?
3. Language Models?
4. Complexity of Language Tasks
18
What is Computational Linguistics?
• A language science that involves building computational models of
languages
• A computational model can refer to any formally (mathematically)
specified model that describes a language phenomenon
19
Computational Linguistics is a Multi-disciplinary Field
20
Computational Linguistics Splits into Two Broad Areas
Natural Language Processing
• Construction of language models for use in computational tasks and
applications, e.g. Machine Translation (MT), Questions Answering (QA)
or Conversational Agents
• A branch of computer science (’natural’ language as opposed to artificial
programming languages)
• Our focus in this course
Computational Cognitive Linguists
• Construct language models to further our understanding into the
cognition of language
• Includes computational psycholinguists and computational neurolinguists
21
Summary
1. Course Admin.
2. What is Computational Linguistics?
3. Language Models?
4. Complexity of Language Tasks
22
Natural Language Analysis
The models used in NLP are used to automatically analyse language to produce
the possible structures/annotations that you have been taught to think about.
Consider some linguistic ambiguities you have encountered ...
o Morphology
o Syntax
o Semantics
o Pragmatics
o ...
You have been learning to associate structure (or annotation) to linguistic units
and in cases of ambiguity, demonstrating that there was more than one possible
structure
23
Combining Language Models
From a computational perspective language is full of ambiguity. A sentence may
have a high number of possible meanings if it contains a number of different
types of ambiguity.
I made her duck
24
Combining Language Models
I made her duck
1. I cooked waterfowl for her
2. I cooked waterfowl belonging to her
3. I created the (plaster?) duck she owns
4. I caused her to quickly lower her head
5. I turned her into a duck
Several types of ambiguity combine to cause many meanings:
• morphological (her can be a dative pronoun or possessive pronoun and
duck can be a noun or a verb)
• syntactic (make can behave both transitively and ditransitively; make can
select a direct object or a verb)
• semantic (make can mean create, cause, cook, ...)
25
Combining Language Models
What types of ambiguity are we seeing in these examples?
At the party there were young men and women.
My neighbor’s hat was taken by the wind. He tried to catch it.
Thank you for not eating or playing music without earphones.
Doctor testifies in horse suit.
26
Summary
1. Course Admin.
2. What is Computational Linguistics?
3. Language Models?
4. Complexity of Language Tasks
27
Complexity of Language Tasks and Applications
* Thanks to Dan Jurafsky for this visual.
28
Complexity of Language Tasks
Sentiment Analysis – Classifying Product Reviews
"... Julie Delpy is far too good for this movie. She imbues Serafine with spirit,
spunk, and humanity. This isn’t necessarily a good thing, since it prevents us
from relaxing and enjoying AN AMERICAN WEREWOLF IN PARIS as a
completely mindless, campy entertainment experience. Delpy’s injection of class
into an otherwise classless production raises the spectre of what this film could
have been with a better script and a better cast ... She was radiant, charismatic,
and effective ...“
- "a good actor trapped in a bad movie" from Po Bang et al. (2002).
Pang, B., Lee, L., & Vaithyanathan, S. (2002, July). Thumbs up?: sentiment classification using
machine learning techniques. In Proceedings of the ACL-02 conference on Empirical methods in
natural language processing-Volume 10 (pp. 79-86)
29
Complexity of Language Tasks
What features of the text help to predict the number of stars? Are the features
hard to identify and disambiguate?
Sentiment Analysis – Classifying Product Reviews
"... Julie Delpy is far too good for this movie. She imbues Serafine with spirit,
spunk, and humanity. This isn’t necessarily a good thing, since it prevents us
from relaxing and enjoying AN AMERICAN WEREWOLF IN PARIS as a
completely mindless, campy entertainment experience. Delpy’s injection of class
into an otherwise classless production raises the spectre of what this film could
have been with a better script and a better cast ... She was radiant, charismatic,
and effective ...“
30
Complexity of Language Tasks
Sentiment Analysis – Classifying Product Reviews
"... Julie Delpy is far too good for this movie. She imbues Serafine with spirit,
spunk, and humanity. This isn’t necessarily a good thing, since it prevents us
from relaxing and enjoying AN AMERICAN WEREWOLF IN PARIS as a
completely mindless, campy entertainment experience. Delpy’s injection of class
into an otherwise classless production raises the spectre of what this film could
have been with a better script and a better cast ... She was radiant, charismatic,
and effective ...“
31
Complexity of Language Tasks
Question Answering: Alexa, Amazon’s virtual assistant
32
Complexity of Language Tasks
Information Extraction about rocket launches
33
Adherence to Linguistic Theory
• The field of computational linguistics has not derived directly from
traditional linguistics.
• It is an interdisciplinary subject that is as closely related to mathematics
and computer science as it is to linguistics.
• The extent to which computational models of language draw directly from
linguistic theory is very varied.
34
What Types of Language Models Will We Look At?
• There are many different types of language model and ways of describing
them.
• The choice of the model will depend on the linguistic unit being
described, and often the task to which it is applied.
• In this course we will look at: rule-based models, finite state machines,
(lexical and context-free grammar) statistical models, neural models.
35
Exercises (see Lecture Notes for details)
Post-Lecture Exercise
1. Read J&M (2nd edition), Chapter 1.
Pre-Lecture Exercises
1. Read about Eliza in Weizenbaum (1966) and try it online. Think about the
Process that Eliza uses to identify keywords and transform them into
responses. What linguistic knowledge would be necessary to make it more
proficient in its task?
36