Natural Language Processing
(Weekly Laboratory Assignments)
Sumit Kumar Banerjee
Contents
1 2
2 3
3 Assignments on Language Modeling 4
3.1 Question 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2 Question 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.3 Question 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.4 Question 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.5 Question 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.6 Question 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.7 Question 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1
Chapter 1
2
Chapter 2
3
Chapter 3
Language Modeling
3.1 Write a Python program to perform Unigram
Language Model with Laplace Smoothing.
from c o l l e c t i o n s i m p o r t Counter
i m p o r t math
d e f unigram_model ( c o r p u s ) :
tokens = corpus . s p l i t ()
c o u n t s = Counter ( t o k e n s )
v = len ( counts )
t o t a l _ t o k e n s = sum ( c o u n t s . v a l u e s ( ) )
d e f prob ( word ) :
r e t u r n ( c o u n t s [ word ] + 1 ) / ( t o t a l _ t o k e n s + v )
r e t u r n prob
c o r p u s = ” t h e c a t s a t on t h e mat t h e c a t a t e f i s h ”
model = unigram_model ( c o r p u s )
p r i n t ( ”P( c a t ) =” , model ( ” c a t ” ) )
3.2 Write a Python program to perform Bigram Model
with Laplace Smoothing
from c o l l e c t i o n s i m p o r t d e f a u l t d i c t
d e f bigram_model ( c o r p u s ) :
tokens = corpus . s p l i t ()
b = defaultdict ( int )
u = defaultdict ( int )
vocab = s e t ( t o k e n s )
f o r i in range ( l e n ( tokens ) −1):
u [ t o k e n s [ i ] ] += 1
b [ ( t o k e n s [ i ] , t o k e n s [ i + 1 ] ) ] += 1
v = l e n ( vocab )
4
CHAPTER 3. ASSIGNMENTS ON LANGUAGE MODELING 5
d e f prob ( w1 , w2 ) :
r e t u r n ( b [ ( w1 , w2 ) ] + 1 ) / ( u [ w1 ] + v )
r e t u r n prob
c o r p u s = ” t h e c a t s a t on t h e mat t h e c a t a t e f i s h ”
model = bigram_model ( c o r p u s )
p r i n t ( ”P( c a t | t h e ) =” , model ( ” t h e ” , ” c a t ” ) )
3.3 Write a Python program to execute Trigram Text
Generator.
from random i m p o r t c h o i c e
from c o l l e c t i o n s i m p o r t d e f a u l t d i c t
d e f t r i g r a m _ g e n e r a t o r ( c o r p u s , s t a r t , l e n g t h = 1 0 ):
o = corpus . s p l i t ()
t = defaultdict ( l i s t )
f o r i in range ( l e n ( o ) −2):
t [ ( o [ i ] , o [ o ] ) ] . append ( t o k e n s [ i +2])
text = l i s t ( start )
f o r _ in range ( length ) :
pair = tuple ( text [ −2:])
next_word = c h o i c e ( t . g e t ( p a i r , [” <END> ” ] ) )
i f next_word == ”<END>”: b r e a k
t e x t . append ( next_word )
return ” ”. join ( text )
corpus = input ()
p r i n t ( trigram_generator ( corpus , (” the ” , ” cat ” ) ) )
3.4 Write a Python program to perform Bigram Spell
Correction.
from d i f f l i b i m p o r t g e t _ c l o s e _ m a t c h e s
d e f s p e l l _ c o r r e c t ( s e n t e n c e , vocab , p ) :
words = s e n t e n c e . s p l i t ( )
o = [ words [ 0 ] ]
f o r i i n r a n g e ( 1 , l e n ( words ) ) :
i f words [ i ] not i n vocab :
c1 = g e t _ c l o s e _ m a t c h e s ( words [ i ] , vocab )
i f c1 :
s = [ ( c , p ( o [ − 1 ] , c ) ) f o r c i n c1 ]
words [ i ] = max ( s , key=lambda x : x [ 1 ] ) [ 0 ]
o . append ( words [ i ] )
CHAPTER 3. ASSIGNMENTS ON LANGUAGE MODELING 6
return ” ”. join (o)
c o r p u s = ” t h e c a t s a t on t h e mat”
vocab = s e t ( c o r p u s . s p l i t ( ) )
model = bigram_model ( c o r p u s )
p r i n t ( s p e l l _ c o r r e c t ( ” t h e c e t s a t on t e h mat ” , vocab , model ) )
3.5 Write a Python program to perform Viterbi POS
Tagging.
N = [ ’ Noun ’ , ’ Verb ’ ]
s t a r t _ p = { ’ Noun ’ : 0 . 6 , ’ Verb ’ : 0 . 4 }
T = { ’ Noun ’ : { ’ Noun ’ : 0 . 1 , ’ Verb ’ : 0 . 9 } ,
’ Verb ’ : { ’ Noun ’ : 0 . 8 , ’ Verb ’ : 0 . 2 } }
E = { ’ Noun ’ : { ’ f i s h ’ : 0 . 5 , ’ eat ’ : 0 . 5 } ,
’ Verb ’ : { ’ f i s h ’ : 0 . 4 , ’ eat ’ : 0 . 6 } }
d e f v i t e r b i ( o , N, s tar t_p , T, E ) :
V = [{}]
path = {}
f o r s i n N:
V [ 0 ] [ s ] = s t a r t _ p [ s ] ∗ E [ s ] . g e t ( o [ 0 ] , 1 e −4)
path [ s ] = [ s ]
f o r t in range (1 , len ( o ) ) :
V. append ( { } )
new_path = {}
f o r s i n N:
(P , S ) = max ( (V[ t − 1 ] [ x ] ∗ T [ x ] [ s ] ∗ E [ s ] . g e t ( o [ t ] , 1 e −4) , x )
V[ t ] [ s ] = P
new_path [ s ] = path [ S ] + [ s ]
path = new_path
( prob , s t a t e ) = max ( (V[ l e n ( o ) − 1 ] [ s ] , s ) f o r s i n N)
r e t u r n path [ s t a t e ]
p r i n t ( v i t e r b i ( [ ’ f i s h ’ , ’ eat ’ ] , sn , sta rt_ p , T, E ) )
3.6 Write a Python program to perform Forward Prob-
ability.
d e f f o r w a r d ( obs , s t a t e s , sta rt_ p , trans_p , emit_p ) :
fwd = [ { } ]
for s in states :
fwd [ 0 ] [ s ] = s t a r t _ p [ s ] ∗ emit_p [ s ] . g e t ( obs [ 0 ] , 0 . 0 0 0 1 )
f o r t i n r a n g e ( 1 , l e n ( obs ) ) :
fwd . append ( { } )
for s in states :
CHAPTER 3. ASSIGNMENTS ON LANGUAGE MODELING 7
fwd [ t ] [ s ] = sum ( fwd [ t − 1 ] [ s 0 ] ∗ trans_p [ s 0 ] [ s ] f o r s 0 i n s
r e t u r n sum ( fwd [ − 1 ] [ s ] f o r s i n s t a t e s )
p r i n t ( f o r w a r d ( [ ’ f i s h ’ , ’ eat ’ ] , s t a t e s , sta rt_ p , trans_p , emit_p ) )
3.7 Write a Python Program to perform HMM Named
Entity Recognition.
s t a t e s = [ ’ O’ , ’PER ’ ]
s t a r t _ p = { ’O ’ : 0 . 9 , ’PER ’ : 0 . 1 }
trans_p = { ’O ’ : { ’O ’ : 0 . 9 , ’PER ’ : 0 . 1 } , ’PER ’ : { ’O ’ : 0 . 4 , ’PER ’ : 0 . 6 }
emit_p = { ’O ’ : { ’ I ’ : 0 . 4 , ’ l i v e ’ : 0 . 6 } , ’PER ’ : { ’ John ’ : 0 . 7 , ’ Smith ’ :
d e f v i t e r b i ( o , N, s tar t_p , T, E ) :
V = [{}]
path = {}
f o r s i n N:
V [ 0 ] [ s ] = s t a r t _ p [ s ] ∗ E [ s ] . g e t ( o [ 0 ] , 1 e −4)
path [ s ] = [ s ]
f o r t in range (1 , len ( o ) ) :
V. append ( { } )
new_path = {}
f o r s i n N:
(P , S ) = max ( (V[ t − 1 ] [ x ] ∗ T [ x ] [ s ] ∗ E [ s ] . g e t ( o [ t ] , 1 e −4) , x )
V[ t ] [ s ] = P
new_path [ s ] = path [ S ] + [ s ]
path = new_path
( prob , s t a t e ) = max ( (V[ l e n ( o ) − 1 ] [ s ] , s ) f o r s i n N)
r e t u r n path [ s t a t e ]
p r i n t ( v i t e r b i ( [ ’ John ’ , ’ Smith ’ ] , s t a t e s , sta rt _p , trans_p , emit_p ) )