AI LabReport
AI LabReport
Abstract—In this report, we have addressed seven out of the missionaries and three cannibals across a river using a boat
eight assigned lab problems. The tasks include: Week 1: Modeling that can hold up to two people. The search space consists of
problems as state-space search challenges and solving them using possible configurations of people on either side of the river.
BFS/DFS. This was achieved by implementing the missionaries
and cannibals problem, as well as the rabbit leap problem. BFS ensures finding the optimal solution, while DFS may
Week 2: Designing a graph search agent and understand the return a solution faster but does not guarantee optimality. (i)
use of a hash table, queue in state space search. This is done A state is represented in this problem as (M,C,B) where M is
by implementing a puzzle-8 problem and a plagiarism detection the number of missionaries on the left bank, C is the number
system using the A* search algorithm. Week 3: Understanding the of cannibals on the left bank and B represents the boat’s
use of Heuristic function for reducing the size of the search space.
This we did by solving the game of marble solitaire and writing position(1 - boat is on left, 0 - it is on right side). The initial
programs for generating k-SAT problems and solving a set of state is (3, 3, 1) and the goal state is (0, 0, 0). While defining
uniform random 3-SAT problems for different combinations of the state transitions, we need to keep in mind that the number
m and n, and comparing their performance. Week 4: Non- of cannibals should not exceed the number of missionaries.
deterministic Search — Simulated Annealing For problems with So, the possible state transitions can be achieved by moving 1
large search spaces, randomized search becomes a meaningful
option given partial/ full-information about the domain. Week- missionary and 1 cannibal, or 2 missionaries, or 2 cannibals, or
6: . Three key objectives are pursued: determining the tolerable 1 missionary, or 1 cannibal. There are about 32 possible states
error for stored patterns in provided code, formulating an energy (4 × 4 × 2) because M has 4 possiblities - 0,1,2,3 as well as C
function for the Eight-rook problem while selecting suitable and B has only two - 0,1. (ii) Breadth-First Search (BFS) is a
weights, and solving a 10-city Traveling Salesman Problem (TSP) level-order traversal. We started from the initial state (3, 3, 1).
using a Hopfield network while estimating the necessary weight
count. Through these endeavors, the study aims to evaluate We explored all possible valid states reachable by one boat
the viability of Hopfield networks in real-world combinatorial trip. And then we continued expanding states until the goal
optimization tasks, shedding light on their capabilities and state (0, 0, 0) was reached. BFS guarantees finding the shortest
limitations in practical problem-solving contexts. Week 7: We path if the solution exists, making it optimal for this problem.
implemented a binary bandit using a stochastic reward system (iii) Depth-First Search (DFS) explores as far as possible along
and developed a 10-armed bandit with non-stationary rewards to
observe how an agent using a modified epsilon-greedy approach each branch before backtracking. It does not guarantee finding
adapts in dynamic environments. The report highlights crucial the optimal solution. We started from the initial state (3, 3, 1)
elements of MENACE and the challenges of non-stationary and then explored as deep as possible in one branch of possible
rewards in RL, showcasing effective strategies for tackling these states. If a solution was found, we returned it otherwise,
problems. Week 8: we explored Markov Decision Processes backtracking was done. (iv)BFS guarantees the shortest path
(MDPs) and their applications in real-world problems like the
Gbike bicycle rental problem. We formulated the problem using and is always optimal. DFS does not guarantee optimality.
states, actions, rewards, and transition probabilities and solved The solution obtained might have more steps than necessary,
it using policy iteration. This involved evaluating the current or it might explore unproductive branches before finding the
policy and improving it iteratively to maximize long-term re- goal. Time complexity for BFS was O(bd ) where b is the
wards. Optimizations such as truncated Poisson distributions and branching factor (the average number of successors for a state),
precomputed probabilities were implemented for computational
efficiency. and d is the depth of the shallowest goal state. BFS explores
Index Terms—State-space search, Breadth-First Search, Depth- every state at a given depth level, making it slower in terms of
First Search, AI Algorithms, Complexity Analysis, Simulated An- execution time but ensuring optimality. Time complexity DFS
nealing, Optimization, Markov Decision Process, Reinforcement was O(bm ) where m is the maximum depth of the search tree.
Learning, Non-stationary rewards, Policy iteration, Value itera- DFS can get lucky and find a solution quickly, but it may also
tion, Discount factor, Transition probabilities, Reward function,
Truncated Poisson distribution, Dynamic programming explore a much larger portion of the tree if the solution lies
deep. The space complexity of BFS is O(bd ) whereas that of
I. W EEK 1 : T O BE ABLE TO MODEL A GIVEN PROBLEM IN DFS is O(bm) where m is the max depth. It is more space-
TERMS OF STATE SPACE SEARCH PROBLEM AND SOLVE THE efficient than BFS.
SAME USING BFS/ DFS B. Rabbit leap problem
A. Missionaries and Cannibals problem In the rabbit leap problem, three east-bound rabbits stand in
The Missionaries and Cannibals problem is modeled as a a line blocked by three west-bound rabbits. They are crossing
state-space search problem where the goal is to transport three a stream with stones placed in the east west direction in a
line. There is one empty stone between them. The rabbits can • BFS: Uses a queue (FIFO) to select nodes.
only move forward one or two steps. They can jump over • DFS: Uses a stack (LIFO) to select nodes.
one rabbit if the need arises, but not more than that. So we • A*: Uses a priority queue to select nodes based on the
need to find out whether they can cross each other without lowest cost (g + h).
stepping into the water. (i) The initial state is represented by A set that stores already visited nodes to prevent revisiting
(E,E,E,-,W,W,W) (three east-bound rabbits on the left, three them, ensuring the algorithm doesn’t get stuck in cycles or
west-bound on the right, and one empty stone in the middle). redundant paths. For selecting nodes,
Goal stae is represented by (W,W,W,-,E,E,E). The search space • For BFS, select the node from the front of the queue
can be around 7!, but there would be a subset of it that (FIFO).
would be having valid moves. (ii) Using BFS, as we know • For DFS, select the node from the top of the stack (LIFO).
it is a level order traversal, it would be searching level wise • For A*, select the node with the lowest cost (g + h) from
states that would eventually go the goal state. The solution is the priority queue.
implemented in the code whose github repository is attached
After selecting a node, we check if it’s the goal node. If yes,
at the end. (iii) The DFS implementation will explore deeper
the algorithm reconstructs the path by backtracking through
into the search space first and will provide a sequence of
the node’s parents and returns the solution. The agent gener-
states leading to the goal. However, it may not necessarily
ates neighboring nodes from the current state. Successors are
be optimal, as DFS can get stuck in deep paths without
generated based on possible legal moves or transitions in the
finding the shortest route. (iv) The comparison between DFS
environment. These successors are added to the frontier unless
and BFS remains the same as that we got in the missionary
they’ve already been visited or are already in the frontier. If
and cannibals problem. Also the time and space complexity
the goal is reached, backtrack from the goal node to the start
remains the same, can refer to that part.
node using parent pointers to reconstruct the solution path.
II. W EEK 2 : T O DESIGN A GRAPH SEARCH AGENT AND This section outlines the key functions used to simulate the
UNDERSTAND THE USE OF A HASH TABLE , QUEUE IN STATE environment for the Puzzle-8 problem, which involves sliding
SPACE SEARCH . tiles to reach a goal state.
A. In-lab Discussion: Puzzle-8 problem GOAL_STATE = [1, 2, 3, 4, 5, 6, 7, 8, 0]
Algorithm 1 Graph Search Algorithm 1) Node Class: Represents the current state of the puzzle,
0: function G RAPH S EARCH (start state, goal state)
its parent node, and the cost (number of moves) to reach that
0: Initialize the frontier as a queue with the start node state.
0: Initialize an empty set for visited nodes class Node:
0: visited ← ∅ def __init__(self, state, parent=None, g=0):
0: while frontier is not empty do self.state = state
0: currentNode ← Remove the first node from frontier self.parent = parent
0: if currentNode is equal to goal state then self.g = g
0: return Solution found: path to goal
0: end if 2) Utility Functions:
0: if currentNode is not in visited then • is_solvable(state): Checks if the puzzle is solv-
0: if successor is not in visited and not in frontier state until a solvable configuration is found.
then • get_empty_tile_index(state): Returns the in-
Case 1:
B. Plagiarism detection problem Input:
Plagiarism detection plays a vital role in academic and doc1 = "This is so beautiful. It is great
content-driven fields, where it’s essential to spot similarities to hear."
between documents. A common method for tackling this doc2 = "This is so beautiful. It is great
challenge is treating it as a sequence alignment problem. to hear."
The objective is to compare two texts by calculating the edit
distance—how many insertions, deletions, or substitutions Output:
are needed to convert one text into another. We use A* Sentence from doc1 Sentence from doc2 Edit Distance
search algorithm for trying to solve the problem. Given two This is so beautiful This is so beautiful 0
textual documents, the problem is to find the alignment that It is great to hear It is great to hear 0
minimizes the edit distance, indicating the level of similarity
between the documents. A lower edit distance signifies a Case 2:
higher likelihood of plagiarism. The edit distance between Input:
two sequences (strings) is the minimum number of operations doc1 = "This is so beautiful. It is great
(insertions, deletions, and substitutions) needed to convert to hear."
one sequence into the other. This can be computed using doc2 = "This is incredibly beautiful. It
dynamic programming or search algorithms like A*. Te initial is wonderful to listen."
state corresponds to the start of both documents. The goal
state is reached when all sentences in both documents are Output:
Sentence from doc1 Sentence from doc2 Edit Distance
aligned.The plagiarism detection system uses the A* search
algorithm to efficiently compute the edit distance between two This is so beautiful This is incredibly beautiful 7
documents. A* is a search algorithm that finds the least-cost It is great to hear It is wonderful to listen 7
path in a graph, making it suitable for sequence alignment Case 3:
tasks. In this case, the graph is represented by the various Input:
states of alignments between the two texts. g(n) represents doc1 = "The quick brown fox."
the cost of aligning two substrings up to the current point doc2 = "A cat sat."
(i.e., the edit distance so far). h(n) is the heuristic estimate
of the remaining cost to align the rest of the substrings. Output:
The system explores different alignments, guided by this Sentence from doc1 Sentence from doc2 Edit Distance
cost function, and finds the alignment with the smallest total The quick brown fox A cat sat 23
cost. The heuristic used is the minimum of the remaining
lengths of the two strings. This ensures that the estimate Case 4:
is optimistic (i.e., never overestimates the remaining cost). Input:
Successor Generation For each state (i.e., partial alignment doc1 = "The cat is on the roof."
of the two strings), the algorithm generates successor states doc2 = "The dog is on the roof."
by considering three possible actions: Insertion: Insert a
Output: 2) Heuristic Functions with Justification: Heuristic Func-
Sentence from doc1 Sentence from doc2 Edit dist tion 1: Count of Remaining Marbles. This heuristic simply
The cat is on the roof The dog is on the roof 2 tallies the number of marbles still present on the board.
A lower count of remaining marbles indicates proximity to
Let’s understand what each case here means:
achieving the goal state, thereby effectively guiding the search.
Case 1: Identical documents with zero edit distance.
Heuristic Function 2: Distance to Center Position. This
Case 2: Slightly modified document with minor changes
heuristic computes the cumulative distance of all marbles to
(synonyms, etc.) leading to a low edit distance.
the central position. Since the aim is to position a single marble
Case 3: Completely different documents resulting in a high
at the center, minimizing the distance of all marbles is likely
edit distance.
to expedite reaching the goal.
Case 4: Partial overlap where two sentences have some com-
3) Best-First Search Algorithm: The best-first search em-
mon words, resulting in a low edit distance.
ploys a priority queue to navigate nodes based on their
And obviously, you can checkout the code from our github
heuristic values, concentrating on the most promising nodes
repository.
first.
III. W EEK 3 : T O UNDERSTAND THE USE OF H EURISTIC 4) A* Algorithm: The A* algorithm integrates the actual
FUNCTION FOR REDUCING THE SIZE OF THE SEARCH cost incurred to reach a node with the heuristic estimate of
SPACE . E XPLORE NON - CLASSICAL SEARCH ALGORITHMS the cost to arrive at the goal from that node.
FOR LARGE PROBLEMS . 5) Comparison of Results from Various Search Algorithms:
A. Solving marble solitaire Analysis:
• Priority Queue Search: This algorithm effectively ex-
plores states based on path costs but may not leverage
heuristics optimally.
• Best-First Search: Typically more efficient than pure
path cost searches, yet it does not guarantee finding the
optimal solution as it neglects the cost to reach a node.
• A* Algorithm: Generally regarded as the most effective
method for pathfinding issues, A* strikes a balance be-
tween cost and heuristic, thereby ensuring both optimality
and completeness.
B. K-SAT problem
The k-SAT problem is a key challenge in computational
theory that concerns the satisfiability of boolean formulas
expressed in conjunctive normal form (CNF). This paper
discusses the implementation of a program aimed at generating
uniform random k-SAT problems and evaluating the perfor-
mance of various search algorithms used to solve them. The
algorithms examined include Hill Climbing, Beam Search, and
Variable Neighborhood Descent.
The objective is to randomly generate k-SAT problems
based on the parameters k (the number of literals in each
clause), m (the number of clauses), and n (the number of
Fig. 1. Initial configuration.
distinct variables). Each clause is designed to contain distinct
variables or their negations, resulting in instances that fall
The initial configuration of the marble board features an under the category of fixed clause length models of SAT,
arrangement where one space is vacant. The objective is to known as uniform random k-SAT problems. To accomplish
eliminate marbles by jumping over them, ultimately leaving this, we implemented a function that creates m clauses, each
just one marble in the center of the board. The initial state containing k distinct variables. This function ensures that the
of the board is depicted in Fig. 1. The goal is to achieve a variables are either presented in their positive form or negated
configuration with only one marble remaining at the center. at random.
1) Priority Queue-Based Search with Path Cost: In a We implemented three search algorithms to solve the gen-
priority queue-based search, each state is maintained in a erated k-SAT problems:
queue that prioritizes nodes according to their path cost, which • Hill Climbing: This algorithm initiates with a random
is the cumulative cost incurred to reach that node. A typical assignment of truth values to the variables. It evaluates
approach to manage this is by utilizing a min-heap, facilitating the current solution and iteratively improves it by flipping
efficient access to the state with the lowest cost. the value of one variable at a time, consistently moving
to the neighboring solution that provides the best increase A) Generate a neighbor by flipping the vari-
in satisfaction. able’s value.
• Beam Search: This algorithm maintains a fixed number B) Evaluate the neighbor.
of the best solutions, determined by a defined beam width, C) If the neighbor is better than the current
and explores the neighbors of each solution. It retains solution:
only the most promising solutions for further exploration. D) Update the current solution to this neighbor.
• Variable Neighborhood Descent: This algorithm ex- E) Update the evaluation score.
plores multiple neighborhoods, making local improve- F) Set ‘improved‘ to true and break the loop.
ments iteratively until no further enhancements can be ii) If an improvement was made, break to start
achieved. from the first neighborhood.
To guide the search algorithms, we implemented two heuris- c) If no improvement was made, return the current
tic functions: solution.
• Heuristic Function 1: Counts the number of satisfied
We conducted experiments with various combinations of
clauses. m and n to assess the performance of the algorithms. Each
• Heuristic Function 2: Calculates the number of unsat-
algorithm was tested on several randomly generated k-SAT
isfied clauses. problems, measuring effectiveness based on the number of
1) Hill Climbing Algorithm: satisfied clauses.
1) Initialize a random solution (assignment of truth values We have outlined the methodology for generating uniform
to variables). random 3-SAT problems and evaluating different search algo-
2) Evaluate the solution (count the number of satisfied rithms. The implementation of Hill Climbing, Beam Search,
clauses). and Variable Neighborhood Descent, coupled with heuristic
3) While true: evaluations, provides insights into the effectiveness of each
a) Generate neighbors by flipping the value of each approach. The outputs from the experiments reveal critical
variable one at a time. performance metrics that can be analyzed to determine the
b) Evaluate each neighbor and find the best neighbor. most effective strategy for solving k-SAT problems.
c) If the best neighbor is better than the current
solution: IV. W EEK 4: N ON - DETERMINISTIC S EARCH —
i) Update the current solution to the best neighbor. S IMULATED A NNEALING F OR PROBLEMS WITH LARGE
SEARCH SPACES , RANDOMIZED SEARCH BECOMES A
ii) Update the current evaluation score.
MEANINGFUL OPTION GIVEN PARTIAL /
d) If no neighbor improves the solution, return the
FULL - INFORMATION ABOUT THE DOMAIN .
current solution.
2) Beam Search Algorithm: A. Traveling Salesman Problem
1) Initialize a list of solutions with a random solution The Traveling Salesman Problem (TSP) is a well-known
(assignment of truth values). NP-hard problem. It involves determining the shortest possible
2) While true: route that visits each city exactly once and returns to the
a) Create an empty list for new solutions. starting point, given a graph where nodes represent cities and
b) For each solution in the current list: edges represent the travel cost between them.
i) Generate neighbors by flipping the value of We implemented Simulated Annealing to solve the TSP with
each variable. 20 tourist destinations across Rajasthan. This algorithm effi-
ii) Add each neighbor to the new solutions list. ciently improves the route by iteratively exploring neighboring
c) Sort the new solutions based on their evaluation solutions and either accepting improvements or, occasionally,
scores (number of satisfied clauses). worse solutions to avoid getting stuck in local optima. The
d) Retain only the top w solutions (beam width). 20 tourist spots include: 1. Jaipur (Amber Fort) 2. Jaisalmer
e) If the best solution satisfies all clauses, return it. (Jaisalmer Fort) 3. Udaipur (City Palace) 4. Jodhpur (Mehran-
garh Fort) 5. Mount Abu (Dilwara Temples) 6. Bikaner (Ju-
3) Variable Neighborhood Descent Algorithm:
nagarh Fort) 7. Ajmer (Dargah Sharif) 8. Pushkar (Brahma
1) Initialize a random solution (assignment of truth values Temple) 9. Ranthambore (National Park) 10. Alwar (Sariska
to variables). Tiger Reserve) 11. Bundi (Taragarh Fort) 12. Chittorgarh
2) Evaluate the solution (count the number of satisfied (Chittorgarh Fort) 13. Bharatpur (Keoladeo National Park)
clauses). 14. Kota (Seven Wonders Park) 15. Shekhawati (Frescoes)
3) While true: 16. Kumbhalgarh (Kumbhalgarh Fort) 17. Jhalawar (Jhalawar
a) Set a flag ‘improved‘ to false. Fort) 18. Barmer (Barmer Fort) 19. Sikar (Khatu Shyam Ji)
b) For each neighborhood: 20. Nathdwara (Shrinathji Temple)
i) For each variable in the neighborhood: Key steps in the process include:
Cost Calculation: The calculate_cost() function
computes the total distance by summing up the travel distances
between consecutive cities in the tour, and then adds the return
distance to the starting city.
Neighbor Generation: The generate_neighbor()
function randomly swaps two cities in the current tour, pro-
viding a new neighboring solution.
Acceptance Criterion: The acceptance rule follows the
standard Simulated Annealing procedure, where new solutions
are accepted if they reduce the total cost. Worse solutions are
accepted probabilistically, depending on the temperature.
Cooling Schedule: The temperature decreases with each • This is a tour for the XQF131 VLSI instance. It has
iteration using a cooling rate. We chose a rate of 0.995 to length 564:
allow slow cooling and thorough exploration of the solution
space.
Additional steps involved:
1. Distance Matrix: A symmetric matrix representing the
distances between all pairs of tourist spots, using real-world
data. 2. Initial Solution: A random permutation of the cities
forms the starting tour. 3. Neighbor Generation: The algo-
rithm explores neighboring tours by swapping two cities in the
current tour. 4. Acceptance Criterion: Even when a new tour 2) Problem 2: XQG237 (237 Points):
has a higher cost, it may still be accepted based on the current
temperature to avoid local optima. 5. Cooling Schedule: As • Best tour length: 1277.551
iterations progress, the temperature reduces, decreasing the • Optimal tour length: 1019
likelihood of accepting worse solutions. 6. Output: The best • Tour: 153, 152, 151, 150, 149, . . . , 221
tour and its associated cost are reported. • Graphical Output:
The algorithm parameters include: - Initial Temperature:
Controls the likelihood of accepting worse solutions at the
start. - Cooling Rate: Governs the rate at which the temperature
decreases. - Number of Iterations: Determines how many steps
the algorithm performs before termination.
Results: After running the algorithm, we obtained an opti-
mized tour of Rajasthan’s tourist spots. The initial tour had a
cost of 8305, which reduced to 4940 after several iterations,
showing a successful optimization process.
The final results include both the best tour order and its
associated cost:
• This is a tour for the XQG237 VLSI instance. It has
• Best Tour: The order in which the cities are visited.
length 1019:
• Best Cost: The total distance of the optimal tour.
For VLSI-based Traveling Salesman Problems (TSP), we used
the same code for the 5 problems we only changed the file
paths for the different datasets.
B. Problem Results
Networks (RNNs) introduced by John Hopfield in 1982, repre- zero, indicating a solution.
sent the first instance of associative neural networks capable of
P ROBLEM 2: PATTERN R ECOGNITION AND
producing emergent associative memory. Associative memory
R ECONSTRUCTION
allows retrieval and completion of a memory using incomplete
or noisy stimuli, akin to recalling a memory triggered by We implemented a Hopfield network to recognize and
hearing a familiar song. Hopfield networks operate with nodes reconstruct patterns.
Approach C ONCLUSION
• Four patterns (D, J, C, M) were defined as matrices and The Hopfield network successfully addressed a variety of
stored in an array X. combinatorial problems:
• Hebb’s rule was applied to compute the weight matrix • Associative memory retrieval and reconstruction.
W based on the outer product of each pattern with itself. • Constraint satisfaction in the 8-rook problem.
• A noisy starting pattern was iteratively updated until • Approximate solutions for TSP using energy minimiza-
convergence, determined by the difference between suc- tion.
cessive iterations falling below a predefined threshold.
VI. W EEK 7 : U NDERSTANDING E XPLOITATION -
E XPLORATION IN SIMPLE N - ARM BANDIT REINFORCEMENT
Error Tolerance LEARNING TASK , EPSILON - GREEDY ALGORITHM
• 25 random patterns were generated. we have tried to understand the concepts involved in re-
• Noise levels were introduced, and the average Hamming inforcement learning, and also tried to implement epsilon-
distance was computed between original and retrieved greedy algorithm for binary and multi-armed bandit problems.
patterns. Another objective was to also understand Matchbox Educable
• Results quantified the network’s robustness to noise, with Naughts and Crosses Engine (MENACE) developed by Donald
higher error tolerance indicating better performance. Michie. We implemented a binary bandit using a stochastic
reward system and developed a 10-armed bandit with non-
P ROBLEM 3: T RAVELING S ALESMAN P ROBLEM (TSP) stationary rewards to observe how an agent using a modified
epsilon-greedy approach adapts in dynamic environments.
The Traveling Salesman Problem (TSP) is a combinatorial
Reinforcement learning (RL) is an area of machine learning
optimization problem where the goal is to find the shortest
where agents learn to make decisions by interacting with their
possible route that visits each city exactly once and returns to
environment, aiming to maximize cumulative rewards. Focus
the original city. In this script, the TSP is approached using a lies in solving simple decision-making tasks using the epsilon-
neural network paradigm known as the Hopfield network. The greedy algorithm, which balances exploration and exploitation.
script first defines a set of points representing city coordinates
We analyze the MENACE system, an early RL-based game-
and generates permutations of these points. Then, it employs a
playing engine for naughts and crosses, and implement bandit
customized energy calculation function within the framework
problems, a common testbed for studying RL algorithms. In
of a Hopfield network to compute the energy associated with
the binary bandit problem, the agent chooses between two
each permutation, considering distances between cities. actions, each producing rewards that follow a probabilistic
By iteratively updating the network’s state to minimize pattern. We apply the epsilon-greedy algorithm to balance
energy, akin to converging to a stable state in the network’s exploration and exploitation, aiming to optimize the overall
dynamics, the script aims to find the permutation (route) with reward. Furthermore, we tackle a 10-armed bandit problem
the minimum energy, which corresponds to the optimal TSP with non-stationary rewards, where the expected value of each
route. Through this iterative process of minimizing energy, the action shifts over time. This creates the need for adjusting stan-
Hopfield network attempts to solve the TSP by converging dard RL strategies to better track these changes. Through these
towards the optimal solution represented by the permutation implementations, we gain valuable insights into important
with the lowest energy. RL principles, such as managing the exploration-exploitation
balance and adapting to dynamic reward environments.
A. P1: MENACE (Matchbox Educable Naughts and Crosses
Engine)
The above energy function E(x) within the framework of a In MENACE, each game state is represented by a unique
Hopfield network to solve the Traveling Salesman Problem configuration of the tic-tac-toe board, with a matchbox ded-
(TSP). This function penalizes configurations violating TSP icated to each specific state. This matchbox holds the beads
constraints, like revisiting or skipping cities. By iterating that signify the possible actions available to the agent. The
through permutations of cities, the code computes the energy selection of actions is determined by the beads within the
for each permutation, selecting the one with minimum energy matchbox. The number of beads allocated to each action
as the optimal route. reflects its probability of being chosen, a greater bead count
The Hopfield network iteratively adjusts neuron states to corresponds to a higher likelihood of selecting that action.
minimize this energy, converging to a stable minimum corre- After each game, the system updates the matchboxes according
sponding to the optimal route. This iterative process continues to the outcome. This reinforcement process strengthens the
until convergence, gradually exploring permutations. Finally, likelihood of selecting winning actions while decreasing the
the optimal route is extracted from the stable state of the chances of choosing those that led to losses. The implemen-
network, indicating the order to visit cities and minimize total tation also emphasizes several critical aspects. Initially, the
distance traveled. system assigns an initial count of beads for each possible
action across all game states. Following the game, the agent as the reward dynamics continuously shift. Consequently,
updates the bead counts based on the results, ensuring that suc- the algorithm may continue to exploit a machine that was
cessful actions receive reinforcement. Furthermore, MENACE once profitable but is no longer the top performer, resulting
incorporates an exploration-exploitation tradeoff; it allows the in missed opportunities and decreased overall winnings. To
agent to discover new actions by randomly selecting beads illustrate, think of dining at a restaurant with ten different
while simultaneously exploiting previously successful actions dishes, each with varying tastes (the rewards). If the flavors
by reinforcing the beads associated with those actions. of each dish change frequently, the epsilon-greedy algorithm
would be akin to sticking with a dish you enjoyed yesterday,
B. P2: Binary Bandit with Epsilon-Greedy Algorithm even if it’s not the best choice today. This approach causes
The epsilon-greedy algorithm is a fundamental approach for you to overlook the possibility of discovering other dishes
addressing the N-armed bandit problem, where the goal is to that might be more appealing based on the current ”menu.”
select actions that maximize long-term rewards. Picture having
two slot machines (Bandit A and Bandit B) with unknown D. P4: Evaluating a Modified Epsilon-Greedy Algorithm in a
payout probabilities. This algorithm aids in deciding which Non-Stationary 10-Armed Bandit Environment
machine to play to maximize winnings. The challenge is Imagine returning to the casino with ten slot machines,
finding the right balance between exploration—trying out both this time equipped with a special tool—the modified epsilon-
machines—and exploitation—focusing on the machine that greedy algorithm! This enhanced strategy helps you tackle the
appears to yield better results. The epsilon-greedy algorithm fluctuating payout rates (non-stationary rewards) that posed
addresses this balance using an epsilon () parameter. With a challenges for the standard epsilon-greedy algorithm. A crucial
probability of , the agent opts for a random machine, allowing element of this new approach is the forgetting factor (), a
for exploration. In contrast, with a probability of (1-), the agent value that ranges between 0 and 1. You can think of it
chooses the machine that has the highest estimated average as a dial that determines how much importance is given
reward based on previous outcomes. The estimated average to past wins and losses when selecting which machine to
reward serves as a historical record of each machine’s perfor- play next. When the forgetting factor is set high (closer
mance. Initially, both machines are assigned an equal average to 1), the modified algorithm emphasizes recent outcomes.
reward (often set at 0.5, reflecting no prior knowledge). As Similar to how you would favor a machine that just paid out
the agent plays, the algorithm updates these average rewards generously, the algorithm concentrates on actions that have
based on the actual results (1 for a win and 0 for a loss). yielded high rewards recently. This responsiveness enables it
Over time, the machine with more wins will accumulate a to quickly adapt to shifts in the reward environment. On the
higher estimated average reward, increasing its likelihood of other hand, a low (closer to 0) gives greater importance to
being chosen for exploitation in future rounds. By strategically historical wins, akin to the basic algorithm’s approach. The
adjusting the exploration factor () and continuously updating exciting aspect of this modified algorithm is that it updates
the estimated rewards through gameplay, the epsilon-greedy the estimated/calculated value of each machine (or arm) using
algorithm strives to identify the machine that consistently a formula that incorporates the forgetting factor rather than just
performs better, thereby maximizing long-term winnings. averaging the old estimate with the new reward. The formula
is as follows:
C. P3: 10-armed bandit in which all ten mean-rewards start
out equal and then take independent random walks NewCalcVal = α × OldCalcVal + (1 − α) × NewReward
Let us say you’re in a casino with ten slot machines instead
This equation takes into account previous performance while
of just two, and this time, the payout rates for each machine
allowing the new reward to have a stronger influence, depend-
are not fixed—they change over time. This scenario presents
ing on the value of . A higher places more emphasis on
a challenge known as non-stationary rewards. The epsilon-
the new reward, enabling the estimated value to adapt more
greedy algorithm we discussed earlier struggles to adapt to this
rapidly to changes in the actual payout of the machine.
situation. In the previous case, the payout rates were constant,
allowing the algorithm to learn which machine was more VII. W EEK 8 : U NDERSTAND THE PROCESS OF
rewarding based on past experiences. You may have noticed SEQUENTIAL DECISION MAKING ( STOCHASTIC
patterns, such as machine A providing more wins than machine ENVIRONMENT ) AND THE CONNECTION WITH
B. However, in this new context, the situation is quite different. REINFORCEMENT LEARNING
The machine that performs well today may not necessarily
be the best choice tomorrow. For instance, machine A might P ROBLEM FOR I N -L AB D ISCUSSION
be winning today, but tomorrow, machine C could take the Suppose that an agent is situated in a 4 × 3 environment
lead! This is where the standard epsilon-greedy algorithm falls as shown in Figure 1. Beginning in the start state, it must
short. It relies on estimated average rewards derived from past choose an action at each time step. The interaction with the
outcomes. In a stationary environment, these estimated rewards environment terminates when the agent reaches one of the goal
serve as reliable indicators of future performance. However, states, marked +1 or −1. The environment is fully observable,
in a non-stationary setting, past wins lose their significance so the agent always knows where it is.
The agent can take the following actions in each state: • Reward Function: The immediate reward is calculated
Up, Down, Left, and Right. However, the environment is as:
stochastic: the action taken achieves the intended effect with a
probability of 0.8, but the rest of the time the agent moves at R(s, a) = Rental Revenue−Moving Cost−Parking Cost
right angles to the intended direction with equal probabilities. • Discount Factor: γ = 0.9 is used to weigh future
If the agent bumps into a wall, it stays in the same square. rewards.
Rewards:
• Moving to any state (except terminal states): r(s) = Steps in Solving the Problem
−0.04. We solve the problem using the Policy Iteration algorithm.
• Moving to terminal states: r(s) = +1 or −1 respectively. The algorithm alternates between policy evaluation and policy
Task: Use Value Iteration to find the value function corre- improvement.
sponding to the optimal policy. Repeat this for the following 1. Policy Evaluation: In this step, we evaluate the current
reward structures: policy π by solving the Bellman Expectation Equation:
• r(s) = −2
X
V (s) = R(s, π(s)) + γ P (s′ |s, π(s))V (s′ )
• r(s) = 0.1
s′
• r(s) = 0.02
• r(s) = 1 Here, V (s) is the value of state s, R(s, π(s)) is the immediate
reward for taking action π(s) in state s, and P (s′ |s, π(s)) is
P ROBLEM D ESCRIPTION the probability of transitioning to state s′ from s.
The Gbike bicycle rental problem involves managing two 2. Policy Improvement: In this step, we improve the policy
locations where bicycles can be rented. The goal is to max- by choosing actions that maximize the expected return for each
imize daily profit while accounting for costs associated with state:
moving bikes between locations and managing bike requests "
X
#
′ ′ ′
and returns. The problem is modeled as a Markov Decision π (s) = arg max R(s, a) + γ P (s |s, a)V (s )
a
Process (MDP). s′
The complete code for all the four labs can be found
at the following link: https://github.com/Lalwaniamisha789/
CS-307-Lab-Report.git.
X. S OFTWARE U SED