Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
110 views110 pages

Science Extension Portfolio

The document outlines a research portfolio focused on using machine learning to investigate the relationship between copolymer structure and antimicrobial activity against specific bacteria. The student explores various interests, ultimately choosing to study antibacterial copolymers due to rising antibiotic resistance and the potential of machine learning in drug discovery. The document details the student's journey, including mentorship applications, relevant literature, and initial research findings in the field of antimicrobial peptides and polymers.

Uploaded by

ethan.hider
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
110 views110 pages

Science Extension Portfolio

The document outlines a research portfolio focused on using machine learning to investigate the relationship between copolymer structure and antimicrobial activity against specific bacteria. The student explores various interests, ultimately choosing to study antibacterial copolymers due to rising antibiotic resistance and the potential of machine learning in drug discovery. The document details the student's journey, including mentorship applications, relevant literature, and initial research findings in the field of antimicrobial peptides and polymers.

Uploaded by

ethan.hider
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 110

Research Portfolio

Beyond Antibiotics: A machine learning guided investigation of


the relationship between copolymer structure and antimicrobial
activity against Pseudomonas Aeruginosa and Mycobacterium
Smegmatis.

Student Number: 36606827

Student Number: 36606827 1


23/10/23
Initial brainstorm:
- Rising antibiotic resistance
- Use of machine learning
- Drug research
- Neurology
- Psychiatry
- Automotive Aerodynamics
- Delving into religious medical practices combined with modern medicine
- Astronomy
- Biology (medical related)
- Chemistry (medical related)

These are areas in general that seem interesting to me, although extremely broad, this is still a
starting point.

26/10/23
I came across within my YouTube recommended, two of the following videos:

https://www.youtube.com/watch?v=gg7WjuFs8F4&ab_channel=GoogleDeepMind – The
video by google DeepMind, an artificial intelligence research laboratory under google, goes
through the importance of understanding protein folding and how AI is an imperative tool in
predicting protein structure. The video goes through the DeepMind team and their process in
participating in the CASP competition in predicting the SARS – CoV – 2 proteins, in which
the AlphaFold system is a huge breakthrough on drug discovery and disease understanding.

https://www.youtube.com/watch?v=gVzPMZqOTo4&ab_channel=SciShow – Similar to the


previous video, SciShow goes through the protein folding problems, as there are so many
possibilities it is very difficult in determining protein structure. Thus, the video goes on about
the potential impact AI may have in medicine and environmental science. With the potential
of AI, the possible ability in creating proteins and utilising them may provide huge
advancements towards society in many aspects and a greater understanding into the nature of
life.

These two videos spiralled an area of interest I want to pursue, that combines computer
science with biology and chemistry. Hence, I looked into machine learning and its possible
uses within scientific research:

https://vial.com/blog/articles/the-role-of-machine-learning-in-drug-design-advancements-
and-challenges/?https://vial.com/blog/articles/the-role-of-machine-learning-in-drug-design-
advancements-and-challenges/?utm_source=organic

Student Number: 36606827 2


According to the website the three following main tasks machine learning can be used for in
drug discovery and development is:

I came across the following paper titled - “Artificial intelligence: A powerful paradigm for
scientific research”, (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8633405/).

This paper essentially explained how machine learning a subset of artificial intelligence, can
be utilised for various aspects that navigate between multidisciplinary fields involving
technology and science. Through these ML (machine learning) techniques, allows scientists
to examine and analyse experimental data more efficiently and effectively. Furthermore,
these ML models can be used for predicting outcomes, categorising data and distributing
them in accordance with their relevance.

Hence, I want to break down my area of interest towards a combination of using


machine learning for medical research since my interests are within the medical field.

27/10/23
Discussing with my Science Extensions
teacher, he advised me to looking into
mentoring. This is recommended
throughout the research project, and a
mentor will be able to guide me and help
me find an area of research. With a google
search I broke down my mentorship to:
- UNSW SciX
- Sydney University
- UTS Mentor Science+

Due to the time, I began applying, I could


only apply for UNSW sciX and the Sydney Uni program.

Student Number: 36606827 3


Unlike the other mentoring programs, UNSW sciX is heavily structured and has topics that
students can choose to pursue; hence the following image below shows the possible projects
to choose from. Below is the structure of UNSW sciX program:

Looking at the projects offered by UNSW my areas of interest was:


- Unravelling Antibacterial Polymers with Machine Learning
- Mapping the brain
- Tracing viral outbreaks
- RNA in disease

While all of them, I found interesting, the first one “Unravelling Antibacterial Polymers with
Machine Learning” aligned with my previous research I conducted, which combined my
interest with both computer science as well as biology and chemistry.

Student Number: 36606827 4


I decided to pursue this topic in comparison to the others, since I was already pretty familiar
with the area this project was aligned to. That is, using machine learning techniques to further
research and understand the structure – function relationship of copolymers, specifically in
the antibacterial properties they exhibit. This research is necessary, due to the rising antibiotic
resistance, necessitating the need for alternative novel therapies, which antibacterial
copolymers can potentially allow for.

3/11/2023
I found an article of interest that was somewhat related with my research topic.:

Sowers, A., Wang, G., Xing, M., & Li, B. (2023). Advances in Antimicrobial Peptide
Discovery via Machine Learning and Delivery via Nanotechnology. Microorganisms, 11(5),
1129. https://doi.org/10.3390/microorganisms11051129

It goes through the potential of Antimicrobial Peptides (AMP) being an alternative to


antibiotics. An alternative is required due to the increasing antibiotic resistance. However,
there are many limitations to AMP’s due to “their haemolytic activity, bioavailability,
degradation from proteolytic enzymes, and high-cost production.” Hence, this article goes

Student Number: 36606827 5


through nanotechnology approaches for AMP delivery and the use of machine learning in
predicting AMPs with minimal toxicity. Furthermore, this article went through AMP’s and
their relationship with human diseases ranging from:
- Role in respiratory diseases
- Role in autoimmune diseases
- Role in Cancer
- Role in Cardiovascular Diseases
- Role in Neurodegenerative diseases

- From my current research I had multiple questions specific to my interests (these


questions are rough and are subject to change):
- What is the best machine learning model in predicting AMP structure that has the best
antimicrobial activity?
- What is the most effective AMPs against different types of diseases?
- How can we develop, produce and apply AMPs for commercial use, ensuring it to be
equally accessible to all, being cheap and reliable?

8/11/2023
Unfortunately, my desired project was fully taken, so I emailed the organisers and was able to
get the organisers to see if they can get me a spot in my desired project.

Student Number: 36606827 6


12/11/2023
Thanks to the UNSW sciX director, I was able to get my desired project.

14/11/2023
Confirmation of entry

One pager, summarising the whole project, was attached


with the confirmation of entry.

Student Number: 36606827 7


15/11/2023
As part of the UNSW sciX program, there has been allocated pre – work that must be
completed as a prerequisite for this project. There are 3 publications to read and python pre –
work. The following image shows this below:

1/12/23 – 6/1/24
Have been vigorously learning python every day, with the data camp courses provided by
UNSW

Student Number: 36606827 8


For my project, I will not be going beyond supervised learning hence it is unnecessary for me
to learn unsupervised machine learning with Python.

7/1/2024
Within the Microsoft teams there are three publications, all students doing the same project as
me must read through understand.

The first publication called: High – Throughput Synthesis of Antimicrobial Copolymers and
Rapid Evaluation of Their Bioactivity.

Judzewitsch, P. R., Zhao, L., Wong, E. H. H., & Boyer, C. (2019, May 17). High -
Throughput Synthesis of Antimicrobial Copolymers and Rapid Evaluation of their
Bioactivity. Macromolecules 2019, 52(11), 3975-3986. 10.1021/acs.macromol.9b00290

Summary of the article:


- The aim of this experiment was to use PET RAFT (high throughput photoinduced
electron transfer reversible addition fragmentation chain transfer polymerisation), to
test different combinations of seven monomers against three bacterial species:
Pseudomonas aeruginosa (gram negative), Staphylococcus aureus (gram positive) and
Mycobacterium Smegmatis.

The seven monomers in question:


- Cationic
• Boc – Aem (Library 1)
• Boc – AEA
• DMAEA (library 2)
• AAPTAC (library
3)
- Hydrophobic
• PEAm
• NIPAM
• EHA
- Hydrophilic:
• Heam
• PEGA
• DEGA

Student Number: 36606827 9


- The article addressed the nature of antimicrobial copolymers, linking it to
antimicrobial peptides. Antimicrobial peptides are a promising alternative to
antibiotics, however due to how expensive they cost, protease degradation and
haemolytic activity, research has gone into antimicrobial copolymers due to new
polymerisation techniques that make them viable substances that mimic the activity of
antimicrobial peptides without have the negatives behind it.
- Antimicrobial peptides work by which the “positive charges are attracted to the
exposed negative phospholipid head group of the bacterial cell wall, while the
hydrophobic amino acid residues insert into and disrupt the phospholipid bilayer. This
causes loss of integrity in the cell wall and therefore intracellular fluid leakage, cell
lysis, and finally, death.”
- Antimicrobial copolymers work in the same nature, in which the key factor according
to this paper is the charge of the polymer and the type of membrane depending on the
bacterial species.

- First library copolymers were found to be most effective against P.aeruginosa, while
third library copolymers were found to be most effective against M.smegmatis
- There was great success against M.smegmatis, and was recommended for further
optimisation and further research into it, due to how its functionality mimics amino
acids in Antimicrobial peptides.
- The greater positive reading of the first library copolymers and the third library
copolymers allowed for their promising antimicrobial copolymer. Solidifying why It
is the charge of the polymer and the type of cell membrane of the bacteria that is
extremely important. Furthermore, hydrophobicity of the copolymer is important.
- The copolymers had no success on gram positive bacterium due to the copolymers
being positively charged.

Student Number: 36606827 10


From this paper, the biggest takeaway is that the hydrophobicity, charge, size and length, the
arrangement of the functionality groups have a huge impact on antimicrobial activity.

8/1/2024
The second publication is called: Towards Sequence – Controlled Antimicrobial Polymers:
Effect of Polymer Block Order on Antimicrobial Activity

Reference: Judzewitsch, P. R., Nguyen, T. K., Shanmugam, S., Wong, E. H. H., & Boyer, C.
(2018). Towards Sequence-Controlled Antimicrobial Polymers: Effect of Polymer Block
Order on Antimicrobial Activity. Angewandte Chemie (International ed. in English), 57(17),
4559–4564. https://doi.org/10.1002/anie.201713036

Student Number: 36606827 11


- The multi – block copolymers was generally 50%:20%:30% (cationic, hydrophobic,
hydrophilic), with an all acrylamide system of monomers was used:
• Tert – butyl (2 – acrylamide methyl) carbamate (Boc – Aeam, monomer A)
• 2 – Phenylethyl acrylamide (PEAm, monomer B)
• 2 – Hydroxy – ethyl acrylamide (HEAm, monomer C)

- These were chosen to mimic cationic, hydrophobic and hydrophilic functionalities of


the amino acids’ lysine, phenylalanine and serine respectively. This paper and
previous papers have found that the antimicrobial and haemolytic activity of the
polymers could be tuned by varying the monomer/polymer block order.
- A section of the paper summarises the results and takeaways from this paper being:
“This study has shown that bacteria genus specificity can be tuned simply via the
order of polymer blocks and to some extent via the combined modulation of polymer
chain length. Manipulating blocks to contain localised cationic segments coupled with
amphiphilic sections showed specific action against P.aeruginosa. Furthermore,
antimicrobial activity and haemolytic activity are dependent on distribution of
monomers within blocks. Indeed, the localised ratio of hydrophobic to hydrophilic
functional groups within amphiphilic sections appears to be a critical factor to
influence biocompatibility as well as antimicrobial activity. This shows that tailoring
individual block structures rather than global compositions may yield more specific
biological outcomes.”
- This paper has shown some potential independent variables I can use when
conducting my own research.
- So far no papers have shown an efficacy against gram positive bacterium, and this can
be attributed to the structure of the cell membrane and the charge of the copolymers
itself.

The third publication is called High – Throughput Process for the discovery of Antimicrobial
Polymers and Their Upscaled Production via Flow Polymerisation.

Reference: Peter R. Judzewitsch, Nathaniel Corrigan, Francisco Trujillo, Jiangtao Xu,


Graeme Moad, Craig J. Hawker, Edgar H. H. Wong, and Cyrille Boyer
Macromolecules 2020 53 (2), 631-639
DOI: 10.1021/acs.macromol.9b02207

This paper used PET – RAFT liked the other papers, synthesising a library of potential
antimicrobial polymers against Pseudomonas Aurginosa, then with the results performed a
structure – property analysis.

Note: This structure – property analysis is the basis of my own research project as I have
concluded for the time being.

Some statements quoted from the paper that are also key takeaways from the paper:
- “In these studies, the chain length, block order, and monomer structure for a library of
tailored copolymers was shown to systematically impact their antimicrobial efficacy.”

Student Number: 36606827 12


- “Block copolymers with (cationic – ran – hydrophobic) – block – hydrophilic
structures often showed enhanced antimicrobial activity compared to random cationic
– ran -hydrophobic – ran – hydrophilic terpolymers with identical monomer molar
ratios and overall chain length.”

Across these three papers its evident, that finetuning hydrophobicity/hydrophilicity,


controlling monomer sequencing, the ratio’s between each monomers, and the use of relevant
monomers for each bacterium is necessary for effective antimicrobial mechanism that are
comparable to the mechanisms of antimicrobial peptides.

15/1/2024
Today was my first day at the UNSW SCIX program. Today we went through and
understood the basics of polymers and machine learning, beginning our research project by
cleaning the data which was derived by experimentalists. This data was a library of 160
polymers, each having different compositions, specifically their cationic: hydrophobic:
hydrophilic ratios and their block sequences. In general, changes to the structure and
functionalities of the polymers were the feature variables, and the target variables was the
MIC of the different bacteria genus’

Student Number: 36606827 13


This is a screenshot of the dataset provided to us. There are many potential independent
variables I can test with my research project, however there are also many missing parts of
the datasheet.

I talked to my supervisors, and they have told me to get into contact with Cyrille Boyer,
Edgar, and other relevant researchers, that were co – authors of the three papers looked
above. However, I seemed to never get any reply on emails, and tried going to their office
in the chemical engineering faculty but could not find any of them.

I have decided for the time being, to keep researching and leave this issue at the back of my
mind for the time being.

16/1/2024
Second day of the program, and today we started coming up with a hypothesis and research
question.

My process of coming up with a research question and hypothesis

Second day of the program, and today we started coming up with a hypothesis and research
question.
My process of coming up with a research question and hypothesis:
I first reviewed the three publications, and a specific passage from the second publication
caught my attention:

“High order multiblock copolymers are attractive for antimicrobial applications as one can
manipulate localised domain concentration within a polymer chain, potentially mimicking the
functional group spatial segregation endowed by the precise monomer sequence and
secondary structures in AMP.”

Student Number: 36606827 14


All three publications had the commonality that the chemical composition of the polymer is
what impacts its antimicrobial activity and its bacterial genus specificity hence I devised
potential independent variables with my dependent variable being the antimicrobial activity
measured by MIC (minimum inhibitory concentration)

Potential dependent variables: global hydrophobicity, polymer charge, dPn, amine functional
groups, block sequence

I realised that dPn and block sequence was not possible with the provided dataset, due to the
laborious requirements in synthesising polymers the dataset had very little variation in dPn
and block sequence hence I decided to remove them out.

I also decided to determine global hydrophobicity specifically through the hydrophobic and
hydrophilic blocks of the polymer but put them together in one term called “global
hydrophobicity”.

Hence, I came up with my research question:


“What influence does the hydrophobicity, polymer charge and the amine functional groups
within the polymers have on antimicrobial activity”

I decided this was horrible and reworded it to:

“How do variations in chemical composition impact the anti - microbial activity and bacterial
genus specificity of polymers?

I then composed my hypothesis:


“Variations in amine functional groups, polymer charge and global hydrophobicity of
copolymers, results in differing antimicrobial activity between Pseudomonas Aeruginosa and
Staphylococcus Aureus, due to the distinct composition of the cell membranes.”

17/1/2024
Code I made during this day → There really was no progression of code, as I was just
copying what they said, and they all told me the values here are the best values for the code.
Although later on, I may review this code and see how I can make it more efficient and
effective.

Student Number: 36606827 15


The performance of the models here:

Student Number: 36606827 16


Logistic Regression

Randomforestclassifier:

Student Number: 36606827 17


It is pretty clear here that Randomforestclassifier is superior just from the precision itself, as
well as f1 – score, accuracy, and recall. Currently my understanding of these is limited,
however once I finalise my methodology, I will then get a better understanding of which
metric here is superior to understanding the performance of the machine learning models.

Student Number: 36606827 18


18/1/2024
One major constraints in testing synthetic antimicrobial copolymers, is how labour intensive
it is, hence experimentalists have tested a large set of polymers and put into a dataset which
has been provided to me as part of the sciX program, and the goal here is to identify what is a
good polymer and what is a bad polymer, so in the future experimentalists do not waste
labour testing polymers that will not be good. Therefore, through the use of machine learning
we can determine what polymers experimentalists should keep testing.

This means my project may have a second research question which would be:
“How can we determine the optimal chemical composition of polymers to result in the most
antimicrobial activity without the constraints of experimenting?”

This is a very iffy question but overviews the machine learning aspect of my research project
which in itself could provide me with one hypothesis.

19/1/2024
There wasn’t much lab sessions during today’s day, it was the end of the summer program, so
there was lab tours, Q and A sessions, and a seminar held at the end for exam technique.

30/1/2024
I sent an email to my teacher, regarding my research question, hypothesis, and the direction
of my research project. This was his following email.

9/1/2024
I did not get a chance for the past few days to be able to respond to the email and fix my
project. However today I caught up with my teacher and have been thinking how can I fix my
hypothesis and research question, plus intertwining the machine learning aspect as a sub
section of my research project.

Student Number: 36606827 19


11/2/2024
I decided to change my research question into one:
“What variations in chemical composition of copolymers have the greatest influence over the
antimicrobial activity against Pseudomonas Aeruginosa and Mycobacterium smegmatis, and
how can machine learning models be utilised to identify and prioritise these factors?”

Independent Variables:
• The overall hydrophobicity of the copolymers (clogP)
• Composition of the cationic (type A), hydrophobic (type B1 and B2), hydrophilic
(Type C)
• The type of monomer for each type
• Length of polymer (dPN)

Hypothesis: The presence of cationic monomers containing primary amine and quaternary
ammonium functionalities, have the greatest influence on the antimicrobial activity of
copolymers against Pseudomonas Aeruginosa and Mycobacterium Smegmatis respectively.
Furthermore, overall copolymer hydrophobicity and the specific compositional ratio of
cationic, hydrophobic, and hydrophilic monomers are additional significant factors in
determining antimicrobial potency. Moreover, polymer chain length has less pronounced
impact on antimicrobial efficacy upon both bacterial strains.

I may need a sub hypothesis that answers the second part of the question referring to how
machine learning models may be utilised to identify and prioritise these factors? - Look into
this later

Possible title ideas:


1. Beyond Antibiotics: Designing potent antimicrobial polymers with Machine Learning
2. Machine Learning - Guided design of Antimicrobial copolymers: An alternative to
traditional antibiotics
3. Predicting Antimicrobial potency of copolymers: A machine learning approach in
optimising composition.
4. Predicting the impact of copolymer composition on Antimicrobial activity against P.
aeruginosa and M.smegmatis: A machine learning driven approach.

Current objectives I may need to add to my timeline for the research proposal (very rough):
• Apply all different machine learning models with its hyperparameters optimised
through GridSearchCV.
• Using the set methodology in determining the machine learning model performance,
graph the performance of all the different machine learning models.
• Then apply the best machine learning model onto the dataset
• Using Shap or Lasso (will determine later), analyse the data
• With this analysis of data begin writing the publication, by comparing through
literature review

4/3/2024
I have came up with a new title:

Student Number: 36606827 20


Beyond Antibiotics: A machine learning - guided investigation of key copolymer
characteristics against Pseudomonas Aeruginosa and Mycobacterium Smegmatis.

These are the following machine learning models I will test (I was told to do these by my
mentors):
- Logistic regression
- Random forest classifier
- Standard vector classifier
- Multi - Layer Perceptron
- K – nearest neighbour
I changed the research question and hypothesis slightly:
Research question: “What are the key copolymer characteristics that have the greatest
influence over the antimicrobial activity against Pseudomonas Aeruginosa and
Mycobacterium smegmatis, and how can machine learning models be utilised to identify and
prioritise these factors?”

Hypothesis: The presence of cationic monomers containing primary amine and quaternary
ammonium functionalities, will have the greatest influence on the antimicrobial activity of
copolymers against Pseudomonas Aeruginosa and Mycobacterium Smegmatis respectively.
Furthermore, overall copolymer hydrophobicity/hydrophilicity and the specific compositional
ratio of cationic, hydrophobic, and hydrophilic monomers are additional significant factors in
determining antimicrobial potency. Moreover, polymer chain length has less pronounced
impact on antimicrobial efficacy upon both bacterial strains.

This hypothesis effectively conveys what I believe I will predict, which is in unison with all
literature within this research area.

The dependent variable: Antimicrobial potency – This will be measured through the Shapley
Values, by seeing which independent variable, or known as a feature has the greatest
influence (quantitatively) on the algorithm of the ML model.

5/3/2024
Today I decided to develop my methodology, since my research proposal is coming up. My
methodology is going to be pretty simple since it is just plugging in the data into a program,
running the model, and then analysing the data from the SHAP values, linking it back to the
literature and discussing such results.

I decided the best way to fully refine and develop my methodology was to read some
literature regarding it. Luckily my mentor had provided us with more additional publications
of interest

Predicting Antimicrobial Activity of Conjugated Oligoelectrolyte Molecules via


Machine Learning
Armi Tiihonen, Sarah J. Cox-Vazquez, Qiaohao Liang, Mohamed Ragab, Zekun Ren, Noor
Titan Putri Hartono, Zhe Liu, Shijing Sun, Cheng Zhou, Nathan C. Incandela, Jakkarin
Limwongyut, Alex S. Moreland, Senthilnath Jayavelu, Guillermo C. Bazan, and Tonio
Buonassisi
Journal of the American Chemical Society 2021 143 (45), 18917-18931

Student Number: 36606827 21


DOI: 10.1021/jacs.1c05055
Relevant information from the article:
- “A common property of these compounds is their ability to intercalate and disrupt
biological membranes, which is their presumed antibiotic mechanism of action. This
proposed mechanism is advantageous for developing new antibiotics, as bacteria are
expected to be less likely to develop resistance to membrane disruptors, compared to
specific receptors. However, there is an absence of a known specific binding site that
may guide the design of chemical structures.”
- The article talks about “molecular representations”, “numeric vectors consisting of
molecular descriptor values”. Essentially, describing ways in categorising the
structure of the molecules, such as “feature engineering relying on domain knowledge
or by attempting to form general fingerprints such as Morgan fingerprints.” This
section of this article is irrelevant for my project. This is because the dataset has
already outlined all the features, giving me a range of independent variables to test,
whilst here this article intends on using deep learning as well as unsupervised
machine learning, to identify the features or otherwise known as independent
variables, then using SHAP to determine a quantitative value for each independent
variable in its relationship with antimicrobial activity. For me, I will be doing utilising
supervised machine learning models.
- The article states that random forest regression or gradient boosting are effective
models. -→ Will have to look into this in the future.
- Shapley additive explanations (SHAP) is an effective ML model analysis method,
being a game – theoretic approach by measuring each feature’s influence on the final
outcome. It shows how each feature affects each final prediction.
- Stokes et al, identified Halicin by using solely a combination of deep learning –
directed message neural networks.

What should be noted about this article is that it analyses Conjugated Oligoelectrolytes
(COE), whilst I am doing antimicrobial copolymers. There is a distinction in structure,
but the antimicrobial mechanism is very similar. COE’s are not primarily studied as
alternatives to antibiotics unlike antimicrobial copolymers.

Student Number: 36606827 22


The above image shows how the article used machine learning. I will take a similar approach;
however, the fingerprints and graph part are unnecessary. I will use multiple models, but then
analyse the models with SHAP that have the best performance.
- The R2 was 0.65 which is very good for a unsupervised machine model. I aim to have
this level of performance in my models as well.
- The split was 80%/20% train/test data sets. Although shown previously I did 30% as
my test dataset. Which I will change in accordance with this article.
- The rest of the article goes through a comparison of the molecular fingerprints, from
different models. Most of it is irrelevant for my case.

I looked into the model performance measurements. And this link perfectly encompassed
everything I needed to know.

https://medium.com/@abhishekjainindore24/a-comprehensive-guide-to-performance-
metrics-in-machine-learning-4ae5bd8208ce

Student Number: 36606827 23


With this information, it is obvious that recall will be taken into consideration greater than
any of the other metrics, this is because it is ok if I get false positives, but getting false
negatives means, there are potential directions of research that can be ignored because of the
false results.

6/3/2024
In preparation for my scientific research proposal, I have developed this methodology:

“This study is conducted by building the machine learning models on a dataset with a
compilation of tested copolymers from recent studies by the Boyer Lab (Judzewitsch,
Nguyen, Shanmugam, Wong, & Boyer, 2018) (Judzewitsch, et al., 2020) (Zhao, Judzewitsch,
Wong, & Boyer, 2019). The selected machine learning models are: Logistic Regression,
Random Forest Classifier, Support Vector Machine, Multi-Layer Perceptron and K nearest
Neighbour, which are prominent classification models used in previous studies relevant to
this research (Tiihonen, et al., 2021). The dataset is feature engineered, making it only
relevant to my hypothesis, where the dependent variable, also known as the target variable, is
the MIC (Minimum Inhibitory Concentration) values of the relevant bacterial strain, the
independent variables, also known as features, are specific copolymer characteristics (e.g.
hydrophobicity, degree of polymerisation). Furthermore Pseudomonas Aeruginosa and
Mycobacterium Smegmatis were chosen within the dataset due to the difference in cell
membranes of the two bacterial strains, allowing for further understanding of copolymer
interactions with the cell membrane.

All the selected machine learning models are then trained with 80% of the dataset and tested
by making the models predict 20% of the dataset’s target variables by being fed a range of
untested features. The machine learning models performance would then be evaluated by
comparing the classification reports of all models, showing the accuracy, precision, recall,
and f1 score (combination of precision and recall) of the models. However, since the MIC of
both bacterial strains are dispersed differently, this may result in the machine learning models
performing differently depending on the bacterial strain. Hence, I will select the best
performing model in predicting the MIC of Pseudomonas Aeruginosa and the best model for
Mycobacterium Smegmatis, then applying the SHAP (Shapley Additive Explanation) values
packaging, providing quantitative data of the influence each feature had on the model’s
algorithm. This allows me to identify trends in the relationship between copolymer structure
and antimicrobial activity against the two bacterial strains, therefore allowing me to test my
hypothesis effectively.”

I will later on refine my methodology to fit my report, as well as look into random forest
regression and gradient boosting once I come to further developing my code.

Student Number: 36606827 24


7/3/2024

I asked a question on discord, regarding ensemble. Models and random forest classifier, as
looking online I found that random forest classifier is generally the best when it comes to
dealing with small datasets. I will take this into account and definitely refine my code once I
get to it.

One thing to note is that the dataset in itself was not very variable. There were only specs of
results that have any significance, which in the future may become a huge limitation and
barrier that I must overcome.

Student Number: 36606827 25


I asked in the official python discord server and will look into gradient boosting trees as well.

According to IBM, https://www.ibm.com/topics/overfitting :


“Overfitting occurs when an algorithm fits too closely or even exactly to its training data,
resulting in a model that can’t make accurate predictions or conclusions from any data other
than the training data.”

“Underfitting occurs when the model has not trained for enough time, or the input variables
are not significant enough to determine a meaningful relationship between the input and
output variables’

- Detecting overfitting can be achieved by K – fold cross-validation


- Ways in avoiding overfitting:
o Early stopping
o Train with more data
o Data augmentation
o Feature selection
o Regularisation
o Ensemble methods

8/3/2024
Here is my timeline that I will try to abide with:

Term 1 (Already pretty much done):


- Attend UNSW summer school program during summer holidays and commence
scientific research proposal.
- Week 1 – 7 - Create a scientific research proposal, ensuring all necessary information
is obtained and developed for the proposal.
- Begin to carry out methodology by creating the necessary python scripts for the
machine learning aspect. (Week 7 – Start of term 2)

Term 2:
- Organise and formulate obtained data from the SHAP values and the performance of
the machine learning models (Week 1 – 2)
- Commence literature review
- Complete discussion, methodology and conclusion (Week 2 – End of July Holidays)
- Finalise literature review (July Holidays)

Term 3:
- Complete and finalise all components of the Scientific Research Report and Portfolio
before the due date
- Submit the report and portfolio.

Student Number: 36606827 26


12/4/2024
Today I finally was able to get the chance to begin writing my code.

Student Number: 36606827 27


Student Number: 36606827 28
Student Number: 36606827 29
Student Number: 36606827 30
Student Number: 36606827 31
Student Number: 36606827 32
Student Number: 36606827 33
Student Number: 36606827 34
These images are the code where the dependent variable is Pseudomonas Aeruginosa. It is
important to note that the performance metrics under each code, is the relevant model’s
performance. What stood out to me is that generally most models were very good at detecting
when X = 0, meaning that the MIC was >=64, which I have determined in accordance with
my mentors.

However, according to some papers that I went through at the start of the year I may consider
of MIC values that are <= 128 ( https://pubs.acs.org/doi/10.1021/acs.macromol.9b00290 ).
This is because, the purpose of this project is to undertake a structure – activity analysis, so
128 MIC, whilst not ideal, can still be a viable direction in research that can be fine-tuned to
become more effective.

The best models by far are Support Vector Machine, Random Forest Classifier, Logistic
Regression (not in any order) and K – Nearest Neighbour. Although, I may have to look into
applying external techniques to identify and prevent overfitting, since some of the
performance results seemed a bit iffy. For example, SVM had a precision of 1 for when X =
1, and a recall of 1 for when X = 0. This is very impressive, although disappointing for the
recall being 0.67 for when X = 1.

I have also decided not to bother with multi – layer perceptron. Its performance was horrible,
which is expected. Since MLP is a type of ANN (artificial neural network) model, to get the
most out of this model, I would need an extremely large dataset which I do not have. My
dataset is very limited, and I am using a binary system, where the MIC value is good (1) or
bad (0) to simplify and ease requirements of computational power. So, for Mycobacteria
Smegmatis, I will not test multi – layer perceptron as the dataset for this bacterium has little
to no variability, with only a small cluster with some potential in research.

Also, I used 0.3 as the tested data here and will do so as well for the other bacterium. I aim to
change this once I have fully understood and developed a plan to refine my code in
accordance with ensemble methods, and gradient boosting models as well.

Student Number: 36606827 35


24/4/2024

Student Number: 36606827 36


Student Number: 36606827 37
Student Number: 36606827 38
Student Number: 36606827 39
Student Number: 36606827 40
Student Number: 36606827 41
As expected, the performance of all the models was not the best, due to the lack of variability
in the dataset. With such a small dataset, and lack of variability, it will lead to decreased
model performance. However, this is fine, as that means the model is accurate in accordance
to the circumstances and limitations of the project. Furthermore, the performance of the
models itself is still valid, and can be used. It is apparent when X = 0, the recall and precision
as a whole was at valid values. However, when X = 1, that is when the performance gets iffy.

This is a hurdle but can be overcame with SHAP values. Since this is a classification test, I
can determine what features have the greatest influence for when X = 0, and then the features
which have the least influence would therefore have the greatest influence for when X = 1.
This approach may not be fully accurate, since the SHAP values works by credit allocation,
some features quantitative value will not fully be indicative of the trends and clusters of data
the model identified but may be indicative of outliers.

Hence, using ensemble method, as a second approach to my methodology will help me cross
validate all the results and create a valuable discussion.

Furthermore, I will have to refine all the code and see the performance when the dataset is set
to 0.2.

Student Number: 36606827 42


I did not mention in the previous entry, but my hyperparameter turning is achieved by finding
the optimal hyperparameters with GridSearchCV, where the range is set to be viable
according to the computational power of my laptop (M2 MacBook air). With
Randomforestclassifier especially it takes up to 20 minutes for the model to process, and I
have to close all my applications for this to happen. Hence, I will have to be weary with using
ensemble models this being gradient boosting as mentioned in a previous paper outlined, as
they are very computationally expensive.

1/5/2024
I changed the test dataset to 0.2, and I have also made it that X = 1 if the MIC <= 128 else X
= 0.

Student Number: 36606827 43


Student Number: 36606827 44
The performance generally stayed the same for all models. Interestingly enough, logistic
regression has constantly had 0 False negatives, whilst the better performing models being
Support Vector Machine and Random Forest Classifier have had some false negatives,
although a small number.

I originally believed that increasing the


threshold to 128 would decrease
overfitting however, asking people on
the python discord server I discovered
this.

I will discuss this with my science


extension teacher to further clarify, if I
should stay with 64ug/ml as the
threshold or 128ug/ml.

Student Number: 36606827 45


I keep getting this error when trying to use XGBOOST, so I have decided to use a different
gradient boosting model that scikit learn offers. This being GradientBoostingClassifier.

Student Number: 36606827 46


The performance was very good, with only one false negative. (This is for Pseudomonas
Aeruginosa). In general, support vector machine, Randomforestclassifier and now
GradientBoostingClassifier have shown to be the best performing models, with very similar
results.

Student Number: 36606827 47


The performance of GradientBoostingClassifier on Mycobacterium Smegmatis was also very
good in comparison to the other models.

I changed all the datasets to 0.2 in the Mycybacterium Smegmatis program file. Which is
simply:

The performance of logistic regression was:

Student Number: 36606827 48


The performance of Randomforestclassifier:

The performance of Support Vector Machine:

Student Number: 36606827 49


The Performance of K – Nearest Neighbours:

The performance of GradientBoostingClassifier:

Student Number: 36606827 50


It is important to note that the code is derived from the documentation of the relevant
libraries. Since all of these models are from the scikit – learn library, I used the
documentation:
- https://scikit-learn.org/stable/
- https://www.datacamp.com/tutorial/random-forests-classifier-python
- https://www.datacamp.com/tutorial/what-is-a-confusion-matrix-in-machine-learning
- https://docs.python.org/3/library/venv.html
- https://docs.python.org/3.9/contents.html
- https://campus.datacamp.com/courses/hyperparameter-tuning-in-python/grid-
search?ex=7

While logistic regression had no false negatives, it had a large number of false positives, and
its overall accuracy was below 0.5. Hence, we cannot take this model. Randomforestclassifier
and support vector machine had only 1 false negative but their overall f1 – score, recall and
accuracy were all above 0.8. Hence, these models can definitely be taken from. K – Nearest
Neighbour had three false negatives, and its overall performance was mediocre, being worse
than Randomforestclassifier and Support Vector Machine but better than logistic regression.
The performance of GradientBoostingClassifier is mediocre as well. Similarly, it had 3 false
negatives, meaning its recall when X = 1 is not the most ideal situation, as well as its
accuracy only being 0.83, and an f1 – score of 0.78. It has performed better than K – Nearest
Neighbour, however we still should not take from this model.

Hence for the time being I can conclude for Mycobacterium Smegmatis the ML models to
conduct analysis with is Randomforestclassifier and Support Vector Machine.

6/5/2024
I completely forgot to come up with a conclusion for Pseudomonas Aeruginosa. Here are the
performances below:

Logistic regression

Student Number: 36606827 51


RandomForestClassifier

Student Number: 36606827 52


Support Vector Machine:

Student Number: 36606827 53


K – Nearest Neighbour

Student Number: 36606827 54


Gradient Boosting Classifier

Logistic regression had by far the worst results, despite having 0 false negatives. This is due
to the high number of false positives and the very low accuracy. Surprisingly
Randomforestclassifier really struggled when it came to determining when X = 1 but was
very successful when it came to determining when X = 0. Support Vector Machine,
performed very well with an f1 – score of 0.8 when X = 1, and f1 – score of 0.92 when X = 0.
There were 2 false negatives, but the overall performance especially accuracy is very notable.
K Nearest Neighbour had the second worst results, with an accuracy of 0.81, and 2 false
negatives where its recall capabilities when X = 0, and X = 1, isn’t at the same performance
as Support Vector Machine. Like all the other models, K Nearest Neighbour particularly
struggled with when X =1. This is too be expected due to how majority of the data would be
classed as 0. Gradient Boosting Classifier has only 1 false negative but 3 false positives. This
is not an issue. The overall accuracy being 0.87 is impressive, and the recall is 0.80 when X
=1, which is also very good. Furthermore, Gradient Boosting Classifier like all the other
models struggled with X =1 overall, especially with the precision being 0.57.

Student Number: 36606827 55


Looking at the script where the target variable is the MIC of Pseudomonas Aeruginosa and
the script where the target variable is the MIC of Mycobacteria Smegmatis, it is probably
unwise in cutting of any of the models due to their variability in performances across the two
bacteria.

Hence instead, I will use all models except logistic regression, and cross check through them
all, and consider of their performances whilst discussing my findings. This method can be
linked back to the following paper: https://pubs.acs.org/doi/10.1021/jacs.1c05055?ref=pdf.

15/5/2024
I talked with my science extension teacher regarding the threshold requirement. Since the
following paper (https://pubs.acs.org/doi/10.1021/acs.macromol.9b00290), took account of
the polymers till 128 μg/ml, then I will do the same thing. Hence, I have completed all my
code and today I will start using the SHAP packaging and will start composing graphs of the
machine learning models performance based of what previous literature have done.

Furthermore, I decided that only using SVM (support vector machine) as my desired ML
model from all my testing is the best optimal outcome. This is because it generally performed
well to all the models, furthermore, to ensure the validity of the experiment here using only
one model makes it easier when doing the analysis, as well as keeping the data determined
consistent. All the other models, have issues with outliers in the dataset specifically with the
mycobacterium.

This also prevents a confirmation bias, since I am only discussing the results from one
machine learning model, preventing me from using the data from all the models which
specifically suit my hypothesis.

Student Number: 36606827 56


Student Number: 36606827 57
Student Number: 36606827 58
Student Number: 36606827 59
The code overall remained the same, however I created a system, where all the plots from the
SHAP packaging went into a file automatically. Furthermore, I had to tweak the code for
SHAP accordingly.

Since SVM is not inherently adapted to SHAP like RandomforestClassifier, I had to tweak it
by using KernalExplainer(), and then predictproba(). Furthermor , I had to ensure that the
shap values was a 2d matrix not 1d matrix, which is why I have the if loops, and also made
the outcomes probability estimates of predictions since SVM gives class models, deciding if
the model is 1 or 0 in this case. So the probability = true hyperparameter ensures through
Platt scaling the models decisions are converted into probabilities.

Here is the Shap Value plots:

PAO1:

Student Number: 36606827 60


Student Number: 36606827 61
Student Number: 36606827 62
Student Number: 36606827 63
Student Number: 36606827 64
Student Number: 36606827 65
Student Number: 36606827 66
Student Number: 36606827 67
Student Number: 36606827 68
Student Number: 36606827 69
Student Number: 36606827 70
Student Number: 36606827 71
There seems to be a formatting
issue, but this will not be an issue
when it comes to the report. This is
due to the program itself
automatically downloading the
images. Here is an image of my ide.
I will manually screenshot from the
ide if I have to, to ensure that the
formatting is correct on the report.

Student Number: 36606827 72


Here is the Mycobacterium:

Student Number: 36606827 73


Student Number: 36606827 74
Student Number: 36606827 75
Student Number: 36606827 76
Student Number: 36606827 77
Student Number: 36606827 78
Student Number: 36606827 79
Student Number: 36606827 80
Student Number: 36606827 81
Student Number: 36606827 82
Student Number: 36606827 83
Student Number: 36606827 84
Student Number: 36606827 85
1/9/2024
Due to trials, and personal circumstances, I had to neglect the project for the time being. With
my project due on the tenth, and with all the code and data completed all I have to do is put
everything into the report.

Here is my peer review:


Peer Review
The report is not fully complete yet so I will just go of what I see.

⁃ The Abstract is good and I assume once you finalise your conclusions you will get
everything done by then
⁃ The literature review is very insightful, and from your planning it seems it will turn
out really well.
⁃ Your aim of study is concise, well thought out, and gives the marker full
understanding of what your report will be about, this will also potentially solidify and
make your abstract stronger as well, and add more depth to the literature review.
⁃ The research question while simplistic meets everything, although I think you can
have a more intricate question that aligns with your aim of study more. Like turning
your aim of study into a question basically.
⁃ Your variables are all listed, and your methodology is really good, although there
should be justification of why you did your methodology through such way, I did not
see that. This can be just linking it back to previous papers regarding it.
⁃ Everything else seems fine overall

While composing my code I forgot to take note of many of the issues I faced. There was a
huge directory issue, everything was being downloaded everywhere. Usually when you
download a package it all ends up in the same environment, but for some reason some of my
modules were being downloaded locally (within the virtual environment) and other modules
being downloaded globally (within my hard drive). I still cannot track down where half the
packaging’s I downloaded went, but I resolved the solution by downloading a separate IDE,
creating a virtual environment in which I downloaded the IDE in, and creating folders within
the directory of the virtual environment. I also used anaconda, instead of just what python
uses by default, and it has killed my computer with memory usage, but in the end, I managed
to get everything done.

Student Number: 36606827 86


2/9/2024
I began writing my scientific research report. Essentially all I have done is copied from my
proposal, created comments where I need to change, and changed up bits of my hypothesis
and research question. The screenshots below are evidence of such:

Student Number: 36606827 87


Student Number: 36606827 88
Student Number: 36606827 89
Student Number: 36606827 90
3/9/2024

I performed a statistical test on my the dataset with the dummies since these are also features
that will be analysed. The code is below:

Student Number: 36606827 91


I have found videos on how to do statistical tests with python, which makes it a lot easier for
me since the actual dataset itself is huge.

https://www.youtube.com/watch?v=CIbJSX-biu0

https://www.youtube.com/watch?v=QE0v3HHcKbs

I also changed my alternative and null hypothesis as well:


Null hypothesis (H0):
• The presence of cationic monomers containing primary amine or quaternary
ammonium functionality groups will not have any significant effect upon
antimicrobial activity against P.aureginosa and M.smegmatis
• The overall chain length, the global hydrophobicity/hydrophilicity, the compositional
rations of the cationic/hydrophobic/hydrophilic monomers as well as the specific type
of monomers present will not have any significant effect upon antimicrobial activity
against P.aureginosa and M.smegmatis

Alternative hypothesis (H1):


• The presence of cationic monomers containing primary amine or quaternary
ammonium functionality groups will have a significant effect upon antimicrobial
activity against P.aureginosa and M.smegmatis.
• The overall chain length, the global hydrophobicity/hydrophilicity, the compositional
ratios of the cationic/hydrophobic/hydrophilic monomers as well as the specific type
of monomers present will have a significant effect upon antimicrobial activity against
P.aureginosa and M.smegmatis.

Therefore, I clearly have mixed results, where the features being type_B1_PEAm,
type_C_HEAm, composition_A, composition_B2, composition_C, dPn have no significance
hence will be under the null hypothesis. However, the rest of the features have significance
hence will be under the alternative hypothesis.

Student Number: 36606827 92


I further refined the code, so I get a table with more values for each statistical test. I am doing
a two tailed t – test here where I am not assuming that everything has the equal variance
hence the Welch’s t- test, using this video for assistance,
https://www.youtube.com/watch?v=YFJox8gZRYA

Student Number: 36606827 93


If the |t – statistic| > t – critical then we reject the null hypothesis, furthermore the p – value
must be >= 0.05.

For Chi Square tests if the p value is less than 0.05, then we reject the null hypothesis.

I also am going to refine the tables, since the variance and standard deviation is not necessary
for the chi_square table. Furthermore the mean and variance are not necessary for the
t_test_results.

Student Number: 36606827 94


I got this error:

This indicates that one of the variances have a zero variance. Hence this means that for
M.Smegmatis. To test this I will put this code:

n1, n2 = len(group1), len(group2) if var1 == 0 or var2 == 0: print(f"Warning: Zero variance


in group for feature {col}.") continue if n1 <= 1 or n2 <= 1: print(f"Warning: Insufficient
sample size for feature {col}.") continue

I got this message

I discussed this error here and was told to report it in my science report regardless. The
reason for this is because the tested copolymers against M.smegmatis all have the same dPn,
hence there is no significance here. I will have to discuss this in my limitations and reference
it in my appendix.

Student Number: 36606827 95


These are the results for M.smegmatis, I had to remove dPn from this and take into account
of its zero variance.

The variance for P.aeruginosa for dPn is also very limited hence not having any signifinace
according to the t tests. I will account for this in my limitations and discussion. Since this is
the limitation of the dataset itself, and previous literature has shown that dPn can be finetuned
to get greater antimicrobial significance.

4/9/2024
All I have done today is just finish of
writing the results of the models and
completing the statistical tests and finalising
them in the report. Here are screenshots of
my progress.

Student Number: 36606827 96


Student Number: 36606827 97
Student Number: 36606827 98
Student Number: 36606827 99
Student Number: 36606827 100
I decided that I will use RFC for M.Smegmatis, since the performance could not be ignored.

Student Number: 36606827 101


5/9/2024
I got my classmate to peer review my report that I have done so far and here is what she said:
Review for your Science Extension Report:

- The Abstract is very informative with good use of scientific language, but still appropriate
for the audience, however NESA's prompt for an abstract is:
The abstract
The abstract is a one-paragraph (approximately 100–200 words) summary of the scientific
research investigation. It contains the question, the methods, key results and conclusions. It
should be accurate and precise. Referencing is not needed in the abstract.
However, you do say that a conclusion is supposed to go there with no more than 50 words,
however I think you should prioritise more words to explain your method and results.

- To make the first sentence a bit more concise, I believe you can just have the reference
instead of saying the Wolrd Heath Organisation (The high A report did this aswell)

Reread the first sentence, as the information is informative, however it doesn't make sense
grammatically.
- For the second sentence, to make it more concise - "Therefore, it is imperative"
- For the third sentence "shown to be a promising alternative" sounds better but the original is
not grammatically wrong.
- making it extremely unlikely pathogens are able to develop resistance towards AMPs.
to: "unlikely for pathogens to develop"
- I think the start of the words are supposed to be capitalised: '(photoinduced electron transfer
reversible addition fragmentation chain transfer)'
-' which are also less susceptible to proteolysis, do not exhibit haemolytic activity and have
lower manufacturing costs.'
to: "proteolysis, and do not exhibit"
- 'drastically reducing the time of research required to solving the
protein folding problem,'
to: "required to solve the "
- Love the last literature paragraph :D

- I like your Research question, however are you able to falsify or prove a "greatest
influence"? You might want to be a bit more specific if possible, unless you prove it in that
way. Howevr the high Grad A report does the same, so it may be valid.

- You have justified your reasoning in the scientific research hypothesis, but I don't think that
is necessary for the hypothesis:
(High grade A):
(Scientific Hypothesis:
Null Hypothesis –
The enzyme xylose isomerase will have no effect on the Drosophila melanogaster’s epileptic
symptoms.
Alternate Hypothesis -

Student Number: 36606827 102


The enzyme xylose isomerase will have an effect on the Drosophila melanogaster’s epileptic
symptoms.)

- Your methodology is concise and well justified, however your tensing changes, and it is
supposed to be past tense
(i.e. 'all the selected machine learning models are then trained" instead of were) There are
more examples that you may want to fix.
- Risk assessment and ethical concerns are great! no problem with that. If you do want a risk
assessment, I can give you points that I had to use in my Design and Technology portfolio for
risk assessments as they were a requirement (pretty easy stuff).
- At the start of your results you explain the statistical calculations that you used, however
you can put this in the method as I have observed many journals to do this and I have never
seen these justifications within the results.
- I think you should label the models or images of the models after the stat tests.

Are you going to continue on after the discussion? In the criteria you have to write about the
limitations (which you have sort of covered) but also future directions for the experiment.
Headings are also required in this area, however the discussion is succint and well written.

Does "Appendix A:
Refer to Appendix A_Dataset.xlsx for full access to dataset." mean that you are putting it in a
separate folder to submit? because It might have to go within the report, however videos are
allowed.

For the overall formatting of the portfolio you may also want to add a contents page which is
required for the HSC test I believe.

Your portfolio is great!


:)
- lily

So from this I am going to fix my methodology, add a contents page, fix my literature review
as it is still lacking, and then complete finishing of my results and then discussing.

I also decided to stick with my previous conclusion being to use support vector machine
because when trying to use SHAP I consistenly got the same error.

The dataset catered for M.Smegmatis has consistently caused me issues, and will definitely
be noted in the appendix and in the limitations section. Hence I had to change the
methodology again. It is important to note that only using SVM still ensures the validity of
the project, since the same algorithm is being applied on the datasets, so the way the data is
being categorised and predicted remains the same. One could argue that by using unique ML

Student Number: 36606827 103


models for each dataset, there may be a bias occurring for a specific bacterial strain, since
there is more data available for P.Aeroginosa than M.Smegmatis.

Here is my refined literature review which and methodology:

Literature Review
According to the World Health Organisation, antibiotic resistance is a top global threat for
public health and development, estimated that pathogens resistant to antibiotics is responsible
for 1.27 million global deaths in 2019 (World Health Organisation, 2023). Therefore, making
it imperative for alternative drugs to be developed. Subsequently, Antimicrobial peptides
(AMP) have shown to be promising as an alternative, with some AMPs already being FDA
approved. AMPs kill drug resistant bacteria, as they exhibit the ability to disrupt the
membrane of the pathogen resulting in the leakage of cellular components, ultimately leading
to cellular necrosis (Sowers, Wang, Xing, & Li, 2023). Furthermore, AMPs have shown to
bind towards intracellular targets (e.g. protein synthesis inhibition), therefore exhibiting the
ability to kill off pathogens at multiple target sites, making it extremely unlikely pathogens
are able to develop resistance towards AMPs. However, AMPs have limitations, exhibiting
haemolytic activity, protease degradation and high manufacturing cost (Chen & Lu, 2020).

As a result, emerging studies have utilised new and efficient polymerisation techniques such
as PET - RAFT (Photoinduced Electron Transfer Reversible Addition Fragmentation Chain
Transfer) (Zhao, Judzewitsch, Wong, & Boyer, 2019), to produce synthetic copolymers that
mimic the membrane disrupting mechanism of AMPs, which are also less susceptible to
proteolysis, do not exhibit haemolytic activity and have lower manufacturing costs. These
antimicrobial copolymers exhibit the same cell membrane disruptive properties due to the
presence of a cationic, hydrophobic and hydrophilic monomers. The cationic monomer
allows for the electrostatic attraction to the bacterium membrane, and through the correct
ratio of hydrophobicity, determined by the percentage composition of hydrophobic and
hydrophilic monomers, the copolymer is able to disrupt the cell membrane, resulting in
membrane lysis (Qiu, et al., 2020). Therefore, making them superior to AMPs and antibiotics.
(Judzewitsch P. R., Nguyen, Shanmugam, Wong, & Boyer, 2018). Furthermore, for
antimicrobial copolymers to be successful the structure of the copolymers are imperative in
determining antimicrobial potency, since the electrostatic interaction and cell membrane
disruptive mechanisms of the copolymer are selective upon its effectiveness on certain

Student Number: 36606827 104


bacterium’s cell membrane, revealing the narrow spectrum of applications with antimicrobial
copolymers (Judzewitsch, et al., 2020). Thus, research within this field has been concentrated
on determining the optimal structure of copolymers against select bacterial genus
(Judzewitsch P. R., Nguyen, Shanmugam, Wong, & Boyer, Towards Sequence - Controlled
Antimicrobial Polymers: Effect of Polymer Block Order on Antimicrobial Activity, 2018)
(pham, Oliver, Wong, & Boyeer, 2021) (Pham, Oliver, & Boyer, 2022). However, traditional
methods of research are time inefficient and labour intensive, where in general it takes around
10 – 15 years before a potent drug can be developed (MatchTrial, 2020). Subsequently, this
prolonged period is detrimental in tackling antibiotic resistance due to bacteria’s evolving
nature in resisting antibiotics.

Therefore, Machine learning can be utilised as an analytical technique due to its predictive
properties, allowing for the discovery of key components in antimicrobial potency, as well as
predicting effective antimicrobial copolymers (Dara, Dhamercherla, Jadav, Babu, & Ahsan,
2022). Machine learning is an emerging technique, most notably known for the use of
AlphaFold in predicting protein structure (Jumper, J., Evans, R., Pritzel, A. et al. 2021),
drastically reducing the time of research required to solving the protein folding problem, in
turn providing a new pathway to understand the relationship between structure and function
without the constraints of experimentation. Most notably, an article on Conjugated
Oligoelectrolytes, which are molecules similar in nature to antimicrobial copolymers
(Tiihonen, et al., 2021), revealed the profound effectiveness of machine learning based
approaches in drug discovery, specifically in conducting an analysis between structure and
function of the molecules. Thus, utilising machine learning reveals quantitative relationships
between copolymer structure and antimicrobial activity, allowing for a definitive
understanding of the antimicrobial mechanisms of copolymers. Ultimately, resulting in great
practical significance for researchers, by providing areas of copolymer structure to be further
tested.

Methodology
This study is conducted by building the machine learning models on a dataset with a
compilation of tested copolymers from recent studies by the Boyer Lab (Judzewitsch,
Nguyen, Shanmugam, Wong, & Boyer, 2018) (Judzewitsch, et al., 2020) (Zhao, Judzewitsch,
Wong, & Boyer, 2019). The selected machine learning models are: Random Forest Classifier
(RFC), Support Vector Machine (SVM), K nearest Neighbour (KNN), and Gradient Boosting

Student Number: 36606827 105


Classifier (GBC) which are prominent classification models used in previous studies relevant
to this research (Tiihonen, et al., 2021). The dataset was feature engineered, making it only
relevant to the hypothesis, where the dependent variable, also known as the target variable, is
the MIC (Minimum Inhibitory Concentration) values of the relevant bacterial strain, the
independent variables, also known as features, are specific copolymer characteristics (e.g.
hydrophobicity, degree of polymerisation). Furthermore, Pseudomonas Aeruginosa and
Mycobacterium Smegmatis were chosen within the dataset due to the difference in cell
membranes of the two bacterial strains, allowing for further understanding of copolymer
interactions with the cell membrane. Two separate datasets where composed, where one had
the MIC against P.Aeruginosa and the other had the MIC against M.Smegmatis (refer to
appendix D, for more information).

Since some of the features were continuous, and some were discrete, A two tailed unpaired T
– test (assuming unequal variances) was conducted for features with continuous variables,
whilst a Chi Square of Independence was conducted for features with discrete variables, to
identify whether the features of the dataset had any significant association with the Minimum
Inhibitory Concentration (MIC) of the copolymers against the two bacterial strains, since the
hypothesis did not specify a specific direction. Furthermore, the alpha value chosen was 0.05,
thus statistically significant results had at least a 95% confidence. Furthermore, some of the
features were deemed insignificant due to the limited dataset, which was contrary to current
literature within the same field (further explained in the limitations and discussion).

All the selected machine learning models are then trained with 80% of the dataset and tested
by making the models predict 20% of the dataset’s target variables by being fed a range of
untested features. Furthermore, the models were given a binary classification test in which
the MIC values against the Bacterial strains was either X = 1 or X = 0. When X = 1 then the
MIC value was less than or equal to 128 µg/ml, whilst when X = 0 then the MIC values was
greater than 128 µg/ml. This threshold was determined, since previous literature used the
same (Judzewitsch P. R., Nguyen, Shanmugam, Wong, & Boyer, Towards Sequence -
Controlled Antimicrobial Polymers: Effect of Polymer Block Order on Antimicrobial
Activity, 2018). The machine learning models performance would then be evaluated by
comparing the classification reports and confusion matrixes of all models, showing the
accuracy, precision, recall, and f1 score of the models. However, since the MIC of both
bacterial strains are dispersed differently, this may result in the machine learning models

Student Number: 36606827 106


performing differently depending on the bacterial strain. Hence I selected the best overall
performing model in predicting the MIC of P.Aeruginosa and M.Semgmatis. This ensures the
validity of the project since the same algorithm is used in categorising and predicting the
data, ensuring that the output of the values is conducted the same way. The SHAP (Shapley
Additive Explanation) values packaging was then applied on the selected ML model. This
provided quantitative data on the influence of each feature had on the model’s algorithm. The
features that SHAP was applied on were those deemed significant from the statistical tests
conducted. This allowed me to identify trends in the relationship between copolymer
structure and antimicrobial activity against the two bacterial strains, therefore allowing me to
test my hypothesis effectively.

8/9/2024
Today I completed all of my report, where I finished my discussion, limitations and future
directions, and abstract.

Abstract:
Emerging studies suggest that copolymers mimicking Antimicrobial Peptides show promise
as a solution against the growing trend of antibiotic resistance found in bacteria. However,
due to the early nature of this research, development of a potent drug may take decades, this
study aims to identify trends in copolymer structure that greatly influence activity against
Pseudomonas Aeruginosa and Mycobacterium Smegmatis, through the use of Machine
Learning (ML) technology. Thereby revealing new pathways towards drug discovery that are
time efficient and less laborious. Consequently, it was found that the Support Vector Machine
had the best overall performance against both bacteria. Subsequently, indicated by the SHAP
values applied on the Support Vector Machine as well as the P value of 8.787 x 10-8 (3 d.p.)
for the monomer AAPTAC (quaternary ammonium) against Mycobacterium Smegmatis, and
the monomer BOC – AEm (primary amine) P value of 5.530 x 10-7 (3 d.p.) against
Pseudomonas Aeruginosa; the presence of primary amine group and quaternary ammonium
group has the greatest influence upon antimicrobial activity upon Pseudomonas Aeruginosa
and Mycobacterium Smegmatis respectively, due to their electrostatic attraction to the
bacterium membrane. Furthermore, the compositional ratio of hydrophobic/hydrophilic
monomers and overall hydrophobicity were additional significant factors in determining
antimicrobial potency, due to their importance in causing the cell membrane disruptive
properties of the copolymers.

Student Number: 36606827 107


Discussion
Comparing the dependency plot against P.Aeruginosa and M.Smegmatis, there was a clear
distinction between the monomers that were shown to be effective in increasing
Antimicrobial potency. The monomer BOC-AEm had a positive effect on antimicrobial
efficacy against P.Aeruginosa evident by the cluster of dot points at approximately 1 where
the Y value, being the SHAP value given was all greater than 0. Conversely the monomer
AAPTAC had a negative effect on antimicrobial efficacy against the same bacterium, evident
by a cluster of dot points at approximately 1.5 where the Y value give was all less than 0.
Likewise the inverse occurred where the monomer BOC-AEm had a negative effect on the
antimicrobial activity against M.Smegmatis whilst AAPTAC had a positive effect on
antimicrobial activity. This aligns with previous literature, where the presence of the BOC-
AEm monomer allows for an electrostatic interaction between the negatively charged
peptidoglycan layer that is negatively charged, whilst the monomer itself contains a primary
amine group which are positively charged. Furthermore hydrogen bonds can be formed
between the primary amine group and hydroxyl groups on the sugar residues of the cell
membrane, or the primary amine group can interact with carboxyl groups in peptide chains
present in the peptidoglycan structure by also forming hydrogen bonds (Zhao, Judzewitsch,
Wong, & Boyer, 2019).

The monomer AAPTAC contains a quaternary ammonium functional group which allows for
an electrostatic interaction between the positively charged functional group and negatively
charged cell membrane. Moreover, what made AAPTAC more successful is the hydrophobic
interactions where the hydrophobic alkyl chains penetrate the mycolic acids of the cell
membrane (Zhao, Judzewitsch, Wong, & Boyer, 2019). The monomer DEAm had mixed
results according to the dependency plot and the summary plot. This aligns with previous
literature (Zhao, Judzewitsch, Wong, & Boyer, 2019), where protonation can occur between
the tertiary amine group of DEAm and the cell membrane of Mycobacterium, but its overall
success is minimal in compared to AAPTAC.

According to the dependency plot and summary plot for P.Aeruginosa composition_B and
clogP_predicted both have a positive effect upon antimicrobial activity against the bacterium.
Both features had an increasing function, this indicated that hydrophobicity is crucial for
antimicrobial activity against P.Aeruginosa. Subsequently, this aligned with all literature as
well, since the presence of hydrophobic monomers are crucial in disrupting the cell

Student Number: 36606827 108


membrane after the copolymer binds with the bacterium (Pham, Oliver, & Boyer, 2022). The
summary plot showed that the actual hydrophobic monomers had little to no effect against
the bacterium, however this is due to the lack of variability in the dataset. Conversely against
M.Smegmatis, there is a decreasing trend in composition_C and clogP_predicted, whilst an
increasing trend in composition_B. This revealed that the copolymers effective against
M.Smegmatis were less hydrophobic whilst those effective against P.Aeruginosa were more
hydrophobic. Furthermore, AAPTAC is both highly hydrophobic due to the acrylamide group
present and highly hydrophilic due to the quaternary ammonium group, meaning a third
monomer under Type_C which are inherently hydrophilic would not be necessary (pham,
Oliver, Wong, & Boyeer, 2021).

Limitations and Future Directions


The dataset used to conduct this experiment was very small and had missing values (refer to
appendix D). This caused the dPn feature to be deemed insignificant for both bacterial
strains, contrary to current literature (Namivandi-Zangech, et al., 2018), where dPn has been
proven to be factor in copolymer structure that can be finetuned to increase antimicrobial
potency.
Furthermore, the small dataset meant that the statistical tests would also be constrained. Since
the small dataset lacked in size and variability, significant features were deemed insignificant
that contradicted current literature (Zhao, Judzewitsch, Wong, & Boyer, 2019) (pham, Oliver,
Wong, & Boyeer, 2021). Moreover, the lack of variability coincided onto the dependency and
summary plots despite cutting out insignificant features.

Whilst the experiment provided insight in the relationship between copolymer structure and
antimicrobial activity, there are many clear limitations restraining the full potential of this
experiment, preventing this experiment on standing on its own without being solidified by
previous literature. However, this project has revealed the potential for machine learning
guided approaches in drug discovery which can alleviate the time period of research and
development, as well as labour requirements. Antimicrobial copolymers in previous literature
and through this project have shown to be a promising alternative to antibiotics, where more
experimental data is required which then can be used for machine learning guided
investigations whether it be predicting copolymers or structure – function analysis.

Student Number: 36606827 109


Conclusion
Therefore it is conclusive the presence of primary amine groups and quaternary ammonium
groups in cationic monomers has the greatest influence upon antimicrobial potency against
P.Aeruginosa and M.Smegmatis respectively. This is indicated by P value = 8.787 x 10-8 (3
d.p.) for the monomer AAPTAC against M.Smegmatis. Whilst, the P value = 5.530 x 10-7 (3
d.p.) for the monomer BOC – AEm against P.Aeruginosa further supports such conclusion.
Additionally, the percentage composition of hydrophobic/hydrophilic monomers and overall
hydrophobicity of copolymers are also significant factors upon antimicrobial potency against
the two bacterial strains. However, the presence of the correct cationic monomer in respect to
the bacterial strains is most significant as it initiates copolymer interaction with the cell
membrane.

Student Number: 36606827 110

You might also like