Science Extension Portfolio
Science Extension Portfolio
These are areas in general that seem interesting to me, although extremely broad, this is still a
starting point.
26/10/23
I came across within my YouTube recommended, two of the following videos:
https://www.youtube.com/watch?v=gg7WjuFs8F4&ab_channel=GoogleDeepMind – The
video by google DeepMind, an artificial intelligence research laboratory under google, goes
through the importance of understanding protein folding and how AI is an imperative tool in
predicting protein structure. The video goes through the DeepMind team and their process in
participating in the CASP competition in predicting the SARS – CoV – 2 proteins, in which
the AlphaFold system is a huge breakthrough on drug discovery and disease understanding.
These two videos spiralled an area of interest I want to pursue, that combines computer
science with biology and chemistry. Hence, I looked into machine learning and its possible
uses within scientific research:
https://vial.com/blog/articles/the-role-of-machine-learning-in-drug-design-advancements-
and-challenges/?https://vial.com/blog/articles/the-role-of-machine-learning-in-drug-design-
advancements-and-challenges/?utm_source=organic
I came across the following paper titled - “Artificial intelligence: A powerful paradigm for
scientific research”, (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8633405/).
This paper essentially explained how machine learning a subset of artificial intelligence, can
be utilised for various aspects that navigate between multidisciplinary fields involving
technology and science. Through these ML (machine learning) techniques, allows scientists
to examine and analyse experimental data more efficiently and effectively. Furthermore,
these ML models can be used for predicting outcomes, categorising data and distributing
them in accordance with their relevance.
27/10/23
Discussing with my Science Extensions
teacher, he advised me to looking into
mentoring. This is recommended
throughout the research project, and a
mentor will be able to guide me and help
me find an area of research. With a google
search I broke down my mentorship to:
- UNSW SciX
- Sydney University
- UTS Mentor Science+
While all of them, I found interesting, the first one “Unravelling Antibacterial Polymers with
Machine Learning” aligned with my previous research I conducted, which combined my
interest with both computer science as well as biology and chemistry.
3/11/2023
I found an article of interest that was somewhat related with my research topic.:
Sowers, A., Wang, G., Xing, M., & Li, B. (2023). Advances in Antimicrobial Peptide
Discovery via Machine Learning and Delivery via Nanotechnology. Microorganisms, 11(5),
1129. https://doi.org/10.3390/microorganisms11051129
8/11/2023
Unfortunately, my desired project was fully taken, so I emailed the organisers and was able to
get the organisers to see if they can get me a spot in my desired project.
14/11/2023
Confirmation of entry
1/12/23 – 6/1/24
Have been vigorously learning python every day, with the data camp courses provided by
UNSW
7/1/2024
Within the Microsoft teams there are three publications, all students doing the same project as
me must read through understand.
The first publication called: High – Throughput Synthesis of Antimicrobial Copolymers and
Rapid Evaluation of Their Bioactivity.
Judzewitsch, P. R., Zhao, L., Wong, E. H. H., & Boyer, C. (2019, May 17). High -
Throughput Synthesis of Antimicrobial Copolymers and Rapid Evaluation of their
Bioactivity. Macromolecules 2019, 52(11), 3975-3986. 10.1021/acs.macromol.9b00290
- First library copolymers were found to be most effective against P.aeruginosa, while
third library copolymers were found to be most effective against M.smegmatis
- There was great success against M.smegmatis, and was recommended for further
optimisation and further research into it, due to how its functionality mimics amino
acids in Antimicrobial peptides.
- The greater positive reading of the first library copolymers and the third library
copolymers allowed for their promising antimicrobial copolymer. Solidifying why It
is the charge of the polymer and the type of cell membrane of the bacteria that is
extremely important. Furthermore, hydrophobicity of the copolymer is important.
- The copolymers had no success on gram positive bacterium due to the copolymers
being positively charged.
8/1/2024
The second publication is called: Towards Sequence – Controlled Antimicrobial Polymers:
Effect of Polymer Block Order on Antimicrobial Activity
Reference: Judzewitsch, P. R., Nguyen, T. K., Shanmugam, S., Wong, E. H. H., & Boyer, C.
(2018). Towards Sequence-Controlled Antimicrobial Polymers: Effect of Polymer Block
Order on Antimicrobial Activity. Angewandte Chemie (International ed. in English), 57(17),
4559–4564. https://doi.org/10.1002/anie.201713036
The third publication is called High – Throughput Process for the discovery of Antimicrobial
Polymers and Their Upscaled Production via Flow Polymerisation.
This paper used PET – RAFT liked the other papers, synthesising a library of potential
antimicrobial polymers against Pseudomonas Aurginosa, then with the results performed a
structure – property analysis.
Note: This structure – property analysis is the basis of my own research project as I have
concluded for the time being.
Some statements quoted from the paper that are also key takeaways from the paper:
- “In these studies, the chain length, block order, and monomer structure for a library of
tailored copolymers was shown to systematically impact their antimicrobial efficacy.”
15/1/2024
Today was my first day at the UNSW SCIX program. Today we went through and
understood the basics of polymers and machine learning, beginning our research project by
cleaning the data which was derived by experimentalists. This data was a library of 160
polymers, each having different compositions, specifically their cationic: hydrophobic:
hydrophilic ratios and their block sequences. In general, changes to the structure and
functionalities of the polymers were the feature variables, and the target variables was the
MIC of the different bacteria genus’
I talked to my supervisors, and they have told me to get into contact with Cyrille Boyer,
Edgar, and other relevant researchers, that were co – authors of the three papers looked
above. However, I seemed to never get any reply on emails, and tried going to their office
in the chemical engineering faculty but could not find any of them.
I have decided for the time being, to keep researching and leave this issue at the back of my
mind for the time being.
16/1/2024
Second day of the program, and today we started coming up with a hypothesis and research
question.
Second day of the program, and today we started coming up with a hypothesis and research
question.
My process of coming up with a research question and hypothesis:
I first reviewed the three publications, and a specific passage from the second publication
caught my attention:
“High order multiblock copolymers are attractive for antimicrobial applications as one can
manipulate localised domain concentration within a polymer chain, potentially mimicking the
functional group spatial segregation endowed by the precise monomer sequence and
secondary structures in AMP.”
Potential dependent variables: global hydrophobicity, polymer charge, dPn, amine functional
groups, block sequence
I realised that dPn and block sequence was not possible with the provided dataset, due to the
laborious requirements in synthesising polymers the dataset had very little variation in dPn
and block sequence hence I decided to remove them out.
I also decided to determine global hydrophobicity specifically through the hydrophobic and
hydrophilic blocks of the polymer but put them together in one term called “global
hydrophobicity”.
“How do variations in chemical composition impact the anti - microbial activity and bacterial
genus specificity of polymers?
17/1/2024
Code I made during this day → There really was no progression of code, as I was just
copying what they said, and they all told me the values here are the best values for the code.
Although later on, I may review this code and see how I can make it more efficient and
effective.
Randomforestclassifier:
This means my project may have a second research question which would be:
“How can we determine the optimal chemical composition of polymers to result in the most
antimicrobial activity without the constraints of experimenting?”
This is a very iffy question but overviews the machine learning aspect of my research project
which in itself could provide me with one hypothesis.
19/1/2024
There wasn’t much lab sessions during today’s day, it was the end of the summer program, so
there was lab tours, Q and A sessions, and a seminar held at the end for exam technique.
30/1/2024
I sent an email to my teacher, regarding my research question, hypothesis, and the direction
of my research project. This was his following email.
9/1/2024
I did not get a chance for the past few days to be able to respond to the email and fix my
project. However today I caught up with my teacher and have been thinking how can I fix my
hypothesis and research question, plus intertwining the machine learning aspect as a sub
section of my research project.
Independent Variables:
• The overall hydrophobicity of the copolymers (clogP)
• Composition of the cationic (type A), hydrophobic (type B1 and B2), hydrophilic
(Type C)
• The type of monomer for each type
• Length of polymer (dPN)
Hypothesis: The presence of cationic monomers containing primary amine and quaternary
ammonium functionalities, have the greatest influence on the antimicrobial activity of
copolymers against Pseudomonas Aeruginosa and Mycobacterium Smegmatis respectively.
Furthermore, overall copolymer hydrophobicity and the specific compositional ratio of
cationic, hydrophobic, and hydrophilic monomers are additional significant factors in
determining antimicrobial potency. Moreover, polymer chain length has less pronounced
impact on antimicrobial efficacy upon both bacterial strains.
I may need a sub hypothesis that answers the second part of the question referring to how
machine learning models may be utilised to identify and prioritise these factors? - Look into
this later
Current objectives I may need to add to my timeline for the research proposal (very rough):
• Apply all different machine learning models with its hyperparameters optimised
through GridSearchCV.
• Using the set methodology in determining the machine learning model performance,
graph the performance of all the different machine learning models.
• Then apply the best machine learning model onto the dataset
• Using Shap or Lasso (will determine later), analyse the data
• With this analysis of data begin writing the publication, by comparing through
literature review
4/3/2024
I have came up with a new title:
These are the following machine learning models I will test (I was told to do these by my
mentors):
- Logistic regression
- Random forest classifier
- Standard vector classifier
- Multi - Layer Perceptron
- K – nearest neighbour
I changed the research question and hypothesis slightly:
Research question: “What are the key copolymer characteristics that have the greatest
influence over the antimicrobial activity against Pseudomonas Aeruginosa and
Mycobacterium smegmatis, and how can machine learning models be utilised to identify and
prioritise these factors?”
Hypothesis: The presence of cationic monomers containing primary amine and quaternary
ammonium functionalities, will have the greatest influence on the antimicrobial activity of
copolymers against Pseudomonas Aeruginosa and Mycobacterium Smegmatis respectively.
Furthermore, overall copolymer hydrophobicity/hydrophilicity and the specific compositional
ratio of cationic, hydrophobic, and hydrophilic monomers are additional significant factors in
determining antimicrobial potency. Moreover, polymer chain length has less pronounced
impact on antimicrobial efficacy upon both bacterial strains.
This hypothesis effectively conveys what I believe I will predict, which is in unison with all
literature within this research area.
The dependent variable: Antimicrobial potency – This will be measured through the Shapley
Values, by seeing which independent variable, or known as a feature has the greatest
influence (quantitatively) on the algorithm of the ML model.
5/3/2024
Today I decided to develop my methodology, since my research proposal is coming up. My
methodology is going to be pretty simple since it is just plugging in the data into a program,
running the model, and then analysing the data from the SHAP values, linking it back to the
literature and discussing such results.
I decided the best way to fully refine and develop my methodology was to read some
literature regarding it. Luckily my mentor had provided us with more additional publications
of interest
What should be noted about this article is that it analyses Conjugated Oligoelectrolytes
(COE), whilst I am doing antimicrobial copolymers. There is a distinction in structure,
but the antimicrobial mechanism is very similar. COE’s are not primarily studied as
alternatives to antibiotics unlike antimicrobial copolymers.
I looked into the model performance measurements. And this link perfectly encompassed
everything I needed to know.
https://medium.com/@abhishekjainindore24/a-comprehensive-guide-to-performance-
metrics-in-machine-learning-4ae5bd8208ce
6/3/2024
In preparation for my scientific research proposal, I have developed this methodology:
“This study is conducted by building the machine learning models on a dataset with a
compilation of tested copolymers from recent studies by the Boyer Lab (Judzewitsch,
Nguyen, Shanmugam, Wong, & Boyer, 2018) (Judzewitsch, et al., 2020) (Zhao, Judzewitsch,
Wong, & Boyer, 2019). The selected machine learning models are: Logistic Regression,
Random Forest Classifier, Support Vector Machine, Multi-Layer Perceptron and K nearest
Neighbour, which are prominent classification models used in previous studies relevant to
this research (Tiihonen, et al., 2021). The dataset is feature engineered, making it only
relevant to my hypothesis, where the dependent variable, also known as the target variable, is
the MIC (Minimum Inhibitory Concentration) values of the relevant bacterial strain, the
independent variables, also known as features, are specific copolymer characteristics (e.g.
hydrophobicity, degree of polymerisation). Furthermore Pseudomonas Aeruginosa and
Mycobacterium Smegmatis were chosen within the dataset due to the difference in cell
membranes of the two bacterial strains, allowing for further understanding of copolymer
interactions with the cell membrane.
All the selected machine learning models are then trained with 80% of the dataset and tested
by making the models predict 20% of the dataset’s target variables by being fed a range of
untested features. The machine learning models performance would then be evaluated by
comparing the classification reports of all models, showing the accuracy, precision, recall,
and f1 score (combination of precision and recall) of the models. However, since the MIC of
both bacterial strains are dispersed differently, this may result in the machine learning models
performing differently depending on the bacterial strain. Hence, I will select the best
performing model in predicting the MIC of Pseudomonas Aeruginosa and the best model for
Mycobacterium Smegmatis, then applying the SHAP (Shapley Additive Explanation) values
packaging, providing quantitative data of the influence each feature had on the model’s
algorithm. This allows me to identify trends in the relationship between copolymer structure
and antimicrobial activity against the two bacterial strains, therefore allowing me to test my
hypothesis effectively.”
I will later on refine my methodology to fit my report, as well as look into random forest
regression and gradient boosting once I come to further developing my code.
I asked a question on discord, regarding ensemble. Models and random forest classifier, as
looking online I found that random forest classifier is generally the best when it comes to
dealing with small datasets. I will take this into account and definitely refine my code once I
get to it.
One thing to note is that the dataset in itself was not very variable. There were only specs of
results that have any significance, which in the future may become a huge limitation and
barrier that I must overcome.
“Underfitting occurs when the model has not trained for enough time, or the input variables
are not significant enough to determine a meaningful relationship between the input and
output variables’
8/3/2024
Here is my timeline that I will try to abide with:
Term 2:
- Organise and formulate obtained data from the SHAP values and the performance of
the machine learning models (Week 1 – 2)
- Commence literature review
- Complete discussion, methodology and conclusion (Week 2 – End of July Holidays)
- Finalise literature review (July Holidays)
Term 3:
- Complete and finalise all components of the Scientific Research Report and Portfolio
before the due date
- Submit the report and portfolio.
However, according to some papers that I went through at the start of the year I may consider
of MIC values that are <= 128 ( https://pubs.acs.org/doi/10.1021/acs.macromol.9b00290 ).
This is because, the purpose of this project is to undertake a structure – activity analysis, so
128 MIC, whilst not ideal, can still be a viable direction in research that can be fine-tuned to
become more effective.
The best models by far are Support Vector Machine, Random Forest Classifier, Logistic
Regression (not in any order) and K – Nearest Neighbour. Although, I may have to look into
applying external techniques to identify and prevent overfitting, since some of the
performance results seemed a bit iffy. For example, SVM had a precision of 1 for when X =
1, and a recall of 1 for when X = 0. This is very impressive, although disappointing for the
recall being 0.67 for when X = 1.
I have also decided not to bother with multi – layer perceptron. Its performance was horrible,
which is expected. Since MLP is a type of ANN (artificial neural network) model, to get the
most out of this model, I would need an extremely large dataset which I do not have. My
dataset is very limited, and I am using a binary system, where the MIC value is good (1) or
bad (0) to simplify and ease requirements of computational power. So, for Mycobacteria
Smegmatis, I will not test multi – layer perceptron as the dataset for this bacterium has little
to no variability, with only a small cluster with some potential in research.
Also, I used 0.3 as the tested data here and will do so as well for the other bacterium. I aim to
change this once I have fully understood and developed a plan to refine my code in
accordance with ensemble methods, and gradient boosting models as well.
This is a hurdle but can be overcame with SHAP values. Since this is a classification test, I
can determine what features have the greatest influence for when X = 0, and then the features
which have the least influence would therefore have the greatest influence for when X = 1.
This approach may not be fully accurate, since the SHAP values works by credit allocation,
some features quantitative value will not fully be indicative of the trends and clusters of data
the model identified but may be indicative of outliers.
Hence, using ensemble method, as a second approach to my methodology will help me cross
validate all the results and create a valuable discussion.
Furthermore, I will have to refine all the code and see the performance when the dataset is set
to 0.2.
1/5/2024
I changed the test dataset to 0.2, and I have also made it that X = 1 if the MIC <= 128 else X
= 0.
I changed all the datasets to 0.2 in the Mycybacterium Smegmatis program file. Which is
simply:
While logistic regression had no false negatives, it had a large number of false positives, and
its overall accuracy was below 0.5. Hence, we cannot take this model. Randomforestclassifier
and support vector machine had only 1 false negative but their overall f1 – score, recall and
accuracy were all above 0.8. Hence, these models can definitely be taken from. K – Nearest
Neighbour had three false negatives, and its overall performance was mediocre, being worse
than Randomforestclassifier and Support Vector Machine but better than logistic regression.
The performance of GradientBoostingClassifier is mediocre as well. Similarly, it had 3 false
negatives, meaning its recall when X = 1 is not the most ideal situation, as well as its
accuracy only being 0.83, and an f1 – score of 0.78. It has performed better than K – Nearest
Neighbour, however we still should not take from this model.
Hence for the time being I can conclude for Mycobacterium Smegmatis the ML models to
conduct analysis with is Randomforestclassifier and Support Vector Machine.
6/5/2024
I completely forgot to come up with a conclusion for Pseudomonas Aeruginosa. Here are the
performances below:
Logistic regression
Logistic regression had by far the worst results, despite having 0 false negatives. This is due
to the high number of false positives and the very low accuracy. Surprisingly
Randomforestclassifier really struggled when it came to determining when X = 1 but was
very successful when it came to determining when X = 0. Support Vector Machine,
performed very well with an f1 – score of 0.8 when X = 1, and f1 – score of 0.92 when X = 0.
There were 2 false negatives, but the overall performance especially accuracy is very notable.
K Nearest Neighbour had the second worst results, with an accuracy of 0.81, and 2 false
negatives where its recall capabilities when X = 0, and X = 1, isn’t at the same performance
as Support Vector Machine. Like all the other models, K Nearest Neighbour particularly
struggled with when X =1. This is too be expected due to how majority of the data would be
classed as 0. Gradient Boosting Classifier has only 1 false negative but 3 false positives. This
is not an issue. The overall accuracy being 0.87 is impressive, and the recall is 0.80 when X
=1, which is also very good. Furthermore, Gradient Boosting Classifier like all the other
models struggled with X =1 overall, especially with the precision being 0.57.
Hence instead, I will use all models except logistic regression, and cross check through them
all, and consider of their performances whilst discussing my findings. This method can be
linked back to the following paper: https://pubs.acs.org/doi/10.1021/jacs.1c05055?ref=pdf.
15/5/2024
I talked with my science extension teacher regarding the threshold requirement. Since the
following paper (https://pubs.acs.org/doi/10.1021/acs.macromol.9b00290), took account of
the polymers till 128 μg/ml, then I will do the same thing. Hence, I have completed all my
code and today I will start using the SHAP packaging and will start composing graphs of the
machine learning models performance based of what previous literature have done.
Furthermore, I decided that only using SVM (support vector machine) as my desired ML
model from all my testing is the best optimal outcome. This is because it generally performed
well to all the models, furthermore, to ensure the validity of the experiment here using only
one model makes it easier when doing the analysis, as well as keeping the data determined
consistent. All the other models, have issues with outliers in the dataset specifically with the
mycobacterium.
This also prevents a confirmation bias, since I am only discussing the results from one
machine learning model, preventing me from using the data from all the models which
specifically suit my hypothesis.
Since SVM is not inherently adapted to SHAP like RandomforestClassifier, I had to tweak it
by using KernalExplainer(), and then predictproba(). Furthermor , I had to ensure that the
shap values was a 2d matrix not 1d matrix, which is why I have the if loops, and also made
the outcomes probability estimates of predictions since SVM gives class models, deciding if
the model is 1 or 0 in this case. So the probability = true hyperparameter ensures through
Platt scaling the models decisions are converted into probabilities.
PAO1:
⁃ The Abstract is good and I assume once you finalise your conclusions you will get
everything done by then
⁃ The literature review is very insightful, and from your planning it seems it will turn
out really well.
⁃ Your aim of study is concise, well thought out, and gives the marker full
understanding of what your report will be about, this will also potentially solidify and
make your abstract stronger as well, and add more depth to the literature review.
⁃ The research question while simplistic meets everything, although I think you can
have a more intricate question that aligns with your aim of study more. Like turning
your aim of study into a question basically.
⁃ Your variables are all listed, and your methodology is really good, although there
should be justification of why you did your methodology through such way, I did not
see that. This can be just linking it back to previous papers regarding it.
⁃ Everything else seems fine overall
While composing my code I forgot to take note of many of the issues I faced. There was a
huge directory issue, everything was being downloaded everywhere. Usually when you
download a package it all ends up in the same environment, but for some reason some of my
modules were being downloaded locally (within the virtual environment) and other modules
being downloaded globally (within my hard drive). I still cannot track down where half the
packaging’s I downloaded went, but I resolved the solution by downloading a separate IDE,
creating a virtual environment in which I downloaded the IDE in, and creating folders within
the directory of the virtual environment. I also used anaconda, instead of just what python
uses by default, and it has killed my computer with memory usage, but in the end, I managed
to get everything done.
I performed a statistical test on my the dataset with the dummies since these are also features
that will be analysed. The code is below:
https://www.youtube.com/watch?v=CIbJSX-biu0
https://www.youtube.com/watch?v=QE0v3HHcKbs
Therefore, I clearly have mixed results, where the features being type_B1_PEAm,
type_C_HEAm, composition_A, composition_B2, composition_C, dPn have no significance
hence will be under the null hypothesis. However, the rest of the features have significance
hence will be under the alternative hypothesis.
For Chi Square tests if the p value is less than 0.05, then we reject the null hypothesis.
I also am going to refine the tables, since the variance and standard deviation is not necessary
for the chi_square table. Furthermore the mean and variance are not necessary for the
t_test_results.
This indicates that one of the variances have a zero variance. Hence this means that for
M.Smegmatis. To test this I will put this code:
I discussed this error here and was told to report it in my science report regardless. The
reason for this is because the tested copolymers against M.smegmatis all have the same dPn,
hence there is no significance here. I will have to discuss this in my limitations and reference
it in my appendix.
The variance for P.aeruginosa for dPn is also very limited hence not having any signifinace
according to the t tests. I will account for this in my limitations and discussion. Since this is
the limitation of the dataset itself, and previous literature has shown that dPn can be finetuned
to get greater antimicrobial significance.
4/9/2024
All I have done today is just finish of
writing the results of the models and
completing the statistical tests and finalising
them in the report. Here are screenshots of
my progress.
- The Abstract is very informative with good use of scientific language, but still appropriate
for the audience, however NESA's prompt for an abstract is:
The abstract
The abstract is a one-paragraph (approximately 100–200 words) summary of the scientific
research investigation. It contains the question, the methods, key results and conclusions. It
should be accurate and precise. Referencing is not needed in the abstract.
However, you do say that a conclusion is supposed to go there with no more than 50 words,
however I think you should prioritise more words to explain your method and results.
- To make the first sentence a bit more concise, I believe you can just have the reference
instead of saying the Wolrd Heath Organisation (The high A report did this aswell)
Reread the first sentence, as the information is informative, however it doesn't make sense
grammatically.
- For the second sentence, to make it more concise - "Therefore, it is imperative"
- For the third sentence "shown to be a promising alternative" sounds better but the original is
not grammatically wrong.
- making it extremely unlikely pathogens are able to develop resistance towards AMPs.
to: "unlikely for pathogens to develop"
- I think the start of the words are supposed to be capitalised: '(photoinduced electron transfer
reversible addition fragmentation chain transfer)'
-' which are also less susceptible to proteolysis, do not exhibit haemolytic activity and have
lower manufacturing costs.'
to: "proteolysis, and do not exhibit"
- 'drastically reducing the time of research required to solving the
protein folding problem,'
to: "required to solve the "
- Love the last literature paragraph :D
- I like your Research question, however are you able to falsify or prove a "greatest
influence"? You might want to be a bit more specific if possible, unless you prove it in that
way. Howevr the high Grad A report does the same, so it may be valid.
- You have justified your reasoning in the scientific research hypothesis, but I don't think that
is necessary for the hypothesis:
(High grade A):
(Scientific Hypothesis:
Null Hypothesis –
The enzyme xylose isomerase will have no effect on the Drosophila melanogaster’s epileptic
symptoms.
Alternate Hypothesis -
- Your methodology is concise and well justified, however your tensing changes, and it is
supposed to be past tense
(i.e. 'all the selected machine learning models are then trained" instead of were) There are
more examples that you may want to fix.
- Risk assessment and ethical concerns are great! no problem with that. If you do want a risk
assessment, I can give you points that I had to use in my Design and Technology portfolio for
risk assessments as they were a requirement (pretty easy stuff).
- At the start of your results you explain the statistical calculations that you used, however
you can put this in the method as I have observed many journals to do this and I have never
seen these justifications within the results.
- I think you should label the models or images of the models after the stat tests.
Are you going to continue on after the discussion? In the criteria you have to write about the
limitations (which you have sort of covered) but also future directions for the experiment.
Headings are also required in this area, however the discussion is succint and well written.
Does "Appendix A:
Refer to Appendix A_Dataset.xlsx for full access to dataset." mean that you are putting it in a
separate folder to submit? because It might have to go within the report, however videos are
allowed.
For the overall formatting of the portfolio you may also want to add a contents page which is
required for the HSC test I believe.
So from this I am going to fix my methodology, add a contents page, fix my literature review
as it is still lacking, and then complete finishing of my results and then discussing.
I also decided to stick with my previous conclusion being to use support vector machine
because when trying to use SHAP I consistenly got the same error.
The dataset catered for M.Smegmatis has consistently caused me issues, and will definitely
be noted in the appendix and in the limitations section. Hence I had to change the
methodology again. It is important to note that only using SVM still ensures the validity of
the project, since the same algorithm is being applied on the datasets, so the way the data is
being categorised and predicted remains the same. One could argue that by using unique ML
Literature Review
According to the World Health Organisation, antibiotic resistance is a top global threat for
public health and development, estimated that pathogens resistant to antibiotics is responsible
for 1.27 million global deaths in 2019 (World Health Organisation, 2023). Therefore, making
it imperative for alternative drugs to be developed. Subsequently, Antimicrobial peptides
(AMP) have shown to be promising as an alternative, with some AMPs already being FDA
approved. AMPs kill drug resistant bacteria, as they exhibit the ability to disrupt the
membrane of the pathogen resulting in the leakage of cellular components, ultimately leading
to cellular necrosis (Sowers, Wang, Xing, & Li, 2023). Furthermore, AMPs have shown to
bind towards intracellular targets (e.g. protein synthesis inhibition), therefore exhibiting the
ability to kill off pathogens at multiple target sites, making it extremely unlikely pathogens
are able to develop resistance towards AMPs. However, AMPs have limitations, exhibiting
haemolytic activity, protease degradation and high manufacturing cost (Chen & Lu, 2020).
As a result, emerging studies have utilised new and efficient polymerisation techniques such
as PET - RAFT (Photoinduced Electron Transfer Reversible Addition Fragmentation Chain
Transfer) (Zhao, Judzewitsch, Wong, & Boyer, 2019), to produce synthetic copolymers that
mimic the membrane disrupting mechanism of AMPs, which are also less susceptible to
proteolysis, do not exhibit haemolytic activity and have lower manufacturing costs. These
antimicrobial copolymers exhibit the same cell membrane disruptive properties due to the
presence of a cationic, hydrophobic and hydrophilic monomers. The cationic monomer
allows for the electrostatic attraction to the bacterium membrane, and through the correct
ratio of hydrophobicity, determined by the percentage composition of hydrophobic and
hydrophilic monomers, the copolymer is able to disrupt the cell membrane, resulting in
membrane lysis (Qiu, et al., 2020). Therefore, making them superior to AMPs and antibiotics.
(Judzewitsch P. R., Nguyen, Shanmugam, Wong, & Boyer, 2018). Furthermore, for
antimicrobial copolymers to be successful the structure of the copolymers are imperative in
determining antimicrobial potency, since the electrostatic interaction and cell membrane
disruptive mechanisms of the copolymer are selective upon its effectiveness on certain
Therefore, Machine learning can be utilised as an analytical technique due to its predictive
properties, allowing for the discovery of key components in antimicrobial potency, as well as
predicting effective antimicrobial copolymers (Dara, Dhamercherla, Jadav, Babu, & Ahsan,
2022). Machine learning is an emerging technique, most notably known for the use of
AlphaFold in predicting protein structure (Jumper, J., Evans, R., Pritzel, A. et al. 2021),
drastically reducing the time of research required to solving the protein folding problem, in
turn providing a new pathway to understand the relationship between structure and function
without the constraints of experimentation. Most notably, an article on Conjugated
Oligoelectrolytes, which are molecules similar in nature to antimicrobial copolymers
(Tiihonen, et al., 2021), revealed the profound effectiveness of machine learning based
approaches in drug discovery, specifically in conducting an analysis between structure and
function of the molecules. Thus, utilising machine learning reveals quantitative relationships
between copolymer structure and antimicrobial activity, allowing for a definitive
understanding of the antimicrobial mechanisms of copolymers. Ultimately, resulting in great
practical significance for researchers, by providing areas of copolymer structure to be further
tested.
Methodology
This study is conducted by building the machine learning models on a dataset with a
compilation of tested copolymers from recent studies by the Boyer Lab (Judzewitsch,
Nguyen, Shanmugam, Wong, & Boyer, 2018) (Judzewitsch, et al., 2020) (Zhao, Judzewitsch,
Wong, & Boyer, 2019). The selected machine learning models are: Random Forest Classifier
(RFC), Support Vector Machine (SVM), K nearest Neighbour (KNN), and Gradient Boosting
Since some of the features were continuous, and some were discrete, A two tailed unpaired T
– test (assuming unequal variances) was conducted for features with continuous variables,
whilst a Chi Square of Independence was conducted for features with discrete variables, to
identify whether the features of the dataset had any significant association with the Minimum
Inhibitory Concentration (MIC) of the copolymers against the two bacterial strains, since the
hypothesis did not specify a specific direction. Furthermore, the alpha value chosen was 0.05,
thus statistically significant results had at least a 95% confidence. Furthermore, some of the
features were deemed insignificant due to the limited dataset, which was contrary to current
literature within the same field (further explained in the limitations and discussion).
All the selected machine learning models are then trained with 80% of the dataset and tested
by making the models predict 20% of the dataset’s target variables by being fed a range of
untested features. Furthermore, the models were given a binary classification test in which
the MIC values against the Bacterial strains was either X = 1 or X = 0. When X = 1 then the
MIC value was less than or equal to 128 µg/ml, whilst when X = 0 then the MIC values was
greater than 128 µg/ml. This threshold was determined, since previous literature used the
same (Judzewitsch P. R., Nguyen, Shanmugam, Wong, & Boyer, Towards Sequence -
Controlled Antimicrobial Polymers: Effect of Polymer Block Order on Antimicrobial
Activity, 2018). The machine learning models performance would then be evaluated by
comparing the classification reports and confusion matrixes of all models, showing the
accuracy, precision, recall, and f1 score of the models. However, since the MIC of both
bacterial strains are dispersed differently, this may result in the machine learning models
8/9/2024
Today I completed all of my report, where I finished my discussion, limitations and future
directions, and abstract.
Abstract:
Emerging studies suggest that copolymers mimicking Antimicrobial Peptides show promise
as a solution against the growing trend of antibiotic resistance found in bacteria. However,
due to the early nature of this research, development of a potent drug may take decades, this
study aims to identify trends in copolymer structure that greatly influence activity against
Pseudomonas Aeruginosa and Mycobacterium Smegmatis, through the use of Machine
Learning (ML) technology. Thereby revealing new pathways towards drug discovery that are
time efficient and less laborious. Consequently, it was found that the Support Vector Machine
had the best overall performance against both bacteria. Subsequently, indicated by the SHAP
values applied on the Support Vector Machine as well as the P value of 8.787 x 10-8 (3 d.p.)
for the monomer AAPTAC (quaternary ammonium) against Mycobacterium Smegmatis, and
the monomer BOC – AEm (primary amine) P value of 5.530 x 10-7 (3 d.p.) against
Pseudomonas Aeruginosa; the presence of primary amine group and quaternary ammonium
group has the greatest influence upon antimicrobial activity upon Pseudomonas Aeruginosa
and Mycobacterium Smegmatis respectively, due to their electrostatic attraction to the
bacterium membrane. Furthermore, the compositional ratio of hydrophobic/hydrophilic
monomers and overall hydrophobicity were additional significant factors in determining
antimicrobial potency, due to their importance in causing the cell membrane disruptive
properties of the copolymers.
The monomer AAPTAC contains a quaternary ammonium functional group which allows for
an electrostatic interaction between the positively charged functional group and negatively
charged cell membrane. Moreover, what made AAPTAC more successful is the hydrophobic
interactions where the hydrophobic alkyl chains penetrate the mycolic acids of the cell
membrane (Zhao, Judzewitsch, Wong, & Boyer, 2019). The monomer DEAm had mixed
results according to the dependency plot and the summary plot. This aligns with previous
literature (Zhao, Judzewitsch, Wong, & Boyer, 2019), where protonation can occur between
the tertiary amine group of DEAm and the cell membrane of Mycobacterium, but its overall
success is minimal in compared to AAPTAC.
According to the dependency plot and summary plot for P.Aeruginosa composition_B and
clogP_predicted both have a positive effect upon antimicrobial activity against the bacterium.
Both features had an increasing function, this indicated that hydrophobicity is crucial for
antimicrobial activity against P.Aeruginosa. Subsequently, this aligned with all literature as
well, since the presence of hydrophobic monomers are crucial in disrupting the cell
Whilst the experiment provided insight in the relationship between copolymer structure and
antimicrobial activity, there are many clear limitations restraining the full potential of this
experiment, preventing this experiment on standing on its own without being solidified by
previous literature. However, this project has revealed the potential for machine learning
guided approaches in drug discovery which can alleviate the time period of research and
development, as well as labour requirements. Antimicrobial copolymers in previous literature
and through this project have shown to be a promising alternative to antibiotics, where more
experimental data is required which then can be used for machine learning guided
investigations whether it be predicting copolymers or structure – function analysis.