Interstellar Medium Clustering SEO
Interstellar Medium Clustering SEO
medium
Data Mining and Machine Learning in Astronomy
Andrea Hidalgo
S UMMER R ESEARCH I NTERNSHIP, U NIVERSITY OF W ESTERN O NTARIO
This research was done under the supervision of Dr. Pauline Barmby with the financial support of
the MITACS Globalink Research Internship Award within a total of 12 weeks, from June 16th to
September 5th of 2014.
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1 Motivation 5
1.2 Objective 5
1.3 A bit of context 6
1.3.1 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Hypothesis 9
2.2.1 Topics you should review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 Downloading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4 Experimenting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1 Methods Selected 21
4.1.1 ESOM, Evolving Self Organizing Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1.2 CSOM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Further work 27
4.2.1 Some interesting ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2.2 Links you should check out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1. Introduction
1.1 Motivation
When I applied for the summer research internship, the title of the project was The many colours of
nearby galaxies an the description was
The different populations of stars in a galaxy carry the record of its past star forma-
tion history, and also affect its future. The project involves analysing Hubble Space
Telescopes images of nearby galaxies of different types. By measuring the brightness
and colours of millions of stars, we can understand the ages and compositions of the
stars, and learn how the galaxy formed stars in the past. The radiation emitted by stars
affects the gas in a galaxy, and thus how it will form stars in the future. We will use
multi-colour images of galaxies to gain new insights into both their past and future.
So, as an engineer without any astrophysics background I thought I would be doing image
processing applied to astronomy and I ended up doing so much more, but hey! You never know what
you will end up doing.
Before coming to Canada, Pauline and I exchanged some emails where she shared me some
interesting papers, web pages and an astronomy on-line course which later I did take, mainly the
information was about a general introduction to astronomy and how astronomy images are, yes,
astronomy images are completely different as any other normal images, they are made of purely
science data and every image has valuable knowledge you can learn from, and hey you will forget
soon about pixels and start talking about sky coordinates.
So, in a few words I had no idea of what I was going to do (still), I realized I didn’t have any
idea, and the only thing I understood was how CCD detectors work. I didn’t know I had a research
adventure awaiting for me.
1.2 Objective
After I arrived and had my first meeting with Pauline, she explained me a general idea of what she
wanted and shared me some more papers (about multi-wavelength studies), I read the information
6 Chapter 1. Introduction
The (testable) assumption that the same physical laws that apply here and now also
apply everywhere and at all times, and that there are no special locations or directions
in the universe.
That’s how science is made, thinking and testing and thinking again, creating your own scientific
method, coming up with hypothesis, learning what might work and what not, using your instincts.
Well, before coming here I didn’t think like that, it was just all about being super productive
and thinking about doing robots and all kinds of devices with sensors. I had some experience
programming in C/C++, no computer science background and I had never had an astronomy course.
This report was written in order to help someone to continue researching about data mining
techniques applied in Astronomy, I explain how did I come up with the clustering techniques, my
hypothesis, some tests and other ideas I have had, I hope this can help anyone and the research
is continued. Anything you may need/questions do not hesitate to contact me, my e-mail address
is: [email protected], also s part of my own documentation I created a GitHub page where
you can download all the codes I programmed and find more information. The link to this page
is: https://github.com/LaurethTeX/Clustering, from the README file you can access to all
the pages, take your time to surf.
1.3.1 References
Since I found so much good information about pretty much everything I wanted to know about, I
will just create a remark and let you know where you can find more specific information about, just
like below.
R For more information about the cosmological principle, review Chapter 1: Why Learn As-
tronomy?, page 10, from 21st Century Astronomy, Hester | Smith | Blumenthal | Kay | Voss,
Third Edition, 2010.
2. Discovering what to do...
So, now here you have your first astronomy picture, 1 what do you see?, it is a monochrome image,
with different levels of brightness, slightly big (8500 x 5000), it looks like a lot of stars making a
spiral.
Figure 2.1: Picture of the M83 galaxy, image taken from the WFC3 ERS M83 Data Products,
http://archive.stsci.edu/prepds/wfc3ers/m83datalist.html
How can we learn something about this image, quantize, get useful information? In the next
subsections I will explain the first ideas.
1 For example purposes the image selected is a picture of M83 through a Wide H-alpha and [N II] filter.
8 Chapter 2. Discovering what to do...
they vary according to color dimensions, methods and number of required superpixels and whether
the algorithm is able to find borders and make pixel classifications.
R You can find some example test I tested with Matlab and with Python in this web page: https:
//github.com/LaurethTeX/Clustering/blob/master/Methods.md, also there is a huge
amount of information on the internet about this but here are two pages you might find useful:
• Superpixel: Empirical Studies and Applications
http://ttic.uchicago.edu/~xren/research/superpixel/
• Segmentation Algorithms in scikits-image
http://peekaboo-vision.blogspot.ca/2012/09/segmentation-algorithms-in-scikits-image.
html
Also there is one article (from IEEE) I found about and might interest you, it’s pure computer
science,
• Normalized Cuts and Image Segmentation
http://www.cs.berkeley.edu/~malik/papers/SM-ncut.pdf
2.1.2 PCA
Welcome to Astronomy where you will find more acronyms than words to mention something on
articles, lots of fun!, well in this case PCA stands for Principal Component Analysis, the objective
of this method is to reduce dimensionality, transform the data to another space where is can be
2.2 Hypothesis 9
manipulated and reduced, there are multiple examples of work that has been done in astronomy
applying this technique.
Therefore, the idea of applying this method is that if we have multiple-wavelength images of the
same target and transform them to PCA space then we will have less dimensionality and it will be
easier to process all the data and fins valuable information.2
Figure 2.3: A distribution of points drawn from a bivariate Gaussian and centred on the origin of x
and y. PCA defines a rotation such that the new axes (x0 and y0 ) are aligned along the directions of
maximal variance (the principal components) with zero covariance. This is equivalent to minimizing
the square of the perpendicular distances between the points and the principal components
R An example article, where they explain how to apply PCA on multi-wavelength images and
also mentions the pros and cons of using it.
• Preserving Structure in Multi-wavelength Images of Extended Objects
http://arxiv.org/abs/1101.1679v1
There’s a whole section that talks about this subject with a machine learning approach as a
preprocessing step in this nice book,
• Ivezić, Ž. and Connolly, A.J. and Vanderplas, J.T. and Gray, A., Statistics, Data Mining
and Machine Learning in Astronomy, Princeton University Press, Princeton, NJ, 2014.
2.2 Hypothesis
Our data looks like the images on Fig.2.4, and it contains data from let’s say a determined galaxy at
different wavelengths, if we assume that the galaxy contains various regions that relate to interstellar
objects that can tell, how stars are formed, where, how stars die, where was a star, and other
mysteries, I guess we can assume that those certain regions can be identified because they share
similar characteristics, the ideas is to find how a galaxy is made from, its contents, apply the concept
of the superpixel idea in 3D superpixels.
Take the time to think about this, how the data looks like in 3D, how a star looks like in the data
cube, imagine it, this is where ideas of how to tackle this problem come from.
2 Before I forget to mention, later I discovered that PCA is not commonly used for data mining preprocessing because
it is hard to interpret the information in the output result. Imagine clusters of data on PCA space, how do you make sense
to that?
10 Chapter 2. Discovering what to do...
This will require a lot of work, but hey it will be worthy and fun!
• Astroinformatics and computer science
– Data mining
– Machine Learning
– Big Data Analysis
– Neural Networks
– Visualization Resources
• Statistics and Image Processing
– Probability Density Function
– Point Spread Function
– Full width at half maximum
– Convolution
• Interstellar medium and star formation
– HII regions
– Planetary Nebulae
– Supernova Remnants
– Molecular Gas
– All kinds of Nebulae (e.g. dark, reflection)
– AGN’s (Active Galactic Nucleus)
• Astrophysics
– Units (light-years, parsecs)
– World coordinate system
– Light
– Telescopes
– Stars and Stellar Evolution
– Distance, Brightness, Luminosity
– Galaxies
The GitHub page will certainly help you to understand why you need to learn about that, and where
to find articles, web pages and books.
2.2 Hypothesis 11
2.2.2 Downloading
First, let’s equip ourselves with the basic software you will need in order to start then you may
probably find other cool programs and later you will install them. There is also the possibility that
your assigned computer will have them installed already but here is a brief description of what you
can do with them, most of them are easy to use.
DS9: It is a program that visualizes astronomy images in FITS format (don’t worry if you recognize
this format, it will be explained later), where you can easily manipulate them, read their
headers, compare, look at regions, see their characteristics, make graphs, even videos. Well,
depending on what you need to use later you will be finding all the functions, the best
way is to click everywhere and find out what happens, also you can ask to your astronomy
colleagues they will tell you all the perks, or if you like learning by yourself or you need
something specific check the documentation web page. It is fairly easy to install, just follow
the instructions.
Download: http://ds9.si.edu/site/Download.html
Documentation: http://ds9.si.edu/site/Documentation.html
The picture below shows (Fig.2.6something cool you can do in DS9.
Python and a user interface: The most limitless and user friendly way to develop programs in
Astronomy is using Python, there are many packages, modules, functions now available to
help you in almost anything. Me, as an undergrad engineer I’m used to program on an user
interface and not directly in a terminal. So, here I will explain you my own way of doing
things.
I make my programs on the Canopy editor, it shows when and where you have programming
error and warnings, and the interface is easy to learn, now to run, I open a terminal, go to the
directory where my program is, type ipython wait and then type run myProgram.py, and
12 Chapter 2. Discovering what to do...
Figure 2.6: This is an RGB picture made from 3 independent FITS files, with a z scale and a region
file overlaid from NED database, if you would like to learn more about this, or reproduce it, it is
all explained in this web page: https://github.com/LaurethTeX/Clustering/blob/master/
NEDtoREGION-FILE/KnownRegions.md
To learn how to use them check the documentation page, user manuals or their API’s, if
you have experience on object oriented programming it will be like running a new bike
and if you don’t, don’t worry too much, Python was designed to be easy to program, just
learn the rules of the game.
• Astropy, this package is the must have of every astronomer, contains tools to handle
coordinate systems, units, convolution.. well is better if you take a look at the web
page. http://www.astropy.org/
• Numpy, this package contains the math magic functions, linear algebra tools and
the array management variables, make sure you learn all about Numpy arrays you
will work with them all the time. http://www.numpy.org/
• SciPy, well this package is the base of all scikit modules which contain the functions
you will use in image processing and machine learning. http://www.scipy.org/
– Scikit Image, contains image processing tools, it is the OpenCV for Python
http://scikit-image.org/
– Scikit Learn, contains data mining algorithms, pretty much contains everything
that you will ever need. http://scikit-learn.org/
• Matplotlib, this package is probably one of the most powerful tools visualize data,
you can draw almost anything you want and exactly how you want it. An example
of that are the images of the AstroML book, you can access to the image library
code and learn how they are made, this is the website http://www.astroml.
org/book_figures/index.html.3 . You can download the package here http:
//matplotlib.org/.
• PyFITS, in this package you will find tools to manipulate FITS files, create new
ones, create image cubes, tables, and do all kinds of things with their headers.
Certainly this package is more than useful. http://www.stsci.edu/institute/
software_hardware/pyfits
In the path of researching I’m certain you will find more and new packages and by them you
will be prepared to install anything.
Montage: This is a toolkit for assembling astronomical images into mosaics, but it has more
functions that you may need in the future to prepare your data before processing it. There
are two ways of installing and I would say that is better to have them both. One is to
install the toolkit and any time you need it, you run the commands on the terminal, the
other one is to install a Python module and use it just like any other module. To install
montage for terminal, download the latest version in this website http://montage.ipac.
caltech.edu/docs/download.html, read the README file or go to this website http:
//montage.ipac.caltech.edu/docs/build.html and follow the steps, now if you don’t
have any problem installing it, you can try testing it with an example program found on this
website http://montage.ipac.caltech.edu/docs/pleiades_tutorial.html, in case
you are having trouble and your computer is a MAC, instead of doing step five (If you want to
be able to run the Montage executables from any directory), try this:
1. Open a file called .profile located in your user folder. (e.g. /Users/Laureth)
$ vi .profile
3 Statistics, Data Mining, and Machine Learning in Astronomy book, it was mentioned before
14 Chapter 2. Discovering what to do...
Then try testing the Montage commands, and I’m sure that it will magically work, just
remember that any time you use any command, type source .profile.
Now the other way to install, implies only to install a Python module but this module
contains less functions that the terminal application, in any case check the website http:
//www.astropy.org/montage-wrapper/, there you will find all the documentation you
may need and the instructions to install it (Spoilers pip install montage-wrapper ).
Any questions you may have and how to install, here is my GitHub page for software tools
https://github.com/LaurethTeX/Clustering/blob/master/Tools.md
3. Understand your data
Before continuing, first and most importantly you must select the raw data you are going to process
and later after you acquire experience with an specific dataset the idea is to expand the algorithms to
any kind of dataset. The important things are to learn how to input the data correctly, establish the
right learning parameters in the selected algorithm and find the best way to visualize your results
and interpret them correctly.
Now let’s start with basic concepts that vary from an engineering to an astronomer point of view.
observation and a multidimensional array that could be a table, or an image, or an array of images
(data cube). This files can be managed in different ways, with an image preview use DS9, for handing
the data in a program use the Python package PyFITS.
Figure 3.1: In this image you can observe how an observation looks, before and after convolution,
this particular image corresponds to the B band filter and was convolved to a 0.083 arc sec FWHM
Table 3.2: WFC3/UVIS PSF FWHM informations for the selected dataset, as you can see the largest
number here is 0.083 which means the poorest spatial resolution, this is the number used to calculate
the convolution kernel, in order to precess them all images must have the same spatial resolution.
output clusters relate information from all the wavelengths and the regions covered by them can be
interpreted more easily. Now if you choose to create an image cube (just append the image arrays
in one FITS file) it is possible that your images have a different conversion between their world
coordinate system to pixel, so have to make sure all of your images are projected with only one
conversion, this mean that you have to re-project them to a common WCS.
Well, what I wrote before it is a brief summary of what I did, but I’m sure that you can find a
better way to do your own data pre-processing but here are some things that you should consider:
• Create a method as general as possible, with input parameter that can be adapted to any kind
of data, this will save you a lot of work in the future
• Understand first your algorithm, how the data is going to be processed and design the best
way to input your data
• Accommodate your data according to the type of attributes that the algorithm can handle
• Consider the size of your dataset, if it’s huge your program may never end
• Find out of your algorithm can work with high dimensional data (multi-wavelength), because
if not, you won’t be able to input data cubes
• Find out if your selected clustering algorithms is able to find clusters of irregular shapes, this
18 Chapter 3. Understand your data
Figure 3.2: Look at the image, it is composed of two mosaics, therefore, there are some regions
with missing data, now look at the borders of each mosaic there is noise near the edges, this is data
that we don’t want messing with our clustering algorithm and can be classified as outliers, it is very
important to reduce them as much as possible so the output clusters can be correctly classified and
correspond to the information that we are looking for
will help you to device the best way to accommodate your patterns
• Handle outliers, if you identify them, know where they are, try to eliminate them as much as
possible, we don’t want them messing with our clusters
• In case that you come up with an artful mathematical method like PCA to reduce dimensional-
ity, make sure that what you input can later make sense when is clustered, because you will be
working in another space
• Remember that the most important goal is to find hidden knowledge therefore, you must know
you to visualize and interpret your results
• For the let’s call it astronomy image processing, make sure that your data is scientifically
approved ask people around you.
This section is explained at length in the GitHub page, there you will find my codes and some help-
ful links, https://github.com/LaurethTeX/Clustering/blob/master/Preprocessing.md
3.3 Software available 19
I discovered surfing on the internet a cloud computing software that is free, has data mining
algorithms embedded, is specifically developed for Astronomy and is programmed by Caltech,
University Federico II and the Astronomical Observatory of Capodimonte. The homepage website,
http://dame.dsf.unina.it/index.html. Well, the platform for testing is ready!, now what?
I requested and account and the next day they sent me an acceptance with my user name and my
password approved. I introduced myself to the documentation, the available clustering functions, the
manuals for every method, the blogs and discovered that the was one method available that could
work with data cubes and do its clustering on every pattern (number in the multidimensional matrix)
which was exactly what I needed. The name of this method is ESOM (Evolving Self Organizing
Maps) and I read its manual, did some foolish test with all my image and ... never got a result ...
the experiment ran forever (more than two weeks), when I realised that this wasn’t the best way to
tackle this problem I started considering only clustering on the independent images and not in the
data cube due to the fact that the dimensionality was immense. So, in the end my selected methods
have some results but not all, here is where all the work has to be done, analysed and tested again.
method is Train. Here, the important variables to understand an look at are, the learning rate, epsilon
and the pruning frequency. It is highly recommendable that you check the DAMEWARE manual for
this function, there they will explain in detail the meaning of each on the mentioned variables.
Expected Results
This particular method as I mentioned before supports data cubes and considers as an independent
pattern all the numbers in the multi-dimensional array this means that our clusters are groups of
patterns with similar characteristics, that correspond to volumes of similar fluxes of electrons inside
the data cube.
The output files from the experiment that will show us our results are,
• E_SOM_TrainTestRun_Results.txt: File that, for each pattern, reports ID, features, BMU,
cluster and activation of winner node
• E_SOM_TrainTestRun_Histogram.png: Histogram of clusters found
• E_SOM_TrainTestRun_U_matrix.png: U-Matrix image
• E_SOM_TrainTestRun_Clusters.txt: File that, for each clusters, reports label, number of
pattern assigned, percentage of association respect total number of pattern and its centroids.
• E_SOM_Train_Datacube_image.zip: Archive that includes the clustered images of each slice
of a data cube.1
The file that you will be looking forward to see is the last one, the zip where you will be able to see
the slices of the volume, and how the final configuration of the clusters was arranged.
Failed and still running tests: What no to do and what is still running
The first tests I did included all the complete data cube, including the areas where data was missing,
the images were only re-projected and convolved. That was before realising that outliers might affect
the ability of the algorithm to identify the clusters and distract them with noise and missing data. So,
the first thing you must NOT do, is to get rid of the outliers when you are training your network, if
you ever get to have a well trained network then it might be interesting to learn how the network
interacts with noise an outliers, but for now we will help her a bit.
In table 4.1 are the input parameters I used to the failed tests applied in the raw data cube, and in
table 4.2 are the input parameters used on experiments that are still running since August 7th, 2014.
(I wonder if they will ever end)
Some of the failed experiments had histogram like the one you can see on figure 4.1 where
the clusters were created but reached a point where the neural network could not define how to
differentiate a cluster from another cluster and failed.
Hey, if you were wondering why I always choose to normalize, and one as the input node, well
the normalization is due to the fact that I know that the data has, according to its filter, all kinds of
ranges of fluxes on every layer which means that the distances between patterns might not be correct,
this is a topic you should look into. And for the input node I choose 1 because if I start with any
other number the experiment automatically fails, and of course we do not want that.
As I progressed and saw the results and the log files in all the failed experiments I decide to try
the algorithm on independent layers and see if I could get something. Therefore I selected the Hα
convolved observation (halpha_conv.fits) and did some tests on it, table 4.3 shows the parameters I
used for the failed experiments and table 4.4 shows the parameters of the still running experiments.
1I
have my doubts whether this file is produced or not, in none of my test was produced, you might need to contact the
developers and ask about this.
4.1 Methods Selected 23
Name Input nodes Normalized data Learning rate Epsilon Pruning Frequency
Train2 1 1 0.3 0.001 5
Train3 1 1 0.7 10 100
Train4 1 1 0.95 1 10
Train5 1 1 0.99 0.1 10
Train6 1 1 0.01 0.01 1
Train7 1 1 0.5 0.7 5
Train8 1 1 0.5 0.5 7
Train11 1 1 0.25 0.00001 10
Table 4.1: This table describes all the failed experiments done in the workspace WFC3 with the raw
data cube as an input, using the ESOM method in the DAME platform selecting the number 3 as the
dataset type and without using a previous configuration file.
Name Input nodes Normalized data Learning rate Epsilon Pruning Frequency
Train9 1 1 0.3 0.0001 5
Train10 1 1 0.99 0.0001 10
Train12 1 1 0.5 0.0001 5
Table 4.2: This table describes all the experiments done in the workspace WFC3 that are still running
since August 7th, 2014 with the raw data cube as an input, using the ESOM method in the DAME
platform selecting the number 3 as the dataset type and without using a previous configuration file.
Figure 4.1: In this particular experiment, the neural network failed due to a very low pruning
frequency, high number of patterns and all the outliers inclusions.
Name Input nodes Normalized data Learning rate Epsilon Pruning Frequency
TrainHa1 1 1 0.5 0.01 5
TrainHa2 1 1 0.5 0.001 5
Table 4.3: This table describes the failed experiments done in the workspace WFC3 for the hal-
pha_conv.fits file, using the ESOM method for one layer in the DAME platform selecting the number
3 as the dataset type and without using a previous configuration file.
24 Chapter 4. Experimenting
Name Input nodes Normalized data Learning rate Epsilon Pruning Frequency
TrainHa3 1 1 0.5 0.0001 5
Table 4.4: This table describes the still running experiments since August 10th, 2014 in the workspace
WFC3 for the halpha_conv.fits file, using the ESOM method for one layer in the DAME platform
selecting the number 3 as the dataset type and without using a previous configuration file.
My next mental step was to repeat the tests eliminating as many outliers I could reduce, my
hypothesis here is that, if I eliminate all the areas where there is missing data and noise, the neural
networks will be concentrated only in the patterns I’m interested in clustering and maybe identifying
interesting regions that correspond to some known interstellar object. So, what I did was to try the
ESOM algorithm with, again, independent images, this time I decided to apply the same experiment
to three different layers, Hα, UV wide and i-band. In table 4.5 you can see the parameters of the
failed experiments and on figure 4.2 there are some of the output histograms. Also, in table 4.6 you
can see the input parameters of the still running experiments.
Name Input nodes Normalized data Learning rate Epsilon Pruning Frequency
Train1 1 1 0.5 0.001 50
Train2 1 1 0.5 0.01 50
Train3 1 1 0.5 0.1 100
Train4 1 1 0.5 0.001 100
Table 4.5: This parameters where used in three different workspaces (halphaCrop, uvwidecrop,
ibandcrop), with their own input file that corresponded to the convolved and cropped observation of
each filter (halpha_conv_crp.fits, uvwide_conv_crp.fits, iband_conv_crp.fits), all of the experiments
had no previous configuration file and the dataset type was 3 and all failed.
Figure 4.2: The histogram on the left corresponds to the halpha workspace in Train1, the one on the
center to the iband workspace in Train3 and the one on the right to the uvwide workspace in Train2,
all of them were failed experiments.
As you can see, I discovered that if I choose an epsilon of 0.0001 the experiments will be still
running, and all of the other variables can be variated like the learning rate and the pruning frequency.
Name Input nodes Normalized data Learning rate Epsilon Pruning Frequency
Train5 1 1 0.5 0.0001 100
Train6 1 1 0.99 0.0001 75
Table 4.6: This parameters where used in three different workspaces (halphaCrop, uvwidecrop,
ibandcrop), with their own input file that corresponded to the convolved and cropped observation of
each filter (halpha_conv_crp.fits, uvwide_conv_crp.fits, iband_conv_crp.fits), all of the experiments
had no previous configuration file and the dataset type was 3. The experiments mentioned are still
running since August 11th, 2014.
fixated pruning frequency of 0.0001, hopping that this time I could get some interesting results. The
input parameters for the two experiments I tested can be seen in table 4.7.
Name Input nodes Normalized data Learning rate Epsilon Pruning Frequency
ESOMtrain1 1 1 0.5/0.75 0.0001 100
ESOMtrain2 9 1 0.75 0.001 100
Table 4.7: This parameters where used in two different workspaces (Data Cube, RPDataCube), the
first experiment is still running since August 12th, 2014 and the second failed. The input for the
Data Cube workspace corresponds to a 9 layer data cube with no re-projection and the RPDataCube
input is the same data cube but re-projected.
As you can see, in the experiment ESOMtrain2 I tried to start the neural network with 9 nodes
(thinking logically as having 9 layers in the data cube) and immediately the experiment failed, so do
not try to input a number different than one.
I waited 17 days for the experiments to finish (I did some other stuff in the meanwhile, most
of the time learning new things) but I did not get any results so I came up with a different strategy,
selecting small data cubes with already identified regions by the NED database. I selected randomly
a particular HII region located in RA 204.26971, DEC -29.84933 (See figure 4.3) and centred it in a
605x605 pixels sample.
This time, most of the experiments gave me immediate results failing or finishing. On table 4.8,
you can see the input parameters and the status of the experiments I tested with the small data cube.
Table 4.8: All the mentioned experiment belong to the SmallDataCube workspace, have 3 as data
type and one input node, no previous configuration file and the input file is rp_small_datacube.fits.
In this case three of the experiments ended and none of them failed (yet), here I detected that the
output file that contains the distributions of the clusters on every layer is missing, but we got some
interesting results, in the next figures (4.4,4.5) you can appreciate better what I’m taking about.
26 Chapter 4. Experimenting
Figure 4.3: Illustration of the randomly chosen HII region for the small sample from the M83
re-projected data cube.
Figure 4.4: All of the images correspond to histograms of the ended experiments mentioned above
in order (Train2, Train3, Train6), as you can see there is a predominance on one of the clusters that
can mean that is detecting the HII region or the experiment never started, to understand further the
results a visualization of the clusters is needed.
Figure 4.5: All of the images correspond to U-matrices of the ended experiments mentioned above
in order (Train2, Train3, Train6)
There is work to be done for this cases, understand what is going on and interpret correctly the
results, but last we got some.
4.2 Further work 27
4.1.2 CSOM
Well, as I mentioned before I did some tests using the ESOM method but since I wasn’t getting any
results I thought of testing this method, as always I strongly recommend to read carefully its man-
ual, http://dame.dsf.unina.it/documents/SOFM_UserManual_DAME-MAN-NA-0014-Rel1.
1.pdf and fully understand what is going on behind the curtains. In the meanwhile, this is my
own explanation. This method uses FITS files, does not support data cubes, specifically uses a
neighbourhood function in order to preserve the topological properties of the input space, it is a type
of artificial network and is mainly unsupervised learning and produces a low dimensional discretized
representation of the input space of the training samples. I in this case you can choose the number
of clusters/neurons in the first layer (neural network), the diameter, number of layers (in the neural
network), learning rate and variance on each layer. Here you have more input parameters to control.
Expected Results
Well in this case, since only FITS images are allowed, what we expect to find are areas identifying
the different objects in the interstellar medium.
The important results in this case, are got in the Run and Test steps, in the Train step only the
network configuration is outputted. What we are interested on seeing are the plotted clusters.
Tests
In this case I did some tests on the CSOM workspace, but none of the, where successful, too many
input variables to control and test. So, in this case I will leave this parameters free for you to try. I
do believe that this method could be very useful and if you find a way to input the data cube in a
different configuration you will get some interesting results, due to the fact that in this method the
preservation of the topology is one of the main principles.