DETERMINATION OF MEMBERSHIP, CLUSTER AND
~ CLUSTER CENTRE FOR M67 USING HDBSCAN, PLOTTING
COLOR MAGNITUDE DIAGRAM AND FITTING ISOCHRONE
1. Introduction
Open clusters have long been regarded as powerful tools for studies of the Galactic disk and
evolution of stars (Chen, 2003). Membership determination is the first step to study an open cluster,
which can directly influence estimation of physical parameters. Various methods have been used
for membership determination based on proper motions, radial velocities, photometric data and
their combination.
Various algorithms have been developed for the determination of star cluster membership.
Machine-learning applications for this case were introduced such as DBSCAN (Gao, 2014),
Gaussian Mixture Model (Gao, 2020), KMEANS (El Aziz et al, 2016), kth nearest neighbor (Gao,
2016), ML-MOC (Agarwal et al. 2021) and many more.
2. Objectives
Aims of the workshop are:
1. to determine the center of the open cluster; and
2. to determine the membership probability
y 3. Theoretical Background
Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) is a natural
evolution of DBSCAN released in the past few years, almost 20 years after DBSCAN (Ester et al
1996). DBSCAN identifies clusters as overdensities in a multidimensional space in which the
number of sources exceeds the required minimum number of points within a neighborhood
(minPts) of a particular linking length e.Red: Core Points
Yellow: Border points. Still part
of the cluster because it's within
epsilon of a core point, but not
‘does not meet the min_points
criteria
Blue: Noise point, Not assigned
toacluster
Fig 1. tllustration of DBSCAN (source: medium.com/@agarwalvibhor84)
HDBSCAN works in a similar way except the user only needs to set a minimum cluster size. It does.
not depend on e; instead it condenses the minimum spanning tree by pruning off the nodes that do
not meet the minimum number of sources in a cluster, and reanalyzing the nodes that do (Kounkel
& Covey, 2019). Not only does it automatically determine other things to set a density threshold
accurately, it also does this on local levels, meaning that clusters can be returned in different areas.
of a dataset with different density levels.
4. Data
The data that will be used is from Gaia Early Data Release 3 (Gaia eDR3, Gaia Collaboration 2016b;
2020a). The third early data release (eDR3, Gaia Collaboration et al. 2018) of the ESA Gaia space
mission (Gaia Collaboration et al. 2016b) is by far the deepest and most precise astrometric
catalogue ever obtained, with proper motion nominal uncertainties a hundred times smaller than
UCAC4 and PPMXL.
We download sources from Gaia eDR3 in a cone around the cluster centre for a value of radius that
is greater than the tidal radius of the cluster. Though our algorithm is quite robust to the choice of
this initial radius, we download sources within a radius of 180 arcmin from the cluster centre. Next,
we select the sources that satisfy the following criteria (Agarwal et al. 2021):
1. Each source must have the five astrometric parameters, positions, proper motions, and
parallax as well as valid measurements in the three photometric passbands G, GBP, and GRP
in the Gaia eDR3 catalogue
2. Their parallax values must be non-negative.
3. To eliminate sources with high uncertainty while still retaining a fraction of sources down to G
~ 21 mag, the errors in their G-mag must be less than 0.005.You can download NGC 752 data here.
5. Workshop Structure
The workshop is made up of two Jupyter Notebooks. The layout of the workshop is as follows:
1. Determine the center of the open cluster.
2. Determine the membership probability.
Our membership assignment relies on the astrometric solution, and we only used the Gaia eDR3
photometry to manually confirm that the groups identified matched the expected aspect of a
cluster in a color-magnitude diagram.
Part 1: Determine the Center of the Open Cluster
To determine the membership of open cluster NGC 752, we will use a module in python called
hdbsean (McInnes et al. 2017). if this notebook is run on gColab, firstly we need to install some
libraries.
Import the required packages
Ipip install hdbscan
import math
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import hdbscan
from sklearn.preprocessing import StandardScaler
from astropy.coordinates import SkyCoord
import astropy.units as u
from sklearn.mixture import GaussianMixture
import arviz as az
from patsy import dmatrix
import statsmodels.formula.api as smf
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from _future__ import print_functionRequirement already
Requirement already
Requirement already
Requirement already
Requirement already
Requirement already
Requirement already
Requirement already
hdbscan in /usr/local/1ib/python3.7/dist-packages (0.8.2:
cython>=@,27 in /usr/local/lib/python3.7/dist-packages (4
scikit-Learn>
[email protected] in /usr/local/1ib/python3.7/dist-packi
joblib>=1. in /usr/local/1ib/python3.7/dist-packages (ft
scipy>=1.@ in /usr/local/lib/python3.7/dist-packages (fre
numpy>=1.26 in /usr/local/1ib/python3.7/dist-packages (ft
six in /usr/local/Lib/python3.7/dist-packages (from hdbs«
: threadpoolct1>=2.8.@ in /usr/local/1ib/python3.7/dist-pac
Set some plotting configurations
SMALL_SIZE = 12
MEDIUM SIZE = 14
BIGGER SIZE = 20
plt.re(‘font', size=SMALL_SIZe) # controls default text sizes
plt.rc(‘axes", titlesize-SMALL_SIZE) # fontsize of the axes title
plt.rc(‘axes', labelsize=MEDTUM_STZE) — # fontsize of the x and y labels
plt.re(‘xtick', labelsize-SMALL_SIZE) # fontsize of the tick labels
plt.rc(‘ytick', labelsize=SMALL_SIZE) # fontsize of the tick labels
#
plt.rc(‘legend', fontsize-10) legend fontsize
%matplotlib inline
Data Preparation
Import data file
FILENAME = “gaiaedr3_15@_M67.csv"
datafile = pd.read_csv(FILENAME, delimiter="
datafiledatafile.
source_id
603712422876575360
603712427171528448
603712457236324224
603712560315544576
nfo()
ra
135,110808
135.118378
135,121338
136,130346
ra_error
0.179466
0.223748
0.523702
0.903004
dec
10.726941
10.723134
10.738075
10.744922
Rangelndex: 115670 entries, @ to 115669
Data columns (total 24 columns):
21
22
23
dtypes: floatea(23), intea(1)
memory usage: 21.2 MB
column
source_id
ra
ra_error
dec
dec_error
parallax
parallax_error
parallax_over_error
pm
pmra
pmra_error
pmdec
pmdec_error
phot_g_mean_flux
phot_g_mean_flux_error
phot_g_mean_nag
phot_bp_mean_flux
phot_bp_mean_flux_error
phot_bp_mean_mag
phot_rp_mean_flux
phot_rp_mean_flux_error
phot_rp_mean_mag
dr2_radial_velocity
Non-Null Count
115670 non-null
115670 non-null
115670 non-null
115678 non-null
115678 non-null
95584 non-null
95584 non-null
95584 non-null
95584 non-null
95584 non-null
95584 non-null
95584 non-null
95584 non-null
115551 non-null.
115551 non-null
115551 non-null
114016 non-null
114016 non-null
114016 non-null
114350 non-null
114350 non-null
114350 non-null
1571 non-null
dr2_radial_velocity_error 1571 non-null
Handling the missing values
datafile. dropna(subset=[‘pmra’,
“pmdec,
Dtype
intea
Floatea
floatea
floatea
floate4
floatea
floatsa
floatea
floatea
Floated
floatea
floates
floatea
floates
Floatea
floates
floates
floatea
Floatea
Floatea
floatea
floats4
floatea
Floated
dec_error
0.144282
0.181216
0.395677
0.761642
‘parallax’ ]).reset_index()
Parallax paral]
0.286789
0.051501
0.012734
0.781963,index source_id ra raerror dec dec_error parallax
0 0 603712422876575360 135,110808 0.179466 10.726941 0.144282 0.286789
1 1 603712427171528448 135.118378 0.223748 10.723134 0.181216 0.051501
2 2 603712457236324224 135.121338 0.523702 10.738075 0.395677 -0.012734
3 3 603712660315544576 135.130346 0.903004 10.744922 0.761642 0.781963
4 4 603713041351907200 135.140797 0.336336 10.772432 0.245219 0.951278
95579 115665 603622950118859520 135.003350 1.049273 10.782234 0.879750 0.584927
95580 115666 603623018837269632 134996200 0.473971 10.786565 0.302199 -0.134166
95581 115667 603623018837270016 134,986138 0.489632 10.790086 0.348764 1.001567
95582 115668 603623053197008512 134,998781 0.346450 10.791484 0.283039 0.518023
To eliminate sources with high uncertainty while still retaining a fraction of sources down to G ~ 21
mag, we need to select the errors in their G-mag must be less than 0.005. Calculate error of G (
|e), Gap (|onpl), and Grp (\onp|).
25 on,
lea) “Into Fo.
2.5 OF yp
lerl =~ TIO Fup
25 OF mp
lenP| = Tt Far
Adding 5 more columns named e_Gmag, e_BPmag, e_RPmag and bp_rp (to plot color-magnitude
easily)
datafile['e Gnag’] = abs(-2.5*datafile[ ‘phot_g_mean_flux_error’ ]/math. log(10)/datafile[ ‘phot_
datafile[‘e BPmag'] = abs(-2.5*datafile[ ' phot_bp_mean_flux_error’ ]/math. 1og(10) /datafile[ ‘pho
datafile['e RPmag'] = abs(-2.5*datafile[ ‘phot_rp_mean_flux_error’ ]/math. log(10) /datafile[ ‘pho
datafile[ 'bp_rp'] = datafile ‘phot_bp_mean_mag'] - datafile[ ‘phot_rp_mean_mag" ]
datafile[ 'parallax_over_error’] = datafile[‘parallax'] / datafile[''parallax_error’ ]
Select data with positive parallax value (w > 0) and error of G magnitude (7) < 0.005
pprocessdata = datafile[(datafile['parallax'] > @) & (datafile['e_Gmag'] < 0.005)].reset_inde
pprocessdata70360
70361
70362
70363
70364
source_id
603712422876575360
603712427171528448
603715588267458176
603715618332282624
603715622627 196288
603622847038577280
603622881398315520
603622881398315776
603623018837270016
603623053197008512
70365 rows x 28 columns
ra ra_error
135,110808 0.179466
135.118378 0.223748
135,105567 0.031968
135.119962 0.312882
135.120392 0.016787
134.983284 0.238657
134.968256 0.477930
134,965021 0.359085
134.986138 0.489632
134,998781 0.346450
Select pmra, pmdec and parallax for plotting
parallax"]]
dec
10.726941
10.723134
10.754417
10.770758
10.761850
10.768340
10.767039
10,770651
10.790086
10.791484
dec_error
0.144282
0.181216
0.023291
0.246001
0.012744
0.175209
0.351956
0.256683
0.348764
0.283039
parallax paralle
0.286789
0.051501
0.644799
1.489228
2.166628
0.577788
0.068398
0.671293
1.001567
0.518023
df = pprocessdata[["pmra", “pmdec'
df = df.to_numpy().astype("float32", copy = False)
Visualization |
Spatial Distribution
fig = plt.figure(figsiz:
plt.plot(pprocessdata['ra"], pprocessdatal ‘dec’ ],
6, 6))
plt.xlabel(r'$\alpha$ (deg) Declination’)
plt.ylabel(r'$\delta$ (deg) Right Ascension’)
plt.title (‘Spatial Distribution of all stars’)
plt.show()
")
f
fi
fiSpatial Distribution of all stars
6 (deg) Right Asce!
Vector Point Diagram
I URANO I
fig = plt-Figure(figsize-(6, 6))
plt.plot(pprocessdata['pmra'], pprocessdata[ ‘pmdec'], ',")
plt.xlabel(r'$\mu_{\alpha*}$ (mas/yr)")
plt.ylabel(r'$\mu_{\delta}$ (mas/yr)')
plt.title (‘Plotting Proper Motion as Vector Point Diagram’)
plt.ylin(-58,50)
pt. x1im(-58, 58)
plt.show()Plotting Proper Motion as Vector Point Diagram
Color Magnitude Diagram
1 1
fig = plt.figure(figsize=(6, 8))
plt.plot(pprocessdata[ "bp_rp'], pprocessdata['phot_g_mean_mag'], ',')
ax = plt.gea()
ax. invert_yaxis()
Hplt.xlim(@., 3.)
plt.title("Color Magnitude Diagram")
plt.xlabel('bp - rp")
plt.ylabel(’g')
plt.show()
Color Magnitude Diagram
10
u
16
Normalize the data and run HDBSCAN
stscaler_df = StandardScaler().fit(df)df_ = stscaler_df.transform(d#)
clus_size = 2 * df_.shape(1]
clusterer = hdbscan.HDBSCAN(clus_size)
cluster_labels = clusterer.fit_predict (éf_)
pprocessdatal ‘hdbscan’] = cluster_labels
Vector Point Diagram for every HDBSCAN cluster
fig, ax = plt.subplots()#figsize=(6,6))
plot = ax.scatter(pprocessdata[ 'pmra'], pprocessdata[ 'pmdec’], s=5, c=pprocessdatal 'hdbscan’ ]
Fig.colorbar(plot, ax=ax)
ax = plt.gca()
ax.invert_yaxis()
plt.xlin( 50,58)
plt.ylin(-58,5@)
plt.title('Vector Point Diagram for every HDBSCAN cluster”)
plt.xlabel(r*$\mu_{\alpha*}$ (mas/yr)")
plt.ylabel(r'$\mu_{\delta}$ (mas/yr)')
plt.show()
‘Vector Point Diagram for every HDBSCAN cluster
1000
SI 800
S
2 600
&
gz 400
200
- °
—o -202«0 CO
Has (masiyn)
Distribution of stars inside each cluster and the number of members from each clustering result.
plt.figure(figsize=(6, 4))
plt.hist(pprocessdata[ "hdbscan’ ])
plt.xlabel(‘Label of Cluster’)
plt.ylabel( ‘Number of Sources’)
plt.title( Distribution of stars in each HDBSCAN cluster’)plt. show()
plt.close()
pprocessdatal ‘hdbscan'].value_counts()
Distribution of stars in each HDBSCAN cluster
50000
i
7“
5 30000
3
2 20000
2
10000
°
© 200 «400» «0-800 10001200
Label of Cluster
a 54age
5371422
677 aa
1143 7
176 53
sea 6
249 6
491 6
140 6
569 6
Name: hdbscan, Length: 1188, dtype: into
‘Separate the data with a label that shows the background data (abel = -1).
result_hdbscan = pprocessdata[pprocessdatal ‘hdbscan'] >= @] .reset_index(drop=True)
© = result_hdbscan[ ‘hdbscan"].value_counts()
print (c)
537-1422
677 84
1143 70
176 53
1080 51
221 6
259 6
285 6
293 61124 6
Finding the cluster with the most number after assuming the data used only consists of the
background and one stellar cluster.
n_max = c.index[np.argmax(c)]
result = result_hdbscan[result_hdbscan[‘hdbscan"] == n_max]
result
source_id ra ra_error dec dec_error parallax paralle
119 603785987076155392 134.065048 0.079601 10.469566 0.045647 1.005484 fl
372 603848521800034176 134.004906 0.020841 10.788134 0.011424 1.187326 f
889 — 604003037543393920 134.634072 0.294349 11.514641 0.211293 1.021565, fl
970 604024585394575616 135.082808 0.582022 11.718627 0.392696 1.092838 cl
1159 604612549237529600 133.692016 0.015389 11018895 0.007980 1.097347 c
14532 597664700902078976 132,323039 0.021296 9.606395 0.012219 1.135177 cl
14777 597712426578737792 133.589252 0.013260 9.631878 0.006607 1.151925 cl
14825 597724757429410048 133,852332 0.481212 9.910795 0.264615 1.162087 c
14900 597743311687984768 133.498198 0.143220 10.020892 0.084263 1.389898 f
15297 597830722862488064 133.851594 0.097605 10.549505 0.047888 1.095713, fl
1422 rows 29 columns
y Visualization II (Result)
Spatial Distribution
fig = plt.figure(figsize=(6, 6))
ax = plt.subplot()
plt.plot(pprocessdata['ra’], pprocessdata[‘dec'], *.', mei
plt.plot(result[‘ra'], result[‘dec'], ‘o', mfc='tab:orange’, markersizi
“silver', mfc="darkgray", markersi
+» label="HDBSCAN")
plt.xlabel(r'$\alpha$ (deg)')
plt.ylabel(r'$\delta$ (deg)')plt.legend()
plt.show()
Bb 2 84S
Vector Point Diagram
fig = plt.figure(figsize=(6, 6))
plt.plot(pprocessdata[‘pmra'], pprocessdata[‘pndec'], '.", mec="silver', mf
plt.plot(result[‘pmra'], result[‘pmdec'], ‘o', mfc="tab:orange', mec='None', markersize
darkgray’, mark
ey
plt.xlabel(r*$\nu_{\alpha*}$ (mas/yr)")
plt.ylabel(r'$\mu_{\delta}$ (mas/yr)')
plt.xticks()
plt.yticks()
pit. x1im(-38, 38)
plt.ylin(-14, 38)
plt. Legend()
plt.show()Hs (mas/yr)
Color Magnitude Diagram
1: 2
plt.figure(figsize > 8))
plt.plot(pprocessdatal 'bp_rp'], pprocessdatal ‘phot_g_mean_mag'], '.', mec='silver', mfc='dark
plt.plot(result["bp_rp'], result['phot_g mean_mag'], ‘o', color="tab:orange’, markersize=2.,
plt.xlabel(r*$6_{BP}-G_{RP}$")
plt.ylabel(r'$6$ (mag)')
plt.xlin(@., 3.)
plt.gca().invert_yaxis()
plt.legend()
plt.show()All sources,
HOBSCAN
Parallax Distribution
| |
bins_all = np.arange(pprocessdatal 'parallax"].min(), pprocessdatal ‘parallax’ ].max(), .@1)
bins_sam = np.arange(result["parallax’].min(), result[‘parallax’].max(), .@1)
Ie wh |
plt.figure(Figsize=(6, 4))
pprocessdata.parallax.hist(bins=bins_all, color=‘gray', labe!
result.parallax.hist(bins=bins_sam, color="orange’, label='
‘ALL Sources")
DBSCAN" )
plt.xlabel(r'$\onega$ (mas)')
plt.ylabel( ‘Number of Sources")
plt.xlin(@, 5)
plt.xticks()
plt.yticks()
pit. Legend()
pit. show()
8
8
Number of Sources
88 8
8
w (mas)
1. Determine the center of the stellar cluster
rac np.mean(result[‘ra‘])
dec_c = np.mean(result[ ‘dec’ })
pnra_e = np.mean(result{ ‘pmra'])
pmdec_c = np.mean(result[‘pmdec'])parallax_mean = np.mean(result[ ‘parallax’ ])
distance =1000/parallax_mean
print (rac, dec_c, pmra_c, pndec_c, parallax_mean,distance)
fig = plt.figure(figsize=(6, 6))
ax = plt.subplot()
plt.plot(pprocessdata['ra"], pprocessdata[‘dec'], *.", mec="silver', mfce"darkgray’, markersi
plt.plot(result['ra'], result["dec'], ‘0’, mfc='tab:orange’, markersize=2., label="HDBSCAN")
plt.plot(ra_c,dec_c,‘o', markersize=5, c= ‘green’, label="centre of cluster)
plt.xlabel(r'$\alphas (deg)')
plt.ylabel(r*$\delta$ (deg)')
pit. legend()
plt.show()
132. 85333382840096 11.833583454686082 -10.960577721346255 -2.905743785149157 1.15488663°
fig = plt.figure(figsize-(6, 6))
plt. plot (pprocessdatal 'pmra’], pprocessdata[ 'pmdec'], '.", mec="silver’, mfc="darkgray’, mark
plt.plot(result[‘pmra'], result['pmdec'], ‘o', mfc="tab:orange', mec='None’, markersize=5., 1
plt.plot(pmra_c,pmdec_c, ‘o', markersize=5,c= ‘green’, label=‘centre of cluster‘)
plt.xlabel(r'$\mu_{\alpha*}$ (mas/yr)')
plt.ylabel(r'$\mu_{\delta}$ (mas/yr)')
plt.xticks()
plt.yticks()
plt.xlim(-15,)
plt.ylim(-5,@)
plt.legend()plt.show()
len(result)
1422
Selecting some parameters to be calculated for all stars
allsource = pprocessdata[[
‘rat,
‘raerror',
‘dec’,
‘dec_error’,
‘parallax’,
‘parallax_error’,
‘pmra’,
“pmra_error'
“pmdec’,
‘pmdec_error’,
‘phot_g_mean_mag’,
“bp_rp’
n
allsource.head()ra. raerror dec dec_error parallax parallax_error pra pmra
0 135110808 0.179466 10.726941 0.144282 0.286789 0.189191 -0,675509 0;
1 135.118378 0.223748 10.723134 0.181216 0.051501 0.257076 -6,506968 0;
2 135.105567 0.031968 10.754417 0.023291 0.644799 0.041918 -0.661945 0
¥ Sample Sources Selection
To select the sample source, we select range of proper motions and parallax of the all source that
the mean of the enclosed values close to the mean of proper motions (ji,,,, jis) and the mean of
parallax (3)
HDBSCAN_MEAN_PHRA
HDBSCAN_MEAN_PMDEC
HDBSCAN_MEAN_PARALLAX
pmra_c
pmdec_c
parallax_mean
PMRALRANGE == 3.
PMDEC_RANGE = 3.
PARALLAX RANGE = 0.4
samplesource = allsource[
(allsource[ 'pmra*] >= HDBSCAN_MEAN_PMRA-(PMRA_RANGE/2.)) & (allsource[‘pmra’] <= HOBSCAN_
(allsource[ ‘pmdec’] >= HOBSCAN_MEAN_PMDEC-(PMDEC_RANGE/2.)) & (allsource[‘pndec'] <= HD8S
(allsource[ "parallax" ] >= HOBSCAN_MEAN_PARALLAX-(PARALLAX RANGE/2.)) & (allsource[ ‘paral
].reset_index(drop=True)
Vector Point Diagram
fig = plt.figure(figsize=(6, 6))
plt.plot(alisource[‘pmra’], allsource['pmdec'], '.", coloi
plt.plot(samplesource[ 'pmra'], samplesource[ ‘pndec’], *
"gray', markersize=2., label="Al1
» color="blue’, markersize=2., label
plt.xlabel(r"$\mu_{\alpha*}$ (mas/yr)")
plt.ylabel(r"$\mu_{\delta}$ (mas/yr)")
plt.title("Vector Point Diagram")
plt.xticks()
plt.yticks()
plt.xlim(-25,25)
plt.ylim(-25,25)
pit. 1egend()
plt.show()\Vector Point Diagram
ll Sources
‘Sample Sources
Us (mas/yr)
Ha» (mas/yr)
Parallax Distribution
bins_all = np.arange(allsource[ 'parallax'].min(), allsource[parallax'].max(), .@1)
bins_sam = np.arange(samplesource[ ‘parallax’ ].min(), samplesource[ ‘parallax’ ].max(), -01)
plt.figure(figsize=(6, 4))
allsource[ ‘parallax’ ].hist(bins=bins_all, color="gray', labe!
samplesource[ ‘parallax’ ].hist(bins=bins_sam, color="b', label=
‘11 Sources")
‘Sample Sources")
plt.xlabel(r"$\omega$ (mas)")
plt.ylabel("Number of Sources")
plt.xlim([@, 5])
plt.xticks()
plt.yticks()
plt.legend()
plt.show()mm All Sources
‘mm Sample Sources
ver of Sources
Color Magnitude Diagram
, 1
plt.figure(Figsize=(6, 8))
plt.plot(allsource[‘bp_rp'], allsource[‘phot_g mean_mag'], *.', colo’
plt.plot(samplesource[ ‘bp_rp'], samplesource[ ‘phot_g mean_mag'], ‘.', color
gray’, markersize:
*, markersize
plt.xlabel(r"$6_{8P}-G_{RP}$")
plt.ylabel(r"$G$ (mag)")
plt.xlim([@., 3.5])
plt.ylim(8,20)
plt.gca().invert_yaxis()
plt.legend()
plt.show()print(‘Al1-Sources
“d+ \nSample-Sources.
%d" -%(1en(allsource), - 1en(samplesource)))
All Sources = 70365
Sample Sources = 1714
ime
Normalize the data
df = samplesource[["pmra", “pmdec", “parallax")]
df = d¥.to_numpy().astype("float32", copy = False)
zB] aE 3 g
stscaler_df = StandardScaler().fit(d#)
df_ = stscaler_df.transform(d#)
201 AN EISSN:
norm_pmra = df_[:,0]
norm_pmde = df_[:,1]
norm_para = df_[:,2]
|, Ls 3
Select some parameters to be calculated
a
sample_data_dict = {
*norm_pnra’ : norm_pmra,
*norm_pnde’ : norm_pmde,
*norm_para’ : norm_para,
t
sample_data = pd.DataFrame(sample_data_dict)
Train Gaussian Mixture Model (GMM) with whole data with two gaussian components (field and
cluster)
gnm = GaussianMixture(n_components=2, max_iter=1000, covariance_type="full', randon_state=Non
Calculate means, covariances and weights of trained/fitted models
gnm.means_, gnm.covariances_, gnm.weights_
(array([[ @.04856147, 0.03984163, @.00229148),
[-0.01871317, -0.01535297, -0.00088304)]),
array([[[ 2.99284569, @.12718294, -0.09479567],
{ @.12718294, 3.024422, [email protected]],
[-2.08479567, -8.16097453, 1.76752092]],[[ @.23079895, @.01806545, @.02431149],
[ @.e1806545, .21904246, [email protected]],
[ 0.02431149, [email protected]@58193, @.7@423423]]]),
array([@.27816086, @.72183914]))
Calculate the probabilities of the whole data
pred_data = gnm.predict_proba(sample_data)
pred_data
array([[8.51605686e-02, 9.14839439e-01],
[1.ee0eeeeec+20, 1.224277210-18],
[1.2e000e00c+00, 1.06197402e-13],
[1.000000000+00, 2.55563579e-10],
[9.91619509e-01, 8.38049093e-03],
[1.00000000e+00, 1.22142457e-12]])
Check the calculated probabilities
plt.hist(pred_data[:,@], bins=[@., .1, .2, .3, .4, .5,
plt.xlim([@., 1.])
plt.xlabel("Probability for mu_alpha (mas/yr)")
plt.ylabel("Number of sources")
plt.show()
, +9, 1.])
1000
800
600
400
Number of sources
200
00 02 oa 06 08 10
Probability for mu_alpha (mas/yr)
plt.hist(pred_data[:,1], -bins=[0.,+.1,+.2,
plt.x1im([@.,-1.])
plt.xlabel("Probability-for-$\mu_{\delta}$-(mas/yr)")
plt.ylabel("Number-of sources”)
plt.show()
0275 By*.9,02.])1000
800
600
400
Number of sources
200
00 02 o4 06 os 10
> The Probabilities
samplesource| ‘prob’ ]
pred_data[:,0]
print: (samplesource[ ‘prob’ ])
NameError Traceback (most recent call last)
in ()
=> 1 samplesource[ 'prob’] = pred_data[:,0]
2 print (samplesource['prob'])
NameError: name 'pred_data’ is not defined
‘SEARCH STACK OVERFLOW
Determine the probability member classes. According to Agarwal et al. (2021), there are three main
classes: member_high is high probability members (P(x) > 0.6); member_moder is moderate
probability members (0.2 < P(x) < 0.6); and menber_low is low probability members (
P(x) < 0.2). There is also one additional class: member_ultra is ultra-high probability members (
P(x) > 0.8)
menber_ultra = samplesource[samplesource[ ‘prob'] >= .8].reset_index(drop=True)
menber_high = samplesource[samplesource[ ‘prob’] >= .6].reset_index(drop=True)
menber_noder = samplesource[(samplesource[ ‘prob'] > .2) & (samplesource['prob'] < .6)].reset_
member_low = samplesource[samplesource[ ‘prob’ ] <= .2].reset_index(drop=True)
print (menber_ultra)Stars with a high probability values are automatically considered as members of the cluster. Stars
with medium probability values can be considered as the cluster members(member_incl) if their
parallax values lie in the parallax value range of ultra-high probability cluster members.
rember_ultral ‘parallax’ ].min()) &
ember_ultraf ‘parallax’ ].max())].rese
member_incl = member_moder[ (member_moder| ‘parallax" ]
(menber_moder[ ‘parallax’ }
print(‘Sample Sources = %d \nHigh probability menber sources (p >= @.6) = %d \nModerate proba
Combine member_high and member_incl to get all members.
member_all = pd.concat([member_high, member_incl]).sort_values(by=['prob'], ascending-False).
Len(menber_all)
Calculate some important parameters
mean_para_val = np.mean(member_all[ ‘parallax’ ])
mean_para_std = np.std(member_all[ ‘parallax’ ])
menber_dist = 1000. /(menber_al1[‘parallax'])
mean_pnra_val = np.mean(menber_all{ ‘pnra’])
mean_pmra_std = np.std(menber_all['pmra' })
mmean_pnde_val = np.mean(menber_all{ ‘pndec’ })
mean_pnde_std = np.std(menber_all{ 'pmdec' ])
mean_dist_val = np.mean(member_dist )
mean_dist_std = np.std(menber_dist )
mean_pmra_val, mean_pmra_std, mean_pnde_val, mean_pmde_std, mean_para_val, mean_para_std, mea
Visualization Il (Result)
Probability Distribution
bins_sanp = np.arange(samplesource['prob"].min(), samplesource{ "prob"].max(), 1)bins_high = np.arange(samplesource[ 'prob'][samplesource[ 'prob'] >= .6].min(), samplesource['p
bins_mode = np.arange(samplesource[ ‘prob’ ][(samplesource[ ‘prob’] >= .2) & (samplesourcet ‘prob
(samplesource[ ‘parallax’ ] >= menber_ultra[ ‘parallax’ ].
(samplesource[ ‘parallax’ ] <= menber_ultra[ ‘parallax’ ].
samplesource[ ‘prob’ ][(samplesource[ ‘prob'] 2) & (samplesource[ ‘prob
(samplesource['parallax'] >= menber_ultra[ ‘parallax’ ].
(samplesource['parallax'] <= menber_ultra[ ‘parallax’ ].
bins = np.linspace(@., 1., 19)
plt.Figure(Figsize=(6, 4))
plt.hist(samplesource['prob'], bins=[0., .1, .2, .3, +4, +5, +6) «7, «8, -9, 14], color="dark
plt.hist(member_high['prob'], bins=[.6, .7, .8, .9, 1.], color="tab:orange’, rwidth=.975, lab
plt.hist(menber_incl[‘prob'], bins=[.2, .3, .4, .5, .6], color="tab:green’, rwidth=.975, labe
plt.xlabel("Probability")
plt.ylabel("Number of Sources")
plt.xlim([@., 1.])
plt.xticks()
plt.yticks()
pit. legend()
plt.show()
1000 Sm Sample Sources
EE High Probabilty Members
EE Moderate Probability Members
800
600
400
Number of Sources
200
00 02 oa 06 08 10
Probability
Vector Point Diagram
fig = plt.figure(Figsize-(6, 6))
plt.plot(samplesource[‘pmra'], samplesource['pndec'], ‘o", mec="silver', mfc="darkgray’, mark
plt.plot(menber_high{‘pmra'], menber_high[‘pndec'], ‘o', mfc='tab:orange', mec='None’, marker
plt.plot(menber_incl['pmra'], menber_incl['pmdec'], ‘o', mfc="tab:green’, mec='None', markers
plt.xlabel(r"$\mu_{\alpha*}$ (mas/yr)")plt.ylabel(r"$\mu_{\delta}$ (mas/yr)")
plt.xticks()
plt.yticks()
plt.title("Vector Point Diagram")
plt.legend()
pit. show()
Vector Point Diagram
-1s
20
e725
5
z
a
E-30
£
“35
es = Sample Sources
“ + Hh probabity (2 > =0.6)
+ Moderate probabity (02< =p =06)
25 120 115 110 -105 -100 95
Ha» (mas/yr)
Parallax and proper motions distribution
bins_samp = np.arange(samplesource[ ‘parallax'].min(), samplesource[ ‘parallax'].max(), -05)
bins_high = np.arange(menber_high[‘parallax'].min(), menber_high{ ‘parallax’ ].max(), .@5)
bins_mode = np.arange(menber_incl['parallax'].min(), menber_incl{ ‘parallax’ ].max(), .@5)
plt.Figure()#figsize=(6, 4))
sanplesource[ ‘parallax’ ].hist(bins=bins_samp, color='silver', rwidth=.85, label="Sample Sourc
menber_high[ ‘parallax’ ].hist(bins-bins_high, color="tab:orange’, rwidth=.85, label=n"High pro
menber_incl[ ‘parallax" ].hist(bins-bins_mode, color:
plt.xlabel(r"$\onega$ (mas)")
plt.ylabel("Number of Sources")
plt.xticks()
plt.yticks()
plt.legend()
plt.show()Ee TS Sample Sources
Hoh probably (p> =06)
$ wo sm Moderate protabity(02< =p< =06)
5
8 00
3
3 200
E
5
= 100
ot
095 100 105 110 115 120 125 130
w (mas)
Spatial distribution
fig = plt.figure(figsize-(6, 6))
plt.plot(samplesource['ra’], samplesource["dec'], ‘o", me:
plt.plot(menber_high['ra'], member_high['dec'], ‘o
plt.plot(menber_incl[‘ra*], member_incl[‘dec'],
silver’, mfc="darkgray", markersi
‘tab:orange", markersiz
plt.xlabel(r'$\alphas (deg)')
plt.ylabel(r'$\delta$ (deg)')
plt.legend()
fax. set_xticklabels([358.25, 358.5, 358.75, 359.0, 359.25, 359.5, 359.75, 0.00, 0.25], fontsi
plt.show()
‘Sample sources
+ High probabilty (p > =0.6)
+ Moderate probabiity (02< =p< =06)
6 (deg)
BO O84Color Magnitude Diagram
plt.Figure(figsize=(6, 8)
plt.plot(samplesource['bp_rp'], samplesource['phot_g_mean_mag'], ‘o', mec='silver', mfc="dark
plt.plot(menber_high['bp_rp'], member_high[‘phot_g mean_mag'], 'o', color='tab:orange’, marke
plt.plot(menber_incl['bp_rp'], member_incl["phot_g mean_mag'], ‘o', color='tab:green’, marker
plt.xlabel(r"$6_{8P}-G_{RP}$")
plt.ylabel(r"$6$ (mag)")
plt.xlim([@., 3.])
plt.gca().invert_yaxis()
plt.legend()
plt.show()
‘Semple Sources
+ High probabilty (p> =0.6)
+ Moderate probabilty (02< =p< =06)
10
2
u
G (mag)
16
oo 05 10 15 20 25 30
plt.Figure(Figsize=(6, 8)
plt.plot(menber_all['bp_rp'], menber_all['phot_g mean_mag'], ‘o', color=‘tab:blue', markersiz
plt. xlabel(r"$6_{8P}-G_{RP}$")
plt.ylabel(r"$68. (mag)")plt.xlim([@., 3.])
plt.ylin([10,20])
plt.gca().invert_yaxis()
plt.legend()
plt.show()
10 -
u
G (mag)
16
18
‘All members
0.0
yemenber_al['phot_g_mean_mag"]
member_all['bp_rp']
[00.5]
yaex[x>0.5]
xaqys5-5*np. 1ogi@ (distance)
print (1en(x),len(y))
axeaz.plot_kde(ya, rugeTrue)
pit. show()
plt.close()
ax-az.plot_kde(xa, rug-True)
plt.show()
plt.close()
ax-az.plot_kde(xa, values2-ya, contour-False, pcolormesh_kwargs
3.0
“cmap
“inferno"}, legend=ax. invert_yaxis()
pit. show()
plt.close()
X_train,X_test,y_train,y test = train_test_split(xa,ya, [email protected])
pmse_list=[]
P2_list=[]
for i in range (7,17):
for j in range (17):
knots = 4
degree = j # try different knots and degree values
try:
X_spline = dnatrix("bs(x,df = ‘+str(knots)+", degree
spline_fit = sm.GLM(y_train,X_spline).fit()
+str(degree)+', include_interce
y_pred_train = spline_fit.predict(dmatrix('bs(test, df = ‘+str(knots)+", degree = ‘+str
rmse_train = np.sqrt(mean_squared_error(y_train,y_pred_train))
print (“root mean square error for training set ", rmse_train)
print("r2 score for training set ",r2_score(y train,y pred_train))
y_pred = spline_fit.predict(dmatrix('bs(test, df = ‘+str(knots)+', degree = ‘+str(degre
rse_test = np.sqrt(mean_squared_error(y_test,y_pred))
print(“root mean square error for training set “,rnse_test)
print("r2 score for training set ",r2_score(y_test,y_pred))
rmse_list.append ([rmse_train,rmse_test])
2_list.append([r2_score(y_train,y_pred_train),r2_score(y_test,y_pred)])
range_pred = np.Linspace(np.min(X_train) ,np-max(X_train),5@)
prediction = spline_fit.predict(dnatrix(‘bs(xp, df = ‘+str(knots)+", degree = ‘+str(deg
plt.Figure(Figsize=(7,7))
plt.plot(range_pred, prediction, color='r', label='Specifying degree = '+str(degree)+"
plt.scatter(xa,ya, color="blue’ , alpha=8.3, edgecolor="k’)
plt.xlabel('Color")
plt.ylabel("6")
pit. legend()
#plt.scatter(menber_all['bp_rp'].tolist(), member_all{"phot_g mean_mag'].tolist(), face
ax = plt.gcea()
ax.invert_yaxis()
plt.show()
plt.close()
except:
print ("fail")
print (rmse_list)
print (r2_list)rmse_list=np.array(rmse_list)
r2_listenp.array(r2_list)
Hiprint (np.max(range_pred) .np.min(range_pred))
print (min(rmse_list[:,1]))0 2 4 6 8 10
Color
root mean square error for training set 0.1359167441682956
2 score for training set @.947@239280007002
root mean square error for training set @.17200044938890915
r2 score for training set 0.9199368439062416
os
— Specifying degree = 1 with 8 knots
10
is
20
25
0 2 4 6 8 10
Color
root mean square error for training set @.13177983847404448
r2 score for training set @.9501997216524329
root mean square error for training set @.1597061776926086
2 score for training set @.9309733217975995
os
— Specifying degree = 2 with 8 knots
10
is} ©
20