Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
116 views45 pages

Cluster Hdbscan Dan GMM

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
116 views45 pages

Cluster Hdbscan Dan GMM

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 45
DETERMINATION OF MEMBERSHIP, CLUSTER AND ~ CLUSTER CENTRE FOR M67 USING HDBSCAN, PLOTTING COLOR MAGNITUDE DIAGRAM AND FITTING ISOCHRONE 1. Introduction Open clusters have long been regarded as powerful tools for studies of the Galactic disk and evolution of stars (Chen, 2003). Membership determination is the first step to study an open cluster, which can directly influence estimation of physical parameters. Various methods have been used for membership determination based on proper motions, radial velocities, photometric data and their combination. Various algorithms have been developed for the determination of star cluster membership. Machine-learning applications for this case were introduced such as DBSCAN (Gao, 2014), Gaussian Mixture Model (Gao, 2020), KMEANS (El Aziz et al, 2016), kth nearest neighbor (Gao, 2016), ML-MOC (Agarwal et al. 2021) and many more. 2. Objectives Aims of the workshop are: 1. to determine the center of the open cluster; and 2. to determine the membership probability y 3. Theoretical Background Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) is a natural evolution of DBSCAN released in the past few years, almost 20 years after DBSCAN (Ester et al 1996). DBSCAN identifies clusters as overdensities in a multidimensional space in which the number of sources exceeds the required minimum number of points within a neighborhood (minPts) of a particular linking length e. Red: Core Points Yellow: Border points. Still part of the cluster because it's within epsilon of a core point, but not ‘does not meet the min_points criteria Blue: Noise point, Not assigned toacluster Fig 1. tllustration of DBSCAN (source: medium.com/@agarwalvibhor84) HDBSCAN works in a similar way except the user only needs to set a minimum cluster size. It does. not depend on e; instead it condenses the minimum spanning tree by pruning off the nodes that do not meet the minimum number of sources in a cluster, and reanalyzing the nodes that do (Kounkel & Covey, 2019). Not only does it automatically determine other things to set a density threshold accurately, it also does this on local levels, meaning that clusters can be returned in different areas. of a dataset with different density levels. 4. Data The data that will be used is from Gaia Early Data Release 3 (Gaia eDR3, Gaia Collaboration 2016b; 2020a). The third early data release (eDR3, Gaia Collaboration et al. 2018) of the ESA Gaia space mission (Gaia Collaboration et al. 2016b) is by far the deepest and most precise astrometric catalogue ever obtained, with proper motion nominal uncertainties a hundred times smaller than UCAC4 and PPMXL. We download sources from Gaia eDR3 in a cone around the cluster centre for a value of radius that is greater than the tidal radius of the cluster. Though our algorithm is quite robust to the choice of this initial radius, we download sources within a radius of 180 arcmin from the cluster centre. Next, we select the sources that satisfy the following criteria (Agarwal et al. 2021): 1. Each source must have the five astrometric parameters, positions, proper motions, and parallax as well as valid measurements in the three photometric passbands G, GBP, and GRP in the Gaia eDR3 catalogue 2. Their parallax values must be non-negative. 3. To eliminate sources with high uncertainty while still retaining a fraction of sources down to G ~ 21 mag, the errors in their G-mag must be less than 0.005. You can download NGC 752 data here. 5. Workshop Structure The workshop is made up of two Jupyter Notebooks. The layout of the workshop is as follows: 1. Determine the center of the open cluster. 2. Determine the membership probability. Our membership assignment relies on the astrometric solution, and we only used the Gaia eDR3 photometry to manually confirm that the groups identified matched the expected aspect of a cluster in a color-magnitude diagram. Part 1: Determine the Center of the Open Cluster To determine the membership of open cluster NGC 752, we will use a module in python called hdbsean (McInnes et al. 2017). if this notebook is run on gColab, firstly we need to install some libraries. Import the required packages Ipip install hdbscan import math import matplotlib.pyplot as plt import numpy as np import pandas as pd import hdbscan from sklearn.preprocessing import StandardScaler from astropy.coordinates import SkyCoord import astropy.units as u from sklearn.mixture import GaussianMixture import arviz as az from patsy import dmatrix import statsmodels.formula.api as smf from sklearn.metrics import r2_score from sklearn.model_selection import train_test_split import statsmodels.api as sm from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error from _future__ import print_function Requirement already Requirement already Requirement already Requirement already Requirement already Requirement already Requirement already Requirement already hdbscan in /usr/local/1ib/python3.7/dist-packages (0.8.2: cython>=@,27 in /usr/local/lib/python3.7/dist-packages (4 scikit-Learn>[email protected] in /usr/local/1ib/python3.7/dist-packi joblib>=1. in /usr/local/1ib/python3.7/dist-packages (ft scipy>=1.@ in /usr/local/lib/python3.7/dist-packages (fre numpy>=1.26 in /usr/local/1ib/python3.7/dist-packages (ft six in /usr/local/Lib/python3.7/dist-packages (from hdbs« : threadpoolct1>=2.8.@ in /usr/local/1ib/python3.7/dist-pac Set some plotting configurations SMALL_SIZE = 12 MEDIUM SIZE = 14 BIGGER SIZE = 20 plt.re(‘font', size=SMALL_SIZe) # controls default text sizes plt.rc(‘axes", titlesize-SMALL_SIZE) # fontsize of the axes title plt.rc(‘axes', labelsize=MEDTUM_STZE) — # fontsize of the x and y labels plt.re(‘xtick', labelsize-SMALL_SIZE) # fontsize of the tick labels plt.rc(‘ytick', labelsize=SMALL_SIZE) # fontsize of the tick labels # plt.rc(‘legend', fontsize-10) legend fontsize %matplotlib inline Data Preparation Import data file FILENAME = “gaiaedr3_15@_M67.csv" datafile = pd.read_csv(FILENAME, delimiter=" datafile datafile. source_id 603712422876575360 603712427171528448 603712457236324224 603712560315544576 nfo() ra 135,110808 135.118378 135,121338 136,130346 ra_error 0.179466 0.223748 0.523702 0.903004 dec 10.726941 10.723134 10.738075 10.744922 Rangelndex: 115670 entries, @ to 115669 Data columns (total 24 columns): 21 22 23 dtypes: floatea(23), intea(1) memory usage: 21.2 MB column source_id ra ra_error dec dec_error parallax parallax_error parallax_over_error pm pmra pmra_error pmdec pmdec_error phot_g_mean_flux phot_g_mean_flux_error phot_g_mean_nag phot_bp_mean_flux phot_bp_mean_flux_error phot_bp_mean_mag phot_rp_mean_flux phot_rp_mean_flux_error phot_rp_mean_mag dr2_radial_velocity Non-Null Count 115670 non-null 115670 non-null 115670 non-null 115678 non-null 115678 non-null 95584 non-null 95584 non-null 95584 non-null 95584 non-null 95584 non-null 95584 non-null 95584 non-null 95584 non-null 115551 non-null. 115551 non-null 115551 non-null 114016 non-null 114016 non-null 114016 non-null 114350 non-null 114350 non-null 114350 non-null 1571 non-null dr2_radial_velocity_error 1571 non-null Handling the missing values datafile. dropna(subset=[‘pmra’, “pmdec, Dtype intea Floatea floatea floatea floate4 floatea floatsa floatea floatea Floated floatea floates floatea floates Floatea floates floates floatea Floatea Floatea floatea floats4 floatea Floated dec_error 0.144282 0.181216 0.395677 0.761642 ‘parallax’ ]).reset_index() Parallax paral] 0.286789 0.051501 0.012734 0.781963, index source_id ra raerror dec dec_error parallax 0 0 603712422876575360 135,110808 0.179466 10.726941 0.144282 0.286789 1 1 603712427171528448 135.118378 0.223748 10.723134 0.181216 0.051501 2 2 603712457236324224 135.121338 0.523702 10.738075 0.395677 -0.012734 3 3 603712660315544576 135.130346 0.903004 10.744922 0.761642 0.781963 4 4 603713041351907200 135.140797 0.336336 10.772432 0.245219 0.951278 95579 115665 603622950118859520 135.003350 1.049273 10.782234 0.879750 0.584927 95580 115666 603623018837269632 134996200 0.473971 10.786565 0.302199 -0.134166 95581 115667 603623018837270016 134,986138 0.489632 10.790086 0.348764 1.001567 95582 115668 603623053197008512 134,998781 0.346450 10.791484 0.283039 0.518023 To eliminate sources with high uncertainty while still retaining a fraction of sources down to G ~ 21 mag, we need to select the errors in their G-mag must be less than 0.005. Calculate error of G ( |e), Gap (|onpl), and Grp (\onp|). 25 on, lea) “Into Fo. 2.5 OF yp lerl =~ TIO Fup 25 OF mp lenP| = Tt Far Adding 5 more columns named e_Gmag, e_BPmag, e_RPmag and bp_rp (to plot color-magnitude easily) datafile['e Gnag’] = abs(-2.5*datafile[ ‘phot_g_mean_flux_error’ ]/math. log(10)/datafile[ ‘phot_ datafile[‘e BPmag'] = abs(-2.5*datafile[ ' phot_bp_mean_flux_error’ ]/math. 1og(10) /datafile[ ‘pho datafile['e RPmag'] = abs(-2.5*datafile[ ‘phot_rp_mean_flux_error’ ]/math. log(10) /datafile[ ‘pho datafile[ 'bp_rp'] = datafile ‘phot_bp_mean_mag'] - datafile[ ‘phot_rp_mean_mag" ] datafile[ 'parallax_over_error’] = datafile[‘parallax'] / datafile[''parallax_error’ ] Select data with positive parallax value (w > 0) and error of G magnitude (7) < 0.005 pprocessdata = datafile[(datafile['parallax'] > @) & (datafile['e_Gmag'] < 0.005)].reset_inde pprocessdata 70360 70361 70362 70363 70364 source_id 603712422876575360 603712427171528448 603715588267458176 603715618332282624 603715622627 196288 603622847038577280 603622881398315520 603622881398315776 603623018837270016 603623053197008512 70365 rows x 28 columns ra ra_error 135,110808 0.179466 135.118378 0.223748 135,105567 0.031968 135.119962 0.312882 135.120392 0.016787 134.983284 0.238657 134.968256 0.477930 134,965021 0.359085 134.986138 0.489632 134,998781 0.346450 Select pmra, pmdec and parallax for plotting parallax"]] dec 10.726941 10.723134 10.754417 10.770758 10.761850 10.768340 10.767039 10,770651 10.790086 10.791484 dec_error 0.144282 0.181216 0.023291 0.246001 0.012744 0.175209 0.351956 0.256683 0.348764 0.283039 parallax paralle 0.286789 0.051501 0.644799 1.489228 2.166628 0.577788 0.068398 0.671293 1.001567 0.518023 df = pprocessdata[["pmra", “pmdec' df = df.to_numpy().astype("float32", copy = False) Visualization | Spatial Distribution fig = plt.figure(figsiz: plt.plot(pprocessdata['ra"], pprocessdatal ‘dec’ ], 6, 6)) plt.xlabel(r'$\alpha$ (deg) Declination’) plt.ylabel(r'$\delta$ (deg) Right Ascension’) plt.title (‘Spatial Distribution of all stars’) plt.show() ") f fi fi Spatial Distribution of all stars 6 (deg) Right Asce! Vector Point Diagram I URANO I fig = plt-Figure(figsize-(6, 6)) plt.plot(pprocessdata['pmra'], pprocessdata[ ‘pmdec'], ',") plt.xlabel(r'$\mu_{\alpha*}$ (mas/yr)") plt.ylabel(r'$\mu_{\delta}$ (mas/yr)') plt.title (‘Plotting Proper Motion as Vector Point Diagram’) plt.ylin(-58,50) pt. x1im(-58, 58) plt.show() Plotting Proper Motion as Vector Point Diagram Color Magnitude Diagram 1 1 fig = plt.figure(figsize=(6, 8)) plt.plot(pprocessdata[ "bp_rp'], pprocessdata['phot_g_mean_mag'], ',') ax = plt.gea() ax. invert_yaxis() Hplt.xlim(@., 3.) plt.title("Color Magnitude Diagram") plt.xlabel('bp - rp") plt.ylabel(’g') plt.show() Color Magnitude Diagram 10 u 16 Normalize the data and run HDBSCAN stscaler_df = StandardScaler().fit(df) df_ = stscaler_df.transform(d#) clus_size = 2 * df_.shape(1] clusterer = hdbscan.HDBSCAN(clus_size) cluster_labels = clusterer.fit_predict (éf_) pprocessdatal ‘hdbscan’] = cluster_labels Vector Point Diagram for every HDBSCAN cluster fig, ax = plt.subplots()#figsize=(6,6)) plot = ax.scatter(pprocessdata[ 'pmra'], pprocessdata[ 'pmdec’], s=5, c=pprocessdatal 'hdbscan’ ] Fig.colorbar(plot, ax=ax) ax = plt.gca() ax.invert_yaxis() plt.xlin( 50,58) plt.ylin(-58,5@) plt.title('Vector Point Diagram for every HDBSCAN cluster”) plt.xlabel(r*$\mu_{\alpha*}$ (mas/yr)") plt.ylabel(r'$\mu_{\delta}$ (mas/yr)') plt.show() ‘Vector Point Diagram for every HDBSCAN cluster 1000 SI 800 S 2 600 & gz 400 200 - ° —o -202«0 CO Has (masiyn) Distribution of stars inside each cluster and the number of members from each clustering result. plt.figure(figsize=(6, 4)) plt.hist(pprocessdata[ "hdbscan’ ]) plt.xlabel(‘Label of Cluster’) plt.ylabel( ‘Number of Sources’) plt.title( Distribution of stars in each HDBSCAN cluster’) plt. show() plt.close() pprocessdatal ‘hdbscan'].value_counts() Distribution of stars in each HDBSCAN cluster 50000 i 7“ 5 30000 3 2 20000 2 10000 ° © 200 «400» «0-800 10001200 Label of Cluster a 54age 5371422 677 aa 1143 7 176 53 sea 6 249 6 491 6 140 6 569 6 Name: hdbscan, Length: 1188, dtype: into ‘Separate the data with a label that shows the background data (abel = -1). result_hdbscan = pprocessdata[pprocessdatal ‘hdbscan'] >= @] .reset_index(drop=True) © = result_hdbscan[ ‘hdbscan"].value_counts() print (c) 537-1422 677 84 1143 70 176 53 1080 51 221 6 259 6 285 6 293 6 1124 6 Finding the cluster with the most number after assuming the data used only consists of the background and one stellar cluster. n_max = c.index[np.argmax(c)] result = result_hdbscan[result_hdbscan[‘hdbscan"] == n_max] result source_id ra ra_error dec dec_error parallax paralle 119 603785987076155392 134.065048 0.079601 10.469566 0.045647 1.005484 fl 372 603848521800034176 134.004906 0.020841 10.788134 0.011424 1.187326 f 889 — 604003037543393920 134.634072 0.294349 11.514641 0.211293 1.021565, fl 970 604024585394575616 135.082808 0.582022 11.718627 0.392696 1.092838 cl 1159 604612549237529600 133.692016 0.015389 11018895 0.007980 1.097347 c 14532 597664700902078976 132,323039 0.021296 9.606395 0.012219 1.135177 cl 14777 597712426578737792 133.589252 0.013260 9.631878 0.006607 1.151925 cl 14825 597724757429410048 133,852332 0.481212 9.910795 0.264615 1.162087 c 14900 597743311687984768 133.498198 0.143220 10.020892 0.084263 1.389898 f 15297 597830722862488064 133.851594 0.097605 10.549505 0.047888 1.095713, fl 1422 rows 29 columns y Visualization II (Result) Spatial Distribution fig = plt.figure(figsize=(6, 6)) ax = plt.subplot() plt.plot(pprocessdata['ra’], pprocessdata[‘dec'], *.', mei plt.plot(result[‘ra'], result[‘dec'], ‘o', mfc='tab:orange’, markersizi “silver', mfc="darkgray", markersi +» label="HDBSCAN") plt.xlabel(r'$\alpha$ (deg)') plt.ylabel(r'$\delta$ (deg)') plt.legend() plt.show() Bb 2 84S Vector Point Diagram fig = plt.figure(figsize=(6, 6)) plt.plot(pprocessdata[‘pmra'], pprocessdata[‘pndec'], '.", mec="silver', mf plt.plot(result[‘pmra'], result[‘pmdec'], ‘o', mfc="tab:orange', mec='None', markersize darkgray’, mark ey plt.xlabel(r*$\nu_{\alpha*}$ (mas/yr)") plt.ylabel(r'$\mu_{\delta}$ (mas/yr)') plt.xticks() plt.yticks() pit. x1im(-38, 38) plt.ylin(-14, 38) plt. Legend() plt.show() Hs (mas/yr) Color Magnitude Diagram 1: 2 plt.figure(figsize > 8)) plt.plot(pprocessdatal 'bp_rp'], pprocessdatal ‘phot_g_mean_mag'], '.', mec='silver', mfc='dark plt.plot(result["bp_rp'], result['phot_g mean_mag'], ‘o', color="tab:orange’, markersize=2., plt.xlabel(r*$6_{BP}-G_{RP}$") plt.ylabel(r'$6$ (mag)') plt.xlin(@., 3.) plt.gca().invert_yaxis() plt.legend() plt.show() All sources, HOBSCAN Parallax Distribution | | bins_all = np.arange(pprocessdatal 'parallax"].min(), pprocessdatal ‘parallax’ ].max(), .@1) bins_sam = np.arange(result["parallax’].min(), result[‘parallax’].max(), .@1) Ie wh | plt.figure(Figsize=(6, 4)) pprocessdata.parallax.hist(bins=bins_all, color=‘gray', labe! result.parallax.hist(bins=bins_sam, color="orange’, label=' ‘ALL Sources") DBSCAN" ) plt.xlabel(r'$\onega$ (mas)') plt.ylabel( ‘Number of Sources") plt.xlin(@, 5) plt.xticks() plt.yticks() pit. Legend() pit. show() 8 8 Number of Sources 88 8 8 w (mas) 1. Determine the center of the stellar cluster rac np.mean(result[‘ra‘]) dec_c = np.mean(result[ ‘dec’ }) pnra_e = np.mean(result{ ‘pmra']) pmdec_c = np.mean(result[‘pmdec']) parallax_mean = np.mean(result[ ‘parallax’ ]) distance =1000/parallax_mean print (rac, dec_c, pmra_c, pndec_c, parallax_mean,distance) fig = plt.figure(figsize=(6, 6)) ax = plt.subplot() plt.plot(pprocessdata['ra"], pprocessdata[‘dec'], *.", mec="silver', mfce"darkgray’, markersi plt.plot(result['ra'], result["dec'], ‘0’, mfc='tab:orange’, markersize=2., label="HDBSCAN") plt.plot(ra_c,dec_c,‘o', markersize=5, c= ‘green’, label="centre of cluster) plt.xlabel(r'$\alphas (deg)') plt.ylabel(r*$\delta$ (deg)') pit. legend() plt.show() 132. 85333382840096 11.833583454686082 -10.960577721346255 -2.905743785149157 1.15488663° fig = plt.figure(figsize-(6, 6)) plt. plot (pprocessdatal 'pmra’], pprocessdata[ 'pmdec'], '.", mec="silver’, mfc="darkgray’, mark plt.plot(result[‘pmra'], result['pmdec'], ‘o', mfc="tab:orange', mec='None’, markersize=5., 1 plt.plot(pmra_c,pmdec_c, ‘o', markersize=5,c= ‘green’, label=‘centre of cluster‘) plt.xlabel(r'$\mu_{\alpha*}$ (mas/yr)') plt.ylabel(r'$\mu_{\delta}$ (mas/yr)') plt.xticks() plt.yticks() plt.xlim(-15,) plt.ylim(-5,@) plt.legend() plt.show() len(result) 1422 Selecting some parameters to be calculated for all stars allsource = pprocessdata[[ ‘rat, ‘raerror', ‘dec’, ‘dec_error’, ‘parallax’, ‘parallax_error’, ‘pmra’, “pmra_error' “pmdec’, ‘pmdec_error’, ‘phot_g_mean_mag’, “bp_rp’ n allsource.head() ra. raerror dec dec_error parallax parallax_error pra pmra 0 135110808 0.179466 10.726941 0.144282 0.286789 0.189191 -0,675509 0; 1 135.118378 0.223748 10.723134 0.181216 0.051501 0.257076 -6,506968 0; 2 135.105567 0.031968 10.754417 0.023291 0.644799 0.041918 -0.661945 0 ¥ Sample Sources Selection To select the sample source, we select range of proper motions and parallax of the all source that the mean of the enclosed values close to the mean of proper motions (ji,,,, jis) and the mean of parallax (3) HDBSCAN_MEAN_PHRA HDBSCAN_MEAN_PMDEC HDBSCAN_MEAN_PARALLAX pmra_c pmdec_c parallax_mean PMRALRANGE == 3. PMDEC_RANGE = 3. PARALLAX RANGE = 0.4 samplesource = allsource[ (allsource[ 'pmra*] >= HDBSCAN_MEAN_PMRA-(PMRA_RANGE/2.)) & (allsource[‘pmra’] <= HOBSCAN_ (allsource[ ‘pmdec’] >= HOBSCAN_MEAN_PMDEC-(PMDEC_RANGE/2.)) & (allsource[‘pndec'] <= HD8S (allsource[ "parallax" ] >= HOBSCAN_MEAN_PARALLAX-(PARALLAX RANGE/2.)) & (allsource[ ‘paral ].reset_index(drop=True) Vector Point Diagram fig = plt.figure(figsize=(6, 6)) plt.plot(alisource[‘pmra’], allsource['pmdec'], '.", coloi plt.plot(samplesource[ 'pmra'], samplesource[ ‘pndec’], * "gray', markersize=2., label="Al1 » color="blue’, markersize=2., label plt.xlabel(r"$\mu_{\alpha*}$ (mas/yr)") plt.ylabel(r"$\mu_{\delta}$ (mas/yr)") plt.title("Vector Point Diagram") plt.xticks() plt.yticks() plt.xlim(-25,25) plt.ylim(-25,25) pit. 1egend() plt.show() \Vector Point Diagram ll Sources ‘Sample Sources Us (mas/yr) Ha» (mas/yr) Parallax Distribution bins_all = np.arange(allsource[ 'parallax'].min(), allsource[parallax'].max(), .@1) bins_sam = np.arange(samplesource[ ‘parallax’ ].min(), samplesource[ ‘parallax’ ].max(), -01) plt.figure(figsize=(6, 4)) allsource[ ‘parallax’ ].hist(bins=bins_all, color="gray', labe! samplesource[ ‘parallax’ ].hist(bins=bins_sam, color="b', label= ‘11 Sources") ‘Sample Sources") plt.xlabel(r"$\omega$ (mas)") plt.ylabel("Number of Sources") plt.xlim([@, 5]) plt.xticks() plt.yticks() plt.legend() plt.show() mm All Sources ‘mm Sample Sources ver of Sources Color Magnitude Diagram , 1 plt.figure(Figsize=(6, 8)) plt.plot(allsource[‘bp_rp'], allsource[‘phot_g mean_mag'], *.', colo’ plt.plot(samplesource[ ‘bp_rp'], samplesource[ ‘phot_g mean_mag'], ‘.', color gray’, markersize: *, markersize plt.xlabel(r"$6_{8P}-G_{RP}$") plt.ylabel(r"$G$ (mag)") plt.xlim([@., 3.5]) plt.ylim(8,20) plt.gca().invert_yaxis() plt.legend() plt.show() print(‘Al1-Sources “d+ \nSample-Sources. %d" -%(1en(allsource), - 1en(samplesource))) All Sources = 70365 Sample Sources = 1714 ime Normalize the data df = samplesource[["pmra", “pmdec", “parallax")] df = d¥.to_numpy().astype("float32", copy = False) zB] aE 3 g stscaler_df = StandardScaler().fit(d#) df_ = stscaler_df.transform(d#) 201 AN EISSN: norm_pmra = df_[:,0] norm_pmde = df_[:,1] norm_para = df_[:,2] |, Ls 3 Select some parameters to be calculated a sample_data_dict = { *norm_pnra’ : norm_pmra, *norm_pnde’ : norm_pmde, *norm_para’ : norm_para, t sample_data = pd.DataFrame(sample_data_dict) Train Gaussian Mixture Model (GMM) with whole data with two gaussian components (field and cluster) gnm = GaussianMixture(n_components=2, max_iter=1000, covariance_type="full', randon_state=Non Calculate means, covariances and weights of trained/fitted models gnm.means_, gnm.covariances_, gnm.weights_ (array([[ @.04856147, 0.03984163, @.00229148), [-0.01871317, -0.01535297, -0.00088304)]), array([[[ 2.99284569, @.12718294, -0.09479567], { @.12718294, 3.024422, [email protected]], [-2.08479567, -8.16097453, 1.76752092]], [[ @.23079895, @.01806545, @.02431149], [ @.e1806545, .21904246, [email protected]], [ 0.02431149, [email protected]@58193, @.7@423423]]]), array([@.27816086, @.72183914])) Calculate the probabilities of the whole data pred_data = gnm.predict_proba(sample_data) pred_data array([[8.51605686e-02, 9.14839439e-01], [1.ee0eeeeec+20, 1.224277210-18], [1.2e000e00c+00, 1.06197402e-13], [1.000000000+00, 2.55563579e-10], [9.91619509e-01, 8.38049093e-03], [1.00000000e+00, 1.22142457e-12]]) Check the calculated probabilities plt.hist(pred_data[:,@], bins=[@., .1, .2, .3, .4, .5, plt.xlim([@., 1.]) plt.xlabel("Probability for mu_alpha (mas/yr)") plt.ylabel("Number of sources") plt.show() , +9, 1.]) 1000 800 600 400 Number of sources 200 00 02 oa 06 08 10 Probability for mu_alpha (mas/yr) plt.hist(pred_data[:,1], -bins=[0.,+.1,+.2, plt.x1im([@.,-1.]) plt.xlabel("Probability-for-$\mu_{\delta}$-(mas/yr)") plt.ylabel("Number-of sources”) plt.show() 0275 By*.9,02.]) 1000 800 600 400 Number of sources 200 00 02 o4 06 os 10 > The Probabilities samplesource| ‘prob’ ] pred_data[:,0] print: (samplesource[ ‘prob’ ]) NameError Traceback (most recent call last) in () => 1 samplesource[ 'prob’] = pred_data[:,0] 2 print (samplesource['prob']) NameError: name 'pred_data’ is not defined ‘SEARCH STACK OVERFLOW Determine the probability member classes. According to Agarwal et al. (2021), there are three main classes: member_high is high probability members (P(x) > 0.6); member_moder is moderate probability members (0.2 < P(x) < 0.6); and menber_low is low probability members ( P(x) < 0.2). There is also one additional class: member_ultra is ultra-high probability members ( P(x) > 0.8) menber_ultra = samplesource[samplesource[ ‘prob'] >= .8].reset_index(drop=True) menber_high = samplesource[samplesource[ ‘prob’] >= .6].reset_index(drop=True) menber_noder = samplesource[(samplesource[ ‘prob'] > .2) & (samplesource['prob'] < .6)].reset_ member_low = samplesource[samplesource[ ‘prob’ ] <= .2].reset_index(drop=True) print (menber_ultra) Stars with a high probability values are automatically considered as members of the cluster. Stars with medium probability values can be considered as the cluster members(member_incl) if their parallax values lie in the parallax value range of ultra-high probability cluster members. rember_ultral ‘parallax’ ].min()) & ember_ultraf ‘parallax’ ].max())].rese member_incl = member_moder[ (member_moder| ‘parallax" ] (menber_moder[ ‘parallax’ } print(‘Sample Sources = %d \nHigh probability menber sources (p >= @.6) = %d \nModerate proba Combine member_high and member_incl to get all members. member_all = pd.concat([member_high, member_incl]).sort_values(by=['prob'], ascending-False). Len(menber_all) Calculate some important parameters mean_para_val = np.mean(member_all[ ‘parallax’ ]) mean_para_std = np.std(member_all[ ‘parallax’ ]) menber_dist = 1000. /(menber_al1[‘parallax']) mean_pnra_val = np.mean(menber_all{ ‘pnra’]) mean_pmra_std = np.std(menber_all['pmra' }) mmean_pnde_val = np.mean(menber_all{ ‘pndec’ }) mean_pnde_std = np.std(menber_all{ 'pmdec' ]) mean_dist_val = np.mean(member_dist ) mean_dist_std = np.std(menber_dist ) mean_pmra_val, mean_pmra_std, mean_pnde_val, mean_pmde_std, mean_para_val, mean_para_std, mea Visualization Il (Result) Probability Distribution bins_sanp = np.arange(samplesource['prob"].min(), samplesource{ "prob"].max(), 1) bins_high = np.arange(samplesource[ 'prob'][samplesource[ 'prob'] >= .6].min(), samplesource['p bins_mode = np.arange(samplesource[ ‘prob’ ][(samplesource[ ‘prob’] >= .2) & (samplesourcet ‘prob (samplesource[ ‘parallax’ ] >= menber_ultra[ ‘parallax’ ]. (samplesource[ ‘parallax’ ] <= menber_ultra[ ‘parallax’ ]. samplesource[ ‘prob’ ][(samplesource[ ‘prob'] 2) & (samplesource[ ‘prob (samplesource['parallax'] >= menber_ultra[ ‘parallax’ ]. (samplesource['parallax'] <= menber_ultra[ ‘parallax’ ]. bins = np.linspace(@., 1., 19) plt.Figure(Figsize=(6, 4)) plt.hist(samplesource['prob'], bins=[0., .1, .2, .3, +4, +5, +6) «7, «8, -9, 14], color="dark plt.hist(member_high['prob'], bins=[.6, .7, .8, .9, 1.], color="tab:orange’, rwidth=.975, lab plt.hist(menber_incl[‘prob'], bins=[.2, .3, .4, .5, .6], color="tab:green’, rwidth=.975, labe plt.xlabel("Probability") plt.ylabel("Number of Sources") plt.xlim([@., 1.]) plt.xticks() plt.yticks() pit. legend() plt.show() 1000 Sm Sample Sources EE High Probabilty Members EE Moderate Probability Members 800 600 400 Number of Sources 200 00 02 oa 06 08 10 Probability Vector Point Diagram fig = plt.figure(Figsize-(6, 6)) plt.plot(samplesource[‘pmra'], samplesource['pndec'], ‘o", mec="silver', mfc="darkgray’, mark plt.plot(menber_high{‘pmra'], menber_high[‘pndec'], ‘o', mfc='tab:orange', mec='None’, marker plt.plot(menber_incl['pmra'], menber_incl['pmdec'], ‘o', mfc="tab:green’, mec='None', markers plt.xlabel(r"$\mu_{\alpha*}$ (mas/yr)") plt.ylabel(r"$\mu_{\delta}$ (mas/yr)") plt.xticks() plt.yticks() plt.title("Vector Point Diagram") plt.legend() pit. show() Vector Point Diagram -1s 20 e725 5 z a E-30 £ “35 es = Sample Sources “ + Hh probabity (2 > =0.6) + Moderate probabity (02< =p =06) 25 120 115 110 -105 -100 95 Ha» (mas/yr) Parallax and proper motions distribution bins_samp = np.arange(samplesource[ ‘parallax'].min(), samplesource[ ‘parallax'].max(), -05) bins_high = np.arange(menber_high[‘parallax'].min(), menber_high{ ‘parallax’ ].max(), .@5) bins_mode = np.arange(menber_incl['parallax'].min(), menber_incl{ ‘parallax’ ].max(), .@5) plt.Figure()#figsize=(6, 4)) sanplesource[ ‘parallax’ ].hist(bins=bins_samp, color='silver', rwidth=.85, label="Sample Sourc menber_high[ ‘parallax’ ].hist(bins-bins_high, color="tab:orange’, rwidth=.85, label=n"High pro menber_incl[ ‘parallax" ].hist(bins-bins_mode, color: plt.xlabel(r"$\onega$ (mas)") plt.ylabel("Number of Sources") plt.xticks() plt.yticks() plt.legend() plt.show() Ee TS Sample Sources Hoh probably (p> =06) $ wo sm Moderate protabity(02< =p< =06) 5 8 00 3 3 200 E 5 = 100 ot 095 100 105 110 115 120 125 130 w (mas) Spatial distribution fig = plt.figure(figsize-(6, 6)) plt.plot(samplesource['ra’], samplesource["dec'], ‘o", me: plt.plot(menber_high['ra'], member_high['dec'], ‘o plt.plot(menber_incl[‘ra*], member_incl[‘dec'], silver’, mfc="darkgray", markersi ‘tab:orange", markersiz plt.xlabel(r'$\alphas (deg)') plt.ylabel(r'$\delta$ (deg)') plt.legend() fax. set_xticklabels([358.25, 358.5, 358.75, 359.0, 359.25, 359.5, 359.75, 0.00, 0.25], fontsi plt.show() ‘Sample sources + High probabilty (p > =0.6) + Moderate probabiity (02< =p< =06) 6 (deg) BO O84 Color Magnitude Diagram plt.Figure(figsize=(6, 8) plt.plot(samplesource['bp_rp'], samplesource['phot_g_mean_mag'], ‘o', mec='silver', mfc="dark plt.plot(menber_high['bp_rp'], member_high[‘phot_g mean_mag'], 'o', color='tab:orange’, marke plt.plot(menber_incl['bp_rp'], member_incl["phot_g mean_mag'], ‘o', color='tab:green’, marker plt.xlabel(r"$6_{8P}-G_{RP}$") plt.ylabel(r"$6$ (mag)") plt.xlim([@., 3.]) plt.gca().invert_yaxis() plt.legend() plt.show() ‘Semple Sources + High probabilty (p> =0.6) + Moderate probabilty (02< =p< =06) 10 2 u G (mag) 16 oo 05 10 15 20 25 30 plt.Figure(Figsize=(6, 8) plt.plot(menber_all['bp_rp'], menber_all['phot_g mean_mag'], ‘o', color=‘tab:blue', markersiz plt. xlabel(r"$6_{8P}-G_{RP}$") plt.ylabel(r"$68. (mag)") plt.xlim([@., 3.]) plt.ylin([10,20]) plt.gca().invert_yaxis() plt.legend() plt.show() 10 - u G (mag) 16 18 ‘All members 0.0 yemenber_al['phot_g_mean_mag"] member_all['bp_rp'] [00.5] yaex[x>0.5] xaqys5-5*np. 1ogi@ (distance) print (1en(x),len(y)) axeaz.plot_kde(ya, rugeTrue) pit. show() plt.close() ax-az.plot_kde(xa, rug-True) plt.show() plt.close() ax-az.plot_kde(xa, values2-ya, contour-False, pcolormesh_kwargs 3.0 “cmap “inferno"}, legend= ax. invert_yaxis() pit. show() plt.close() X_train,X_test,y_train,y test = train_test_split(xa,ya, [email protected]) pmse_list=[] P2_list=[] for i in range (7,17): for j in range (17): knots = 4 degree = j # try different knots and degree values try: X_spline = dnatrix("bs(x,df = ‘+str(knots)+", degree spline_fit = sm.GLM(y_train,X_spline).fit() +str(degree)+', include_interce y_pred_train = spline_fit.predict(dmatrix('bs(test, df = ‘+str(knots)+", degree = ‘+str rmse_train = np.sqrt(mean_squared_error(y_train,y_pred_train)) print (“root mean square error for training set ", rmse_train) print("r2 score for training set ",r2_score(y train,y pred_train)) y_pred = spline_fit.predict(dmatrix('bs(test, df = ‘+str(knots)+', degree = ‘+str(degre rse_test = np.sqrt(mean_squared_error(y_test,y_pred)) print(“root mean square error for training set “,rnse_test) print("r2 score for training set ",r2_score(y_test,y_pred)) rmse_list.append ([rmse_train,rmse_test]) 2_list.append([r2_score(y_train,y_pred_train),r2_score(y_test,y_pred)]) range_pred = np.Linspace(np.min(X_train) ,np-max(X_train),5@) prediction = spline_fit.predict(dnatrix(‘bs(xp, df = ‘+str(knots)+", degree = ‘+str(deg plt.Figure(Figsize=(7,7)) plt.plot(range_pred, prediction, color='r', label='Specifying degree = '+str(degree)+" plt.scatter(xa,ya, color="blue’ , alpha=8.3, edgecolor="k’) plt.xlabel('Color") plt.ylabel("6") pit. legend() #plt.scatter(menber_all['bp_rp'].tolist(), member_all{"phot_g mean_mag'].tolist(), face ax = plt.gcea() ax.invert_yaxis() plt.show() plt.close() except: print ("fail") print (rmse_list) print (r2_list) rmse_list=np.array(rmse_list) r2_listenp.array(r2_list) Hiprint (np.max(range_pred) .np.min(range_pred)) print (min(rmse_list[:,1])) 0 2 4 6 8 10 Color root mean square error for training set 0.1359167441682956 2 score for training set @.947@239280007002 root mean square error for training set @.17200044938890915 r2 score for training set 0.9199368439062416 os — Specifying degree = 1 with 8 knots 10 is 20 25 0 2 4 6 8 10 Color root mean square error for training set @.13177983847404448 r2 score for training set @.9501997216524329 root mean square error for training set @.1597061776926086 2 score for training set @.9309733217975995 os — Specifying degree = 2 with 8 knots 10 is} © 20

You might also like