Scientific Co-Authorship Social Networks: A Case Study of Computer Science Scenario in India
Scientific Co-Authorship Social Networks: A Case Study of Computer Science Scenario in India
38
International Journal of Computer Applications (0975 8887)
Volume 52 No.12, August 2012
Science area of four of the IITs. The collaboration network is contributors for conferences, journals, workshops etc. The
essentially a graph (G) where the vertices (V) represent authors academic network so extracted can be used in many services,
and the link/edge (e) between them represents the fact that they such as finding an appropriate person to introduce or negotiate
are related by the relation of co-authorship. Each edge has someone, who one should talk in order to expand his/her
certain weight that reflects the number of papers written by a network efficiently [11]. The extracted academic network may
pair of vertices (i.e. authors). In this paper, we use node, vertex also be used for trend detection/prediction. Trend detection can
and author interchangeably. help a researcher to analyze the thrust area of research in a
The paper is organized as follows. In section 2, we discuss particular field, what other researchers are doing in that or
background and related work in the area. We discuss our related field. Trend prediction can help research community to
procedure for data collection in section 3. In section 4, we have an idea of the potential research topics/areas in a particular
discuss various measures that can be used for analysis of the field. The size, reach, growth and diversity of these networks
social networks. In section 5, we present the architecture of our are their characteristic features challenging the research
social network extraction system. We present and discuss our community.
experimental results in section 6. Finally, we conclude and give Scientific social networks can be obtained by considering
some future directions in section 7. different scientific relations like project participation, co-
authorship, thesis supervision, conference participation,
2. BACKGROUND AND RELATED technical production, etc. Social network of researchers can be
constructed by using any or combination of these relations.
WORK However, the collaboration is normally established based on
Universities and other institutions of higher learning have been similar research interests. Of the various relations mentioned
known for providing solutions to various problems confronting above, co-authorship relation is the most important measure [2]
the society. Research has been answering to many such of collaboration among individuals and organizations.
problems. Modern day research is faced with both extraordinary There are several studies which have been conducted for
opportunities and challenges. A fast paced modern society turns extraction of social network from various information sources
to academics as public servants for immediate answers to an like WWW, e-mail, instant messenger logs, search engines, etc.
array of practical problems created by its own increasing needs Referral Web [12] was the first attempt of this kind to develop
and desires. Society is willing to invest in research as the basis an automated interactive tool for social network extraction from
of a knowledge economy as long as research proves to be a specified domain and finding shortest referral chains to
responsive to its needs, productive and effective. Most of the experts. It uses a search engine (Altavista) to extract social
questions science is required to answer are too complex to be networks through co-occurrence of names in close proximity in
addressed in the traditional disciplinary framework of academic any document e.g. personal homepages, lists of co-authors in
research. Yet with the explosion of knowledge, research has technical papers, citations, and organizational charts publicly
become only more fragmented than ever. The lack of available on the WWW taken as evidence of a direct
communication and coordination even within the faculties of a relationship. The network obtained is an egocentric network, in
single university or even a single department leads to that it is focused on a specific person. Referral Web [12] has
opportunities lost every single day. Lack of collaboration has influenced many studies for automatic extraction of social
aggravated the problem and even research groups within networks. Tombe et al. [11] proposed a system for social
universities become specialized: long gone are the days where network extraction of conference participants from the Web.
every subdiscipline within a scientific domain was equally The idea behind this study is that: at academic conferences, a
represented at a university or other research institution. participant registers a brief profile with fields like Name, E-
Social networks have got a lot of focus from the research mail, Affiliation, etc. well before the conference which means
community long before the advent of the Web [5]. Social that there is enough time to gather information about the
sciences made great strides in measuring and analyzing social participants from the Web. P. Mika in the year 2005 developed
networks between 1950 and 1980, at the same time when Flink [13], a system for extraction, aggregation, and
Vannevar Bushs proposed hypertext medium Memex was visualization of online Semantic Web community. The Web
gaining acceptance [9]. There are numerous examples of social mining module of Flink obtains as in [12] hit count from a
networks formed by social interactions like co-authoring, search engine (Google) for both the persons X and Y
advising, supervising, and serving on committees between individually as well as hit count for co-occurrence of these two
academics; directing, acting, and producing between movie names with the target being the Semantic Web community.
personnel; composing, and singing between musicians; trading Constructing a co-authorship relation based social network is
and diplomatic relations between countries; sharing interests, straight forward, as various authors of a publication are
making phone calls, and transmitting infections between explicitly stated in the publication itself. In SNA, co-authorship
people; hyperlinking between Web pages; and citations between networks are also called as affiliation networks where
papers. vertices represent authors and edges represent the relationship
The last decade has seen a rapid growth of research interests in (i.e. co-authored publication) between these authors. If two
Online Social Networks. Social network extraction is an vertices are connected to each, it implies that the authors
emerging field of research with majority of research work represented by those two vertices have co-authored a paper. The
concentrated towards late 2000s and the focus has been to strength of relationship between two authors is directly
construct efficient procedures and algorithms for the proportional to the number of publications authored together.
identification of community structure in a generic network. Collaboration networks in the scientific communities are a well-
Construction of the researcher network by automating studied subject for its inherent complexity and motivation to
information extraction from Web can benefit many Web mining predict or analyze certain features among the persons involved.
and social network applications [10]. For example, in this case, Various studies [14, 15], have investigated into various network
if all the profiles of researchers are correctly extracted, we will parameters like small-world, betweenness centrality, vertex
have a large collection of well-structured data about real-world centrality etc. to interpret the obtained data. The analysis of
researchers. The profiles extracted can help in expert finding for these parameters provides a good insight of the health of the
research guidance for new scholars, potential speakers and research community and the institution. Good health indicates
39
International Journal of Computer Applications (0975 8887)
Volume 52 No.12, August 2012
that the institution is alive, full of activity, publishing more (not discussed in this paper) worked with great precision and
papers, and attracting more projects, collaborations and grants. the relationships extracted and used in this work were obtained
This can help one to find potential researchers for collaboration, using it.
project funding, guidance etc. and could prove beneficial for
institutions, students, and funding agencies as well. 4. NETWORK ANALYSIS METRICS
The co-authorship network can be considered as a true a proxy Social Network Analysis is a branch of sociology that is
to the social collaboration network of the researchers and has formalized to a great extent based on the mathematical notions
attracted research attention in recent years. The idea of co- of graph theory. This formal model captures the key
authorship networks started with the Erdos number project observation of Social Network Analysis, namely that to a great
(www.oakland.edu/enp/). The greater the distance between Paul extent social structure alone determines the opportunities and
Erdos and another researcher (r), the greater will be his Erdos limitations of social units and ultimately effects the
number and vice versa. Many studies [14, 15, 16, and 17] on development of the network as well as substantial outputs of the
co-authorship network analysis across several domains were community.
made by Newman. These studies answered a broad variety of The field of Social Network Analysis as it is today is the result
questions about collaboration patterns by analyzing co- of the convergence of several streams of applied research in
authorship networks across several domains: Biology, fields like sociology, social psychology and anthropology.
Computer Science, Mathematics, and Physics. The parameters Many of the concepts and theories of network analysis have
analyzed in these studies were: the number of papers authors been developed independently by various researchers often
write, how many people they write them with, the typical through empirical studies of various social settings. For
distance between researchers through the network, etc. Few example, many social psychologists of the 1940s found a
other studies also have analyzed co-authorship network. For formal description of social groups useful in depicting
example, social science collaboration networks were communication channels in the group when trying to explain
investigated in [18], digital library community in [19, 20], and processes of group communication. Already in the mid-1950s
database community in [21]. anthropologists have found network representations useful in
generalizing actual field observations, for example when
3. DATA COLLECTION comparing the level of reciprocity in marriage and other social
There are various online sources for collection of public data. exchanges across different cultures.
These include publication databases like DBLP, CiteSeer, An important aspect of the structure of social networks came
Google Scholar, etc., journal homepages, conference from a remarkable experiment conducted by the American
proceedings (websites), homepages of individual researchers as psychologist Stanley Milgram [23]. Milgram in 1967, went out
well as organizational/institutional websites. Since this work to test the common observation that no matter where we live,
focuses on collaboration between faculty members of the the world around us seems to be small: we routinely encounter
institutions under consideration, institutional websites are the persons not known to us who turn out to be the friends of our
best and authentic source of the publication data of their faculty friends. Milgrams experiments not only wanted to test whether
members. The data which we have used in this study has been we are in fact all connected but was also interested in what is
obtained from the websites of the four IITs under consideration the average distance between any two individuals in the social
i.e. namely IIT Delhi, IIT Kanpur, IIT Kharagpur, and IIT network of the American society. Milgram calculated the
Madras. average of the length of the chains and concluded that the
The data analyzed in this work pertains to 2005-2011 period. experiment showed that on average Americans are no more than
We obtained data about faculty members and associated six steps apart from each other.
publications of Computer Science departments of these four Social network metrics such as degree, betweenness, closeness
IITs. The data was highly unstructured as is the case with any and network centrality are often the subject of academic
Web data. Data on web pages can be found in different formats. research. Understanding social networks and their metrics is
HTML is designed for unstructured data which contains important as these networks form the underlying structure,
information in several formats, e.g., text, image, video and which allows for rapid information distribution [24].
audio. It is known that web pages in HTML format are dirty Several network analysis measures as proposed in [25] can be
because their contents are ill-formed and broken [22]. used to indentify influential nodes and discover community
Moreover, different IITs have maintained the required data structures of the extracted social networks. We are interested in
(faculty list and publications list) in different formats. capturing the internal connectivity as well as attributes of key
We considered only those faculty members for the purpose of nodes in the network. In order to identify the leaders in the
social network extraction who were currently occupying a network, the quantity of interest in many social network studies
teaching position in the department. Altogether, we analyzed is the betweenness centrality of an actor i. Centrality is a
publications of 107 researchers from the Computer Science area measure of the information about the relative importance of
of four IITs resulting in 1017 co-authored publications, nodes and edges in a graph. Several centrality measures like
including journal papers, conference papers, invited papers and betweenness centrality, closeness centrality, and degree
technical reports, and 2375 co-authorship relationships. centrality have been proposed in [25] to identify the most
To analyze the data under consideration and extract social important actors (leaders) in a social network. In addition to
networks, it was necessary to extract the co-author relationship other aspects of analysis, we are also interested in answering the
for the researcher under consideration from publications data. following four questions:
The problem with publication data is that the author names of a
particular publication are written in different formats in (i) Who are the hub/leaders?
different publications. Moreover, the formats for name list and (ii) Who has more connections?
publications were different in different IITs. In order to map (iii) How strong are the collaboration ties?
names in the faculty list to the name in publication data, we (iv) How collaborative the authors are?
developed an algorithm to extract similar/same name from these
publication data. We have used this algorithm for the purpose of
name resolution and extraction of relationships. The algorithm
40
International Journal of Computer Applications (0975 8887)
Volume 52 No.12, August 2012
We may use four measures namely (i) Betweenness centrality, the publication data which serves as input to the social network
(ii) Degree centrality, (iii) Clustering coefficient, and (iv) visualization engine.
average degree to answer the above four questions efficiently. We use NodeXL [29], an open source graph analysis and
Betweenness centrality measures the fraction of all shortest visualization tool, which works as an add-on to Microsoft
paths that pass through a given node. Nodes with high Excel, as social network generation and visualization engine for
betweenness centrality play a crucial role in the information visualization and analysis of the extracted co-authorship
flow and cohesiveness of the network and are indispensable to relations. The social network visualization engine generates the
the network due to the information flow they assist in. Nodes social (academic) graphs and other network statistics based on
with the high betweenness act as gate keeper [26]. the affiliation (co-authorship) list provided.
Degree centrality of node in the network is the number of links
incident on it and is used to identify nodes that have highest 6. RESULTS AND DISCUSSION
number of connections in the network. A more sophisticated Two types of co-authorship social (academic) network graphs
version of degree centrality is eigenvector centrality. It not only viz. affiliation networks and internal collaboration networks
depends on the number of incident links but also the quality of were obtained using the graph generation and visualization
those links [26]. We use eigenvector centrality in our engine. The affiliation networks shows the general co-
experiments. authorship relationships (internal as well as external) whereas
Clustering coefficient signifies how well a nodes internal affiliation networks shows the co-authorship
neighbourhood is connected. The more connected the collaboration among the faculty members within the department
neighbours are with one another, the higher the clustering (of a particular IIT) itself. Figure 2(a), 3(a), 4(a) and 5(a) show
coefficient. This is because the neighbourhood graph is heading the extracted affiliation networks and Figure 2(b), 3(b), 4(b) and
towards becoming a clique i.e. a complete graph where every 5(b) show the extracted internal collaboration networks. In
node is connected to one another [27]. The clustering co- these graphs, we have labelled edges blue, green, and red for
efficient of a network as given in [28] is the average of the highest collaborative strength, moderate collaborative strength,
clustering co-efficient of all the nodes in the network. It and weak collaborative strength, respectively. In addition to the
indicates the degree to which nodes in a network tend to cluster colours which we have used to distinguish the strength of
together and it is therefore considered to be a good measure if a relationship, the width of the edges is directly proportional to
network demonstrates small world behaviour [28]. Stanley the strength of relationship. The more the width of a particular
Milgrams [23] theory of the 6 Degree of Separation utilises edge the stronger the relationship i.e. more papers authored
the average path length metric. together. We extracted and analyzed various graphs and tried to
The average degree of all the nodes in the network is a measure come up with meaningful interpretations of the co-authorship
of how collaborative the authors are. networks from our study, which are given as follows.
41
International Journal of Computer Applications (0975 8887)
Volume 52 No.12, August 2012
although Amitabha Bagachi acts as gatekeeper in collaboration largest connected component of all the four affiliation graphs.
network, Subhashis Banerjee is the highest connected node in This highlights great connectivity between the vertices in the
the network. This can be attributed to that fact that Subhashis affiliation graph of IIT Kharagpur and majority of the nodes in
Banerjee is connected to other highly ranked nodes in the the affiliation graph are clustered in complete graphs i.e.
network like Prem Kalra (as evident from Figure 2-a), whereas cliques. This means that the flow of information is hard in IIT
Amitabha Bagachi despite having highest betweenness have Kanpur, whereas it is easy in case of IIT Kharagpur.
connections with low ranked nodes in the network. All the four affiliation graphs (Figure 2-a, 3-a, 4-a and 5-a) have
average path length less than 6 which means that these
networks are like other social network graphs and are small
6.2. Strength of Collaboration Ties world.
As per values of various network attributes that we have
obtained from the graph visualization engine (Table-1), IIT
Kanpur has the lowest clustering co-efficient. This highlights
the lack of connectivity between the vertices (authors) in the
affiliation graph of IIT Kanpur (Figure 3-a). Only few of the
nodes are tightly clustered in cliques i.e. complete graph. This
gets validated from the values of diameter and average path
length for IIT Kanpur, as given in table-1. IIT Kharagpur has
highest clustering co-efficient of all the four IITs and also the
42
International Journal of Computer Applications (0975 8887)
Volume 52 No.12, August 2012
Average path length (APL): It is equal to the average 6.4. Inter-Departmental (Internal)
of the shortest distance between every connected pair
of nodes. It is a measure of the mean separation of the
Collaborations
nodes in the network. Unconnected pairs are IIT Kharagpur (Figure 4-b) lead the four IITs in internal
excluded. collaboration with 85.16% of the total faculty members actually
collaborating with each other, directly or indirectly, whereas in
Diameter (DI): The longest distance between all
IIT Madras (Figure 5-b), Kanpur (Figure 3-b), and Delhi
pairs of nodes. It is a measure of how far apart is the
(Figure 2-b), it was 56.52%, 34.62%, and 16.13% respectively.
most distant pair.
The size of the largest connected component in IIT Madras,
Collaborators (CL): The author's average
collaborator.
Clustering Coefficient (CC): Clustering co-efficient
of the entire network.
Largest component (LC): The percentage of the
nodes that connect to the largest component (largest
connected component).
IIT P* A# APL DI CL CC LC
Delhi 138 131 1.76 4 2.03 0.092 16.79
Kanpur 225 271 4.37 10 2.02 0.017 85.24
Kharagpur 247 252 3.48 6 3.22 0.329 93.25
Madras 407 419 3.56 7 2.14 0.085 74.70
*Papers
#Authors
43
International Journal of Computer Applications (0975 8887)
Volume 52 No.12, August 2012
Madras, Kharagpur, and Kanpur, Sukhendu Das, Sarkar S., and 8. REFERENCES
Amitabha Mukherjee respectively, were highly connected and [1] Merriam-Webster's Collegiate Dictionary (1999). Tenth
act as gatekeeper as well in their respective departmental co- Edition. Springfield, MA: Merriam-Webster, Incorporated.
authorship networks. These nodes (authors) are very much
important for the flow of information in these networks. [2] Stroele, V., Oliveira, J., Zimbrao, G. and Souza, M.J.
Mining and Analyzing Multirelational Social Networks.
In Proceedings of 12th International Conference on
Computational Science and Engineering, Vancouver,
Canada, 2009, pp 711-716.
[3] Patel, N. Collaboration in the professional growth of
American sociology. Social Science Information, 6, 1973,
pp. 77-92.
[4] Glanzel, W. and Schubert, A. Analysing Scientific
Networks Through Co-authorship. Handbook of
Quantitative Science and Technology Research, 2004,
Kluwer Academic Publishers.
[5] Wasserman, S. and Faust, K. Social Network Analysis:
Methods and Applications. Cambridge University Press,
New York City, New York, 1994, U.S.A.
[6] Newman, MEJ. Co-authorship networks and patterns of
scientific collaboration. In Proceedings of the National
Academy of Sciences, 101(1), 2001, pp. 5200-5.
[7] Davis, G.F. and Greve, H.R. Corporate Elite Networks
and Governance Changes in the 1980s. The American
Journal of Sociology, 103(1), 1997, pp. 1-37.
7. CONCLUSIONS AND FUTURE [8] Stuart, T.E. Network Positions and Propensities to
DIRECTIONS Collaborate: An Investigation of Strategic Alliance
Co-authorship social networks are based on the co-authorship Formation in a High-technology Industry. Administrative
relationship (i.e. jointly conducting research or participating in Science Quarterly, 43(3), 1998, pp. 668-98.
Figure 5(b): IIT Madras- Internal Collaboration Graph [9] Chakrabarti, S. Mining the Web: Discovering Knowledge
from Hypertext Data. Morgan Kaufmann Publishers,
a research study and presenting the results as a research USA, 2003.
publication together) and are a result of people collaborating to [10] Tang, J., Zhang, D., and Yao, L. Social Network
become co-authors. Analyses of these networks reveal the Extraction of Academic Researchers. In Proceedings of
collaboration pattern and structure of the scientific community. International Conference on Data Mining-ICDM07,
Publishing patterns and trends of a particular group or Nebraska, USA, October 2007, pp 292-301.
institution can be analyzed by studying these networks. In
addition, co-authorship networks also provide a platform for [11] Tomobe, H., Matsuo, Y. and Hasida, K. Social Network
studying network evolution and dynamics. In this paper, we Extraction of Conference Participants. In Proceedings of
studied co-authorship networks (both internal and external) of 12th International Conference on World Wide Web-
computer science departments of the four IITs under WWW03, Budapest, Hungary, May 2003.
consideration. [12] Kautz, H., Selman, B., and Shah, M. The Hidden Web.
We demonstrated the experiments conducted with publication American Association for Artificial Intelligence magazine,
data for the analysis of various network parameters. We have 18(2), 1997, pp 2735.
shown the collaboration pattern in these institutions of higher
and technical learning. Prominent researchers in all the four [13] Mika, P. Flink: Semantic web technology for the
IITs were identified including gatekeepers, hubs, leaders. extraction and analysis of social networks. Journal of
Detailed analysis of the generated social (academic) networks Web Semantics, 3(2), 2005, pp 211-223.
i.e. graphs and the values of various network attributes show
[14] Newman, MEJ. Scientific collaboration Networks-II.
that internal collaboration ties in departments under
Shortest paths, weighted networks, and centrality.
consideration was highest in IIT Kharagpur and lowest in IIT
Physical Review E, Volume 64,016132
Delhi. IIT Madras has published the highest number of papers
whereas IIT Delhi has lowest number of publications during the [15] Newman, MEJ. The structure of scientific collaboration
period under investigation. networks. PNAS, January, 2001, 98(2), pp 404-409
Future extensions of this work could be the analysis of
collaboration at the institutional level. This can help understand [16] Newman, MEJ. Co-authorship networks and patterns of
the collaboration pattern and ties among these institutions. scientific collaboration. In Proceedings of the National
Another extension could be the temporal analysis of the Academy of Sciences, 101(90001), 2004, pp.5200-5205.
evolution pattern of these collaboration networks and their [17] Newman, MEJ. Scientific Collaboration Networks-I.
impact on the research and development activities in these Network Construction and Fundamental Results.
institutions. The result of analysis of the collaboration ties can Physical Review E, 64(1):16131, 2001.
be improved by considering other collaboration relationships
like co-supervision, project participation, etc.
44
International Journal of Computer Applications (0975 8887)
Volume 52 No.12, August 2012
[18] Moody, J. The structure of a social science collaboration [23] Milgram, S. The Small World Problem. Psychology
network: Disciplinary Cohesion from 1963 to 1999. Today, 2(1), 1967, pp.60-67.
American Sociological Review, 69(2), 2004, pp. 213-238.
[24] Newman, M.E.J. Community centrality. Phys. Reviews
[19] Liu, X., Bollen, J. Nelson, M.L. and Van de Sompel. H. 74 (2006).
Coauthorship networks in the digital library research
community. Information Pro-cessing & Management, [25] Chelmis, C. and Prasanna, V.K. Social Networking
41(6), 2005, pp.1462-1480. Analysis: A State of the Art and the Effect of Semantics.
In Proceedings of 3rd IEEE Conference on Social
[20] Sharma, M. and Urs, S.R., Network Dynamics of Computing (SocialCom), Boston, MA, 2011, pp. 531-536.
Scholarship: A Social Network Analysis of Digital Library
Community. In Proceeding of the 2nd PhD Workshop on [26] http://en.wikipedia.org/wiki/Centrality
Information and Knowledge Management, New York, NY, [27] http://en.wikipedia.org/wiki/Clustering_coefficient
USA, 2008, pp. 101-104.
[28] Watts, D. J. and Strogatz, S. H. Collective dynamics of
[21] Elmacioglu, E. and Lee, D. On six degrees of separation
small-world networks. Nature, 393(6684), June 1998,
in dblp-db and more. SIGMOD Rec., 34(2), 2005, pp.33-
pp. 440442.
40.
[29] Hansen, D., Shneiderman, B., and Smith, M. A.
[22] Ma, L., Goharian, G. and Chowdhury, A. Automatic data
Analyzing Social Media Networks with NodeXL: Insights
extraction template generated web pages. In Proceedings
from a Connected World. Morgan Kaufmann, Burlington,
of the International Conference on Parallel and
MA, USA, 2009.
Distributed Processing Techniques and Applications
(PDPTA 03), Las Vegas, Nevada, USA, 2003, pp.642648.
45