Let The Chart Spark Embedding Semantic Context Int
Let The Chart Spark Embedding Semantic Context Int
May 20 0
Desert
Farmland
1970
65 -5
25 1980
May 10 1990
-10
2000
Area of the Kubuqi
April 30
Desert and the -15
2010
Restoration in 2001
Date of
(x10000km²)
-20
April 20 Cherry Blossom
55
2006 2008 2010 2012 2014 2016 2018 2020
Glacier Mass Balance
Fig. 1: Pictorial visualizations created by ChartSpark. (a) a line chart depicts the date of cherry blossom in High Park each year,
embedded with the tree branch while also preserving the trend. (b) A pie chart shows the area of Kubuqi desert and its restoration in
2021, with the three types of land embedded in the corresponding sectors in a consistent style. (c) A line chart shows the amount of
glacier mass balance per year, coherently presented with the background glacier structure in compliance with the data trend.
Abstract— Pictorial visualization seamlessly integrates data and semantic context into visual representation, conveying complex
information in a manner that is both engaging and informative. Extensive studies have been devoted to developing authoring tools to
simplify the creation of pictorial visualizations. However, mainstream works mostly follow a retrieving-and-editing pipeline that heavily
relies on retrieved visual elements from a dedicated corpus, which often compromise the data integrity. Text-guided generation methods
are emerging, but may have limited applicability due to its predefined recognized entities. In this work, we propose ChartSpark, a novel
system that embeds semantic context into chart based on text-to-image generative model. ChartSpark generates pictorial visualizations
conditioned on both semantic context conveyed in textual inputs and data information embedded in plain charts. The method is generic
for both foreground and background pictorial generation, satisfying the design practices identified from an empirical research into
existing pictorial visualizations. We further develop an interactive visual interface that integrates a text analyzer, editing module, and
evaluation module to enable users to generate, modify, and assess pictorial visualizations. We experimentally demonstrate the usability
of our tool, and conclude with a discussion of the potential of using text-to-image generative model combined with interactive interface
for visualization design.
Index Terms—pictorial visualization, generative model, authoring tool
Raw Data Binding Estimation Once the visual elements were established, participants attempted to
Context Element
bind the data with these elements using techniques such as rotation,
scaling, deformation, and stretching. At the end of the creation process,
(a) some participants evaluated the final design to ensure the accuracy of
data binding and avoid the error of manual manipulation.
Foreground Internal Single Foreground Internal Multi Foreground External Background
Based on the interviews, we found that participants all mentioned
that it was tedious and challenging to bind data with visual elements.
“I need to repeat the same operation for each single element”, A2 com-
mented, “If I want to change visual materials, I have to start my work
from the scratch. It is not good for creation iteration”. Meanwhile,
P1 noted that “The image cutout and anti-aliasing need to use profes-
sional software that is hard to use. It is also complicated to deform
the visual element to adjust the shape of charts”. Furthermore, both
P1 and P2 stated that it was difficult to find images that matched with
the theme. “I can barely find any concrete elements, especially for
abstract vocabulary”, mentioned by P2. However, neither artists nor
visualization experts thought that would be a potential issue. As for
evaluating visualization performance, visualization experts raised up
the concern about the visual distortion and data integrity compared with
0 20% 40% 60% 80% 100%
the original chart.
(b)
3.1.2 Design Requirements
Fig. 2: Analysis of preliminary study. (a) General workflow of cre- Besides the discussion about the conventional workflow, participants
ating pictorial visualizations contains five stages. Participants with also expressed their expectations and concerns about involving gener-
different backgrounds have encountered difficulties on the stage with ative models to assist the process of creating pictorial visualizations.
indications. (b) Taxonomy of common design patterns for different All participants believed that it was essential to capture appropriate
chat types. We represent the percentage of each design pattern for each semantic context and significant data features before starting to design,
chart type in the corpus. For illustrating the representation of each especially in the context of AI generation. They sketched a chart that
design patterns, we use several types of candy to represent. reflected data at the beginning of their design process, which provided
a solid base for adding the visual elements with relevant semantic con-
integrate chart information into the textual guidance through the use of text. When telling participants generative models were involved in the
attention mechanisms. creation process, they emphasized the necessity of raw data preview so
that they could evaluate if the portrayal of data by the generative model
3 P RELIMINARY S TUDY was faithful. Participants hoped that the generative model could allow
We conducted a preliminary study to comprehend the design process flexible customization in design. Specifically, some participants wanted
in crafting a pictorial representation. From the formative interviews, a controllable generative result including its shape and color while
we gained a better understanding of the design workflow (Sect. 3.1), others wanted a large variety of styles in visual elements. Moreover,
and received concerns and expectations that are summarized as design participants expected an integration of visual attributes from charts
requirements. We collected a corpus of pictorial visualizations and while generating images. “Recent text-to-picture models work well in
examined the design patterns presented in them (Sect. 3.2). general”, V2 acknowledged, “However, particularly for pictorial visu-
alization, we must consider information encoded by visual channel like
3.1 Formative Interviews the trend for line plot and height for bar chart”. All participants antici-
To study a typical workflow and ensure the authoring tool can be pated an evaluation module to validate the quality of their works once
useful and accessible to a wide range of users, we conducted formative the visualization was completed, which provided the visual distortion
interviews with people from three different backgrounds, including two from the original data including height, area, and angle.
artists (A1, A2) who mostly use Adobe Illustrator and P5.js in daily By following the design workflow, there are four design requirements
work, two visualization experts (V1, V2) who have around 3-years data based on participants’ concerns (R1 and R4) and expectations (R2 and
analysis and dashboard design experience, and two individuals (P1, P2) R3) respectively.
without relevant training in art or visualization. Each participant was R1. Preview data and theme. Visualizing the raw data and obtaining
interviewed individually, and the length of each interview varied from semantic description before starting visualization design.
30 minutes to an hour. The interview consisted of three stages. First, we R2. Personalize visual element. Customizing pictorial visualization,
introduced the concept of pictorial visualization and presented several such as color and shape by controllable manipulation, while ex-
examples to the participants. During the second stage, participants were panding design space with various styles of visual elements .
provided with a dataset of the global change in a desert area, which R3. Embed semantic context into chart. Integrating the visual el-
included an x-axis for time, a y-axis for area, and a title. They were then ement and data automatically and naturally, while supporting
instructed to describe their design workflow using rough sketches while flexible embedding methods for the semantic context.
vocalizing their thought process. Finally, we asked the participants the R4. Evaluate the performance. Evaluating visualization design in
following questions: 1) What are the key steps involved in creating visual distortion, which indicates the loss of data integrity.
a pictorial visualization? 2) Which step in the pictorial visualization
creation process do you find the most challenging, and why? 3) What 3.2 Pictorial Visualization Corpus
expectations and concerns would you have if a generative model were Our preliminary study also unveils systematic patterns for integrating
involved in creating pictorial visualizations? semantic context with data in pictorial visualizations. To collect the
sample, we drew upon datasets provided by prior research [11, 26, 43]
3.1.1 General Workflow and manually selected some typical charts, resulting in 587 charts to
The participants’ workflows in creating pictorial visualizations revealed constitute part of our data. As some of the collected pictographs overly
that there was a general pattern to the process, as shown in Fig. 2 (a). focused on a specific type such as icon-based, we also retrieved addi-
Firstly, participants recognized the theme and data features by reading tional examples from Pinterest and Google to supplement our corpus.
After removing redundant visualizations, our final corpus comprised distortion of the generated chart, while the evaluation in the final stage
863 samples. We then classified the images based on embedding and delivers more accurate values and explicitly visualizes the error.
representation. To minimize individual judgment bias, each visualiza-
tion was appraised and categorized by two authors, with a double-check 4.2 Feature Extraction
process. We analyzed the collected data and summarized the common
Since pictorial visualization fuses both numerical data and textual infor-
design patterns into a taxonomy. As depicted in Fig. 2 (b), we employ
mation, we extract the data feature and semantic context, respectively.
various shapes of candies as content to represent this taxonomy. Over-
Data feature. To facilitate users’ comprehension of the overall appear-
all, the design pattern concerning the embedded object can be divided
ance of the raw data, we offer a variety of chart types from which they
into foreground and background. For the foreground, there are various
can choose to visualize the data (Fig. 5 (A1 )). Users can efficiently
forms of data representation, such as encoding as an internal element
identify patterns and trends concealed within the data in data preview
of the chart itself or locating it externally, and filling the element in
(Fig. 5 (A2 )). This data preview also functions as an indicator to detect
a single or multiple manner. In contrast, the background refers to the
any deviations in the subsequent generation process. Furthermore, we
overall visual context in which the data is presented.
extract the data annotation encompassing the x and y axes and the title
• Foreground Internal Single (394, 45.6%). The semantic context into an SVG format file, making it available for subsequent user editing.
is embedded in foreground, encoding the visual element in a single Semantic context. For the semantic context extraction, we employ
manner as the chart itself. Examples include rectangular candies for a two-step approach, namely keyword extraction and relevant word
bars in a bar chart and round candies for points in a scatter plot. retrieval. Initially, we extract the keyword using MPNet [45], a pre-
• Foreground Internal Multiple (299, 34.7%). The semantic context trained model featuring a sentence transformer architecture. However,
is embedded in foreground, encoding the visual element in a multiple since the keyword may not be sufficiently explicit to inspire a concrete
manner as the chart itself. For example, a stack of the same candies visual element, we also provide relevant words to stimulate users’ cre-
to form a bar in a bar chart. ativity, particularly for individuals with limited design expertise, in
• Foreground External (104, 12.1%). The semantic context is embed- accordance with Fig. 2. This objective is accomplished by estimating
ded in foreground, encoding the visual element in a single manner word similarity through Word2Vec [29] to convert words into vectors.
and locating it externally. For instance, there is a candy next to each To accomplish this, we employ the English Wikipedia Dump from
sector in a pie chart. November 2021, comprising a corpus of 199,430 words, each repre-
• Background (66, 7.7%). The semantic context is embedded by sented by a 300-dimensional vector. In our experiments, we discovered
visual elements in background. Some background may also reflect that some top similar words are more closely associated with proper
the data information by the contour visual elements. nouns, which could decrease the recognition capacity in visualization.
While the visual representation for embedding semantic context into With the assumption that the occurrence frequency of the word in the
charts can be diverse, we observe a significant phenomenon in both corpus is related to its usage in everyday life, we integrate frequency
the foreground and background: whether the data can be reflected and similarity as two criteria to rank the relevant semantic context,
within the displayed visual element. If so, the semantic information yielding the top 7 words as a result (Fig. 5 (A3 )).
and data information will be encoded together within the element, and
the element would comply with the data’s inherent trend or magnitude. 4.3 Generation
Based on the preliminary study, we identify four embedding types,
4 M ETHOD along with foreground and background as the two main layers for em-
4.1 Overview bedding objects. Our key observation is that some embedding types
only require visual embellishment containing semantic context, while
Extraction Generation Estimation others need to comply with the inherent data to make it part of the chart
Prompt Generation
itself. In light of this observation, we devise a generation methodology
Data feature Process that employs both unconditional and conditional approaches. The fun-
Visual
Embedding Refine
object Unconditional Distortion damental distinction between these approaches hinges on whether the
Semantic Context Foreground
Conditional
Background chart information is factored into the generation process. As illustrated
in Fig. 4(a), the generation stage consists of three core modules. The
Fig. 3: The 3-stage framework of ChartSpark. The extraction stage unconditional module adopts the fundamental text-to-image diffusion
provides users with data features and semantic context. The generation model, subsequently generating a corresponding visual element. For
stage produces visual elements by the input prompt and selected method the conditional module, we inject the chart image into the attention
with a final refinement. The evaluation stage evaluates the generated map, serving as guidance for the ensuing generation process, as shown
visualization based on distortion. in Fig. 4(b). Lastly, the modification module is tasked with reproduction
and refinement to accommodate the four embedding types and enhance
The proposed ChartSpark framework’s workflow is depicted in
its details.
Fig. 3, comprising three primary stages. In the initial stage, data
features are visualized and semantic context is extracted from the raw
data, offering users a visual preview and thematic topic to enhance their 4.3.1 Unconditional Generation
comprehension of the data (R1). Subsequently, users employ prompt- Diffusion-based generation methods outperform previous generation
driven generation to acquire visual elements, embedding the semantic methods in the quality of generated images and semantic comprehen-
context into the foreground or background in either a conditional or sion. In this work, we develop our framework based on the Frozen La-
unconditional manner (R2, R3). Ultimately, the evaluation module tent Diffusion Model (LDM) [40]. Below, we outline the core structure
furnishes a suggestion mechanism to indicate data distortion (R4). In and generation process of LDM to provide some preliminary knowl-
comparison to the workflow depicted in Fig. 2, ChartSpark streamlines edge. Similar to the previous diffusion models [13, 20, 44], LDM
the process by supplanting the retrieval and data binding stages with follows a forward process that incorporates Gaussian noise using a
the generation, which mitigates the inconvenience of retrieving visual Markov process, and a reverse process that denoises to recover the
elements and integrating them with charts. Instead, users provide original distribution from reconstructing the image. However, LDM
prompts regarding their preferred style and embedding technique to distinguishes itself from other diffusion models by employing com-
steer this automatic generation. The ChartSpark framework features a pressed and low-dimensional latent space instead of pixel-based diffu-
sandwich structure, with the first and last stages ensuring optimization sion, thereby reducing computational costs for inference. The LDM
of the middle stage’s performance and augmenting the faithfulness in architecture includes an autoencoder that converts pixel images into
data and expressiveness in visual representation. The preview presented latent representations and a UNet that predicts noise and performs de-
in the initial stage enables users to intuitively discern the potential noising in the latent space. To enable text-guided generation, LDM
Unconditional
Modification
background removal
Replication
“A cherry blossom,
Diffusion Model
Refinement
Augmentation module
Attention map
“The lighthouse by the “A bowl of cherries” “A cute fluffy cat in
Conditional
Sea” the sofa”
0 Attention Score 1
(a) (b)
Fig. 4: The process for unconditional and conditional generation for foreground and background. (a) Given a prompt, the generative model
generates relevant visual elements by unconditional or conditional method, then the generated visualization can be edited through replication and
refinement. (b) Internal mechanism of incorporating an image and textual input into the attention map.
enhances the underlying UNet backbone with a cross-attention mecha- upsampling, resulting in cross-attention layers being present at different
nism, facilitating the integration of multimodal guidance. resolutions. In our experiments, we observed that the middle layer
Foreground. In our preliminary analysis of existing pictorial visual- exhibited better performance, and empirically set the N as 16.
ization, the foreground is the common object to embed the semantic Background. In unconditional background generation, the aim is to
context and can exhibit various representations. The unconditional incorporate semantic context without extracting objects. To achieve
generation can produce visual elements to embellish the chart, which this, we employ a straightforward text-to-image generation by the
closely matches the semantics but does not contain information about fundamental diffusion model.
underlying data. We achieve this by utilizing the prompt-driven method.
The text prompt P provided by the user consists of an object Pob j and its 4.3.2 Conditional Generation
corresponding description Pdes . In Fig. 4 (a), the term with an underline Compared with unconditional generation, conditional generation in-
represents Pob j , while the term without an underline represents Pdes . volves integrating chart image Ic to make the generated visual element
Given the generated image Ig , our objective is to extract the visual comply with the data information, such as trend and contour. There
element related to the semantic context Pob j from the image. To accom- are two principal challenges that require addressing: 1) Enhancing gen-
plish this, we use cross-attention between object Pob j and Ig to locate erational diversity. We introduce an augmentation module to expand
the target region, and then remove the background to obtain Iob j . As the possible fusion directions. Nevertheless, we have discovered that
shown in the top of Fig. 4 (b), we obtain the visual feature map V of the conventional augmentation operations used in natural image domains,
generated image from the autoencoder and embedding of Pob j . Next, such as cropping and flipping, are inappropriate for charts and may
we use linear projections to transform them into Q and K. We then ultimately jeopardize the data integrity of the chart. 2) Integrating the
multiply Q and K to obtain the attention score, which is subsequently semantic context and the chart. This entails determine how to condi-
multiplied with V to generate the final attention map. In summary, the tion the generation process by merging the attention map containing
process can be described as follows: semantic context and chart with the data information.
Foreground. The conditional foreground generation emphasizes the
QK T integration of semantic context into the visual marks in the chart, while
A(Q, K,V ) = So f tmax( √ ) ·V, (1)
d adhering to the data represented within the chart. Intuitively, the seman-
where the d represents the dimension of the latent projection dimen- tic context needs to be integrated into the rectangle, line, sector, and
sion of Q and K, and the So f tmax function is utilized to normalize bubble for the bar chart, line chart, pie chart, and scatter plot, respec-
the attention score. As shown in the bottom of Fig. 4(b), the attention tively. Initially, we randomly augment Ic with various manipulations,
score is directly proportional to the strength of the relevance between including Gaussian blur, dynamic blur, and image warp, as depicted
the image and text. As a result, we can extract the object of interest in Fig. 4 (a). The augmentation module aug is established based on
from a cluttered background by comparing pixel differences. To ac- the principle of enhancing the diversity of chart element shapes while
complish this, we first calculate the threshold to distinguish the object maintaining data integrity. Then, we obtain the attention map A con-
and background, obtaining a mask. Next, we perform a pixel-wise cerning the Pob ject from the generation process (Eq. 1). To infuse the
comparison at the corresponding positions in Ig to obtain a rough object data information from Ic into the attention map, we utilize Ic as a mask,
region, denoted as Iob j . Lastly, to achieve a more refined result, we ensuring the attention map possesses the same shape as the element in
utilize ISNet [35], a state-of-the-art segmentation neural network, to Ic . To maximize the fused image I f use by including as much semantic
eliminate redundant information. The process can be described through context as possible, we employ two common affine transformations,
the following equations: scaling and rotating. The optimization function can be expressed as:
∑ Ai j I f use = max[aug(Ic ) φ ( fupsample (A), θ , s)], (5)
M = I[Ai j > ], (2) θ ,s
N2
Iob j = fupsample (M) Ig , (3) where φ is the affine transformation parameterized by scaling param-
eter s and rotation parameter θ . Finally, taking as input I f use , which
0 integrates semantic context and chart information, we regenerate to
Iob j = R(Iob j ), (4)
obtain Ig .
where I[.] is the element-wise indicator function on the matrix, N 2 Background. The background serves not only as a container of seman-
represents the total number of pixels in attention map A, and M is tic context but also as a part of the chart that conveys data information.
calculated as a matrix with a value of 1 at the object’s location and To this end, we fuse the feature of Ic into the background generation
0 elsewhere. Since M has the same dimensions as A, we employ an process. We also leverage the augmentation module to improve the
upsample technique to resize its shape to match that of Ig . R represents diversity of generation. Unlike the augmentation for foreground, which
the redundant information removal operation. The symmetric and focuses on element distortion, the augmentation for the background
hierarchical structure of the UNet involves both downsampling and necessitates the seamless integration of the chart’s features with the
A C
A1 C1
A2
C2
A3
B B3
B1
B2
C3
Fig. 5: User interface of ChartSpark. It consists of a central canvas for manipulation and composition of visual elements and three main modules
corresponding to the process of feature extraction (A), generation (B), as well as evaluation and editing (C).
background component. In practice, we define a set encompassing this issue, we optimize these local details using the reconstruction
various interaction methods between the chart elements and the chart capabilities of the generative model.
edges (axis and border of the chart image) to establish a closed shape, Refinement During the generation process, users may integrate mul-
particularly in the case of line charts, as the other three chart types tiple embedding methods and experiment with various generations,
possess ample space to encode information. In the process of merging yielding several independent generation results. For instance, as de-
semantic context and chart, we utilize a blending approach by calcu- picted in Fig. 5, the tree branch and the cherry blossom are generated
lating a weighted average of the attention A and the augmented chart separately, employing unconditional and conditional foreground modes,
image aug(Ic ) to facilitate this integration. This can be expressed as: respectively. Distinctly generated visual elements can give rise to nu-
I f use = ρA + (1 − ρ)aug(Ic ), (6) merous issues, such as incoherent concatenation and inconsistent styles.
Refinement based on image-to-image generation can supplement de-
where ρ ∈ [0, 1], we set ρ as 0.6 empirically. Then we injected the I f use tails to counteract incoherent concatenation and harmonize the style of
into the generation procedure to achieve reconstruction, yielding the Ig . the image, while preserving its layout and semantic context.
4.3.3 Modification
Upon producing individual components, it is essential to adjust them
to accomplish the ultimate composition. At the element level, we 4.4 Evaluation
reuse the generated elements to encode other visual marks in the chart,
enhancing its reproducibility and adaptability. At the chart levels, we As stated in requirement R4, it is essential to provide an evaluation
refine the image details to create a more harmonious overall appearance, module to inform users of any potential distortions that affect data
particularly when merging independently generated visual elements. integrity when creating pictorial visualizations.
Replication To apply the visual element to other visual marks in a To assess distortion, we ascertain the disparity between the generated
chart, traditional tools present two challenges in this process, including visual element and the original plain chart (Fig. 5(C1 )). Given that each
the tedious task of copying each individual element one by one and chart employs distinct visual channels to encode data, we tailor our
the risk of element distortion. We propose a warp-and-merge strategy methodology accordingly to guarantee precise assessement for each
to overcome these challenges. Taking a bar chart as a case in point. chart type. For bar charts, we concentrate primarily on the height of
Initially, we generate the fundamental visual component by employing each bar as an indicator of distortion. For line charts, the portrayal of
the tallest bar as a reference point. This streamlines the issue by the trend is of paramount importance in the evaluation. For pie charts,
examining how to reduce the element to correspond with the shorter we measure the angle for each sector to gauge distortion. For scatter
bars. To elaborate, we partition the visual component into five equally charts, we estimate the distance of the centroid of each point to assess
elevated sections and compute the structural similarity (SSIM) amid distortion. In contrast to the approach presented in [11], our system
each pair. Grounded on the proportion of the bar, we cut the most not only displays global numeric distortion values to users but also
similar part and concatenate the remaining parts together. However, the identifies local regions with high errors and presents them to users in
concatenated image might have artifacts and rigid junctions. To address visual context to facilitate modifications (Fig. 5(C1 )).
Global
Total.
Agricultural 7.14 7.07
Land Use by
60
1.2 Wheat
1.0
in y
50
rs d b
12/13
nd ito se
6.78
40 16/17 6.73
la is u
0.7 0.8
ea l v ort
279185
323190 08/09
w Z na sp
30
20/21
Ne tio ran
20
a t
rn of
te s
10
in ode
713510
m
0 0.1 Goals scored by Linel Messi for
04/05 FC Barcelona in all competitions
783820
869838
2000 2004 2008 2012
0.4 5% 24000
75%
Green
Veggies 23000
0.3
Citrus fruit
22000
0.2
21000
Juice
0.1 Yrs
Composition 20000
15-19 25-34 45-54 65-74 Yrs 2003 2005 2007 2009 2011 2013
3 3 4.75
ding objects ( foreground and background) and embedding techniques MEAN = 4.75
SD = 0.43 3.75
3
3.75
(conditional and unconditional). Next, we presented the interface to ChartSpark is effective for creating pictorial visualization.
Effectiveness: 3.25 3 2.25
the participant and introduced the functionalities. We then guided the MEAN = 4.50