Christian Garbin’s personal blog

Improve writing by learning how to read

2024-05-05T00:00:00-04:00

The advice in How to Read a Paper changed how I read scientific papers. I used to read them linearly, struggling through each section in order. The three-pass approach was liberating. It freed my mind from having to understand everything the first time.

Eventually, I realized I could turn the advice around to write a good paper.

Most of what follows applies to scientific papers. However, we can use the same principles for other types of writing.

What makes a good paper?

Good papers have, at a minimum, these qualities:

They are correct. Their claims are supported by evidence and are internally consistent.
They are easy to read. They are well-structured, with clear language, good grammar and attention to detail.

The first quality is exclusionary. An incorrect paper does not benefit from any other quality. We will spend the rest of this post discussing the second quality.

How to write a good paper

Here, we will apply the advice from How to Read a Paper to writing a good paper. We will use the three-pass approach and see how it applies to writing.

The first pass

The main point of the first pass is to convince the user that the paper’s topic is worth reading and that it is well structured.

Carefully read the title, abstract, and introduction

Read the section and sub-section headings, but ignore everything else

Read the conclusions

S. Keshav — How to Read a Paper

How to turn that “read” advice into “write” advice:

Choose a title that communicates the paper’s main point. Don’t try to be clever. The title should be clear and concise. It should tell the reader what the paper is about.
Write a short, to the point abstract. This is the first opportunity to motivate the reader. A good abstract explains the research problem, why it is important to work on it, the challenges, and the main results in a few sentences. Note that these are fundamental questions. You may not be ready to write the paper if you can’t answer them.
Write a clear introduction that sets the stage for the rest of the paper. Use it to entice the reader to keep reading. The introduction should cover the same items as the abstract but in more detail.
Choose section and subsection titles that tell a story. The section titles for scientific papers often follow a rigid structure (introduction, related work, methodology, results, discussion, conclusion). However, we have more freedom for the subsections. Use precise and concise titles that tell the reader what to expect. For example, “Approach 1” and “Approach 2” are not good subsection titles. Better titles to help pique the reader’s interest are “Approach 1: Using a neural network” and “Approach 2: Using a decision tree”.

Here are a few lessons that took me a while to learn:

If I don’t have good answers for the fundamental items in the abstract, my research is not yet ready. I may even have a problem with the research topic and scope, a fundamental issue that needs to be addressed before spending more time on the research.
I do not need to write the abstract and introduction first. Writing them last may be easier (as long as the abstract items have been answered – this is about writing the abstract, not the research itself).
In fact, writing a paper is not a linear process. I may write the results first, then go back to methods, then write the introduction, the abstract, the discussion, and finally, the conclusion. As long as the research is solid, choose the order that makes it easier to write.
But having a roadmap helps. I write a table of contents before starting to write. It helps me see the big picture and how to break down the sections into logical and engaging subsections.

The second pass

The main point of the second pass is to help the reader understand the paper’s content.

Look carefully at the figures, diagrams and other illustrations in the paper. Pay special attention to graphs. Are the axes properly labeled? Are results shown with error bars, so that conclusions are statistically significant? Common mistakes like these will separate rushed, shoddy work from the truly excellent. …

After this pass, you should be able to grasp the content of the paper. You should be able to summarize the main thrust of the paper, with supporting evidence, to someone else.

S. Keshav — How to Read a Paper

How to turn that “read” advice into “write” advice:

Don’t rely on text alone to convey your ideas. Support the text with figures, tables, and diagrams.
1. Chris Olah’s blog has superb examples of figures and diagrams supporting the text explanations, such as the one on how LSTM works.
Use detailed captions for figures, tables, and diagrams. Assume the reader will read only the captions and none of the paper’s text. The captions should be self-contained and explain the main points. “Results for approach 1” is a bad caption. A better caption to keep the reader engaged is “Result from experiments for approach 1, using neural networks to solve the problem. The graph shows it performs well in the first ten epochs, then overfits. This problem sparked the idea for approach 2 (figure 2).”
Avoid trivial mistakes that undermine the paper’s credibility. Trivial mistakes indicate sloppiness and make the reader doubt the paper’s results.
1. Make sure that the axes are properly labeled, that the figures have legends, and that the color schemes are clear and accessible (a list of common errors and how to fix them).
2. Carefully review table columns and row labels.
3. Run the entire paper through a spell and grammar checker. In this day and age, there is no excuse for typos and grammar mistakes.

Here are a few lessons that that took me a while to learn:

Have a “hero picture” (or diagram). This picture summarizes or explains the most important concepts in the paper and is placed in visible places, such as the bottom of the first page or the top of the second page (and remember to write a self-contained caption). Readers are drawn to pictures. A good picture can make the reader want to read the paper. For example, the figure at the top of the second page on this paper explains at a glance the main contributions of the paper: a large dataset curated by experts (self-promotion warning: I’m one of the authors, but the idea to put this picture strategically came from another coauthor – I first thought it was a gimmick, but I now appreciate the value). As a bonus, this picture can be used as a graphical abstract.
Read the paper aloud, including captions (to another coauthor, if you have one). Seriously. This is the most effective of all the “how to make it easier to read” tips I know. If you can’t read it aloud without stumbling, the reader will have a hard time reading it silently.

The third pass

The main point of the third pass is to convince the user that the paper’s results are significant and well-supported.

The key to the third pass is to attempt to virtually re-implement the paper: that is, making the same assumptions as the authors, re-create the work. By comparing this re-creation with the actual paper, you can easily identify not only a paper’s innovations, but also its hidden failings and assumptions.

S. Keshav — How to Read a Paper

How to turn that “read” advice into “write” advice:

Ensure that the results are statistically significant. Use error bars, p-values, or other statistical measures to show that the results are not due to chance.
Perform ablation studies to show the impact of different components of your approach and to prove that the results come from your approach, not other factors.

Here are a few lessons that that took me a while to learn:

Anticipate questions. If you were a reviewer, what would you ask? Answer those questions in the paper. Ask a trusted colleague to read the paper and ask them what questions they have. Don’t disregard their questions as “obvious” or “unimportant.” If they have those questions, so will the reviewers.
At the same time, keep the main body brief and use appendices for additional information. Keep the main body focused on the main points, then use the appendices to provide additional information for the curious reader.

Above all, don’t lose the reader in the first pass

If you remember only one thing from this post: the first pass is so important that it may be the last. Don’t lose the reader in the first pass.

Or, as Keshav puts it:

“[W]hen you write a paper, you can expect most reviewers (and readers) to make only one pass over it. Take care to choose coherent section and sub-section titles and to write concise and comprehensive abstracts. If a reviewer cannot understand the gist after one pass, the paper will likely be rejected; if a reader cannot understand the highlights of the paper after five minutes, the paper will likely never be read.”

S. Keshav — How to Read a Paper

Using LLMs to summarize GitHub issues

2023-11-05T00:00:00-04:00

This project is a learning exercise on using large language models (LLMs) for summarization. It uses GitHub issues as a practical use case that we can relate to.

The goal is to allow developers to understand what is being reported and discussed in the issues without having to read each message in the thread. We will take the original GitHub issue with its comments and generate a summary like this one.

UPDATE 2024-07-21: With the announcement of GPT-4o mini, there are fewer and fewer reasons to use GPT-3.5 models. I updated the code to use the GPT-4o and GPT-4o mini models and to remove the GPT-4 Turbo models (they are listed under “older models we support”, hinting that they will eventually be removed).

We will review the following topics:

How to prepare data to use with an LLM.
How to build a prompt to summarize data.
How good are LLMs at summarizing text and GitHub issues in particular.
Developing applications with LLMs: some of their limitations, such as the context window size.
The role of prompts in LLMs and how to create good prompts.
When not to use LLMs.

The code for these experiments is available on this GitHub repository. This YouTube video walks through the sections below, but note that it uses the first version of the code. The code has been updated since then.

Overview of the steps

Before we start, let’s review what happens behind the scenes when we use LLMs to summarize GitHub issues.

The following diagram shows the main steps:

Get the issue and its comments from GitHub: The application converts the issue URL the user entered in (1) to a GitHub API URL and requests the issue, then the comments (2). The GitHub API returns the issue and comments in JSON format (3).
Preprocess the data: The application converts the JSON data into a compact text format (4) that the LLM can process. This is important to reduce token usage and costs.
Build the prompt: The application builds a prompt (5) for the LLM. The prompt is a text that tells the LLM what to do.
Send the request to the LLM: The application sends the prompt to the LLM (6) and waits for the response.
Process the LLM response: The application receives the response from the LLM (7) and shows it to the user (8).

We will now review each step in more detail.

Quick get-started guide

This section describes the steps to go from a GitHub issue like this one (click to enlarge)…

…to LLM-generated summary (click to enlarge):

Follow the “quick get-started guide” on the GitHub repository to start the application if you want to follow along.

Once the application is running, enter the URL for the issue above, https://github.com/microsoft/semantic-kernel/issues/2039, and click the Generate summary with <model> button to generate the summary. It will take a few seconds to complete.

Large language models are not deterministic and may be updated anytime. The results you get may be different from the ones shown here.
The GitHub issue may have been updated since the screenshots were taken.

In the following sections, we will go behind the scenes to see how the application works.

What happens behind the scenes

This section describes the steps to summarize a GitHub issue using LLMs. We will start by fetching the issue data, preprocessing it, building an appropriate prompt, sending it to the LLM, and finally, processing the response.

Step 1 - Get the GitHub issue and its comments

The first step is to get the raw data using the GitHub API. In this step we translate the URL the user entered into a GitHub API URL and request the issue and its comments. For example, the URL https://github.com/microsoft/semantic-kernel/issues/2039 is translated into https://api.github.com/repos/microsoft/semantic-kernel/issues/2039. The GitHub API returns a JSON object with the issue. Click here to see the JSON object for the issue.

The issue has a link to its comments:

"comments_url": "https://api.github.com/repos/microsoft/semantic-kernel/issues/2039/comments",

We use that URL to request the comments and get another JSON object. Click here to see the JSON object for the comments.

Step 2 - Translate the JSON data into a compact text format

The JSON objects have more information than we need. Before sending the request to the LLM, we need to extract the pieces we need for the following reasons:

Large objects cost more because most LLMs charge per token.
It takes longer to process large objects.
Large objects may not fit in the LLM’s context window (the context window is the number of tokens the LLM can process at a time).

In this step, we take the JSON objects and convert them into a compact text format. The text format is easier to process and takes less space than the JSON objects.

This is the start of the JSON object returned by the GitHub API for the issue.

{
  "url": "https://api.github.com/repos/microsoft/semantic-kernel/issues/2039",
  "repository_url": "https://api.github.com/repos/microsoft/semantic-kernel",
  "labels_url": "https://api.github.com/repos/microsoft/semantic-kernel/issues/2039/labels{/name}",
  "comments_url": "https://api.github.com/repos/microsoft/semantic-kernel/issues/2039/comments",
  "events_url": "https://api.github.com/repos/microsoft/semantic-kernel/issues/2039/events",
  "html_url": "https://github.com/microsoft/semantic-kernel/issues/2039",
  "id": 1808939848,
  "node_id": "I_kwDOJDJ_Yc5r0jtI",
  "number": 2039,
  "title": "Copilot Chat: [Copilot Chat App] Azure Cognitive Search: kernel.Memory.SearchAsync producing no   ...

  "body": "**Describe the bug**\r\nI'm trying to build out the Copilot Chat App as a RAG chat (without
           skills for now). Not sure if its an issue with Semantic Kernel or my cognitive search...
           ...many lines removed for brevity...
           package version 0.1.0, pip package version 0.1.0, main branch of repository]\r\n\r\n**Additional
           context**\r\n",
   ...

And this is the compact text format we create out of it.

Title: Copilot Chat: [Copilot Chat App] Azure Cognitive Search: kernel.Memory.SearchAsync producing no
results for queries
Body (between '''):
'''
**Describe the bug**
I'm trying to build out the Copilot Chat App as a RAG chat (without skills for now). Not sure if its an
issue with Semantic Kernel or my cognitive search setup. Looking for some guidance.
...many lines removed for brevity...

To get from the JSON object to the compact text format we do the following:

Remove all fields we don’t need for the summary. For example, repository_url, node_id, and many others.
Change from JSON to plain text format. For example, {"title": "Copilot Chat: [Copilot Chat App] Azure ... becomes Title: Copilot Chat: [Copilot Chat App] Azure ....
Remove spaces and quotes. They count as tokens, which increase costs and processing time.
Add a few hints to guide the LLM. For example, Body (between ''') tells the LLM that the body of the issue is between the ''' characters.

Click here to see the result of this step. Compare with the JSON object for the issue and comments to see how much smaller the text format is.

Step 3 - Build the prompt

A prompt tells the LLM what to do, along with the data it needs.

Our prompt is stored in this file. The prompt instructs the LLM to summarize the issue and the comments in the format we want (the “Don’t waste…“ part comes from this example).

You are an experienced developer familiar with GitHub issues.
The following text was parsed from a GitHub issue and its comments.
Extract the following information from the issue and comments:
- Issue: A list with the following items: title, the submitter name, the submission date and
  time, labels, and status (whether the issue is still open or closed).
- Summary: A summary of the issue in precisely one short sentence of no more than 50 words.
- Details: A longer summary of the issue. If code has been provided, list the pieces of code
  that cause the issue in the summary.
- Comments: A table with a summary of each comment in chronological order with the columns:
  date/time, time since the issue was submitted, author, and a summary of the comment.
Don't waste words. Use short, clear, complete sentences. Use active voice. Maximize detail, meaning focus on the content. Quote code snippets if they are relevant.
Answer in markdown with section headers separating each of the parts above.

Step 4 - Send the request to the LLM

We now have all the pieces we need to send the request to the LLM. Different LLMs have different APIs, but most of them have a variation of the following parameters:

The model: The LLM to use. As a general rule, larger models are better but are also more expensive and take more time to build the response.
System prompt: The instructions we send to the LLM to tell it what to do, what format to use, and so on. This is usually not visible to the user.
The user input: The data the user enters in the application. In our case, the user enters the URL for the GitHub issue and we use it to create the actual user input (the parsed issue and comments).
The temperature: The higher the temperature, the more creative the LLM is. The lower the temperature, the more predictable it is. We use a temperature of 0.0 to get more precise and consistent results.

These are the main ones we use in this project. There are other parameters we can adjust for other use cases.

This is the relevant code in llm.py:

    completion = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": prompt},
            {"role": "user", "content": user_input},
        ],
        temperature=0.0  # We want precise and repeatable results
    )

Step 5 - Show the response

The LLM returns a JSON object with the response and usage data. We show the response to the user and use the usage data to calculate the cost of the request.

This is a sample response from the LLM (using the OpenAI API):

ChatCompletion(..., choices=[Choice(finish_reason='stop', index=0, message=ChatCompletionMessage(content=
'<response removed to save space>', role='assistant', function_call=None))], created=1698528558,
model='gpt-3.5-turbo-0613', object='chat.completion', usage=CompletionUsage(completion_tokens=304,
prompt_tokens=1301, total_tokens=1605))

Besides the response, we get the token usage. The cost is not part of the response. We must calculate that ourselves following the published pricing rules.

At this point, we have everything we need to show the response to the user.

Developing applications with LLMs

In this section we will go through a few examples to see how to use LLMs in applications. We will start with simple cases that work well, then move on to cases where things don’t behave as expected and how to work around them.

This is a summary of what is covered in the following sections.

A simple GitHub issue first to see how LLMs can summarize.
A large GitHub issue that doesn’t fit in the context window of a basic LLM.
A more powerful model for a better summary.
The introduction of GPT-4o mini.
The importance of using a good prompt.
Sometimes we should not use an LLM.

A simple GitHub issue to get started

We will start with a simple case to see how well LLMs can summarize.

Start the application as described in the “quick get-started guide” on the GitHub repository to follow along. Then choose the first issue in the list of samples, <https://github.com/openai/openai-python/issues/488> (simple example) and click the “Generate summary with…“ button (click to enlarge).

After a few seconds we should get a summary like the picture below. At the top we can see the token count, the cost (derived from the token count), and how long it took for the LLM to generate the summary. After that we see the LLM’s response. Compared with the original GitHub issue, the LLM does a good job of summarizing the main points of the issue and the comments. We can see at a glance the main points of the issue and its comments (click to enlarge).

A large GitHub issue

Now choose the issue https://github.com/scikit-learn/scikit-learn/issues/26817 (large, requires GPT-3.5 16k or GPT-4) and click the “Generate summary with…“ button. Do not change the LLM model yet.

It will fail with this error:

Error code: 400 - {'error': {'message': "This model's maximum context length is 4097 tokens. However, your messages resulted in 4154 tokens. Please reduce the length of the messages.", 'type': 'invalid_request_error', 'param': 'messages', 'code': 'context_length_exceeded'}}

Each LLM has a limit on the number of tokens it can process at a time. This limit is the context window size. The context window must fit the information we want to summarize and the summary itself. If the information we want to summarize is larger than the context window, as we saw in this case, the LLM will reject the request.

There are a few ways to work around this problem:

Break up the information into smaller pieces that fit in the context window. For example, we could ask for a summary of each comment separately, then combine them into a single summary to show to the user. This may not work well in all cases, for example, if one comment refers to another.
Use a model with a larger context window.

We will use the second option. Click on “Click to configure the prompt and the model” at the top of the screen, select the GPT-4o model and click the “Generate summary with gpt-4o” button (click to enlarge).

Now we get a summary from the LLM.

Why don’t we start with GPT-4o to avoid such problems? Money. As a general rule, LLMs with larger context windows cost more. If we use an AI provider such as OpenAI, we must pay more per token. If we run the model ourselves, we need to buy more powerful hardware. Either way, using a larger context window costs more.

Better summaries with a more powerful model

As a result of using GPT-4o, we also get better summaries.

Why don’t we use GPT-4o from the start? In addition to the above reason (money), there is also higher latency. As a general rule, better models are also larger. They need more hardware to run, translating into higher costs per token and a longer time to generate a response.

We can see the difference comparing the token count, cost, and time to generate the summary between the gpt-3.5-turbo and the gpt-4o models.

How do we pick a model? It depends on the use case. Start with the smallest (and thus cheaper and faster) model that produces good results. Create some heuristics to decide when to use a more powerful model. For example, switch to a larger model if the comments are larger than a certain size and if the users are willing to wait longer for better results (sometimes an average result faster is better than the perfect result later).

The introduction of GPT-4o mini

The previous sections compared GPT-3.5 Turbo against GPT-4o to emphasize the differences between a smaller and a much larger model. However, in July 2024, OpenAI introduced the GPT-4o mini model. It comes with the same 128k tokens context window as the GPT-4o model but with a much lower cost. It’s even cheaper than the GPT-3.5 models. See the OpenAI API pricing for details.

GPT-4o (not mini) is still a better model, but its price and latency may not justify the better results. For example, the following table shows the summary for a large issue (https://github.com/qjebbs/vscode-plantuml/issues/255). GPT-4o is on the left, and GPT-4o mini is on the mini. The difference in costs is staggering, but the results are not that much different.

The message is that unless you have a specific reason for using GPT-3.5 Turbo, you should start a new project with the GPT-4o mini model. It will produce results comparable to GPT-4o for less than the GPT-3.5 Turbo cost.

GPT-4o summary	GPT-4o mini summary
3,859 tokens, US $0.0303	4,060, tokens, US $0.0012

The importance of using a good prompt

Precise instructions in the prompt are important to get good results. To illustrate what a difference a good prompt makes:

Select the “gpt-3.5” model.
Select the GitHub issue https://github.com/openai/openai-python/issues/488 from the sample list.
Click the “Generate summary with…“ button.

We get a summary of the comments like this one (click to enlarge).

If we remove from the prompt the line “Don’t waste words. Use short, clear, complete sentences. Use active voice. Maximize detail, meaning focus on the content. Quote code snippets if they are relevant.”, we get this summary. Note how the text is more verbose and is indeed “wasting words” (click to enlarge).

To remove the line, click on “Click to configure the prompt and the model” at the top of the screen and remove the line from the prompt, then click on the “Generate summary with…“ button again. Reload the page to restore the line.

Getting the prompt right is still an experimental process. It goes under the name of prompt engineering. These are some references to learn more about prompt engineering.

If all we have is a hammer…

Once we learn we can summarize texts with an LLM, we are tempted to use it for everything. Let’s say we also want to know the number of comments on the issue. We could ask the LLM by adding it to the prompt.

Click on “Click to configure the prompt and the model” at the top of the screen and add the line - Number of comments in the issue to the prompt as shown below. Leave all other lines unchanged.

You are an experienced developer familiar with GitHub issues.
The following text was parsed from a GitHub issue and its comments.
Extract the following information from the issue and comments:
- Issue: A list with the following items: title, the submitter name, the submission date and
time, labels, and status (whether the issue is still open or closed).
- Number of comments in the issue  <-- ** ADD THIS LINE **
...remainder of the lines...

The LLM will return a number of comments, but it will usually be wrong. Select, for example, the issue https://github.com/qjebbs/vscode-plantuml/issues/255 from the sample list. None of the models get the number of comments correctly.

Why? Because LLMs are not “executing” instructions, they are simply generating one token at a time.

This is an important concept to keep in mind. LLMs do not understand what the text means. They just pick the next token based on the previous ones. They are not a replacement for code.

What to do instead? If we have easy access to the information we want, we should just use it. In this case, we can get the number of comments from the GitHub API response.

    issue, comments = get_github_data(st.session_state.issue_url)
    num_comments = len(comments)  # <--- This is all we need

What we learned in these experiments

LLMs are good at summarizing text if we use the right prompt.
Summarizing larger documents requires larger context windows or more sophisticated techniques.
Getting good results requires good prompts. Good prompts are still an experimental process.
Sometimes we should not use an LLM. If we can easily get the information we need from the data, we should do that instead of using an LLM.

This project lets you ask questions on a document and get answers from an LLM. It uses techniques similar to this project but with a significant difference: the LLM runs locally on your computer.

Writing good Jupyter notebooks

2022-09-19T00:00:00-04:00

Jupyter notebooks are an excellent tool for data scientists and machine learning practitioners. However, if not approached with a few techniques, they can turn into a pile of unintelligible, unmaintainable code.

This post will discuss some techniques I use to write good Jupyter notebooks. We will start with a notebook that is not wrong but is not well written. We will progressively change it until we arrive at a good notebook.

But first, what is a good Jupyter notebook? Good notebooks have the following properties:

They are organized logically, with sections clearly delineated and named.
They have important assumptions and decisions spelled out.
Their code is easy to understand.
Their code is flexible (easy to modify).
Their code is resilient (hard to break).

This post is adapted from a guest lecture I gave to Dr. Marques’ data science class. If you are pressed for time, check out the GitHub repository, starting with the presentation.

We will use as an example a notebook that attempts to answer the question “is there gender discrimination in the salaries of an organization?” Our dataset is a list of salaries and other attributes from that organization. We will start from the first step in any data project, exploratory data analysis (EDA), clean up the dataset, and finally, attempt to answer the question.

To illustrate how to go from a notebook that is not wrong but is also not good, we will go through the following steps:

Step 1: the original notebook, the one that lacks structure and proper coding practices.
Step 2: add a description, organize into sections, add exploratory data analysis.
Step 3: make data cleanup more explicit and explain why specific numbers were chosen (the assumptions behind them).
Step 4: make the code more flexible with constants and make the code more difficult to break.
Step 5: make the graphs easier to read.
Step 6: describe the limitations of the conclusion.

Step 1 - The original notebook

This is the original notebook. It is technically correct, but far from what is acceptable for a project of this importance.

The first hint of a problem is the structure of the notebook: it doesn’t have any. It’s a collection of cells, one after the other.

Step 2 - Add a description, organize it into sections, and add exploratory data analysis

Starting in this step, we will make incremental changes to the notebook. Each change will bring us closer to a good notebook. Changes from the previous step are highlighted with a “REWORK NOTE” comment and an explanation of what has changed. Here is an example:

In this step, we make the following improvements:

Add a clear “what is this notebook about?” description.
Add an exploratory data analysis (EDA) section.
Split the notebook into sections.

This is the reworked notebook. It is better, but we can still improve it:

Make the data cleanup more explicit.
Explain what the code blocks are doing.
Explain why specific numbers were chosen (the assumptions behind them).
Make the graphs easier to read.
Make the code more flexible with constants.
Make the code more resilient (harder to break).
Describe the limitations of the conclusion.

We will fix some of the issues in the next step.

Step 3 - Make data cleanup more explicit and explain why specific numbers were chosen

In this step, we make the following improvements:

Make the data cleanup more explicit.
Explain why specific numbers were chosen (the assumptions behind them).
Explain what the code blocks are doing.

The following figure shows how we explain why we are removing all employees that are 66 or older and add a reference to back up our decision (the hyperlink in the text). We also explain why we think this is a good decision.

Why should we document decisions to this level of detail? One reason is to remember why we made them. But, more importantly, we, the data scientists, may not be the domain experts. In this example, the domain experts are the HR and legal departments. We need to engage them to validate our decisions. Documenting to this level of detail invites a dicussion with the domain experts to validate the decisions.

This is the reworked notebook.

Step 4 - Make the code more flexible and more difficult to break

In this step, we make the following improvements:

Make the code more flexible with constants. If we need to change decisions, for example, the age cutoff, we have only one place to change.
Make the code more difficult to break. By following patterns, we reduce the chances of introducing bugs.

In this piece of code, we remove everyone who made less than the minimum age working full time (see the notebook for details).

There are a few notable items in this code:

We use a constant if we need to make changes later (more flexible code).
We use a generic name for the constant (SALARY_CUTOFF), so we don’t need to change it later if we change the cutoff value. If we had named it something more specific, like MINIMUM_WAGE, we would need to change the constant name if we changed the value. This makes the code less flexible and less resilient.
We don’t modify the original data. We create a filter instead, so we can see the effect of each filter separately and backtrack one change at a time if we need to.
The filter variable also has a generic name (low_salaries), for the same reasons we used a generic name for the constant.
We print the results of the operation (the cutoff value and how many items it removed from the dataset), so we can discuss with the domain experts if our decision makes sense. For example, we could ask an HR representative if they expected to see this many employers removed when we set this salary cutoff. It may catch errors in the dataset or in the code.

Regarding the last item, printing the operation results: showing the effect of filtering data (how many employees were removed) helps validate the decisions with the domain experts.

When we clean up the age column, we keep using the same patterns:

We create a filter for the data we want to exclude, as we did for the salary filter.
We follow a pattern for the variable name. The salary one was named SALARY_CUTOFF, so this one is also suffixed with ..._CUTOFF.
We choose a generic variable name. If we name it something more specific, e.g. RETIRED_AGE and decide to change the age cutoff later, the RETIRED_ part may no longer make sense. A generic name (AGE_CUTOFF) requires only a change to the value, making the code more resilient.

With all the filters in place, we can clean up the data in one step. Because all the filters we created are to exclude data, we can confidently negate all of them to get the data we want to keep. If we use different types of filters (exclude and include), we have to carefully think about how to apply each of them, opening the door for bugs.

This is an important concept: don’t make your brain hold more information than it absolutely has to (don’t create extraneous cognitive load). If we follow a pattern, we have only one thing to remember, the pattern itself.

This is the reworked notebook.

Step 5 - Make the graphs easier to read

In this step, we make the graphs easier to read.

First, we add transparency when plotting multiple variables on the same graph.

This is the pairplot from the previous step, without transparency:

And this is the pairplot with transparency:

Adding transparency lets us see the clusters of data, the areas where we have many data points, as opposed to the places where we have few data points. It helps identifies patterns in the data.

Another technique to make graphs readable is to bin the data. This is the graph from the previous step that plots age vs. salary:

It is impossible to see any pattern in such a graph. To make it more legible, we will bin the data. But the question is, “what bins make sense for this case?” Since we are analyzing salaries, we chose 22 as our first bin because this is usually the age of graduation. After that, we will bin every five years for the first years to account for rapid promotions and rises that happen at the start of a career, then bin every ten years for later stages in the career, where promotions are rarer. We also document those assumptions clearly to discuss them with the domain experts.

This is the new graph:

This is the reworked notebook.

Step 6 - Describe the limitations of the conclusion

We now have a good notebook. It is organized in sections, uses constants to make the code more understandable and resilient, the graphs are well formatted, and we added explanations for all assumptions and decisions.

We are now at the last step, where we present the conclusion to the original question, “is there gender discrimination in the salaries of an organization?”.

In real life, the data we have is not perfect and complex questions don’t always have simple answers. And that’s the case here. We have a few limitations that prevent us from giving a definitive answer to the question. But we have enough to spur some action. Our job at this point is to document what we found and the limitations of our analysis.

In the conclusion section, we clearly document:

That we used proxy variables.
Despite the dataset’s limitations, we have tentative conclusions.
That we need more precise data, but at the same time, we have enough to take action (and avoid analysis paralysis).

This is the final notebook.

Conclusion

We write notebooks for our stakeholders, not for ourselves.

To write good notebooks, we need to:

Organize them logically so that the stakeholders can follow the analysis.
Make the code easy to understand, easy to change (flexible), and hard to break (resilient), so we can modify it confidently as we review the results with the stakeholders.
Spell out critical assumptions and decisions so stakeholders can validate them (or challenge them).
Clearly document the limitations of the analysis so stakeholders can decide if they are acceptable or not.

Running the examples

The notebooks are available on this GitHub repository.

Vision transformer properties

2022-07-23T00:00:00-04:00

Transformers crossed over from natural language into computer vision in a few low-key steps until the An Image is Worth 16x16 Words paper exploded into the machine learning scene late in 2020.

Convolutional neural networks (CNNs) and recurrent neural networks (RNNs) were the dominant architectures in computer vision tasks until transformers arrived on the scene. When I started studying vision transformers, I assumed they were a replacement for CNNs and RNNs. I learned that they are more than that. They are a fundamentally different approach to the problem, resulting in some interesting properties.

In this article we will review transformers’ properties in computer vision tasks that set them apart from CNNs and RNNs.

If you haven’t read the paper yet, start with the accompanying blog post Google AI Blog: Transformers for Image Recognition at Scale. It has a nice animation and covers the topics in the paper at a higher level. This six-minute video from AI Coffee Break with Letitia is an excellent introduction to the paper or a refresher if it has been a while since you read it.

If this is your first encounter with transformers, start with transformers in natural language processing to learn the fundamental concepts, then come back to vision transformers. Check out Understanding transformers in one morning if you are not yet familiar with the topic.

How transformers process images

First, a brief review of how transformers were adapted from natural language processing (NLP) to computer vision.

Transformers operate on sequences. In NLP the sequence is a piece of text. For example, in the sentence “the cat refused to eat the food because it was cold” we can correlate the word “it” to “food” (not “cat”) and use that to illustrate the concept of “attention.” It is easy to conceive text as a sequence of words and imagine transformer concepts that way.

But what is a “sequence” in computer vision? That is the first significant difference between transformers in computer vision and transformers in NLP.

A naive solution would be to treat an image as a sequence of pixels. The problem with this approach is that it generates humongous sequences. A 256 x 256 RGB image, commonly used to train models, results in a sequence of 196,608 pixels (256 x 256 x 3 RGB channels). This large sequence would require too many computing resources for training and inference. To help visualize: a 400-page book has about 200,000 words. In this one-to-one mapping of pixels to words, it would be the equivalent of feeding that book to a transformer network at once.

To make the problem tractable, An Image is Worth 16x16 Words partitions images into squares called patches. Each patch is the equivalent of a token in an NLP transformer. Back to the 256 x 256 image, partitioning it into 16 x 16 squares results in 256 patches (tokens). Each patch is still a large number of pixels, but the problem is more tractable now because the number of tokens is much smaller.

In addition to the patches, the network has one more token, the class token. This token is the image classification (“cat”, “dog”, …). Beyond that, the transformer network in An Image is Worth 16x16 Words is the same as the transformers used in natural language processing. In the words of the paper, “The “Base” and “Large” models are directly adopted from BERT”.

The picture below, from Google’s blog post, shows the network architecture. Token zero is the class token. The patches are extracted from the image and used as tokens. This transformer is known as ViT, the vision transformer. The term ViT is commonly used in the literature to refer to this architecture.

The vision transformer (ViT) architecture from Google’s blog post

How are transformers different from CNNs in computer vision?

Convolutional neural networks (CNN) work in small image areas. The learned weights are related to that small area, as shown in this picture from Stand-Alone Self-Attention in Vision Models.

CNN locality inductive bias Stand-Alone Self-Attention in Vision Models

In other words, the concept of “locality” (pixels closer to each other are related) is part of the CNN architecture as a prior, or inductive bias, a piece of knowledge that the network creators embedded into the network’s architecture. This piece of knowledge makes assumptions about what the best solution is for a specific problem. Perhaps there are better ways to solve the problem, but we are constraining the solution space to the inductive biases that are part of the network architecture.

On the other hand, a transformer network doesn’t have such inductive biases embedded into its architecture. For example, It has to learn that “locality” is a good thing in computer vision problems on its own.

This lack of inductive bias in the network architecture is a fundamental difference between transformers and CNNs. In more practical terms, a transformer network does not make assumptions about the structure of the problem. As a result of that, the network has to learn the concepts.

Eventually, the transformer network does learn convolutions and locality. The picture below (from An Image is Worth 16x16 Words) shows the size of the image area attended by each head in each layer. In the lower layers (left), some heads attend to pixels close to each other (bottom of the graph), and other heads attend to pixels further away (top of the graph). As we move up in the layers (right of the graph), heads attend to pixels farther out in the image area (top of the graph). In other words, lower layers have both local and global attention, while higher layers have global attention. The network was not told to behave this way. It learned this attention pattern on its own.

In the words of the authors:

This “attention distance” is analogous to receptive field size in CNNs. We find that some heads attend to most of the image already in the lowest layers, showing that the ability to integrate information globally is indeed used by the model. Other attention heads have consistently small attention distances in the low layers. This highly localized attention is less pronounced in hybrid models that apply a ResNet before the Transformer (Figure 7, right), suggesting that it may serve a similar function as early convolutional layers in CNNs.

An Image is Worth 16x16 Words

ViT head attention by layer – An Image is Worth 16x16 Words

Fewer assumptions → more interesting solutions

If, in the end, transformers learn convolutions and locality anyway, what have we gained by using transformers for computer vision? Why go through all the trouble of training transformers to do what CNNs do from the start?

In the words of Lucas Beyer (Standford CS 25 lecture), one of the technical contributors to ViT:

[W]e want the model to have as little of our thinking built-in, because what we may think that is good to solve the task may actually not be the best to solve the task. … [W]e want to encode as little as possible into the model, such that if we just throw massive amounts of data in a difficult task at it, it might think things that are even better than [what we would have assumed]… Ideally, we want [a] model that is powerful enough to learn about this concept [locality] itself, if it’s useful to solve the task. If it’s not useful to solve the task, then if we had put it in, there is no way for the model not to do this.

Lucas Beyer – Standford CS 25 lecture

What else do transformers learn on their own?

So, transformers learned to behave like CNNs. What else could they be learning on their own? By changing how a transformer model is trained, Emerging Properties in Self-Supervised Vision Transformers found out that:

[W]e make the following observations: first, self-supervised ViT features contain explicit information about the semantic segmentation of an image, which does not emerge as clearly with supervised ViTs, nor with convnets. Second, these features are also excellent k-NN classifiers.

Emerging Properties in Self-Supervised Vision Transformers

More concretely, when trained to perform object classification, transformers also learn object segmentation on their own, as shown in the following picture from the paper (for a more lively demonstration, see their blog post).

Transformer segmentation – Emerging Properties in Self-Supervised Vision Transformers

Segmenting an image requires some understanding of what the objects are, i.e., understanding the semantics of an image and not just treating it as a collection of pixels. The fact that the transformer model is segmenting the image indicates that it is also extracting semantic meanings. From their blog post:

DINO learns a great deal about the visual world. By discovering object parts and shared characteristics across images, the model learns a feature space that exhibits a very interesting structure. If we embed ImageNet classes using the features computed using DINO, we see that they organize in an interpretable way, with similar categories landing near one another. This suggests that the model managed to connect categories based on visual properties, a bit like humans do.

Advancing the state of the art in computer vision with self-supervised Transformers and 10x more efficient training

Transformer class separation – Advancing the state of the art in computer vision with self-supervised Transformers and 10x more efficient training

Intriguing Properties of Vision Transformers explores more properties of vision transformers, such as dealing with occlusion better than CNNs.

Transformer deal with occlusion better than CNNs – Intriguing Properties of Vision Transformers

They introduce a “shape token” to train transformers to be more shape-biased than they naturally are to get automated object segmentation (rightmost column in the picture below).

The [results] show that properly trained ViT models offer shape-bias nearly as high as the human’s ability to recognize shapes. This leads us to question if positional encoding is the key that helps ViTs achieve high performance under severe occlusions (as it can potentially allow later layers to recover the missing information with just a few image patches given their spatial ordering).

Intriguing Properties of Vision Transformers

Transformer segmentation with shape token better than CNNs – Intriguing Properties of Vision Transformers

Vision Transformers are Robust Learners doesn’t have fancy pictures to illustrate what they found, but the results are no less interesting. They found out that without any specific training, vision transformers can cope with image perturbations better than CNNs.

[W]e study the robustness of the Vision Transformer … against common corruptions and perturbations, distribution shifts, and natural adversarial examples. … [W]ith fewer parameters and similar dataset and pre-training combinations, ViT gives a top-1 accuracy of 28.10% on ImageNet-A which is 4.3x higher than a comparable variant of BiT. Our analyses on image masking, Fourier spectrum sensitivity, and spread on discrete cosine energy spectrum reveal intriguing properties of ViT attributing to improved robustness.

Vision Transformers are Robust Learners

Why do vision transformers perform better than CNNs?

We are still in the early stages of understanding the differences between CNNs and vision transformers.

Do Vision Transformers See Like Convolutional Neural Networks? found significant differences in the structure of vision transformers and CNNs.

[We use] CKA to study the internal representation structure of each model. … Figure 1 [below] shows the results as a heatmap, for multiple ViTs and ResNets. We observe clear differences between the internal representation structure between the two model architectures: (1) ViTs show a much more uniform similarity structure, with a clear grid like structure (2) lower and higher layers in ViT show much greater similarity than in the ResNet, where similarity is divided into different (lower/higher) stages.

Do Vision Transformers See Like Convolutional Neural Networks?

Transformer vs. ResNet internal representation – Do Vision Transformers See Like Convolutional Neural Networks?

This is the first significant difference between vision transformers and CNNs.

[T]hese results suggest that (i) ViT lower layers compute representations in a different way to lower layers in the ResNet, (ii) ViT also more strongly propagates representations between lower and higher layers (iii) the highest layers of ViT have quite different representations to ResNet.

Do Vision Transformers See Like Convolutional Neural Networks?

A possible explanation for this structural difference is how the transformer layers learn to aggregate spatial information. CNNs have fixed receptive fields (encoded in the kernel sizes and sequences of layers). Transformers do not have this prior knowledge of “receptive fields” for an image. They have to learn that spatial relations are important in image processing.

The experiments in the paper confirmed the observation in the original ViT paper that transformers eventually settle in a structure where lower layers learn to pay attention locally and globally, while higher layers learn to pay attention globally.

[E]ven in the lowest layers of ViT, self-attention layers have a mix of local heads (small distances) and global heads (large distances). This is in contrast to CNNs, which are hardcoded to attend only locally in the lower layers.

Do Vision Transformers See Like Convolutional Neural Networks?

Lower layers in vision transformers pay attention locally and globally – Do Vision Transformers See Like Convolutional Neural Networks?

What was not covered here

Vision transformers are barely a few years old. We are still learning more about how to train them and how they behave. This is a short list of active research areas.

More efficient training and inference

Networks with fewer priors embedded in their design need more data to eventually learn these priors that they don’t have. ViT was trained in a dataset of 300 million images. Large datasets are still private (for the most part) and require a huge amount of computer power to train the model.

New training methods, such as data-efficient image transformers (DeiT) manage to train vision transformers using only ImageNet (while still large, it’s within reach of more research teams and organizations). See Efficient Transformers: a Survey for more work in this area.

Is “attention” needed?

“Attention” is a central concept in transformer networks. But is it really necessary to achieve the same results? Some intriguing research questions if we need attention at all.

FNet: Mixing Tokens with Fourier Transforms: “We show that Transformer encoder architectures can be sped up, with limited accuracy costs, by replacing the self-attention sublayers with simple linear transformations that “mix” input tokens. … [W]e find that replacing the self-attention sublayer in a Transformer encoder with a standard, unparameterized Fourier Transform achieves 92-97% of the accuracy of BERT counterparts on the GLUE benchmark, but trains 80% faster on GPUs and 70% faster on TPUs at standard 512 input lengths.”
MLP-Mixer: An all-MLP Architecture for Vision “[W]e show that while convolutions and attention are both sufficient for good performance, neither of them are necessary. We present MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs) … When trained on large datasets, or with modern regularization schemes, MLP-Mixer attains competitive scores on image classification benchmarks, with pre-training and inference cost comparable to state-of-the-art models.”

Catching up with recent developments

Transformers are moving fast. These are some places I use to keep up with recent developments.

Yanni Kilchner reviews recent papers on his YouTube channel. It is a great place to go after reading a paper to check your understanding and insights you may have missed on a first pass.
AI Coffe Break with Letitia distills papers into short videos (about ten minutes or so). It’s the ideal format to review the essence of papers.
For a slower pace but a broader view, the authors of A Survey on Vision Transformer and Transformers in Vision: A Survey publish new versions of their papers every few months.

Understanding transformers in one morning

2022-07-22T00:00:00-04:00

Transformers are (deservedly so) a hot topic in machine learning.

If you are new to transformers, the resources in this article will help you understand their fundamentals and applications. It will take about one morning (four hours, give or take) to go through all items.

I created the list after spending much longer than one morning wading through many articles and videos. I lost time going around in circles, wasting time with superficial sources, or stumbling on articles that were too deep for my level when I first encountered them but were great once I was more prepared.

This list is organized in a logical sequence, building up the knowledge from the first principles, then going deeper into the details. They are the videos and articles that helped me the most. I hope they help you as well.

Hour 1 - The paper

First, read Google AI Research’s blog post Google AI Blog: Transformer: A Novel Neural Network Architecture for Language Understanding. Don’t follow the links; just read the post. Then read the paper Attention Is All You Need. Don’t worry about understanding the details at this point. Get familiar with terminology and pictures.

The paper has about 6,000 words. It would take twenty minutes to read at the average reading pace of 300 words per minute. But it’s a scientific paper, so it will take longer. Using the three-pass approach, let’s reserve an hour to read it.

Hour 2 - Key concepts

The second hour is about understanding the key concepts in the paper with Rasa’s Algorithm Whiteboard video series.

Rasa Algorithm Whiteboard - Transformers & Attention 1: Self Attention (14 minutes): Explains attention first with a simple example using a time series, then with a text example. The video introduces word embedding, a key concept for NLP (natural language processing) models, including transformers. With these concepts explained, it defines self-attention.
Rasa Algorithm Whiteboard - Transformers & Attention 2: Keys, Values, Queries (13 minutes): Building on the previous video, it explains keys, queries, and values. First, it explains the operations that make up the attention layer conceptually, as a process to add context to a value (you can think of a “value” as a “word” in this context). Since we are trying to create a model, it describes where we need to add trainable parameters (weights). With the concepts and weights in place, it reintroduces the operations as matrix operations that create the stackable self-attention block.
Rasa Algorithm Whiteboard - Transformers & Attention 3: Multi Head Attention (11 minutes): Using a phrase as an example, it explains why we need more than one attention head to understand the context where words are used (multi-head attention). The fact that the attention heads are independent is a crucial concept in transformers. It allows matrix operations for each head to run in parallel, significantly speeding up the training process.
Rasa Algorithm Whiteboard: Transformers & Attention 4 - Transformers (15 minutes): With the foundational concepts explained, this video covers the pictures in the “Attention is All You Need” paper that make up the transformer architecture. The new concept introduced here is positional encoding. It ends by highlighting how the transformer architecture lends itself to parallelization in ways other attention architectures cannot.

We just finished the second hour of the morning’s understanding transformers. Rasa’s videos are a great introduction but are still informal. That’s not a bug – it’s a feature. They introduce the key concepts in simple terms, making them easy to follow.

Hour 3 - Digging into details

Now we will switch to a more formal introduction with these two lectures from professor Peter Bloem, VU University in Amsterdam.

Lecture 12.1 Self-attention (23 minutes): Explains, with the help of illustrations, the matrix operations to calculate self-attention, then moves on to keys, queries, and values. With the basic concepts in place, it explains why we need multi-head attention.
Lecture 12.2 Transformers (18 minutes): Examines the pieces that make up the transformer model in the paper. The pictures from the paper are dissected with some math and code.

Hour 4 - Pick your adventure

Go wide with LSTM is dead, long live Transformers (30 minutes): This talk gives a sense of history, explaining how we approached natural language problems in the past, their limitations, and how transformers overcame those limitations. It shows how to implement the transformer calculations with Python code. If you are better at visualizing code than math (like me), this can help you understand the operations.
Go deep with The Annotated Transformer (30 to 60 minutes to read, hours and hours to experiment): This article by the Harvard NLP team annotates the transformer paper with modern (as of 2022) PyTorch code. Each section of the paper is supplemented by the code that implements it. Part 3, “A Real World Example”, implements a fully functional German-English translation example using a smaller dataset that makes it workable on smaller machines.

Where to go from here

It is a good time to reread the paper. It will make more sense now.

These are other articles and videos that helped me understand transformers. Some of them overlap with the ones above, and some are complementary.

Positional embedding (encoding) is a key concept in understanding transformers. The transformer paper assumes that the reader knows that concept and briefly explains the reasons to use sine and cosine. This video from AI Coffee Break with Letitia explains in under ten minutes the concepts and the reasons to use sine and cosine.
Transformers from scratch is the accompanying blog post to hour 3, “Digging into details.” Professor Bloem describes some concepts explored in the video and adds code to show they are implemented.
Transformers from Scratch (same title, different article) takes more time than other articles to explain one-hot encoding, dot product, and matrix multiplication, among others, with illustrations. By the time it gets to “attention as matrix multiplication”, it’s easier to understand the math. This post can be a good refresher if you are rusty on the math side of machine learning.
Transformer model for language understanding is TensorFlow’s official implementation of the paper. It is not as annotated as the PyTorch code in The Annotated Transformer, but still helpful if you are in a TensorFlow shop.
The Transformer Model in Equations is exactly what the name says, transformers as mathematical operations. The “Discussion” section is an insightful explanation of the equations, valuable even if you don’t have a strong math background (like me).
The Illustrated Transformer is an often-cited source for understanding transformers. It is a good source if someone can read only one article beyond the paper.
Andrej Karpathy’s Let’s build GPT: from scratch, in code, spelled out walks through the code to build a transformer model from scratch. At just under two hours, it’s the best investment of time at the code level I have found. Andrej is a great teacher and knows what he is talking about.

For a sense of history, these two papers are highly cited as works that led to the transformer architecture.

Neural Machine Translation by Jointly Learning to Align and Translate is the paper credited with introducing the “attention” mechanism.
Effective Approaches to Attention-based Neural Machine Translation builds on the previous paper, introducing other important concepts, including dot-product attention. This official Tensorflow notebook implements a Spanish-to-English translation based on the paper.

Finally, Attention is all you need; Attentional Neural Network Models is a talk by Łukasz Kaiser, one of the paper’s authors. He builds up the solution, starting with how natural language translation used to be solved in the past, the limitations, and how transformers solve them. So far, it’s what I would expect from one of the authors. What makes this video interesting to me is how humble Łukasz is. He explains the trials and errors and, at one point, how they had to ask for help to train the model they created.

Reading a scientific paper makes it look like a linear story from problem to solution (“we had an idea and implemented it”). Watching Łukasz talk helps us understand how these great solutions don’t arrive out of thin air. Researchers build on top of previous work, try many variations, make mistakes, and ask for help to complete their work. Then they write the paper…

If your interests are in computer vision, it turns out transformers work quite well for that too.

Applications of transformers in computer vision

2021-12-01T00:00:00-05:00

This article describes the evolution of transformers, their application in natural language processing (NLP), their surprising effectiveness in computer vision, ending with applications in healthcare.

It starts with the motivation and origins of transformers, from the initial attempts to apply a specialized neural network architecture (recurrent neural network – RNN) to natural language processing (NLP), the evolution of such architectures (long short-term memory and the concept of attention), to the creation of transformers and what makes them perform well in NLP. Then it describes how transformers are applied to computer vision. The last section describes some of the applications of transformers in healthcare (an area of interest for my research).

Side note: It was originally written as a survey paper for a class I took. Hence the references are in bibliography format instead of embedded links.

if you are new to transformers, see Understanding transformers in one morning and Vision transformer properties.

The origins of transformers – natural language processing

When context matters

In some machine learning applications, we train models by feeding one input at a time. The trained model is then used in the same way: given one input, make a prediction. The typical example is image recognition and classification. We train the model by feeding one image at a time. Once trained, we feed one image and the model returns a prediction.

However, there are other classes of problems where a single input is not enough to make a prediction. Natural language processing is a prominent example. When translating a sentence, it is not enough to look at one word at a time. The context in which a word is used matters. For example, the Portuguese word legal is translated in different ways to English.

Isso é um argumento legal → This is a legal argument

Isso é um seriado legal → This is a nice TV series

In these applications of machine learning, context matters. The translation of “legal” depends on the word that came before it. If we represent the phrases as vectors (so a model can process them), we could, for example, represent the first phrase as the vector p1=[87,12,43,215,102] and the second sentence as the vector p2=[87,12,43,175,102].

A model attempting to translate the word “legal”, encoded as 102, must remember what came before it. The model must translate 102 one way if it was preceded by 215 (p1) and another way if it was preceded by 175 (p2).

The model must have a “memory” of what it has seen so far. Or, in other words, the model’s output is contextual: it is based not only on its current state (the current input – the current word) but also on previous states (what came before the current input – the words that came before). To understand the context, the model must “remember” what it has seen so far, instead of taking only one input at a time, i.e. the model must work with a sequence of input values.

Remembering the past – recurrent neural networks

Recurrent neural networks (RNNs) are a class of networks that can model such problems. The figure below shows the standard representation of an RNN cell. The blue arrow indicates the “temporal loop” in the network: the result from a previous input, known as the state, is fed into the network when processing a new input. Using the state from a previous input when processing new input allows the network to “remember” what it has seen so far.

The temporal loop can be conceptually represented as passing the state from the past steps into the future steps. In the figure below, the RNN cell is unrolled (repeated) to represent the state from previous steps passed into the subsequent ones (this process is also called “unfolding” the network).

Forgetting the past – vanishing and exploding gradients

RNNs are trained with a variation of back-propagation, similar to how we train other types of neural networks. First, we choose how many steps we will unroll the network, and then we apply a specialized version of back-propagation (Ian et al., 2016).

Ideally, we would like to create an RNN with as many unrolled steps as possible, to have as much context as possible (i.e. remember very large sentences or even entire pieces of text). However, a large number of unrolled steps has an unfortunate effect: vanishing and exploding gradients, which limits the size of the network we can build (Bengio et al., 1994) (Pascanu et al., 2013).

In practice, the result is that we have to limit the number of unrolled steps of an RNN, thus limiting how far back the network can “remember” information.

Going further into the past – long short-term memory

Long short-term memory (LSTM) is a recurrent network architecture created to deal with the vanishing and exploding gradient problem of the classical RNN architecture (Hochreiter & Schmidhuber, 1997). They do so by having a more complex cell design. In this design, the gradients are all contained within the LSTM cell, making them more stable because they no longer have to traverse the entire network.

The figure below, from (Greff et al., 2017), compares an RNN cell (left) with a typical LSTM cell (right), including the “forget gate” that enables it to learn long sequences that are not partitioned into subsequences (Gers et al., 1999).

Deciding where to look – attention

With LSTM we have a solution to look further into the past and process larger sentences. Now we need to decide where to look when processing a sentence because the order of the words is important for language processing. A model cannot mindlessly translate one word at a time.

A typical example where the order of words matters is the placement of adjectives. Back to the first example, we can see that the placement of “legal” varies in each language.

Isso é um argumento legal → This is a legal argument

How does a model know that “legal” goes to a different position in the translated phrase? The solution has two parts. First, the model needs to process the entire sentence, not each word separately. Then, the model needs to learn that it has to pay more attention to some parts of the phrases than others, at different times (in the example above, although “legal” comes last in the input, the model has to learn that in the output it must come first).

RNN encoder/decoder networks (Cho et al., 2014) are used for the first part, processing the entire sentence. An encoder/decoder has two neural networks: one that converts (encodes) a sequence of words into a representation suitable to train a network, and another network that takes the encoded representation and translates (decodes) it. The decoder, armed with a full sequence of words and not just one word, implements the second part of the solution: decide in which sequence it must process the words (which may not be in the same order they were received, as in this case).

This process is known as attention (Bahdanau et al., 2014) (Luong et al., 2015), as in “where should the decoder look to produce the next output”.

“Attention is all we need” – transformers

Adding the concept of attention significantly improved the accuracy of the networks, but it is still part of a time-consuming process, the training of the encoder and decoder RNNs.

If what we want is the information to calculate attention, can we do that in a faster way? It turns out we can. Transformer networks dispense with RNNs and directly compute the important piece of information we want, attention. They achieve better accuracy for a fraction of the training time (Jakob, 2017) (Vaswani et al., 2017).

Instead of using RNNs, transformers use stacks of feed-forward layers (a simple layer of neurons, without cycles, unlike RNNs). The figure below, from the original paper (Vaswani et al., 2017), shows the network architecture.

Dispensing with RNNs has two effects: the training process can be parallelized (RNNs are sequential by definition: the state of a previous step is fed into the next step) and computations are much faster. The following table, from (Vaswani et al., 2017), shows the smaller computational complexity of the transformer model compared to RNNs and convolutional neural networks (CNNs).

The rightmost columns of the following table, also from (Vaswani et al., 2017), compares the training cost (in FLOPs). The transformer models are two to three orders the magnitude less expensive to train.

The best performing language models today, BERT (Devlin et al., 2019), GPT-2 (Radford et al., 2019), and GPT-3 (Brown et al., 2020), are based on the transformer architecture. The combination of a simpler network and parallelization allowed the creation of these large, sophisticated models.

A key concept of the transformer architecture is the “multi-head self-attention” layer. “Multi” refers to the fact that instead of having one attention layer, transformers have multiple attention layers running in parallel. In addition, the layers employ self-attention (Cheng et al., 2016) (Lin et al., 2017). With such a construct, transformers can efficiently weigh in the contribution of multiple parts of a sentence simultaneously. Each self-attention layer can encode longer range dependencies, capturing the relationship between words that are further apart (compared to RNNs and CNNs).

The ability to pay attention to multiple parts of the input and the encoding of longer-range dependencies results in better accuracy. The figure below (Alammar, 2018b) shows how self-attention allows a model to learn that “it” refers more strongly to “The animal” in the sentence.

Research continues to create larger transformer models. A recent advancement in the architecture of transformers is Big Bird (Zaheer et al., 2021). It removes the original model’s quadratic computational and memory dependency on the sequence length by introducing sparse attention. By removing the quadratic dependency, larger models can be built, capable of processing larger sequences.

Transformers in computer vision

The concepts of “sequence” and “attention” can also be applied to computer vision. The original applications of attention in image processing used RNNs, like the NLP counterparts. Neural networks with attention were used for image classification (Mnih et al., 2014), multiple object recognition (Ba et al., 2015), and image caption generation (Xu et al., 2016). These applications of attention to computer vision experienced the same issues that afflicted NLP architectures based on RNN: vanishing or exploding gradients and long times to train the model.

And, just like in NLP, the solution was to apply self-attention, using the transformer architecture. One of the first applications of transformers in computer vision was in image generation (Parmar et al., 2018). (Carion et al., 2020) applied transformers to object detection and segmentation using a hybrid architecture, with a CNN used to extract image features.

Then (Dosovitskiy et al., 2020, which includes references to earlier works they built upon), dropped all other types of networks, creating a “pure” transformer architecture for image recognition. In the figure below, from that paper, we can see the same elements of the NLP transformer architecture, now applied to computer vision: the lack of more complex networks (like RNN or CNN) that results in fast training time, the concept of sequences (created by splitting the image into patches), and the multi-headed attention. This architecture is known as ViT (Vision Transformer).

The resulting transformer models are more accurate than the convolutional neural network (CNN) models typically used in computer vision and, more importantly, significantly faster to train. In the table below, from (Dosovitskiy et al., 2020), the first three columns are three versions of the transformer model. The last row shows how the transformer-based networks (first three columns) use substantially less computational resources for training than CNN-based networks (last two columns).

Transformers in computer vision is still an active area of research. At the time of this writing (November of 2021), the recently-published Swin Transformer architecture (Liu et al., 2021) used a shifted windows approach (figure below, from the paper) to achieve state-of-the-art results in image classification, object detection, and image segmentation. The shifted window architecture allows a transformer network to cope with the “…large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text.”

Transformers in healthcare

Applications of Transformers in healthcare fall, in general terms, into the following categories:

Natural language process (NLP): extract information from medical records to make predictions.
Genomics and proteomics: processing the large sequences from genetic and proteomic.
Computer vision: image classification, segmentation, augmentation, and generation.

The following sections describe some of these applications. Note from the dates of the references that this is a recent and active area of research. Many of the current applications of CNNs and RNNs in the same areas have evolved over the years until they reached their current performance. It is expected that these early (and promising) applications of transformers will improve over time as research continues.

NLP applications

The healthcare industry has been accumulating written records for many years. There is a wealth of information stored in these records from consultation notes, lab exam summaries, and radiologists’ reports. Most of them are already stored in electronic health records (EHR), ready to be consumed by computers. Transformers’ success with NLP makes them a good fit to process EHR. Some of the applications include:

BEHRT (Li et al., 2020), as the name indicates, was inspired by BERT (Devlin et al., 2019). Trained on medical records, BEHRT can predict 301 diseases in a future visit of a patient. It improved the state-of-the-art in this task by “8.0–13.2% (in terms of average precision scores for different tasks)”. In addition to the improvements in prediction, the attention mechanism has the potential to make the model more interpretable, an important feature for healthcare applications.
(Kodialam et al., 2020) introduces SARD (self-attention with reverse distillation), where the input to the model is not the raw text from medical records but a summary of a medical visit. While BEHRT can handle 301 conditions, SARD can handle “…a much larger set of 37,004 codes, spanning conditions, medications, procedures, and physician specialty.”

Genomics and proteomics applications

Transformers’ ability to process sequences makes them natural candidates for genomics and proteomics applications, where large, complex sequences abound.

AlphaFold2 (Jumper et al., 2021) is an evolution of the first AlphaFold. It decisively won the 14th Critical Assessment of Structural Prediction (CASP), a competition to predict the structure (“folds”) of proteins. Understanding the structure of proteins is important because the function of a protein is directly related to its structure. Given that the structure of a protein is determined by its amino acid sequence, it is not surprising to learn that one of the most important changes in AlphaFold2 was the addition of attention via transformers (Rubiera, 2021). AlphaFold2’s transformer architecture has been named EvoFormer.
(Avsec et al., 2021) applied transformers to gene expression. They named the architecture Enformer (“a portmanteau of enhancer and transformer”). Gene expression is a fundamental building block in biology. It is “the process by which information from a gene is used in the synthesis of a functional gene product that enables it to produce end products, protein or non-coding RNA, and ultimately affect a phenotype, as the final effect.” (Wikipedia, 2021).

With these applications in mind, the figure below (Avsec, 2021) illustrates why the ability to process larger sequences makes transformers an effective architecture for genomics and proteomics applications. The dark blue area shows how far the Enformer architecture can look for interactions between DNA base pairs (200,000), compared with the previous state-of-the-art Basenji2 architecture (40,000 base pairs).

Computer vision applications

Transformers are improving the following areas of healthcare computer vision:

Label generation: extract accurate labels from medical records to train image classification networks.
Large image analysis: process the large images generated in some medical areas.
Improvements to explainability: produce explanations that are easier to interpret for medical professionals.

The following sections expand on those areas.

Label generation

Medical image applications that identify diseases and other features in images are trained with supervised or semi-supervised learning, which means they need many images with accurate labels. Labeling medical images requires experts that are few, expensive, or both.

On the other hand, there are many images with accompanying medical reports, for example, the radiological reports from x-rays. An application capable of reliably extracting labels from the reports can boost the number of images in medical image datasets. However, medical reports are created by human experts for other human experts. The reports contain complex sentences that record not only the expert’s certainty about findings but also other potential findings and exclusions. Telling apart positive, potential, and negated (excluded) findings is a complex task.

CheXpert (Irvin et al., 2019) made available over 100,000 chest x-ray images with labels extracted from the medical reports using a rule-based NLP parser. The same team later developed ChexBert (Smit et al., 2020) based on (as the name implies) BERT (Devlin et al., 2019). CheXBert performed better than CheXPert and, crucially, it performs better in uncertainty (“potential”, “unremarkable”, and similar words) and negation cases, which are notoriously difficult to analyze.

These results indicate that transformer-based labeling extraction can improve the labels of existing datasets and help create more trustworthy labeled medical images, which are necessary to advance research in healthcare computer vision.

Large image analysis

Some medical diagnosis images, such as those used in histopathology, are large, in the hundreds of megabytes to the gigapixel range. Traditional neural networks cannot handle such images in one piece. Before transformers, a common solution was to split the image into multiple patches and process them separately with a CNN-based network (Komura & Ishikawa, 2018). Dividing an image into arbitrary patches may lose context information about the overall image structure and features.

Holistic Attention Network – HATNet (Mehta et al., 2020) is a transformer-based architecture that takes a different approach, borrowing concepts from NLP. Instead of analyzing each patch separately, it considers each patch a “word” and combines them into bags of words. The bags of words are then processed by a transformer network that aggregates information from the different patches into a global image representation. HATNet is “8% more accurate and about 2× faster than the previous best network”.

More important than the immediate results of HATNet is the innovative approach that opens up the door to more research into processing large medical images. For example, TransUNet (Chen et al., 2021) takes a similar approach for medical image segmentation. As in image classification, CNNs have been traditionally applied to medical image segmentation. Using CNNs for segmentation has a related problem as for classification: the CNNs lose global context. TransUNet resolves that problem with a hybrid architecture: a CNN is used to extract features from the large-dimensional images, which are then passed to a transformer network. It improved the state-of-the-art Synapse multi-organ CT segmentation by several percentage points.

Improvements in interpretability

In high-stakes applications, such as healthcare, interpretable results help improve “auditability, system verification, enhance trust, and user adoption” (Reyes et al., 2020). Specifically for medical images, interpretability is related to explaining what pieces of an image the model considered for inference.

Although still a new field, the interpretability of image classification with transformers shows early signs that it can result in more precise, and thus more helpful, interpretations of what a model is “looking” at. In the figure below, from (Chefer et al., 2021), the rightmost column shows their new method to extract interpretability from a transformer multi-class image classification task. It generates class-specific visualizations with better-defined activations. The closest alternative method is Grad-CAM (Selvaraju et al., 2020) (other methods cannot even generate class-specific visualizations), but it has significantly more extraneous artifacts in the visualization.

The transformer’s attention map also shows promising results for interpretability. In the figure below, from (Matsoukas et al., 2021), the top row shows the original image of a dermoscopic image (left), an eye fundus (center), and a mammography (right). The middle row is a Grad-CAM saliency map, traditionally used to interpret the classification from CNNs. The bottom row is a saliency map from a transformer attention layer. The attention layer saliency shows a more well-defined saliency area, making the results easier to interpret (although the paper notes that this assumption has to be tested with medical professionals).

Conclusions

Transformers were first used in NLP applications, resulting in impressive language models like BERT, GPT-2, and GTP-3. Their ability to learn the association between pieces of a large sequence of data (attention) is now being used in computer vision. The resulting models are faster to train and more accurate than CNNs for image classification.

From the literature references, we notice that applying transformers to computer vision is still a new area. CNN- and RNN-based solutions evolved over many years of research. We should expect transformers also to evolve. In fact, several approaches are already being tried to create more efficient transformer architectures by, for example, reducing the quadratic complexity of the attention mechanism (May, 2020), (Tay et al., 2020), (Choromanski & Colwell, 2020).

Efficient transformer architectures will have two effects. From one side, larger and larger sequences will be handled, improving the results in applications where the size of the sequence is critical for the results (for example, large resolution images used in healthcare). On the other hand, for the same sequence length, it will become faster, and thus cheaper, to train transformers, democratizing their use.

And, as a final benefit, we may end up with one unified network architecture that can be applied to two important fields, natural language processing, and computer vision.

Appendix A - A reading list for RNN, LSTM, attention, and transformers in NLP

While researching this paper, I started with the original application of the networks, natural language processing (NLP). After researching the applications for image processing, it became clear that starting with NLP was indeed a good choice. The concepts of sequence and attention are easier to illustrate and follow in that area. Once learned in that context, they can be transferred to computer vision.

This appending is a reading list in the context of NLP to help other readers, and the future self of the author when he will (inevitably) have forgotten some of the concepts.

The seminal paper on encoder/decoder combined with RNN for natural language processing is (Cho et al., 2014). (Sutskever et al., 2014) introduced sequence-to-sequence using long short-term memory (LSTM) networks. (Bahdanau et al., 2014) and (Luong et al., 2015) are credited with developing the attention mechanism. (Vaswani et al., 2017) is the original paper on transformers (reading the accompanying Google’s blog post (Jakob, 2017) makes it easier to follow the paper).

The explanations of RNN and LSTM in this paper are simplified because I wanted to focus on transformers. I did not discuss the different types of RNNs and the inner working of the LSTM cell. For a step-by-step, illustrated explanation of how LSTMs work and why it is an effective RNN architecture, see (Olah, 2015). For other RNN architectures, see (Olah & Carter, 2016).

(Alammar, 2018a) describes step-by-step, with the help of animated visualizations the sequence-to-sequence, encoder/decoder, RNN, and attention concepts, including details of how they work. (Alammar, 2018b) builds on that to explain how transformers use the important concept of self-attention, with detailed illustrations.

Finally, as a historical note: finding the original paper on recurrent networks (RNNs) turned out to be elusive. Like many ideas, it evolved over time. (Rumelhart et al., 1987) is credited in several places as the first mention and description of a “recurrent network”, although it did not describe the back-propagation through time (BPTT) method used to train RNNs nowadays.

Appendix B - The quadratic bottleneck

As a general rule, the longer the sequence a transformer can process, the better results it will have. However, it comes at the cost of large amounts of memory and processing power required for training and inference. The self-attention mechanism of the standard transformer architecture is a quadratic function (figure below, from (Tay et al., 2020)).

Several approaches are being tried to reduce the quadratic complexity, creating more efficient transformer architectures (May, 2020), (Tay et al., 2020), (Choromanski & Colwell, 2020).

References

Alammar, J. (2018a, May 9) Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention)
Alammar, J. (2018b, June 27) The Illustrated Transformer
Avsec, Ž. (2021, October 4) Predicting gene expression with AI. Deepmind
Avsec, Ž., Agarwal, V., Visentin, D., Ledsam, J. R., Grabska-Barwinska, A., Taylor, K. R., Assael, Y., Jumper, J., Kohli, P., & Kelley, D. R. (2021) Effective gene expression prediction from sequence by integrating long-range interactions. Nature Methods, 18(10), 1196–1203
Ba, J., Mnih, V., & Kavukcuoglu, K. (2015) Multiple Object Recognition with Visual Attention
Bahdanau, D., Cho, K., & Bengio, Y. (2014) Neural Machine Translation by Jointly Learning to Align and Translate
Bengio, Y., Simard, P., & Frasconi, P. (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2), 157–166.
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., … Amodei, D. (2020) Language Models are Few-Shot Learners
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020) End-to-End Object Detection with Transformers
Chefer, H., Gur, S., & Wolf, L. (2021) Transformer Interpretability Beyond Attention Visualization
Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., Lu, L., Yuille, A. L., & Zhou, Y. (2021) TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation
Cheng, J., Dong, L., & Lapata, M. (2016) Long Short-Term Memory-Networks for Machine Reading
Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014) Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation
Choromanski, K., & Colwell, L. (2020, October 23) Rethinking Attention with Performers
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2020) An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Gers, F. A., Schmidhuber, J., & Cummins, F. (1999) Learning to forget: Continual prediction with LSTM. 1999 Ninth International Conference on Artificial Neural Networks ICANN 99. (Conf. Publ. No. 470), 2, 850–855 vol.2
Greff, K., Srivastava, R. K., Koutník, J., Steunebrink, B. R., & Schmidhuber, J. (2017) LSTM: A Search Space Odyssey. IEEE Transactions on Neural Networks and Learning Systems, 28(10), 2222–2232
Hochreiter, S., & Schmidhuber, J. (1997) Long Short-Term Memory. Neural Computation, 9(8), 1735–1780
Ian, G., Yoshua, B., & Aaron, C. (2016) Deep Learning
Jakob, U. (2017) Transformer: A Novel Neural Network Architecture for Language Understanding
Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., Potapenko, A., Bridgland, A., Meyer, C., Kohl, S. A. A., Ballard, A. J., Cowie, A., Romera-Paredes, B., Nikolov, S., Jain, R., Adler, J., … Hassabis, D. (2021) Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873), 583–589
Kodialam, R. S., Boiarsky, R., Lim, J., Dixit, N., Sai, A., & Sontag, D. (2020) Deep Contextual Clinical Prediction with Reverse Distillation
Komura, D., & Ishikawa, S. (2018) Machine Learning Methods for Histopathological Image Analysis. Computational and Structural Biotechnology Journal, 16, 34–42
Li, Y., Rao, S., Solares, J. R. A., Hassaine, A., Ramakrishnan, R., Canoy, D., Zhu, Y., Rahimi, K., & Salimi-Khorshidi, G. (2020) BEHRT: Transformer for Electronic Health Records. Scientific Reports, 10(1), 7155
Lin, Z., Feng, M., Santos, C. N. dos, Yu, M., Xiang, B., Zhou, B., & Bengio, Y. (2017) A Structured Self-attentive Sentence Embedding
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021) Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Luong, M.-T., Pham, H., & Manning, C. D. (2015) Effective Approaches to Attention-based Neural Machine Translation
Matsoukas, C., Haslum, J. F., Söderberg, M., & Smith, K. (2021) Is it Time to Replace CNNs with Transformers for Medical Images?
May, M. (2020, March 14) A Survey of Long-Term Context in Transformers. Machine Learning Musings
Mehta, S., Lu, X., Weaver, D., Elmore, J. G., Hajishirzi, H., & Shapiro, L. (2020) HATNet: An End-to-End Holistic Attention Network for Diagnosis of Breast Biopsy Images
Mnih, V., Heess, N., Graves, A., & kavukcuoglu, koray. (2014) Recurrent Models of Visual Attention. Advances in Neural Information Processing Systems, 27
Olah, C. (2015, August 27) Understanding LSTM Networks. Colah’s Blog
Olah, C., & Carter, S. (2016) Attention and Augmented Recurrent Neural Networks. Distill, 1(9), e1
Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, Ł., Shazeer, N., Ku, A., & Tran, D. (2018) Image Transformer
Pascanu, R., Mikolov, T., & Bengio, Y. (2013) On the difficulty of training Recurrent Neural Networks
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019) Language Models are Unsupervised Multitask Learners
Reyes, M., Meier, R., Pereira, S., Silva, C. A., Dahlweid, F.-M., Tengg-Kobligk, H. von, Summers, R. M., & Wiest, R. (2020) On the Interpretability of Artificial Intelligence in Radiology: Challenges and Opportunities. Radiology: Artificial Intelligence, 2(3), e190043
Rubiera, C. O. (2021) AlphaFold 2 is here: What’s behind the structure prediction miracle - Oxford Protein Informatics Group. Oxford Protein Informatics Group
Rumelhart, D. E., Hinton, G., & Williams, R. (1987) Learning Internal Representations by Error Propagation. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition: Foundations (pp. 318–362). MIT Press
Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2020) Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization. International Journal of Computer Vision, 128(2), 336–359
Sutskever, I., Vinyals, O., & Le, Q. V. (2014) Sequence to Sequence Learning with Neural Networks
Tay, Y., Dehghani, M., Bahri, D., & Metzler, D. (2020) Efficient Transformers: A Survey
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017) Attention Is All You Need
Wikipedia. (2021) Gene expression
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R., & Bengio, Y. (2016) Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Zaheer, M., Guruganesh, G., Dubey, A., Ainslie, J., Alberti, C., Ontanon, S., Pham, P., Ravula, A., Wang, Q., Yang, L., & Ahmed, A. (2021) Big Bird: Transformers for Longer Sequences

Machine learning interpretability with feature attribution

2021-04-26T00:00:00-04:00

There are many discussions in the machine learning (ML) community about model interpretability and explainability. The discussions take place in several contexts, ranging from using interpretability and explainability techniques to increase the robustness of a model, all the way to increasing end-user trust in a model.

This article reviews feature attribution, a technique to interpret model predictions. First, it reviews commonly-used feature attribution methods, then demonstrates feature attribution with SHAP, one of these methods.

Feature attribution methods “indicate how much each feature in your model contributed to the predictions for each given instance.” They work with tabular data, text, and images. The following pictures show an example for each case.

An example of feature attribution for text (from Explainable AI: A Review of Machine Learning Interpretability Methods):

An example of feature attribution for tabular data (from SHAP tutorial - official documentation):

An example of feature attribution for a model that identifies a cat in a picture (from LIME’s GitHub):

What feature attributions are used for

The prominent use cases for feature attribution are:

Debug models: verify that models make predictions for the right reasons. For example, in the first picture below, a model predicts diseases in X-ray images based on the metal tags the X-ray technicians place on patients, not the actual disease marks (an example of spurious correlation).
Audit models: verify that models are not looking at attributes that encode bias (gender, race, among others) when making decisions. For example, in the second picture below, the middle column shows a gender-biased model that predicts professions by looking at the face in the image. The rightmost column shows where a debiased model looks to make predictions.
Optimize models: simplify correlated features and remove features that do not contribute to predictions.

The figure below (source) is an example of feature attribution to debug a model (verify what the model uses to predict diseases). In this case, the model is looking at the wrong place to make predictions (using the X-ray markers instead of the pathology).

The figure below (source) is an example of feature attribution to audit a model. The middle column shows how the model predicts all women as “nurse”, never as “doctor” – an example of gender bias. The rightmost column shows a corrected model.

Where feature attribution is in relation to other interpretability methods

Explainability fact sheets defines the following explanation families (borrowed from Explanation facilities and interactive systems):

Association between antecedent and consequent: “model internals such as its parameters, feature(s)-prediction relations such as explanations based on feature attribution or importance and item(s)-prediction relations, such as influential training instances”.
Contrast and differences: “prototypes and criticisms (similarities and dissimilarities) and class-contrastive counterfactual statements”.
Causal mechanism: “a full causal model”.

Feature attribution is part of the first family, the association between antecedent and consequent.

Using the framework in the taxonomy of interpretable models, we can further narrow down feature attribution methods as:

Post-hoc: They are usually used after the model is trained and usually with black-box models. Therefore, we are interpreting the results of the model, not the model itself (c)reating interpretable models is yet another area of research). The typical application for feature attribution is to interpret the predictions of black-box models, such as deep neural networks (DNNs) and random forests. These models are too complex to be directly interpreted. Thus we are left with interpreting the model’s results, not the model itself.
Result of the interpretation method: They result in feature summary statistics (and visualization - most summary statistics can be visualized in one way or another).
Model-agnostic or model-specific: Shapley-value-based feature attribution methods can be used with different model architectures - they are model agnostic. Gradient-based feature attribution methods are based on gradients; therefore, they can be used only with models trained with gradient descent (neural networks, logistic regression, support vector machines, for example) - they are model specific.
Local: They explain an individual prediction of the model, not the entire model (that would be “global” interpretability).

Putting it all together, feature attribution methods are post-hoc, local interpretation methods. They can be model-agnostic (e.g., SHAP) or model-specific (e.g., Grad-CAM).

Limitations and traps of feature attribution

Feature attributions are approximations

In their typical application, explanations have a fundamental limitation when applied to black-box models: they are approximations of how the model behaves.

“[Explanations] cannot have perfect fidelity with respect to the original model. If the explanation was completely faithful to what the original model computes, the explanation would equal the original model, and one would not need the original model in the first place, only the explanation.”

Cynthia Rudin — Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead

More succinctly:

“Explanations must be wrong.”

Cynthia Rudin — Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead

As we are going through the exploration of the feature attributions, we must keep in my mind that we are analyzing two items at the same time:

What the model predicted.
How feature attribution approximates what the model considers to make the prediction.

Therefore, never mistake the explanation for the actual behavior of the model. This is a critical conceptual limitation to keep in mind.

Because the explanations are approximations, they may disagree with each other. For example, in the figure below, LIME (left) and SHAP (right) disagree not only in the magnitude of features’ contributions but also in the direction (sign). This disagreement is more common than we may think. Refer to the excellent paper The Disagreement Problem in Explainable Machine Learning: A Practitioner’s Perspective for more details and how practitioners deal with this issue (the figure comes from the paper).

Feature attribution may not make sense

Feature attributions do not have any understanding of the model they are explaining. They simply explain what the model predicts, not caring if the prediction is right or wrong.

Therefore, never confuse “explaining” with “understanding”.

Feature attributions are sensitive to the baseline

Another conceptual limitation is the choice of a baseline. The attributions are not absolute values. They are the contributions compared to a baseline. To better understand why baselines are important, see how Shapley values are calculated in the Shapley values section, then the section on baselines right after it.

Feature attributions are slow to calculate

Moving on to practical limitations, an important one is performance. Calculating feature attributions for large images is time-consuming.

When used to help explain the predictions of a model to end-users, consider that it may make the user interface look unresponsive. You may have to compute the attributions offline or, at a minimum, indicate to the user that there is a task in progress and how long it will take.

User interactions are complex

The attributions we get from the feature attributions algorithms are just numbers. To make sense of them, we need to apply visualization techniques.

For example, simply overlaying the raw attribution values on an image may leave out important pixels that contributed to the prediction, as illustrated in figure 2 of this paper. Compare the number of pixels highlighted in the top-right picture with the one below it, adjusted to show more contributing pixels.

Showing all information at once to the users may also induce them to make more mistakes. For example, when showing the feature attributions overlaid to a medical image, this paper found out that it increased overdiagnosing of a medical condition. It points to the fact that just because we can explain something, we shouldn’t necessarily put that explanation in front of users without considering how it will change their behavior.

Well-known feature attribution methods

The following table was compiled with the article A Visual History of Interpretation for Image Recognition and the paper Explainable AI: A Review of Machine Learning Interpretability Methods.

Each row has an explanation method, when it was introduced, a link to the paper that introduced it, and an example of how the method attributes features. The entries are in chronological order.

Method and introductory paper	Feature attribution example (from the paper)
CAM (class activation maps) 2015-12 Learning Deep Features for Discriminative Localization
LIME (local interpretable model-agnostic explanations) 2016-08 “Why Should I Trust You?”: Explaining the Predictions of Any Classifier
Grad-CAM 2016-10 Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
Integrated gradients 2017-03 Axiomatic Attribution for Deep Networks
DeepLIFT (Deep Learning Important FeaTures) 2017-04 Learning Important Features Through Propagating Activation Differences
SHAP (SHapley Additive exPlanations) 2017-05 A Unified Approach to Interpreting Model Predictions
SmoothGrad 2017-06 SmoothGrad: removing noise by adding noise
Anchors 2018 Anchors: High Precision Model-Agnostic Explanations
CEM (contrastive explanations method) 2018-02 Explanations based on the Missing: Towards Contrastive Explanations with Pertinent Negatives
This looks like that 2018-06 This Looks Like That: Deep Learning for Interpretable Image Recognition
XRAI 2019-06 XRAI: Better Attributions Through Regions
Contrastive Explanations 2021-09 Contrastive Explanations for Model Interpretability

A feature attribution example with SHAP

SHAP (SHapley Additive exPlanations) was introduced in the paper A Unified Approach to Interpreting Model Predictions. As the title indicates, SHAP unifies LIME, Shapley sampling values, DeepLIFT, QII, layer-wise relevance propagation, Shapley regression values, and tree interpreter.

Because of SHAP’s claim to unify several methods, in this section we review how it works. It starts with an example of SHAP for image classification, then explains the theory behind it. For a more detailed review of SHAP, including code, please see this article.

Example with MNIST

The code for the examples described in this section is available on this GitHub repository.

The following figure shows the SHAP feature attributions for a convolutional neural network that classifies digits from the MNIST dataset.

The leftmost digit is the sample from the MNIST dataset. The text at the top shows the actual label from the dataset (8) and the label the network predicted (also 8, thus a correct prediction). The next ten digits are the SHAP feature attributions for each class (the digits zero to nine, from left to right). At the top of each class we see the probability assigned by the network. In this case, the network gave the probability 99.54% to the digit 8, so it’s correct and very confident about the prediction.

SHAP uses colors to explain attributions:

Red pixels increases the probability of a class being predicted
Blue pixels decrease the probability of a class being predicted

We can see that the contours of the digit 8 are assigned high probability. We can also see that the empty space inside the top loop is relevant to detecting a digit 8. The empty spaces to the left and right of the middle, where the bottom and top half of the digit meet are also important. In other words, it’s not only what is present that is important to decide what digit an image is, but also what is absent.

Looking at digits 2 and 3, we can see in blue the reasons why the network assigned lower probabilities to them.

Shapley values

SHAP uses an approximation of Shapley value for feature attribution. The Shapley value determines the contribution of individuals in interactions that involve multiple participants.

For example (based on this article), a company has three employees, Anne, Bob, and Charlie. The company has ended a month with a profit of 100 (the monetary unit is not essential). The company wants to distribute the profit to the employees according to their contribution.

We have so far two pieces of information, the profit when the company had no employee (zero) and the profit with all three employees on board.

Employees	Profit
None	0
Anne, Bob, Charlie	100

Going through historical records, the company determined the profit when different combinations of employees were working in the past months. They are added to the table below, between the two lines of the previous table.

	Employees	Profit
1	None	0
2	Anne	10
3	Bob	20
4	Charlie	30
5	Anne, Bob	60
6	Bob, Charlie	70
7	Anne, Charlie	90
8	Anne, Bob, Charlie	100

At first glance, it looks like Bob contributes 50 to the profit: in line 2 we see that Anne contributes 10 to the profit and in line 5 the profit of Anne and Bob together is 60. The conclusion would be that Bob contributed 50. However, when we look at line 4 (only Charlie) and line 6 (Bob and Charlie), we now conclude that Bob contributes 40 to the profit, contradicting the first conclusion.

Which one is correct? Both. We are interested in each employee’s contribution when they are working together. This is a collaborative game.

To understand the individual contributions, we start by analyzing all possible paths from “no employee” to “all three employees”.

Path	Combination to get to all employees
1	Anne → Anne, Bob → Anne, Bob, Charlie
2	Anne → Anne, Charlie → Anne, Bob, Charlie
3	Bob → Anne, Bob → Anne, Bob, Charlie
4	Bob → Bob, Charlie → Anne, Bob, Charlie
5	Charlie → Anne, Charlie → Anne, Bob, Charlie
6	Charlie → Bob, Charlie → Anne, Bob, Charlie

We then calculate each employee’s contribution in that path (this part is important). For example, in the first path, Anne contributes 10 (line 1 in the previous table), Bob contributes 50 (line 5, minus Anne’s contribution of 10), and Charlie contributes 40 (line 8 in the previous table, minus line 5). The total contribution must add to the total profit (this part is also important): Anne = 10 + Bob = 50 + Charlie = 40 → 100.

Repeating the process above, we calculate each employee’s contribution for each path. Finally, we average the contributions — this is the Shapley value for each employee (last line in the table).

Path	Combination to get to all employees	Anne	Bob	Charlie
1	Anne → Anne, Bob → Anne, Bob, Charlie	10	50	40
2	Anne → Anne, Charlie → Anne, Bob, Charlie	10	10	80
3	Bob → Anne, Bob → Anne, Bob, Charlie	40	20	40
4	Bob → Bob, Charlie → Anne, Bob, Charlie	30	20	50
5	Charlie → Anne, Charlie → Anne, Bob, Charlie	30	40	30
6	Charlie → Bob, Charlie → Anne, Bob, Charlie	60	10	30
	Average (Shapley value)	30	25	45

In this example we managed to calculate each individual’s contribution for all possible paths in a reasonable time. In machine learning, the “individuals” are the features in the dataset. There may be thousands or even millions of features in a dataset. For example, in image classification, each pixel in the image is a feature.

SHAP uses a similar method to explain the contribution of features to a model’s prediction. However, calculating the contribution of each feature is not feasible in some cases (e.g. images and their millions of pixels). The combination of paths to try is exponential (factorial, to be precise). SHAP makes simplifications to calculate the features’ contributions. It is crucial to remember that SHAP is an approximation, not the actual contribution value.

The importance of the baseline

In the example above, we asked “what is each employee’s contribution to the profit?”. Our baseline was the company with zero employees and no profit.

We could have asked a different question: “what is the contribution of Bob and Charlie, given that Anne is already an employee?”. In this case, our baseline is 10, the profit that Anne adds to the company by herself. Only paths 1 and 2 would apply, with the corresponding changes to the average contribution.

SHAP (and other feature attribution methods) calculate the feature contribution compared to a baseline. For example, in feature attribution for image classification, the baseline is an image or a set of images.

The choice of the baseline affects the calculations. Visualizing the Impact of Feature Attribution Baselines discussed the problem and its effect on feature attribution.

Appendix - interpretability vs. explainability

Ajay Thampi’s Interpretable AI book distinguishes between interpretability and explainability this way:

Interpretability: “It is the degree to which we can consistently estimate what a model will predict given an input, understand how the model came up with the prediction, understand how the prediction changes with changes in the input or algorithmic parameters and finally understand when the model has made a mistake. Interpretability is mostly discernible by experts who are either building, deploying or using the AI system and these techniques are building blocks that will help you get to explainability.”
Explainability: “[G]oes beyond interpretability in that it helps us understand in a human-readable form how and why a model came up with a prediction. It explains the internal mechanics of the system in human terms with the intent to reach a much wider audience. Explainability requires interpretability as building blocks and also looks to other fields and areas such as Human-Computer Interaction (HCI), law and ethics.”

Other sources treat interpretability and explainability as equivalent terms (for example, Miller’s work and Molan’s online book on the topic).

This article uses “interpretability” as defined in Ajay Thampi’s book. We distinguish between interpretability and explainability to not involve aspects of displaying the interpretation of a model’s prediction to end-users. This would add to the discussion other topics such as user interface and user interaction. While important for the overall discussion of ML interpretability and explainability, these topics are not relevant to the scope of this work. However, we preserve the original term when quoting a source. If the source chose “explainability”, we quote it so.

Therefore, when we discuss “interpretability” here, we mean the interpretation that is shown to a machine learning practitioner, someone familiar with model training and evaluation. We discuss interpretability in a more technical format with this definition in place, assuming that the consumer of the interpretability results has enough technical background to understand it.

An overview of deep learning for image processing

2021-04-26T00:00:00-04:00

Deep learning revolutionized image processing. It made previous techniques, based on manual feature extraction, obsolete. This article reviews the progress of deep learning, with ever-growing networks and the new developments in the field.

Deep learning is a sub-area of machine learning, which in turn is a sub-area of artificial intelligence (picture source).

The best way I found to explain deep learning is in contrast to traditional methods. Yann LeCun, one of the founders of deep learning, gave an informative talk on the evolution of learning techniques, starting with the traditional ones and ending with deep learning. He focuses on image recognition in that talk.

It is a worthwhile investment of one hour of our time to listen to someone who was not only present but actively driving the evolution of deep learning. The two pictures immediately below are from his speech.

Traditional image recognition vs. deep learning

In traditional image recognition, we use hand-crafted rules to extract features from an image (source).

In contrast, deep learning image recognition is done with trainable, multi-layer neural networks. Instead of hand-crafting the rules, we feed labeled images to the network. The neural network, through the training process, extracts the features needed to identify the images (source).

“Deep” comes from the fact that neural networks (in this application) use several layers. For example, LeNet-5, named after Yann LeCunn (of the presentation above) and shown in the (historic) picture below (source), has seven layers.

What deep learning networks “learn”

Each layer “learns” (“extracts” is a better technical term) different aspects (“features” in the pictures above) of the images. Lower layers extract basic features (such as edges), and higher layers extract more complex concepts (that frankly, we don’t quite know how to explain yet).

The picture below (source) shows the features that each layer of a deep learning network extracts. On the left, we have the first layers of the network. They extract basic features, such as edges. As we move to the right, we see the upper layers of the network and the features they extract.

Unlike traditional image processing, a deep learning network is not manually configured to extract these features. They learn it through the training process.

The evolution of deep learning

Deep learning for image processing entered the mainstream in the late 1990s when convolutional neural networks were applied to image processing. After stalling a bit in the early 2000s, deep learning took off in the early 2010s. In a short span of a few years, bigger and bigger network architectures were developed. Over time, what “deep” meant was stretched even further.

The table below shows the evolution of deep learning network architectures.

When/What	Notable features	Canonical depiction
1990s LeNet	Trainable network for image recognition. - Gradient-based learning - Convolutional neural network
2012 AlexNet	One network outperformed, by a large margin, model ensembling (best in class at the time) in ImageNet. - Deep convolutional neural network - Overcame overfitting with data augmentation and dropout
2014 Inception (GoogLeNet)	Very deep network (over 20 layers), composed of building blocks, resulting in a “network in a network” (inception).	Partial depiction
2014 VGGNet	Stacks of small convolution filters (as opposed to one large filter) to reduce the number of parameters in the network.
2015 ResNet	Introduced skip connections (residual learning) to train very deep networks (152 layers). At the same time, the network is compact (few parameters for its size).	Partial depiction

Network architectures continue to evolve today. So many architectures have been put into practice that we now need a taxonomy to categorize them.

Recent trends

Efficiently scaling CNNs: There are different ways to scale CNN-based networks. The EfficientNet family of networks shows that we don’t always need large CNN networks to get good results.
Back to basics: The MLP-Mixer network does away with CNN layers altogether. It uses only simpler multi-layer perceptron (MLP) layers, resulting in networks with faster throughput, predicting more images per second than other network architectures.
Transformers: Transformer-based networks, after their success with natural language processing (NLP), are being applied to image processing.
Learning concepts: by training with images and their textual descriptions (multimodal learning), OpenAI created CLIP, a network that seems to have learned the concepts of images. Traditional image classification relied on extracting features from the images. They work well on images with the same characteristics but fail when they are different. For example, they identify the picture of a banana but not the sketch of a banana. On the other hand, CLIP seems to have learned the concept of the images. It identifies pictures and sketches of bananas (see the illustration in the article)

Keeping up with new developments

Papers with Code maintains a leaderboard of the state of the art, including links to the papers that describe the network used to achieve each result.

Exploring SHAP explanations for image classification

2021-04-25T00:00:00-04:00

This article explores how to interpret predictions of an image classification neural network using SHAP (SHapley Additive exPlanations).

The goals of the experiments are to:

Explore how SHAP explains the predictions. This experiment uses a (fairly) accurate network to understand how SHAP attributes the predictions.
Explore how SHAP behaves with inaccurate predictions. This experiment uses a network with lower accuracy and prediction probabilities that are less robust (more spread among the classes) to understand how SHAP behaves when the predictions are not reliable (a hat tip to Dr. Rudin’s work).

Why use SHAP instead of another method?

This project is my first opportunity to delve into model interpretability at the code level. I picked SHAP (SHapley Additive exPlanations) to get started because of its promise to unify various methods (emphasis ours):

“…various methods have recently been proposed to help users interpret the predictions of complex models, but it is often unclear how these methods are related and when one method is preferable over another. To address this problem, we present a unified framework for interpreting predictions, SHAP (SHapley Additive exPlanations). SHAP assigns each feature an importance value for a particular prediction. Its novel components include: (1) the identification of a new class of additive feature importance measures. … The new class unifies six existing methods, …”

Overview of SHAP feature attribution for image classification

How SHAP works

SHAP is based on Shapley value, a method to calculate the contributions of each player to the outcome of a game. See this article for a simple, illustrated example of how to calculate the Shapley value and this article by Samuelle Mazzanti for a more detailed explanation.

The Shapley value is calculated with all possible combinations of players. Given N players, it has to calculate outcomes for 2^N combinations of players. In the case of machine learning, the “players” are the features (e.g. pixels in an image) and the “outcome of a game” is the model’s prediction. Calculating the contribution of each feature is not feasible for large numbers of N. For example, for images, N is the number of pixels.

Therefore, SHAP does not attempt to calculate the actual Shapley value. Instead, it uses sampling and approximations to calculate the SHAP value. See chapter 4 of the SHAP paper for details.

Visualizing SHAP attributions

SHAP uses colors to explain attributions:

Red pixels increase the probability of a class being predicted
Blue pixels decrease the probability of a class being predicted

The following picture and text come from the SHAP README.

“The plot above explains ten outputs (digits 0-9) for four different images. Red pixels increase the model’s output while blue pixels decrease the output. The input images are shown on the left, and as nearly transparent grayscale backings behind each explanation. The sum of the SHAP values equals the difference between the expected model output (averaged over the background dataset) and the current model output. Note that for the ‘zero’ image the blank middle is important, while for the ‘four’ image the lack of a connection on top makes it a four instead of a nine.”

This is an essential part of the explanation: “Note that for the ‘zero’ image the blank middle is important, while for the ‘four’ image the lack of a connection on top makes it a four instead of a nine.” In other words, it’s not only what is present that is important to decide what digit an image is, but also what is absent.

Experiments

This Jupyter notebook shows how to use SHAP’s DeepExplainer to visualize feature attribution in image classification with neural networks. See the instructions to run the code for more details.

SHAP has multiple explainers. The notebook uses the DeepExplainer explainer because it is the one used in the image classification SHAP sample code.

The code is based on the SHAP MNIST example, available as a Jupyter notebook on GitHub. This notebook uses the PyTorch sample code because at this time (April 2021), SHAP does not support TensorFlow 2.0. This GitHub issue tracks the work to support TensorFlow 2.0 in SHAP.

The experiments are as follows:

Train a CNN to classify the MNIST dataset.
Show the feature attributions for a subset of the training set using SHAP DeepExplainer.
Review and annotate some of the attributions to better understand what they reveal about the model and the explanation itself.
Repeat the steps above with the CNN that is significantly less accurate.

An important caveat

“Explanations must be wrong.”

Cynthia Rudin — Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead

As we are going through the exploration of the feature attributions, we must keep in my mind that we are analyzing two items at the same time:

What the model predicted.
How the feature attribution explainer approximates what the model considers to make the prediction.

The explainer approximates the model and sometimes (as in this case) also uses an approximation of the input. Therefore, some of the attributions that may not make much sense may result from these approximations, not necessarily the model’s behavior.

Therefore, never mistake the explanation for the actual behavior of the model. This is a critical conceptual limitation to keep in mind.

See more on this post about feature attribution.

Some results from the experiments

This section explores some of the feature attributions resulting from the experiments (see the notebook).

Before reading further: this is my first foray into the details of feature attribution with SHAP (or any other method). Some of the items reported below are questions I need to investigate further to understand better how feature attribution in general, and SHAP in particular, work.

Some candidates for research questions are noted in the explanations.

Accurate network

This section explores feature attribution using the (fairly) accurate network. This network achieves 97% overall accuracy.

Each picture below shows these pieces of information:

The leftmost digit is the example from the MNIST dataset that the network predicted. The text at the top of the picture shows the actual and predicted values. The predicted value is the largest of all probabilities (without applying a threshold).
Following that digit, there are ten digits, one for each class (from left to right: zero to nine), with the feature attributions overlaid on each digit. The text at the top shows the probability that the network assigned for that class.

Some of the feature attributions are easy to interpret. For example, this is the attribution for a digit “1”.

We can see that the presence of the vertical pixels at the center of the image increases the probability of predicting a digit “1”, as we would expect. The absence of pixels around that vertical line also increases the probability.

The two examples for the digit “8” below are also easy to interpret. We can see that the blank space in the top loop and the blank spaces on both sides of the middle part of the image are important to define an “8”.

In the two examples for the digit “2” below, on the other hand, the first one is easy to interpret, but the attributions for the second make less sense. While reviewing them, note that the scale for the SHAP values is different for each example. The range of values in the second example is an order of magnitude larger. It does not affect a comparative analysis but it may be important in other cases to note the scale before judging the attributions.

In the first example we can see which pixels are more relevant (red) to predict the digit “2”. We can also see what pixels were used to reduce the probability of predicting the digit “7” (blue), the second-highest predicted probability.

In the second picture, the more salient attributions are on the second-highest probability, the digit “7”. It’s almost as if the network “worked harder” to reject that digit than to predict the digit “2”. Although the probability of the digit “7” is higher in this second example (compared to the digit “7” in the first example), it’s still far away from the probability assigned to the digit “2”.

RESEARCH QUESTION 1: What causes SHAP sometimes to highlight the attributions of a class that was not assigned the highest probability?

Inaccurate network

This section explores feature attribution using the inaccurate network. This network achieves 87% overall accuracy. Besides the low overall accuracy, each prediction has a larger probability spread. In some cases, the difference between the largest and the second-largest probability is very small, as we will soon see.

In the example for the digit “0” below, the network incorrectly predicted it as “5”. But it didn’t miss by much. The difference in probability between “5” (incorrect) and “0” (correct) is barely 1%. Also, the two probabilities add up to 54%. In other words, the two top probabilities add up to about half of the total probability. The prediction for this example is not only wrong but uncertain across several classes (labels).

SHAP still does what we ask: shows the feature attributions for each class. For the three classes with the highest probability, we can see that:

Digit “0”: The empty middle is the important part, as we have seen in other cases for this digit.
Digit “8”: The top and bottom parts look like the top and bottom loops of the digit “8”, resulting in the red areas we see in the attribution. The empty middle is now a detractor for this class (blue). An actual digit “8” would have something here, where the bottom and top loops meet.
Digit “5”: Left this one for last because it is the one with the highest probability (but not by much) and also the one hardest to explain. It is almost as if just a few pixels (in red) were enough to assign a probability higher than the correct digit “0”.

This example shows an important concept about explanations for black-box models: they explain what the model is predicting, but they do not attempt to explain if the predictions are correct.

Hence the call to stop explaining black-box models (at least for some applications). But this is a story for another day…

Aggregate attributions for accurate vs. inaccurate networks

Instead of plotting attributions one by one, as we saw in the previous examples, SHAP can also plot multiple images in the same plot. One advantage of this plot is that all images share the same SHAP scale.

The plots below show all the attributions for all test digits. The accurate network is on the left and the inaccurate network is on the right.

In the plot for the accurate network we can see that all samples have at least one class (digit) with favorable attributions (red). The plot is dotted with red areas. In the inaccurate network we don’t see the same pattern. The plot is mainly gray.

Accurate	Inaccurate

RESEARCH QUESTION 2: Given this pattern, is it possible to use the distribution of attributions across samples to determine if a network is accurate (or not)? In other words, if all we have is the feature attributions for a reasonable number of cases but don’t have the actual vs. predicted labels, could we use that to determine whether a network is accurate (or not)?

Limitations of these experiments

SHAP attributes features based on a baseline input. This is this line of code in the Jupyter notebook:

expl = shap.DeepExplainer(model, background_images)

The baseline images are extracted from the test set here:

images, targets = next(iter(m.test_loader))
...
BACKGROUND_SIZE = 100
background_images = images[:BACKGROUND_SIZE]

The choice of baseline images can significantly affect the SHAP results (the results of any method that relies on baseline images, to be precise), as demonstrated in Visualizing the Impact of Feature Attribution Baseline.

In the experiments we conducted here we used a relatively small set of images for the baseline and we didn’t attempt to get an equal distribution of the digits in that baseline (other than a simple manual check of distributions - see the notebook).

RESEARCH QUESTION 3: Would a larger number of baseline images, with equal distribution of digits, significantly affect the results? More generically, what is a reasonable number of baseline images to start trusting the results?

Code

See instructions here to prepare the environment and run the code.

Machine learning, but not understanding

2021-04-10T00:00:00-04:00

In the expression machine learning, are the machines actually learning anything?

In the book “Artificial Intelligence, a guide for thinking humans” Melanie Mitchell explains that

“Learning in neural networks simply consists in gradually modifying the weights on connections so that each output’s error gets as close to 0 as possible on all training examples.”

Melanie Mitchell — Artificial Intelligence, a guide for thinking humans

Let’s explore what “learning” means for machine learning, guided by Mitchell’s book. More specifically, we will concentrate on “deep learning”, a branch of machine learning that has powered most of the recent advances in artificial intelligence.

All quoted text in this article is from Dr. Mitchell’s book “Artificial Intelligence, a guide for thinking humans”.

An extremely short explanation of deep learning

Deep learning uses layers of “units”’ (also called neurons, but some people, including Mitchell and I, prefer the more generic units term, to not confuse with biological neurons) to extract patterns from labeled data. The internal layers are called “hidden layers”. The last layer is called the “output layer”, or the classification layer.

In the following figure (from Mitchell’s book), a neural network comprised of several hidden layers (only one shown) was trained to classify handwritten digits. The output layer has ten units, one for each possible digit.

How does a neural network learn? Back to Mitchell’s quote:

“Learning in neural networks simply consists in gradually modifying the weights on connections so that each output’s error gets as close to 0 as possible on all training examples.”

Going through the sentence pieces:

training examples: The labeled examples we present to the network to train it. For example, we present a picture of a square or a triangle and its corresponding label, “square” or “triangle”.
output’s error: How far the network’s prediction is from the correct label of the example picture.
weights on connections: A large-precision decimal number that adjusts the output of a unit in one layer to the input of a unit in the next layer. The weights are where the “knowledge” of the neural network is encoded.
gradually modifying: This is the neural network learning process. An algorithm carefully modifies the weights on the connections to get closer to the expected output. Repeating the adjustment step over time (many, many times) allows the network to learn from the training examples.

An important consequence of this process

“The machine learns what it observes in the data rather than what you (the human) might observe. If there are statistical associations in the training data, even if irrelevant to the task at hand, the machine will happily learn those instead of what you wanted it to learn.”

Thus, neural networks are not “learning” in the sense that we would understand the term. They are not learning higher-level concepts from the samples used to train them. They are extracting patterns from the data presented to them during training (and they assume that the labels are correct). That’s all.

Or, as Mitchell puts more eloquently:

“The phrase “barrier of meaning” perfectly captures an idea that has permeated this book: humans, in some deep and essential way, understand the situations they encounter, whereas no AI system yet possesses such understanding. While state-of-the-art AI systems have nearly equaled (and in some cases surpassed) humans on certain narrowly defined tasks, these systems all lack a grasp of the rich meanings humans bring to bear in perception, language, and reasoning. This lack of understanding is clearly revealed by the un-humanlike errors these systems can make; by their difficulties with abstracting and transferring what they have learned; by their lack of commonsense knowledge; … The barrier of meaning between AI and human-level intelligence still stands today.”

Should we be concerned that deep learning is not “learning”? We should, if we don’t understand what it implies for real-life applications.

In the next sections we will explore how neural networks lack the grasp of “rich meanings we humans bring to bear in perception”, illustrating it with some “un-humanlike errors these systems can make; by their difficulties with abstracting and transferring what they have learned; by their lack of commonsense knowledge”.

You can run the examples used in the text with the Jupyter notebook on this GitHub repository. The examples use small pictures to run quickly on any computer.

Telling squares and triangles apart

We will see how a neural network trained to tell squares and triangles apart behaves.

For human beings, the pictures below show squares and triangles. Some are small, some are large, some are in a light background, some are in a darker background. But they are all clearly either a square or a triangle in a frame.

In this section we will go through the typical process of training a neural network to classify squares and triangles:

Get a dataset with labeled pictures of squares and triangles
Split the dataset into a training set and a test set
Train the network with the training set
Validate the neural network accuracy with the test set

After we are done with that, we will predict similar images to see how the network handles them.

The “squares vs. triangles” training examples

This is how some of the training images look like. Each picture is a square or a triangle in different positions. The dataset has hundreds of these pictures.

The “squares vs. triangles” neural network

We train a convolutional neural network (CNN) to classify a picture as a “square” or as a “triangle”, using the training examples. We chose a CNN architecture because it is well suited to image classification.

If you would like to see the details of the training process, see the Jupyter notebook on this GitHub repository.

How does the neural network perform?

Before we started the training process, we set aside 10% of the pictures to use later (67 pictures). They are pictures that the neural network was not trained on. This is the test set. We use the test set to measure the performance of the neural network.

A traditional measure of performance is “accuracy”. It measures the percentage of pictures in the test set that were correctly classified.

First, we ask the neural network to predict what the pictures are (more details on how that happens here), then we compare with the actual labels and calculate the accuracy.

Our neural network classified 65 out 67 pictures correctly, for an accuracy of 97%. This is a pretty good accuracy for a relatively small neural network that can be trained quickly.

Let’s visualize where the neural network made the mistakes. The picture below shows the mistakes with a red border. All other pictures were classified correctly. Below each picture is the neural network’s classification.

Despite the good accuracy, does the neural network understand the concept of what it is learning?

When are squares not squares?

When they are larger. At least for this neural network.

In this section we will use the neural network we just trained to classify a set of squares. But there is a twist to these squares: they are larger than the ones we used in the training set.

This is how they look like.

Using the neural network, we classify the large squares and calculate the accuracy, just like we did with the test set.

But this time, out of 77 large squares, only 43 are classified as squares. The other 34 are classified as triangles. With an accuracy of 55.8%, the neural network is barely better than flipping a coin.

Below are all the squares in this set and how the neural network classified them. The ones with the red border were incorrectly classified as triangles (there are many of them).

Why does this experiment matter?

The simplest and fastest way to improve this neural network is to increase the size of the training and test sets. In this case, we should add larger squares to the training set and retrain the neural network. It will very likely perform better.

But this does not address the fundamental problem: the neural network does not understand the concept of “square”.

Quoting Mitchell again (emphasis added):

“The phrase “barrier of meaning” perfectly captures an idea that has permeated this book: humans, in some deep and essential way, understand the situations they encounter, whereas no AI system yet possesses such understanding. While state-of-the-art AI systems have nearly equaled (and in some cases surpassed) humans on certain narrowly defined tasks, these systems all lack a grasp of the rich meanings humans bring to bear in perception, language, and reasoning. This lack of understanding is clearly revealed by the un-humanlike errors these systems can make; by their difficulties with abstracting and transferring what they have learned; by their lack of commonsense knowledge; … The barrier of meaning between AI and human-level intelligence still stands today.”

Even if we collect lots and lots and lots of examples, we are confronted with the long-tail problem:

“[T]he vast range of possible unexpected situations an AI system could be faced with.”

For example, let’s say we trained our autonomous driving system to recognize a school zone by the warning sign painted on the road (source):

Then, one day our autonomous driving system comes across these real-life examples (source 1, source 2):

Any (well, most) human beings would still identify them as warning signs for school zones (presumably, the human would chuckle, then - hopefully - slow down).

Would the autonomous driving system identify them correctly? The honest answer is “we don’t know”. It depends on how it was trained. Was it given these examples in the training set? In enough quantities to identify the pattern? Did the test set have examples? Were they classified correctly?

But no matter how comprehensive we make the training and test sets and how methodically we inspect the classification results, we are faced with the fundamental problem: the neural network does not understand the concept of “school zone warning”.

The autonomous driving system lacks common sense.

“…humans also have a fundamental competence lacking in all current AI systems: common sense. We have vast background knowledge of the world, both its physical and its social aspects.”

The neural network may be learning, but it is definitely not understanding.

Not understanding “squares” - part 2

In the first section we changed the shape of an object. In this section we will not change the object. We will change the environment instead.

We will train a neural network to classify squares and triangles again. This time they are in different environments, represented by different background colors. The squares are in a lighter background and the triangles are on a dark(er) background (we can think of the backgrounds as “twilight” and “night”).

The picture below shows how they look like.

Following the same steps we used in the first section, we train a neural network to classify the squares and triangles.

Once the network is trained, we use the test set to calculate the neural network accuracy and find out that it is a perfect 100% accuracy score. All squares and triangles in the test set were classified correctly.

If you would like to see the details of the training process, see the Jupyter notebook on this GitHub repository.

So far, so good, but…

In the dark, all squares are triangles

What happens if the squares are now in the same environment as the triangles (all squares are in the “night” environment)?

This is how the squares look like in the darker environment.

When we ask the neural network to classify these squares, we find out that the performance is now abysmal. The accuracy is 0%. All squares are misclassified as triangles.

To confirm, we can visualize the predictions. The wrong predictions have a red frame around them (all of them are wrong in this case).

Why does this experiment matter?

The neural network we just trained fails in the same way the first neural network failed: it doesn’t understand the concepts of “square” and “triangle”. It is just looking for any sort of pattern in the training data. It doesn’t know if a pattern makes sense or not, it just knows there is a pattern there.

In this case, the neural network is very likely learning not from the shape, but from the background (a case of spurious correlation). It is assuming that a darker background means “triangle” because it doesn’t really understand the concept of what makes a triangle a triangle.

Sometimes this leads to some funny examples, like the neural network that “learned” to classify land vs. water birds based on the background. The duck on the right was misclassified as a land bird, simply because it was not in its usual water environment (source).

Other times the mistakes are more consequential, for example, when neural networks misclassify X-rays based on markings left by radiologists in the images. Instead of learning actual attributes of a disease, the neural network “learned” from the marks left behind in the images. Images without such marks may be classified as “healthy”. The consequences can be catastrophic (source).

Should we be concerned that deep “learning” is not “understanding”?

Mitchell asks the following question in her book:

“but the question remains: Will the fact that these systems lack humanlike understanding inevitably render them fragile, unreliable, and vulnerable to attacks? And how should this factor into our decisions about applying AI systems in the real world?”

Until we achieve humanlike understanding, we should be concerned that neural networks do not generalize well.

Does it mean we need to stop using neural networks until then? No.

“I think the most worrisome aspect of AI systems in the short term is that we will give them too much autonomy without being fully aware of their limitations and vulnerabilities.”

Deep learning has successfully improved our lives. It’s “just” a matter of understanding its limitations, applying it judiciously, for the tasks that it’s well suited.

To do that we need to educate the general public and, more importantly, the technical community. Too often we hype the next “AI has achieved humanlike performance in [some task here]”, when in fact we should say “under these specific circumstances, for this specific application, AI has performed well”.

Source code for the experiments

The source code for the experiments described here is on this GitHub repository. It uses small pictures to run quickly on a regular computer.

Feel free to modify the pictures, the neural network model, and other parameters that affect the results.

But remember that when the results improve, it’s not the neural network that is learning more all of a sudden. You are improving it.

“Because of the open-ended nature of designing these networks, in general it is not possible to automatically set all the parameters and designs, even with automated search. Often it takes a kind of cabalistic knowledge that students of machine learning gain both from their apprenticeships with experts and from hard-won experience.”

Melanie Mitchell — Artificial Intelligence, a guide for thinking humans

Christian Garbin’s personal blog

Improve writing by learning how to read

What makes a good paper?

How to write a good paper

The first pass

The second pass

The third pass

Above all, don’t lose the reader in the first pass

More reading on writing

Using LLMs to summarize GitHub issues

Overview of the steps

Quick get-started guide

What happens behind the scenes

Step 1 - Get the GitHub issue and its comments

Step 2 - Translate the JSON data into a compact text format

Step 3 - Build the prompt

Step 4 - Send the request to the LLM

Step 5 - Show the response

Developing applications with LLMs

A simple GitHub issue to get started

A large GitHub issue

Better summaries with a more powerful model

The introduction of GPT-4o mini

The importance of using a good prompt

If all we have is a hammer…

What we learned in these experiments

Related projects

Writing good Jupyter notebooks

Step 1 - The original notebook

Step 2 - Add a description, organize it into sections, and add exploratory data analysis

Step 3 - Make data cleanup more explicit and explain why specific numbers were chosen

Step 4 - Make the code more flexible and more difficult to break

Step 5 - Make the graphs easier to read

Step 6 - Describe the limitations of the conclusion

Conclusion

Running the examples

Vision transformer properties

How transformers process images

How are transformers different from CNNs in computer vision?

Fewer assumptions → more interesting solutions

What else do transformers learn on their own?

Why do vision transformers perform better than CNNs?

What was not covered here

More efficient training and inference

Is “attention” needed?

Catching up with recent developments

Understanding transformers in one morning

Hour 1 - The paper

Hour 2 - Key concepts

Hour 3 - Digging into details

Hour 4 - Pick your adventure

Where to go from here

Applications of transformers in computer vision

The origins of transformers – natural language processing

When context matters

Remembering the past – recurrent neural networks

Forgetting the past – vanishing and exploding gradients

Going further into the past – long short-term memory

Deciding where to look – attention

“Attention is all we need” – transformers

Transformers in computer vision

Transformers in healthcare

NLP applications

Genomics and proteomics applications

Computer vision applications

Label generation

Large image analysis

Improvements in interpretability

Conclusions

Appendix A - A reading list for RNN, LSTM, attention, and transformers in NLP

Appendix B - The quadratic bottleneck

References

Machine learning interpretability with feature attribution

What feature attributions are used for

Where feature attribution is in relation to other interpretability methods

Limitations and traps of feature attribution

Feature attributions are approximations

Feature attribution may not make sense

Feature attributions are sensitive to the baseline

Feature attributions are slow to calculate