AzureAI Fundamentals
AzureAI Fundamentals
"teach" a computer model to make predictions and draw conclusions from data.
Computer vision - Capabilities within AI to interpret the world visually through
cameras, video, and images.
Natural language processing - Capabilities within AI for a computer to interpret
written or spoken language, and respond in kind.
Document intelligence - Capabilities within AI that deal with managing, processing,
and using high volumes of data found in forms and documents.
Knowledge mining - Capabilities within AI to extract information from large volumes
of often unstructured data to create a searchable knowledge store.
Generative AI - Capabilities within AI that create original content in a variety of
formats including natural language, image, code, and more.
The answer is, from data. In today's world, we create huge volumes of data as we go
about our everyday lives. From the text messages, emails, and social media posts we
send to the photographs and videos we take on our phones, we generate massive
amounts of information. More data still is created by millions of sensors in our
homes, cars, cities, public transport infrastructure, and factories.
Data scientists can use all of that data to train machine learning models that can
make predictions and inferences based on the relationships they find in the data.
Machine learning models try to capture the relationship between data. For example,
suppose an environmental conservation organization wants volunteers to identify and
catalog different species of wildflower using a phone app. The following animation
shows how machine learning can be used to enable this scenario.
Analyze and interpret text in documents, email messages, and other sources.
Interpret spoken language, and synthesize speech responses.
Automatically translate spoken or written phrases between languages.
Interpret commands and determine appropriate actions.
You can use Microsoft's Azure AI Language to build natural language processing
solutions. Some features of Azure AI Language include understanding and analyzing
text, training conversational language models that can understand spoken or text-
based commands, and building intelligent applications.
Microsoft's Azure AI Speech is another service that can be used to build natural
language processing solutions. Azure AI Speech features include speech recognition
and synthesis, real-time translations, conversation transcriptions, and more.
You can explore Azure AI Language features in the Azure Language Studio and Azure
AI Speech features in the Azure Speech Studio. The service features are available
for use and testing in the studios and other programming languages.
Azure AI Search can utilize the built-in AI capabilities of Azure AI services such
as image processing, document intelligence, and natural language processing to
extract data. The product's AI capabilities makes it possible to index previously
unsearchable documents and to extract and surface insights from large amounts of
data quickly.
You can use Microsoft's Azure AI Document Intelligence to build solutions that
manage and accelerate data collection from scanned documents. Features of Azure AI
Document Intelligence help automate document processing in applications and
workflows, enhance data-driven strategies, and enrich document search capabilities.
You can use prebuilt models to add intelligent document processing for invoices,
receipts, health insurance cards, tax forms, and more. You can also use Azure AI
Document Intelligence to create custom models with your own labeled datasets. The
service features are available for use and testing in the Document Intelligence
Studio and other programming languages.
Azure OpenAI Service is Microsoft's cloud solution for deploying, customizing, and
hosting generative AI models. It brings together the best of OpenAI's cutting edge
models and APIs with the security and scalability of the Azure cloud platform.
Azure OpenAI Service supports many generative model choices that can serve
different needs. You can use Azure AI Studio to create generative AI solutions,
such as custom copilot chat-based assistants that use Azure OpenAI Service models
Machine learning is in many ways the intersection of two disciplines - data science
and software engineering. The goal of machine learning is to use data to create a
predictive model that can be incorporated into a software application or service.
To achieve this goal requires collaboration between data scientists who explore and
prepare the data before using it to train a machine learning model, and software
developers who integrate the models into applications where they're used to predict
new data values (a process known as inferencing).
The number of ice creams sold on a given day, based on the temperature, rainfall,
and windspeed.
The selling price of a property based on its size in square feet, the number of
bedrooms it contains, and socio-economic metrics for its location.
The fuel efficiency (in miles-per-gallon) of a car based on its engine size,
weight, width, height, and length.
Classification
Classification is a form of supervised machine learning in which the label
represents a categorization, or class. There are two common classification
scenarios.
The arrangement of the confusion matrix is such that correct (true) predictions are
shown in a diagonal line from top-left to bottom-right. Often, color-intensity is
used to indicate the number of predictions in each cell, so a quick glance at a
model that predicts well should reveal a deeply shaded diagonal trend.
The simplest metric you can calculate from the confusion matrix is accuracy - the
proportion of predictions that the model got right. Accuracy is calculated as:
(TN+TP) ÷ (TN+FN+FP+TP)
Recall is a metric that measures the proportion of positive cases that the model
identified correctly. In other words, compared to the number of patients who have
diabetes, how many did the model predict to have diabetes?
The formula for recall is:
TP ÷ (TP+FN)
F1-score is an overall metric that combined recall and precision. The formula for
F1-score is:
(2 x Precision x Recall) ÷ (Precision + Recall)
Another name for recall is the true positive rate (TPR), and there's an equivalent
metric called the false positive rate (FPR) that is calculated as FP÷(FP+TN). We
already know that the TPR for our model when using a threshold of 0.5 is 0.75, and
we can use the formula for FPR to calculate a value of 0÷2 = 0.
Of course, if we were to change the threshold above which the model predicts true
(1), it would affect the number of positive and negative predictions; and therefore
change the TPR and FPR metrics. These metrics are often used to evaluate a model by
plotting a received operator characteristic (ROC) curve that compares the TPR and
FPR for every possible threshold value between 0.0 and 1.0:
The ROC curve for a perfect model would go straight up the TPR axis on the left and
then across the FPR axis at the top. Since the plot area for the curve measures
1x1, the area under this perfect curve would be 1.0 (meaning that the model is
correct 100% of the time). In contrast, a diagonal line from the bottom-left to the
top-right represents the results that would be achieved by randomly guessing a
binary label; producing an area under the curve of 0.5. In other words, given two
possible class labels, you could reasonably expect to guess correctly 50% of the
time.
In the case of our diabetes model, the curve above is produced, and the area under
the curve (AUC) metric is 0.875. Since the AUC is higher than 0.5, we can conclude
the model performs better at predicting whether or not a patient has diabetes than
randomly guessing.
One-vs-Rest algorithms train a binary classification function for each class, each
calculating the probability that the observation is an example of the target class.
Each function calculates the probability of the observation being a specific class
compared to any other class.
Multinomial algorithms
As an alternative approach is to use a multinomial algorithm, which creates a
single function that returns a multi-valued output. The output is a vector (an
array of values) that contains the probability distribution for all possible
classes - with a probability score for each class which when totaled add up to 1.0:
f(x) =[P(y=0|x), P(y=1|x), P(y=2|x)]
There are multiple metrics that you can use to evaluate cluster separation,
including:
Average distance to cluster center: How close, on average, each point in the
cluster is to the centroid of the cluster.
Average distance to other center: How close, on average, each point in the cluster
is to the centroid of all other clusters.
Maximum distance to cluster center: The furthest distance between a point in the
cluster and its centroid.
Silhouette: A value between -1 and 1 that summarizes the ratio of distance between
points in the same cluster and points in different clusters (The closer to 1, the
better the cluster separation).
Deep learning is an advanced form of machine learning that tries to emulate the way
the human brain learns. The key to deep learning is the creation of an artificial
neural network that simulates electrochemical activity in biological neurons by
using mathematical functions, as shown here.
Biological neural network Artificial neural network
Diagram of a natural neural network. Diagram of an artificial neural network.
Neurons fire in response to electrochemical stimuli. When fired, the signal is
passed to connected neurons. Each neuron is a function that operates on an input
value (x) and a weight (w). The function is wrapped in an activation function that
determines whether to pass the output on.
Artificial neural networks are made up of multiple layers of neurons - essentially
defining a deeply nested function. This architecture is the reason the technique is
referred to as deep learning and the models produced by it are often referred to as
deep neural networks (DNNs). You can use deep neural networks for many kinds of
machine learning problem, including regression and classification, as well as more
specialized models for natural language processing and computer vision.
Just like other machine learning techniques discussed in this module, deep learning
involves fitting training data to a function that can predict a label (y) based on
the value of one or more features (x). The function (f(x)) is the outer layer of a
nested function in which each layer of the neural network encapsulates functions
that operate on x and the weight (w) values associated with them. The algorithm
used to train the model involves iteratively feeding the feature values (x) in the
training data forward through the layers to calculate output values for ŷ,
validating the model to evaluate how far off the calculated ŷ values are from the
known y values (which quantifies the level of error, or loss, in the model), and
then modifying the weights (w) to reduce the loss. The trained model includes the
final weight values that result in the most accurate predictions.
In Azure Machine Learning studio, you can (among other things):
In this module we’ve used several different terms relating to AI services. Here's a
recap:
Transformers
Most advances in computer vision over the decades have been driven by improvements
in CNN-based models. However, in another AI discipline - natural language
processing (NLP), another type of neural network architecture, called a transformer
has enabled the development of sophisticated models for language. Transformers work
by processing huge volumes of data, and encoding language tokens (representing
individual words or phrases) as vector-based embeddings (arrays of numeric values).
You can think of an embedding as representing a set of dimensions that each
represent some semantic attribute of the token. The embeddings are created such
that tokens that are commonly used in the same context are closer together
dimensionally than unrelated words.
Multi-modal models
The success of transformers as a way to build language models has led AI
researchers to consider whether the same approach would be effective for image
data. The result is the development of multi-modal models, in which the model is
trained using a large volume of captioned images, with no fixed labels. An image
encoder extracts features from images based on pixel values and combines them with
text embeddings created by a language encoder. The overall model encapsulates
relationships between natural language token embeddings and image features, as
shown here:
The Microsoft Florence model is just such a model. Trained with huge volumes of
captioned images from the Internet, it includes both a language encoder and an
image encoder. Florence is an example of a foundation model. In other words, a pre-
trained general model on which you can build multiple adaptive models for
specialist tasks. For example, you can use Florence as a foundation model for
adaptive models that perform:
Microsoft Azure provides multiple Azure AI services that you can use to detect and
analyze faces, including:
Azure AI Vision, which offers face detection and some basic face analysis, such as
returning the bounding box coordinates around an image.
Azure AI Video Indexer, which you can use to detect and identify faces in a video.
Azure AI Face, which offers pre-built algorithms that can detect, recognize, and
analyze faces.
Of these, Face offers the widest range of facial analysis capabilities.
Face service
The Azure Face service can return the rectangle coordinates for any human faces
that are found in an image, as well as a series of attributes related to those face
such as:
Accessories: indicates whether the given face has accessories. This attribute
returns possible accessories including headwear, glasses, and mask, with confidence
score between zero and one for each accessory.
Blur: how blurred the face is, which can be an indication of how likely the face is
to be the main focus of the image.
Exposure: such as whether the image is underexposed or over exposed. This applies
to the face in the image and not the overall image exposure.
Glasses: whether or not the person is wearing glasses.
Head pose: the face's orientation in a 3D space.
Mask: indicates whether the face is wearing a mask.
Noise: refers to visual noise in the image. If you have taken a photo with a high
ISO setting for darker settings, you would notice this noise in the image. The
image looks grainy or full of tiny dots that make the image less clear.
Occlusion: determines if there might be objects blocking the face in the image.
For example, consider the following restaurant reviews, which are already labeled
as 0 (negative) or 1 (positive):
The food and service were both great: 1
A really terrible experience: 0
Mmm! tasty food and a fun vibe: 1
Slow service and substandard food: 0
With enough labeled reviews, you can train a classification model using the
tokenized text as features and the sentiment (0 or 1) a label. The model will
encapsulate a relationship between tokens and sentiment - for example, reviews with
tokens for words like "great", "tasty", or "fun" are more likely to return a
sentiment of 1 (positive), while reviews with words like "terrible", "slow", and
"substandard" are more likely to return 0 (negative).
Azure AI Language is a part of the Azure AI services offerings that can perform
advanced natural language processing over unstructured text. Azure AI Language's
text analysis features include:
Named entity recognition identifies people, places, events, and more. This feature
can also be customized to extract custom categories.
Entity linking identifies known entities together with a link to Wikipedia.
Personal identifying information (PII) detection identifies personally sensitive
information, including personal health information (PHI).
Language detection identifies the language of the text and returns a language code
such as "en" for English.
Sentiment analysis and opinion mining identifies whether text is positive or
negative.
Summarization summarizes text by identifying the most important information.
Key phrase extraction lists the main concepts from unstructured text.
To work with conversational language understanding, you need to take into account
three core concepts: utterances, entities, and intents.
Utterances
An utterance is an example of something a user might say, and which your
application must interpret. For example, when using a home automation system, a
user might use the following utterances:
Entities
An entity is an item to which an utterance refers. For example, fan and light in
the following utterances:
You can think of the fan and light entities as being specific instances of a
general device entity.
Intents
An intent represents the purpose, or goal, expressed in a user's utterance. For
example, for both of the previously considered utterances, the intent is to turn a
device on; so in your conversational language understanding application, you might
define a TurnOn intent that is related to these utterances.
Azure AI Language: A resource that enables you to build apps with industry-leading
natural language understanding capabilities without machine learning expertise. You
can use a language resource for authoring and prediction.
Azure AI services: A general resource that includes conversational language
understanding along with many other Azure AI services. You can only use this type
of resource for prediction.
The separation of resources is useful when you want to track resource utilization
for Azure AI Language use separately from client applications using all Azure AI
services applications.
Authoring
After you've created an authoring resource, you can use it to train a
conversational language understanding model. To train a model, start by defining
the entities and intents that your application will predict as well as utterances
for each intent that can be used to train the predictive model.
When you create entities and intents, you can do so in any order. You can create an
intent, and select words in the sample utterances you define for it to create
entities for them; or you can create the entities ahead of time and then map them
to words in utterances as you're creating the intents.
You can write code to define the elements of your model, but in most cases it's
easiest to author your model using the Language studio - a web-based interface for
creating and managing Conversational Language Understanding applications.
After training the model, you can test it by submitting text and reviewing the
predicted intents. Training and testing is an iterative process. After you train
your model, you test it with sample utterances to see if the intents and entities
are recognized correctly. If they're not, make updates, retrain, and test again.
Predicting
When you are satisfied with the results from the training and testing, you can
publish your Conversational Language Understanding application to a prediction
resource for consumption.
Client applications can use the model by connecting to the endpoint for the
prediction resource, specifying the appropriate authentication key; and submit user
input to get predicted intents and entities. The predictions are returned to the
client application, which can then take appropriate action based on the predicted
intent.
You can use the output of speech synthesis for many purposes, including:
The model that is used by the Speech to text API, is based on the Universal
Language Model that was trained by Microsoft. The data for the model is Microsoft-
owned and deployed to Microsoft Azure. The model is optimized for two scenarios,
conversational and dictation. You can also create and train your own custom models
including acoustics, language, and pronunciation if the pre-built models from
Microsoft do not provide what you need.
Real-time transcription
Real-time speech to text allows you to transcribe text in audio streams. You can
use real-time transcription for presentations, demos, or any other scenario where a
person is speaking.
Batch transcription
Not all speech to text scenarios are real time. You might have audio recordings
stored on a file share, a remote server, or even on Azure storage. You can point to
audio files with a shared access signature (SAS) URI and asynchronously receive
transcription results.
Batch transcription should be run in an asynchronous manner because the batch jobs
are scheduled on a best-effort basis. Normally a job will start executing within
minutes of the request but there is no estimate for when a job changes into the
running state.
The text to speech API
The text to speech API enables you to convert text input to audible speech, which
can either be played directly through a computer speaker or written to an audio
file.
The service includes multiple pre-defined voices with support for multiple
languages and regional pronunciation, including neural voices that leverage neural
networks to overcome common limitations in speech synthesis with regard to
intonation, resulting in a more natural sounding voice. You can also develop custom
voices and use them with the text to speech API
Using document intelligence, the company can take a scanned image of a receipt,
digitize the text with OCR, and pair the field items with their field names in a
database. Document intelligence can identify specific data such as the merchant's
name, merchant's address, total value, and tax value.
Azure AI Document Intelligence supports features that can analyze documents and
forms with prebuilt and custom models. In this module, you explore how Azure AI
services provide access to document intelligence capabilities.
Prebuilt models - pretrained models that have been built to process common document
types such as invoices, business cards, ID documents, and more. These models are
designed to recognize and extract specific fields that are important for each
document type.
Custom models - can be trained to identify specific fields that are not included in
the existing pretrained models.
Document analysis - general document analysis that returns structured data
representations, including regions of interest and their inter-relationships.
Prebuilt models
The prebuilt models apply advanced machine learning to accurately identify and
extract text, key-value pairs, tables, and structures from forms and documents.
These capabilities include extracting:
The model has been trained to recognize several different languages, depending on
the receipt type. For best results when using the prebuilt receipt model, images
should be:
After the resource has been created, you can create client applications that use
its key and endpoint to connect forms for analysis, or use the resource in Document
Intelligence Studio.
Data from any source: accepts data from any source provided in JSON format, with
auto crawling support for selected data sources in Azure.
Full text search and analysis: offers full text search capabilities supporting both
simple query and full Lucene query syntax.
AI powered search: has Azure AI capabilities built in for image and text analysis
from raw content.
Multi-lingual offers linguistic analysis for 56 languages to intelligently handle
phonetic matching or language-specific linguistics. Natural language processors
available in Azure AI Search are also used by Bing and Office.
Geo-enabled: supports geo-search filtering based on proximity to a physical
location.
Configurable user experience: has several features to improve the user experience
including autocomplete, autosuggest, pagination, and hit highlighting.