Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
8 views33 pages

Security Design Principles For Cloud Computing

Uploaded by

snegarajalingam1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views33 pages

Security Design Principles For Cloud Computing

Uploaded by

snegarajalingam1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 33

22CS512 - SECURITY AND PRIVACY IN CLOUD

What is System Design for Cloud Computing?


Systems design for cloud computing refers to architecting and
planning the structure of applications and systems that leverage cloud services
and resources. It encompasses various aspects such as scalability, reliability,
security, and cost-efficiency to optimize the performance and effectiveness of
applications in cloud environments
Core Design Principles for Cloud Computing:
1. Scalability :
Scalability is the capability of a system to enhance its capacity as
the size of the load rises by putting up more resources in the system. In cloud
computing this means being able to think of architectures, that can scale out, that
is add more instances at a time and scale up, that is increasing the power of each
of the instances. Key techniques include:
 Horizontal Scaling: Adding more instances to balance the load of how
frequently a certain page is accessed by the users.
 Vertical Scaling: This particular pattern involves expanding the number
of existing instances, which are considered as a means of raising the
capability of a given establishment.
 Auto-Scaling: that are allocated to meet the users’ needs are likely to be
varied.
2. Reliability and Fault Tolerance:
Reliability guarantees that a system will deliver the required
functionality hence its definition as the correctness of operation, while in fault
tolerance it is defined as the ability of a system to provide the required
functionality despite the failure of some of its parts. Strategies include:
 Redundancy: There is no doubt that replication of vital elements and
solutions.
1
 Switching to the standby resources on its own.
 Data Replication: Backup - it means copying data to different sites.
3. Performance Optimization:
Performance optimization is the process that is aimed minimizing
the amount of time that is taken and maximizing the quantities of data that are
transferred. Techniques include:
 Load Balancing: Redirecting and diffused traffic to various servers.
 Caching: Caching frequently used data into memory.
 Content Delivery Networks (CDNs): Content delivery closer to the
customers.
4. Security:
Security in cloud systems may be defined as the safety of the data
and applications in cloud systems against insurgent access and threats. Best
practices include:
 Encryption: Preserving data stored in impart storage devices and when
transmitted through the network.
 Identity and Access Management (IAM): Restriction of people’s access to
certain resources.
 Network Security: Setting up of firewalls and virtual private networks.
5. Cost Efficiency:
Operational efficiency is about how one can be economical in their
usage of the cloud resources to the lowest levels. Techniques include:
 Spot Instances: Using lower-cost, or surplus, elements of the cloud.
 Resource Provisioning: Providing resources to customers so that they can
precisely make their expected level of purchases.
 Cost Monitoring and Analysis: Monitoring and measuring the utilization
to check for possible areas of spending reduction.

2
Comprehensive data protection:
Comprehensive data protection, as discussed on GeeksforGeeks, involves a
multi-faceted approach encompassing various measures to safeguard sensitive
information. This includes employing firewalls and encrypted systems, using
VPNs, implementing strong passwords, and being cautious about public
networks and external authorizations. Furthermore, it's crucial to regularly assess
and update security measures, including encryption, backup and disaster
recovery planning, access control, network security, and physical security, to
ensure the confidentiality of sensitive information.
Here's a more detailed breakdown:
Key Aspects of Data Protection:
 Data Analysis and Classification: Understanding the type
and sensitivity of data is crucial for implementing appropriate
security measures.
 Access Control: Limiting access to sensitive data based on
roles and permissions is essential.
 Encryption: Protecting data at rest and in transit using strong
encryption algorithms.
 Multi-factor Authentication (MFA): Adding an extra layer
of security to user logins.

3
 Strong Passwords: Requiring users to create and regularly
update strong, complex passwords.
 Physical Security: Protecting data centers and storage
devices from unauthorized access.
 Endpoint Security: Implementing measures to secure devices
that access or store data.
 Security Policies: Establishing clear policies and procedures
for data handling and security.
 Security Awareness Training: Educating users about data
security threats and best practices.
 VPNs: Using Virtual Private Networks to create secure
connections for data transmission.
 Regular Backups: Creating regular backups of important
data to prevent data loss.
 Data Retention Policies: Establishing clear policies for how
long data is stored and how it's securely deleted.
 Monitoring and Logging: Implementing systems to monitor
network activity and log events for security analysis.

4
End-to-end access control :
End-to-end access control ensures secure access to resources
by verifying user identity (authentication) and granting appropriate permissions
(authorization), protecting data from unauthorized access and breaches. This
involves implementing policies that define who can access what, and employing
mechanisms like multi-factor authentication, least privilege access, and regular
audits.
Key Concepts:
 Authentication: Verifying the identity of a user, often through passwords,
biometrics, or tokens.
 Authorization: Determining what resources a user is permitted to access
based on their identity and predefined policies.
 Access Control Policies: Rules that define who can access what resources
and under what conditions.
 Least Privilege: Granting users only the minimum necessary access
required to perform their tasks.
 Multi-Factor Authentication (MFA): Requiring multiple forms of
verification to enhance security.
 Auditing: Regularly reviewing access logs to identify and address
potential security issues.
 Network Access Control (NAC): A security approach that unifies
endpoint security, user authentication, and network security enforcement.
 Database Access Control: Managing access to sensitive data within
databases to prevent breaches.
Principles of End-to-End Access Control:
1. Principle of Least Privilege:
Grant users only the necessary permissions to perform their tasks.
2. Principle of Separation of Duties:

5
Divide critical tasks among multiple users to prevent any single individual from
having excessive control.
3. Principle of Defense in Depth:
Implement multiple layers of security controls to provide redundancy and
increase the difficulty for attackers.
4. Principle of Least Astonishment:
Make access control mechanisms intuitive and predictable to minimize user
confusion and errors.
5. Principle of Regular Auditing:
Continuously monitor and review access logs to identify and address security
vulnerabilities.
Benefits of End-to-End Access Control:
 Reduced Risk of Data Breaches: Protects sensitive information from
unauthorized access and misuse.
 Improved Compliance: Helps organizations meet regulatory
requirements for data security.
 Enhanced Security Posture: Strengthens overall security by
implementing a layered approach to access control.
 Streamlined Access Management: Simplifies the process of managing
user access to resources.
 Increased User Productivity: Allows users to access the resources they
need while preventing access to unauthorized resources.

Common attack vectors and threats :


Common attack vectors in cybersecurity include phishing, malware (including
ransomware), social engineering, weak or stolen credentials, unpatched
vulnerabilities, insider threats, and man-in-the-middle (MITM) attacks. These
methods allow attackers to gain unauthorized access to systems and data.

6
 Phishing:This involves deceiving individuals into revealing sensitive
information, often through fraudulent emails or websites.
 Malware:Malicious software designed to damage or disable computers
and networks. Examples include viruses, worms, and ransomware.
 Social Engineering:Manipulating individuals into performing actions or
divulging information that compromises security.
 Weak or Stolen Credentials:Attackers can exploit weak passwords or
gain access through compromised accounts, allowing them to bypass
security measures.
 Unpatched Vulnerabilities:Exploiting known security flaws in software
or systems that have not been updated with security patches.
 Insider Threats:Attacks originating from within an organization, either
malicious or unintentional.
 Man-in-the-Middle (MITM) Attacks:Attackers intercept
communication between two parties, potentially eavesdropping on or
altering the data exchanged.
 Cross-Site Scripting (XSS):Injecting malicious scripts into trusted
websites, impacting users who visit those sites.
 SQL Injection:Exploiting vulnerabilities in database queries to access or
manipulate data.
 Denial of Service (DoS) and Distributed Denial of Service (DDoS)
Attacks:
 Session Hijacking:Taking over a user's active session to gain
unauthorized access to their account or system.
 Supply Chain Attacks:Exploiting vulnerabilities in third-party vendors
or suppliers to gain access to a target organization's systems.
 AI and GenAI Attacks:Using AI and generative AI technologies to
create more sophisticated attacks or to exploit vulnerabilities in AI
systems themselves.

7
Network and Storage:
Network-attached storage (NAS) is a file-based storage
architecture that makes stored data more accessible to networked devices
means multiple users or client devices retrieve data from a single storage
system. As the multiple clients or users connected through a Local Area
Network access the data from a centralized disk capacity
by Ethernet that's why it is referred to as Network Attached Storage.
Network Attached Storage(NAS) gives all the connected devices of the
network a single access point for storage which is called NAS Storage
Server.
What is Network Attached Storage Used for?
 For storing and exchanging files
 For data backup and catastrophe recovery, create active data archives.
 Provide a virtual desktop environment.
8
 Test and create server-side and web apps.
 Stream torrents and media files
 Save any pictures and movies that you need to access frequently.
 Establish a printing repository within the company.
Components of NAS:
 Processor: Every NAS has a processor at its core, which monitor the
memory and central processing unit (CPU).
 Interface of a network: USB and Wi-Fi connectivity are two examples of
direct computer connections that small Network Storage Devices (NAS)
intended for desktop or single-user use may support.
 Physical storage: It usually takes the form of disc drives, is a requirement
for every NAS. The drives, which frequently accommodate a variety of
various storage devices, may be conventional magnetic HDDs, SSDs, or
other non-volatile memory devices.
 Operating System: The OS arranges and controls the NAS hardware and
makes storage accessible to clients, such as users and other apps, much
like it does on a traditional computer.
Key Considerations in Selecting NAS:
 Storage Capacity: Determine the amount of storage area you need each
now and inside the future.
 Scalability: Choose a NAS that offers scalability to deal with future
needs. Look for devices that support enlargement alternatives.
 Performance: Assess the performance requirements of your NAS,
together with data transfer speeds, read/write s, and help for RAID
configurations.
 Data Redundancy and Protection: Ensure that the NAS device offers
sturdy facts safety capabilities, which includes RAID (Redundant Array of
Independent Disks) ranges for facts redundancy, snapshots for factor-in-
time recovery, and built-in backup and replication abilties to guard against
records loss.

9
 Data Accessibility and Sharing: Evaluate the NAS tool's abilities for
information accessibility and sharing across more than one devices and
structures
Types of Network Attached Storage Solutions:
 Home NAS: Now let us look at its description; it is a personal or small
home office (Soho) type of machine. Such devices are generally cheaper
and are commonly characterized by lower capacity and capability. Some
of those are the Synology DiskStation and QNAP TS series.
 Small Business NAS: Has more features and capability compared to
home-use NAS, typically for micro- to small businesses. They are often
more enhanced and may include such options as remote working ability,
solutions around backup operations, and better throughput. Some
examples of the models that it can be installed on include Synology
RackStation and the QNAP Turbo NAS series, among others.
 Enterprise NAS: Designed for large business entities that need to process
large volumes of information. These solutions are rather effective
compared to traditional systems; the key components are measured by
their performance and such enhancements as options for power supply
redundancy, high availability, and improved management of data.
Examples include Dell PowerStore and the NetApp FAS series.
 High-Performance NAS: Primarily deliver a very high speed and
performance, typically found in environments that require much data
processing or are involved in media. Both business and enterprise levels
can still contain these. Some are Synology FS series, QNAP TS-h series,
etc.
 RAID-enabled NAS: Uses RAID, also known as a redundant array of
independent disks, to ensure duplication and enhance the disk’s
performance. Various RAID levels (for example, RAID 1, RAID 5, and
RAID 6) have different priorities: data protection and data speed.
 Hybrid NAS: oversees NAS capabilities with features of cloud storage.
This type of NAS will allow data to be copied to off-site cloud storage for
backup and remote usage.

10
 JBOD (Just a Bunch of Disks): An NAS configuration that segregates
the drives, and each drive is dealt with separately from the other drives.
This is suitable for less complicated instances where having multiples of
the same component is not an issue.
 iSCSI NAS: Connects over a network using the iSCSI protocol to
external storage devices. This type of NAS can be used to expand storage
for servers and is frequently used in environments where block-level
storage is needed.

Benefits of Network Attached Storage:


 Centralized Storage: NAS enables several users and ends to download
and access data from a central location, thus easing the sharing of data.
 Ease of Access: Storing files on a NAS makes them available over the
network; that is, you can run them on computers, tablets, and smart
phones locally and over the internet (depending on the configuration).
 Scalability: Most NAS enclosures are very flexible, as are the devices
built into them, and can have more and larger drives added to them if the
need arises.
 Data Redundancy and Backup: NAS systems that are available on the
market today come with the support of RAID, which stands for
Redundant Array of Independent Disks, and this is a scheme that helps to
avoid the failure of disks that store data.
 User Management and Permissions: A great feature of the NAS devices
is the users’ account and permission management; the administrators can
define who should have access and which users can edit files or folders,
which boosts security and organization.

11
Secure Isolation Strategies :
Secure isolation strategies involve various techniques to protect
systems, networks, and data by creating boundaries and limiting
access. These strategies include physical security, network segmentation,
endpoint security, and application security, among others. Implementing
strong authentication, encryption, and regular monitoring are also crucial
for maintaining a secure environment.

1. Physical Security:
 Restricting physical access to servers, data centers, and other critical
infrastructure is the first line of defense.
 Implementing measures like surveillance, access control systems, and
environmental controls (e.g., temperature, humidity) helps prevent
unauthorized access and physical damage.
2. Network Segmentation:
 Dividing a network into smaller, isolated segments helps limit the impact
of security breaches.
 If one segment is compromised, the attacker's access is restricted to that
segment, preventing them from reaching other critical parts of the
network.
 Techniques include using VLANs, firewalls, and routers to create separate
network zones.
3. Endpoint Security:

12
 Protecting individual devices (laptops, desktops, smartphones) from
malware and unauthorized access is crucial.
 This includes using firewalls, antivirus software, intrusion prevention
systems (IPS), and endpoint detection and response (EDR) tools.
 Regular software updates and security patching are also vital.
4. Application Security:
 Securing applications involves protecting them from vulnerabilities and
attacks.
 This includes using secure coding practices, input validation, output
encoding, and regular security audits.
 Using web application firewalls (WAFs) to filter malicious traffic and
protect against common web application attacks.
5. Authentication and Authorization:
 Strong authentication mechanisms like multi-factor authentication (MFA)
are essential to verify user identities.
 Implementing role-based access control (RBAC) ensures that users only
have access to the resources they need.
6. Data Encryption:
 Encrypting data both in transit and at rest protects sensitive information
from unauthorized access.
 Use encryption protocols like TLS/SSL for data in transit and encryption
at the database or file level for data at rest.
7. Regular Monitoring and Auditing:
 Continuously monitoring network traffic, system logs, and application
activity helps detect suspicious behavior.
 Regularly auditing systems and applications helps identify vulnerabilities
and security gaps.
8. Browser Isolation:

13
 This technique isolates web browsing activity in a secure environment,
preventing malware and phishing attacks from affecting the user's device.
 The browser runs in a separate container or virtual machine, and only
sanitized data is sent back to the user's device.
9. Virtual Private Networks (VPNs):
 VPNs create a secure connection over a public network, encrypting traffic
and masking the user's IP address.
 This is particularly useful for remote users connecting to a network over
an insecure public Wi-Fi connection.

14
VIRTUALIZATION STRATEGIES
Virtualization is the process of creating a virtual representation of hardwaresuch
as server, storage, network or other physical machines. It supports multiple
copies of virtual machines(VMs) to execute on one physical machine each with
their own operating system and programs. This optimizes hardware efficiency
and flexibility and enables resources to be shared between multiple customers
or organizations.
Virtualization is a key to providing Infrastructure as a Service (IaaS) solutions
for cloud computing, whereby the user has access to remote computing
resources.

Why is Virtualization Important?


Virtualization is important because it let's you get the most out of your computer
or server resources. Consider it like being able to use one physical box as many
smaller, independent "virtual" boxes. There are multiple virtual boxes, each
having its own program to run and data to store, but they use the same physical
box.
1. Better use of Resources
Instead of allowing for numerous unused machines, virtualizationenables
you to host multiple programs or systems on one computer, which is more
effective.
2. Cost Utilization
Companies can save their money on hardware, power, and maintenance by
using less physical equipment.
3. Flexibility
Virtual machines can be easily installed, relocated and resized to suit changing
requirements. If a virtual machine requires more power, it can obtain it rapidly
without requiring new hardware.
15
What is Data Retention?

Data retention is like keeping information for a particular time.


People and businesses do this for different reasons, such as following the law,
ensuring things keep running smoothly, and studying the data. In the age of
digitization, data play a significant role in making decisions and policies inside
any industry. It is also used to analyze customer behavior.

Why is Data Retention Necessary?


Now we know that what is data retention, let's explore its important also. Data
Retention is important for following reasons that are as follows:

Legal compliance:
Many industries and educational institutions are legally bound to keep the
records for a certain period. Maintaining and organizing data is essential to
ensure an organization follows the rules and laws that apply to it. This helps the
organization avoid getting into trouble with the law.
Audit and Accountability:
Organizations retaining required data go through the audit process smoothly. It
creates a record of information that can be checked to see if the company is
doing what it's supposed to and following the rules.
Business and Decision-making:
Data plays a significant role in learning from past mistakes and analyzing
future market trends; it leverages an organization's decision-making
process.
Customer Relationship Management:
Retained customer data enables organizations to understand their clients
better. This, in turn, facilitates personalized services and helps identify

16
valuable customers, improving customer satisfaction and loyalty.
Resilience to Data Loss:
Retaining essential operational data ensures businesses can recover quickly
in the face of data loss or system failures. It supports business continuity
and minimizes loss.
Security incident investigation:
In a security breach or cyber attack, retained data can be crucial in
identifying potential sources and helping strengthen the system.
Compliance with Financial Regulations:
Industries must retain financial records for audit purposes, especially in
finance. This ensures transparency and compliance with financial
regulations.
Core Elements of Data Retention
A strong data retention policy hinges on three crucial elements: data
classification, retention periods, and storage considerations. Let's explore what
are core elements of data retention:

A. Data Classification
The first step in effective data retention is understanding what information you
possess. Data classification involves categorizing data based on its sensitivity
and legal requirements. This is vital for several reasons:

Prioritization: Sensitive data, such as financial records or personal information


(e.g., Social Security numbers), requires stricter security measures and may
have shorter retention periods due to privacy regulations.
Compliance: Classifying data helps identify which information falls under
specific regulations, like HIPAA for healthcare data or GDPR for EU user data.
This ensures you comply with mandated retention periods for certain data types.
17
Risk Management: Understanding the sensitivity of data allows you to
implement appropriate security protocols and minimize the risks associated with
data breaches or unauthorized access.
B. Retention Periods
Data retention periods define how long different types of data need to be stored.
These periods can vary significantly depending on:

1.Legal Requirements: Many regulations specify minimum retention periods for


certain data types. For example, tax laws might dictate how long financial
records must be kept.
2.Business Needs: Organizations may need to retain data for internal purposes
like business continuity (disaster recovery) or data analysis. Customer purchase
history can be valuable for understanding buying trends, but may not require
long-term storage.
3.Data Lifecycle: Consider the "natural lifespan" of the data. For instance,social
media posts might only be relevant for a short period, while historical sales data
might be valuable for long-term analysis.
It's crucial to strike a balance between legal compliance, business needs, and
data minimization. Holding onto unnecessary data for extended periods not only
increases storage costs but also creates a larger security footprint.

C. Storage Considerations
Where you store your data is an essential aspect of data retention. The chosen
method needs to be secure, cost-effective, and scalable to accommodate future
growth. Here's a breakdown of common storage options:
1. On-premise Storage: Data is physically stored on servers located within your
organization's infrastructure. This offers greater control but can be expensive and
require significant IT expertise for maintenance and security.
18
2. Cloud Storage: Data is stored on remote servers managed by a cloud service
provider. This offers scalability, flexibility, and often lower upfront costs.
However, it's crucial to choose a reputable provider with robust security
measures.
3. Hybrid Storage: A combination of on-premise and cloud storage provides a
balance between control, security, and scalability. You can store sensitive data
on-site and less critical data in the cloud.

Data Deletion Procedures

1.Deletion Protocols: Establish protocols for securely deleting tenant data that is
no longer needed. This includes ensuring that data is irrecoverable after deletion
to protect against unauthorized access
2.Grace Periods: For services like Microsoft 365, a grace period is often
Provided after subscription termination, during which data can still be accessed.
Forexample, Microsoft retains customer data for 90 days after a subscription
ends,after which it is deleted.
3.Compliance with Regulations: Ensure that deletion practices comply with
19
relevant regulations, which may dictate specific requirements for data disposal
and retention.

Data Archiving Procedures

1.Archiving Policies: Develop archiving policies that specify when data


Becomes inactive and should be moved to secondary storage. This helps manage
Storage resources effectively while retaining necessary data for compliance and
historical reference.
2.Secure Storage: Ensure that archived data is stored securely, with access
Controls in place to prevent unauthorized access. This is particularly important
for sensitive tenant data.

3.Periodic Review: Regularly review archived data to determine if it can be


deleted or if it still needs to be retained based on current business needs and
regulatory requirements

What is Data Encryption?


Data encryption is the process of converting readable information (plaintext)
into an unreadable format (ciphertext) to protect it from unauthorized access. It
is a method of preserving data confidentiality by transforming it into ciphertext,
which can only be decoded using a unique decryption key produced at the time
of the encryption or before it. The conversion of plaintext into ciphertext is
20
known as encryption. By using encryption keys and mathematical algorithms,
the data is scrambled so that anyone intercepting it without the proper key
cannot understand the contents.When the intendedrecipient receives the
encrypted data, they use the matching decryption key to return it to its original,
readable form. This approach ensures that sensitive information such as
personal details, financial data, or confidential communications remain secure as
it travels over networks or is stored on devices.

How Does Encryption Work?

When data or information is shared over internet, it passes via a


number of global network devices that are a component of the public internet.
Data that is transmitted via the open internet leads to the risk of being stolen or
hacked by hackers. Users can install particular hardware or software to
guarantee the safe transfer of data or information in order to avoid hacking. In
network security these operations are referred to as encryption. The process of
transforming plaintext into ciphertext, is called encryption.

On the left you have an original, readable message called plaintext


such as “GeeksforGeeks.” Before sending it over a network, the sender uses an
encryption key and an encryption process to convert this readable message into a
scrambled, unreadable format known as ciphertext (in this image it is like
“KGifuT+us0=”). This ciphertext travels across the internet, so if someone
intercepts it, they cannot understand it without the key. When the ciphertext
reaches the intended recipient, they use the matching decryption key and a

21
decryption process to turn the unreadable ciphertext back into the original,
readable message “GeeksforGeeks.” Essentially the image shows how
encryption and decryption ensure that only authorized parties with the correct
keys can access the information in its original form.

What Is Data Redaction?

Data redaction is a type of text analysis technique that helps you safeguard
sensitive data and control it from getting compromised. You can remove select
information from documents to prevent data exposure. This is usually done
manually by people in an office. However, if the documents are higher in
number, says, 1 million, it becomes extremely excruciating for a person to
handle all of it together.

In such cases, advanced analytics techniques such as Named Entity


Recognition can automate the complete redaction of data from documents.

The redacted information is a common term for blackening out information.


However, it is easier said than done, especially when uploading documents
online. One famous example is the debacle by the New South Wales Medical
Council in 2016.

The staff at the institution blacked out the person’s name before uploading the
document. However, the person’s identity remained in the underlying data
linked with the search engine results. Removing information that had already
gone out was not easy. The medical council team had to contact Google to fix
the issue.

Data Redaction Examples


Data redaction examples can be plentiful, depending on the masked
information. Let us look at them in detail.

 Complete Redaction: It involves redacting the entire content in a


document. Data with characters can have a single space. If the data has
numerical values, it usually gets redacted to zero.
 Half Redaction: You can redact a small portion of the data in the
document. For example, you can edit the last six digits of customers’
mobile numbers. It would be like 7023XXXXXX.

22
 Random Redaction: It displays random values to users each time they
view a document. The values would depend on the type of underlying
information in the record.
 Regular Expressions: It identifies patterns to redact data. Redacting
email addresses that can have varying character lengths is a typical
example.

What is Tokenization?
Tokenization is a fundamental process in NLP where text is broken into smaller
units called tokens — such as words, characters, or sub-words — making it
easier for machines to analyze and understand language.
For example:
Sentence: "Chatbots are helpful."
Word Tokenization: ["Chatbots", "are", "helpful"]
Character Tokenization: ["C", "h", "a", "t", "b", "o", "t", "s", " ", "a", "r",
"e", " ", "h", "e", "l", "p", "f", "u", "l"]

Types of Tokenization
Type Description
Word Tokenization Splits text into individual words.
Character Tokenization Splits text into characters; used in

23
fine-grained analysis.
Sub-word Tokenization Breaks words into smaller units;
balances word and character
level.
Sentence Tokenization Splits large text into separate
sentences.
N-gram Tokenization Splits text into fixed-size sequences
of tokens (n-grams).

Use Cases of Tokenization


- Information Retrieval: Indexing and fast search matching.
- Search Engines: Enhances query understanding and response accuracy.
- Machine Translation: Translates text across languages while preserving
meaning.
- Speech Recognition: Breaks voice commands into tokens for processing.
Challenges in Tokenization
- Ambiguity: Example: “I saw her duck” – multiple meanings.
- No Word Boundaries: Languages like Chinese, Japanese make tokenization
harder.
- Special Characters: URLs, emails, punctuation (e.g., [email protected])
complicate token interpretation.
Tools & Libraries for Tokenization
Tool Features
NLTK Offers sentence and word
tokenizers; beginner-friendly.
SpaCy Fast and multilingual; great for
production-grade NLP.
BERT Tokenizer Context-aware; suitable for deep

24
learning models.
Byte-Pair Encoding (BPE) Efficient for morphologically rich
languages.
Sentence Piece Unsupervised and language-
independent sub-word
tokenizer.

Types of Tokenization:

Word Tokenization: This is the most common method where text is divided
into individual words. It works well for languages with clear word boundaries,
like English.

Character Tokenization: In this method, text is split into individual characters.


This is particularly useful for languages without clear word boundaries or for
tasks that require a detailed analysis, such as spelling correction.

Sub-word Tokenization: Sub-word tokenization strikes a balance between


word and character tokenization by breaking down text into units that are larger
than a single character but smaller than a full word.

Sentence Tokenization: Sentence tokenization is also a common technique used


to make a division of paragraphs or large set of sentences into separated
sentences as tokens.

25
N-gram Tokenization: N-gram tokenization splits words into fixed-sized
chunks (size = n) of data.

What is obfuscation and how does it work?

Obfuscation means to make something difficult to understand. Programming


code is often obfuscated to protect intellectual property or trade secrets, and to
prevent an attacker from reverse engineering a proprietary software program.

Encrypting some or all of a program's code is one obfuscation method. Other


approaches include stripping out potentially revealing metadata,
replacing class and variable names with meaningless labels, and adding unused
or meaningless code to an application script. A tool called an obfuscator will
automatically convert straightforward source code into a program that works as
intended, but is more difficult to read, understand and, therefore, compromise
by potentially malicious parties.

26
Unfortunately, malicious code writers also use these methods to prevent their
attack mechanisms from being detected by antimalware or antivirus tools.
The 2020 SolarWinds attack is an example of hackers using obfuscation to
evade defenses and launch a successful cyberattack.

Why use code obfuscation


Obfuscation in computer code uses complex roundabout phrases and redundant
logic to make the code difficult for the reader to understand, while maintaining
its inherent functionality. The reader might be a person (a genuine user or
a cyberthreat actor), a computing device or another program (e.g., malware).
The goal is to distract the reader with the complicated syntax and make it
difficult for them to parse the message and determine its true content. If the
code is too complex to understand, it becomes harder to reverse engineer the
application.

How does obfuscation work?


Code obfuscation is about making the code's delivery method and presentation
more confusing. In doing so, it prevents unauthorized parties and cybercriminals
from getting into the code to modify it for their own purposes.

Obfuscation might involve changing the content of the original code by adding
dummy code, renaming variables, changing the logical structure, replacing
simple arithmetic expressions with their complex equivalents and so on. But
even with these changes, obfuscation does not alter how the program works,
neither does it modify its end output. Rather, its main purpose is to make
reverse engineering difficult.

The following is an example snippet of normal JavaScript code:

var greeting = 'Hello World';

greeting = 10;var product = greeting * greeting;

That same snippet in obfuscated form looks like this:


27
var
_0x154f=['98303fgKsLC','9koptJz','1LFqeWV','13XCjYtB','6990QlzuJn','8726
0lXoUxl','2HvrLBZ','15619aDPIAh','1kfyliT','80232AOCrXj','2jZAgwY','1825
93oBiMFy','1lNvUId','131791JfrpUY'];var
_0x52df=function(_0x159d61,_0x12b953){_0x159d61=_0x159d61-0x122;var
_0x154f4b=_0x154f[_0x159d61];return _0x154f4b;};
(function(_0x19e682,_0x2b7215){var _0x5e377c=_0x52df;while(!![]){try{var
_0x2d3a87=-parseInt(_0x5e377c(0x129))*parseInt(_0x5e377c(0x123))+-
parseInt(_0x5e377c(0x125))*parseInt(_0x5e377c(0x12e))
+parseInt(_0x5e377c(0x127))*-parseInt(_0x5e377c(0x126))+-
parseInt(_0x5e377c(0x124))*-parseInt(_0x5e377c(0x12f))+-
parseInt(_0x5e377c(0x128))*-parseInt(_0x5e377c(0x12b))
+parseInt(_0x5e377c(0x12a))*parseInt(_0x5e377c(0x12d))
+parseInt(_0x5e377c(0x12c))*parseInt(_0x5e377c(0x122));if(_0x2d3a87===_
0x2b7215)break;else _0x19e682['push'](_0x19e682['shift']
());}catch(_0x22c179){_0x19e682['push'](_0x19e682['shift']());}}}
(_0x154f,0x1918c));var greeting='Hello\x20World';greeting=0xa;var
product=greeting*greeting;

The obfuscated version is nearly impossible to follow by a human eye.

Obfuscation techniques
Obfuscation involves several different methods. Often, multiple techniques are
used to create a layered effect. In fact, it is recommended to use more than one
obfuscation technique because there is no single "silver bullet" to avert
cyberattacks involving application reverse engineering or code theft. Using
multiple methods better hardens the code and provides a higher level of
protection to safeguard sensitive data and prevent application reverse
engineering.

Some of the most common obfuscation techniques are:

 Renaming. The obfuscator alters the methods and names of variables. The
new names might include undecipherable, unprintable or invisible characters.
This method is commonly used by Java, iOS, Android and .NET obfuscators.
 Packing. This compresses the entire program to make the code unreadable.
 Control flow. The code's logical structure is changed to make it less
traceable. The decompiled code yields nondeterministic semantic results and
28
is made to look like spaghetti logic. This logic is unstructured and therefore
hard for a hacker to understand or take advantage of.
 Instruction pattern transformation. This approach takes common
instructions created by the compiler and swaps them for more complex, less
common instructions that effectively perform the same operations while also
hardening the code.
 Arithmetic and logical expression transformation. Simple arithmetic and
logical expressions are replaced with complex equivalents that are hard to
understand.
 Dummy code insertion. Dummy or ancillary code can be added to a
program to make it harder to read and reverse engineer. Like the other
obfuscation methods, doing this does not affect the program's logic or
outcome.
 Metadata or unused code removal. Metadata provides extra information
about the program, much like annotations on a Word document that can help
readers to understand its content, history, creator and so forth. Removing
metadata as well as unused code leaves a hacker with less information about
the program and its code, reducing the likelihood that they will be able to
understand its logic for reverse engineering purposes. This technique can also
improve the application's runtime performance.
 Binary linking. Combining multiple input executables or libraries into one
(or more) output binaries reduces the amount of information available to
cybercriminals for possible application exploitation. It also makes the
application smaller and simplifies its deployment.
 Opaque predicate insertion. A predicate in code is a logical expression that
is either true or false. Opaque predicates are conditional branches -- or if-then
statements -- where the results cannot easily be determined with statistical
analysis. Inserting an opaque predicate introduces unnecessary code that is
never executed but might be puzzling to someone trying to understand the
decompiled output.
 String encryption. This method uses encryption to hide the strings in the
executable and only restores their values when they are needed to run the
program. This makes it difficult to go through a program and search for
particular strings. That said, decrypting strings at runtime can adversely
impact runtime performance, although the effect is usually quite small.

29
 Code transposition. This is the reordering of routines and branches in the
code without having a visible effect on its behavior.

What is PKI? A Public Key Infrastructure Definitive Guide


Public key infrastructure (PKI) governs the issuance of digital certificates to
protect sensitive data, provide unique digital identities for users, devices and
applications and secure end-to-end communications.

Today, organizations rely on PKI to manage security through

encryption.Specifically, the most common form of encryption used today

involves a public key, which anyone can use to encrypt a message, and a private

key (also known as a secret key), which only one person should be able to use

to decrypt those messages. These keys can be used by people, devices, and

applications.

PKI security first emerged in the 1990s to help govern encryption keys through the
issuance and management of digital certificates. These PKI certificates verify
the owner of a private key and the authenticity of that relationship going
forward to help maintain security. The certificates are akin to a driver’s license
or passport for the digital world.

30
How Does PKI Work?

To understand how PKI works, it’s important to go back to the basics that govern
encryption in the first place. With that in mind, let’s dive into cryptographic
algorithms and digital certificates.

Building Blocks of Public Key Cryptography

Cryptographic algorithms are defined, highly complex mathematical formulas used to


encrypt and decrypt messages. They are also the building blocks of PKI
authentication. These algorithms range in complexity and the earliest ones pre-
date modern technology.

Symmetric Encryption

Symmetric encryption is a simple cryptographic algorithm by today’s standards,


however, it was once considered state of the art. In fact, the German army used
it to send private communications during World War II. The movie The
Imitation Game actually does quite a good job of explaining how symmetric
encryption works and the role it played during the war.

With symmetric encryption, a message that gets typed in plain text goes through
mathematical permutations to become encrypted. The encrypted message is
difficult to break because the same plain text letter does not always come out
the same in the encrypted message. For example, the message “HHH” would
not encrypt to three of the same characters.

To both encrypt and decrypt the message, you need the same key, hence the name
symmetric encryption. While decrypting messages is exceedingly difficult
without the key, the fact that the same key must be used to encrypt and decrypt
the message carries significant risk. That’s because if the distribution channel
used to share the key gets compromised, the whole system for secure messages
is broken.

Asymmetric Encryption

31
Asymmetric encryption, or asymmetrical cryptography, solves the exchange problem
that plagued symmetric encryption. It does so by creating two different
cryptographic keys (hence the name asymmetric encryption) — a private key
and a public key.

With asymmetric encryption, a message still goes through mathematical permutations


to become encrypted but requires a private key (which should be known only to
the recipient) to decrypt and a public key (which can be shared with anyone) to
encrypt a message.

Here’s how this works in action:

 Alice wants to send a private message to Bob, so she uses Bob’s public key to
generate encrypted ciphertext that only Bob’s private key can decrypt.
 Because only Bob’s private key can decrypt the message, Alice can send it
knowing that no one else can read it — not even an eavesdropper — so long as
Bob is careful that no one else has his private key.

Asymmetric encryption also makes it possible to take other actions that are harder to
do with symmetric encryption, like digital signatures, which work as follows:

 Bob can send a message to Alice and encrypt a signature at the end using his
private key.
 When Alice receives the message, she can use Bob’s public key to verify two
things:
o Bob, or someone with Bob’s private key, sent the message
o The message was not modified in transit, because if it does get modified the
verification will fail

In both of these examples, Alice has not generated her own key. Just with a public key
exchange, Alice can send encrypted messages to Bob and verify documents that
Bob has signed. Importantly, these actions are only one-way. To reverse the
actions so Bob can send private messages to Alice and verify her signature,
32
Alice would have to generate her own private key and share the corresponding
public key.

Today, there are three popular mathematical properties used to generate private
and public keys: RSA, ECC, and Diffie-Hellman. Each uses different
algorithms to generate encryption keys but they all rely on the same basic
principles as far as the relationship between the public key and private key.

Let’s look at the RSA 2048 bit algorithm as an example. This algorithm
randomly generates two prime numbers that are each 1024 bits long and then
multiplies them together. The answer to that equation is the public key, while
the two prime numbers that created the answer are the private key.

This approach works because it’s extremely difficult to reverse the computation
when it involves two prime numbers of that size, making it relatively easy to
compute the public key from the private key but nearly impossible to compute
the private key from the public key.

33

You might also like