Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
2 views4 pages

1

Uploaded by

dhaanesh.suresh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views4 pages

1

Uploaded by

dhaanesh.suresh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

SQL DATA STANDARDIZATION DOCUMENTATION

DNA LEARNING (WEEKEND - 1)​

STRING FUNCTIONS

1.​ TRIM() - removes trailing and leading spaces from the string
2.​ LTRIM() - removes trailing or leading space from the beginning of the string
3.​ RTRIM() - removes trailing or leading spaces from the end of the string
4.​ UPPER() - Capitalizes the string
5.​ LOWER() - Makes the string characters into small letters
6.​ LENGTH() - finds length of string
7.​ SUBSTRING() - Extracts the part of the string

SUBSTRING(‘john’, 1, 3) - (string, start position, no of characters to be extracted)

Ans: joh

8.​ INSTR() - Gives the position of character in the string.

INSTR([email protected], ‘@’)

Ans : 16

9.​ CHAR_LENGTH() - Finds no of characters in the string


10.​CONCAT() - Adds two strings

11.​REGEXP_REPLACE() - Removes unwanted characters

​ REGEXP_REPLACE(728973jh89u23x534, ‘^[0-9]’, ‘’)



​ Ans: 7289738923534

12.​LPAD - Adds any character from beginning of string to make it to a specific length

​ LPAD(555, 6, 0) - Given number is 555, and result length should be 6 padded up


​ With 0

​ Ans : 000555

13.​RPAD - Adds any character from the end of string

​ RPAD(555, 6, 9) - Given number is 555, and result length should be 6 padded up

​ ​ ​ ​ ​ ​ ​ ​ ​ ​ S DHAANESH
​ With 9

​ Ans : 555999

SQL DATA ANALYSIS (CUSTOMER DATASET)

Dataset - https://github.com/Dhaanesh26/data_standardisation/blob/main/datasets/customer.csv

Q1. Remove extra spaces and fix casing in names

SELECT TRIM(UPPER(CustomerName))

Q2. Standardize phone number formats

SELECT PhoneNumber,
CONCAT(‘+1’,
RIGHT(REGEXP_REPLACE(SUBSTRING_INDEX(PhoneNumber,‘x’,1),‘[^0-9]’, ‘’)), 10)

Q3. Check email if @ is present

SELECT Email,
​ CASE
​ ​ WHEN INSTR(Email, ‘@’) > 0 THEN ‘Valid’
​ ​ ELSE ‘Inavalid’
​ END AS email_status
FROM customer

Q4. Query the domain name from email

SELECT Email, SUBSTRING(Email, INSTR(Email, ‘@’) + 1) AS domain_name


FROM customer

Q5. Convert string to proper date format

SELECT STR_TO_DATE(CreatedDate, ‘%Y-%m-%d’) AS standardised_signup_date


FROM customer

Q6. Standardize categorical data using CASE statements

SELECT isActive,
​ CASE
​ ​ WHEN UPPER(isActive) IN (‘TRUE’, ‘T’) THEN ‘True’
​ ​ ELSE ‘False’

​ ​ ​ ​ ​ ​ ​ ​ ​ ​ S DHAANESH
​ END AS customer_status
FROM customer;

Q7. Consistent granularity for analysis (Extra)

SELECT DATE_FORMAT(CreateDate, ‘%Y-%m’) AS month, SUM(sales) AS total_sales


FROM customer
GROUP BY DATE_FORMAT(CreateDate, ‘%Y-%m’)
ORDER BY month

Note : Why not use AS name in group by as well?. It can be used because, GROUP BY clause
is performed before the select statement, so the month Alias is not yet created to be used.
Hence full expression is used in the group by clause.

THEORY

PARTITION BY

It is used in window functions to split the result into groups of rows and perform calculations
within the group, without collapsing like how GROUP BY does.

EXAMPLE TABLE

Q. Employee sales by city and region

SELECT EmpName, City, Region, Sales,


​ SUM(sales) OVER (PARTITION BY City) AS total_city_sales,
​ SUM(sales) OVER(PARTITION BY Region) AS total_region_sales,
​ RANK() OVER(PARTITION BY City ORDER BY Sales) AS city_sales_rank
FROM employees;

​ ​ ​ ​ ​ ​ ​ ​ ​ ​ S DHAANESH
Difference between GROUP BY & PARTITION BY

GROUP BY PARTITION BY

Aggregate data into summary results Apply window functions without collapsing

Rows are collapsed, returns one row per Rows are not collapsed, returns one row per
group row group

Aggregate functions are used (SUM, MIN, Windows functions are used (RANK(),
MAX, AVG) ROW_NUMBER()

​ ​ ​ ​ ​ ​ ​ ​ ​ ​ S DHAANESH

You might also like