0% found this document useful (0 votes)

15 views34 pages

Methods & Function in Databricks

The document provides a comprehensive overview of various column methods and functions used in data manipulation, particularly in PySpark. It includes methods for sorting, casting, and manipulating data types, as well as mathematical, datetime, and collection functions. Additionally, it describes operations for handling arrays and maps, such as filtering, transforming, and aggregating data.

Uploaded by

proxy9819

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views34 pages

Methods & Function in Databricks

Uploaded by

proxy9819

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

Column Method

Column.getattr(item) An expression that gets an item at position

ordinal out of a list, or gets an item by key out of
a dict.

Column.getitem(k) An expression that gets an item at position

ordinal out of a list, or gets an item by key out of
a dict.

Column.alias(*alias, **kwargs) Returns this column aliased with a new name or

names (in the case of expressions that return
more than one column, such as explode).

Column.asc() Returns a sort expression based on the

ascending order of the column.

Column.asc_nulls_first() Returns a sort expression based on the

ascending order of the column, and null values
return before non-null values.

Column.asc_nulls_last() Returns a sort expression based on the

ascending order of the column, and null values
appear after non-null values.

Column.astype(dataType) astype() is an alias for cast().

Column.between(lowerBound, True if the current column is between the lower

upperBound) bound and upper bound, inclusive.

Column.bitwiseAND(other) Compute bitwise AND of this expression with

another expression.

Column.bitwiseOR(other) Compute bitwise OR of this expression with

another expression.

Column.bitwiseXOR(other) Compute bitwise XOR of this expression with

another expression.

Column.cast(dataType) Casts the column into type dataType.

Column.contains(other) Contains the other element.

Column.desc() Returns a sort expression based on the

descending order of the column.

Column.desc_nulls_first() Returns a sort expression based on the

descending order of the column, and null values
appear before non-null values.

Column.desc_nulls_last() Returns a sort expression based on the

descending order of the column, and null values
appear after non-null values.

Column.dropFields(*fieldNames) An expression that drops fields in StructType by

name.

Column.endswith(other) String ends with.

Column.eqNullSafe(other) Equality test that is safe for null values.

Column.getField(name) An expression that gets a field by name in a

StructType.

Column.getItem(key) An expression that gets an item at position

ordinal out of a list, or gets an item by key out of
a dict.

Column.ilike(other) SQL ILIKE expression (case insensitive LIKE).

Column.isNotNull() True if the current expression is NOT null.

Column.isNull() True if the current expression is null.

Column.isin(*cols) A boolean expression that is evaluated to true if

the value of this expression is contained by the
evaluated values of the arguments.

Column.like(other) SQL like expression.

Column.name(*alias, **kwargs) name() is an alias for alias().

Column.otherwise(value) Evaluates a list of conditions and returns one of

multiple possible result expressions.

Column.over(window) Define a windowing column.

Column.rlike(other) SQL RLIKE expression (LIKE with Regex).

Column.startswith(other) String starts with.

Column.substr(startPos, length) Return a Column which is a substring of the

column.

Column.when(condition, value) Evaluates a list of conditions and returns one of

multiple possible result expressions.

Column.withField(fieldName, An expression that adds/replaces a field in

col) StructType by name.

Functions
Normal Functions
col(col) Returns a Column based on the given column name.

column(col) Returns a Column based on the given column name.

lit(col) Creates a Column of literal value.

broadcast(df) Marks a DataFrame as small enough for use in broadcast

joins.
coalesce(*cols) Returns the first column that is not null.

input_file_name() Creates a string column for the file name of the current
Spark task.

isnan(col) An expression that returns true if the column is NaN.

isnull(col) An expression that returns true if the column is null.

monotonically_increas A column that generates monotonically increasing 64-bit

ing_id() integers.

nanvl(col1, col2) Returns col1 if it is not NaN, or col2 if col1 is NaN.

rand([seed]) Generates a random column with independent and

identically distributed (i.i.d.) samples uniformly distributed
in [0.0, 1.0).

randn([seed]) Generates a column with independent and identically

distributed (i.i.d.) samples from the standard normal
distribution.

spark_partition_id() A column for partition ID.

when(condition, value) Evaluates a list of conditions and returns one of multiple

possible result expressions.

bitwise_not(col) Computes bitwise not.

bitwiseNOT(col) Computes bitwise not.

expr(str) Parses the expression string into the column that it

represents

greatest(*cols) Returns the greatest value of the list of column names,

skipping null values.
least(*cols) Returns the least value of the list of column names,
skipping null values.

Math Functions
sqrt(col) Computes the square root of the specified float value.

abs(col) Computes the absolute value.

acos(col) Computes inverse cosine of the input column.

acosh(col) Computes inverse hyperbolic cosine of the input column.

asin(col) Computes inverse sine of the input column.

asinh(col) Computes inverse hyperbolic sine of the input column.

atan(col) Compute inverse tangent of the input column.

atanh(col) Computes inverse hyperbolic tangent of the input column.

atan2(col1, col2) New in version 1.4.0.

bin(col) Returns the string representation of the binary value of the given
column.

cbrt(col) Computes the cube-root of the given value.

ceil(col) Computes the ceiling of the given value.

conv(col, Convert a number in a string column from one base to another.
fromBase,
toBase)

cos(col) Computes cosine of the input column.

cosh(col) Computes hyperbolic cosine of the input column.

cot(col) Computes cotangent of the input column.

csc(col) Computes cosecant of the input column.

exp(col) Computes the exponential of the given value.

expm1(col) Computes the exponential of the given value minus one.

factorial(col) Computes the factorial of the given value.

floor(col) Computes the floor of the given value.

hex(col) Computes hex value of the given column, which could be

pyspark.sql.types.StringType,
pyspark.sql.types.BinaryType,
pyspark.sql.types.IntegerType or
pyspark.sql.types.LongType.

unhex(col) Inverse of hex.

hypot(col1, col2) Computes sqrt(a^2 + b^2) without intermediate overflow or

underflow.

log(arg1[, arg2]) Returns the first argument-based logarithm of the second

argument.

log10(col) Computes the logarithm of the given value in Base 10.

log1p(col) Computes the natural logarithm of the “given value plus one”.

log2(col) Returns the base-2 logarithm of the argument.

pmod(dividend, Returns the positive value of dividend mod divisor.

divisor)

pow(col1, col2) Returns the value of the first argument raised to the power of the
second argument.

rint(col) Returns the double value that is closest in value to the argument
and is equal to a mathematical integer.

round(col[, Round the given value to scale decimal places using HALF_UP
scale]) rounding mode if scale >= 0 or at integral part when scale < 0.

bround(col[, Round the given value to scale decimal places using HALF_EVEN
scale]) rounding mode if scale >= 0 or at integral part when scale < 0.

sec(col) Computes secant of the input column.

shiftleft(col, Shift the given value numBits left.

numBits)

shiftright(col, (Signed) shift the given value numBits right.

numBits)

shiftrightunsig Unsigned shift the given value numBits right.

ned(col, numBits)

signum(col) Computes the signum of the given value.

sin(col) Computes sine of the input column.

sinh(col) Computes hyperbolic sine of the input column.

tan(col) Computes tangent of the input column.

tanh(col) Computes hyperbolic tangent of the input column.

toDegrees(col) New in version 1.4.0.

degrees(col) Converts an angle measured in radians to an approximately

equivalent angle measured in degrees.

toRadians(col) New in version 1.4.0.

radians(col) Converts an angle measured in degrees to an approximately

equivalent angle measured in radians.

Datetime Functions
add_months(start, months) Returns the date that is months months after
start.

current_date() Returns the current date at the start of query

evaluation as a DateType column.

current_timestamp() Returns the current timestamp at the start of

query evaluation as a TimestampType column.

date_add(start, days) Returns the date that is days days after start.

date_format(date, format) Converts a date/timestamp/string to a value of

string in the format specified by the date format
given by the second argument.

date_sub(start, days) Returns the date that is days days before start.
date_trunc(format, timestamp) Returns timestamp truncated to the unit
specified by the format.

datediff(end, start) Returns the number of days from start to end.

dayofmonth(col) Extract the day of the month of a given

date/timestamp as integer.

dayofweek(col) Extract the day of the week of a given

date/timestamp as integer.

dayofyear(col) Extract the day of the year of a given

date/timestamp as integer.

second(col) Extract the seconds of a given date as integer.

weekofyear(col) Extract the week number of a given date as

integer.

year(col) Extract the year of a given date/timestamp as

integer.

quarter(col) Extract the quarter of a given date/timestamp as

integer.

month(col) Extract the month of a given date/timestamp as

integer.

last_day(date) Returns the last day of the month which the

given date belongs to.

localtimestamp() Returns the current timestamp without time

zone at the start of query evaluation as a
timestamp without time zone column.

minute(col) Extract the minutes of a given timestamp as

integer.
months_between(date1, date2[, Returns number of months between dates date1
roundOff]) and date2.

next_day(date, dayOfWeek) Returns the first date which is later than the
value of the date column based on second week
day argument.

hour(col) Extract the hours of a given timestamp as

integer.

make_date(year, month, day) Returns a column with a date built from the year,
month and day columns.

from_unixtime(timestamp[, Converts the number of seconds from unix

format]) epoch (1970-01-01 00:00:00 UTC) to a string
representing the timestamp of that moment in
the current system time zone in the given
format.

unix_timestamp([timestamp, Convert time string with given pattern

format]) (‘yyyy-MM-dd HH:mm:ss’, by default) to Unix
time stamp (in seconds), using the default
timezone and the default locale, returns null if
failed.

to_timestamp(col[, format]) Converts a Column into

pyspark.sql.types.TimestampType using the
optionally specified format.

to_date(col[, format]) Converts a Column into

pyspark.sql.types.DateType using the
optionally specified format.

trunc(date, format) Returns date truncated to the unit specified by

the format.

from_utc_timestamp(timestamp, This is a common function for databases

tz) supporting TIMESTAMP WITHOUT TIMEZONE.

to_utc_timestamp(timestamp, tz) This is a common function for databases

supporting TIMESTAMP WITHOUT TIMEZONE.
window(timeColumn, Bucketize rows into one or more time windows
windowDuration[, …]) given a timestamp specifying column.

session_window(timeColumn, Generates session window given a timestamp

gapDuration) specifying column.

timestamp_seconds(col) Converts the number of seconds from the Unix

epoch (1970-01-01T00:00:00Z) to a timestamp.

window_time(windowColumn) Computes the event time from a window

column.

Collection Functions
array(*cols) Creates a new array column.

array_contains(col, value) Collection function: returns null if the array is

null, true if the array contains the given value,
and false otherwise.

arrays_overlap(a1, a2) Collection function: returns true if the arrays

contain any common non-null element; if not,
returns null if both the arrays are non-empty
and any of them contains a null element;
returns false otherwise.

array_join(col, delimiter[, Concatenates the elements of column using

null_replacement]) the delimiter.

create_map(*cols) Creates a new map column.

slice(x, start, length) Collection function: returns an array containing

all the elements in x from index start (array
indices start at 1, or from the end if start is
negative) with the specified length.
concat(*cols) Concatenates multiple input columns together
into a single column.

array_position(col, value) Collection function: Locates the position of the

first occurrence of the given value in the given
array.

element_at(col, extraction) Collection function: Returns element of array

at given index in extraction if col is array.

array_append(col, value) Collection function: returns an array of the

elements in col1 along with the added element
in col2 at the last of the array.

array_sort(col[, comparator]) Collection function: sorts the input array in

ascending order.

array_insert(arr, pos, value) Collection function: adds an item into a given

array at a specified array index.

array_remove(col, element) Collection function: Remove all elements that

equal to element from the given array.

array_distinct(col) Collection function: removes duplicate values

from the array.

array_intersect(col1, col2) Collection function: returns an array of the

elements in the intersection of col1 and col2,
without duplicates.

array_union(col1, col2) Collection function: returns an array of the

elements in the union of col1 and col2, without
duplicates.

array_except(col1, col2) Collection function: returns an array of the

elements in col1 but not in col2, without
duplicates.

array_compact(col) Collection function: removes null values from

the array.
transform(col, f) Returns an array of elements after applying a
transformation to each element in the input
array.

exists(col, f) Returns whether a predicate holds for one or

more elements in the array.

forall(col, f) Returns whether a predicate holds for every

element in the array.

filter(col, f) Returns an array of elements for which a

predicate holds in a given array.

aggregate(col, initialValue, merge[, Applies a binary operator to an initial state and

finish]) all elements in the array, and reduces this to a
single state.

zip_with(left, right, f) Merge two given arrays, element-wise, into a

single array using a function.

transform_keys(col, f) Applies a function to every key-value pair in a

map and returns a map with the results of
those applications as the new keys for the
pairs.

transform_values(col, f) Applies a function to every key-value pair in a

map and returns a map with the results of
those applications as the new values for the
pairs.

map_filter(col, f) Returns a map whose key-value pairs satisfy a

predicate.

map_from_arrays(col1, col2) Creates a new map from two arrays.

map_zip_with(col1, col2, f) Merge two given maps, key-wise into a single

map using a function.
explode(col) Returns a new row for each element in the
given array or map.

explode_outer(col) Returns a new row for each element in the

given array or map.

posexplode(col) Returns a new row for each element with

position in the given array or map.

posexplode_outer(col) Returns a new row for each element with

position in the given array or map.

inline(col) Explodes an array of structs into a table.

inline_outer(col) Explodes an array of structs into a table.

get(col, index) Collection function: Returns element of array

at given (0-based) index.

get_json_object(col, path) Extracts json object from a json string based

on json path specified, and returns json string
of the extracted json object.

json_tuple(col, *fields) Creates a new row for a json column

according to the given field names.

from_json(col, schema[, options]) Parses a column containing a JSON string into

a MapType with StringType as keys type,
StructType or ArrayType with the specified
schema.

schema_of_json(json[, options]) Parses a JSON string and infers its schema in

DDL format.

to_json(col[, options]) Converts a column containing a StructType,

ArrayType or a MapType into a JSON string.
size(col) Collection function: returns the length of the
array or map stored in the column.

struct(*cols) Creates a new struct column.

sort_array(col[, asc]) Collection function: sorts the input array in

ascending or descending order according to
the natural ordering of the array elements.

array_max(col) Collection function: returns the maximum

value of the array.

array_min(col) Collection function: returns the minimum value

of the array.

shuffle(col) Collection function: Generates a random

permutation of the given array.

reverse(col) Collection function: returns a reversed string

or an array with reverse order of elements.

flatten(col) Collection function: creates a single array from

an array of arrays.

sequence(start, stop[, step]) Generate a sequence of integers from start to

stop, incrementing by step.

array_repeat(col, count) Collection function: creates an array

containing a column repeated count times.

map_contains_key(col, value) Returns true if the map contains the key.

map_keys(col) Collection function: Returns an unordered

array containing the keys of the map.
map_values(col) Collection function: Returns an unordered
array containing the values of the map.

map_entries(col) Collection function: Returns an unordered

array of all entries in the given map.

map_from_entries(col) Collection function: Converts an array of

entries (key value struct types) to a map of
values.

arrays_zip(*cols) Collection function: Returns a merged array of

structs in which the N-th struct contains all
N-th values of input arrays.

map_concat(*cols) Returns the union of all the given maps.

from_csv(col, schema[, options]) Parses a column containing a CSV string to a

row with the specified schema.

schema_of_csv(csv[, options]) Parses a CSV string and infers its schema in

DDL format.

to_csv(col[, options]) Converts a column containing a StructType

into a CSV string.

Partition Transformation
Functions
years(col) Partition transform function: A transform for timestamps and
dates to partition data into years.

months(col) Partition transform function: A transform for timestamps and

dates to partition data into months.
days(col) Partition transform function: A transform for timestamps and
dates to partition data into days.

hours(col) Partition transform function: A transform for timestamps to

partition data into hours.

bucket(numBuckets, Partition transform function: A transform for any type that

col) partitions by a hash of the input column.

Aggregate Functions
approxCountDistinct(col[, rsd]) New in version 1.3.0.

approx_count_distinct(col[, rsd]) Aggregate function: returns a new Column for

approximate distinct count of column col.

avg(col) Aggregate function: returns the average of

the values in a group.

collect_list(col) Aggregate function: returns a list of objects

with duplicates.

collect_set(col) Aggregate function: returns a set of objects

with duplicate elements eliminated.

corr(col1, col2) Returns a new Column for the Pearson

Correlation Coefficient for col1 and col2.

count(col) Aggregate function: returns the number of

items in a group.

count_distinct(col, *cols) Returns a new Column for distinct count of

col or cols.
countDistinct(col, *cols) Returns a new Column for distinct count of
col or cols.

covar_pop(col1, col2) Returns a new Column for the population

covariance of col1 and col2.

covar_samp(col1, col2) Returns a new Column for the sample

covariance of col1 and col2.

first(col[, ignorenulls]) Aggregate function: returns the first value in

a group.

grouping(col) Aggregate function: indicates whether a

specified column in a GROUP BY list is
aggregated or not, returns 1 for aggregated
or 0 for not aggregated in the result set.

grouping_id(*cols) Aggregate function: returns the level of

grouping, equals to

kurtosis(col) Aggregate function: returns the kurtosis of

the values in a group.

last(col[, ignorenulls]) Aggregate function: returns the last value in

a group.

max(col) Aggregate function: returns the maximum

value of the expression in a group.

max_by(col, ord) Returns the value associated with the

maximum value of ord.

mean(col) Aggregate function: returns the average of

the values in a group.

median(col) Returns the median of the values in a group.

min(col) Aggregate function: returns the minimum
value of the expression in a group.

min_by(col, ord) Returns the value associated with the

minimum value of ord.

mode(col) Returns the most frequent value in a group.

percentile_approx(col, percentage[, Returns the approximate percentile of the

accuracy]) numeric column col which is the smallest
value in the ordered col values (sorted from
least to greatest) such that no more than
percentage of col values is less than the
value or equal to that value.

product(col) Aggregate function: returns the product of

the values in a group.

skewness(col) Aggregate function: returns the skewness of

the values in a group.

stddev(col) Aggregate function: alias for stddev_samp.

stddev_pop(col) Aggregate function: returns population

standard deviation of the expression in a
group.

stddev_samp(col) Aggregate function: returns the unbiased

sample standard deviation of the expression
in a group.

sum(col) Aggregate function: returns the sum of all

values in the expression.

sum_distinct(col) Aggregate function: returns the sum of

distinct values in the expression.
sumDistinct(col) Aggregate function: returns the sum of
distinct values in the expression.

var_pop(col) Aggregate function: returns the population

variance of the values in a group.

var_samp(col) Aggregate function: returns the unbiased

sample variance of the values in a group.

variance(col) Aggregate function: alias for var_samp

Window Functions
cume_dist() Window function: returns the cumulative distribution of
values within a window partition, i.e.

dense_rank() Window function: returns the rank of rows within a

window partition, without any gaps.

lag(col[, offset, default]) Window function: returns the value that is offset rows
before the current row, and default if there is less than
offset rows before the current row.

lead(col[, offset, default]) Window function: returns the value that is offset rows
after the current row, and default if there is less than
offset rows after the current row.

nth_value(col, offset[, Window function: returns the value that is the offsetth
ignoreNulls]) row of the window frame (counting from 1), and null if
the size of window frame is less than offset rows.

ntile(n) Window function: returns the ntile group id (from 1 to n

inclusive) in an ordered window partition.
percent_rank() Window function: returns the relative rank (i.e.

rank() Window function: returns the rank of rows within a

window partition.

row_number() Window function: returns a sequential number starting

at 1 within a window partition.

Sort Functions
asc(col) Returns a sort expression based on the ascending order of the
given column name.

asc_nulls_first( Returns a sort expression based on the ascending order of the

col) given column name, and null values return before non-null
values.

asc_nulls_last(c Returns a sort expression based on the ascending order of the

ol) given column name, and null values appear after non-null values.

desc(col) Returns a sort expression based on the descending order of the

given column name.

desc_nulls_first Returns a sort expression based on the descending order of the

(col) given column name, and null values appear before non-null
values.

desc_nulls_last( Returns a sort expression based on the descending order of the

col) given column name, and null values appear after non-null values.

String Functions
ascii(col) Computes the numeric value of the first
character of the string column.
base64(col) Computes the BASE64 encoding of a binary
column and returns it as a string column.

bit_length(col) Calculates the bit length for the specified string

column.

concat_ws(sep, *cols) Concatenates multiple input string columns

together into a single string column, using the
given separator.

decode(col, charset) Computes the first argument into a string from

a binary using the provided character set (one
of ‘US-ASCII’, ‘ISO-8859-1’, ‘UTF-8’,
‘UTF-16BE’, ‘UTF-16LE’, ‘UTF-16’).

encode(col, charset) Computes the first argument into a binary from

a string using the provided character set (one
of ‘US-ASCII’, ‘ISO-8859-1’, ‘UTF-8’,
‘UTF-16BE’, ‘UTF-16LE’, ‘UTF-16’).

format_number(col, d) Formats the number X to a format like

‘#,–#,–#.–’, rounded to d decimal places with
HALF_EVEN round mode, and returns the
result as a string.

format_string(format, *cols) Formats the arguments in printf-style and

returns the result as a string column.

initcap(col) Translate the first letter of each word to upper

case in the sentence.

instr(str, substr) Locate the position of the first occurrence of

substr column in the given string.

length(col) Computes the character length of string data or

number of bytes of binary data.

lower(col) Converts a string expression to lower case.

levenshtein(left, right) Computes the Levenshtein distance of the two
given strings.

locate(substr, str[, pos]) Locate the position of the first occurrence of

substr in a string column, after position pos.

lpad(col, len, pad) Left-pad the string column to width len with
pad.

ltrim(col) Trim the spaces from left end for the specified
string value.

octet_length(col) Calculates the byte length for the specified

string column.

regexp_extract(str, pattern, idx) Extract a specific group matched by a Java

regex, from the specified string column.

regexp_replace(string, pattern, Replace all substrings of the specified string

replacement) value that match regexp with replacement.

unbase64(col) Decodes a BASE64 encoded string column

and returns it as a binary column.

rpad(col, len, pad) Right-pad the string column to width len with
pad.

repeat(col, n) Repeats a string column n times, and returns it

as a new string column.

rtrim(col) Trim the spaces from right end for the specified
string value.

soundex(col) Returns the SoundEx encoding for a string

split(str, pattern[, limit]) Splits str around matches of the given pattern.
substring(str, pos, len) Substring starts at pos and is of length len
when str is String type or returns the slice of
byte array that starts at pos in byte and is of
length len when str is Binary type.

substring_index(str, delim, count) Returns the substring from string str before
count occurrences of the delimiter delim.

overlay(src, replace, pos[, len]) Overlay the specified portion of src with
replace, starting from byte position pos of src
and proceeding for len bytes.

sentences(string[, language, Splits a string into arrays of sentences, where

country]) each sentence is an array of words.

translate(srcCol, matching, A function translates any character in the

replace) srcCol by a character in matching.

trim(col) Trim the spaces from both ends for the

specified string column.

upper(col) Converts a string expression to uppercase.

UDF
call_udf(udfName, *cols) Call a user-defined function.

pandas_udf([f, returnType, Creates a pandas user defined function (a.k.a.

functionType])

udf([f, returnType]) Creates a user defined function (UDF).

unwrap_udt(col) Unwrap UDT data type column into its underlying

type.
Misc Functions
md5(col) Calculates the MD5 digest and returns the value as a 32
character hex string.

sha1(col) Returns the hex string result of SHA-1.

sha2(col, numBits) Returns the hex string result of SHA-2 family of hash
functions (SHA-224, SHA-256, SHA-384, and SHA-512).

crc32(col) Calculates the cyclic redundancy check value (CRC32) of a

binary column and returns the value as a bigint.

hash(*cols) Calculates the hash code of given columns, and returns the
result as an int column.

xxhash64(*cols) Calculates the hash code of given columns using the 64-bit
variant of the xxHash algorithm, and returns the result as a
long column.

assert_true(col[, Returns null if the input column is true; throws an exception

errMsg]) with the provided error message otherwise.

raise_error(errMsg) Throws an exception with the provided error message.

DataFrame
DataFrame.__getattr__(name) Returns the Column denoted by name.

DataFrame.getitem(item) Returns the column as a Column.

DataFrame.agg(*exprs) Aggregate on the entire DataFrame without

groups (shorthand for
df.groupBy().agg()).
DataFrame.alias(alias) Returns a new DataFrame with an alias set.

DataFrame.approxQuantile(col, Calculates the approximate quantiles of

probabilities, …) numerical columns of a DataFrame.

DataFrame.cache() Persists the DataFrame with the default

storage level (MEMORY_AND_DISK).

DataFrame.checkpoint([eager]) Returns a checkpointed version of this

DataFrame.

DataFrame.coalesce(numPartitions) Returns a new DataFrame that has exactly

numPartitions partitions.

DataFrame.colRegex(colName) Selects column based on the column

name specified as a regex and returns it
as Column.

DataFrame.collect() Returns all the records as a list of Row.

DataFrame.columns Retrieves the names of all columns in the

DataFrame as a list.

DataFrame.corr(col1, col2[, method]) Calculates the correlation of two columns

of a DataFrame as a double value.

DataFrame.count() Returns the number of rows in this

DataFrame.

DataFrame.cov(col1, col2) Calculate the sample covariance for the

given columns, specified by their names,
as a double value.

DataFrame.createGlobalTempView(nam Creates a global temporary view with this

e) DataFrame.

DataFrame.createOrReplaceGlobalTemp Creates or replaces a global temporary

View(name) view using the given name.
DataFrame.createOrReplaceTempView(n Creates or replaces a local temporary view
ame) with this DataFrame.

DataFrame.createTempView(name) Creates a local temporary view with this

DataFrame.

DataFrame.crossJoin(other) Returns the cartesian product with another

DataFrame.

DataFrame.crosstab(col1, col2) Computes a pair-wise frequency table of

the given columns.

DataFrame.cube(*cols) Create a multi-dimensional cube for the

current DataFrame using the specified
columns, so we can run aggregations on
them.

DataFrame.describe(*cols) Computes basic statistics for numeric and

string columns.

DataFrame.distinct() Returns a new DataFrame containing the

distinct rows in this DataFrame.

DataFrame.drop(*cols) Returns a new DataFrame without specified

columns.

DataFrame.dropDuplicates([subset]) Return a new DataFrame with duplicate

rows removed, optionally only considering
certain columns.

DataFrame.dropDuplicatesWithinWater Return a new DataFrame with duplicate

mark([subset]) rows removed,

DataFrame.drop_duplicates([subset]) drop_duplicates() is an alias for

dropDuplicates().

DataFrame.dropna([how, thresh, subset]) Returns a new DataFrame omitting rows

with null values.

DataFrame.dtypes Returns all column names and their data

types as a list.
DataFrame.exceptAll(other) Return a new DataFrame containing rows
in this DataFrame but not in another
DataFrame while preserving duplicates.

DataFrame.explain([extended, mode]) Prints the (logical and physical) plans to

the console for debugging purposes.

DataFrame.fillna(value[, subset]) Replace null values, alias for na.fill().

DataFrame.filter(condition) Filters rows using the given condition.

DataFrame.first() Returns the first row as a Row.

DataFrame.foreach(f) Applies the f function to all Row of this

DataFrame.

DataFrame.foreachPartition(f) Applies the f function to each partition of

this DataFrame.

DataFrame.freqItems(cols[, support]) Finding frequent items for columns,

possibly with false positives.

DataFrame.groupBy(*cols) Groups the DataFrame using the specified

columns, so we can run aggregation on
them.

DataFrame.head([n]) Returns the first n rows.

DataFrame.hint(name, *parameters) Specifies srow(ome hint on the current

DataFrame.

DataFrame.inputFiles() Returns a best-effort snapshot of the files

that compose this DataFrame.

DataFrame.intersect(other) Return a new DataFrame containing rows

only in both this DataFrame and another
DataFrame.
DataFrame.intersectAll(other) Return a new DataFrame containing rows
in both this DataFrame and another
DataFrame while preserving duplicates.

DataFrame.isEmpty() Checks if the DataFrame is empty and

returns a boolean value.

DataFrame.isLocal() Returns True if the collect() and take()

methods can be run locally (without any
Spark executors).

DataFrame.isStreaming Returns True if this DataFrame contains

one or more sources that continuously
return data as it arrives.

DataFrame.join(other[, on, how]) Joins with another DataFrame, using the

given join expression.

DataFrame.limit(num) Limits the result count to the number

specified.

DataFrame.localCheckpoint([eager]) Returns a locally checkpointed version of

this DataFrame.

DataFrame.mapInPandas(func, schema[, Maps an iterator of batches in the current

barrier]) DataFrame using a Python native function
that takes and outputs a pandas
DataFrame, and returns the result as a
DataFrame.

DataFrame.mapInArrow(func, schema[, Maps an iterator of batches in the current

barrier]) DataFrame using a Python native function
that takes and outputs a PyArrow’s
RecordBatch, and returns the result as a
DataFrame.

DataFrame.melt(ids, values, …) Unpivot a DataFrame from wide format to

long format, optionally leaving identifier
columns set.

DataFrame.na Returns a DataFrameNaFunctions for

handling missing values.
DataFrame.observe(observation, *exprs) Define (named) metrics to observe on the
DataFrame.

DataFrame.offset(num) Returns a new :class: DataFrame by

skipping the first n rows.

DataFrame.orderBy(*cols, **kwargs) Returns a new DataFrame sorted by the

specified column(s).

DataFrame.persist([storageLevel]) Sets the storage level to persist the

contents of the DataFrame across
operations after the first time it is
computed.

DataFrame.printSchema([level]) Prints out the schema in the tree format.

DataFrame.randomSplit(weights[, seed]) Randomly splits this DataFrame with the

provided weights.

DataFrame.rdd Returns the content as an pyspark.RDD of

Row.

DataFrame.registerTempTable(name) Registers this DataFrame as a temporary

table using the given name.

DataFrame.repartition(numPartitions, Returns a new DataFrame partitioned by

*cols) the given partitioning expressions.

DataFrame.repartitionByRange(numPa Returns a new DataFrame partitioned by

rtitions, …) the given partitioning expressions.

DataFrame.replace(to_replace[, value, Returns a new DataFrame replacing a

subset]) value with another value.

DataFrame.rollup(*cols) Create a multi-dimensional rollup for the

current DataFrame using the specified
columns, so we can run aggregation on
them.
DataFrame.sameSemantics(other) Returns True when the logical query plans
inside both DataFrames are equal and
therefore return the same results.

DataFrame.sample([withReplacement, Returns a sampled subset of this

…]) DataFrame.

DataFrame.sampleBy(col, fractions[, Returns a stratified sample without

seed]) replacement based on the fraction given
on each stratum.

DataFrame.schema Returns the schema of this DataFrame as a

pyspark.sql.types.StructType.

DataFrame.select(*cols) Projects a set of expressions and returns a

new DataFrame.

DataFrame.selectExpr(*expr) Projects a set of SQL expressions and

returns a new DataFrame.

DataFrame.semanticHash() Returns a hash code of the logical query

plan against this DataFrame.

DataFrame.show([n, truncate, vertical]) Prints the first n rows to the console.

DataFrame.sort(*cols, **kwargs) Returns a new DataFrame sorted by the

specified column(s).

DataFrame.sortWithinPartitions(*cols Returns a new DataFrame with each

, **kwargs) partition sorted by the specified column(s).

DataFrame.sparkSession Returns Spark session that created this

DataFrame.

DataFrame.stat Returns a DataFrameStatFunctions for

statistic functions.

DataFrame.storageLevel Get the DataFrame’s current storage level.

DataFrame.subtract(other) Return a new DataFrame containing rows
in this DataFrame but not in another
DataFrame.

DataFrame.summary(*statistics) Computes specified statistics for numeric

and string columns.

DataFrame.tail(num) Returns the last num rows as a list of Row.

DataFrame.take(num) Returns the first num rows as a list of Row.

DataFrame.to(schema) Returns a new DataFrame where each row

is reconciled to match the specified
schema.

DataFrame.toDF(*cols) Returns a new DataFrame that with new

specified column names

DataFrame.toJSON([use_unicode]) Converts a DataFrame into a RDD of string.

DataFrame.toLocalIterator([prefetchP Returns an iterator that contains all of the

artitions]) rows in this DataFrame.

DataFrame.toPandas() Returns the contents of this DataFrame as

Pandas pandas.DataFrame.

DataFrame.to_pandas_on_spark([index_
col])

DataFrame.transform(func, *args, Returns a new DataFrame.

**kwargs)

DataFrame.union(other) Return a new DataFrame containing the

union of rows in this and another
DataFrame.

DataFrame.unionAll(other) Return a new DataFrame containing the

union of rows in this and another
DataFrame.
DataFrame.unionByName(other[, …]) Returns a new DataFrame containing a
union of rows in this and another
DataFrame.

DataFrame.unpersist([blocking]) Marks the DataFrame as non-persistent,

and removes all blocks for it from memory
and disk.

DataFrame.unpivot(ids, values, …) Unpivot a DataFrame from wide format to

long format, optionally leaving identifier
columns set.

DataFrame.where(condition) where() is an alias for filter().

DataFrame.withColumn(colName, col) Returns a new DataFrame by adding a

column or replacing the existing column
that has the same name.

DataFrame.withColumns(*colsMap) Returns a new DataFrame by adding

multiple columns or replacing the existing
columns that have the same names.

DataFrame.withColumnRenamed(existing, Returns a new DataFrame by renaming an

new) existing column.

DataFrame.withColumnsRenamed(colsMa Returns a new DataFrame by renaming

p) multiple columns.

DataFrame.withMetadata(columnName, Returns a new DataFrame by updating an

metadata) existing column with metadata.

DataFrame.withWatermark(eventTime, Defines an event time watermark for this

…) DataFrame.

DataFrame.write Interface for saving the content of the

non-streaming DataFrame out into external
storage.

DataFrame.writeStream Interface for saving the content of the

streaming DataFrame out into external
storage.
DataFrame.writeTo(table) Create a write configuration builder for v2
sources.

DataFrame.pandas_api([index_col]) Converts the existing DataFrame into a

pandas-on-Spark DataFrame.

DataFrameNaFunctions.drop([how, Returns a new DataFrame omitting rows

thresh, subset]) with null values.

DataFrameNaFunctions.fill(value[, Replace null values, alias for na.fill().

subset])

DataFrameNaFunctions.replace(to_repl Returns a new DataFrame replacing a

ace[, …]) value with another value.

DataFrameStatFunctions.approxQuanti Calculates the approximate quantiles of

le(col, …) numerical columns of a DataFrame.

DataFrameStatFunctions.corr(col1, Calculates the correlation of two columns

col2[, method]) of a DataFrame as a double value.

DataFrameStatFunctions.cov(col1, Calculate the sample covariance for the

col2) given columns, specified by their names,
as a double value.

DataFrameStatFunctions.crosstab(col1 Computes a pair-wise frequency table of

, col2) the given columns.

DataFrameStatFunctions.freqItems(col Finding frequent items for columns,

s[, support]) possibly with false positives.

DataFrameStatFunctions.sampleBy(col, Returns a stratified sample without

fractions) replacement based on the fraction given
on each stratum.

PD 2 Preview
0% (3)
PD 2 Preview
32 pages
Untitled
100% (2)
Untitled
633 pages
Pandas Cheat Sheet PDF
67% (3)
Pandas Cheat Sheet PDF
1 page
Differential Forms and Applications (Do Carmo)
100% (1)
Differential Forms and Applications (Do Carmo)
124 pages
SQL Vs PySpark 1678871778
No ratings yet
SQL Vs PySpark 1678871778
8 pages
PySpark Data Frame Questions PDF
100% (2)
PySpark Data Frame Questions PDF
57 pages
Python & Data Science Cheat Sheet
100% (4)
Python & Data Science Cheat Sheet
11 pages
SQL Cheat Sheet Python
100% (1)
SQL Cheat Sheet Python
1 page
Course Handbook: COMSATS Institute of Information Technology, Lahore
No ratings yet
Course Handbook: COMSATS Institute of Information Technology, Lahore
5 pages
Data Science Cheat Sheet: KEY Imports
100% (1)
Data Science Cheat Sheet: KEY Imports
1 page
Learningthepandaslibrary PDF
100% (1)
Learningthepandaslibrary PDF
233 pages
PySpark SQL Functions-10-03
No ratings yet
PySpark SQL Functions-10-03
357 pages
Chapter-2 Python Pandas
100% (2)
Chapter-2 Python Pandas
33 pages
IP XII U1 Ch3 DataHandling (DataFrame) Final
No ratings yet
IP XII U1 Ch3 DataHandling (DataFrame) Final
45 pages
Techniques
No ratings yet
Techniques
31 pages
Python Vocabularies
100% (1)
Python Vocabularies
101 pages
Spark Entity Resolution with DataFrame Analysis
No ratings yet
Spark Entity Resolution with DataFrame Analysis
5 pages
Python Cheat Sheet Code Academy
100% (1)
Python Cheat Sheet Code Academy
1 page
PySpark Reference Guide
No ratings yet
PySpark Reference Guide
2 pages
Composite Laminate Mechanics
100% (7)
Composite Laminate Mechanics
111 pages
SQL To Pyspark Conversion
No ratings yet
SQL To Pyspark Conversion
9 pages
Pyspark SQL Basics Cheat Sheet: Python For Data Science
No ratings yet
Pyspark SQL Basics Cheat Sheet: Python For Data Science
1 page
Big Data Analytics in Apache Spark
No ratings yet
Big Data Analytics in Apache Spark
79 pages
Data Analysis With SQL: Postgresql Cheat Sheet
No ratings yet
Data Analysis With SQL: Postgresql Cheat Sheet
4 pages
SQL & pySPARK
No ratings yet
SQL & pySPARK
9 pages
SQL & PySpark for Data Engineers
No ratings yet
SQL & PySpark for Data Engineers
58 pages
PySpark Functions Guide
No ratings yet
PySpark Functions Guide
9 pages
16 9
100% (1)
16 9
19 pages
SQL PySpark Cheat Sheet 1731729790
No ratings yet
SQL PySpark Cheat Sheet 1731729790
9 pages
Forecasting Questions PDF
No ratings yet
Forecasting Questions PDF
5 pages
Python Pandas - 2 2020-21
No ratings yet
Python Pandas - 2 2020-21
21 pages
Databricks Vs SQL Cheat Sheet
100% (1)
Databricks Vs SQL Cheat Sheet
11 pages
SQL and PySpark
No ratings yet
SQL and PySpark
80 pages
A Strict Lyapunov Funtion For The Simple Pendulum - R.Kelly - V.Santibañez
No ratings yet
A Strict Lyapunov Funtion For The Simple Pendulum - R.Kelly - V.Santibañez
5 pages
THE METHODS OF REJUVENATION IN ETERNAL LIFE - Author S - Grabovoi, Grigorii & Grabovoi, Grigori - 2018
100% (2)
THE METHODS OF REJUVENATION IN ETERNAL LIFE - Author S - Grabovoi, Grigorii & Grabovoi, Grigori - 2018
18 pages
MATLAB Command Window
No ratings yet
MATLAB Command Window
3 pages
Model Reduction for Control Engineers
No ratings yet
Model Reduction for Control Engineers
17 pages
Linear Algebra For Business Analytics
No ratings yet
Linear Algebra For Business Analytics
27 pages
IP Imp Notes
No ratings yet
IP Imp Notes
5 pages
Pyspark Cheatsheet
No ratings yet
Pyspark Cheatsheet
21 pages
SQL - & - Pyspak
No ratings yet
SQL - & - Pyspak
6 pages
M.R Dennis and J.H Hannay - Geometry of Calugareanu's Theorem
No ratings yet
M.R Dennis and J.H Hannay - Geometry of Calugareanu's Theorem
11 pages
Data Frames
No ratings yet
Data Frames
12 pages
Recent Research Advances in The Dynamic Behavior of Shells: 1989-2000, Part 2: Homogeneous Shells
No ratings yet
Recent Research Advances in The Dynamic Behavior of Shells: 1989-2000, Part 2: Homogeneous Shells
20 pages
Continuous-Time Signals: David W. Graham EE 327
No ratings yet
Continuous-Time Signals: David W. Graham EE 327
18 pages
Appendixc
No ratings yet
Appendixc
144 pages
Python CheatSheet
No ratings yet
Python CheatSheet
2 pages
Matrix Types and Operations Guide
No ratings yet
Matrix Types and Operations Guide
9 pages
Number and Letter Series
No ratings yet
Number and Letter Series
22 pages
Advanced Math Solutions Guide
No ratings yet
Advanced Math Solutions Guide
5 pages
Data Science Course for Professionals
No ratings yet
Data Science Course for Professionals
21 pages
Math130 Ass#6 Mercado
No ratings yet
Math130 Ass#6 Mercado
11 pages
SQL Vs Pyspark-1
No ratings yet
SQL Vs Pyspark-1
9 pages
Ss..fourier Series Formulaes
No ratings yet
Ss..fourier Series Formulaes
4 pages
Ebook Comandos JesusG 1741221641
No ratings yet
Ebook Comandos JesusG 1741221641
7 pages
Chapter 5-Tensor Analysis (Part 3) .
No ratings yet
Chapter 5-Tensor Analysis (Part 3) .
34 pages
Cubics Exam Questions
No ratings yet
Cubics Exam Questions
32 pages
10 Linear PDE With CC L1
No ratings yet
10 Linear PDE With CC L1
11 pages
Understanding Number Systems
No ratings yet
Understanding Number Systems
16 pages
TD 6 Correction
No ratings yet
TD 6 Correction
9 pages
Lecture Week2
No ratings yet
Lecture Week2
72 pages
Comparison of SQL
No ratings yet
Comparison of SQL
11 pages
Journal
No ratings yet
Journal
47 pages
IMO 2022 Notes
No ratings yet
IMO 2022 Notes
11 pages
VTAMPS 11.0 Senior Secondary Set 1
No ratings yet
VTAMPS 11.0 Senior Secondary Set 1
11 pages
Pyspark Cheatsheet
No ratings yet
Pyspark Cheatsheet
10 pages
SAT Algebra - Results
No ratings yet
SAT Algebra - Results
68 pages
Q1. Difference Between Cache and Pe
No ratings yet
Q1. Difference Between Cache and Pe
13 pages
FP Roots of Quadratic Equations PPQ
No ratings yet
FP Roots of Quadratic Equations PPQ
10 pages
HTML Code
No ratings yet
HTML Code
3 pages
HTML Code
No ratings yet
HTML Code
4 pages
As Pure Mathematics Practice Set 4 Mark Scheme
No ratings yet
As Pure Mathematics Practice Set 4 Mark Scheme
15 pages
Pandas Cheat Sheet
No ratings yet
Pandas Cheat Sheet
11 pages
Pyspark Syntax Using Simple Examples
No ratings yet
Pyspark Syntax Using Simple Examples
28 pages
07 Structured Data Processing
No ratings yet
07 Structured Data Processing
91 pages
SQL & PySpark ?
No ratings yet
SQL & PySpark ?
9 pages
SQL & Pyspark
No ratings yet
SQL & Pyspark
9 pages
Pyspark - Cheatsheet With Comparison To SQL5 - Seequality
No ratings yet
Pyspark - Cheatsheet With Comparison To SQL5 - Seequality
36 pages
Imp Pandas Cheatsheet
No ratings yet
Imp Pandas Cheatsheet
11 pages
Effective Pandas
No ratings yet
Effective Pandas
595 pages
PySpark Cheatsheet - Elaborate
No ratings yet
PySpark Cheatsheet - Elaborate
14 pages

Methods & Function in Databricks

Uploaded by

Methods & Function in Databricks

Uploaded by

Column Method

Column.__getattr__(item) An expression that gets an item at position

Column.__getitem__(k) An expression that gets an item at position

Column.alias(*alias, **kwargs) Returns this column aliased with a new name or

Column.asc() Returns a sort expression based on the

Column.asc_nulls_first() Returns a sort expression based on the

Column.asc_nulls_last() Returns a sort expression based on the

Column.astype(dataType) astype() is an alias for cast().

Column.between(lowerBound, True if the current column is between the lower

Column.bitwiseAND(other) Compute bitwise AND of this expression with

Column.bitwiseOR(other) Compute bitwise OR of this expression with

Column.bitwiseXOR(other) Compute bitwise XOR of this expression with

Column.cast(dataType) Casts the column into type dataType.

Column.desc() Returns a sort expression based on the

Column.desc_nulls_first() Returns a sort expression based on the

Column.desc_nulls_last() Returns a sort expression based on the

Column.dropFields(*fieldNames) An expression that drops fields in StructType by

Column.endswith(other) String ends with.

Column.eqNullSafe(other) Equality test that is safe for null values.

Column.getField(name) An expression that gets a field by name in a

Column.getItem(key) An expression that gets an item at position

Column.ilike(other) SQL ILIKE expression (case insensitive LIKE).

Column.isNotNull() True if the current expression is NOT null.

Column.isNull() True if the current expression is null.

Column.isin(*cols) A boolean expression that is evaluated to true if

Column.like(other) SQL like expression.

Column.otherwise(value) Evaluates a list of conditions and returns one of

Column.over(window) Define a windowing column.

Column.rlike(other) SQL RLIKE expression (LIKE with Regex).

Column.startswith(other) String starts with.

Column.substr(startPos, length) Return a Column which is a substring of the

Column.when(condition, value) Evaluates a list of conditions and returns one of

Column.withField(fieldName, An expression that adds/replaces a field in

column(col) Returns a Column based on the given column name.

lit(col) Creates a Column of literal value.

broadcast(df) Marks a DataFrame as small enough for use in broadcast

isnan(col) An expression that returns true if the column is NaN.

isnull(col) An expression that returns true if the column is null.

monotonically_increas A column that generates monotonically increasing 64-bit

nanvl(col1, col2) Returns col1 if it is not NaN, or col2 if col1 is NaN.

rand([seed]) Generates a random column with independent and

randn([seed]) Generates a column with independent and identically

spark_partition_id() A column for partition ID.

when(condition, value) Evaluates a list of conditions and returns one of multiple

bitwise_not(col) Computes bitwise not.

bitwiseNOT(col) Computes bitwise not.

expr(str) Parses the expression string into the column that it

greatest(*cols) Returns the greatest value of the list of column names,

abs(col) Computes the absolute value.

acos(col) Computes inverse cosine of the input column.

acosh(col) Computes inverse hyperbolic cosine of the input column.

asin(col) Computes inverse sine of the input column.

asinh(col) Computes inverse hyperbolic sine of the input column.

atan(col) Compute inverse tangent of the input column.

atanh(col) Computes inverse hyperbolic tangent of the input column.

atan2(col1, col2) New in version 1.4.0.

cbrt(col) Computes the cube-root of the given value.

ceil(col) Computes the ceiling of the given value.

cos(col) Computes cosine of the input column.

cosh(col) Computes hyperbolic cosine of the input column.

cot(col) Computes cotangent of the input column.

csc(col) Computes cosecant of the input column.

exp(col) Computes the exponential of the given value.

expm1(col) Computes the exponential of the given value minus one.

factorial(col) Computes the factorial of the given value.

floor(col) Computes the floor of the given value.

hex(col) Computes hex value of the given column, which could be

unhex(col) Inverse of hex.

hypot(col1, col2) Computes sqrt(a^2 + b^2) without intermediate overflow or

log(arg1[, arg2]) Returns the first argument-based logarithm of the second

log10(col) Computes the logarithm of the given value in Base 10.

log2(col) Returns the base-2 logarithm of the argument.

pmod(dividend, Returns the positive value of dividend mod divisor.

sec(col) Computes secant of the input column.

shiftleft(col, Shift the given value numBits left.

shiftright(col, (Signed) shift the given value numBits right.

Column.getattr(item) An expression that gets an item at position

Column.getitem(k) An expression that gets an item at position