Column Method
Column.__getattr__(item) An expression that gets an item at position
ordinal out of a list, or gets an item by key out of
a dict.
Column.__getitem__(k) An expression that gets an item at position
ordinal out of a list, or gets an item by key out of
a dict.
Column.alias(*alias, **kwargs) Returns this column aliased with a new name or
names (in the case of expressions that return
more than one column, such as explode).
Column.asc() Returns a sort expression based on the
ascending order of the column.
Column.asc_nulls_first() Returns a sort expression based on the
ascending order of the column, and null values
return before non-null values.
Column.asc_nulls_last() Returns a sort expression based on the
ascending order of the column, and null values
appear after non-null values.
Column.astype(dataType) astype() is an alias for cast().
Column.between(lowerBound, True if the current column is between the lower
upperBound) bound and upper bound, inclusive.
Column.bitwiseAND(other) Compute bitwise AND of this expression with
another expression.
Column.bitwiseOR(other) Compute bitwise OR of this expression with
another expression.
Column.bitwiseXOR(other) Compute bitwise XOR of this expression with
another expression.
Column.cast(dataType) Casts the column into type dataType.
Column.contains(other) Contains the other element.
Column.desc() Returns a sort expression based on the
descending order of the column.
Column.desc_nulls_first() Returns a sort expression based on the
descending order of the column, and null values
appear before non-null values.
Column.desc_nulls_last() Returns a sort expression based on the
descending order of the column, and null values
appear after non-null values.
Column.dropFields(*fieldNames) An expression that drops fields in StructType by
name.
Column.endswith(other) String ends with.
Column.eqNullSafe(other) Equality test that is safe for null values.
Column.getField(name) An expression that gets a field by name in a
StructType.
Column.getItem(key) An expression that gets an item at position
ordinal out of a list, or gets an item by key out of
a dict.
Column.ilike(other) SQL ILIKE expression (case insensitive LIKE).
Column.isNotNull() True if the current expression is NOT null.
Column.isNull() True if the current expression is null.
Column.isin(*cols) A boolean expression that is evaluated to true if
the value of this expression is contained by the
evaluated values of the arguments.
Column.like(other) SQL like expression.
Column.name(*alias, **kwargs) name() is an alias for alias().
Column.otherwise(value) Evaluates a list of conditions and returns one of
multiple possible result expressions.
Column.over(window) Define a windowing column.
Column.rlike(other) SQL RLIKE expression (LIKE with Regex).
Column.startswith(other) String starts with.
Column.substr(startPos, length) Return a Column which is a substring of the
column.
Column.when(condition, value) Evaluates a list of conditions and returns one of
multiple possible result expressions.
Column.withField(fieldName, An expression that adds/replaces a field in
col) StructType by name.
Functions
Normal Functions
col(col) Returns a Column based on the given column name.
column(col) Returns a Column based on the given column name.
lit(col) Creates a Column of literal value.
broadcast(df) Marks a DataFrame as small enough for use in broadcast
joins.
coalesce(*cols) Returns the first column that is not null.
input_file_name() Creates a string column for the file name of the current
Spark task.
isnan(col) An expression that returns true if the column is NaN.
isnull(col) An expression that returns true if the column is null.
monotonically_increas A column that generates monotonically increasing 64-bit
ing_id() integers.
nanvl(col1, col2) Returns col1 if it is not NaN, or col2 if col1 is NaN.
rand([seed]) Generates a random column with independent and
identically distributed (i.i.d.) samples uniformly distributed
in [0.0, 1.0).
randn([seed]) Generates a column with independent and identically
distributed (i.i.d.) samples from the standard normal
distribution.
spark_partition_id() A column for partition ID.
when(condition, value) Evaluates a list of conditions and returns one of multiple
possible result expressions.
bitwise_not(col) Computes bitwise not.
bitwiseNOT(col) Computes bitwise not.
expr(str) Parses the expression string into the column that it
represents
greatest(*cols) Returns the greatest value of the list of column names,
skipping null values.
least(*cols) Returns the least value of the list of column names,
skipping null values.
Math Functions
sqrt(col) Computes the square root of the specified float value.
abs(col) Computes the absolute value.
acos(col) Computes inverse cosine of the input column.
acosh(col) Computes inverse hyperbolic cosine of the input column.
asin(col) Computes inverse sine of the input column.
asinh(col) Computes inverse hyperbolic sine of the input column.
atan(col) Compute inverse tangent of the input column.
atanh(col) Computes inverse hyperbolic tangent of the input column.
atan2(col1, col2) New in version 1.4.0.
bin(col) Returns the string representation of the binary value of the given
column.
cbrt(col) Computes the cube-root of the given value.
ceil(col) Computes the ceiling of the given value.
conv(col, Convert a number in a string column from one base to another.
fromBase,
toBase)
cos(col) Computes cosine of the input column.
cosh(col) Computes hyperbolic cosine of the input column.
cot(col) Computes cotangent of the input column.
csc(col) Computes cosecant of the input column.
exp(col) Computes the exponential of the given value.
expm1(col) Computes the exponential of the given value minus one.
factorial(col) Computes the factorial of the given value.
floor(col) Computes the floor of the given value.
hex(col) Computes hex value of the given column, which could be
pyspark.sql.types.StringType,
pyspark.sql.types.BinaryType,
pyspark.sql.types.IntegerType or
pyspark.sql.types.LongType.
unhex(col) Inverse of hex.
hypot(col1, col2) Computes sqrt(a^2 + b^2) without intermediate overflow or
underflow.
log(arg1[, arg2]) Returns the first argument-based logarithm of the second
argument.
log10(col) Computes the logarithm of the given value in Base 10.
log1p(col) Computes the natural logarithm of the “given value plus one”.
log2(col) Returns the base-2 logarithm of the argument.
pmod(dividend, Returns the positive value of dividend mod divisor.
divisor)
pow(col1, col2) Returns the value of the first argument raised to the power of the
second argument.
rint(col) Returns the double value that is closest in value to the argument
and is equal to a mathematical integer.
round(col[, Round the given value to scale decimal places using HALF_UP
scale]) rounding mode if scale >= 0 or at integral part when scale < 0.
bround(col[, Round the given value to scale decimal places using HALF_EVEN
scale]) rounding mode if scale >= 0 or at integral part when scale < 0.
sec(col) Computes secant of the input column.
shiftleft(col, Shift the given value numBits left.
numBits)
shiftright(col, (Signed) shift the given value numBits right.
numBits)
shiftrightunsig Unsigned shift the given value numBits right.
ned(col, numBits)
signum(col) Computes the signum of the given value.
sin(col) Computes sine of the input column.
sinh(col) Computes hyperbolic sine of the input column.
tan(col) Computes tangent of the input column.
tanh(col) Computes hyperbolic tangent of the input column.
toDegrees(col) New in version 1.4.0.
degrees(col) Converts an angle measured in radians to an approximately
equivalent angle measured in degrees.
toRadians(col) New in version 1.4.0.
radians(col) Converts an angle measured in degrees to an approximately
equivalent angle measured in radians.
Datetime Functions
add_months(start, months) Returns the date that is months months after
start.
current_date() Returns the current date at the start of query
evaluation as a DateType column.
current_timestamp() Returns the current timestamp at the start of
query evaluation as a TimestampType column.
date_add(start, days) Returns the date that is days days after start.
date_format(date, format) Converts a date/timestamp/string to a value of
string in the format specified by the date format
given by the second argument.
date_sub(start, days) Returns the date that is days days before start.
date_trunc(format, timestamp) Returns timestamp truncated to the unit
specified by the format.
datediff(end, start) Returns the number of days from start to end.
dayofmonth(col) Extract the day of the month of a given
date/timestamp as integer.
dayofweek(col) Extract the day of the week of a given
date/timestamp as integer.
dayofyear(col) Extract the day of the year of a given
date/timestamp as integer.
second(col) Extract the seconds of a given date as integer.
weekofyear(col) Extract the week number of a given date as
integer.
year(col) Extract the year of a given date/timestamp as
integer.
quarter(col) Extract the quarter of a given date/timestamp as
integer.
month(col) Extract the month of a given date/timestamp as
integer.
last_day(date) Returns the last day of the month which the
given date belongs to.
localtimestamp() Returns the current timestamp without time
zone at the start of query evaluation as a
timestamp without time zone column.
minute(col) Extract the minutes of a given timestamp as
integer.
months_between(date1, date2[, Returns number of months between dates date1
roundOff]) and date2.
next_day(date, dayOfWeek) Returns the first date which is later than the
value of the date column based on second week
day argument.
hour(col) Extract the hours of a given timestamp as
integer.
make_date(year, month, day) Returns a column with a date built from the year,
month and day columns.
from_unixtime(timestamp[, Converts the number of seconds from unix
format]) epoch (1970-01-01 00:00:00 UTC) to a string
representing the timestamp of that moment in
the current system time zone in the given
format.
unix_timestamp([timestamp, Convert time string with given pattern
format]) (‘yyyy-MM-dd HH:mm:ss’, by default) to Unix
time stamp (in seconds), using the default
timezone and the default locale, returns null if
failed.
to_timestamp(col[, format]) Converts a Column into
pyspark.sql.types.TimestampType using the
optionally specified format.
to_date(col[, format]) Converts a Column into
pyspark.sql.types.DateType using the
optionally specified format.
trunc(date, format) Returns date truncated to the unit specified by
the format.
from_utc_timestamp(timestamp, This is a common function for databases
tz) supporting TIMESTAMP WITHOUT TIMEZONE.
to_utc_timestamp(timestamp, tz) This is a common function for databases
supporting TIMESTAMP WITHOUT TIMEZONE.
window(timeColumn, Bucketize rows into one or more time windows
windowDuration[, …]) given a timestamp specifying column.
session_window(timeColumn, Generates session window given a timestamp
gapDuration) specifying column.
timestamp_seconds(col) Converts the number of seconds from the Unix
epoch (1970-01-01T00:00:00Z) to a timestamp.
window_time(windowColumn) Computes the event time from a window
column.
Collection Functions
array(*cols) Creates a new array column.
array_contains(col, value) Collection function: returns null if the array is
null, true if the array contains the given value,
and false otherwise.
arrays_overlap(a1, a2) Collection function: returns true if the arrays
contain any common non-null element; if not,
returns null if both the arrays are non-empty
and any of them contains a null element;
returns false otherwise.
array_join(col, delimiter[, Concatenates the elements of column using
null_replacement]) the delimiter.
create_map(*cols) Creates a new map column.
slice(x, start, length) Collection function: returns an array containing
all the elements in x from index start (array
indices start at 1, or from the end if start is
negative) with the specified length.
concat(*cols) Concatenates multiple input columns together
into a single column.
array_position(col, value) Collection function: Locates the position of the
first occurrence of the given value in the given
array.
element_at(col, extraction) Collection function: Returns element of array
at given index in extraction if col is array.
array_append(col, value) Collection function: returns an array of the
elements in col1 along with the added element
in col2 at the last of the array.
array_sort(col[, comparator]) Collection function: sorts the input array in
ascending order.
array_insert(arr, pos, value) Collection function: adds an item into a given
array at a specified array index.
array_remove(col, element) Collection function: Remove all elements that
equal to element from the given array.
array_distinct(col) Collection function: removes duplicate values
from the array.
array_intersect(col1, col2) Collection function: returns an array of the
elements in the intersection of col1 and col2,
without duplicates.
array_union(col1, col2) Collection function: returns an array of the
elements in the union of col1 and col2, without
duplicates.
array_except(col1, col2) Collection function: returns an array of the
elements in col1 but not in col2, without
duplicates.
array_compact(col) Collection function: removes null values from
the array.
transform(col, f) Returns an array of elements after applying a
transformation to each element in the input
array.
exists(col, f) Returns whether a predicate holds for one or
more elements in the array.
forall(col, f) Returns whether a predicate holds for every
element in the array.
filter(col, f) Returns an array of elements for which a
predicate holds in a given array.
aggregate(col, initialValue, merge[, Applies a binary operator to an initial state and
finish]) all elements in the array, and reduces this to a
single state.
zip_with(left, right, f) Merge two given arrays, element-wise, into a
single array using a function.
transform_keys(col, f) Applies a function to every key-value pair in a
map and returns a map with the results of
those applications as the new keys for the
pairs.
transform_values(col, f) Applies a function to every key-value pair in a
map and returns a map with the results of
those applications as the new values for the
pairs.
map_filter(col, f) Returns a map whose key-value pairs satisfy a
predicate.
map_from_arrays(col1, col2) Creates a new map from two arrays.
map_zip_with(col1, col2, f) Merge two given maps, key-wise into a single
map using a function.
explode(col) Returns a new row for each element in the
given array or map.
explode_outer(col) Returns a new row for each element in the
given array or map.
posexplode(col) Returns a new row for each element with
position in the given array or map.
posexplode_outer(col) Returns a new row for each element with
position in the given array or map.
inline(col) Explodes an array of structs into a table.
inline_outer(col) Explodes an array of structs into a table.
get(col, index) Collection function: Returns element of array
at given (0-based) index.
get_json_object(col, path) Extracts json object from a json string based
on json path specified, and returns json string
of the extracted json object.
json_tuple(col, *fields) Creates a new row for a json column
according to the given field names.
from_json(col, schema[, options]) Parses a column containing a JSON string into
a MapType with StringType as keys type,
StructType or ArrayType with the specified
schema.
schema_of_json(json[, options]) Parses a JSON string and infers its schema in
DDL format.
to_json(col[, options]) Converts a column containing a StructType,
ArrayType or a MapType into a JSON string.
size(col) Collection function: returns the length of the
array or map stored in the column.
struct(*cols) Creates a new struct column.
sort_array(col[, asc]) Collection function: sorts the input array in
ascending or descending order according to
the natural ordering of the array elements.
array_max(col) Collection function: returns the maximum
value of the array.
array_min(col) Collection function: returns the minimum value
of the array.
shuffle(col) Collection function: Generates a random
permutation of the given array.
reverse(col) Collection function: returns a reversed string
or an array with reverse order of elements.
flatten(col) Collection function: creates a single array from
an array of arrays.
sequence(start, stop[, step]) Generate a sequence of integers from start to
stop, incrementing by step.
array_repeat(col, count) Collection function: creates an array
containing a column repeated count times.
map_contains_key(col, value) Returns true if the map contains the key.
map_keys(col) Collection function: Returns an unordered
array containing the keys of the map.
map_values(col) Collection function: Returns an unordered
array containing the values of the map.
map_entries(col) Collection function: Returns an unordered
array of all entries in the given map.
map_from_entries(col) Collection function: Converts an array of
entries (key value struct types) to a map of
values.
arrays_zip(*cols) Collection function: Returns a merged array of
structs in which the N-th struct contains all
N-th values of input arrays.
map_concat(*cols) Returns the union of all the given maps.
from_csv(col, schema[, options]) Parses a column containing a CSV string to a
row with the specified schema.
schema_of_csv(csv[, options]) Parses a CSV string and infers its schema in
DDL format.
to_csv(col[, options]) Converts a column containing a StructType
into a CSV string.
Partition Transformation
Functions
years(col) Partition transform function: A transform for timestamps and
dates to partition data into years.
months(col) Partition transform function: A transform for timestamps and
dates to partition data into months.
days(col) Partition transform function: A transform for timestamps and
dates to partition data into days.
hours(col) Partition transform function: A transform for timestamps to
partition data into hours.
bucket(numBuckets, Partition transform function: A transform for any type that
col) partitions by a hash of the input column.
Aggregate Functions
approxCountDistinct(col[, rsd]) New in version 1.3.0.
approx_count_distinct(col[, rsd]) Aggregate function: returns a new Column for
approximate distinct count of column col.
avg(col) Aggregate function: returns the average of
the values in a group.
collect_list(col) Aggregate function: returns a list of objects
with duplicates.
collect_set(col) Aggregate function: returns a set of objects
with duplicate elements eliminated.
corr(col1, col2) Returns a new Column for the Pearson
Correlation Coefficient for col1 and col2.
count(col) Aggregate function: returns the number of
items in a group.
count_distinct(col, *cols) Returns a new Column for distinct count of
col or cols.
countDistinct(col, *cols) Returns a new Column for distinct count of
col or cols.
covar_pop(col1, col2) Returns a new Column for the population
covariance of col1 and col2.
covar_samp(col1, col2) Returns a new Column for the sample
covariance of col1 and col2.
first(col[, ignorenulls]) Aggregate function: returns the first value in
a group.
grouping(col) Aggregate function: indicates whether a
specified column in a GROUP BY list is
aggregated or not, returns 1 for aggregated
or 0 for not aggregated in the result set.
grouping_id(*cols) Aggregate function: returns the level of
grouping, equals to
kurtosis(col) Aggregate function: returns the kurtosis of
the values in a group.
last(col[, ignorenulls]) Aggregate function: returns the last value in
a group.
max(col) Aggregate function: returns the maximum
value of the expression in a group.
max_by(col, ord) Returns the value associated with the
maximum value of ord.
mean(col) Aggregate function: returns the average of
the values in a group.
median(col) Returns the median of the values in a group.
min(col) Aggregate function: returns the minimum
value of the expression in a group.
min_by(col, ord) Returns the value associated with the
minimum value of ord.
mode(col) Returns the most frequent value in a group.
percentile_approx(col, percentage[, Returns the approximate percentile of the
accuracy]) numeric column col which is the smallest
value in the ordered col values (sorted from
least to greatest) such that no more than
percentage of col values is less than the
value or equal to that value.
product(col) Aggregate function: returns the product of
the values in a group.
skewness(col) Aggregate function: returns the skewness of
the values in a group.
stddev(col) Aggregate function: alias for stddev_samp.
stddev_pop(col) Aggregate function: returns population
standard deviation of the expression in a
group.
stddev_samp(col) Aggregate function: returns the unbiased
sample standard deviation of the expression
in a group.
sum(col) Aggregate function: returns the sum of all
values in the expression.
sum_distinct(col) Aggregate function: returns the sum of
distinct values in the expression.
sumDistinct(col) Aggregate function: returns the sum of
distinct values in the expression.
var_pop(col) Aggregate function: returns the population
variance of the values in a group.
var_samp(col) Aggregate function: returns the unbiased
sample variance of the values in a group.
variance(col) Aggregate function: alias for var_samp
Window Functions
cume_dist() Window function: returns the cumulative distribution of
values within a window partition, i.e.
dense_rank() Window function: returns the rank of rows within a
window partition, without any gaps.
lag(col[, offset, default]) Window function: returns the value that is offset rows
before the current row, and default if there is less than
offset rows before the current row.
lead(col[, offset, default]) Window function: returns the value that is offset rows
after the current row, and default if there is less than
offset rows after the current row.
nth_value(col, offset[, Window function: returns the value that is the offsetth
ignoreNulls]) row of the window frame (counting from 1), and null if
the size of window frame is less than offset rows.
ntile(n) Window function: returns the ntile group id (from 1 to n
inclusive) in an ordered window partition.
percent_rank() Window function: returns the relative rank (i.e.
rank() Window function: returns the rank of rows within a
window partition.
row_number() Window function: returns a sequential number starting
at 1 within a window partition.
Sort Functions
asc(col) Returns a sort expression based on the ascending order of the
given column name.
asc_nulls_first( Returns a sort expression based on the ascending order of the
col) given column name, and null values return before non-null
values.
asc_nulls_last(c Returns a sort expression based on the ascending order of the
ol) given column name, and null values appear after non-null values.
desc(col) Returns a sort expression based on the descending order of the
given column name.
desc_nulls_first Returns a sort expression based on the descending order of the
(col) given column name, and null values appear before non-null
values.
desc_nulls_last( Returns a sort expression based on the descending order of the
col) given column name, and null values appear after non-null values.
String Functions
ascii(col) Computes the numeric value of the first
character of the string column.
base64(col) Computes the BASE64 encoding of a binary
column and returns it as a string column.
bit_length(col) Calculates the bit length for the specified string
column.
concat_ws(sep, *cols) Concatenates multiple input string columns
together into a single string column, using the
given separator.
decode(col, charset) Computes the first argument into a string from
a binary using the provided character set (one
of ‘US-ASCII’, ‘ISO-8859-1’, ‘UTF-8’,
‘UTF-16BE’, ‘UTF-16LE’, ‘UTF-16’).
encode(col, charset) Computes the first argument into a binary from
a string using the provided character set (one
of ‘US-ASCII’, ‘ISO-8859-1’, ‘UTF-8’,
‘UTF-16BE’, ‘UTF-16LE’, ‘UTF-16’).
format_number(col, d) Formats the number X to a format like
‘#,–#,–#.–’, rounded to d decimal places with
HALF_EVEN round mode, and returns the
result as a string.
format_string(format, *cols) Formats the arguments in printf-style and
returns the result as a string column.
initcap(col) Translate the first letter of each word to upper
case in the sentence.
instr(str, substr) Locate the position of the first occurrence of
substr column in the given string.
length(col) Computes the character length of string data or
number of bytes of binary data.
lower(col) Converts a string expression to lower case.
levenshtein(left, right) Computes the Levenshtein distance of the two
given strings.
locate(substr, str[, pos]) Locate the position of the first occurrence of
substr in a string column, after position pos.
lpad(col, len, pad) Left-pad the string column to width len with
pad.
ltrim(col) Trim the spaces from left end for the specified
string value.
octet_length(col) Calculates the byte length for the specified
string column.
regexp_extract(str, pattern, idx) Extract a specific group matched by a Java
regex, from the specified string column.
regexp_replace(string, pattern, Replace all substrings of the specified string
replacement) value that match regexp with replacement.
unbase64(col) Decodes a BASE64 encoded string column
and returns it as a binary column.
rpad(col, len, pad) Right-pad the string column to width len with
pad.
repeat(col, n) Repeats a string column n times, and returns it
as a new string column.
rtrim(col) Trim the spaces from right end for the specified
string value.
soundex(col) Returns the SoundEx encoding for a string
split(str, pattern[, limit]) Splits str around matches of the given pattern.
substring(str, pos, len) Substring starts at pos and is of length len
when str is String type or returns the slice of
byte array that starts at pos in byte and is of
length len when str is Binary type.
substring_index(str, delim, count) Returns the substring from string str before
count occurrences of the delimiter delim.
overlay(src, replace, pos[, len]) Overlay the specified portion of src with
replace, starting from byte position pos of src
and proceeding for len bytes.
sentences(string[, language, Splits a string into arrays of sentences, where
country]) each sentence is an array of words.
translate(srcCol, matching, A function translates any character in the
replace) srcCol by a character in matching.
trim(col) Trim the spaces from both ends for the
specified string column.
upper(col) Converts a string expression to uppercase.
UDF
call_udf(udfName, *cols) Call a user-defined function.
pandas_udf([f, returnType, Creates a pandas user defined function (a.k.a.
functionType])
udf([f, returnType]) Creates a user defined function (UDF).
unwrap_udt(col) Unwrap UDT data type column into its underlying
type.
Misc Functions
md5(col) Calculates the MD5 digest and returns the value as a 32
character hex string.
sha1(col) Returns the hex string result of SHA-1.
sha2(col, numBits) Returns the hex string result of SHA-2 family of hash
functions (SHA-224, SHA-256, SHA-384, and SHA-512).
crc32(col) Calculates the cyclic redundancy check value (CRC32) of a
binary column and returns the value as a bigint.
hash(*cols) Calculates the hash code of given columns, and returns the
result as an int column.
xxhash64(*cols) Calculates the hash code of given columns using the 64-bit
variant of the xxHash algorithm, and returns the result as a
long column.
assert_true(col[, Returns null if the input column is true; throws an exception
errMsg]) with the provided error message otherwise.
raise_error(errMsg) Throws an exception with the provided error message.
DataFrame
DataFrame.__getattr__(name) Returns the Column denoted by name.
DataFrame.__getitem__(item) Returns the column as a Column.
DataFrame.agg(*exprs) Aggregate on the entire DataFrame without
groups (shorthand for
df.groupBy().agg()).
DataFrame.alias(alias) Returns a new DataFrame with an alias set.
DataFrame.approxQuantile(col, Calculates the approximate quantiles of
probabilities, …) numerical columns of a DataFrame.
DataFrame.cache() Persists the DataFrame with the default
storage level (MEMORY_AND_DISK).
DataFrame.checkpoint([eager]) Returns a checkpointed version of this
DataFrame.
DataFrame.coalesce(numPartitions) Returns a new DataFrame that has exactly
numPartitions partitions.
DataFrame.colRegex(colName) Selects column based on the column
name specified as a regex and returns it
as Column.
DataFrame.collect() Returns all the records as a list of Row.
DataFrame.columns Retrieves the names of all columns in the
DataFrame as a list.
DataFrame.corr(col1, col2[, method]) Calculates the correlation of two columns
of a DataFrame as a double value.
DataFrame.count() Returns the number of rows in this
DataFrame.
DataFrame.cov(col1, col2) Calculate the sample covariance for the
given columns, specified by their names,
as a double value.
DataFrame.createGlobalTempView(nam Creates a global temporary view with this
e) DataFrame.
DataFrame.createOrReplaceGlobalTemp Creates or replaces a global temporary
View(name) view using the given name.
DataFrame.createOrReplaceTempView(n Creates or replaces a local temporary view
ame) with this DataFrame.
DataFrame.createTempView(name) Creates a local temporary view with this
DataFrame.
DataFrame.crossJoin(other) Returns the cartesian product with another
DataFrame.
DataFrame.crosstab(col1, col2) Computes a pair-wise frequency table of
the given columns.
DataFrame.cube(*cols) Create a multi-dimensional cube for the
current DataFrame using the specified
columns, so we can run aggregations on
them.
DataFrame.describe(*cols) Computes basic statistics for numeric and
string columns.
DataFrame.distinct() Returns a new DataFrame containing the
distinct rows in this DataFrame.
DataFrame.drop(*cols) Returns a new DataFrame without specified
columns.
DataFrame.dropDuplicates([subset]) Return a new DataFrame with duplicate
rows removed, optionally only considering
certain columns.
DataFrame.dropDuplicatesWithinWater Return a new DataFrame with duplicate
mark([subset]) rows removed,
DataFrame.drop_duplicates([subset]) drop_duplicates() is an alias for
dropDuplicates().
DataFrame.dropna([how, thresh, subset]) Returns a new DataFrame omitting rows
with null values.
DataFrame.dtypes Returns all column names and their data
types as a list.
DataFrame.exceptAll(other) Return a new DataFrame containing rows
in this DataFrame but not in another
DataFrame while preserving duplicates.
DataFrame.explain([extended, mode]) Prints the (logical and physical) plans to
the console for debugging purposes.
DataFrame.fillna(value[, subset]) Replace null values, alias for na.fill().
DataFrame.filter(condition) Filters rows using the given condition.
DataFrame.first() Returns the first row as a Row.
DataFrame.foreach(f) Applies the f function to all Row of this
DataFrame.
DataFrame.foreachPartition(f) Applies the f function to each partition of
this DataFrame.
DataFrame.freqItems(cols[, support]) Finding frequent items for columns,
possibly with false positives.
DataFrame.groupBy(*cols) Groups the DataFrame using the specified
columns, so we can run aggregation on
them.
DataFrame.head([n]) Returns the first n rows.
DataFrame.hint(name, *parameters) Specifies srow(ome hint on the current
DataFrame.
DataFrame.inputFiles() Returns a best-effort snapshot of the files
that compose this DataFrame.
DataFrame.intersect(other) Return a new DataFrame containing rows
only in both this DataFrame and another
DataFrame.
DataFrame.intersectAll(other) Return a new DataFrame containing rows
in both this DataFrame and another
DataFrame while preserving duplicates.
DataFrame.isEmpty() Checks if the DataFrame is empty and
returns a boolean value.
DataFrame.isLocal() Returns True if the collect() and take()
methods can be run locally (without any
Spark executors).
DataFrame.isStreaming Returns True if this DataFrame contains
one or more sources that continuously
return data as it arrives.
DataFrame.join(other[, on, how]) Joins with another DataFrame, using the
given join expression.
DataFrame.limit(num) Limits the result count to the number
specified.
DataFrame.localCheckpoint([eager]) Returns a locally checkpointed version of
this DataFrame.
DataFrame.mapInPandas(func, schema[, Maps an iterator of batches in the current
barrier]) DataFrame using a Python native function
that takes and outputs a pandas
DataFrame, and returns the result as a
DataFrame.
DataFrame.mapInArrow(func, schema[, Maps an iterator of batches in the current
barrier]) DataFrame using a Python native function
that takes and outputs a PyArrow’s
RecordBatch, and returns the result as a
DataFrame.
DataFrame.melt(ids, values, …) Unpivot a DataFrame from wide format to
long format, optionally leaving identifier
columns set.
DataFrame.na Returns a DataFrameNaFunctions for
handling missing values.
DataFrame.observe(observation, *exprs) Define (named) metrics to observe on the
DataFrame.
DataFrame.offset(num) Returns a new :class: DataFrame by
skipping the first n rows.
DataFrame.orderBy(*cols, **kwargs) Returns a new DataFrame sorted by the
specified column(s).
DataFrame.persist([storageLevel]) Sets the storage level to persist the
contents of the DataFrame across
operations after the first time it is
computed.
DataFrame.printSchema([level]) Prints out the schema in the tree format.
DataFrame.randomSplit(weights[, seed]) Randomly splits this DataFrame with the
provided weights.
DataFrame.rdd Returns the content as an pyspark.RDD of
Row.
DataFrame.registerTempTable(name) Registers this DataFrame as a temporary
table using the given name.
DataFrame.repartition(numPartitions, Returns a new DataFrame partitioned by
*cols) the given partitioning expressions.
DataFrame.repartitionByRange(numPa Returns a new DataFrame partitioned by
rtitions, …) the given partitioning expressions.
DataFrame.replace(to_replace[, value, Returns a new DataFrame replacing a
subset]) value with another value.
DataFrame.rollup(*cols) Create a multi-dimensional rollup for the
current DataFrame using the specified
columns, so we can run aggregation on
them.
DataFrame.sameSemantics(other) Returns True when the logical query plans
inside both DataFrames are equal and
therefore return the same results.
DataFrame.sample([withReplacement, Returns a sampled subset of this
…]) DataFrame.
DataFrame.sampleBy(col, fractions[, Returns a stratified sample without
seed]) replacement based on the fraction given
on each stratum.
DataFrame.schema Returns the schema of this DataFrame as a
pyspark.sql.types.StructType.
DataFrame.select(*cols) Projects a set of expressions and returns a
new DataFrame.
DataFrame.selectExpr(*expr) Projects a set of SQL expressions and
returns a new DataFrame.
DataFrame.semanticHash() Returns a hash code of the logical query
plan against this DataFrame.
DataFrame.show([n, truncate, vertical]) Prints the first n rows to the console.
DataFrame.sort(*cols, **kwargs) Returns a new DataFrame sorted by the
specified column(s).
DataFrame.sortWithinPartitions(*cols Returns a new DataFrame with each
, **kwargs) partition sorted by the specified column(s).
DataFrame.sparkSession Returns Spark session that created this
DataFrame.
DataFrame.stat Returns a DataFrameStatFunctions for
statistic functions.
DataFrame.storageLevel Get the DataFrame’s current storage level.
DataFrame.subtract(other) Return a new DataFrame containing rows
in this DataFrame but not in another
DataFrame.
DataFrame.summary(*statistics) Computes specified statistics for numeric
and string columns.
DataFrame.tail(num) Returns the last num rows as a list of Row.
DataFrame.take(num) Returns the first num rows as a list of Row.
DataFrame.to(schema) Returns a new DataFrame where each row
is reconciled to match the specified
schema.
DataFrame.toDF(*cols) Returns a new DataFrame that with new
specified column names
DataFrame.toJSON([use_unicode]) Converts a DataFrame into a RDD of string.
DataFrame.toLocalIterator([prefetchP Returns an iterator that contains all of the
artitions]) rows in this DataFrame.
DataFrame.toPandas() Returns the contents of this DataFrame as
Pandas pandas.DataFrame.
DataFrame.to_pandas_on_spark([index_
col])
DataFrame.transform(func, *args, Returns a new DataFrame.
**kwargs)
DataFrame.union(other) Return a new DataFrame containing the
union of rows in this and another
DataFrame.
DataFrame.unionAll(other) Return a new DataFrame containing the
union of rows in this and another
DataFrame.
DataFrame.unionByName(other[, …]) Returns a new DataFrame containing a
union of rows in this and another
DataFrame.
DataFrame.unpersist([blocking]) Marks the DataFrame as non-persistent,
and removes all blocks for it from memory
and disk.
DataFrame.unpivot(ids, values, …) Unpivot a DataFrame from wide format to
long format, optionally leaving identifier
columns set.
DataFrame.where(condition) where() is an alias for filter().
DataFrame.withColumn(colName, col) Returns a new DataFrame by adding a
column or replacing the existing column
that has the same name.
DataFrame.withColumns(*colsMap) Returns a new DataFrame by adding
multiple columns or replacing the existing
columns that have the same names.
DataFrame.withColumnRenamed(existing, Returns a new DataFrame by renaming an
new) existing column.
DataFrame.withColumnsRenamed(colsMa Returns a new DataFrame by renaming
p) multiple columns.
DataFrame.withMetadata(columnName, Returns a new DataFrame by updating an
metadata) existing column with metadata.
DataFrame.withWatermark(eventTime, Defines an event time watermark for this
…) DataFrame.
DataFrame.write Interface for saving the content of the
non-streaming DataFrame out into external
storage.
DataFrame.writeStream Interface for saving the content of the
streaming DataFrame out into external
storage.
DataFrame.writeTo(table) Create a write configuration builder for v2
sources.
DataFrame.pandas_api([index_col]) Converts the existing DataFrame into a
pandas-on-Spark DataFrame.
DataFrameNaFunctions.drop([how, Returns a new DataFrame omitting rows
thresh, subset]) with null values.
DataFrameNaFunctions.fill(value[, Replace null values, alias for na.fill().
subset])
DataFrameNaFunctions.replace(to_repl Returns a new DataFrame replacing a
ace[, …]) value with another value.
DataFrameStatFunctions.approxQuanti Calculates the approximate quantiles of
le(col, …) numerical columns of a DataFrame.
DataFrameStatFunctions.corr(col1, Calculates the correlation of two columns
col2[, method]) of a DataFrame as a double value.
DataFrameStatFunctions.cov(col1, Calculate the sample covariance for the
col2) given columns, specified by their names,
as a double value.
DataFrameStatFunctions.crosstab(col1 Computes a pair-wise frequency table of
, col2) the given columns.
DataFrameStatFunctions.freqItems(col Finding frequent items for columns,
s[, support]) possibly with false positives.
DataFrameStatFunctions.sampleBy(col, Returns a stratified sample without
fractions) replacement based on the fraction given
on each stratum.