DataFrame Reference
- class odps.df.DataFrame(*args, **kwargs)[源代码]
Main entrance of PyODPS DataFrame.
Users can initial a DataFrame by
odps.models.Table
.- 参数:
data (
odps.models.Table
or pandas DataFrame) – ODPS table or pandas DataFrame- Example:
>>> df = DataFrame(o.get_table('my_example_table')) >>> df.dtypes odps.Schema { movie_id int64 title string release_date string video_release_date string imdb_url string user_id int64 rating int64 unix_timestamp int64 age int64 sex string occupation string zip_code string } >>> df.count() 100000 >>> >>> # Do the `groupby`, aggregate the `movie_id` by count, then sort the count in a reversed order >>> # Finally we get the top 25 results >>> df.groupby('title').agg(count=df.movie_id.count()).sort('count', ascending=False)[:25] >>> >>> # We can use the `value_counts` to reach the same goal >>> df.movie_id.value_counts()[:25]
- static batch_persist(dfs, tables, *args, **kwargs)[源代码]
Persist multiple DataFrames into ODPS.
- 参数:
dfs – DataFrames to persist.
tables – Table names to persist to. Use (table, partition) tuple to store to a table partition.
args – args for Expr.persist
kwargs – kwargs for Expr.persist
- Examples:
>>> DataFrame.batch_persist([df1, df2], ['table_name1', ('table_name2', 'partition_name2')], lifecycle=1)
- class odps.df.CollectionExpr(*args, **kwargs)[源代码]
Collection represents for the two-dimensions data.
- Example:
>>> # projection >>> df = DataFrame(o.get_table('my_table')) # DataFrame is actually a CollectionExpr >>> df['name', 'id'] # projection some columns >>> df[[df.name, df.id]] # projection >>> df[df] # means nothing, but get all the columns >>> df[df, df.name.lower().rename('name2')] # projection a new columns `name2` besides all the original columns >>> df.select(df, name2=df.name.lower()) # projection by `select` >>> df.exclude('name') # projection all columns but `name` >>> df[df.exclude('name'), df.name.lower()] # `name` will not conflict any more >>> >>> # filter >>> df[(df.id < 3) & (df.name != 'test')] >>> df.filter(df.id < 3, df.name != 'test') >>> >>> # slice >>> df[: 10] >>> df.limit(10) >>> >>> # Sequence >>> df.name # an instance of :class:`odps.df.expr.expressions.SequenceExpr` >>> >>> # schema or dtypes >>> df.dtypes odps.Schema { name string id int64 } >>> df.schema odps.Schema { name string id int64 }
- all()
All is True.
- 参数:
expr
- 返回:
- any()
Any is True.
- 参数:
expr
- 返回:
- append_id(id_col='append_id')
Append an ID column to current column to form a new DataFrame.
- 参数:
id_col (str) – name of appended ID field.
- 返回:
DataFrame with ID field
- 返回类型:
- apply(func, axis=0, names=None, types=None, reduce=False, resources=None, keep_nulls=False, args=(), **kwargs)
Apply a function to a row when axis=1 or column when axis=0.
- 参数:
expr
func – function to apply
axis – row when axis=1 else column
names – output names
types – output types
reduce – if True will return a sequence else return a collection
resources – resources to read
keep_nulls – if True, keep rows producing empty results, only work in lateral views
args – args for function
kwargs – kwargs for function
- 返回:
- Example:
Apply a function to a row:
>>> from odps.df import output >>> >>> @output(['iris_add', 'iris_sub'], ['float', 'float']) >>> def handle(row): >>> yield row.sepallength - row.sepalwidth, row.sepallength + row.sepalwidth >>> yield row.petallength - row.petalwidth, row.petallength + row.petalwidth >>> >>> iris.apply(handle, axis=1).count()
Apply a function to a column:
>>> class Agg(object): >>> >>> def buffer(self): >>> return [0.0, 0] >>> >>> def __call__(self, buffer, val): >>> buffer[0] += val >>> buffer[1] += 1 >>> >>> def merge(self, buffer, pbuffer): >>> buffer[0] += pbuffer[0] >>> buffer[1] += pbuffer[1] >>> >>> def getvalue(self, buffer): >>> if buffer[1] == 0: >>> return 0.0 >>> return buffer[0] / buffer[1] >>> >>> iris.exclude('name').apply(Agg)
- applymap(func, rtype=None, resources=None, columns=None, excludes=None, args=(), **kwargs)
Call func on each element of this collection.
- 参数:
func – lambda, function,
odps.models.Function
, or str which is the name ofodps.models.Funtion
rtype – if not provided, will be the dtype of this sequence
columns – columns to apply this function on
excludes – columns to skip when applying the function
- 返回:
a new collection
- Example:
>>> df.applymap(lambda x: x + 1)
- ast()
Return the AST string.
- 返回:
AST tree
- 返回类型:
str
- bfill(subset=None)
Fill NA/NaN values with the backward method. Equivalent to fillna(method=’bfill’).
- 参数:
expr (DataFrame) – input DataFrame.
subset – Labels along other axis to consider.
- 返回:
DataFrame
- bloom_filter(on, column, capacity=3000, error_rate=0.01)
Filter collection on the on sequence by BloomFilter built by column
- 参数:
collection
on – sequence or column name
column – instance of Column
capacity (int) – numbers of capacity
error_rate (float) – error rate
- 返回:
collection
- Example:
>>> df1 = DataFrame(pd.DataFrame({'a': ['name1', 'name2', 'name3', 'name1'], 'b': [1, 2, 3, 4]})) >>> df2 = DataFrame(pd.DataFrame({'a': ['name1']})) >>> df1.bloom_filter('a', df2.a) a b 0 name1 1 1 name1 4
- property columns
columns :rtype: list which each element is a Column
- Type:
return
- compile()
Compile this expression into an ODPS SQL
- 返回:
compiled DAG
- 返回类型:
str
- concat(rights, distinct=False, axis=0)
Concat collections.
- 参数:
left – left collection
rights – right collections, can be a DataFrame object or a list of DataFrames
distinct – whether to remove duplicate entries. only available when axis == 0
axis – when axis == 0, the DataFrames are merged vertically, otherwise horizontally.
- 返回:
collection
Note that axis==1 can only be used under Pandas DataFrames or XFlow.
- Example:
>>> df['name', 'id'].concat(df2['score'], axis=1)
- continuous(*args)
Set fields to be continuous.
- 返回类型:
- Example:
>>> # Table schema is create table test(f1 double, f2 string) >>> # Original continuity: f1=DISCRETE, f2=DISCRETE >>> # Now we want to set ``f1`` and ``f2`` into continuous >>> new_ds = df.continuous('f1 f2')
- count()
Value counts
- 参数:
expr
- 返回:
- discrete(*args)
Set fields to be discrete.
- 返回类型:
- Example:
>>> # Table schema is create table test(f1 double, f2 string) >>> # Original continuity: f1=CONTINUOUS, f2=CONTINUOUS >>> # Now we want to set ``f1`` and ``f2`` into continuous >>> new_ds = df.discrete('f1 f2')
- distinct(on=None, *ons)
Get collection with duplicate rows removed, optionally only considering certain columns
- 参数:
expr – collection
on – sequence or sequences
- 返回:
dinstinct collection
- Example:
>>> df.distinct(['name', 'id']) >>> df['name', 'id'].distinct()
- drop(data, axis=0, columns=None)
Drop data from a DataFrame.
- 参数:
expr – collection to drop data from
data – data to be removed
axis – 0 for deleting rows, 1 for columns.
columns – columns of data to select, only useful when axis == 0
- 返回:
collection
- Example:
>>> import pandas as pd >>> df1 = DataFrame(pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})) >>> df2 = DataFrame(pd.DataFrame({'a': [2, 3], 'b': [5, 7]})) >>> df1.drop(df2) a b c 0 1 4 7 1 3 6 9 >>> df1.drop(df2, columns='a') a b c 0 1 4 7 >>> df1.drop(['a'], axis=1) b c 0 4 7 1 5 8 2 6 9 >>> df1.drop(df2, axis=1) c 0 7 1 8 2 9
- dropna(how='any', thresh=None, subset=None)
Return object with labels on given axis omitted where alternately any or all of the data are missing
- 参数:
expr (DataFrame) – input DataFrame
how – can be ‘any’ or ‘all’. If ‘any’ is specified any NA values are present, drop that label. If ‘all’ is specified and all values are NA, drop that label.
thresh – require that many non-NA values
subset – Labels along other axis to consider, e.g. if you are dropping rows these would be a list of columns to include
- 返回:
DataFrame
- erase_key_value(*args)
Erase key-value represented fields.
- 返回类型:
- Example:
>>> new_ds = df.erase_key_value('f1 f2')
- except_(*rights, **kwargs)
Exclude data from a collection, like except clause in SQL. All collections involved should have same schema.
- 参数:
left – collection to drop data from
rights – collection or list of collections
distinct – whether to preserve duplicate entries
- 返回:
collection
- Examples:
>>> import pandas as pd >>> df1 = DataFrame(pd.DataFrame({'a': [1, 2, 3, 3, 3], 'b': [1, 2, 3, 3, 3]})) >>> df2 = DataFrame(pd.DataFrame({'a': [1, 3], 'b': [1, 3]})) >>> df1.setdiff(df2) a b 0 2 2 1 3 3 2 3 3 >>> df1.setdiff(df2, distinct=True) a b 0 2 2
- exclude(*fields)[源代码]
Projection columns which not included in the fields
- 参数:
fields – field names
- 返回:
new collection
- 返回类型:
odps.df.expr.expression.CollectionExpr
- execute(**kwargs)
- 参数:
hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size
priority (int) – instance priority, 9 as default
running_cluster – cluster to run this instance
- 返回:
execution result
- 返回类型:
odps.df.backends.frame.ResultFrame
- extract_kv(columns=None, kv_delim=':', item_delim=',', dtype='float', fill_value=None)
Extract values in key-value represented columns into standalone columns. New column names will be the name of the key-value column followed by an underscore and the key.
- 参数:
expr (DataFrame) – input DataFrame
columns – the key-value columns to be extracted.
kv_delim (str) – delimiter between key and value.
item_delim (str) – delimiter between key-value pairs.
dtype (str) – type of value columns to generate.
fill_value – default value for missing key-value pairs.
- 返回:
extracted data frame
- 返回类型:
- Example:
>>> df name kv 0 name1 k1=1.0,k2=3.0,k5=10.0 1 name2 k2=3.0,k3=5.1 2 name3 k1=7.1,k7=8.2 3 name4 k2=1.2,k3=1.5 4 name5 k2=1.0,k9=1.1 >>> table = df.extract_kv(columns=['A', 'B'], kv_delim='=') >>> table name kv_k1 kv_k2 kv_k3 kv_k5 kv_k7 kv_k9 0 name1 1.0 3.0 Nan 10.0 Nan Nan 1 name2 Nan 3.0 5.1 Nan Nan Nan 2 name3 7.1 Nan Nan Nan 8.2 Nan 3 name4 Nan 1.2 1.5 Nan Nan Nan 4 name5 Nan 1.0 Nan Nan Nan 1.1
- ffill(subset=None)
Fill NA/NaN values with the forward method. Equivalent to fillna(method=’ffill’).
- 参数:
expr (DataFrame) – input DataFrame.
subset – Labels along other axis to consider.
- 返回:
DataFrame
- fillna(value=None, method=None, subset=None)
Fill NA/NaN values using the specified method
- 参数:
expr (DataFrame) – input DataFrame
method – can be ‘backfill’, ‘bfill’, ‘pad’, ‘ffill’ or None
value – value to fill into
subset – Labels along other axis to consider.
- 返回:
DataFrame
- filter(*predicates)[源代码]
Filter the data by predicates
- 参数:
predicates – the conditions to filter
- 返回:
new collection
- 返回类型:
- filter_parts(predicate='', exclude=True)[源代码]
Filter the data by partition string. A partition string looks like pt1=1,pt2=2/pt1=2,pt2=1, where comma (,) denotes ‘and’, while (/) denotes ‘or’.
- 参数:
predicate (str|Partition) – predicate string of partition filter
exclude (bool) – True if you want to exclude partition fields, otherwise False. True for default.
- 返回:
new collection
- 返回类型:
- groupby(by, *bys)
Group collection by a series of sequences.
- 参数:
expr – collection
by – columns to group
bys – columns to group
- 返回:
GroupBy instance
- 返回类型:
- head(n=None, **kwargs)[源代码]
Return the first n rows. Execute at once.
- 参数:
n
- 返回:
result frame
- 返回类型:
odps.df.backends.frame.ResultFrame
- inner_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, skewjoin=False)
Inner join two collections.
If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.
- 参数:
left – left collection
right – right collection
on – fields to join on
suffixes – when name conflict, the suffixes will be added to both columns.
- 返回:
collection
- Example:
>>> df.dtypes.names ['name', 'id'] >>> df2.dtypes.names ['name', 'id1'] >>> df.inner_join(df2) >>> df.inner_join(df2, on='name') >>> df.inner_join(df2, on=('id', 'id1')) >>> df.inner_join(df2, on=['name', ('id', 'id1')]) >>> df.inner_join(df2, on=[df.name == df2.name, df.id == df2.id1])
- intersect(*rights, **kwargs)
Calc intersection among datasets,
- 参数:
left – collection
rights – collection or list of collections
distinct – whether to preserve duolicate entries
- 返回:
collection
- Examples:
>>> import pandas as pd >>> df1 = DataFrame(pd.DataFrame({'a': [1, 2, 3, 3, 3], 'b': [1, 2, 3, 3, 3]})) >>> df2 = DataFrame(pd.DataFrame({'a': [1, 3, 3], 'b': [1, 3, 3]})) >>> df1.intersect(df2) a b 0 1 1 1 3 3 2 3 3 >>> df1.intersect(df2, distinct=True) a b 0 1 1 1 3 3
- join(right, on=None, how='inner', suffixes=('_x', '_y'), mapjoin=False, skewjoin=False)
Join two collections.
If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.
- 参数:
left – left collection
right – right collection
on – fields to join on
how – ‘inner’, ‘left’, ‘right’, or ‘outer’
suffixes – when name conflict, the suffix will be added to both columns.
mapjoin – set use mapjoin or not, default value False.
skewjoin – set use of skewjoin or not, default value False. Can specify True if the collection is skew, or a list specifying columns with skew values, or a list of dicts specifying skew combinations.
- 返回:
collection
- Example:
>>> df.dtypes.names ['name', 'id'] >>> df2.dtypes.names ['name', 'id1'] >>> df.join(df2) >>> df.join(df2, on='name') >>> df.join(df2, on=('id', 'id1')) >>> df.join(df2, on=['name', ('id', 'id1')]) >>> df.join(df2, on=[df.name == df2.name, df.id == df2.id1]) >>> df.join(df2, mapjoin=False) >>> df.join(df2, skewjoin=True) >>> df.join(df2, skewjoin=["c0", "c1"]) >>> df.join(df2, skewjoin=[{"c0": 1, "c1": "2"}, {"c0": 3, "c1": "4"}])
- key_value(*args, **kwargs)
Set fields to be key-value represented.
- 返回类型:
- Example:
>>> new_ds = df.key_value('f1 f2', kv=':', item=',')
- kurt()
Calculate kurtosis of the sequence
- 参数:
expr
- 返回:
- kurtosis()
Calculate kurtosis of the sequence
- 参数:
expr
- 返回:
- label_field(f)
Select one field as the label field.
Note that this field will be exclude from feature fields.
- 参数:
f (str) – Selected label field
- 返回类型:
- left_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, merge_columns=None, skewjoin=False)
Left join two collections.
If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.
- 参数:
left – left collection
right – right collection
on – fields to join on
suffixes – when name conflict, the suffixes will be added to both columns.
mapjoin – set use mapjoin or not, default value False.
merge_columns – whether to merge columns with the same name into one column without suffix. If the value is True, columns in the predicate with same names will be merged, with non-null value. If the value is ‘left’ or ‘right’, the values of predicates on the left / right collection will be taken. You can also pass a dictionary to describe the behavior of each column, such as { ‘a’: ‘auto’, ‘b’: ‘left’ }.
- 返回:
collection
- Example:
>>> df.dtypes.names ['name', 'id'] >>> df2.dtypes.names ['name', 'id1'] >>> df.left_join(df2) >>> df.left_join(df2, on='name') >>> df.left_join(df2, on=('id', 'id1')) >>> df.left_join(df2, on=['name', ('id', 'id1')]) >>> df.left_join(df2, on=[df.name == df2.name, df.id == df2.id1])
- map_reduce(mapper=None, reducer=None, group=None, sort=None, ascending=True, combiner=None, combiner_buffer_size=1024, mapper_output_names=None, mapper_output_types=None, mapper_resources=None, reducer_output_names=None, reducer_output_types=None, reducer_resources=None)
MapReduce API, mapper or reducer should be provided.
- 参数:
expr
mapper – mapper function or class
reducer – reducer function or class
group – the keys to group after mapper
sort – the keys to sort after mapper
ascending – True if ascending else False
combiner – combiner function or class, combiner’s output should be equal to mapper
combiner_buffer_size – combiner’s buffer size, 1024 as default
mapper_output_names – mapper’s output names
mapper_output_types – mapper’s output types
mapper_resources – the resources for mapper
reducer_output_names – reducer’s output names
reducer_output_types – reducer’s output types
reducer_resources – the resources for reducer
- 返回:
- Example:
>>> from odps.df import output >>> >>> @output(['word', 'cnt'], ['string', 'int']) >>> def mapper(row): >>> for word in row[0].split(): >>> yield word.lower(), 1 >>> >>> @output(['word', 'cnt'], ['string', 'int']) >>> def reducer(keys): >>> cnt = [0] >>> def h(row, done): # done illustrates that all the rows of the keys are processed >>> cnt[0] += row.cnt >>> if done: >>> yield keys.word, cnt[0] >>> return h >>> >>> words_df.map_reduce(mapper, reducer, group='word')
- max()
Max value
- 参数:
expr
- 返回:
- mean()
Arithmetic mean.
- 参数:
expr
- 返回:
- median()
Median value.
- 参数:
expr
- 返回:
- melt(id_vars=None, value_vars=None, var_name='variable', value_name='value', ignore_nan=False)
“Unpivots” a DataFrame from wide format to long format, optionally leaving identifier variables set.
This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (id_vars), while all other columns, considered measured variables (value_vars), are “unpivoted” to the row axis, leaving just two non-identifier columns, ‘variable’ and ‘value’.
- 参数:
expr – collection
id_vars – column(s) to use as identifier variables.
value_vars – column(s) to unpivot. If not specified, uses all columns that are not set as id_vars.
var_name – name to use for the ‘variable’ column. If None it uses frame.columns.name or ‘variable’.
value_name – name to use for the ‘value’ column.
ignore_nan – whether to ignore NaN values in data.
- 返回:
collection
- Example:
>>> df.melt(id_vars='id', value_vars=['col1', 'col2']) >>> df.melt(id_vars=['id', 'id2'], value_vars=['col1', 'col2'], var_name='variable')
- min()
Min value
- 参数:
expr
- 返回:
- min_max_scale(columns=None, feature_range=(0, 1), preserve=False, suffix='_scaled', group=None)
Resize a data frame by max / min values, i.e., (X - min(X)) / (max(X) - min(X))
- 参数:
expr (DataFrame) – input DataFrame
feature_range – the target range to resize the value into, i.e., v * (b - a) + a
preserve (bool) – determine whether input data should be kept. If True, scaled input data will be appended to the data frame with suffix
columns – columns names to resize. If set to None, float or int-typed columns will be normalized if the column is not specified as a group column.
group – determine scale groups. Scaling will be done in each group separately.
suffix (str) – column suffix to be appended to the scaled columns.
- 返回:
resized data frame
- 返回类型:
- moment(order, central=False)
Calculate the n-th order moment of the sequence
- 参数:
expr
order – moment order, must be an integer
central – if central moments are to be computed.
- 返回:
- nunique()
The distinct count.
- 参数:
expr
- 返回:
- outer_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, merge_columns=None, skewjoin=False)
Outer join two collections.
If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.
- 参数:
left – left collection
right – right collection
on – fields to join on
suffixes – when name conflict, the suffixes will be added to both columns.
mapjoin – set use mapjoin or not, default value False.
merge_columns – whether to merge columns with the same name into one column without suffix. If the value is True, columns in the predicate with same names will be merged, with non-null value. If the value is ‘left’ or ‘right’, the values of predicates on the left / right collection will be taken. You can also pass a dictionary to describe the behavior of each column, such as { ‘a’: ‘auto’, ‘b’: ‘left’ }.
- 返回:
collection
- Example:
>>> df.dtypes.names ['name', 'id'] >>> df2.dtypes.names ['name', 'id1'] >>> df.outer_join(df2) >>> df.outer_join(df2, on='name') >>> df.outer_join(df2, on=('id', 'id1')) >>> df.outer_join(df2, on=['name', ('id', 'id1')]) >>> df.outer_join(df2, on=[df.name == df2.name, df.id == df2.id1])
- persist(name, partitions=None, partition=None, lifecycle=None, project=None, **kwargs)
Persist the execution into a new table. If partitions not specified, will create a new table without partitions if the table does not exist, and insert the SQL result into it. If partitions are specified, they will be the partition fields of the new table. If partition is specified, the data will be inserted into the exact partition of the table.
- 参数:
name – table name
partitions (list) – list of string, the partition fields
partition (string or PartitionSpec) – persist to a specified partition
lifecycle (int) – table lifecycle. If absent, options.lifecycle will be used.
project – project name, if not provided, will be the default project
hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size
priority (int) – instance priority, 9 as default
running_cluster – cluster to run this instance
overwrite (bool) – overwrite the table, True as default
drop_table (bool) – drop table if exists, False as default
create_table (bool) – create table first if not exits, True as default
drop_partition (bool) – drop partition if exists, False as default
create_partition (bool) – create partition if not exists, None as default
cast (bool) – cast all columns’ types as the existed table, False as default
- 返回:
- Example:
>>> df = df['name', 'id', 'ds'] >>> df.persist('odps_new_table') >>> df.persist('odps_new_table', partition='pt=test') >>> df.persist('odps_new_table', partitions=['ds'])
- pivot(rows, columns, values=None)
Produce ‘pivot’ table based on 3 columns of this DataFrame. Uses unique values from rows / columns and fills with values.
- 参数:
expr – collection
rows – use to make new collection’s grouped rows
columns – use to make new collection’s columns
values – values to use for populating new collection’s values
- 返回:
collection
- Example:
>>> df.pivot(rows='id', columns='category') >>> df.pivot(rows='id', columns='category', values='sale') >>> df.pivot(rows=['id', 'id2'], columns='category', values='sale')
- pivot_table(values=None, rows=None, columns=None, aggfunc='mean', fill_value=None)
Create a spreadsheet-style pivot table as a DataFrame.
- 参数:
expr – collection
(optional) (fill_value) – column to aggregate
rows – rows to group
columns – keys to group by on the pivot table column
aggfunc – aggregate function or functions
(optional) – value to replace missing value with, default None
- 返回:
collection
- Example:
>>> df A B C D 0 foo one small 1 1 foo one large 2 2 foo one large 2 3 foo two small 3 4 foo two small 3 5 bar one large 4 6 bar one small 5 7 bar two small 6 8 bar two large 7 >>> table = df.pivot_table(values='D', rows=['A', 'B'], columns='C', aggfunc='sum') >>> table A B large_D_sum small_D_sum 0 bar one 4.0 5.0 1 bar two 7.0 6.0 2 foo one 4.0 1.0 3 foo two NaN 6.0
- quantile(prob=None, **kw)
Percentile value.
- 参数:
expr
prob – probability or list of probabilities, in [0, 1]
- 返回:
- query(expr)[源代码]
Query the data with a boolean expression.
- 参数:
expr – the query string, you can use ‘@’ character refer to environment variables.
- 返回:
new collection
- 返回类型:
- reshuffle(by=None, sort=None, ascending=True)
Reshuffle data.
- 参数:
expr
by – the sequence or scalar to shuffle by. RandomScalar as default
sort – the sequence or scalar to sort.
ascending – True if ascending else False
- 返回:
collection
- right_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, merge_columns=None, skewjoin=False)
Right join two collections.
If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.
- 参数:
left – left collection
right – right collection
on – fields to join on
suffixes – when name conflict, the suffixes will be added to both columns.
mapjoin – set use mapjoin or not, default value False.
merge_columns – whether to merge columns with the same name into one column without suffix. If the value is True, columns in the predicate with same names will be merged, with non-null value. If the value is ‘left’ or ‘right’, the values of predicates on the left / right collection will be taken. You can also pass a dictionary to describe the behavior of each column, such as { ‘a’: ‘auto’, ‘b’: ‘left’ }.
- 返回:
collection
- Example:
>>> df.dtypes.names ['name', 'id'] >>> df2.dtypes.names ['name', 'id1'] >>> df.right_join(df2) >>> df.right_join(df2, on='name') >>> df.right_join(df2, on=('id', 'id1')) >>> df.right_join(df2, on=['name', ('id', 'id1')]) >>> df.right_join(df2, on=[df.name == df2.name, df.id == df2.id1])
- roles(clear_features=True, **field_roles)
Set roles of fields
- 参数:
clear_features – Clear feature roles on fields
field_roles
- 返回:
- sample(parts=None, columns=None, i=None, n=None, frac=None, replace=False, weights=None, strata=None, random_state=None)
Sample collection.
- 参数:
expr – collection
parts – how many parts to hash
columns – the columns to sample
i – the part to sample out, can be a list of parts, must be from 0 to parts-1
n – how many rows to sample. If strata is specified, n should be a dict with values in the strata column as dictionary keys and corresponding sample size as values
frac – how many fraction to sample. If strata is specified, n should be a dict with values in the strata column as dictionary keys and corresponding sample weight as values
replace – whether to perform replace sampling
weights – the column name of weights
strata – the name of strata column
random_state – the random seed when performing sampling
- 返回:
collection
Note that n, frac, replace, weights, strata and random_state can only be used under Pandas DataFrames or XFlow.
- Example:
Sampling with parts:
>>> df.sample(parts=1) >>> df.sample(parts=5, i=0) >>> df.sample(parts=10, columns=['name'])
Sampling with fraction or weights, replacement option can be specified:
>>> df.sample(n=100) >>> df.sample(frac=0.1) >>> df.sample(frac=0.1, replace=True)
Sampling with weight column:
>>> df.sample(n=100, weights='weight_col') >>> df.sample(n=100, weights='weight_col', replace=True)
Stratified sampling. Note that currently we do not support stratified sampling with replacement.
>>> df.sample(strata='category', frac={'Iris Setosa': 0.5, 'Iris Versicolour': 0.4})
- select(*fields, **kw)[源代码]
Projection columns. Remember to avoid column names’ conflict.
- 参数:
fields – columns to project
kw – columns and their names to project
- 返回:
new collection
- 返回类型:
odps.df.expr.expression.CollectionExpr
- setdiff(*rights, **kwargs)
Exclude data from a collection, like except clause in SQL. All collections involved should have same schema.
- 参数:
left – collection to drop data from
rights – collection or list of collections
distinct – whether to preserve duplicate entries
- 返回:
collection
- Examples:
>>> import pandas as pd >>> df1 = DataFrame(pd.DataFrame({'a': [1, 2, 3, 3, 3], 'b': [1, 2, 3, 3, 3]})) >>> df2 = DataFrame(pd.DataFrame({'a': [1, 3], 'b': [1, 3]})) >>> df1.setdiff(df2) a b 0 2 2 1 3 3 2 3 3 >>> df1.setdiff(df2, distinct=True) a b 0 2 2
- size()
Value counts
- 参数:
expr
- 返回:
- skew()
Calculate skewness of the sequence
- 参数:
expr
- 返回:
- sort(by, ascending=True)
Sort the collection by values. sort is an alias name for sort_values
- 参数:
expr – collection
by – the sequence or sequences to sort
ascending – Sort ascending vs. descending. Sepecify list for multiple sort orders. If this is a list of bools, must match the length of the by
- 返回:
Sorted collection
- Example:
>>> df.sort_values(['name', 'id']) # 1 >>> df.sort(['name', 'id'], ascending=False) # 2 >>> df.sort(['name', 'id'], ascending=[False, True]) # 3 >>> df.sort([-df.name, df.id]) # 4, equal to #3
- sort_values(by, ascending=True)
Sort the collection by values. sort is an alias name for sort_values
- 参数:
expr – collection
by – the sequence or sequences to sort
ascending – Sort ascending vs. descending. Sepecify list for multiple sort orders. If this is a list of bools, must match the length of the by
- 返回:
Sorted collection
- Example:
>>> df.sort_values(['name', 'id']) # 1 >>> df.sort(['name', 'id'], ascending=False) # 2 >>> df.sort(['name', 'id'], ascending=[False, True]) # 3 >>> df.sort([-df.name, df.id]) # 4, equal to #3
- split(frac, seed=None)
Split the current column into two column objects with certain ratio.
- 参数:
frac (float) – Split ratio
- 返回:
two split DataFrame objects
- std(**kw)
Standard deviation.
- 参数:
expr
kw
- 返回:
- std_scale(columns=None, with_means=True, with_std=True, preserve=False, suffix='_scaled', group=None)
Resize a data frame by mean and standard error.
- 参数:
expr (DataFrame) – Input DataFrame
with_means (bool) – Determine whether the output will be subtracted by means
with_std (bool) – Determine whether the output will be divided by standard deviations
preserve (bool) – Determine whether input data should be kept. If True, scaled input data will be appended to the data frame with suffix
columns – Columns names to resize. If set to None, float or int-typed columns will be normalized if the column is not specified as a group column.
group – determine scale groups. Scaling will be done in each group separately.
suffix (str) – column suffix to be appended to the scaled columns.
- 返回:
resized data frame
- 返回类型:
- sum()
Sum value
- 参数:
expr
- 返回:
- switch(*args, **kw)
Similar to the case-when in SQL. Refer to the example below
- 参数:
expr
args
kw
- 返回:
sequence or scalar
- Example:
>>> # if df.id == 3 then df.name >>> # elif df.id == df.fid.abs() then df.name + 'test' >>> # default: 'test' >>> df.id.switch(3, df.name, df.fid.abs(), df.name + 'test', default='test')
- tail(n=None, **kwargs)[源代码]
Return the last n rows. Execute at once.
- 参数:
n
- 返回:
result frame
- 返回类型:
odps.df.backends.frame.ResultFrame
- to_kv(columns=None, kv_delim=':', item_delim=',', kv_name='kv_col')
Merge values in specified columns into a key-value represented column.
- 参数:
expr (DataFrame) – input DataFrame
columns – the columns to be merged.
kv_delim (str) – delimiter between key and value.
item_delim (str) – delimiter between key-value pairs.
kv_col (str) – name of the new key-value column
- 返回:
converted data frame
- 返回类型:
- Example:
>>> df name k1 k2 k3 k5 k7 k9 0 name1 1.0 3.0 Nan 10.0 Nan Nan 1 name2 Nan 3.0 5.1 Nan Nan Nan 2 name3 7.1 Nan Nan Nan 8.2 Nan 3 name4 Nan 1.2 1.5 Nan Nan Nan 4 name5 Nan 1.0 Nan Nan Nan 1.1 >>> table = df.to_kv(columns=['A', 'B'], kv_delim='=') >>> table name kv_col 0 name1 k1=1.0,k2=3.0,k5=10.0 1 name2 k2=3.0,k3=5.1 2 name3 k1=7.1,k7=8.2 3 name4 k2=1.2,k3=1.5 4 name5 k2=1.0,k9=1.1
- to_pandas(wrap=False, **kwargs)[源代码]
Convert to pandas DataFrame. Execute at once.
- 参数:
wrap – if True, wrap the pandas DataFrame into a PyODPS DataFrame
- 返回:
pandas DataFrame
- tolist(**kwargs)
Pack all data in the sequence into a list :param expr: :param unique: make every elements in the sequence to be unique :return:
- union(right, distinct=False)
Union two collections.
- 参数:
left – left collection
right – right collection
distinct
- 返回:
collection
- Example:
>>> df['name', 'id'].union(df2['id', 'name'])
- var(**kw)
Variance
- 参数:
expr
ddof – degree of freedom
kw
- 返回:
- verify()
Verify if this expression can be compiled into ODPS SQL.
- 返回:
True if compilation succeed else False
- 返回类型:
bool
- class odps.df.SequenceExpr(*args, **kwargs)[源代码]
Sequence represents for 1-dimension data.
- ast()
Return the AST string.
- 返回:
AST tree
- 返回类型:
str
- astype(data_type)[源代码]
Cast to a new data type.
- 参数:
data_type – the new data type
- 返回:
casted sequence
- Example:
>>> df.id.astype('float')
- between(left, right, inclusive=True)
Return a boolean sequence or scalar show whether each element is between left and right.
- 参数:
expr – sequence or scalar
left – left value
right – right value
inclusive – if true, will be left <= expr <= right, else will be left < expr < right
- 返回:
boolean sequence or scalar
- compile()
Compile this expression into an ODPS SQL
- 返回:
compiled DAG
- 返回类型:
str
- concat(rights, distinct=False, axis=0)
Concat collections.
- 参数:
left – left collection
rights – right collections, can be a DataFrame object or a list of DataFrames
distinct – whether to remove duplicate entries. only available when axis == 0
axis – when axis == 0, the DataFrames are merged vertically, otherwise horizontally.
- 返回:
collection
Note that axis==1 can only be used under Pandas DataFrames or XFlow.
- Example:
>>> df['name', 'id'].concat(df2['score'], axis=1)
- continuous()
Set sequence to be continuous.
- 返回类型:
Column
- Example:
>>> # Table schema is create table test(f1 double, f2 string) >>> # Original continuity: f1=DISCRETE, f2=DISCRETE >>> # Now we want to set ``f1`` and ``f2`` into continuous >>> new_ds = df.continuous('f1 f2')
- count()
Value counts
- 参数:
expr
- 返回:
- cut(bins, right=True, labels=None, include_lowest=False, include_under=False, include_over=False)
Return indices of half-open bins to which each value of expr belongs.
- 参数:
expr – sequence or scalar
bins – list of scalars
right – indicates whether the bins include the rightmost edge or not. If right == True(the default), then the bins [1, 2, 3, 4] indicate (1, 2], (2, 3], (3, 4]
labels – Usesd as labes for the resulting bins. Must be of the same length as the resulting bins.
include_lowest – Whether the first interval should be left-inclusive or not.
include_under – include the bin below the leftmost edge or not
include_over – include the bin above the rightmost edge or not
- 返回:
sequence or scalar
- discrete()
Set sequence to be discrete.
- 返回类型:
Column
- Example:
>>> # Table schema is create table test(f1 double, f2 string) >>> # Original continuity: f1=CONTINUOUS, f2=CONTINUOUS >>> # Now we want to set ``f1`` and ``f2`` into continuous >>> new_ds = df.discrete('f1 f2')
- drop(data, axis=0, columns=None)
Drop data from a DataFrame.
- 参数:
expr – collection to drop data from
data – data to be removed
axis – 0 for deleting rows, 1 for columns.
columns – columns of data to select, only useful when axis == 0
- 返回:
collection
- Example:
>>> import pandas as pd >>> df1 = DataFrame(pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})) >>> df2 = DataFrame(pd.DataFrame({'a': [2, 3], 'b': [5, 7]})) >>> df1.drop(df2) a b c 0 1 4 7 1 3 6 9 >>> df1.drop(df2, columns='a') a b c 0 1 4 7 >>> df1.drop(['a'], axis=1) b c 0 4 7 1 5 8 2 6 9 >>> df1.drop(df2, axis=1) c 0 7 1 8 2 9
- property dtype
Return the data type. Available types: int8, int16, int32, int64, float32, float64, boolean, string, decimal, datetime
- 返回:
the data type
- erase_key_value()
Erase key-value represented fields.
- 返回类型:
Column
- Example:
>>> new_ds = df.erase_key_value('f1 f2')
- except_(*rights, **kwargs)
Exclude data from a collection, like except clause in SQL. All collections involved should have same schema.
- 参数:
left – collection to drop data from
rights – collection or list of collections
distinct – whether to preserve duplicate entries
- 返回:
collection
- Examples:
>>> import pandas as pd >>> df1 = DataFrame(pd.DataFrame({'a': [1, 2, 3, 3, 3], 'b': [1, 2, 3, 3, 3]})) >>> df2 = DataFrame(pd.DataFrame({'a': [1, 3], 'b': [1, 3]})) >>> df1.setdiff(df2) a b 0 2 2 1 3 3 2 3 3 >>> df1.setdiff(df2, distinct=True) a b 0 2 2
- execute(**kwargs)
- 参数:
hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size
priority (int) – instance priority, 9 as default
running_cluster – cluster to run this instance
- 返回:
execution result
- 返回类型:
odps.df.backends.frame.ResultFrame
- fillna(value)
Fill null with value.
- 参数:
expr – sequence or scalar
value – value to fill into
- 返回:
sequence or scalar
- hash(func=None)
Calculate the hash value.
- 参数:
expr
func – hash function
- 返回:
- hll_count(error_rate=0.01, splitter=None)
Calculate HyperLogLog count
- 参数:
expr
error_rate (float) – error rate
splitter – the splitter to split the column value
- 返回:
sequence or scalar
- Example:
>>> df = DataFrame(pd.DataFrame({'a': np.random.randint(100000, size=100000)})) >>> df.a.hll_count() 63270 >>> df.a.nunique() 63250
- inner_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, skewjoin=False)
Inner join two collections.
If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.
- 参数:
left – left collection
right – right collection
on – fields to join on
suffixes – when name conflict, the suffixes will be added to both columns.
- 返回:
collection
- Example:
>>> df.dtypes.names ['name', 'id'] >>> df2.dtypes.names ['name', 'id1'] >>> df.inner_join(df2) >>> df.inner_join(df2, on='name') >>> df.inner_join(df2, on=('id', 'id1')) >>> df.inner_join(df2, on=['name', ('id', 'id1')]) >>> df.inner_join(df2, on=[df.name == df2.name, df.id == df2.id1])
- intersect(*rights, **kwargs)
Calc intersection among datasets,
- 参数:
left – collection
rights – collection or list of collections
distinct – whether to preserve duolicate entries
- 返回:
collection
- Examples:
>>> import pandas as pd >>> df1 = DataFrame(pd.DataFrame({'a': [1, 2, 3, 3, 3], 'b': [1, 2, 3, 3, 3]})) >>> df2 = DataFrame(pd.DataFrame({'a': [1, 3, 3], 'b': [1, 3, 3]})) >>> df1.intersect(df2) a b 0 1 1 1 3 3 2 3 3 >>> df1.intersect(df2, distinct=True) a b 0 1 1 1 3 3
- isin(values)
Return a boolean sequence or scalar showing whether each element is exactly contained in the passed values.
- 参数:
expr – sequence or scalar
values – list object or sequence
- 返回:
boolean sequence or scalar
- isna()
Return a sequence or scalar according to the input indicating if the values are null.
- 参数:
expr – sequence or scalar
- 返回:
sequence or scalar
- isnull()
Return a sequence or scalar according to the input indicating if the values are null.
- 参数:
expr – sequence or scalar
- 返回:
sequence or scalar
- join(right, on=None, how='inner', suffixes=('_x', '_y'), mapjoin=False, skewjoin=False)
Join two collections.
If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.
- 参数:
left – left collection
right – right collection
on – fields to join on
how – ‘inner’, ‘left’, ‘right’, or ‘outer’
suffixes – when name conflict, the suffix will be added to both columns.
mapjoin – set use mapjoin or not, default value False.
skewjoin – set use of skewjoin or not, default value False. Can specify True if the collection is skew, or a list specifying columns with skew values, or a list of dicts specifying skew combinations.
- 返回:
collection
- Example:
>>> df.dtypes.names ['name', 'id'] >>> df2.dtypes.names ['name', 'id1'] >>> df.join(df2) >>> df.join(df2, on='name') >>> df.join(df2, on=('id', 'id1')) >>> df.join(df2, on=['name', ('id', 'id1')]) >>> df.join(df2, on=[df.name == df2.name, df.id == df2.id1]) >>> df.join(df2, mapjoin=False) >>> df.join(df2, skewjoin=True) >>> df.join(df2, skewjoin=["c0", "c1"]) >>> df.join(df2, skewjoin=[{"c0": 1, "c1": "2"}, {"c0": 3, "c1": "4"}])
- key_value(**kwargs)
Set fields to be key-value represented.
- 返回类型:
Column
- Example:
>>> new_ds = df.key_value('f1 f2', kv=':', item=',')
- left_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, merge_columns=None, skewjoin=False)
Left join two collections.
If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.
- 参数:
left – left collection
right – right collection
on – fields to join on
suffixes – when name conflict, the suffixes will be added to both columns.
mapjoin – set use mapjoin or not, default value False.
merge_columns – whether to merge columns with the same name into one column without suffix. If the value is True, columns in the predicate with same names will be merged, with non-null value. If the value is ‘left’ or ‘right’, the values of predicates on the left / right collection will be taken. You can also pass a dictionary to describe the behavior of each column, such as { ‘a’: ‘auto’, ‘b’: ‘left’ }.
- 返回:
collection
- Example:
>>> df.dtypes.names ['name', 'id'] >>> df2.dtypes.names ['name', 'id1'] >>> df.left_join(df2) >>> df.left_join(df2, on='name') >>> df.left_join(df2, on=('id', 'id1')) >>> df.left_join(df2, on=['name', ('id', 'id1')]) >>> df.left_join(df2, on=[df.name == df2.name, df.id == df2.id1])
- map(func, rtype=None, resources=None, args=(), **kwargs)
Call func on each element of this sequence.
- 参数:
func – lambda, function,
odps.models.Function
, or str which is the name ofodps.models.Funtion
rtype – if not provided, will be the dtype of this sequence
- 返回:
a new sequence
- Example:
>>> df.id.map(lambda x: x + 1)
- max()
Max value
- 参数:
expr
- 返回:
- min()
Min value
- 参数:
expr
- 返回:
- notin(values)
Return a boolean sequence or scalar showing whether each element is not contained in the passed values.
- 参数:
expr – sequence or scalar
values – list object or sequence
- 返回:
boolean sequence or scalar
- notna()
Return a sequence or scalar according to the input indicating if the values are not null.
- 参数:
expr – sequence or scalar
- 返回:
sequence or scalar
- notnull()
Return a sequence or scalar according to the input indicating if the values are not null.
- 参数:
expr – sequence or scalar
- 返回:
sequence or scalar
- nunique()
The distinct count.
- 参数:
expr
- 返回:
- outer_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, merge_columns=None, skewjoin=False)
Outer join two collections.
If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.
- 参数:
left – left collection
right – right collection
on – fields to join on
suffixes – when name conflict, the suffixes will be added to both columns.
mapjoin – set use mapjoin or not, default value False.
merge_columns – whether to merge columns with the same name into one column without suffix. If the value is True, columns in the predicate with same names will be merged, with non-null value. If the value is ‘left’ or ‘right’, the values of predicates on the left / right collection will be taken. You can also pass a dictionary to describe the behavior of each column, such as { ‘a’: ‘auto’, ‘b’: ‘left’ }.
- 返回:
collection
- Example:
>>> df.dtypes.names ['name', 'id'] >>> df2.dtypes.names ['name', 'id1'] >>> df.outer_join(df2) >>> df.outer_join(df2, on='name') >>> df.outer_join(df2, on=('id', 'id1')) >>> df.outer_join(df2, on=['name', ('id', 'id1')]) >>> df.outer_join(df2, on=[df.name == df2.name, df.id == df2.id1])
- persist(name, partitions=None, partition=None, lifecycle=None, project=None, **kwargs)
Persist the execution into a new table. If partitions not specified, will create a new table without partitions if the table does not exist, and insert the SQL result into it. If partitions are specified, they will be the partition fields of the new table. If partition is specified, the data will be inserted into the exact partition of the table.
- 参数:
name – table name
partitions (list) – list of string, the partition fields
partition (string or PartitionSpec) – persist to a specified partition
lifecycle (int) – table lifecycle. If absent, options.lifecycle will be used.
project – project name, if not provided, will be the default project
hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size
priority (int) – instance priority, 9 as default
running_cluster – cluster to run this instance
overwrite (bool) – overwrite the table, True as default
drop_table (bool) – drop table if exists, False as default
create_table (bool) – create table first if not exits, True as default
drop_partition (bool) – drop partition if exists, False as default
create_partition (bool) – create partition if not exists, None as default
cast (bool) – cast all columns’ types as the existed table, False as default
- 返回:
- Example:
>>> df = df['name', 'id', 'ds'] >>> df.persist('odps_new_table') >>> df.persist('odps_new_table', partition='pt=test') >>> df.persist('odps_new_table', partitions=['ds'])
- right_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, merge_columns=None, skewjoin=False)
Right join two collections.
If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.
- 参数:
left – left collection
right – right collection
on – fields to join on
suffixes – when name conflict, the suffixes will be added to both columns.
mapjoin – set use mapjoin or not, default value False.
merge_columns – whether to merge columns with the same name into one column without suffix. If the value is True, columns in the predicate with same names will be merged, with non-null value. If the value is ‘left’ or ‘right’, the values of predicates on the left / right collection will be taken. You can also pass a dictionary to describe the behavior of each column, such as { ‘a’: ‘auto’, ‘b’: ‘left’ }.
- 返回:
collection
- Example:
>>> df.dtypes.names ['name', 'id'] >>> df2.dtypes.names ['name', 'id1'] >>> df.right_join(df2) >>> df.right_join(df2, on='name') >>> df.right_join(df2, on=('id', 'id1')) >>> df.right_join(df2, on=['name', ('id', 'id1')]) >>> df.right_join(df2, on=[df.name == df2.name, df.id == df2.id1])
- role(role_name)
Set role of current column
- 参数:
role_name – name of the role to be selected.
- 返回:
- setdiff(*rights, **kwargs)
Exclude data from a collection, like except clause in SQL. All collections involved should have same schema.
- 参数:
left – collection to drop data from
rights – collection or list of collections
distinct – whether to preserve duplicate entries
- 返回:
collection
- Examples:
>>> import pandas as pd >>> df1 = DataFrame(pd.DataFrame({'a': [1, 2, 3, 3, 3], 'b': [1, 2, 3, 3, 3]})) >>> df2 = DataFrame(pd.DataFrame({'a': [1, 3], 'b': [1, 3]})) >>> df1.setdiff(df2) a b 0 2 2 1 3 3 2 3 3 >>> df1.setdiff(df2, distinct=True) a b 0 2 2
- size()
Value counts
- 参数:
expr
- 返回:
- switch(*args, **kw)
Similar to the case-when in SQL. Refer to the example below
- 参数:
expr
args
kw
- 返回:
sequence or scalar
- Example:
>>> # if df.id == 3 then df.name >>> # elif df.id == df.fid.abs() then df.name + 'test' >>> # default: 'test' >>> df.id.switch(3, df.name, df.fid.abs(), df.name + 'test', default='test')
- to_pandas(wrap=False, **kwargs)[源代码]
Convert to pandas Series. Execute at once.
- 参数:
wrap – if True, wrap the pandas DataFrame into a PyODPS DataFrame
- 返回:
pandas Series
- tolist(**kwargs)
Pack all data in the sequence into a list :param expr: :param unique: make every elements in the sequence to be unique :return:
- union(right, distinct=False)
Union two collections.
- 参数:
left – left collection
right – right collection
distinct
- 返回:
collection
- Example:
>>> df['name', 'id'].union(df2['id', 'name'])
- value_counts(sort=True, ascending=False, dropna=False)
Return object containing counts of unique values.
The resulting object will be in descending order so that the first element is the most frequently-occuring element. Exclude NA values by default
- 参数:
expr – sequence
sort (bool) – if sort
dropna – Don’t include counts of None, default False
- 返回:
collection with two columns
- 返回类型:
- verify()
Verify if this expression can be compiled into ODPS SQL.
- 返回:
True if compilation succeed else False
- 返回类型:
bool
- class odps.df.expr.expressions.Int64SequenceExpr(*args, **kwargs)[源代码]
- kurt()
Calculate kurtosis of the sequence
- 参数:
expr
- 返回:
- kurtosis()
Calculate kurtosis of the sequence
- 参数:
expr
- 返回:
- mean()
Arithmetic mean.
- 参数:
expr
- 返回:
- median()
Median value.
- 参数:
expr
- 返回:
- moment(order, central=False)
Calculate the n-th order moment of the sequence
- 参数:
expr
order – moment order, must be an integer
central – if central moments are to be computed.
- 返回:
- quantile(prob=None, **kw)
Percentile value.
- 参数:
expr
prob – probability or list of probabilities, in [0, 1]
- 返回:
- skew()
Calculate skewness of the sequence
- 参数:
expr
- 返回:
- std(**kw)
Standard deviation.
- 参数:
expr
kw
- 返回:
- sum()
Sum value
- 参数:
expr
- 返回:
- to_datetime()
Return a sequence or scalar that is the datetime value of the current numeric sequence or scalar.
- 参数:
expr – sequence or scalar
- 返回:
sequence or scalar
- var(**kw)
Variance
- 参数:
expr
ddof – degree of freedom
kw
- 返回:
- class odps.df.expr.expressions.StringSequenceExpr(*args, **kwargs)[源代码]
- capitalize()
Convert strings in the Sequence or string scalar to be capitalized. Equivalent to str.capitalize().
- 参数:
expr
- 返回:
sequence or scalar
- cat(others=None, sep=None, na_rep=None)
Concatenate strings in sequence with given separator
- 参数:
expr
others – other sequences
sep – string or None, default None
na_rep – string or None default None, if None, NA in the sequence are ignored
- 返回:
- contains(pat, case=True, flags=0, regex=True)
Return boolean sequence whether given pattern/regex is contained in each string in the sequence
- 参数:
expr – sequence or scalar
pat – Character sequence or regular expression
case (bool) – If True, case sensitive
flags – re module flags, e.g. re.IGNORECASE
regex – If True use regex, otherwise use string finder
- 返回:
sequence or scalar
- count(*args, **kwargs)
Value counts
- 参数:
expr
- 返回:
- endswith(pat)
Return boolean sequence or scalar indicating whether each string in the sequence or scalar ends with passed pattern. Equivalent to str.endswith().
- 参数:
expr
pat – Character sequence
- 返回:
sequence or scalar
- extract(pat, flags=0, group=0)
Find group in each string in the Series using passed regular expression.
- 参数:
expr
pat – Pattern or regular expression
flags – re module, e.g. re.IGNORECASE
group – if None as group 0
- 返回:
sequence or scalar
- find(sub, start=0, end=None)
Return lowest indexes in each strings in the sequence or scalar where the substring is fully contained between [start:end]. Return -1 on failure. Equivalent to standard str.find().
- 参数:
expr
sub – substring being searched
start – left edge index
end – right edge index
- 返回:
sequence or scalar
- get(index)
Extract element from lists, tuples, or strings in each element in the sequence or scalar
- 参数:
expr
index – Integer index(location)
- 返回:
sequence or scalar
- isalnum()
Check whether all characters in each string in the sequence or scalar are alphanumeric. Equivalent to str.isalnum().
- 参数:
expr
- 返回:
boolean sequence or scalar
- isalpha()
Check whether all characters in each string in the sequence or scalar are alphabetic. Equivalent to str.isalpha().
- 参数:
expr
- 返回:
boolean sequence or scalar
- isdecimal()
Check whether all characters in each string in the sequence or scalar are decimal. Equivalent to str.isdecimal().
- 参数:
expr
- 返回:
boolean sequence or scalar
- isdigit()
Check whether all characters in each string in the sequence or scalar are digits. Equivalent to str.isdigit().
- 参数:
expr
- 返回:
boolean sequence or scalar
- islower()
Check whether all characters in each string in the sequence or scalar are lowercase. Equivalent to str.islower().
- 参数:
expr
- 返回:
boolean sequence or scalar
- isnumeric()
Check whether all characters in each string in the sequence or scalar are numeric. Equivalent to str.isnumeric().
- 参数:
expr
- 返回:
boolean sequence or scalar
- isspace()
Check whether all characters in each string in the sequence or scalar are whitespace. Equivalent to str.isspace().
- 参数:
expr
- 返回:
boolean sequence or scalar
- istitle()
Check whether all characters in each string in the sequence or scalar are titlecase. Equivalent to str.istitle().
- 参数:
expr
- 返回:
boolean sequence or scalar
- isupper()
Check whether all characters in each string in the sequence or scalar are uppercase. Equivalent to str.isupper().
- 参数:
expr
- 返回:
boolean sequence or scalar
- len()
Compute length of each string in the sequence or scalar
- 参数:
expr
- 返回:
lengths
- ljust(width, fillchar=' ')
Filling right side of strings in the sequence or scalar with an additional character. Equivalent to str.ljust().
- 参数:
expr
width – Minimum width of resulting string; additional characters will be filled with fillchar
fillchar – Additional character for filling, default is whitespace.
- 返回:
sequence or scalar
- lower()
Convert strings in the sequence or scalar lowercase. Equivalent to str.lower().
- 参数:
expr
- 返回:
sequence or scalar
- lstrip(to_strip=None)
Strip whitespace (including newlines) from each string in the sequence or scalar from left side. Equivalent to str.lstrip().
- 参数:
expr
to_strip
- 返回:
sequence or sclaar
- pad(width, side='left', fillchar=' ')
Pad strings in the sequence or scalar with an additional character to specified side.
- 参数:
expr
width – Minimum width of resulting string; additional characters will be filled with spaces
side – {‘left’, ‘right’, ‘both’}, default ‘left’
fillchar – Additional character for filling, default is whitespace
- 返回:
sequence or scalar
- repeat(repeats)
Duplicate each string in the sequence or scalar by indicated number of times.
- 参数:
expr
repeats – times
- 返回:
sequence or scalar
- replace(pat, repl, n=-1, case=True, flags=0, regex=True)
Replace occurrence of pattern/regex in the sequence or scalar with some other string. Equivalent to str.replace()
- 参数:
expr
pat – Character sequence or regular expression
repl – Replacement
n – Number of replacements to make from start
case – if True, case sensitive
flags – re module flag, e.g. re.IGNORECASE
- 返回:
sequence or scalar
- rfind(sub, start=0, end=None)
Return highest indexes in each strings in the sequence or scalar where the substring is fully contained between [start:end]. Return -1 on failure. Equivalent to standard str.rfind().
- 参数:
expr
sub
start
end
- 返回:
sequence or scalar
- rjust(width, fillchar=' ')
Filling left side of strings in the sequence or scalar with an additional character. Equivalent to str.rjust().
- 参数:
expr
width – Minimum width of resulting string; additional characters will be filled with fillchar
fillchar – Additional character for filling, default is whitespace.
- 返回:
sequence or scalar
- rstrip(to_strip=None)
Strip whitespace (including newlines) from each string in the sequence or scalar from right side. Equivalent to str.rstrip().
- 参数:
expr
to_strip
- 返回:
sequence or scalar
- slice(start=None, stop=None, step=None)
Slice substrings from each element in the sequence or scalar
- 参数:
expr
start – int or None
stop – int or None
step – int or None
- 返回:
sliced
- split(pat=None, n=-1)
Split each string (a la re.split) in the Series/Index by given pattern, propagating NA values. Equivalent to str.split().
- 参数:
expr
pat – Separator to split on. If None, splits on whitespace
n – not supported right now
- 返回:
list sequence or scalar
- startswith(pat)
Return boolean sequence or scalar indicating whether each string in the sequence or scalar starts with passed pattern. Equivalent to str.startswith().
- 参数:
expr
pat – Character sequence
- 返回:
sequence or scalar
- strip(to_strip=None)
Strip whitespace (including newlines) from each string in the sequence or scalar from left and right sides. Equivalent to str.strip().
- 参数:
expr
to_strip
- 返回:
sequence or scalar
- strptime(date_format)
Return datetimes specified by date_format, which supports the same string format as the python standard library. Details of the string format can be found in python string format doc
- 参数:
expr
date_format (str) – date format string (e.g. “%Y-%m-%d”)
- 返回:
- sum()
Sum value
- 参数:
expr
- 返回:
- swapcase()
Convert strings in the sequence or scalar to be swapcased. Equivalent to str.swapcase().
- 参数:
expr
- 返回:
converted
- title()
Convert strings in the sequence or scalar to titlecase. Equivalent to str.title().
- 参数:
expr
- 返回:
converted
- todict(item_delim=',', kv_delim='=')
Convert the string sequence / expr into a string dict given item and key-value delimiters.
- 参数:
expr
item_delim – delimiter between data items
kv_delim – delimiter between keys and values
- 返回:
dict sequence or scalar
- upper()
Convert strings in the sequence or scalar uppercase. Equivalent to str.upper().
- 参数:
expr
- 返回:
sequence or scalar
- zfill(width)
Filling left side of strings in the sequence or scalar with 0. Equivalent to str.zfill().
- 参数:
expr
width – Minimum width of resulting string; additional characters will be filled with 0
- 返回:
filled
- class odps.df.Scalar(*args, **kwargs)[源代码]
Represent for the scalar type.
- 参数:
_value – value of the scalar
_value_type – value type of the scalar
- Example:
>>> df[df, Scalar(4).rename('append_const')]
- ast()
Return the AST string.
- 返回:
AST tree
- 返回类型:
str
- between(left, right, inclusive=True)
Return a boolean sequence or scalar show whether each element is between left and right.
- 参数:
expr – sequence or scalar
left – left value
right – right value
inclusive – if true, will be left <= expr <= right, else will be left < expr < right
- 返回:
boolean sequence or scalar
- compile()
Compile this expression into an ODPS SQL
- 返回:
compiled DAG
- 返回类型:
str
- cut(bins, right=True, labels=None, include_lowest=False, include_under=False, include_over=False)
Return indices of half-open bins to which each value of expr belongs.
- 参数:
expr – sequence or scalar
bins – list of scalars
right – indicates whether the bins include the rightmost edge or not. If right == True(the default), then the bins [1, 2, 3, 4] indicate (1, 2], (2, 3], (3, 4]
labels – Usesd as labes for the resulting bins. Must be of the same length as the resulting bins.
include_lowest – Whether the first interval should be left-inclusive or not.
include_under – include the bin below the leftmost edge or not
include_over – include the bin above the rightmost edge or not
- 返回:
sequence or scalar
- execute(**kwargs)
- 参数:
hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size
priority (int) – instance priority, 9 as default
running_cluster – cluster to run this instance
- 返回:
execution result
- 返回类型:
odps.df.backends.frame.ResultFrame
- fillna(value)
Fill null with value.
- 参数:
expr – sequence or scalar
value – value to fill into
- 返回:
sequence or scalar
- hash(func=None)
Calculate the hash value.
- 参数:
expr
func – hash function
- 返回:
- inner_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, skewjoin=False)
Inner join two collections.
If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.
- 参数:
left – left collection
right – right collection
on – fields to join on
suffixes – when name conflict, the suffixes will be added to both columns.
- 返回:
collection
- Example:
>>> df.dtypes.names ['name', 'id'] >>> df2.dtypes.names ['name', 'id1'] >>> df.inner_join(df2) >>> df.inner_join(df2, on='name') >>> df.inner_join(df2, on=('id', 'id1')) >>> df.inner_join(df2, on=['name', ('id', 'id1')]) >>> df.inner_join(df2, on=[df.name == df2.name, df.id == df2.id1])
- isin(values)
Return a boolean sequence or scalar showing whether each element is exactly contained in the passed values.
- 参数:
expr – sequence or scalar
values – list object or sequence
- 返回:
boolean sequence or scalar
- isna()
Return a sequence or scalar according to the input indicating if the values are null.
- 参数:
expr – sequence or scalar
- 返回:
sequence or scalar
- isnull()
Return a sequence or scalar according to the input indicating if the values are null.
- 参数:
expr – sequence or scalar
- 返回:
sequence or scalar
- join(right, on=None, how='inner', suffixes=('_x', '_y'), mapjoin=False, skewjoin=False)
Join two collections.
If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.
- 参数:
left – left collection
right – right collection
on – fields to join on
how – ‘inner’, ‘left’, ‘right’, or ‘outer’
suffixes – when name conflict, the suffix will be added to both columns.
mapjoin – set use mapjoin or not, default value False.
skewjoin – set use of skewjoin or not, default value False. Can specify True if the collection is skew, or a list specifying columns with skew values, or a list of dicts specifying skew combinations.
- 返回:
collection
- Example:
>>> df.dtypes.names ['name', 'id'] >>> df2.dtypes.names ['name', 'id1'] >>> df.join(df2) >>> df.join(df2, on='name') >>> df.join(df2, on=('id', 'id1')) >>> df.join(df2, on=['name', ('id', 'id1')]) >>> df.join(df2, on=[df.name == df2.name, df.id == df2.id1]) >>> df.join(df2, mapjoin=False) >>> df.join(df2, skewjoin=True) >>> df.join(df2, skewjoin=["c0", "c1"]) >>> df.join(df2, skewjoin=[{"c0": 1, "c1": "2"}, {"c0": 3, "c1": "4"}])
- left_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, merge_columns=None, skewjoin=False)
Left join two collections.
If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.
- 参数:
left – left collection
right – right collection
on – fields to join on
suffixes – when name conflict, the suffixes will be added to both columns.
mapjoin – set use mapjoin or not, default value False.
merge_columns – whether to merge columns with the same name into one column without suffix. If the value is True, columns in the predicate with same names will be merged, with non-null value. If the value is ‘left’ or ‘right’, the values of predicates on the left / right collection will be taken. You can also pass a dictionary to describe the behavior of each column, such as { ‘a’: ‘auto’, ‘b’: ‘left’ }.
- 返回:
collection
- Example:
>>> df.dtypes.names ['name', 'id'] >>> df2.dtypes.names ['name', 'id1'] >>> df.left_join(df2) >>> df.left_join(df2, on='name') >>> df.left_join(df2, on=('id', 'id1')) >>> df.left_join(df2, on=['name', ('id', 'id1')]) >>> df.left_join(df2, on=[df.name == df2.name, df.id == df2.id1])
- map(func, rtype=None, resources=None, args=(), **kwargs)
Call func on each element of this sequence.
- 参数:
func – lambda, function,
odps.models.Function
, or str which is the name ofodps.models.Funtion
rtype – if not provided, will be the dtype of this sequence
- 返回:
a new sequence
- Example:
>>> df.id.map(lambda x: x + 1)
- notin(values)
Return a boolean sequence or scalar showing whether each element is not contained in the passed values.
- 参数:
expr – sequence or scalar
values – list object or sequence
- 返回:
boolean sequence or scalar
- notna()
Return a sequence or scalar according to the input indicating if the values are not null.
- 参数:
expr – sequence or scalar
- 返回:
sequence or scalar
- notnull()
Return a sequence or scalar according to the input indicating if the values are not null.
- 参数:
expr – sequence or scalar
- 返回:
sequence or scalar
- outer_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, merge_columns=None, skewjoin=False)
Outer join two collections.
If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.
- 参数:
left – left collection
right – right collection
on – fields to join on
suffixes – when name conflict, the suffixes will be added to both columns.
mapjoin – set use mapjoin or not, default value False.
merge_columns – whether to merge columns with the same name into one column without suffix. If the value is True, columns in the predicate with same names will be merged, with non-null value. If the value is ‘left’ or ‘right’, the values of predicates on the left / right collection will be taken. You can also pass a dictionary to describe the behavior of each column, such as { ‘a’: ‘auto’, ‘b’: ‘left’ }.
- 返回:
collection
- Example:
>>> df.dtypes.names ['name', 'id'] >>> df2.dtypes.names ['name', 'id1'] >>> df.outer_join(df2) >>> df.outer_join(df2, on='name') >>> df.outer_join(df2, on=('id', 'id1')) >>> df.outer_join(df2, on=['name', ('id', 'id1')]) >>> df.outer_join(df2, on=[df.name == df2.name, df.id == df2.id1])
- persist(name, partitions=None, partition=None, lifecycle=None, project=None, **kwargs)
Persist the execution into a new table. If partitions not specified, will create a new table without partitions if the table does not exist, and insert the SQL result into it. If partitions are specified, they will be the partition fields of the new table. If partition is specified, the data will be inserted into the exact partition of the table.
- 参数:
name – table name
partitions (list) – list of string, the partition fields
partition (string or PartitionSpec) – persist to a specified partition
lifecycle (int) – table lifecycle. If absent, options.lifecycle will be used.
project – project name, if not provided, will be the default project
hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size
priority (int) – instance priority, 9 as default
running_cluster – cluster to run this instance
overwrite (bool) – overwrite the table, True as default
drop_table (bool) – drop table if exists, False as default
create_table (bool) – create table first if not exits, True as default
drop_partition (bool) – drop partition if exists, False as default
create_partition (bool) – create partition if not exists, None as default
cast (bool) – cast all columns’ types as the existed table, False as default
- 返回:
- Example:
>>> df = df['name', 'id', 'ds'] >>> df.persist('odps_new_table') >>> df.persist('odps_new_table', partition='pt=test') >>> df.persist('odps_new_table', partitions=['ds'])
- right_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, merge_columns=None, skewjoin=False)
Right join two collections.
If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.
- 参数:
left – left collection
right – right collection
on – fields to join on
suffixes – when name conflict, the suffixes will be added to both columns.
mapjoin – set use mapjoin or not, default value False.
merge_columns – whether to merge columns with the same name into one column without suffix. If the value is True, columns in the predicate with same names will be merged, with non-null value. If the value is ‘left’ or ‘right’, the values of predicates on the left / right collection will be taken. You can also pass a dictionary to describe the behavior of each column, such as { ‘a’: ‘auto’, ‘b’: ‘left’ }.
- 返回:
collection
- Example:
>>> df.dtypes.names ['name', 'id'] >>> df2.dtypes.names ['name', 'id1'] >>> df.right_join(df2) >>> df.right_join(df2, on='name') >>> df.right_join(df2, on=('id', 'id1')) >>> df.right_join(df2, on=['name', ('id', 'id1')]) >>> df.right_join(df2, on=[df.name == df2.name, df.id == df2.id1])
- switch(*args, **kw)
Similar to the case-when in SQL. Refer to the example below
- 参数:
expr
args
kw
- 返回:
sequence or scalar
- Example:
>>> # if df.id == 3 then df.name >>> # elif df.id == df.fid.abs() then df.name + 'test' >>> # default: 'test' >>> df.id.switch(3, df.name, df.fid.abs(), df.name + 'test', default='test')
- verify()
Verify if this expression can be compiled into ODPS SQL.
- 返回:
True if compilation succeed else False
- 返回类型:
bool
- odps.df.NullScalar(tp)[源代码]
Creates a Scalar representing typed None values.
- 参数:
tp – type of the scalar
- 返回:
Scalar with None value
- class odps.df.RandomScalar(seed=None, **kw)[源代码]
Represent for the random scalar type.
- 参数:
seed – random seed, None by default
- Example:
>>> df[df, RandomScalar().rename('append_random')]
- ast()
Return the AST string.
- 返回:
AST tree
- 返回类型:
str
- between(left, right, inclusive=True)
Return a boolean sequence or scalar show whether each element is between left and right.
- 参数:
expr – sequence or scalar
left – left value
right – right value
inclusive – if true, will be left <= expr <= right, else will be left < expr < right
- 返回:
boolean sequence or scalar
- compile()
Compile this expression into an ODPS SQL
- 返回:
compiled DAG
- 返回类型:
str
- cut(bins, right=True, labels=None, include_lowest=False, include_under=False, include_over=False)
Return indices of half-open bins to which each value of expr belongs.
- 参数:
expr – sequence or scalar
bins – list of scalars
right – indicates whether the bins include the rightmost edge or not. If right == True(the default), then the bins [1, 2, 3, 4] indicate (1, 2], (2, 3], (3, 4]
labels – Usesd as labes for the resulting bins. Must be of the same length as the resulting bins.
include_lowest – Whether the first interval should be left-inclusive or not.
include_under – include the bin below the leftmost edge or not
include_over – include the bin above the rightmost edge or not
- 返回:
sequence or scalar
- execute(**kwargs)
- 参数:
hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size
priority (int) – instance priority, 9 as default
running_cluster – cluster to run this instance
- 返回:
execution result
- 返回类型:
odps.df.backends.frame.ResultFrame
- fillna(value)
Fill null with value.
- 参数:
expr – sequence or scalar
value – value to fill into
- 返回:
sequence or scalar
- hash(func=None)
Calculate the hash value.
- 参数:
expr
func – hash function
- 返回:
- inner_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, skewjoin=False)
Inner join two collections.
If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.
- 参数:
left – left collection
right – right collection
on – fields to join on
suffixes – when name conflict, the suffixes will be added to both columns.
- 返回:
collection
- Example:
>>> df.dtypes.names ['name', 'id'] >>> df2.dtypes.names ['name', 'id1'] >>> df.inner_join(df2) >>> df.inner_join(df2, on='name') >>> df.inner_join(df2, on=('id', 'id1')) >>> df.inner_join(df2, on=['name', ('id', 'id1')]) >>> df.inner_join(df2, on=[df.name == df2.name, df.id == df2.id1])
- isin(values)
Return a boolean sequence or scalar showing whether each element is exactly contained in the passed values.
- 参数:
expr – sequence or scalar
values – list object or sequence
- 返回:
boolean sequence or scalar
- isna()
Return a sequence or scalar according to the input indicating if the values are null.
- 参数:
expr – sequence or scalar
- 返回:
sequence or scalar
- isnull()
Return a sequence or scalar according to the input indicating if the values are null.
- 参数:
expr – sequence or scalar
- 返回:
sequence or scalar
- join(right, on=None, how='inner', suffixes=('_x', '_y'), mapjoin=False, skewjoin=False)
Join two collections.
If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.
- 参数:
left – left collection
right – right collection
on – fields to join on
how – ‘inner’, ‘left’, ‘right’, or ‘outer’
suffixes – when name conflict, the suffix will be added to both columns.
mapjoin – set use mapjoin or not, default value False.
skewjoin – set use of skewjoin or not, default value False. Can specify True if the collection is skew, or a list specifying columns with skew values, or a list of dicts specifying skew combinations.
- 返回:
collection
- Example:
>>> df.dtypes.names ['name', 'id'] >>> df2.dtypes.names ['name', 'id1'] >>> df.join(df2) >>> df.join(df2, on='name') >>> df.join(df2, on=('id', 'id1')) >>> df.join(df2, on=['name', ('id', 'id1')]) >>> df.join(df2, on=[df.name == df2.name, df.id == df2.id1]) >>> df.join(df2, mapjoin=False) >>> df.join(df2, skewjoin=True) >>> df.join(df2, skewjoin=["c0", "c1"]) >>> df.join(df2, skewjoin=[{"c0": 1, "c1": "2"}, {"c0": 3, "c1": "4"}])
- left_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, merge_columns=None, skewjoin=False)
Left join two collections.
If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.
- 参数:
left – left collection
right – right collection
on – fields to join on
suffixes – when name conflict, the suffixes will be added to both columns.
mapjoin – set use mapjoin or not, default value False.
merge_columns – whether to merge columns with the same name into one column without suffix. If the value is True, columns in the predicate with same names will be merged, with non-null value. If the value is ‘left’ or ‘right’, the values of predicates on the left / right collection will be taken. You can also pass a dictionary to describe the behavior of each column, such as { ‘a’: ‘auto’, ‘b’: ‘left’ }.
- 返回:
collection
- Example:
>>> df.dtypes.names ['name', 'id'] >>> df2.dtypes.names ['name', 'id1'] >>> df.left_join(df2) >>> df.left_join(df2, on='name') >>> df.left_join(df2, on=('id', 'id1')) >>> df.left_join(df2, on=['name', ('id', 'id1')]) >>> df.left_join(df2, on=[df.name == df2.name, df.id == df2.id1])
- map(func, rtype=None, resources=None, args=(), **kwargs)
Call func on each element of this sequence.
- 参数:
func – lambda, function,
odps.models.Function
, or str which is the name ofodps.models.Funtion
rtype – if not provided, will be the dtype of this sequence
- 返回:
a new sequence
- Example:
>>> df.id.map(lambda x: x + 1)
- notin(values)
Return a boolean sequence or scalar showing whether each element is not contained in the passed values.
- 参数:
expr – sequence or scalar
values – list object or sequence
- 返回:
boolean sequence or scalar
- notna()
Return a sequence or scalar according to the input indicating if the values are not null.
- 参数:
expr – sequence or scalar
- 返回:
sequence or scalar
- notnull()
Return a sequence or scalar according to the input indicating if the values are not null.
- 参数:
expr – sequence or scalar
- 返回:
sequence or scalar
- outer_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, merge_columns=None, skewjoin=False)
Outer join two collections.
If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.
- 参数:
left – left collection
right – right collection
on – fields to join on
suffixes – when name conflict, the suffixes will be added to both columns.
mapjoin – set use mapjoin or not, default value False.
merge_columns – whether to merge columns with the same name into one column without suffix. If the value is True, columns in the predicate with same names will be merged, with non-null value. If the value is ‘left’ or ‘right’, the values of predicates on the left / right collection will be taken. You can also pass a dictionary to describe the behavior of each column, such as { ‘a’: ‘auto’, ‘b’: ‘left’ }.
- 返回:
collection
- Example:
>>> df.dtypes.names ['name', 'id'] >>> df2.dtypes.names ['name', 'id1'] >>> df.outer_join(df2) >>> df.outer_join(df2, on='name') >>> df.outer_join(df2, on=('id', 'id1')) >>> df.outer_join(df2, on=['name', ('id', 'id1')]) >>> df.outer_join(df2, on=[df.name == df2.name, df.id == df2.id1])
- persist(name, partitions=None, partition=None, lifecycle=None, project=None, **kwargs)
Persist the execution into a new table. If partitions not specified, will create a new table without partitions if the table does not exist, and insert the SQL result into it. If partitions are specified, they will be the partition fields of the new table. If partition is specified, the data will be inserted into the exact partition of the table.
- 参数:
name – table name
partitions (list) – list of string, the partition fields
partition (string or PartitionSpec) – persist to a specified partition
lifecycle (int) – table lifecycle. If absent, options.lifecycle will be used.
project – project name, if not provided, will be the default project
hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size
priority (int) – instance priority, 9 as default
running_cluster – cluster to run this instance
overwrite (bool) – overwrite the table, True as default
drop_table (bool) – drop table if exists, False as default
create_table (bool) – create table first if not exits, True as default
drop_partition (bool) – drop partition if exists, False as default
create_partition (bool) – create partition if not exists, None as default
cast (bool) – cast all columns’ types as the existed table, False as default
- 返回:
- Example:
>>> df = df['name', 'id', 'ds'] >>> df.persist('odps_new_table') >>> df.persist('odps_new_table', partition='pt=test') >>> df.persist('odps_new_table', partitions=['ds'])
- right_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, merge_columns=None, skewjoin=False)
Right join two collections.
If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.
- 参数:
left – left collection
right – right collection
on – fields to join on
suffixes – when name conflict, the suffixes will be added to both columns.
mapjoin – set use mapjoin or not, default value False.
merge_columns – whether to merge columns with the same name into one column without suffix. If the value is True, columns in the predicate with same names will be merged, with non-null value. If the value is ‘left’ or ‘right’, the values of predicates on the left / right collection will be taken. You can also pass a dictionary to describe the behavior of each column, such as { ‘a’: ‘auto’, ‘b’: ‘left’ }.
- 返回:
collection
- Example:
>>> df.dtypes.names ['name', 'id'] >>> df2.dtypes.names ['name', 'id1'] >>> df.right_join(df2) >>> df.right_join(df2, on='name') >>> df.right_join(df2, on=('id', 'id1')) >>> df.right_join(df2, on=['name', ('id', 'id1')]) >>> df.right_join(df2, on=[df.name == df2.name, df.id == df2.id1])
- switch(*args, **kw)
Similar to the case-when in SQL. Refer to the example below
- 参数:
expr
args
kw
- 返回:
sequence or scalar
- Example:
>>> # if df.id == 3 then df.name >>> # elif df.id == df.fid.abs() then df.name + 'test' >>> # default: 'test' >>> df.id.switch(3, df.name, df.fid.abs(), df.name + 'test', default='test')
- verify()
Verify if this expression can be compiled into ODPS SQL.
- 返回:
True if compilation succeed else False
- 返回类型:
bool
- class odps.df.expr.groupby.GroupBy(*args, **kwargs)[源代码]
- all()
All is True.
- 参数:
expr
- 返回:
- any()
Any is True.
- 参数:
expr
- 返回:
- ast()
Return the AST string.
- 返回:
AST tree
- 返回类型:
str
- compile()
Compile this expression into an ODPS SQL
- 返回:
compiled DAG
- 返回类型:
str
- count()
Value counts
- 参数:
expr
- 返回:
- cume_dist(sort=None, ascending=True)
Calculate cumulative ratio of a sequence expression.
- 参数:
expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order
- 返回:
calculated column
- dense_rank(sort=None, ascending=True)
Calculate dense rank of a sequence expression.
- 参数:
expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order
- 返回:
calculated column
- execute(**kwargs)
- 参数:
hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size
priority (int) – instance priority, 9 as default
running_cluster – cluster to run this instance
- 返回:
execution result
- 返回类型:
odps.df.backends.frame.ResultFrame
- kurt()
Calculate kurtosis of the sequence
- 参数:
expr
- 返回:
- kurtosis()
Calculate kurtosis of the sequence
- 参数:
expr
- 返回:
- max()
Max value
- 参数:
expr
- 返回:
- mean()
Arithmetic mean.
- 参数:
expr
- 返回:
- median()
Median value.
- 参数:
expr
- 返回:
- min()
Min value
- 参数:
expr
- 返回:
- min_rank(sort=None, ascending=True)
Calculate rank of a sequence expression.
- 参数:
expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order
- 返回:
calculated column
- moment(order, central=False)
Calculate the n-th order moment of the sequence
- 参数:
expr
order – moment order, must be an integer
central – if central moments are to be computed.
- 返回:
- nth_value(nth, skip_nulls=False, sort=None, ascending=True)
Get nth value of a grouped and sorted expression.
- 参数:
expr – expression for calculation
nth – integer position
skip_nulls – whether to skip null values, False by default
sort – name of the sort column
ascending – whether to sort in ascending order
- 返回:
calculated column
- nunique()
The distinct count.
- 参数:
expr
- 返回:
- percent_rank(sort=None, ascending=True)
Calculate percentage rank of a sequence expression.
- 参数:
expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order
- 返回:
calculated column
- persist(name, partitions=None, partition=None, lifecycle=None, project=None, **kwargs)
Persist the execution into a new table. If partitions not specified, will create a new table without partitions if the table does not exist, and insert the SQL result into it. If partitions are specified, they will be the partition fields of the new table. If partition is specified, the data will be inserted into the exact partition of the table.
- 参数:
name – table name
partitions (list) – list of string, the partition fields
partition (string or PartitionSpec) – persist to a specified partition
lifecycle (int) – table lifecycle. If absent, options.lifecycle will be used.
project – project name, if not provided, will be the default project
hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size
priority (int) – instance priority, 9 as default
running_cluster – cluster to run this instance
overwrite (bool) – overwrite the table, True as default
drop_table (bool) – drop table if exists, False as default
create_table (bool) – create table first if not exits, True as default
drop_partition (bool) – drop partition if exists, False as default
create_partition (bool) – create partition if not exists, None as default
cast (bool) – cast all columns’ types as the existed table, False as default
- 返回:
- Example:
>>> df = df['name', 'id', 'ds'] >>> df.persist('odps_new_table') >>> df.persist('odps_new_table', partition='pt=test') >>> df.persist('odps_new_table', partitions=['ds'])
- qcut(bins, labels=False, sort=None, ascending=True)
Get quantile-based bin indices of every element of a grouped and sorted expression. The indices of bins start from 0. If cuts are not of equal sizes, extra items will be appended into the first group.
- 参数:
expr – expression for calculation
bins – number of bins
sort – name of the sort column
ascending – whether to sort in ascending order
- 返回:
calculated column
- quantile(prob=None, **kw)
Percentile value.
- 参数:
expr
prob – probability or list of probabilities, in [0, 1]
- 返回:
- rank(sort=None, ascending=True)
Calculate rank of a sequence expression.
- 参数:
expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order
- 返回:
calculated column
- row_number(sort=None, ascending=True)
Calculate row number of a sequence expression.
- 参数:
expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order
- 返回:
calculated column
- size()
Value counts
- 参数:
expr
- 返回:
- skew()
Calculate skewness of the sequence
- 参数:
expr
- 返回:
- std(**kw)
Standard deviation.
- 参数:
expr
kw
- 返回:
- sum()
Sum value
- 参数:
expr
- 返回:
- tolist(**kwargs)
Pack all data in the sequence into a list :param expr: :param unique: make every elements in the sequence to be unique :return:
- var(**kw)
Variance
- 参数:
expr
ddof – degree of freedom
kw
- 返回:
- verify()
Verify if this expression can be compiled into ODPS SQL.
- 返回:
True if compilation succeed else False
- 返回类型:
bool
- class odps.df.expr.groupby.SequenceGroupBy(*args, **kwargs)[源代码]
- ast()
Return the AST string.
- 返回:
AST tree
- 返回类型:
str
- compile()
Compile this expression into an ODPS SQL
- 返回:
compiled DAG
- 返回类型:
str
- count()
Value counts
- 参数:
expr
- 返回:
- cumcount(sort=None, ascending=True, unique=False, preceding=None, following=None)
Calculate cumulative count of a sequence expression.
- 参数:
expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order
unique – whether to eliminate duplicate entries
preceding – the start point of a window
following – the end point of a window
- 返回:
calculated column
- cume_dist(sort=None, ascending=True)
Calculate cumulative ratio of a sequence expression.
- 参数:
expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order
- 返回:
calculated column
- cummax(sort=None, ascending=True, unique=False, preceding=None, following=None)
Calculate cumulative maximum of a sequence expression.
- 参数:
expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order
unique – whether to eliminate duplicate entries
preceding – the start point of a window
following – the end point of a window
- 返回:
calculated column
- cummin(sort=None, ascending=True, unique=False, preceding=None, following=None)
Calculate cumulative minimum of a sequence expression.
- 参数:
expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order
unique – whether to eliminate duplicate entries
preceding – the start point of a window
following – the end point of a window
- 返回:
calculated column
- dense_rank(sort=None, ascending=True)
Calculate dense rank of a sequence expression.
- 参数:
expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order
- 返回:
calculated column
- execute(**kwargs)
- 参数:
hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size
priority (int) – instance priority, 9 as default
running_cluster – cluster to run this instance
- 返回:
execution result
- 返回类型:
odps.df.backends.frame.ResultFrame
- hll_count(error_rate=0.01, splitter=None)
Calculate HyperLogLog count
- 参数:
expr
error_rate (float) – error rate
splitter – the splitter to split the column value
- 返回:
sequence or scalar
- Example:
>>> df = DataFrame(pd.DataFrame({'a': np.random.randint(100000, size=100000)})) >>> df.a.hll_count() 63270 >>> df.a.nunique() 63250
- lag(offset, default=None, sort=None, ascending=True)
Get value in the row
offset
rows prior to the current row.- 参数:
offset – the offset value
default – default value for the function, when there are no rows satisfying the offset
expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order
- 返回:
calculated column
- lead(offset, default=None, sort=None, ascending=True)
Get value in the row
offset
rows after to the current row.- 参数:
offset – the offset value
default – default value for the function, when there are no rows satisfying the offset
expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order
- 返回:
calculated column
- max()
Max value
- 参数:
expr
- 返回:
- min()
Min value
- 参数:
expr
- 返回:
- min_rank(sort=None, ascending=True)
Calculate rank of a sequence expression.
- 参数:
expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order
- 返回:
calculated column
- nth_value(nth, skip_nulls=False, sort=None, ascending=True)
Get nth value of a grouped and sorted expression.
- 参数:
expr – expression for calculation
nth – integer position
skip_nulls – whether to skip null values, False by default
sort – name of the sort column
ascending – whether to sort in ascending order
- 返回:
calculated column
- nunique()
The distinct count.
- 参数:
expr
- 返回:
- percent_rank(sort=None, ascending=True)
Calculate percentage rank of a sequence expression.
- 参数:
expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order
- 返回:
calculated column
- persist(name, partitions=None, partition=None, lifecycle=None, project=None, **kwargs)
Persist the execution into a new table. If partitions not specified, will create a new table without partitions if the table does not exist, and insert the SQL result into it. If partitions are specified, they will be the partition fields of the new table. If partition is specified, the data will be inserted into the exact partition of the table.
- 参数:
name – table name
partitions (list) – list of string, the partition fields
partition (string or PartitionSpec) – persist to a specified partition
lifecycle (int) – table lifecycle. If absent, options.lifecycle will be used.
project – project name, if not provided, will be the default project
hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size
priority (int) – instance priority, 9 as default
running_cluster – cluster to run this instance
overwrite (bool) – overwrite the table, True as default
drop_table (bool) – drop table if exists, False as default
create_table (bool) – create table first if not exits, True as default
drop_partition (bool) – drop partition if exists, False as default
create_partition (bool) – create partition if not exists, None as default
cast (bool) – cast all columns’ types as the existed table, False as default
- 返回:
- Example:
>>> df = df['name', 'id', 'ds'] >>> df.persist('odps_new_table') >>> df.persist('odps_new_table', partition='pt=test') >>> df.persist('odps_new_table', partitions=['ds'])
- qcut(bins, labels=False, sort=None, ascending=True)
Get quantile-based bin indices of every element of a grouped and sorted expression. The indices of bins start from 0. If cuts are not of equal sizes, extra items will be appended into the first group.
- 参数:
expr – expression for calculation
bins – number of bins
sort – name of the sort column
ascending – whether to sort in ascending order
- 返回:
calculated column
- rank(sort=None, ascending=True)
Calculate rank of a sequence expression.
- 参数:
expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order
- 返回:
calculated column
- row_number(sort=None, ascending=True)
Calculate row number of a sequence expression.
- 参数:
expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order
- 返回:
calculated column
- size()
Value counts
- 参数:
expr
- 返回:
- tolist(**kwargs)
Pack all data in the sequence into a list :param expr: :param unique: make every elements in the sequence to be unique :return:
- verify()
Verify if this expression can be compiled into ODPS SQL.
- 返回:
True if compilation succeed else False
- 返回类型:
bool
- class odps.df.expr.groupby.Int64SequenceGroupBy(*args, **kwargs)[源代码]
- ast()
Return the AST string.
- 返回:
AST tree
- 返回类型:
str
- compile()
Compile this expression into an ODPS SQL
- 返回:
compiled DAG
- 返回类型:
str
- count()
Value counts
- 参数:
expr
- 返回:
- cumcount(sort=None, ascending=True, unique=False, preceding=None, following=None)
Calculate cumulative count of a sequence expression.
- 参数:
expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order
unique – whether to eliminate duplicate entries
preceding – the start point of a window
following – the end point of a window
- 返回:
calculated column
- cume_dist(sort=None, ascending=True)
Calculate cumulative ratio of a sequence expression.
- 参数:
expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order
- 返回:
calculated column
- cummax(sort=None, ascending=True, unique=False, preceding=None, following=None)
Calculate cumulative maximum of a sequence expression.
- 参数:
expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order
unique – whether to eliminate duplicate entries
preceding – the start point of a window
following – the end point of a window
- 返回:
calculated column
- cummean(sort=None, ascending=True, unique=False, preceding=None, following=None)
Calculate cumulative mean of a sequence expression.
- 参数:
expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order
unique – whether to eliminate duplicate entries
preceding – the start point of a window
following – the end point of a window
- 返回:
calculated column
- cummedian(sort=None, ascending=True, unique=False, preceding=None, following=None)
Calculate cumulative median of a sequence expression.
- 参数:
expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order
unique – whether to eliminate duplicate entries
preceding – the start point of a window
following – the end point of a window
- 返回:
calculated column
- cummin(sort=None, ascending=True, unique=False, preceding=None, following=None)
Calculate cumulative minimum of a sequence expression.
- 参数:
expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order
unique – whether to eliminate duplicate entries
preceding – the start point of a window
following – the end point of a window
- 返回:
calculated column
- cumstd(sort=None, ascending=True, unique=False, preceding=None, following=None)
Calculate cumulative standard deviation of a sequence expression.
- 参数:
expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order
unique – whether to eliminate duplicate entries
preceding – the start point of a window
following – the end point of a window
- 返回:
calculated column
- cumsum(sort=None, ascending=True, unique=False, preceding=None, following=None)
Calculate cumulative summation of a sequence expression.
- 参数:
expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order
unique – whether to eliminate duplicate entries
preceding – the start point of a window
following – the end point of a window
- 返回:
calculated column
- dense_rank(sort=None, ascending=True)
Calculate dense rank of a sequence expression.
- 参数:
expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order
- 返回:
calculated column
- execute(**kwargs)
- 参数:
hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size
priority (int) – instance priority, 9 as default
running_cluster – cluster to run this instance
- 返回:
execution result
- 返回类型:
odps.df.backends.frame.ResultFrame
- hll_count(error_rate=0.01, splitter=None)
Calculate HyperLogLog count
- 参数:
expr
error_rate (float) – error rate
splitter – the splitter to split the column value
- 返回:
sequence or scalar
- Example:
>>> df = DataFrame(pd.DataFrame({'a': np.random.randint(100000, size=100000)})) >>> df.a.hll_count() 63270 >>> df.a.nunique() 63250
- kurt()
Calculate kurtosis of the sequence
- 参数:
expr
- 返回:
- kurtosis()
Calculate kurtosis of the sequence
- 参数:
expr
- 返回:
- lag(offset, default=None, sort=None, ascending=True)
Get value in the row
offset
rows prior to the current row.- 参数:
offset – the offset value
default – default value for the function, when there are no rows satisfying the offset
expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order
- 返回:
calculated column
- lead(offset, default=None, sort=None, ascending=True)
Get value in the row
offset
rows after to the current row.- 参数:
offset – the offset value
default – default value for the function, when there are no rows satisfying the offset
expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order
- 返回:
calculated column
- max()
Max value
- 参数:
expr
- 返回:
- mean()
Arithmetic mean.
- 参数:
expr
- 返回:
- median()
Median value.
- 参数:
expr
- 返回:
- min()
Min value
- 参数:
expr
- 返回:
- min_rank(sort=None, ascending=True)
Calculate rank of a sequence expression.
- 参数:
expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order
- 返回:
calculated column
- moment(order, central=False)
Calculate the n-th order moment of the sequence
- 参数:
expr
order – moment order, must be an integer
central – if central moments are to be computed.
- 返回:
- nth_value(nth, skip_nulls=False, sort=None, ascending=True)
Get nth value of a grouped and sorted expression.
- 参数:
expr – expression for calculation
nth – integer position
skip_nulls – whether to skip null values, False by default
sort – name of the sort column
ascending – whether to sort in ascending order
- 返回:
calculated column
- nunique()
The distinct count.
- 参数:
expr
- 返回:
- percent_rank(sort=None, ascending=True)
Calculate percentage rank of a sequence expression.
- 参数:
expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order
- 返回:
calculated column
- persist(name, partitions=None, partition=None, lifecycle=None, project=None, **kwargs)
Persist the execution into a new table. If partitions not specified, will create a new table without partitions if the table does not exist, and insert the SQL result into it. If partitions are specified, they will be the partition fields of the new table. If partition is specified, the data will be inserted into the exact partition of the table.
- 参数:
name – table name
partitions (list) – list of string, the partition fields
partition (string or PartitionSpec) – persist to a specified partition
lifecycle (int) – table lifecycle. If absent, options.lifecycle will be used.
project – project name, if not provided, will be the default project
hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size
priority (int) – instance priority, 9 as default
running_cluster – cluster to run this instance
overwrite (bool) – overwrite the table, True as default
drop_table (bool) – drop table if exists, False as default
create_table (bool) – create table first if not exits, True as default
drop_partition (bool) – drop partition if exists, False as default
create_partition (bool) – create partition if not exists, None as default
cast (bool) – cast all columns’ types as the existed table, False as default
- 返回:
- Example:
>>> df = df['name', 'id', 'ds'] >>> df.persist('odps_new_table') >>> df.persist('odps_new_table', partition='pt=test') >>> df.persist('odps_new_table', partitions=['ds'])
- qcut(bins, labels=False, sort=None, ascending=True)
Get quantile-based bin indices of every element of a grouped and sorted expression. The indices of bins start from 0. If cuts are not of equal sizes, extra items will be appended into the first group.
- 参数:
expr – expression for calculation
bins – number of bins
sort – name of the sort column
ascending – whether to sort in ascending order
- 返回:
calculated column
- quantile(prob=None, **kw)
Percentile value.
- 参数:
expr
prob – probability or list of probabilities, in [0, 1]
- 返回:
- rank(sort=None, ascending=True)
Calculate rank of a sequence expression.
- 参数:
expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order
- 返回:
calculated column
- row_number(sort=None, ascending=True)
Calculate row number of a sequence expression.
- 参数:
expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order
- 返回:
calculated column
- size()
Value counts
- 参数:
expr
- 返回:
- skew()
Calculate skewness of the sequence
- 参数:
expr
- 返回:
- std(**kw)
Standard deviation.
- 参数:
expr
kw
- 返回:
- sum()
Sum value
- 参数:
expr
- 返回:
- tolist(**kwargs)
Pack all data in the sequence into a list :param expr: :param unique: make every elements in the sequence to be unique :return:
- var(**kw)
Variance
- 参数:
expr
ddof – degree of freedom
kw
- 返回:
- verify()
Verify if this expression can be compiled into ODPS SQL.
- 返回:
True if compilation succeed else False
- 返回类型:
bool