6 mins read

Working with rows of Tables.jl tables – Beragampengetahuan

By: Blog by Bogumił Kamiński

Re-posted from:

Three weeks ago I wrote a post about getting a schema of Tables.jl tables.
Therefore today, to complement, I thought to discuss how one can get rows of such tables.

The post was written using Julia 1.9.2, Tables.jl 1.11.0, DataAPI.jl 1.15.0, and DataFrames.jl 1.6.1.

Many Julia users are happy with using DataFrames.jl to work with their tables.
However, this is only one of the available options.
This means that, especially package creators, prefer not to hardcode DataFrame
as a specific type that their package supports, but allow for generic Tables.jl tables.

An example of such need is, for example, a function that could take a generic table and
split it into train-validation-test subsets. To achieve this you need to be able
to take a subset of its rows.

There are two functions that, in combination, can be used to generically subset a Tables.jl table:

  • the DataAPI.nrow function that returns a number of rows in a table;
  • the Tables.subset function that allows you to get a subset of rows of a table.

Before I turn to showing you how they work let me highlight one issue. Most of Tables.jl tables
support these functions. However, their support is not guaranteed. The reason is that some tables
are never materialized in memory, e.g. are only a stream of rows that can be read only once.
In such a case we will not know the number of rows in such a table (as it is dynamic) and, similarly,
to get a subset of its rows you would need to scan the whole stream anyway.

The DataAPI.nrow function is easy to understand. You pass it a table and in return you get the number of its rows.
Let us see it in practice:

julia> using DataAPI

julia> using Tables

julia> table = (a=1:10, b=11:20, c=21:30)
(a = 1:10, b = 11:20, c = 21:30)

julia> DataAPI.nrow(table)
10

The Tables.subset accepts two positional arguments. The first is a table, and the second
are 1-based row indices that should be picked. You have two options for passing indices.
You can pass a single integer index like this:

julia> Tables.subset(table, 2)
(a = 2, b = 12, c = 22)

In which case you get a single row of a table.
The other option is to pass a collection of indices, in which case, you get a table (not a single row):

julia> Tables.subset(table, 2:3)
(a = 2:3, b = 12:13, c = 22:23)

To see that indeed it works for other tables, let us check a DataFrame from DataFrames.jl:

julia> using DataFrames

julia> df = DataFrame(table)
10×3 DataFrame
 Row │ a      b      c
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1     11     21
   2 │     2     12     22
   3 │     3     13     23
   4 │     4     14     24
   5 │     5     15     25
   6 │     6     16     26
   7 │     7     17     27
   8 │     8     18     28
   9 │     9     19     29
  10 │    10     20     30

julia> nrow(df)
10

julia> Tables.subset(df, 2)
DataFrameRow
 Row │ a      b      c
     │ Int64  Int64  Int64
─────┼─────────────────────
   2 │     2     12     22

julia> Tables.subset(df, 2:3)
2×3 DataFrame
 Row │ a      b      c
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     2     12     22
   2 │     3     13     23

Again, note that Tables.subset(df, 2) returned DataFrameRow (a single row of a table),
while Tables.subset(df, 2:3) returned a DataFrame (a table).

If you work with large tables you often hit performance and memory consumption considerations.
In terms of Tables.subset this is related to the question if this function copies data
or just makes a view of the source table. This option is handled by the viewhint keyword argument.

Let us first see how it works:

julia> Tables.subset(df, 2:3, viewhint=true)
2×3 SubDataFrame
 Row │ a      b      c
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     2     12     22
   2 │     3     13     23

julia> Tables.subset(df, 2:3, viewhint=false)
2×3 DataFrame
 Row │ a      b      c
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     2     12     22
   2 │     3     13     23

As you can see viewhint=true returned a view (a SubDataFrame), while viewhint=false produced a copy.

Let us see another example:

julia> table2 = Tables.rowtable(df)
10-element Vector{NamedTuple{(:a, :b, :c), Tuple{Int64, Int64, Int64}}}:
 (a = 1, b = 11, c = 21)
 (a = 2, b = 12, c = 22)
 (a = 3, b = 13, c = 23)
 (a = 4, b = 14, c = 24)
 (a = 5, b = 15, c = 25)
 (a = 6, b = 16, c = 26)
 (a = 7, b = 17, c = 27)
 (a = 8, b = 18, c = 28)
 (a = 9, b = 19, c = 29)
 (a = 10, b = 20, c = 30)

julia> Tables.subset(table2, 2:3, viewhint=true)
2-element view(::Vector{NamedTuple{(:a, :b, :c), Tuple{Int64, Int64, Int64}}}, 2:3) with eltype NamedTuple{(:a, :b, :c), Tuple{Int64, Int64, Int64}}:
 (a = 2, b = 12, c = 22)
 (a = 3, b = 13, c = 23)

julia> Tables.subset(table2, 2:3, viewhint=false)
2-element Vector{NamedTuple{(:a, :b, :c), Tuple{Int64, Int64, Int64}}}:
 (a = 2, b = 12, c = 22)
 (a = 3, b = 13, c = 23)

As you can see viewhint=true produced a view of a vector, while viewhint=false made a copy of source data.

Now you might ask why the keyword argument is called viewhint? The reason is that not all Tables.jl tables allow
for flexibility of making a view or a copy. Therefore the rules are as follows:

  • if viewhint is not passed then table decides on its side if it returns a copy or a view (depending on what is possible);
  • if viewhint=true then table should return a view, but if it is not possible this can be a copy;
  • if viewhint=false then table should return a copy, but if it is not possible this can be a view.

In other words viewhint should be considered as a performance hint only.
It does not guarantee to produce what you ask for (as for some tables satisfying this request might be impossible).

Summarizing our post. If you want to write a generic function that subsets a Tables.jl table then you can use:

  • the DataAPI.nrow function to learn how many rows it has;
  • the Tables.subset function to get a subset of its rows using 1-based indexing.

I hope these examples are useful for your work.

Contents

Software Terbaru Saat Ini



Aplikasi yang sedang trend saat ini

object oriented programming, programming language, programming adalah, web programming, belajar programming, tournament software, software, software adalah, contoh software, apa itu software, pengertian software, aplikasi, aplikasi penghasil uang, aplikasi bokep, aplikasi video, programming

#Working #rows #Tables.jl #tables

Tinggalkan Balasan

Alamat email Anda tidak akan dipublikasikan. Ruas yang wajib ditandai *