Julia Programming - Working with Datasets



In this chapter, we shall discuss in detail about datasets.

CSV files

As we know that CSV (Comma Separated Value) file is a plain text file which uses commas to separate fields and values of those fields. The extension of these files is .CSV. We have various methods provided by Julia programming language to perform operations on CSV files.

Import a .CSV file in Julia

To import a .CSV file, we need to install CSV package. Use the following command to do so −

using pkg
pkg.add("CSV")

Reading data

To read data from a CSV file in Julia we need to use read() method from CSV package as follows −

julia> using CSV
julia> CSV.read("C://Users//Leekha//Desktop//Iris.csv")
150Γ—6 DataFrame
β”‚ Row β”‚  Id   β”‚ SepalLengthCm β”‚ SepalWidthCm β”‚ PetalLengthCm β”‚ PetalWidthCm β”‚ Species        β”‚
β”‚     β”‚ Int64 β”‚      Float64  β”‚     Float64  β”‚     Float64   β”‚    Float64   β”‚ String         β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€-------─
β”‚  1  β”‚   1   β”‚      5.1      β”‚     3.5      β”‚     1.4       β”‚    0.2       β”‚ Iris-setosa    β”‚
β”‚  2  β”‚   2   β”‚      4.9      β”‚     3.0      β”‚     1.4       β”‚    0.2       β”‚ Iris-setosa    β”‚
β”‚  3  β”‚   3   β”‚      4.7      β”‚     3.2      β”‚     1.3       β”‚    0.2       β”‚ Iris-setosa    β”‚
β”‚  4  β”‚   4   β”‚      4.6      β”‚     3.1      β”‚     1.5       β”‚    0.2       β”‚ Iris-setosa    β”‚
β”‚  5  β”‚   5   β”‚      5.0      β”‚     3.6      β”‚     1.4       β”‚    0.2       β”‚ Iris-setosa    β”‚
β”‚  6  β”‚   6   β”‚      5.4      β”‚     3.9      β”‚     1.7       β”‚    0.4       β”‚ Iris-setosa    β”‚
β”‚  7  β”‚   7   β”‚      4.6      β”‚     3.4      β”‚     1.4       β”‚    0.3       β”‚ Iris-setosa    β”‚
β”‚  8  β”‚   8   β”‚      5.0      β”‚     3.4      β”‚     1.5       β”‚    0.2       β”‚ Iris-setosa    β”‚
β”‚  9  β”‚   9   β”‚      4.4      β”‚     2.9      β”‚     1.4       β”‚    0.2       β”‚ Iris-setosa    β”‚
β”‚  10 β”‚   10  β”‚      4.9      β”‚     3.1      β”‚     1.5       β”‚    0.1       β”‚ Iris-setosa    β”‚
⋮
β”‚ 140 β”‚ 140   β”‚      6.9      β”‚     3.1      β”‚     5.4       β”‚    2.1       β”‚ Iris-virginica β”‚
β”‚ 141 β”‚ 141   β”‚      6.7      β”‚     3.1      β”‚     5.6       β”‚    2.4       β”‚ Iris-virginica β”‚
β”‚ 142 β”‚ 142   β”‚      6.9      β”‚     3.1      β”‚     5.1       β”‚    2.3       β”‚ Iris-virginica β”‚
β”‚ 143 β”‚ 143   β”‚      5.8      β”‚     2.7      β”‚     5.1       β”‚    1.9       β”‚ Iris-virginica β”‚
β”‚ 144 β”‚ 144   β”‚      6.8      β”‚     3.2      β”‚     5.9       β”‚    2.3       β”‚ Iris-virginica β”‚
β”‚ 145 β”‚ 145   β”‚      6.7      β”‚     3.3      β”‚     5.7       β”‚    2.5       β”‚ Iris-virginica β”‚
β”‚ 146 β”‚ 146   β”‚      6.7      β”‚     3.0      β”‚     5.2       β”‚    2.3       β”‚ Iris-virginica β”‚
β”‚ 147 β”‚ 147   β”‚      6.3      β”‚     2.5      β”‚     5.0       β”‚    1.9       β”‚ Iris-virginica β”‚
β”‚ 148 β”‚ 148   β”‚      6.5      β”‚     3.0      β”‚     5.2       β”‚    2.0       β”‚ Iris-virginica β”‚
β”‚ 149 β”‚ 149   β”‚      6.2      β”‚     3.4      β”‚     5.4       β”‚    2.3       β”‚ Iris-virginica β”‚
β”‚ 150 β”‚ 150   β”‚      5.9      β”‚     3.0      β”‚     5.1       β”‚    1.8       β”‚ Iris-virginica β”‚

Creating new CSV file

To create new CSV file, we need to use touch()command from CSV package. We also need to use DataFrames package to write the newly created content to new CSV file −

julia> using DataFrames
julia> using CSV
julia> touch("1234.csv")
"1234.csv"

julia> new = open("1234.csv", "w")
IOStream(<file 1234.csv>)

julia> new_data = DataFrame(Name = ["Gaurav", "Rahul", "Aarav", "Raman", "Ravinder"],
                  RollNo = [1, 2, 3, 4, 5],
                  Marks = [54, 67, 90, 23, 95])
                  
5Γ—3 DataFrame
β”‚ Row β”‚  Name    β”‚ RollNo β”‚ Marks β”‚
β”‚     β”‚  String  β”‚ Int64  β”‚ Int64 β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚  Gaurav  β”‚   1    β”‚   54  β”‚
β”‚ 2   β”‚  Rahul   β”‚   2    β”‚   67  β”‚
β”‚ 3   β”‚   Aarav  β”‚   3    β”‚   90  β”‚
β”‚ 4   β”‚   Raman  β”‚   4    β”‚   23  β”‚
β”‚ 5   β”‚ Ravinder β”‚   5    β”‚   95  β”‚

julia> CSV.write("1234.csv", new_data)
"1234.csv"

julia> CSV.read("1234.csv")
5Γ—3 DataFrame
β”‚ Row β”‚    Name  β”‚ RollNo β”‚ Marks β”‚
β”‚     β”‚  String  β”‚ Int64  β”‚ Int64 β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€
β”‚   1 β”‚   Gaurav β”‚   1    β”‚   54  β”‚
β”‚   2 β”‚   Rahul  β”‚   2    β”‚   67  β”‚
β”‚   3 β”‚   Aarav  β”‚   3    β”‚   90  β”‚
β”‚   4 β”‚   Raman  β”‚   4    β”‚   23  β”‚
β”‚   5 β”‚ Ravinder β”‚   5    β”‚   95  β”‚

HDF5

The full form of HDF5 is Hierarchical Data Format v5. Following are some of its properties −

  • A β€œgroup” is similar to a directory, a β€œdataset” is like a file.

  • To associate metadata with a particular group, it uses attributes.

  • It uses ASCII names for different objects.

  • Language wrappers are often known as β€œlow level” or β€œhigh level”.

Opening HDF5 files

HDF5 files can be opened with h5open command as follows −

fid = h5open(filename, mode)

Following table describes the mode −

Sl.No Mode & Meaning
1

"r"

read-only

2

"r+"

read-write − It will preserve any existing contents.

3

"cw"

read-write − It will create file if not existing.

It will also preserve existing contents.

4

"w"

read-write − It will destroy any existing contents.

The above command will produce an object of type HDF5File and a subtype of the abstract type DataFile.

Closing HDF5 files

Once finished with a file, we should close it as follows −

close(fid)

It will also close all the objects in the file.

Opening HDF5 objects

Suppose if we have a file object named fid and it has a group called object1, it can be opened as follows −

Obj1 = fid[β€œobject1”]

Closing HDF5 objects

close(obj1)

Reading data

A group β€œg” containing a dataset with path β€œdtset” and we have opened dataset as dset1 = g[dtset]. We can read the information in following ways −

ABC = read(dset1)
ABC = read(g, "dtset")
Asub = dset1[2:3, 1:3]

Writing data

We can create the dataset as follows −

g["dset1"] = rand(3,5)
write(g, "dset1", rand(3,5))

XML files

Here we will be discussing about LightXML.jl package which is a light-weight Julia wrapper for libxml2. It provides the following functionalities −

  • Parsing an XML file

  • Accessing XML tree structure

  • Creating an XML tree

  • Exporting an XML tree to a string

Example

Suppose we have an xml file named new.xml as follows −

<Hello>
      <to>Gaurav</to>
      <from>Rahul</from>
      <heading>Reminder to meet</heading>
      <body>Friend, Don't forget to meet this weekend!</body>
</Hello>

Now, we can parse this file by using LightXML as follows −

julia> using LightXML
#below code will parse this xml file
julia> xdoc = parse_file("C://Users//Leekha//Desktop//new.xml")
<?xml version="1.0" encoding="utf-8"?>
<Hello>
<to>Gaurav</to>
<from>Rahul</from>
<heading>Reminder to meet</heading>
<body>Friend, Don't forget to meet this weekend!</body>
</Hello>

Following example explains how to get the root element −

julia> xroot = root(xdoc);
julia> println(name(xroot))
Hello
#Traversing all the child nodes and also print element names
julia> for c in child_nodes(xroot) # c is an instance of XMLNode
            println(nodetype(c))
            if is_elementnode(c)
               e = XMLElement(c) # this makes an XMLElement instance
               println(name(e))
            end
         end
3
1
to
3
1
from
3
1
heading
3
1
body
3

RDatasets

Julia has RDatasets.jl package providing easy way to use and experiment with most of the standard data sets which are available in the core of R. To load and work with one of the datasets included in RDatasets packages, we need to install RDatasets as follows −

julia> using Pkg
julia> Pkg.add("RDatasets")

Subsetting the data

For example, we will use the Gcsemv dataset in mlmRev group as follows −

julia> GetData = dataset("mlmRev","Gcsemv");
julia> summary(GetData);
julia> head(GetData)
6Γ—5 DataFrame
β”‚ Row β”‚     School   β”‚     Student  β”‚     Gender   β”‚  Written β”‚   Course β”‚
β”‚     β”‚ Categorical… β”‚ Categorical… β”‚ Categorical… β”‚ Float64⍰ β”‚ Float64⍰ β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚   1 β”‚     20920    β”‚      16      β”‚      M       β”‚  23.0    β”‚  missing β”‚
β”‚   2 β”‚     20920    β”‚      25      β”‚      F       β”‚  missing β”‚   71.2   β”‚
β”‚   3 β”‚     20920    β”‚      27      β”‚      F       β”‚  39.0    β”‚   76.8   β”‚
β”‚   4 β”‚     20920    β”‚      31      β”‚      F       β”‚  36.0    β”‚   87.9   β”‚
β”‚   5 β”‚     20920    β”‚      42      β”‚      M       β”‚  16.0    β”‚   44.4   β”‚
β”‚   6 β”‚     20920    β”‚      62      β”‚      F       β”‚  36.0    β”‚  missing β”‚

We can select the data for a particular school as follows −

julia> GetData[GetData[:School] .== "68137", :]
104Γ—5 DataFrame
β”‚ Row β”‚     School   β”‚     Student  β”‚     Gender   β”‚  Written β”‚   Course β”‚
β”‚     β”‚ Categorical… β”‚ Categorical… β”‚ Categorical… β”‚ Float64⍰ β”‚ Float64⍰ β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  1  β”‚     68137    β”‚      1       β”‚     F        β”‚   18.0   β”‚   56.4   β”‚
β”‚  2  β”‚     68137    β”‚      2       β”‚     F        β”‚   23.0   β”‚   55.5   β”‚
β”‚  3  β”‚     68137    β”‚      3       β”‚     F        β”‚   25.0   β”‚  missing β”‚
β”‚  4  β”‚     68137    β”‚      4       β”‚     F        β”‚   29.0   β”‚   73.1   β”‚
β”‚  5  β”‚     68137    β”‚      5       β”‚     F        β”‚  missing β”‚   66.6   β”‚
β”‚  6  β”‚     68137    β”‚      9       β”‚     F        β”‚   20.0   β”‚   60.1   β”‚
β”‚  7  β”‚     68137    β”‚     11       β”‚     F        β”‚   34.0   β”‚   63.8   β”‚
β”‚  8  β”‚     68137    β”‚     12       β”‚     F        β”‚   60.0   β”‚   89.8   β”‚
β”‚  9  β”‚     68137    β”‚     13       β”‚     F        β”‚   44.0   β”‚   76.8   β”‚
β”‚  10 β”‚     68137    β”‚     14       β”‚     F        β”‚   20.0   β”‚   58.3   β”‚
⋮
β”‚ 94  β”‚     68137    β”‚     252      β”‚     M        β”‚  missing β”‚   75.9   β”‚
β”‚ 95  β”‚     68137    β”‚     254      β”‚     M        β”‚     35.0 β”‚ missing  β”‚
β”‚ 96  β”‚     68137    β”‚     255      β”‚     M        β”‚     36.0 β”‚   62.0   β”‚
β”‚ 97  β”‚     68137    β”‚     258      β”‚     M        β”‚     23.0 β”‚   61.1   β”‚
β”‚ 98  β”‚     68137    β”‚     260      β”‚     M        β”‚     25.0 β”‚ missing  β”‚
β”‚ 99  β”‚     68137    β”‚     261      β”‚     M        β”‚     46.0 β”‚    89.8  β”‚
β”‚ 100 β”‚     68137    β”‚     264      β”‚     M        β”‚     50.0 β”‚    70.3  β”‚
β”‚ 101 β”‚     68137    β”‚     268      β”‚     M        β”‚     15.0 β”‚    43.5  β”‚
β”‚ 102 β”‚     68137    β”‚     270      β”‚     M        β”‚  missing β”‚    73.1  β”‚
β”‚ 103 β”‚     68137    β”‚     272      β”‚     M        β”‚     43.0 β”‚    78.7  β”‚
β”‚ 104 β”‚     68137    β”‚     273      β”‚     M        β”‚     35.0 β”‚    60.1  β”‚

Sorting the data

With the help of sort!() function, we can sort the data. For example, here we will sort the dataset in ascending examination scores −

julia> sort!(GetData, cols=[:Written])
1905Γ—5 DataFrame
β”‚ Row  β”‚       School β”‚      Student β”‚       Gender β”‚  Written β”‚   Course β”‚
β”‚      β”‚ Categorical… β”‚ Categorical… β”‚ Categorical… β”‚ Float64⍰ β”‚ Float64⍰ β”‚
β”œβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  1   β”‚    22710     β”‚       77     β”‚     F        β”‚    0.6   β”‚   41.6   β”‚
β”‚  2   β”‚    68137     β”‚       65     β”‚     F        β”‚    2.5   β”‚   50.0   β”‚
β”‚  3   β”‚    22520     β”‚       115    β”‚     M        β”‚    3.1   β”‚   9.25   β”‚
β”‚  4   β”‚    68137     β”‚       80     β”‚     F        β”‚    4.3   β”‚   50.9   β”‚
β”‚  5   β”‚    68137     β”‚       79     β”‚     F        β”‚    7.5   β”‚   27.7   β”‚
β”‚  6   β”‚    22710     β”‚       57     β”‚     F        β”‚    11.0  β”‚   73.1   β”‚
β”‚  7   β”‚    64327     β”‚       19     β”‚     F        β”‚    11.0  β”‚   87.0   β”‚
β”‚  8   β”‚    68137     β”‚       85     β”‚     F        β”‚    11.0  β”‚   27.7   β”‚
β”‚  9   β”‚    68137     β”‚       97     β”‚     F        β”‚    11.0  β”‚   57.4   β”‚
β”‚ 10   β”‚    68137     β”‚       100    β”‚     F        β”‚    11.0  β”‚   61.1   β”‚
⋮
β”‚ 1895 β”‚    74874     β”‚       83     β”‚     F        β”‚ missing  β”‚    81.4  β”‚
β”‚ 1896 β”‚    74874     β”‚       86     β”‚     F        β”‚ missing  β”‚    92.5  β”‚
β”‚ 1897 β”‚    76631     β”‚       79     β”‚     F        β”‚ missing  β”‚    84.2  β”‚
β”‚ 1898 β”‚    76631     β”‚       193    β”‚     M        β”‚ missing  β”‚    72.2  β”‚
β”‚ 1899 β”‚    76631     β”‚       221    β”‚     F        β”‚ missing  β”‚    76.8  β”‚
β”‚ 1900 β”‚    77207     β”‚       5001   β”‚     F        β”‚ missing  β”‚    82.4  β”‚
β”‚ 1901 β”‚    77207     β”‚       5062   β”‚     M        β”‚ missing  β”‚    75.0  β”‚
β”‚ 1902 β”‚    77207     β”‚       5063   β”‚     F        β”‚ missing  β”‚    79.6  β”‚
β”‚ 1903 β”‚    84772     β”‚       17     β”‚     M        β”‚ missing  β”‚    88.8  β”‚
β”‚ 1904 β”‚    84772     β”‚       49     β”‚     M        β”‚ missing  β”‚    74.0  β”‚
β”‚ 1905 β”‚    84772     β”‚       85     β”‚     F        β”‚ missing  β”‚    90.7  β”‚

Statistics in Julia

To work with statistics, Julia has StatsBase.jl package providing easy way to do simple statistics. To work with statistics, we need to install StatsBase package as follows −

julia> using Pkg
julia> Pkg.add("StatsBase")

Simple Statistics

Julia provides methods to define weights and calculate mean.

We can use weights() function to define weights vectors as follows −

julia> WV = Weights([10.,11.,12.])
3-element Weights{Float64,Float64,Array{Float64,1}}:
 10.0
 11.0
 12.0

You can use the isempty() function to check whether the weight vector is empty or not −

julia> isempty(WV)
false

We can check the type of weight vectors with the help of eltype() function as follows −

julia> eltype(WV)
Float64

We can check the length of the weight vectors with the help of length() function as follows −

julia> length(WV)
3

There are different ways to calculate the mean

  • Harmonic mean − We can use harmmean() function to calculate the harmonic mean.

julia> A = [3, 5, 6, 7, 8, 2, 9, 10]
8-element Array{Int64,1}:
 3
 5
 6
 7
 8
 2
 9
 10
julia> harmmean(A)
4.764831009217679
  • Geometric mean − We can use geomean() function to calculate the Geometric mean.

julia> geomean(A)
5.555368605381863
  • General mean − We can use mean() function to calculate the general mean.

julia> mean(A)
6.25

Descriptive Statistics

It is that discipline of statistics in which information is extracted and analyzed. This information explains the essence of data.

Calculating variance

We can use var() function to calculate the variance of a vector as follows −

julia> B = [1., 2., 3., 4., 5.];
julia> var(B)
2.5

Calculating weighted variance

We can calculate the weighted variance of a vector x w.r.t to weight vector as follows −

julia> B = [1., 2., 3., 4., 5.];
julia> a = aweights([4., 2., 1., 3., 1.])
5-element AnalyticWeights{Float64,Float64,Array{Float64,1}}:
 4.0
 2.0
 1.0
 3.0
 1.0
julia> var(B, a)
2.066115702479339

Calculating standard deviation

We can use std() function to calculate the standard variation of a vector as follows −

julia> std(B)
1.5811388300841898

Calculating weighted standard deviation

We can calculate the weighted standard deviation of a vector x w.r.t to weight vector as follows −

julia> std(B,a)
1.4373989364401725

Calculating mean and standard deviation

We can calculate the mean and standard deviation in a single command as follows −

julia> mean_and_std(B,a)
(2.5454545454545454, 1.4373989364401725)

Calculating mean and variance

We can calculate the mean and variance in a single command as follows −

julia> mean_and_var(B,a)
(2.5454545454545454, 2.066115702479339)

Samples and Estimations

It may be defined as the discipline of statistics where, for analysis, sample units will be selected from a large population set.

Following are the ways in which we can do sampling −

Taking random samples is the simplest way of doing sampling. In this we draw a random element from the array, i.e., the population set. The function for this purpose is sample().

Example

julia> A = [8.,12.,23.,54.5]
4-element Array{Float64,1}:
 8.0
 12.0
 23.0
 54.5
julia> sample(A)
12.0

Next, we can take β€œn” elements as random samples.

Example

julia> A = [8.,12.,23.,54.5]
4-element Array{Float64,1}:
 8.0
 12.0
 23.0
 54.5
julia> sample(A, 2)
2-element Array{Float64,1}:
 23.0
 54.5

We can also write the sampled elements to pre-allocated elements of length β€œn”. The function to do this task is sample!().

Example

julia> B = [1., 2., 3., 4., 5.];
julia> X = [2., 1., 3., 2., 5.];
julia> sample!(B,X)
5-element Array{Float64,1}:
 2.0
 2.0
 4.0
 1.0
 3.0

Another way is to do direct sampling which will randomly picks the numbers from a population set and stores them in another array. The function to do this task is direct_sample!().

Example

julia> StatsBase.direct_sample!(B, X)
5-element Array{Float64,1}:
 1.0
 4.0
 4.0
 4.0
 5.0

Knuth’s algorithms is one other way in which random sampling is done without replcement.

Example

julia> StatsBase.knuths_sample!(B, X)
5-element Array{Float64,1}:
 5.0
 3.0
 4.0
 2.0
 1.0
Advertisements