Getting Started

Launch a cluster of VMs/bare-metal nodes, choose one of them as the client node and clone the repository in there.

git clone https://github.com/uccross/skyhookdm
cd skyhookdm/scripts/deploy/

Execute the deploy_ceph.sh script to deploy a Ceph cluster on a set of nodes and also to mount CephFS on the client node. On the client node, execute:

./deploy_ceph.sh mon1,mon2,mon3 osd1,osd2,osd3 mds1 mgr1 /dev/sdb 3

where mon1, mon2, osd1, etc. are the internal hostnames of the nodes. Similarly on CloudLab, you can execute,

./deploy_ceph.sh node1,node2,node3 node4,node5,node6,node7 node1 node2 /dev/nvme0n1p4 3

Build and install Arrow along with the SkyhookDM object class plugins.

./deploy_skyhook_upstream.sh osd1,osd2,osd3

This will build the plugins as shared libraries and deploy them to the OSD nodes. On CloudLab, you can do,

./deploy_skyhook_upstream.sh node4,node5,node6,node7

Download an example Parquet file with NYC Taxi data.

wget https://skyhook-ucsc.s3.us-west-1.amazonaws.com/128MB.uncompressed.parquet

Write a sample dataset to the CephFS mount by replicating the 128MB Parquet file downloaded in the previous step.

./deploy_data.sh [source file] [destination dir] [no. of copies] [stripe unit]

For example,

./deploy_data.sh 128MB.uncompressed.parquet /mnt/cephfs/dataset 10 134217728

This will write 10 of ~128MB Parquet files to /mnt/cephfs/dataset using a CephFS stripe size of 128MB.

Build and run the example client code.

g++ -std=c++17 ../example.cc -larrow_skyhook -larrow_dataset -larrow -o example
export LD_LIBRARY_PATH=/usr/local/lib
./example file:///mnt/cephfs/dataset

You should get a stringified Arrow table as the output.