Getting Started
Launch a cluster of VMs/bare-metal nodes, choose one of them as the client node and clone the repository in there.
git clone https://github.com/uccross/skyhookdm
cd skyhookdm/scripts/deploy/
Execute the
deploy_ceph.sh
script to deploy a Ceph cluster on a set of nodes and also to mount CephFS on the client node. On the client node, execute:
./deploy_ceph.sh mon1,mon2,mon3 osd1,osd2,osd3 mds1 mgr1 /dev/sdb 3
where mon1, mon2, osd1, etc. are the internal hostnames of the nodes. Similarly on CloudLab, you can execute,
./deploy_ceph.sh node1,node2,node3 node4,node5,node6,node7 node1 node2 /dev/nvme0n1p4 3
Build and install Arrow along with the SkyhookDM object class plugins.
./deploy_skyhook_upstream.sh osd1,osd2,osd3
This will build the plugins as shared libraries and deploy them to the OSD nodes. On CloudLab, you can do,
./deploy_skyhook_upstream.sh node4,node5,node6,node7
Download an example Parquet file with NYC Taxi data.
wget https://skyhook-ucsc.s3.us-west-1.amazonaws.com/128MB.uncompressed.parquet
Write a sample dataset to the CephFS mount by replicating the 128MB Parquet file downloaded in the previous step.
./deploy_data.sh [source file] [destination dir] [no. of copies] [stripe unit]
For example,
./deploy_data.sh 128MB.uncompressed.parquet /mnt/cephfs/dataset 10 134217728
This will write 10 of ~128MB Parquet files to /mnt/cephfs/dataset
using a CephFS stripe size of 128MB.
Build and run the example client code.
g++ -std=c++17 ../example.cc -larrow_skyhook -larrow_dataset -larrow -o example
export LD_LIBRARY_PATH=/usr/local/lib
./example file:///mnt/cephfs/dataset
You should get a stringified Arrow table as the output.