Data Integration / DevOps / DataOps Notes: 2020

Partitioned Hive Table Insert Syntax

create table empl(id int,name string) partitioned by (, string);

Insert records into a Partitioned table using VALUES clause.

insert into empl partition(State='Ohio' ) values (1,'Jr Donald');
insert into empl partition(State='Ohio' ) values (2,'Jr Erica');
insert into empl partition(State='Ohio' ) values (3,'Mela');
insert into empl partition(State='Nevada' ) values (10,'Haris');
insert into empl partition(State='Nevada' ) values (11,'Jake');
insert into empl partition(State='Nevada' ) values (12,'Trevor');

Inserting data into Hive Partition Table using SELECT Clause
INSERT INTO empl PARTITION(State) select id,name,state from empl_w;

Insert Overwrite
insert overwrite table empl partition(state) select id,name,state from empl_w;

AWS CLI - S3 API Recap

CLI Command to list objects based on key filter

aws s3api list-objects --bucket dataos-customer-surveys --query "Contents[?contains(Key, `input`)]"

Command to list object key based on Prefix

aws s3api list-objects --bucket dataos-customer-surveys --prefix Archive/ --query "Contents[].{Key: Key}"

Command to list object key and size

aws s3api list-objects --bucket dataos-medallia-surveys-prod --query "Contents[].{Key: Key, Size: Size}"

Copying the Local file to S3

aws s3 cp D:/Users/AnwarBA/Desktop/HPS_DigtalApp_Responses_EDL_Exp_20221028.txt s3://dataos-medallia-surveys-prod

Git SSH key Logic

ssh-keygen -t rsa -m PEM -b 4096 -C "aashif03@gmail.com"

eval $(ssh-agent -s)

ssh-add ~/.ssh/id_rsa

Testing your SSH Connection

ssh -T git@github.com

Reference: https://docs.oracle.com/en/cloud/paas/integration-cloud/ftp-adapter/generate-keys-pem-format.html

DataLake Vs Datawarehouse

DataLakes and DataWarehouses are used to store data but they are not interchangeable. In simple words , datalakes store raw data whereas the data warehouses store processed data (Cleansed, Filtered, Aggregated etc..).

Key Differences

Data Lake Vs Datawarehouse
Raw Data Processed Data
More Storage Storage is less comparable to Data Lake
Usage not yet determined In Use
Date Scientist Business Professionals
Data Prep/ Additional Tools Reports & Dashboards
needed
Not structured Structured and schema well defined

will be continued with example soon..

Cumulative Addition in SQL

create table product(id int, value int);

insert into product values(1,500);
insert into product values(2,200);
insert into product values(3,400);

Method 1 : Self Join
Select a.id,a.value,sum(b.value) cvalue from product a ,product b
where
a.id>=b.id
group by a.id,a.value order by 1 ;

Method 2: Analytic Function

select id, value,sum(value) over (order by id) from product order by 1 ;

Method 3: Correlated Column Query

Select id, value , (select sum(value) from product where id<=p.id) cvalue
from product p ;

create table product1(id int, value int);

insert into product1 values(1,500);
insert into product1 values(2,200);
insert into product1 values(3,400);

Select * from product1;

with cte as
(select * from product order by id)
select a.id,a.value , sum(b.value) cvalue from cte a ,cte b where a.id>=b.id
group by a.id,a.value order by 1 ;

Azure SQL DW External Table - Simple Example

CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'some strong password ' ;

-- Create a database scoped credential with Azure storage account key as the secret.
CREATE DATABASE SCOPED CREDENTIAL AzureStorageCredential
WITH
IDENTITY = '$$$ azstoreaccount $$$ ' ,
SECRET = ' $$$ azure storage secret key $$$ ;

-- Create an external data source with CREDENTIAL option.
CREATE EXTERNAL DATA SOURCE MyAzureStorage
WITH
( LOCATION = 'wasbs://azcontainer@azstoreaccount.blob.core.windows.net/' ,
CREDENTIAL = AzureStorageCredential ,
TYPE = HADOOP
) ;

CREATE EXTERNAL FILE FORMAT TextFileFormat
WITH
(
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS
(
FIELD_TERMINATOR = '|',
USE_TYPE_DEFAULT = FALSE
)
);

CREATE EXTERNAL TABLE employee_d(
id int,
name varchar(120)
)
WITH (LOCATION='/',
DATA_SOURCE = MyAzureStorage,
FILE_FORMAT = TextFileFormat,
REJECT_TYPE = VALUE,
REJECT_VALUE = 0
);

Select * from employee_d;

Git Essentials

VCS can be either centralized or distributed.

Distributed VCS: Git, Mercurial.

Centralized VCS: CVS, Perforce, SVN, TFS

The main difference between the two classes is that Centralized VCSs keep the history of changes on a central server from which everyone requests the latest version of the work and pushes the latest changes to. This means that everyone sharing the server also shares everyone's work.

On the other hand, on a Distributed VCS, everyone has a local copy of the entire work's history. This means that it is not necessary to be online to change revisions or add changes to the work. "Distributed" comes from the fact that there isn't a central entity in charge of the work's history, so that anyone can sync with any other team member. This helps avoid failure due to a crash of the central versioning server.

This blog is about the key concepts and commands of git (github is used as remote repository) :

Commands :

git clone -- Clones the remote repository to local system

git status -- to get the status of the repository. Gives branch details

Adding a file to the working directory and checking the status

Untracked files - files existing in working directory and not yet added to git

Adding the file to Staging Area and checking the status

Commiting the File and Checking the status

Pushing the changes to Remote repository (github) and checking the status

git push origin master

origin - remote reference

master - branch name

Tracked and Untracked Files

Each file in your working directory can be in one of two states: tracked or untracked.

Tracked files are files that were in the last snapshot; they can be unmodified, modified, or staged. In short, tracked files are files that Git knows about.

git ls-files -- gives the list of tracked files

Any changes to the tracked files are directly committed like below : (without adding it to staging area)

git commit -am "comments"

Eg:

Untracked files are everything else — any files in your working directory that were not in your last snapshot and are not in your staging area. When you first clone a repository, all of your files will be tracked and unmodified because Git just checked them out and you haven't edited anything.

Recursive add - adds the files under subdirectories to staging area recursively

git add .

Backing Out Changes from the staging area and rolling back to previous commit

"git reset HEAD <file>..." to unstage - is the command used to unstage / remove the files from staging area to working directory .

Roll backing the changes on working directory to previous commit

git checkout -- <file> -- to discard changes in working directory

Commit History of a file :

git log --follow <file_name>

git show <commit_id>

Displaying all the commits:

git log --all --graph --decorate --oneline

Git Alias

Git doesn't automatically infer your command if you type it in partially. If you don't want to type the entire text of each of the Git commands, you can easily set up an alias for each command using

git config.

Here are a couple of examples you may want to set up:

$ git config --global alias.hist "log --all --graph --decorate --oneline"

$ git config --global alias.ci commit

$ git config --global alias.st status

This alias is stored in the file named "config" of .git directory

Ignoring Files

Files that you don't want Git to automatically add or even show you as being untracked. These are generally automatically generated files such as log files or files produced by your build system. In such cases, you can create a file listing patterns to match them named .gitignore. Here is an example .gitignore file:

$ cat .gitignore

*.[oa]

The first line tells Git to ignore any files ending in ".o" or ".a" — object and archive files that may be the product of building your code. The second line tells Git to ignore all files whose names end with a tilde (~), which is used by many text editors such as Emacs to mark temporary files. You may alsoinclude a log, tmp, or pid directory; automatically generated documentation; and so on. Setting up a .gitignore file for your new repository before you get going is generally a good .idea so you don't accidentally commit files that you really don't want in your Git repository

Remote Repository Reference

git remote add training_link https://github.com/alexa/git-training.git

root@ludovico:/Projects/git-training# git remote

origin

root@ludovico:/Projects/git-training# git remote show origin

* remote origin

Fetch URL: https://github.com/alexa/git-training.git

Push URL: https://github.com/alexa/git-training.git

HEAD branch: master

Remote branch:

master tracked

Local branch configured for 'git pull':

master merges with remote master

Local ref configured for 'git push':

master pushes to master (fast-forwardable)

Configuration of External Diff and Merge tools :

My selection of external tool is P4Merge from Perforce. Post installation of P4Merge and its mandatory to do below configuration change in git .

git config --global merge.tool p4merge

git config --global mergetool.p4merge.path "C:\Program Files\Perforce\p4merge.exe"

git config --global diff.tool p4merge

git config --global difftool.p4merge.path "C:\Program Files\Perforce\p4merge.exe"

Comparison

Working Directory Vs Staging Area

git diff

git difftool (In case of p4merge, staging area on left side and working directory on the right)

Working Directory Vs Last Commit

git diff HEAD

git difftool HEAD (In case of p4merge, HEAD/Last Commit on left side and working directory on the right)

Staging Area vs Last Commit

git diff --staged HEAD

git difftool --staged HEAD (In case of p4merge, HEAD/Last Commit on left side and staging area on the right)

Note: These commands give details about all files that are different between the source and target. If you want to compare particular , mention the path i.e git diff -- foo.txt

Branching :

git branch -- lists all the local branches (branch name with * is the current branch )

git branch -a -- lists both remote and local branches

git branch <branch_name> -- creates the new branch

git checkout <branch_name> -- checkout is used to switch from one branch to other

Creation of new branch and switching to the newly created branch is done by using below command:

git checkout -b <branch_name>

Merge conflict :

Can be resolved by using mergetool / manually .

Refer : https://git-scm.com/book/en/v2/Git-Tools-Advanced-Merging

Simple YAML for Beginners

Employee:
Name: Alexander
Sex: Male
Age: 30
Title: Devops Engineer
Projects:
- Automation
- Support
Payslips:
- Month: June
Wage: 4000
- Month: July
Wage: 4500
- Month: August
Wage: 4000

Docker Basic Commands

1. Launching a container :

docker run <image-name>

e.g docker run ubuntu

2. Checking the list of docker containers:

For running containers : docker ps

For running & shutdown list : docker ps -a

3. Stopping the docker container (below command just stops the container it doesn't have files created by container )

docker stop <container id / container_name >

e.g docker stop 0uy1212121d19839121

4. Removing the stopped container

docker rm <container id / container_name >

e.g docker rm smart_world

Note: Its possible to stop multiple containers at the same time either by container ids / container name. While using container id, it is not necessary to specify the enter id , its enough to specify 3/4 characters of container id

docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
fa7d61b970f0 ubuntu "sleep 20" 11 minutes ago Exited (0) 11 minutes ago peaceful_boyd
aa45e8d98ee9 ubuntu "/bin/bash" 12 minutes ago Exited (0) 12 minutes ago friendly_engelbart

$ docker rm fa7 aa4
fa7
aa4

$ docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES

5. Removing the docker image

docker rmi <image name >

6. Downloading Image

docker pull <image name >

7. Launching the container based on the download the image

docker exec <image name>

8. Append (Detach / Attach )

To run in background :

docker run -d <image name> additional command ifany

e.g docker run -d centos sleep 200

To bring the container to foregroung

docker attach <conatiner-id>

9. Accessing the bash of container interactively

docker run -it <image-id> bash

e.g docker run -it ubuntu bash

Things to remember:

Port Mapping:

By default , containers get launched with network layer. Applications installed in the container can be accessible by two methods.

1. IP address of the container and by this way. only docker host can access the applications .

2. IP address of the docker host . This is possible by only if port mapping parameter is mentioned while launching the container
docker run -p <dockerhost-portnumber>:<dockerapplication-portnumber> <image-id>
e.g docker run -p 8090:8080 jenkins

Volume Mapping:

By default , data stored in containers are available only when it is running. ie. if a container is stopped , all the data stored will get deleted. In order to persist the storage of container applications and its required to specify the volume mapping parameter when launching the container.

docker run -v <location of docker host>:<location of container> <image id>

e.g docker run -p 3306:3306 -v /opt/mysqldata:/var/mysql mysql

docker file and build is used to create custom docker image as per the requirements.

Note: docker file should always begin with os related image .

sample docker file

Building the docker

docker build .

docker build . -t mybicommunityapp

Pushing the code is a two step process. docker login & docker push

Docker Compose : (docker-compose.yml)

Compose file is a YAML file defining services, networks and volumes and which helps to launch stack of containers

launching the container using docker- compose file

docker-compose up

Docker network :

3 options .

Docker Swarm : is the concept of clustering of containers to avoid single point of failure .

DynamoDB Filter Expression

Attr('lockname').not_exists()

Attr('lockname').contains('mybicommunity')

Tech Netting

Partitioned Hive Table Insert Syntax

AWS CLI - S3 API Recap

Git SSH key Logic

DataLake Vs Datawarehouse

Cumulative Addition in SQL

Azure SQL DW External Table - Simple Example

Git Essentials

Simple YAML for Beginners

Docker Basic Commands

DynamoDB Filter Expression

Page views

Blog Archive

Search

Popular Posts

Labels