Partitioned Hive Table Insert Syntax

0

create table empl(id int,name string) partitioned by (, string);


Insert records into a Partitioned table using VALUES clause.


insert into empl  partition(State='Ohio' ) values (1,'Jr Donald');
insert into empl  partition(State='Ohio' ) values (2,'Jr Erica');
insert into empl  partition(State='Ohio' ) values (3,'Mela');
insert into empl  partition(State='Nevada' ) values (10,'Haris');
insert into empl  partition(State='Nevada' ) values (11,'Jake');
insert into empl  partition(State='Nevada' ) values (12,'Trevor');

Inserting data into Hive Partition Table using SELECT Clause
INSERT INTO empl PARTITION(State) select id,name,state from empl_w;

 
 
 Insert Overwrite
 insert overwrite table empl partition(state) select id,name,state from empl_w;

AWS CLI - S3 API Recap

0
CLI Command to list objects based on key filter

aws s3api list-objects --bucket dataos-customer-surveys  --query "Contents[?contains(Key, `input`)]"
 
Command to list object key based on Prefix 
aws s3api list-objects --bucket dataos-customer-surveys  --prefix Archive/ --query "Contents[].{Key: Key}"

  Command to list object key and size 
aws s3api list-objects --bucket dataos-medallia-surveys-prod --query "Contents[].{Key: Key, Size: Size}"

Copying the Local file to S3 

aws s3 cp D:/Users/AnwarBA/Desktop/HPS_DigtalApp_Responses_EDL_Exp_20221028.txt s3://dataos-medallia-surveys-prod

Git SSH key Logic

0


ssh-keygen -t rsa -m PEM -b 4096 -C "aashif03@gmail.com"
eval $(ssh-agent -s)
ssh-add ~/.ssh/id_rsa

Testing your SSH Connection


Reference: https://docs.oracle.com/en/cloud/paas/integration-cloud/ftp-adapter/generate-keys-pem-format.html

DataLake Vs Datawarehouse

0
DataLakes and DataWarehouses are used to store data but they are not interchangeable. In simple words , datalakes store raw data  whereas the data warehouses store processed data (Cleansed, Filtered, Aggregated etc..). 

Key Differences 

Data Lake                    Vs       Datawarehouse
Raw Data                                 Processed Data
More Storage                            Storage is less comparable to Data Lake
Usage not yet determined          In Use
Date Scientist                           Business Professionals
Data Prep/ Additional Tools         Reports & Dashboards
needed
Not structured                         Structured and schema well defined


will be continued with example soon..

Cumulative Addition in SQL

0

Cumulative Addition in SQL 

create table product(id int, value int);

insert into product values(1,500);
insert into product values(2,200);
insert into product values(3,400);


Method 1 : Self Join
Select a.id,a.value,sum(b.value) cvalue from product a ,product b
where
a.id>=b.id
group by a.id,a.value order by 1 ;


Method 2: Analytic Function

select id, value,sum(value) over (order by id) from product order by 1 ;

Method 3: Correlated Column Query

Select id, value , (select sum(value) from product where id<=p.id) cvalue
from product p ;



create table product1(id int, value int);

insert into product1 values(1,500);
insert into product1 values(2,200);
insert into product1 values(3,400);


Select * from product1;

with cte as
(select * from product order by id)
select a.id,a.value , sum(b.value) cvalue from cte a ,cte b where a.id>=b.id
group by a.id,a.value order by 1 ; 

Azure SQL DW External Table - Simple Example

0


CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'some strong password ' ;

-- Create a database scoped credential with Azure storage account key as the secret.
CREATE DATABASE SCOPED CREDENTIAL AzureStorageCredential
WITH
  IDENTITY = '$$$ azstoreaccount $$$ ' ,
  SECRET = ' $$$ azure storage secret key $$$ ;

-- Create an external data source with CREDENTIAL option.
CREATE EXTERNAL DATA SOURCE MyAzureStorage
WITH
  ( LOCATION = 'wasbs://azcontainer@azstoreaccount.blob.core.windows.net/' ,
    CREDENTIAL = AzureStorageCredential ,
    TYPE = HADOOP
  ) ;
 
 
 CREATE EXTERNAL FILE FORMAT TextFileFormat
WITH
(
    FORMAT_TYPE = DELIMITEDTEXT,
    FORMAT_OPTIONS
    (
        FIELD_TERMINATOR = '|',
        USE_TYPE_DEFAULT = FALSE
    )
);


CREATE EXTERNAL TABLE employee_d(
id int,
name varchar(120)
)
WITH (LOCATION='/',
    DATA_SOURCE = MyAzureStorage,  
    FILE_FORMAT = TextFileFormat,
    REJECT_TYPE = VALUE,
    REJECT_VALUE = 0
);  


Select * from employee_d;
 

Git Essentials

0

VCS can be either centralized or distributed.


Distributed VCS: Git, Mercurial.
Centralized VCS: CVS, Perforce, SVN, TFS


The main difference between the two classes is that Centralized VCSs keep the history of changes on a central server from which everyone requests the latest version of the work and pushes the latest changes to. This means that everyone sharing the server also shares everyone's work.

On the other hand, on a Distributed VCS, everyone has a local copy of the entire work's history. This means that it is not necessary to be online to change revisions or add changes to the work. "Distributed" comes from the fact that there isn't a central entity in charge of the work's history, so that anyone can sync with any other team member. This helps avoid failure due to a crash of the central versioning server.

This blog is about the key concepts and commands of git (github is used as remote repository) :

Commands  : 


git clone -- Clones the remote repository to local system  


git  status --  to get the status of the repository. Gives branch details 



  1. Adding a file to the working directory and checking the status
        

Untracked files - files existing in working directory and not yet added to git 

  1. Adding the file to Staging Area and checking the status 

 


  1. Commiting the File and Checking the status 


  1. Pushing the changes to Remote repository (github) and checking the status 

 git  push origin master  

   origin - remote reference
   master - branch name


Tracked and Untracked Files 

    Each file in your working directory can be in one of two states: tracked or untracked.

Tracked files are files that were in the last snapshot; they can be unmodified, modified, or staged. In short, tracked files are files that Git knows about.

git ls-files -- gives the list of tracked files

Any changes to the tracked files are directly committed like below  : (without adding it to staging area) 

 git commit -am "comments" 
Eg: 

Untracked files are everything else — any files in your working directory that were not in your last snapshot and are not in your staging area. When you first clone a repository, all of your files will be tracked and unmodified because Git just checked them out and you haven't edited anything.
Recursive add  - adds the files under subdirectories to staging area recursively
git  add . 

Backing Out Changes from the staging area and rolling back to previous commit 
"git reset HEAD <file>..." to unstage  - is the command used to unstage / remove the files from staging area to working directory .

Roll backing the changes on working directory to previous commit

git checkout -- <file> --  to discard changes in working directory


Commit History of a file :

git  log --follow <file_name> 

git show <commit_id>

Displaying all the commits

git log  --all --graph --decorate --oneline

Git Alias

Git doesn't automatically infer your command if you type it in partially. If you don't want to type the entire text of each of the Git commands, you can easily set up an alias for each command using
git config.

 Here are a couple of examples you may want to set up:

$ git config --global alias.hist "log  --all --graph --decorate --oneline"
$ git config --global alias.ci commit
$ git config --global alias.st status

This alias is stored in the file named "config" of .git directory

Ignoring Files

Files that you don't want Git to automatically add or even show you as being untracked. These are generally automatically generated files such as log files or files produced by your build system. In such cases, you can create a file listing patterns to match them named .gitignore. Here is an example .gitignore file: 

$ cat .gitignore 
*.[oa]
 *~ 

The first line tells Git to ignore any files ending in ".o" or ".a" — object and archive files that may be the product of building your code. The second line tells Git to ignore all files whose names end with a tilde (~), which is used by many text editors such as Emacs to mark temporary files. You may alsoinclude a log, tmp, or pid directory; automatically generated documentation; and so on. Setting up a .gitignore file for your new repository before you get going is generally a good .idea so you don't accidentally commit files that you really don't want in your Git repository

Remote Repository Reference 

git  remote add training_link https://github.com/alexa/git-training.git

root@ludovico:/Projects/git-training# git remote
origin
root@ludovico:/Projects/git-training# git remote show origin
* remote origin
  HEAD branch: master
  Remote branch:
    master tracked
  Local branch configured for 'git pull':
    master merges with remote master
  Local ref configured for 'git push':
    master pushes to master (fast-forwardable)



Configuration of External  Diff and Merge tools : 

My selection of external tool is P4Merge from Perforce. Post installation of P4Merge and its mandatory to do below configuration change in git . 

git config --global merge.tool p4merge
git config --global mergetool.p4merge.path "C:\Program Files\Perforce\p4merge.exe"
git config --global diff.tool p4merge
git config --global difftool.p4merge.path "C:\Program Files\Perforce\p4merge.exe"



Comparison 

Working Directory Vs Staging Area

git diff 

git  difftool   (In case of p4merge,  staging area on left side and working directory on the right) 

Working Directory Vs Last Commit 

git  diff HEAD
git  difftool HEAD  (In case of p4merge,  HEAD/Last Commit on left side and working directory on the right)

Staging Area vs Last Commit

git diff --staged HEAD 

git  difftool --staged HEAD  (In case of p4merge,  HEAD/Last Commit on left side and staging area on the right)

Note:  These commands give details about all files that are different between the source and target. If you want to compare particular , mention the path i.e git diff -- foo.txt


Branching :

git branch  -- lists all the local branches  (branch name with * is the current branch ) 

git branch -a   -- lists both remote and local branches 

git branch <branch_name> -- creates the new branch 

git checkout <branch_name> -- checkout is used to switch from one branch to other


Creation of new branch and switching to the newly created branch is done by using below command: 

git checkout -b <branch_name>



Merge conflict :

Can be resolved by using mergetool / manually  .


Simple YAML for Beginners

0

Employee:
  Name: Alexander
  Sex: Male
  Age: 30
  Title: Devops Engineer
  Projects:
    - Automation
    - Support
  Payslips:
    - Month: June
      Wage: 4000
    - Month: July
      Wage: 4500
    - Month: August
      Wage: 4000

Docker Basic Commands

0
1. Launching a container : 

docker run <image-name>
e.g docker run ubuntu

2. Checking the list of docker containers:

For running containers :  docker ps 

For running & shutdown list : docker ps -a 

3. Stopping the docker container (below command just stops the container it doesn't have files created by container ) 

docker stop <container id / container_name > 
e.g docker stop 0uy1212121d19839121

4. Removing the stopped container

docker rm  <container id / container_name > 
e.g docker rm smart_world

Note: Its possible to stop multiple containers at the same time either by container ids / container name. While using container id, it is not necessary to specify the enter id , its enough to specify 3/4 characters of container id

docker ps -a
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS                      PORTS               NAMES
fa7d61b970f0        ubuntu              "sleep 20"          11 minutes ago      Exited (0) 11 minutes ago                       peaceful_boyd
aa45e8d98ee9        ubuntu              "/bin/bash"         12 minutes ago      Exited (0) 12 minutes ago                       friendly_engelbart

$ docker rm fa7 aa4
fa7
aa4

$ docker ps -a
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES

5.  Removing the docker image

docker rmi <image name > 

6. Downloading Image 

docker pull <image name >

7. Launching the container based on the download the image 

docker exec <image name> 

8. Append (Detach / Attach ) 

To run in background : 
docker run -d <image name> additional command ifany 
e.g docker run -d centos sleep 200

To bring the container to foregroung 

docker attach <conatiner-id>  

9. Accessing the bash of container interactively 

docker run -it <image-id> bash
e.g docker run -it ubuntu bash 

Things to remember: 

Port Mapping:

By default , containers get launched with network layer. Applications installed in the container can be accessible by two methods. 

1. IP address of the container and by this way. only docker host can access the applications .
2. IP address of the docker host .  This is possible by only if port mapping parameter is mentioned while launching the container
    docker run -p <dockerhost-portnumber>:<dockerapplication-portnumber> <image-id>
    e.g docker run -p 8090:8080 jenkins

Volume Mapping:

By default , data stored in containers are available only when it is running. ie. if a container is stopped , all the data stored will get deleted. In order to persist the storage of container applications and its required to specify the volume mapping parameter when launching the container. 

  docker run -v <location of docker host>:<location of container> <image id> 
  e.g docker run -p 3306:3306 -v /opt/mysqldata:/var/mysql  mysql

docker file and build is used to create custom docker image as per the requirements. 

Note:  docker file should always begin with os related image . 

sample docker file 



Building the docker 

docker build . 

docker build . -t mybicommunityapp 

Pushing the code is a two step process. docker login & docker push 

Docker Compose :  (docker-compose.yml)

Compose file is a YAML file defining services, networks and volumes and which helps to launch stack of containers 



launching the container using docker- compose file 

docker-compose up 

Docker network : 

3 options . 




Docker Swarm : is the concept of clustering of containers to avoid single point of failure . 




DynamoDB Filter Expression

0

Attr('lockname').not_exists() 

Attr('lockname').contains('mybicommunity')