Big Data Resume
Big Data Resume
L’informatique décisionnelle:
Business Intelligence
L'informatique décisionnelle, aussi appelée business intelligence(BI), désigne un
ensemble de méthodes, de moyens et d'outils informatiques utilisés pour piloter
une entreprise et aider à la prise de décision : tableaux de bord, rapports
analytiques et prospectifs.
Traitement vs stockage
ECOSYSTÈME HADOOP
Sqoop
-Importe/exporte des données d’une BD automatiquement
• RDBS ↔ HDFS
-Exemple: une application web/mySQL
Flume
-Collecter des données de sources et importer dans HDFS
-Logs.
HBase
-Une base de donnée NoSQL (clef/valeur)
-Distribuée
-Sans limite pratique pour la taille des tables
-Intégration avec Hadoop
Oozie
Orchestrer des séquences de tâches MapReduce
Tâches oozie: un graphe orienté acyclique d’actions
Peut être lancée par des évènements ou à un certain temps
• Àl’ajoutd’unfichierfaire...
• À tous les jours à 3h00AM faire...
Chukwa
-Système de collection de données distribuées
-Opimiser Hadoop pour traiter des log
-Afficher, monitorer et analyser les fichiers log
JobTracker
JobTracker est le service au sein de Hadoop qui est responsable de prendre les demandes des clients. Il
les attribue aux TaskTrackers sur DataNodes où les données requises sont présentes localement. Si cela
n'est pas possible, JobTracker essaie d'affecter les tâches aux TaskTrackers dans le même rack où les
données sont localement présentes. Si, pour une raison quelconque, cela échoue également, JobTracker
attribue la tâche à un TaskTracker où une réplique des données existe. Dans Hadoop, les blocs de
données sont répliqués sur DataNodes pour garantir la redondance, de sorte que si un nœud du cluster
échoue, le travail n'échoue pas également.
Processus JobTracker:
TaskTraker
Architecture
HDFS: hadoop distributed file systeme
stock les donnees dans un cluster d'une facon distribute sur les different noeud
duplique les donnes
MapReduce:
map: transforme data in parallel across all cluster
reduce: for aggregate this data togheter
Data Ingestion:
- how we get data into your cluster and onto HDFS from external sources
-sqoop: tying (attacher) your hadoop database into a relational database
is basically a connector between Hadoop (HDFS) and database
-Flume: transfert logs from web app in real time by listening
-kafka: transfert data from any source in real time to my cluster
Query Engines:
Apache DRILL:
write sql queries that will work across a wide range of NoSQl databases
ecrire une requete sql on meme temp sur plusieur BD
Hadoop Cluster
Architecture
Hadoop clusters are composed of a network of master and
worker nodes that orchestrate and execute the various jobs
across the Hadoop distributed file system. The master nodes
typically utilize higher quality hardware and include a
NameNode, Secondary NameNode, and JobTracker, with each
running on a separate machine. The workers consist of virtual
machines, running both DataNode and TaskTracker services on
commodity hardware, and do the actual work of storing and
processing the jobs as directed by the master nodes. The final
part of the system are the Client Nodes, which are responsible
for loading the data and fetching the results.
Master nodes are responsible for storing data in HDFS and
overseeing key operations, such as running parallel
computations on the data using MapReduce.
The worker nodes comprise most of the virtual machines
in a Hadoop cluster, and perform the job of storing the
data and running computations. Each worker node runs
the DataNode and TaskTracker services, which are used
to receive the instructions from the master nodes.
Client nodes are in charge of loading the data into the
cluster. Client nodes first submit MapReduce jobs
describing how data needs to be processed and then
fetch the results once the processing is finished.
----------------------------
----------------------------
The masters are the nodes that host the core more-unique Hadoop
roles that usually orchestrate/coordinate/oversee processes and roles
on the other nodes — think HDFS NameNodes (of which max there
can only be 2), Hive Metastore Server (only one at the time of writing
this answer), YARN ResourceManager (just the one), HBase
Masters, Impala StateStore and Catalog Server (one of each). All
master roles need not necessarily have a fixed number of instances
(you can have many Zookeeper Servers) but they all have associated
roles within the same service that rely on them to function. A typical
enterprise production cluster has 2–3 master nodes, scaling up as
per size and services installed on the cluster.
Contrary to this, the workers are the actual nodes doing the real
work of storing data or performing compute or other operations. Roles
like HDFS DataNode, YARN NodeManager, HBase RegionServer,
Impala Daemons etc — they need the master roles to coordinate the
work and total instances of each of these roles usually scale more
linearly with the size of the cluster. A typical cluster has about 80-
90% nodes dedicated to hosting worker roles.
Put simply, edge nodes are the nodes that are neither masters, nor
workers. They usually act as gateways/connection-portals for end-
users to reach the worker nodes better. Roles like HiveServer2
servers, Impala LoadBalancer (Proxy server for Impala Daemons),
Flume agents, config files and web interfaces like HttpFS, Oozie
servers, Hue servers etc — they all fall under this category. Most of
each of these roles can be installed on multiple nodes (assigning
more nodes for each role helps prevent everybody from connecting to
one instance and overwhelming that node).
That said, there is no formal rule that forces cluster admins to adhere
to strict distinction between node types, and most Hadoop service
roles can be assigned to any node which further blurs these
boundaries. But following certain role-co-location guidelines can
significantly boost cluster performance and availability, and some
might be vendor-mandated.
----------------------------
----------------------------
HDFS
Monday, 10 April 2023
17:12
HDFS Command
Monday, 10 April 2023
21:23
# hadoop fs -ls /
Supprimer un fichier
Supprimer un repertoire
Copier un fichier
#hadoop fs -cp /user/cloudera/d1/* /user/cloudera/d2
Deplacer un fichier
#hadoop fs -mv /user/cloudera/d1/* /user/cloudera/d2
Creer un repertoire
#hadoop fs -mkdir /user/rep1
HDFS Java
Tuesday, 16 May 2023
11:24
Private static final String NAME_NODE="hdfs://localhost:9000";
FileSystem hdfs = FileSystem.get( URI.create("hdfs://0.0.0.0:9000"),new Configuration() );
CreateFolder(hdfs,FolderPath);
DeleteFolder(hdfs,FolderPath);
CopieLocalFileToHDFS(hdfs,FolderPath);
CreatFile(hdfs,FolderPath);
WriteIntoFile(hdfs,FolderPath);
ReadFile(hdfs,FolderPath);
++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++
hdfs.mkdirs(FolderPath);
++++++++++++++++++++++++++++++++++++++++++++++
Path localFilePath=newPath("/Users/aymanehinane/Desktop/Home/BigData/PrepaExam/
hdfs/src/main/resources/data.txt");
hdfs.copyFromLocalFile(localFilePath,FolderPath);
++++++++++++++++++++++++++++++++++++++++++++++
Path FilePath=newPath(FolderPath+"/"+"data2.txt");
hdfs.createNewFile(FilePath);
++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++
PathFilePath=newPath(FolderPath+"/"+"data3.txt");
BufferedReaderbfr=newBufferedReader(newInputStreamReader(hdfs.open(FilePath)));
Stringstr=null;
while((str=bfr.readLine())!=null){
System.out.println(str);
}
}
Hive
Saturday, 8 April 2023
16:50
HIVE
quit;
/home/hdoop/apache-hive-3.1.2-bin/bin/beeline -u jdbc:hive2://0.0.0.0:10000
--------------------------------------------------------------------------------------
hadoop fs -ls /
ratings
user_id
movie_id
rating
rating_time
/opt/shared-folder/ml-100k
u.data
hdfs://localhost:9000
/user/hive/warehouse
Exit;
\\s+
CREATE EXTERNAL TABLE IF NOT EXISTS ratings
(user_id INT,movie_id INT,rating INT,rating_time INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/user/hive/warehouse/ratings';
SHOW TABLES;
drop table rating;
root@c7556f98e9b1:/opt/shared-folder/ml-100k# cd $HIVE_HOME/bin
root@c7556f98e9b1:/home/hdoop/apache-hive-3.1.2-bin/bin# schematool -initSchema -dbType derby
Hive Java
Wednesday, 26 April 2023
10:02
Class.forName("org.apache.hadoop.hive.jdbc.HiveDri
ver");
/ / get connection
Connection con =
DriverManager.getConnection("jdbc:hive://localhost:1
0000/default", "", "");
Statement stmt = con.createStatement();
stmt.executeQuery("CREATE DATABASE
userdb");
con.close();
Map Reducer
Monday, 10 April 2023
23:15
YARN
Tuesday, 16 May 2023
12:06
Structure de Hadoop 2
The main components of YARN architecture include:
Advantages :
Disadvantages :
wget http://media.sundog-soft.com/hadoop/RatingsBreakdown.py
wget http://media.sundog-soft.com/hadoop/ml-100k/u.data
For linux
For mac
Python - m pip install pathlib
python -m pip install mrjob
Python -m pip install PyYAML
--search for pa
And again we need to tell it exactly where the Hadoop streaming Jarre is.
This is just the bit of code that integrates map reduce with Python or
anything really.
Movie ID Ranting
1005 4
1006 2
1007 3
1008 4
1009 3
Map --> (4,1) (2,1) (3,1) (4,1) (3,1) ---> shuffle & sort --> (4,(1,1)) (2,1) (3,(1,1)) ---> Reduce --> (4,2)
(2,1) (3,2)
++++++++++++++++++++++++++++++++++++++++++++
//MRJob : This is basically a way to very quickly write mapreduce jobs in Python.
// It abstracts away a lot of the complexity of dealing with the streaming interface to
mapreduce for
class RatingsBreakdown(MRJob):
def steps(self):
return [
MRStep(mapper=self.mapper_get_ratings,
reducer=self.reducer_count_ratings)
]
//Map
def mapper_get_ratings(self, _, line):
(userID, movieID, rating, timestamp) = line.split('\t')
yield rating, 1
//Reduce
def reducer_count_ratings(self, key, values):
yield key, sum(values)
if __name__ == '__main__':
RatingsBreakdown.run()
Movie ID Ranting
1005 4
1006 2
1007 3
1008 4
1007 3
1005 2
1007 3
1008 4
1005 4
1006 2
1005 5
1007 2
Exercice 2 Map Reduce
Wednesday, 19 April 2023
11:32
Exemple d’explication:
MapReduce est écrit on java , mais il peut etre executer dans different
langauge C++,Python…
Dans cette exemple nous allons utiliser Python avec MP job package
Step 1 :
Mapper vat recevoir des lignes text d’un fichier et vat les convertir on
(cle,valeur)
Step2 :
Shuffling and Sort est une etape ou MapReduce vat regrouper toute les
value qui ont la meme cle ( cle , list(value) )
Step 3 :
Reduce permet de calculer la list des value pour chaque cle
Exemple d’application 1:
Nous avons un fichier u.data qui contient une list de review pour chaque
film
1005 4
1006 2
1007 3
1008 4
1009 3
Algorithme
Map --> (4,1) (2,1) (3,1) (4,1) (3,1) ---> shuffle & sort
--> (4,(1,1)) (2,1) (3,(1,1)) ---> Reduce --> (4,2) (2,1) (3,2)
Code python
MRJob permet d’ecrir un map reduce Job on python est de l’executer sur
different environnemt
Exemple d’application 2 :
Pour cette exemple nous voulons regrouper les filme qui ont le meme
nombre de review
Algorithme :
Exécution :
Resume Architecture
Wednesday, 19 April 2023
11:21
Client ----> hdfs client <client hdfs dfs> ( ou est la donnees ) ----> name Node (hdfs)
Resource manager gere la disponibiliter des resource au niveau des NodeManager de mon cluster
MapReduce application master track l'execution des tache mapReduce
Client node will submit MapReduce Jobs to The resource manager
Hadoop YARN.
Hadoop YARN resource managing and monitoring the clusters
such as keeping track of available capacity of clusters, available
clusters, etc. Hadoop Yarn will copy the needful data Hadoop
Distribution File System(HTFS) in parallel.
Next Node Manager will manage all the MapReduce jobs.
MapReduce application master located in Node Manager will keep
track of each of the Map and Reduce tasks and distribute it across
the cluster with the help of YARN.
Map and Reduce tasks connect with HDFS cluster to get needful
data to process and output data.
Pig
Thursday, 20 April 2023
11:23
data1.txt
1|aymane|hinane|22|casa
2|yohane|hinane|23|rabat
3|alie|ahmed|22|casa
4|noor|midawi|21|casa
tuple.txt
1|(aymane,hinane)|22|(casa,mdina)
2|(yohane,hinane)|23|(rabat,m6)
3|(noor,alaoui)|21|(casa,mdina)
4|(ahmed,masour)|22|(rabat,h5)
Un tuple peut tout à fait contenir d'autres tuples, des bags, ou encore autres types
simples et complexes.
Par exemple:
( { (1, 2), (3, John) }, 3, [qui#23] )
mots.txt
hello world
hello world
hello aymane
DUMP ratings;
(721,262,3,877137285)
(913,209,2,881367150)
(378,78,3,880056976)
(880,476,3,880175444)
(716,204,5,879795543)
(276,1090,1,874795795)
(13,225,2,882399156)
(12,203,3,879959583)
DESCRIBE ratings;
-----------------------------------------------------------------------------
DUMP ratingsByMovie;
(1656,{(713,1656,2,888882085),(883,1656,5,891692168)})
(1658,{(733,1658,3,879535780),(894,1658,4,882404137),(782,1658,2,891500230)})
(1659,{(747,1659,1,888733313)})
(1662,{(762,1662,1,878719324),(782,1662,4,891500110)})
(1664,{(870,1664,4,890057322),(880,1664,4,892958799),(839,1664,1,875752902),
(782,1664,4,891499699)})
DESCRIBE ratingsByMovie;
-----------------------------------------------------------------------------
DUMP avgRatings;
(1673,3.0)
(1674,4.0)
(1675,3.0)
(1676,2.0)
(1677,3.0)
(1678,1.0)
DESCRIBE avgRatings;
-----------------------------------------------------------------------------
fiveStarMovies = FILTER avgRatings BY avgRating > 4.0;
DUMP fiveStarMovies;
(1039,4.011111111111111)
(1064,4.25)
(1122,5.0)
(1125,4.25)
(1142,4.045454545454546)
(1169,4.1)
(1189,5.0)
(1191,4.333333333333333)
(1194,4.064516129032258)
(1201,5.0)
(1203,4.0476190476190474)
DESCRIBE fiveStarMovies;
-----------------------------------------------------------------------------
DUMP ratings;
(880,476,3,880175444)
(716,204,5,879795543)
(276,1090,1,874795795)
(13,225,2,882399156)
(12,203,3,879959583)
DESCRIBE metadata;
-----------------------------------------------------------------------------
DUMP nameLookup;
DESCRIBE nameLookup;
-----------------------------------------------------------------------------
DUMP fiveStarsWithData;
(1594,4.5,1594,Everest (1998),889488000)
(1599,5.0,1599,Someone Else's America (1995),831686400)
(1639,4.333333333333333,1639,Bitter Sugar (Azucar Amargo) (1996),848620800)
(1642,4.5,1642,Some Mother's Son (1996),851644800)
(1653,5.0,1653,Entertaining Angels: The Dorothy Day Story (1996),843782400)
DESCRIBE fiveStarsWithData;
-----------------------------------------------------------------------------
(100,4.155511811023622,100,Fargo (1996),855878400)
(181,4.007889546351085,181,Return of the Jedi (1983),858297600)
(515,4.203980099502488,515,Boot, Das (1981),860112000)
(1251,4.125,1251,A Chef in Love (1996),861926400)
(251,4.260869565217392,251,Shall We Dance? (1996),868579200)
(316,4.196428571428571,316,As Good As It Gets (1997),882835200)
(1293,5.0,1293,Star Kid (1997),884908800)
(1191,4.333333333333333,1191,Letter From Death Row, A (1998),886291200)
(1594,4.5,1594,Everest (1998),889488000)
(315,4.1,315,Apt Pupil (1998),909100800)
####################################################################
Le livre le plus critique:
qui a une moyen3 de note < 2 et qui a etait le plus note
--DUMP ratings;
--DUMP ratingByMovie;
--DUMP limit_data;
--DUMP movieData;
####################################################################
Pig integrated with Tez take less time and also do it in a very quick and efficient manner.
Tez uses
Pig Note
Friday, 21 April 2023
02:14
Startup
=======
run pig locally:
$ magicpig -x local script.pig # doesn't work
Commands
========
Loading
-------
grunt> A = LOAD '/path/to/file' USING PigStorage(':') AS (field1:float, field2:int, …);
Load data from /path/to/file and name the fields field1, field2, … sequentially. Fields are split by ':'.
field1 is loaded as a float, field2 as an integer. The 'AS' part is optional. Other basic types are int,
long, float, double, bytearray, boolean, chararray.
Pig also supports complex types are: tuple, bag, map. For example,
grunt> A = LOAD '/path/to/file' USING PigStorage(':') AS (field1:tuple(t1a:int, t1b:int,t1c:int),
field2:chararray);
This will load field1 as a tuple with 3 values that can be referenced as field1.t1a, field1.t1b, etc. Don't
worry about bags and maps.
Saving
------
grunt> STORE A INTO '/path/out' USING AvroStorage();
Save all of A's fields into /path/out in the format defined by AvroStorage
Generating
----------
grunt> B = FOREACH A GENERATE $0 + 1, org.apache.pig.tutorial.ExtractHour(field2) as field2hour;
Use this to select the first field and add one to it, and select the hour part of field2 from A and rename
it as field2hour. You have your basic arithmetic operations and %. You can also use "*" to select all
fields.
Here's an example of a nested foreach. Only distinct, filter, limit, and order are supported inside a
nested foreach.
daily = load 'NYSE_daily' as (exchange, symbol); -- not interested in other fields
grpd = group daily by exchange;
uniqcnt = foreach grpd {
sym = daily.symbol;
uniq_sym = distinct sym;
generate group, COUNT(uniq_sym); };
rices = load 'NYSE_daily' as (exchange, symbol, date, open, high, low, close, volume, adj_close);
beginning = foreach prices generate ..open; -- produces exchange, symbol, date, open
middle = foreach prices generate open..close; -- produces open, high, low, close
end = foreach prices generate volume..; -- produces volume, adj_close
Here's how to use the "<condition> ? <if true> : <if false>" construct
Filtering
---------
grunt> B = FILTER A by $1 > 2;
Remove entries from A whose second field is <= 2. You can use basic numerical comparisons, along
with "is null" and "matches" (for glob matching). e.g,
Grouping
--------
grunt> B = GROUP A BY (field1, field2), B by (fieldA, fieldB);
Match entries in A to entries in B if (field1, field2) == (fieldA, fieldB), and return values of both grouped
by their respective keys. If A and B share the same field names, use A.field1 or B.field1. The key will
have alias "group".
Joining
-------
grunt> C = JOIN A BY field1 LEFT, B BY field2 USING 'replicated' PARALLEL 5;
Joint two variables by field names. Must be done after a GROUP operation on both A and B. Uses
'replicated' method to join, alternatives being 'skewed', 'merge', and normal (hash). Uses 5 reduce
tasks. Does a left join (all rows in A will be kept, but rhs might be null).
There exist multiple types of joins with their own performance characteristics. They are listed below in
order of preference.
Flattening
----------
grunt> B = FOREACH A GENERATE flatten(field1) as field1flat, field2;
If the argument to flatten is a tuple, then flatten it like a list: e.g., ((a,b), c) -> (a, b, c).
If the argument is a bag, then make the cross product: e.g. a:{(b,c), (d,e)} -> {(a,b,c), (a,d,e)}
Unique
------
grunt> B = DISTINCT A;
Use this to turn A into a set. All fields in A must match to be grouped together
Sorting
-------
grunt> B = ORDER A by (field1, field2);
Cross Product
-------------
grunt> C = CROSS A, B;
Take the cross product of A and B's elements. Your machine will shit itself if you do this. You cannot
cross an alias with itself due to namespace conflicts.
Splitting
---------
grunt> SPLIT B INTO A1 IF field1 < 3, A2 IF field2 > 7;
Cut B into two parts based on conditions. Entries can end up in both A1 and A2.
Subsampling
-----------
grunt > B = SAMPLE A 0.01;
Viewing interactively
---------------------
grunt> B = limit A 500;
grunt> DUMP B;
grunt> DESCRIBE B;
Take the first 500 entries in A, and print them to screen, then print out the name of all of B's fields.
Built in Functions
==================
* Eval Functions
* AVG
* CONCAT
* COUNT
* COUNT_STAR
* DIFF
* IsEmpty
* MAX
* MIN
* SIZE
* SUM
* TOKENIZE
* Math Functions
* ABS
* ACOS
* ASIN
* ATAN
* CBRT
* CEIL
* COS
* COSH
* EXP
* FLOOR
* LOG
* LOG10
* RANDOM
* ROUND
* SIN
* SINH
* SQRT
* TAN
* TANH
* String Functions
* INDEXOF
* LAST_INDEX_OF
* LCFIRST
* LOWER
* REGEX_EXTRACT
* REGEX_EXTRACT_ALL
* REPLACE
* STRSPLIT
* SUBSTRING
* TRIM
* UCFIRST
* UPPER
* Tuple, Bag, Map Functions
* TOTUPLE
* TOBAG
* TOMAP
* TOP
Macros in Pig
=============
Basic usage example. Prefix arguments with "$". You can use aliases and literals, as they're literally
subbing text in.
DEFINE group_and_count (A, group_key, reducers) RETURNS B {
D = GROUP $A BY '$group_key' PARALLEL $reducers;
$B = FOREACH D GENERATE group, COUNT($A);
}; X = LOAD 'users' AS (user, age, zip);
Y = group_and_count (X, 'user', 20);
Z = group_and_count (X, 'age', 30);Here's an example of one macro calling another.
DEFINE foreach_count(A, C) RETURNS B {
$B = FOREACH $A GENERATE group, COUNT($C);
}; DEFINE group_with_parallel (A, group_key, reducers) RETURNS B {
C = GROUP $A BY $group_key PARALLEL $reducers;
$B = foreach_count(C, $A);
};Here's an example where a string is transformed into interpreted pig.
/* Get a count of records, return the name of the relation and . */
DEFINE total_count(relation) RETURNS total {
$total = FOREACH (group $relation all) generate '$relation' as label, COUNT_STAR($relation) as total;
};
/* Get totals on 2 relations, union and return them with labels */
DEFINE compare_totals(r1, r2) RETURNS totals {
total1 = total_count($r1);
total2 = total_count($r2);
$totals = union total1, total2;
};
/* See how many records from a relation are removed by a filter, given a condition */
DEFINE test_filter(original, condition) RETURNS result {
filtered = filter $original by $condition;
$result = compare_totals($original, filtered);
};
Parameters
==========
Here's how to do direct substitutions with variables in Pig. Note that "%".
There exist a boatload of ways to write [U]ser [D]efined [F]unctions. Below is a simple
EvalFunc<ReturnType> for doing a map operation, but there are also FilterFunc.
EvalFunc<ReturnType> is also used for aggregation operations, but can be more efficient using the
Algebraic and Accumulator interfaces.
Java
----
.. code::
/**
* A simple UDF that takes a value and raises it to the power of a second
* value. It can be used in a Pig Latin script as Pow(x, y), where x and y
* are both expected to be ints.
*/
public class Pow extends EvalFunc<Long> {
public Schema outputSchema(Schema input) { // Check that we were passed two fields
if (input.size() != 2) {
throw new RuntimeException(
"Expected (int, int), input does not have 2 fields");
}
try {
// Get the types for both columns and check them. If they are
// wrong figure out what types were passed and give a good error
// message.
if (input.getField(0).type != DataType.INTEGER ||
input.getField(1).type != DataType.INTEGER) {
String msg = "Expected input (int, int), received schema (";
msg += DataType.findTypeName(input.getField(0).type);
msg += ", ";
msg += DataType.findTypeName(input.getField(1).type);
msg += ")";
throw new RuntimeException(msg);
}
} catch (Exception e) {
throw new RuntimeException(e);
}
Python
------
- in Pig,
.. code::
grunt> Register 'test.py' using jython as myfuncs;
grunt> b = foreach a generate myfuncs.helloworld(), myfuncs.square(3);
- in Python
.. code::
@outputSchemaFunction("squareSchema")
def pow(n1, n2):
return n1**n2
@schemaFunction("squareSchema")
def squareSchema(input): # this says int -> int, long -> long, float -> float, etc
return input
.. code::
@outputSchema("production:float")
def production(slugging_pct, onbase_pct):
return slugging_pct + onbase_pct
Pig doesn't have control structures (if, for, etc), so jobs that inherently have time-varying file names or
are repeated until convergence can't be controlled from within Pig. You can fix this by embedding Pig
in Python via Jython.
.. code::
# pass parameters
P1_bound = P1.bind(
{
'lower_bound': 500,
'outpath': '/path/to/save',
'count_path': '/tmp/count'
}
)
# run script
stats = P1_bound.runSingle()
# check if successful?
if stats.isSuccessful():
print 'Yay! it succeeded!'
# extract alias 'n_entries' from the script and get its first element
count = float(str(stats.result("n_entries").iterator().next().get(0)))
print 'Output %d rows' % (count,)
Pig Exercice
Tuesday, 16 May 2023
09:32
Un tuple peut contenir d'autres tuple , des bag ou des types primitive
+++++++++++++++++++++++++++++++++++++++++++++++++++
(13,225,2,882399156)
(12,203,3,879959583)
ratings: {userID: int,movieID: int,rating: int,ratingTime: int}
+++++++++++++++++++++++++++++++++++++++++++++++++++
DUMP metadata;
DESCRIBE metadata;
+++++++++++++++++++++++++++++++++++++++++++++++++++
(1,aymane,hinane,22,casa)
(2,yohane,hinane,23,rabat)
(3,alie,ahmed,22,casa)
(4,noor,midawi,21,casa)
user2: {id: int,lastName: chararray,firstName: chararray,age: int,city: chararray}
+++++++++++++++++++++++++++++++++++++++++++++++++++
(2,yohane,hinane,23,rabat)
(3,alie,ahmed,22,casa)
+++++++++++++++++++++++++++++++++++++++++++++++++++
(2,yohane,hinane,23,rabat)
+++++++++++++++++++++++++++++++++++++++++++++++++++
(4,noor,midawi,21,casa)
(3,alie,ahmed,22,casa)
(1,aymane,hinane,22,casa)
(2,yohane,hinane,23,rabat)
+++++++++++++++++++++++++++++++++++++++++++++++++++
(21, {(4,noor,midawi,21,casa)} )
(22, { (3,alie,ahmed,22,casa) , (1,aymane,hinane,22,casa) } )
(23,{ (2,yohane,hinane,23,rabat) } )
+++++++++++++++++++++++++++++++++++++++++++++++++++
(4,noor,midawi,21,casa)
(3,alie,ahmed,22,casa)
(1,aymane,hinane,22,casa)
(2,yohane,hinane,23,rabat)
z: {user2::id: int,user2::lastName: chararray,user2::firstName: chararray,user2::age: int,user2::city:
chararray}
+++++++++++++++++++++++++++++++++++++++++++++++++++
(22)
(23)
(22)
(21)
A: {age: int}
+++++++++++++++++++++++++++++++++++++++++++++++++++
((hinane,aymane),22)
((hinane,yohane),23)
((ahmed,alie),22)
((midawi,noor),21)
B: {fullName: (firstName: chararray,lastName: chararray),age: int}
+++++++++++++++++++++++++++++++++++++++++++++++++++
(1,(aymane,hinane),22,(casa,mdina))
(2,(yohane,hinane),23,(rabat,m6))
(3,(noor,alaoui),21,(casa,mdina))
(4,(ahmed,masour),22,(rabat,h5))
+++++++++++++++++++++++++++++++++++++++++++++++++++
(1,aymane,hinane,22,casa)
(2,yohane,hinane,23,rabat)
(3,alie,ahmed,22,casa)
(4,noor,midawi,21,casa)
user3: {id: int,prenom: chararray,nom: chararray,age: int,city: chararray}
+++++++++++++++++++++++++++++++++++++++++++++++++++
(21, { (4,noor,midawi,21,casa) } )
(22, { (3,alie,ahmed,22,casa),(1,aymane,hinane,22,casa) } )
(23, { (2,yohane,hinane,23,rabat) } )
+++++++++++++++++++++++++++++++++++++++++++++++++++
((hinane,aymane),22)
((hinane,yohane),23)
((ahmed,alie),22)
((midawi,noor),21)
D: {name: (nom: chararray,prenom: chararray),age: int}
+++++++++++++++++++++++++++++++++++++++++++++++++++
groupD = GROUP D BY age;
DUMP groupD;
DESCRIBE groupD;
(21,{ ((midawi,noor),21) } )
(22,{ ((ahmed,alie),22) , ((hinane,aymane),22) } )
(23,{ ((hinane,yohane),23) } )
groupD: {group: int,D: {(name: (nom: chararray,prenom: chararray),age: int)}}
+++++++++++++++++++++++++++++++++++++++++++++++++++
((midawi,noor),21)
((ahmed,alie),22)
((hinane,aymane),22)
((hinane,yohane),23)
flatternD: {D::name: (nom: chararray,prenom: chararray),D::age: int}
+++++++++++++++++++++++++++++++++++++++++++++++++++
(hello world)
(hello world)
(hello aymane)
mots: {ligne: chararray}
+++++++++++++++++++++++++
({(hello),(world)})
({(hello),(world)})
({(hello),(aymane)})
E: {mots: {tuple_of_tokens: (token: chararray)}}
+++++++++++++++++++++++++
+++++++++++++++++++++++++
G = GROUP F BY mot;
DUMP G;
DESCRIBE G;
(hello,{(hello),(hello),(hello)})
(world,{(world),(world)})
(aymane,{(aymane)})
G: {group: chararray,F: {(mot: chararray)}}
+++++++++++++++++++++++++
(hello,3)
(world,2)
(aymane,1)
J: {mot: chararray,occurence: long}
Hbase
Saturday, 13 May 2023
19:13
start-hbase.sh
HBase
• Hbase est un SGBD NoSQL orienté colonne.
HBase utilise le système de fichiers Hadoop pour stocker ses données. Il
aura un serveur maître et des serveurs de région. Le stockage de données
sera sous la forme de régions (tables). Ces régions seront divisées et
stockées dans des serveurs régionaux.
ALTER TABLE my_table ADD COLUMNS (new_column1 int COMMENT 'New integer column',
new_column2 string COMMENT 'New string column');
put ‘<table
name>’,’row1’,’<colfamily:colname>’,’<value>’
Getg=newGet(Bytes.toBytes("row1"));
// Instantiating Getclass
Result result = table.get(g);
/ / Reading the data
/ / Reading values from Result class object
byte [ ] value =
result.getValue(Bytes.toBytes("personal"),Bytes.toBytes("name"));
byte [ ] value1 =
result.getValue(Bytes.toBytes("personal"),Bytes.toBytes("city"));
/ / Printing the values
String name = Bytes.toString(value);
String city = Bytes.toString(value1);
System.out.println("name: " +name +" city: " + city);
Revision Controle
Tuesday, 16 May 2023
21:30
Les 5V du pig DATA
Volume,variete,veracite,valeur,vitesse
Hadoop fs -mkdir
Hadoop fs -mkdir -p
++++++++++++++++++++++++++++++++++++++
Hdfs.mkdirs(folder);
Hdfs.delete( folder , true);
Hdfs.copyFromLocalFile(local,desti);
Hdfs.createNewFile( path );
For()
st.append()
String st = null;