DATA S H E E T
Pentaho™ Data Integration
Ingest, Blend, Cleanse and Prepare Diverse Data From
Any Source, in Any Environment — Without Code.
With Pentaho Data Integration, managing the enormous volumes and
increased variety and velocity of data entering organizations is simplified.
By allowing data preparation from any source and automating your data pipeline, Pentaho Data Integration allows you to curate
data better for your business user. This software delivers business analytics to end users faster with visual tools that reduce
time and complexity — without writing SQL or coding in Java or Python. Organizations immediately gain real value from their
various data sources in the cloud or on premises, including files, relational databases, Hadoop and more.
Turn Big Data Into Integrate and Blend Big Data Data Processing Performance
Actionable Analytics With Existing Enterprise Data and Productivity
Pentaho Data Integration’s adaptive big With broad connectivity to any data Pentaho Data Integration speeds
data layer allows you to plug into popular type and high-performance Spark performance time, reduces the
big data stores with flexibility and insulation and MapReduce execution, Pentaho complexity of integrating big data
from change. Data can be accessed once, technology simplifies and speeds the sources, and provides:
then processed, combined and consumed process of integrating existing databases
● Code-free data transformation
anywhere. The adaptive big data layer with new sources of data. Pentaho Data
design that empowers 15 times faster
includes plug-ins for Hadoop distributions Integration’s graphical designer includes:
productivity versus hand-coding and
and object stores from Cloudera,
● Intuitive, drag-and-drop designer to executes in-cluster for high performance.
Hortonworks, MapR (HPE Ezmeral Data
simplify the creation of analytics data ● Template-based approach to rapidly
Fabric), Amazon Web Services, Google
pipelines (see Figure 1). onboard data sources into Hadoop via
Cloud and Microsoft Azure, object stores
such as Hitachi Content Platform, as ● Rich library of prebuilt components to metadata injection feature set.
well as popular NoSQL databases like access, prepare and blend data from ● Ability to seamlessly switch between
MongoDB and Cassandra. relational sources, big data stores on execution engines, such as Spark and the
premises or in the cloud, enterprise Pentaho native engine, to fit data volume
applications and more. and transformation complexity (see
● Ability to spot check data in flight with Figure 2).
immediate access to analytics, including ● Support for advanced analytics models
charts, visualizations and reporting, from from R, Python, Scala and Weka to oper-
any data prep step. ationalize predictive intelligence while
● Powerful orchestration capabilities reducing data prep time.
to coordinate and combine
transformations, including notifications
and alerts. “Moving data across a
● Integrated enterprise scheduler for business is an art. Pentaho
coordinating workflows and debugger transforms art into better
for testing and tuning job execution. business value.”
Figure 1: Drag-and-Drop Data Transformation Warren Chang, VP of Engineering, Borderfree
in Pentaho Data Integration
DATA S H E E T
Pentaho Data Integration
Figure 2:
Adaptive Execution
With Spark and
Visually Designed
Hadoop MapReduce
Jobs in Pentaho Data
Integration
Broad Connectivity To increase the performance of data ● Identify data that fails to comply with
and Data Delivery extraction, loading and delivery processes, business rules and standards.
Pentaho offers the following capabilities: ● Deduplicate and cleanse inconsistent
Pentaho Data Integration offers broad
Native connectivity and bulk-loading to and redundant data.
connectivity to a variety of diverse ●
data, including all popular structured, most common data sources, including ● Validate, standardize and correct name,
unstructured and semi-structured data ● Amazon Redshift and Snowkflake. address, email and telephone data.
sources. Some examples include: ● Data services to virtualize
● Replace file names and locations with
transformations without staging, making simple business names by integrating
● Relational database management with Pentaho Data Catalog.
system (RDBMS): Oracle, IBM DB2 , data sets immediately available to
reports and applications.
MySQL, Microsoft SQL Server, Powerful Administration
Postgres, IBM MQ. Automatic creation and publishing of
and Management
●
● Spark and Hadoop: Cloudera, metadata models to drive faster
analytic results. Pentaho Data Integration provides out-
Hortonworks, Amazon EMR, MapR
Process streaming data in real time. of-the box capabilities for managing
(HPE Ezmeral Data Fabric), Microsoft ●
operations for data integration projects.
Azure HDInsights, and Elastic Search.
Data Profiling and These capabilities include:
● NoSQL databases and object stores:
MongoDB, Cassandra, HBase, Hitachi Data Quality ● Shared repository for collaboration
Content Platform, AWS S3, Google Pentaho technology provides data among data analysts, developers and
Cloud Storage, Microsoft Azure profiling capabilities, such as row counts, data stewards.
ADLS Gen 2. mathematical functions and identification ● Content management, versioning and
● Analytic databases: Redshift, Snowflake, of null values, as well as data quality locking to easily version jobs for roll-
● Vertica, Greenplum, Teradata, SAP HANA, operators, such as string manipulators, back to prior versions.
Amazon Redshift, Google Big Query. mapping functions, filtering and sorting. ● Control over security privileges for
For name and address verification users and roles and integration with
● Business applications: SAP, Salesforce,
capabilities, Pentaho technology third-party security systems; ability to
Google Analytics.
integrates with leading data quality set permissions for creating, reading or
● Files: XML, JSON, Microsoft Excel, vendors, such as Human Inference and executing jobs and transformations.
CSV, txt, Avro, Parquet, ORC, EBCDIC Melissa Data. Pentaho data profiling and
(mainframe), unstructured files with data quality capabilities help:
metadata, including audio, video and
visual files.
Corporate Headquarters Contact Information
2535 Augustine Drive USA: 1-800-446-0744
Santa Clara, CA 95054 USA Global: 1-858-547-4526
hitachivantara.com | community.hitachivantara.com hitachivantara.com/contact
© Hitachi Vantara LLC 2023. All Rights Reserved. HITACHI and Pentaho are trademarks or registered trademarks of Hitachi, Ltd.
All other trademarks, service marks and company names are properties of their respective owners.
HV-CBE-DS-Pentaho-Data-Integration-7Jul23-J