NetBackup Training
KEEPING PEOPLE AND INFORMATION CONNECTED.
Module 1: Brief Overview, Client/Policy Configuration, Troubleshooting
For Internal SunGard Use Only
Agenda
Introduction Purpose & Assumptions History Terminology and Concepts Architecture Standards How Backups Work Managing NetBackup Client Implementation and Configuration Policies Troubleshooting Reporting Monitoring Overall Environment Shutdown/Restart NetBackup Tips and Tricks Education/Further Reading Q&A
KEEPING PEOPLE AND INFORMATION CONNECTED
Purpose and Assumptions
Purpose
Increase knowledge of NetBackup product
Assumptions
Presentation assumes 6.5.3 Vague familiarity of NetBackup Know how to access environments Windows and/or Unix admin experience Please write down your questions for the Q&A session at the end
KEEPING PEOPLE AND INFORMATION CONNECTED
History
Corporate 1987 - proprietary software solution written by engineers at Control Data for Chrysler Corp. 1993 - renamed to BackupPlus (bp prefix) Late 1993 - OpenVision acquisition (/usr/openv/ install path) and rebranded product NetBackup 1997 - Veritas acquired OpenVision 2005 - Symantec acquired Veritas Version 1993 BackupPlus 1.0 (Control Data) 1994 NetBackup 1.6 (OpenVision) 1996 NetBackup 2.0 1997 NetBackup 3.0 (Veritas) 2000 NetBackup 3.4 2002 NetBackup 4.5 2003 NetBackup 5.0 2005 NetBackup 6.0 (Symantec) 2007 NetBackup 6.5
KEEPING PEOPLE AND INFORMATION CONNECTED
Terminology and Concepts
Master Server brains of the operation, houses catalog Media Server where storage units exist, pushes data Client device providing data to be backed up Enterprise Media Manager (EMM) manages device and media information; typically installed on Master Catalog database of backup images and other information Metadata info of files backed up (name, path, size, date, image location, etc.) Duration - time it takes to perform the backup Exit Code final status of job
0 = Successful with NO files missed 1 = Successful with files missed 2+ = Backup Failed
Start Window time when a backup can START Frequency how often the backup should execute Retention length of time backups are valid Policy grouping of like clients sharing similar attributes
KEEPING PEOPLE AND INFORMATION CONNECTED
Terminology and Concepts continued
Schedule subset of Policy, defines Start Window, retention, storage unit, etc. Storage Unit location defined to store backups, can be disk/tape, exist only on a Media Server Backup Image one backup job comprised of all files backed up; job must complete Disk Storage primary landing zone for jobs; destage to tape later; removes older images as needed; can be configured many ways, current standard is Basic Disk; optional Multiplexing interleaving of multiple jobs on tape to prevent shoeshining Long Term Data Retention utilizes media and marginally increases catalog size; non-issue
Dependent on proper forward and reverse lookups
Scaling horizontally by adding more media servers
KEEPING PEOPLE AND INFORMATION CONNECTED
Terminology and Concepts continued
Files to Backup Part of policy config Exclude List files to skip Include List files to include after processing excludes Additional config on client; granular to policy or schedule level; no stacking Backup Type Full all files captured Differential Incremental all changes since last backup Cumulative Incremental all changes since last full User Allows user to run backups from client side; most used for child jobs of DB Agents - Default-Application-Backup Database Agents Exchange, Notes, Oracle, SQL, SAP, etc. Options NDMP, Off-site Management (Vault), Tape/Disk Sharing, Bare Metal Restore, Snapshot, VMWare etc. Licensing - gold key with many options; SunGard pays for Protected Data Recovery restore catalog or import all images manually
KEEPING PEOPLE AND INFORMATION CONNECTED
NetBackup Tiered Architecture
Master Server (Top Tier)
Scheduler
Stores Catalog (Metadata, Images), Volume Information Vaulting Management
Media Server(s) (Mid Tier)
Data Mover Sends Metadata to Master
Can be located on Master
Clients (Lower Tier)
Configured via GUI/Registry (Win) or config files (*nix)
KEEPING PEOPLE AND INFORMATION CONNECTED
Example NetBackup Architecture Diagram
Master Server
FC Switch Fabric A
Meta-data Nework
Disk Storage
Media Server 1
Media Server 2
Media Server 3
Media Server N
Backup Network
FC Switch Fabric B Enterprise Class Tape Library
Client Hosts
KEEPING PEOPLE AND INFORMATION CONNECTED
Standards
Infrastructure Server/OS Types
Unix Solaris 10 T2000 Naming Standards Defined Network Configuration Standards (Metadata, backup, mgmt)
Robot Types
LTO3/4 Tape Drives / Media Types Volume Serial Numbers (VolSers/bar codes) SAN connectivity Disk Array Standards DSSU Configuration Application/Configuration Documented on LiveLink
KEEPING PEOPLE AND INFORMATION CONNECTED
Quantum Scalar i2000 STK SL8500 Small Robots for legacy restores
How Backups Work (simplified)
Scheduler on Master tells Media to backup its client Media server is granted storage unit resource (disk or tape) Media connects to client software and tells it to start backing up Client creates list of files to backup Full everything Differential changes since last backup Cumulative changes since last full Copies of files are sent to buffer Buffer contents sent to Media Server Media server writes buffer contents to storage unit Media server sends metadata to Master server to update catalog Backup completes Storage unit resource released Backup image is completed and closed
KEEPING PEOPLE AND INFORMATION CONNECTED
Managing NetBackup (Demonstration)
NBU Administration Console 99.9% of daily administration occurs here Activity Monitor Overall job status Jobs tab
Job details State - Queued, Active, Partial, Failed Type Backup, Restore, Catalog, Duplicate, Vault Status Exit Code of job 0 = All files backed up, no problems 1 = Some files skipped (open/locked) >1 = Failure Additional info Suspend/kill jobs Sorting/Filtering - Be aware of any filters you have set Exporting
Daemons tab Processes tab Help
KEEPING PEOPLE AND INFORMATION CONNECTED
Managing NetBackup (contd)
Storage Storage Units defined target for backups (similar to storage pool in TSM) Disk or Tape Storage Unit Groups Media Volume Pools logical grouping of tapes
Various defined pools Scratch SG_SHARED_xxx Policy defines Volume Pool
Volume Groups locational grouping of tapes
Robot groups Onsite group Offsite groups Vault moves media between volume pools
Robots media currently in robot Standalone tapes no longer associated with robot/volume group Inventory Robot Ejecting media States Active, Full, Frozen, Suspended, Imported
KEEPING PEOPLE AND INFORMATION CONNECTED
Managing NetBackup (contd)
Device Monitor
Up/Down/Reset drive
Devices
Drives Robots
SCSI Robots have single Control Host ACS any server can control
Media Servers Topology
KEEPING PEOPLE AND INFORMATION CONNECTED
Managing NetBackup (contd)
Backup Archive Restore
Used for restoring files
Host Properties
Master Server Media Servers Clients Include/Exclude Lists Server authorization
Catalog
Offline backup (legacy method) Import images Verify Images Duplicate images
Reports Vault option that processes and tracks volumes sent offsite
KEEPING PEOPLE AND INFORMATION CONNECTED
Client Implementation and Configuration
All systems Install client binaries
Agents included for Windows, not for Unix
Verify network communication Client configuration Unix
Configuration files bp.conf
SERVER = backup01-dal Master Must be Listed First! SERVER = backup02-dal SERVER = backup03-dal SERVER = backup0N-dal CLIENT_NAME = jumpstart01-dal
exclude_list and include_list
exclude_list.policyName.scheduleName include_list.policyName.scheduleName Exclude/Include lists do not stack
Windows
Backup, Archive, Restore GUI or Registry Some configuration available from Admin Console>Host Properties>Clients Changing open file backup for Windows
Demonstration of Windows client configuration
KEEPING PEOPLE AND INFORMATION CONNECTED
Policies (Demonstration)
Policies - A backup policy allows the admin to configure how and when backups are to be performed for a group of clients. This group of clients share similar backup requirements (type, backup window, retention, etc.)
Attributes
Policy Type Destination
Classification Storage Unit Volume Pool
Check Points Limit Jobs per Policy Job Priority Media Owner
Active/Inactive Follow NFS Cross mount points Compression Encryption Collect DR Info Allow Multiple Data Streams Keyword Phrase Snapshot Client
KEEPING PEOPLE AND INFORMATION CONNECTED
Policies (contd)
Schedules
Attributes Tab
Name Type of Backup Full, Incremental, Differential, Cumulative., User Synthetic Schedule Type Calendar Based Frequency Based
Destination Multiple Copies Override Policy Storage Override Policy Vol Pool Override Media Owner Retention Media Multiplexing Start Window Tab
Exclude Dates Tab
Defines when backup can START Defines when backup cannot run Only available when calendar sched type chosen Retries allowed after runday Specific Days or Recurring Days
Calendar Schedule
Summary of All Policies
KEEPING PEOPLE AND INFORMATION CONNECTED
Policies (contd)
Clients Know hardware/OS type Backup Selections what to backup
ALL_LOCAL_DRIVES System_State:\ or Shadow Copy Components:\ NEW_STREAM for multistreaming
Manual backups
KEEPING PEOPLE AND INFORMATION CONNECTED
Troubleshooting
MSS Document When in doubt, ASK! Windows client Troubleshooting
KEEPING PEOPLE AND INFORMATION CONNECTED
Windows Clients Over 3000 servers across all environments 77% of all servers 85% of all failures
KEEPING PEOPLE AND INFORMATION CONNECTED
Error Codes Media related (8x) Network Communication related (4x) Configuration/Hardware related (5x) Most Common Codes:
41, 196, 5x, 219, 13, 14, 2x
KEEPING PEOPLE AND INFORMATION CONNECTED
Check the Simple Stuff
Is Server On and Cabled Decommissioned Maintenance Hosts Files or DNS correct Host All backup servers All backup interfaces on backup servers Network Functional Routing Library/Media Problem Server Hardware Windows Event Log Correlation Telnet To Master/Media from Client To Client from Master/Media telnet <hostname> bpcd (or 13782) telnet <hostname> vnetd (or 13724)
KEEPING PEOPLE AND INFORMATION CONNECTED
Check the Simple Stuff (contd)
BPCLNTCMD Command Options
-sv returns version of Master
5.1
-pn communicates back to Master
expecting response from server backup01-dal backup03-dal backup03-dal 10.229.133.233 56618
-self returns info about local system
gethostname() returned: backup03-dal host backup03-dal: backup03-dal at 10.229.133.233 (0xae585e9) checkhname: aliases:
-hn <hostname> - returns info resolved from hostname
host backup01-dal: backup01-dal at 10.229.133.229 (0xae585e5) checkhname: aliases:
-ip <IP address> - returns info resolved from IP
checkhaddr: host : backup01-dal: backup01-dal at 10.229.133.229 (0xae585e5) checkhaddr: aliases:
-server <Master> - see hn option
KEEPING PEOPLE AND INFORMATION CONNECTED
In Depth Client Troubleshooting Turn up logging on client
Host properties or client BAR GUI Must have <install>\netbackup\logs\* dirs created
Client Logs and Directories:
bpbkar\<date>.log Backup/Archive process (BPBKAR32) bpcd\<date>.log Client Daemon (BPCDW32) tar\<date>.log Restores (TAR32)
KEEPING PEOPLE AND INFORMATION CONNECTED
In Depth Client Troubleshooting (contd)
Run test backup/restore Examine logs after failure Logs structured as such:
00:00:03.125 [3652] <2> bpcd exit_bpcd: exit status 0 ------>exiting 09:55:33.941 [6092] <16> bpfsmap: ERR - open_snapdisk: NBU snapshot failed
Search for <#> entries:
<2>, <4>, <8>, <16>, <32>: <2>=informational and <32>=Critical Failure
Search error message on Google and Symantec Test recommended solution Lather, rinse, repeat Last resort/time sensitive open case with Symantec (800) 342-0652 Customer Number 3680-5196-9875
KEEPING PEOPLE AND INFORMATION CONNECTED
Example Log Error 41
5:20:55.454 PM: [1656.2600] <16> dtcp_write: TCP - failure: send socket (904) (TCP 10053: Software caused connection abort) 5:20:55.454 PM: [1656.2600] <16> dtcp_write: TCP - failure: attempted to send 6 bytes 5:20:55.486 PM: [1656.2600] <16> dtcp_write: TCP - failure: send socket (904) (TCP 10053: Software caused connection abort)
The connection is being reset internally to the host. Recommendation is to reload the NIC driver or replace the NIC. Error 41 can also produce TCP 10054 errors in the logs, but this is an external closing of the connection. These can be caused by loss of network connectivity, crashes or reboots. Error 41 has also been the result of corrupted VSS. Check the Event Log for any related error messages and consult with Systems Engineers, if necessary
KEEPING PEOPLE AND INFORMATION CONNECTED
Windows Client Troubleshooting Checklist
Narrow your effort based on error code Check the simple stuff: Is server cabled, decommed, under maint. Verify hosts file(s) or DNS on all involved servers Network functional? Verify routing Library or Media problem? Server hardware problem? Check Windows event log Correlate any issues
Run BPCLNTCMD on all involved servers using each option:
-sv -pn -self -hn <hostname> -ip <ip address> -server <name of Master>
KEEPING PEOPLE AND INFORMATION CONNECTED
Maximize logging values for client Verify log dirs created in <install>\netbackup\logs\* bpbkar bpcd tar Start backup/restore Review logs searching for errors (look for <4> <8> <16> <32>) Search error message on Google and Symantec sites Test solution Repeat until resolved Open case with Symantec (800) 342-0652 Cust. #: 3680-5196-9875
Reporting
NetBackup Reports Aptare
In depth historical reporting and trending Supports several backup products, incl. TSM Command Center Dashboard Job Reports The Dot Report Dont agitate the Dots Billing yes we can be a profit center IF we are successful Media Reports
KEEPING PEOPLE AND INFORMATION CONNECTED
Keeping Tabs on the Infrastructure
Use Aptare Check for down drives/stuck tapes regularly Verify Drive Configuration Scratch Destaging Balance Jobs Tape Injects/Ejects
KEEPING PEOPLE AND INFORMATION CONNECTED
How To Shutdown/Restart NetBackup
Shutdown Suspend/Cancel jobs Stop Aptare netbackup stop bpps a to see whats running kill -9 <pid> to kill hung processes Optionally rename startup script Use init 6 to restart server if processes will not die Ensure drives are empty
robtest ACSLS server
KEEPING PEOPLE AND INFORMATION CONNECTED
Startup netbackup start Resume/Restart all jobs Start Aptare Verify environment functions
Management Tips and Tricks
Use Activity Monitor, Restore, Policies, Device Monitor, Clients Properties most often Policies Use Summary of All Policies Sorting/Filtering Sort by State long running jobs? Export to Excel Selected rows or all rows Column Fields Move, Hide, Show Built-in NetBackup Reports Help Use multiple windows Break up long running jobs Multiple streams per policy Multiple policies Watch jobs per policy and client settings Dont forget about Aptare! It isnt always clear, look at it, correlate it, think about it
KEEPING PEOPLE AND INFORMATION CONNECTED
Education and Further Reading
Google Symantec Detailed PDFs on EC troubleshooting Manuals/Troubleshooting Guide Technotes NetBackup Mailing List/Forums List: http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu Forums Backup Central (mirrors the mail lists):
http://www.backupcentral.com/phpBB2/
Symantec: https://forums.symantec.com/syment/board?board.id=21 Tek-Tips:
http://www.tek-tips.com/threadminder.cfm?pid=776
KEEPING PEOPLE AND INFORMATION CONNECTED
Questions and Answers
Altered Lyrics to the tune of the Beatles Yesterday Yesterday, All those backups seemed a waste of pay. Now my database has gone away. Oh I believe in yesterday.
Suddenly, There's not half the files there used to be. And there's a milestone hanging over me. The system crashed, so suddenly.
I pushed something wrong, What it was, I could not say. Now all my data's gone, And I long for yesterday-ay-ay-ay. Yesterday, the need for back-ups seemed so far away. I knew my data was all here to stay, Now I believe in yesterday.
KEEPING PEOPLE AND INFORMATION CONNECTED
Thanks for attending!
KEEPING PEOPLE AND INFORMATION CONNECTED.