注册出现you don'i don t havee permissions to use this comma

Sqoop User Guide (v1.4.2)Sqoop User Guide (v1.4.2)Sqoop User Guide (v1.4.2)Table of Contents
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements.
See the NOTICE file
distributed with this work for additional information
regarding copyright ownership.
The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.1. IntroductionSqoop is a tool designed to transfer data between Hadoop and
relational databases. You can use Sqoop to import data from a
relational database management system (RDBMS) such as MySQL or Oracle
into the Hadoop Distributed File System (HDFS),
transform the data in Hadoop MapReduce, and then export the data back
into an RDBMS.Sqoop automates most of this process, relying on the database to
describe the schema for the data to be imported. Sqoop uses MapReduce
to import and export the data, which provides parallel operation as
well as fault tolerance.This document describes how to get started using Sqoop to move data
between databases and Hadoop and provides reference information for
the operation of the Sqoop command-line tool suite. This document is
intended for:
System and application programmers
System administrators
Database administrators
Data analysts
Data engineers
2. Supported ReleasesThis documentation applies to Sqoop v1.4.2.3. Sqoop ReleasesSqoop is an open source software product of the Apache Software Foundation.Software development for Sqoop occurs at .
At that site you can obtain:
New releases of Sqoop as well as its most recent source code
An issue tracker
A wiki that contains Sqoop documentation
Sqoop is compatible with Apache Hadoop 0.21 and Cloudera’s
Distribution of Hadoop version 3.4. PrerequisitesThe following prerequisite knowledge is required for this product:
Basic computer technology and terminology
Familiarity with command-line interfaces such as bash
Relational database management systems
Basic familiarity with the purpose and operation of Hadoop
Before you can use Sqoop, a release of Hadoop must be installed and
configured. We recommend that you download Cloudera’s Distribution
for Hadoop (CDH3) from the Cloudera Software Archive at
for straightforward installation of Hadoop
on Linux systems.This document assumes you are using a Linux or Linux-like environment.
If you are using Windows, you may be able to use cygwin to accomplish
most of the following tasks. If you are using Mac OS X, you should see
few (if any) compatibility errors. Sqoop is predominantly operated and
tested on Linux.5. Basic UsageWith Sqoop, you can import data from a relational database system into
HDFS. The input to the import process is a database table. Sqoop
will read the table row-by-row into HDFS. The output of this import
process is a set of files containing a copy of the imported table.
The import process is performed in parallel. For this reason, the
output will be in multiple files. These files may be delimited text
files (for example, with commas or tabs separating each field), or
binary Avro or SequenceFiles containing serialized record data.A by-product of the import process is a generated Java class which
can encapsulate one row of the imported table. This class is used
during the import process by Sqoop itself. The Java source code for
this class is also provided to you, for use in subsequent MapReduce
processing of the data. This class can serialize and deserialize data
to and from the SequenceFile format. It can also parse the
delimited-text form of a record. These abilities allow you to quickly
develop MapReduce applications that use the HDFS-stored records in
your processing pipeline. You are also free to parse the delimiteds
record data yourself, using any other tools you prefer.After manipulating the imported records (for example, with MapReduce
or Hive) you may have a result data set which you can then export
back to the relational database. Sqoop’s export process will read
a set of delimited text files from HDFS in parallel, parse them into
records, and insert them as new rows in a target database table, for
consumption by external applications or users.Sqoop includes some other commands which allow you to inspect the
database you are working with. For example, you can list the available
database schemas (with the sqoop-list-databases tool) and tables
within a schema (with the sqoop-list-tables tool). Sqoop also
includes a primitive SQL execution shell (the sqoop-eval tool).Most aspects of the import, code generation, and export processes can
be customized. You can control the specific row range or columns imported.
You can specify particular delimiters and escape characters for the
file-based representation of the data, as well as the file format
You can also control the class or package names used in
generated code. Subsequent sections of this document explain how to
specify these and other arguments to Sqoop.6. Sqoop ToolsSqoop is a collection of related tools. To use Sqoop, you specify the
tool you want to use and the arguments that control the tool.If Sqoop is compiled from its own source, you can run Sqoop without a formal
installation process by running the bin/sqoop program. Users
of a packaged deployment of Sqoop (such as an RPM shipped with Cloudera’s
Distribution for Hadoop) will see this program installed as /usr/bin/sqoop.
The remainder of this documentation will refer to this program as
sqoop. For example:$ sqoop tool-name [tool-arguments]NoteThe following examples that begin with a $ character indicate
that the commands must be entered at a terminal prompt (such as
bash). The $ character represen you should
not start these commands by typing a $. You can also enter commands
inline in the for example, sqoop help. These
examples do not show a $ prefix, but you should enter them the same
Don’t confuse the $ shell prompt in the examples with the $
that precedes an environment variable name. For example, the string
literal $HADOOP_HOME includes a "$".Sqoop ships with a help tool. To display a list of all available
tools, type the following command:$ sqoop help
usage: sqoop COMMAND [ARGS]
Available commands:
Generate code to interact with database records
create-hive-table
Import a table definition into Hive
Evaluate a SQL statement and display the results
Export an HDFS directory to a database table
List available commands
Import a table from a database to HDFS
import-all-tables
Import tables from a database to HDFS
list-databases
List available databases on a server
list-tables
List available tables in a database
Display version information
See 'sqoop help COMMAND' for information on a specific command.You can display help for a specific tool by entering: sqoop help
(tool-name); for example, sqoop help import.You can also add the --help argument to any command: sqoop import
--help.6.1. Using Command AliasesIn addition to typing the sqoop (toolname) syntax, you can use alias
scripts that specify the sqoop-(toolname) syntax. For example, the
scripts sqoop-import, sqoop-export, etc. each select a specific
tool.6.2. Controlling the Hadoop InstallationYou invoke Sqoop through the program launch capability provided by
Hadoop. The sqoop command-line program is a wrapper which runs the
bin/hadoop script shipped with Hadoop. If you have multiple
installations of Hadoop present on your machine, you can select the
Hadoop installation by setting the $HADOOP_HOME environment
variable.For example:$ HADOOP_HOME=/path/to/some/hadoop sqoop import --arguments...or:$ export HADOOP_HOME=/some/path/to/hadoop
$ sqoop import --arguments...If $HADOOP_HOME is not set, Sqoop will use the default installation
location for Cloudera’s Distribution for Hadoop, /usr/lib/hadoop.The active Hadoop configuration is loaded from $HADOOP_HOME/conf/,
unless the $HADOOP_CONF_DIR environment variable is set.6.3. Using Generic and Specific ArgumentsTo control the operation of each Sqoop tool, you use generic and
specific arguments.For example:$ sqoop help import
usage: sqoop import [GENERIC-ARGS] [TOOL-ARGS]
Common arguments:
--connect &jdbc-uri&
Specify JDBC connect string
--connect-manager &jdbc-uri&
Specify connection manager class to use
--driver &class-name&
Manually specify JDBC driver class to use
--hadoop-home &dir&
Override $HADOOP_HOME
Print usage instructions
Read password from console
--password &password&
Set authentication password
--username &username&
Set authentication username
Print more information while working
Generic Hadoop command-line arguments:
(must preceed any tool-specific arguments)
Generic options supported are
-conf &configuration file&
specify an application configuration file
-D &property=value&
use value for given property
-fs &local|namenode:port&
specify a namenode
-jt &local|jobtracker:port&
specify a job tracker
-files &comma separated list of files&
specify comma separated files to be copied to the map reduce cluster
-libjars &comma separated list of jars&
specify comma separated jar files to include in the classpath.
-archives &comma separated list of archives&
specify comma separated archives to be unarchived on the compute machines.
The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]You must supply the generic arguments -conf, -D, and so on after the
tool name but before any tool-specific arguments (such as
--connect). Note that generic Hadoop arguments are preceeded by a
single dash character (-), whereas tool-specific arguments start
with two dashes (--), unless they are single character arguments such as -P.The -conf, -D, -fs and -jt arguments control the configuration
and Hadoop server settings. For example, the -D mapred.job.name=&job_name& can
be used to set the name of the MR job that Sqoop launches, if not specified,
the name defaults to the jar name for the job - which is derived from the used
table name.The -files, -libjars, and -archives arguments are not typically used with
Sqoop, but they are included as part of Hadoop’s internal argument-parsing
system.6.4. Using Options Files to Pass ArgumentsWhen using Sqoop, the command line options that do not change from
invocation to invocation can be put in an options file for convenience.
An options file is a text file where each line identifies an option in
the order that it appears otherwise on the command line. Option files
allow specifying a single option on multiple lines by using the
back-slash character at the end of intermediate lines. Also supported
are comments within option files that begin with the hash character.
Comments must be specified on a new line and may not be mixed with
option text. All comments and empty lines are ignored when option
files are expanded. Unless options appear as quoted strings, any
leading or trailing spaces are ignored. Quoted strings if used must
not extend beyond the line on which they are specified.Option files can be specified anywhere in the command line as long as
the options within them follow the otherwise prescribed rules of
options ordering. For instance, regardless of where the options are
loaded from, they must follow the ordering such that generic options
appear first, tool specific options next, finally followed by options
that are intended to be passed to child programs.To specify an options file, simply create an options file in a
convenient location and pass it to the command line via
--options-file argument.Whenever an options file is specified, it is expanded on the
command line before the tool is invoked. You can specify more than
one option files within the same invocation if needed.For example, the following Sqoop invocation for import can
be specified alternatively as shown below:$ sqoop import --connect jdbc:mysql://localhost/db --username foo --table TEST
$ sqoop --options-file /users/homer/work/import.txt --table TESTwhere the options file /users/homer/work/import.txt contains the following:import
jdbc:mysql://localhost/db
--username
fooThe options file can have empty lines and comments for readability purposes.
So the above example would work exactly the same if the options file
/users/homer/work/import.txt contained the following:#
# Options file for Sqoop import
# Specifies the tool being invoked
# Connect parameter and value
jdbc:mysql://localhost/db
# Username parameter and value
--username
# Remaining options should be specified in the command line.
#6.5. Using ToolsThe following sections will describe each tool’s operation. The
tools are listed in the most likely order you will find them useful.7. sqoop-import7.1. PurposeThe import tool imports an individual table from an RDBMS to HDFS.
Each row from a table is represented as a separate record in HDFS.
Records can be stored as text files (one record per line), or in
binary representation as Avro or SequenceFiles.7.2. Syntax$ sqoop import (generic-args) (import-args)
$ sqoop-import (generic-args) (import-args)While the Hadoop generic arguments must precede any import arguments,
you can type the import arguments in any order with respect to one
another.NoteIn this document, arguments are grouped into collections
organized by function. Some collections are present in several tools
(for example, the "common" arguments). An extended description of their
functionality is given only on the first presentation in this
document.Table 1. Common arguments
Description
--connect &jdbc-uri&
Specify JDBC connect string
--connection-manager &class-name&
Specify connection manager class to
--driver &class-name&
Manually specify JDBC driver class
--hadoop-home &dir&
Override $HADOOP_HOME
Print usage instructions
Read password from console
--password &password&
Set authentication password
--username &username&
Set authentication username
Print more information while working
--connection-param-file &filename&
Optional properties file that
provides connection parameters
7.2.1. Connecting to a Database ServerSqoop is designed to import tables from a database into HDFS. To do
so, you must specify a connect string that describes how to connect to the
database. The connect string is similar to a URL, and is communicated to
Sqoop with the --connect argument. This describes the server and
da it may also specify the port. For example:$ sqoop import --connect jdbc:mysql:///employeesThis string will connect to a MySQL database named employees on the
host . It’s important that you do not use the URL
localhost if you intend to use Sqoop with a distributed Hadoop
cluster. The connect string you supply will be used on TaskTracker nodes
throughout your MapR if you specify the
literal name localhost, each node will connect to a different
database (or more likely, no database at all). Instead, you should use
the full hostname or IP address of the database host that can be seen
by all your remote nodes.You might need to authenticate against the database before you can
access it. You can use the --username and --password or -P parameters
to supply a username and a password to the database. For example:$ sqoop import --connect jdbc:mysql:///employees \
--username aaron --password 12345WarningThe --password parameter is insecure, as other users may
be able to read your password from the command-line arguments via
the output of programs such as ps. The -P argument will read
a password from a console prompt, and is the preferred method of
entering credentials. Credentials may still be transferred between
nodes of the MapReduce cluster using insecure means.Sqoop automatically supports several databases, including MySQL.
strings beginning with jdbc:mysql:// are handled automatically in Sqoop.
full list of databases with built-in support is provided in the "Supported
Databases" section. For some, you may need to install the JDBC driver
yourself.)You can use Sqoop with any other
JDBC-compliant database. First, download the appropriate JDBC
driver for the type of database you want to import, and install the .jar
file in the $SQOOP_HOME/lib directory on your client machine. (This will
be /usr/lib/sqoop/lib if you installed from an RPM or Debian package.)
Each driver .jar file also has a specific driver class which defines
the entry-point to the driver. For example, MySQL’s Connector/J library has
a driver class of com.mysql.jdbc.Driver. Refer to your database
vendor-specific documentation to determine the main driver class.
This class must be provided as an argument to Sqoop with --driver.For example, to connect to a SQLServer database, first download the driver
and install it in your Sqoop lib path.Then run Sqoop. For example:$ sqoop import --driver com.microsoft.jdbc.sqlserver.SQLServerDriver \
--connect &connect-string& ...When connecting to a database using JDBC, you can optionally specify extra
JDBC parameters via a property file using the option
--connection-param-file. The contents of this file are parsed as standard
Java properties and passed into the driver while creating a connection.NoteThe parameters specified via the optional property file are only
applicable to JDBC connections. Any fastpath connectors that use connections
other than JDBC will ignore these parameters.Table 2. Import control arguments:
Description
Append data to an existing dataset
--as-avrodatafile
Imports data to Avro Data Files
--as-sequencefile
Imports data to SequenceFiles
--as-textfile
Imports data as plain text (default)
--boundary-query &statement&
Boundary query to use for creating splits
--columns &col,col,col…&
Columns to import from table
Use direct import fast path
--direct-split-size &n&
Split the input stream every n bytes
when importing in direct mode
--inline-lob-limit &n&
Set the maximum size for an inline LOB
-m,--num-mappers &n&
Use n map tasks to import in parallel
-e,--query &statement&
Import the results of statement.
--split-by &column-name&
Column of the table used to split work
--table &table-name&
Table to read
--target-dir &dir&
HDFS destination dir
--warehouse-dir &dir&
HDFS parent for table destination
--where &where clause&
WHERE clause to use during import
-z,--compress
Enable compression
--compression-codec &c&
Use Hadoop codec (default gzip)
--null-string &null-string&
The string to be written for a null
value for string columns
--null-non-string &null-string&
The string to be written for a null
value for non-string columns
The --null-string and --null-non-string arguments are optional.\
If not specified, then the string "null" will be used.7.2.2. Selecting the Data to ImportSqoop typically imports data in a table-centric fashion. Use the
--table argument to select the table to import. For example, --table
employees. This argument can also identify a VIEW or other table-like
entity in a database.By default, all columns within a table are selected for import.
Imported data is written to HDFS in its "" that is, a
table containing columns A, B, and C result in an import of data such
as:A1,B1,C1
...You can select a subset of columns and control their ordering by using
the --columns argument. This should include a comma-delimited list
of columns to import. For example: --columns "name,employee_id,jobtitle".You can control which rows are imported by adding a SQL WHERE clause
to the import statement. By default, Sqoop generates statements of the
form SELECT &column list& FROM &table name&. You can append a
WHERE clause to this with the --where argument. For example: --where
"id & 400". Only rows where the id column has a value greater than
400 will be imported.By default sqoop will use query select min(&split-by&), max(&split-by&) from
&table name& to find out boundaries for creating splits. In some cases this query
is not the most optimal so you can specify any arbitrary query returning two
numeric columns using --boundary-query argument.7.2.3. Free-form Query ImportsSqoop can also import the result set of an arbitrary SQL query. Instead of
using the --table, --columns and --where arguments, you can specify
a SQL statement with the --query argument.When importing a free-form query, you must specify a destination directory
with --target-dir.If you want to import the results of a query in parallel, then each map task
will need to execute a copy of the query, with results partitioned by bounding
conditions inferred by Sqoop. Your query must include the token $CONDITIONS
which each Sqoop process will replace with a unique condition expression.
You must also select a splitting column with --split-by.For example:$ sqoop import \
--query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \
--split-by a.id --target-dir /user/foo/joinresultsAlternately, the query can be executed once and imported serially, by
specifying a single map task with -m 1:$ sqoop import \
--query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \
-m 1 --target-dir /user/foo/joinresultsNoteIf you are issuing the query wrapped with double quotes ("),
you will have to use \$CONDITIONS instead of just $CONDITIONS
to disallow your shell from treating it as a shell variable.
For example, a double quoted query may look like:
"SELECT * FROM x WHERE a='foo' AND \$CONDITIONS"NoteThe facility of using free-form query in the current version of Sqoop
is limited to simple queries where there are no ambiguous projections and
no OR conditions in the WHERE clause. Use of complex queries such as
queries that have sub-queries or joins leading to ambiguous projections can
lead to unexpected results.7.2.4. Controlling ParallelismSqoop imports data in parallel from most database sources. You can
specify the number
of map tasks (parallel processes) to use to perform the import by
using the -m or --num-mappers argument. Each of these arguments
takes an integer value which corresponds to the degree of parallelism
to employ. By default, four tasks are used. Some databases may see
improved performance by increasing this value to 8 or 16. Do not
increase the degree of parallelism greater than that available within
your MapR tasks will run serially and will likely
increase the amount of time required to perform the import. Likewise,
do not increase the degree of parallism higher than that which your
database can reasonably support. Connecting 100 concurrent clients to
your database may increase the load on the database server to a point
where performance suffers as a result.When performing parallel imports, Sqoop needs a criterion by which it
can split the workload. Sqoop uses a splitting column to split the
workload. By default, Sqoop will identify the primary key column (if
present) in a table and use it as the splitting column. The low and
high values for the splitting column are retrieved from the database,
and the map tasks operate on evenly-sized components of the total
range. For example, if you had a table with a primary key column of
id whose minimum value was 0 and maximum value was 1000, and Sqoop
was directed to use 4 tasks, Sqoop would run four processes which each
execute SQL statements of the form SELECT * FROM sometable WHERE id
&= lo AND id & hi, with (lo, hi) set to (0, 250), (250, 500),
(500, 750), and (750, 1001) in the different tasks.If the actual values for the primary key are not uniformly distributed
across its range, then this can result in unbalanced tasks. You should
explicitly choose a different column with the --split-by argument.
For example, --split-by employee_id. Sqoop cannot currently split on
multi-column indices. If your table has no index column, or has a
multi-column key, then you must also manually choose a splitting
column.7.2.5. Controlling the Import ProcessBy default, the import process will use JDBC which provides a
reasonable cross-vendor import channel. Some databases can perform
imports in a more high-performance fashion by using database-specific
data movement tools. For example, MySQL provides the mysqldump tool
which can export data from MySQL to other systems very quickly. By
supplying the --direct argument, you are specifying that Sqoop
should attempt the direct import channel. This channel may be
higher performance than using JDBC. Currently, direct mode does not
support imports of large object columns.When importing from PostgreSQL in conjunction with direct mode, you
can split the import into separate files after
individual files reach a certain size. This size limit is controlled
with the --direct-split-size argument.By default, Sqoop will import a table named foo to a directory named
foo inside your home directory in HDFS. For example, if your
username is someuser, then the import tool will write to
/user/someuser/foo/(files). You can adjust the parent directory of
the import with the --warehouse-dir argument. For example:$ sqoop import --connnect &connect-str& --table foo --warehouse-dir /shared \
...This command would write to a set of files in the /shared/foo/ directory.You can also explicitly choose the target directory, like so:$ sqoop import --connnect &connect-str& --table foo --target-dir /dest \
...This will import the files into the /dest directory. --target-dir is
incompatible with --warehouse-dir.When using direct mode, you can specify additional arguments which
should be passed to the underlying tool. If the argument
-- is given on the command-line, then subsequent arguments are sent
directly to the underlying tool. For example, the following adjusts
the character set used by mysqldump:$ sqoop import --connect jdbc:mysql:///db --table bar \
--direct -- --default-character-set=latin1By default, imports go to a new target location. If the destination directory
already exists in HDFS, Sqoop will refuse to import and overwrite that
directory’s contents. If you use the --append argument, Sqoop will import
data to a temporary directory and then rename the files into the normal
target directory in a manner that does not conflict with existing filenames
in that directory.NoteWhen using the direct mode of import, certain database client utilities
are expected to be present in the shell path of the task process. For MySQL
the utilities mysqldump and mysqlimport are required, whereas for
PostgreSQL the utility psql is required.7.2.6. Controlling type mappingSqoop is preconfigured to map most SQL types to appropriate Java or Hive
representatives. However the default mapping might not be suitable for
everyone and might be overridden by --map-column-java (for changing
mapping to Java) or --map-column-hive (for changing Hive mapping).Table 3. Parameters for overriding mapping
Description
--map-column-java &mapping&
Override mapping from SQL to Java type
for configured columns.
--map-column-hive &mapping&
Override mapping from SQL to Hive type
for configured columns.
Sqoop is expecting comma separated list of mapping in form &name of column&=&new type&. For example:$ sqoop import ... --map-column-java id=String,value=IntegerSqoop will rise exception in case that some configured mapping will not be used.7.2.7. Incremental ImportsSqoop provides an incremental import mode which can be used to retrieve
only rows newer than some previously-imported set of rows.The following arguments control incremental imports:Table 4. Incremental import arguments:
Description
--check-column (col)
Specifies the column to be examined
when determining which rows to import.
--incremental (mode)
Specifies how Sqoop determines which
rows are new. Legal values for mode
include append and lastmodified.
--last-value (value)
Specifies the maximum value of the
check column from the previous import.
Sqoop supports two types of incremental imports: append and lastmodified.
You can use the --incremental argument to specify the type of incremental
import to perform.You should specify append mode when importing a table where new rows are
continually being added with increasing row id values. You specify the column
containing the row’s id with --check-column. Sqoop imports rows where the
check column has a value greater than the one specified with --last-value.An alternate table update strategy supported by Sqoop is called lastmodified
mode. You should use this when rows of the source table may be updated, and
each such update will set the value of a last-modified column to the current
timestamp.
Rows where the check column holds a timestamp more recent than the
timestamp specified with --last-value are imported.At the end of an incremental import, the value which should be specified as
--last-value for a subsequent import is printed to the screen. When running
a subsequent import, you should specify --last-value in this way to ensure
you import only the new or updated data. This is handled automatically by
creating an incremental import as a saved job, which is the preferred
mechanism for performing a recurring incremental import. See the section on
saved jobs later in this document for more information.7.2.8. File FormatsYou can import data in one of two file formats: delimited text or
SequenceFiles.Delimited text is the default import format. You can also specify it
explicitly by using the --as-textfile argument. This argument will write
string-based representations of each record to the output files, with
delimiter characters between individual columns and rows. These
delimiters may be commas, tabs, or other characters. (The delimiters
see "Output line formatting arguments.") The
following is the results of an example text-based import:1,here is a message,
2,happy new year!,
3,another message,Delimited text is appropriate for most non-binary data types. It also
readily supports further manipulation by other tools, such as Hive.SequenceFiles are a binary format that store individual records in
custom record-specific data types. These data types are manifested as
Java classes. Sqoop will automatically generate these data types for
you. This format supports exact storage of all data in binary
representations, and is appropriate for storing binary data
(for example, VARBINARY columns), or data that will be principly
manipulated by custom MapReduce programs (reading from SequenceFiles
is higher-performance than reading from text files, as records do not
need to be parsed).Avro data files are a compact, efficient binary format that provides
interoperability with applications written in other programming
languages.
Avro also supports versioning, so that when, e.g., columns
are added or removed from a table, previously imported data files can
be processed along with new ones.By default, data is not compressed. You can compress your data by
using the deflate (gzip) algorithm with the -z or --compress
argument, or specify any Hadoop compression codec using the
--compression-codec argument. This applies to SequenceFile, text,
and Avro files.7.2.9. Large ObjectsSqoop handles large objects (BLOB and CLOB columns) in particular
ways. If this data is truly large, then these columns should not be
fully materialized in memory for manipulation, as most columns are.
Instead, their data is handled in a streaming fashion. Large objects
can be stored inline with the rest of the data, in which case they are
fully materialized in memory on every access, or they can be stored in
a secondary storage file linked to the primary data storage. By
default, large objects less than 16 MB in size are stored inline with
the rest of the data. At a larger size, they are stored in files in
the _lobs subdirectory of the import target directory. These files
are stored in a separate format optimized for large record storage,
which can accomodate records of up to 2^63 bytes each. The size at
which lobs spill into separate files is controlled by the
--inline-lob-limit argument, which takes a parameter specifying the
largest lob size to keep inline, in bytes. If you set the inline LOB
limit to 0, all large objects will be placed in external
storage.Table 5. Output line formatting arguments:
Description
--enclosed-by &char&
Sets a required field enclosing
--escaped-by &char&
Sets the escape character
--fields-terminated-by &char&
Sets the field separator character
--lines-terminated-by &char&
Sets the end-of-line character
--mysql-delimiters
Uses MySQL’s default delimiter set:
escaped-by: \
optionally-enclosed-by: '
--optionally-enclosed-by &char&
Sets a field enclosing character
When importing to delimited files, the choice of delimiter is
important. Delimiters which appear inside string-based fields may
cause ambiguous parsing of the imported data by subsequent analysis
passes. For example, the string "Hello, pleased to meet you" should
not be imported with the end-of-field delimiter set to a comma.Delimiters may be specified as:
a character (--fields-terminated-by X)
an escape character (--fields-terminated-by \t). Supported escape
characters are:
\b (backspace)
\n (newline)
\r (carriage return)
\" (double-quote)
\\' (single-quote)
\\ (backslash)
\0 (NUL) - This will insert NUL characters between fields or lines,
or will disable enclosing/escaping if used for one of the --enclosed-by,
--optionally-enclosed-by, or --escaped-by arguments.
The octal representation of a UTF-8 character’s code point. This
should be of the form \0ooo, where ooo is the octal value.
For example, --fields-terminated-by \001 would yield the ^A character.
The hexadecimal representation of a UTF-8 character’s code point. This
should be of the form \0xhhh, where hhh is the hex value.
For example, --fields-terminated-by \0x10 would yield the carriage
return character.
The default delimiters are a comma (,) for fields, a newline (\n) for records, no quote
character, and no escape character. Note that this can lead to
ambiguous/unparsible records if you import database records containing
commas or newlines in the field data. For unambiguous parsing, both must
be enabled. For example, via --mysql-delimiters.If unambiguous delimiters cannot be presented, then use enclosing and
escaping characters. The combination of (optional)
enclosing and escaping characters will allow unambiguous parsing of
lines. For example, suppose one column of a dataset contained the
following values:Some string, with a comma.
Another "string with quotes"The following arguments would provide delimiters which can be
unambiguously parsed:$ sqoop import --fields-terminated-by , --escaped-by \\ --enclosed-by '\"' ...(Note that to prevent the shell from mangling the enclosing character,
we have enclosed that argument itself in single-quotes.)The result of the above arguments applied to the above dataset would
be:"Some string, with a comma.","1","2","3"...
"Another \"string with quotes\"","4","5","6"...Here the imported strings are shown in the context of additional
columns ("1","2","3", etc.) to demonstrate the full effect of enclosing
and escaping. The enclosing character is only strictly necessary when
delimiter characters appear in the imported text. The enclosing
character can therefore be specified as optional:$ sqoop import --optionally-enclosed-by '\"' (the rest as above)...Which would result in the following import:"Some string, with a comma.",1,2,3...
"Another \"string with quotes\"",4,5,6...NoteEven though Hive supports escaping characters, it does not
handle escaping of new-line character. Also, it does not support
the notion of enclosing characters that may include field delimiters
in the enclosed string.
It is therefore recommended that you choose
unambiguous field and record-terminating delimiters without the help
of escaping and enclosing characters when working with H this is
due to limitations of Hive’s input parsing abilities.The --mysql-delimiters argument is a shorthand argument which uses
the default delimiters for the mysqldump program.
If you use the mysqldump delimiters in conjunction with a
direct-mode import (with --direct), very fast imports can be
achieved.While the choice of delimiters is most important for a text-mode
import, it is still relevant if you import to SequenceFiles with
--as-sequencefile. The generated class' toString() method
will use the delimiters you specify, so subsequent formatting of
the output data will rely on the delimiters you choose.Table 6. Input parsing arguments:
Description
--input-enclosed-by &char&
Sets a required field encloser
--input-escaped-by &char&
Sets the input escape
--input-fields-terminated-by &char&
Sets the input field separator
--input-lines-terminated-by &char&
Sets the input end-of-line
--input-optionally-enclosed-by &char&
Sets a field enclosing
When Sqoop imports data to HDFS, it generates a Java class which can
reinterpret the text files that it creates when doing a
delimited-format import. The delimiters are chosen with arguments such
as --fields-terminated-by; this controls both how the data is
written to disk, and how the generated parse() method reinterprets
this data. The delimiters used by the parse() method can be chosen
independently of the output arguments, by using
--input-fields-terminated-by, and so on. This is useful, for example, to
generate classes which can parse records created with one set of
delimiters, and emit the records to a different set of files using a
separate set of delimiters.Table 7. Hive arguments:
Description
--hive-home &dir&
Override $HIVE_HOME
--hive-import
Import tables into Hive (Uses Hive’s
default delimiters if none are set.)
--hive-overwrite
Overwrite existing data in the Hive table.
--create-hive-table
If set, then the job will fail if the target hive
table exits. By default this property is false.
--hive-table &table-name&
Sets the table name to use when importing
--hive-drop-import-delims
Drops \n, \r, and \01 from string
fields when importing to Hive.
--hive-delims-replacement
Replace \n, \r, and \01 from string
fields with user defined string when importing to Hive.
--hive-partition-key
Name of a hive field to partition are
sharded on
--hive-partition-value &v&
String-value that serves as partition key
for this imported into hive in this job.
--map-column-hive &map&
Override default mapping from SQL type to
Hive type for configured columns.
7.2.10. Importing Data Into HiveSqoop’s import tool’s main function is to upload your data into files
in HDFS. If you have a Hive metastore associated with your HDFS
cluster, Sqoop can also import the data into Hive by generating and
executing a CREATE TABLE statement to define the data’s layout in
Hive. Importing data into Hive is as simple as adding the
--hive-import option to your Sqoop command line.If the Hive table already exists, you can specify the
--hive-overwrite option to indicate that existing table in hive must
be replaced. After your data is imported into HDFS or this step is
omitted, Sqoop will generate a Hive script containing a CREATE TABLE
operation defining your columns using Hive’s types, and a LOAD DATA INPATH
statement to move the data files into Hive’s warehouse directory.The script will be executed by calling
the installed copy of hive on the machine where Sqoop is run. If you have
multiple Hive installations, or hive is not in your $PATH, use the
--hive-home option to identify the Hive installation directory.
Sqoop will use $HIVE_HOME/bin/hive from here.NoteThis function is incompatible with --as-avrodatafile and
--as-sequencefile.Even though Hive supports escaping characters, it does not
handle escaping of new-line character. Also, it does not support
the notion of enclosing characters that may include field delimiters
in the enclosed string.
It is therefore recommended that you choose
unambiguous field and record-terminating delimiters without the help
of escaping and enclosing characters when working with H this is
due to limitations of Hive’s input parsing abilities. If you do use
--escaped-by, --enclosed-by, or --optionally-enclosed-by when
importing data into Hive, Sqoop will print a warning message.Hive will have problems using Sqoop-imported data if your database’s
rows contain string fields that have Hive’s default row delimiters
(\n and \r characters) or column delimiters (\01 characters)
present in them.
You can use the --hive-drop-import-delims option
to drop those characters on import to give Hive-compatible text data.
Alternatively, you can use the --hive-delims-replacement option
to replace those characters with a user-defined string on import to give
Hive-compatible text data.
These options should only be used if you use
Hive’s default delimiters and should not be used if different delimiters
are specified.Sqoop will pass the field and record delimiters through to Hive. If you do
not set any delimiters and do use --hive-import, the field delimiter will
be set to ^A and the record delimiter will be set to \n to be consistent
with Hive’s defaults.The table name used in Hive is, by default, the same as that of the
source table. You can control the output table name with the --hive-table
option.Hive can put data into partitions for more efficient query
performance.
You can tell a Sqoop job to import data for Hive into a
particular partition by specifying the --hive-partition-key and
--hive-partition-value arguments.
The partition value must be a
Please see the Hive documentation for more details on
partitioning.You can import compressed tables into Hive using the --compress and
--compression-codec options. One downside to compressing tables imported
into Hive is that many codecs cannot be split for processing by parallel map
tasks. The lzop codec, however, does support splitting. When importing tables
with this codec, Sqoop will automatically index the files for splitting and
configuring a new Hive table with the correct InputFormat. This feature
currently requires that all partitions of a table be compressed with the lzop
codec.Table 8. HBase arguments:
Description
--column-family &family&
Sets the target column family for the import
--hbase-create-table
If specified, create missing HBase tables
--hbase-row-key &col&
Specifies which input column to use as the
--hbase-table &table-name&
Specifies an HBase table to use as the
target instead of HDFS
7.2.11. Importing Data Into HBaseSqoop supports additional import targets beyond HDFS and Hive. Sqoop
can also import records into a table in HBase.By specifying --hbase-table, you instruct Sqoop to import
to a table in HBase rather than a directory in HDFS. Sqoop will
import data to the table specified as the argument to --hbase-table.
Each row of the input table will be transformed into an HBase
Put operation to a row of the output table. The key for each row is
taken from a column of the input. By default Sqoop will use the split-by
column as the row key column. If that is not specified, it will try to
identify the primary key column, if any, of the source table. You can
manually specify the row key column with --hbase-row-key. Each output
column will be placed in the same column family, which must be specified
with --column-family.NoteThis function is incompatible with direct import (parameter
--direct).If the target table and column family do not exist, the Sqoop job will
exit with an error. You should create the target table and column family
before running an import. If you specify --hbase-create-table, Sqoop
will create the target table and column family if they do not exist,
using the default parameters from your HBase configuration.Sqoop currently serializes all values to HBase by converting each field
to its string representation (as if you were importing to HDFS in text
mode), and then inserts the UTF-8 bytes of this string in the target
cell.Table 9. Code generation arguments:
Description
--bindir &dir&
Output directory for compiled objects
--class-name &name&
Sets the generated class name. This overrides
--package-name. When combined with
--jar-file, sets the input class.
--jar-file &file&
Dis use specified jar
--outdir &dir&
Output directory for generated code
--package-name &name&
Put auto-generated classes in this package
--map-column-java &m&
Override default mapping from SQL type to
Java type for configured columns.
As mentioned earlier, a byproduct of importing a table to HDFS is a
class which can manipulate the imported data. If the data is stored in
SequenceFiles, this class will be used for the data’s serialization
container. Therefore, you should use this class in your subsequent
MapReduce processing of the data.The class is typically n a table named foo will
generate a class named foo. You may want to override this class
name. For example, if your table is named EMPLOYEES, you may want to
specify --class-name Employee instead. Similarly, you can specify
just the package name with --package-name. The following import
generates a class named com.foocorp.SomeTable:$ sqoop import --connect &connect-str& --table SomeTable --package-name com.foocorpThe .java source file for your class will be written to the current
working directory when you run sqoop. You can control the output
directory with --outdir. For example, --outdir src/generated/.The import process compiles the source into .class and .jar
these are ordinarily stored under /tmp. You can select an alternate
target directory with --bindir. For example, --bindir /scratch.If you already have a compiled class that can be used to perform the
import and want to suppress the code-generation aspect of the import
process, you can use an existing jar and class by
providing the --jar-file and --class-name options. For example:$ sqoop import --table SomeTable --jar-file mydatatypes.jar \
--class-name SomeTableTypeThis command will load the SomeTableType class out of mydatatypes.jar.7.3. Example InvocationsThe following examples illustrate how to use the import tool in a variety
of situations.A basic import of a table named EMPLOYEES in the corp database:$ sqoop import --connect jdbc:mysql:///corp --table EMPLOYEESA basic import requiring a login:$ sqoop import --connect jdbc:mysql:///corp --table EMPLOYEES \
--username SomeUser -P
Enter password: (hidden)Selecting specific columns from the EMPLOYEES table:$ sqoop import --connect jdbc:mysql:///corp --table EMPLOYEES \
--columns "employee_id,first_name,last_name,job_title"Controlling the import parallelism (using 8 parallel tasks):$ sqoop import --connect jdbc:mysql:///corp --table EMPLOYEES \
-m 8Enabling the MySQL "direct mode" fast path:$ sqoop import --connect jdbc:mysql:///corp --table EMPLOYEES \
--directStoring data in SequenceFiles, and setting the generated class name to
com.foocorp.Employee:$ sqoop import --connect jdbc:mysql:///corp --table EMPLOYEES \
--class-name com.foocorp.Employee --as-sequencefileSpecifying the delimiters to use in a text-mode import:$ sqoop import --connect jdbc:mysql:///corp --table EMPLOYEES \
--fields-terminated-by '\t' --lines-terminated-by '\n' \
--optionally-enclosed-by '\"'Importing the data to Hive:$ sqoop import --connect jdbc:mysql:///corp --table EMPLOYEES \
--hive-importImporting only new employees:$ sqoop import --connect jdbc:mysql:///corp --table EMPLOYEES \
--where "start_date & ''"Changing the splitting column from the default:$ sqoop import --connect jdbc:mysql:///corp --table EMPLOYEES \
--split-by dept_idVerifying that an import was successful:$ hadoop fs -ls EMPLOYEES
Found 5 items
drwxr-xr-x
- someuser somegrp
16:40 /user/someuser/EMPLOYEES/_logs
-rw-r--r--
1 someuser somegrp
0-04-27 16:40 /user/someuser/EMPLOYEES/part-m-00000
-rw-r--r--
1 someuser somegrp
0-04-27 16:40 /user/someuser/EMPLOYEES/part-m-00001
-rw-r--r--
1 someuser somegrp
0-04-27 16:40 /user/someuser/EMPLOYEES/part-m-00002
-rw-r--r--
1 someuser somegrp
0-04-27 16:40 /user/someuser/EMPLOYEES/part-m-00003
$ hadoop fs -cat EMPLOYEES/part-m-00000 | head -n 10
0,joe,smith,engineering
1,jane,doe,marketing
...Performing an incremental import of new data, after having already
imported the first 100,000 rows of a table:$ sqoop import --connect jdbc:mysql:///somedb --table sometable \
--where "id & 100000" --target-dir /incremental_dataset --append8. sqoop-import-all-tables8.1. PurposeThe import-all-tables tool imports a set of tables from an RDBMS to HDFS.
Data from each table is stored in a separate directory in HDFS.For the import-all-tables tool to be useful, the following conditions
must be met:
Each table must have a single-column primary key.
You must intend to import all columns of each table.
You must not intend to use non-default splitting column, nor impose
any conditions via a WHERE clause.
8.2. Syntax$ sqoop import-all-tables (generic-args) (import-args)
$ sqoop-import-all-tables (generic-args) (import-args)Although the Hadoop generic arguments must preceed any import arguments,
the import arguments can be entered in any order with respect to one
another.Table 10. Common arguments
Description
--connect &jdbc-uri&
Specify JDBC connect string
--connection-manager &class-name&
Specify connection manager class to
--driver &class-name&
Manually specify JDBC driver class
--hadoop-home &dir&
Override $HADOOP_HOME
Print usage instructions
Read password from console
--password &password&
Set authentication password
--username &username&
Set authentication username
Print more information while working
--connection-param-file &filename&
Optional properties file that
provides connection parameters
Table 11. Import control arguments:
Description
--as-avrodatafile
Imports data to Avro Data Files
--as-sequencefile
Imports data to SequenceFiles
--as-textfile
Imports data as plain text (default)
Use direct import fast path
--direct-split-size &n&
Split the input stream every n bytes when
importing in direct mode
--inline-lob-limit &n&
Set the maximum size for an inline LOB
-m,--num-mappers &n&
Use n map tasks to import in parallel
--warehouse-dir &dir&
HDFS parent for table destination
-z,--compress
Enable compression
--compression-codec &c&
Use Hadoop codec (default gzip)
These arguments behave in the same manner as they do when used for the
sqoop-import tool, but the --table, --split-by, --columns,
and --where arguments are invalid for sqoop-import-all-tables.Table 12. Output line formatting arguments:
Description
--enclosed-by &char&
Sets a required field enclosing
--escaped-by &char&
Sets the escape character
--fields-terminated-by &char&
Sets the field separator character
--lines-terminated-by &char&
Sets the end-of-line character
--mysql-delimiters
Uses MySQL’s default delimiter set:
escaped-by: \
optionally-enclosed-by: '
--optionally-enclosed-by &char&
Sets a field enclosing character
Table 13. Input parsing arguments:
Description
--input-enclosed-by &char&
Sets a required field encloser
--input-escaped-by &char&
Sets the input escape
--input-fields-terminated-by &char&
Sets the input field separator
--input-lines-terminated-by &char&
Sets the input end-of-line
--input-optionally-enclosed-by &char&
Sets a field enclosing
Table 14. Hive arguments:
Description
--hive-home &dir&
Override $HIVE_HOME
--hive-import
Import tables into Hive (Uses Hive’s
default delimiters if none are set.)
--hive-overwrite
Overwrite existing data in the Hive table.
--create-hive-table
If set, then the job will fail if the target hive
table exits. By default this property is false.
--hive-table &table-name&
Sets the table name to use when importing
--hive-drop-import-delims
Drops \n, \r, and \01 from string
fields when importing to Hive.
--hive-delims-replacement
Replace \n, \r, and \01 from string
fields with user defined string when importing to Hive.
--hive-partition-key
Name of a hive field to partition are
sharded on
--hive-partition-value &v&
String-value that serves as partition key
for this imported into hive in this job.
--map-column-hive &map&
Override default mapping from SQL type to
Hive type for configured columns.
Table 15. Code generation arguments:
Description
--bindir &dir&
Output directory for compiled objects
--jar-file &file&
Dis use specified jar
--outdir &dir&
Output directory for generated code
--package-name &name&
Put auto-generated classes in this package
The import-all-tables tool does not support the --class-name argument.
You may, however, specify a package with --package-name in which all
generated classes will be placed.8.3. Example InvocationsImport all tables from the corp database:$ sqoop import-all-tables --connect jdbc:mysql:///corpVerifying that it worked:$ hadoop fs -ls
Found 4 items
drwxr-xr-x
- someuser somegrp
17:15 /user/someuser/EMPLOYEES
drwxr-xr-x
- someuser somegrp
17:15 /user/someuser/PAYCHECKS
drwxr-xr-x
- someuser somegrp
17:15 /user/someuser/DEPARTMENTS
drwxr-xr-x
- someuser somegrp
17:15 /user/someuser/OFFICE_SUPPLIES9. sqoop-export9.1. PurposeThe export tool exports a set of files from HDFS back to an RDBMS.
The target table must already exist in the database. The input files
are read and parsed into a set of records according to the
user-specified delimiters.The default operation is to transform these into a set of INSERT
statements that inject the records into the database. In "update mode,"
Sqoop will generate UPDATE statements that replace existing records
in the database.9.2. Syntax$ sqoop export (generic-args) (export-args)
$ sqoop-export (generic-args) (export-args)Although the Hadoop generic arguments must preceed any export arguments,
the export arguments can be entered in any order with respect to one
another.Table 16. Common arguments
Description
--connect &jdbc-uri&
Specify JDBC connect string
--connection-manager &class-name&
Specify connection manager class to
--driver &class-name&
Manually specify JDBC driver class
--hadoop-home &dir&
Override $HADOOP_HOME
Print usage instructions
Read password from console
--password &password&
Set authentication password
--username &username&
Set authentication username
Print more information while working
--connection-param-file &filename&
Optional properties file that
provides connection parameters
Table 17. Export control arguments:
Description
Use direct export fast path
--export-dir &dir&
HDFS source path for the export
-m,--num-mappers &n&
Use n map tasks to export in
--table &table-name&
Table to populate
--update-key &col-name&
Anchor column to use for updates.
Use a comma separated list of columns
if there are more than one column.
--update-mode &mode&
Specify how updates are performed
when new rows are found with
non-matching keys in database.
Legal values for mode include
updateonly (default) and
allowinsert.
--input-null-string &null-string&
The string to be interpreted as
null for string columns
--input-null-non-string &null-string&
The string to be interpreted as
null for non-string columns
--staging-table &staging-table-name&
The table in which data will be
staged before being inserted into
the destination table.
--clear-staging-table
Indicates that any data present in
the staging table can be deleted.
Use batch mode for underlying
statement execution.
The --table and --export-dir arguments are required. These
specify the table to populate in the database, and the
directory in HDFS that contains the source data.You can control the number of mappers independently from the number of
files present in the directory. Export performance depends on the
degree of parallelism. By default, Sqoop will use four tasks in
parallel for the export process. Thi you will
need to experiment with your own particular setup. Additional tasks
may offer better concurrency, but if the database is already
bottlenecked on updating indices, invoking triggers, and so on, then
additional load may decrease performance. The --num-mappers or -m
arguments control the number of map tasks, which is the degree of
parallelism used.MySQL provides a direct mode for exports as well, using the
mysqlimport tool. When exporting to MySQL, use the --direct argument
to specify this codepath. This may be
higher-performance than the standard JDBC codepath.NoteWhen using export in direct mode with MySQL, the MySQL bulk utility
mysqlimport must be available in the shell path of the task process.The --input-null-string and --input-null-non-string arguments are
optional. If --input-null-string is not specified, then the string
"null" will be interpreted as null for string-type columns.
If --input-null-non-string is not specified, then both the string
"null" and the empty string will be interpreted as null for non-string
columns. Note that, the empty string will be always interpreted as null
for non-string columns, in addition to other string if specified by
--input-null-non-string.Since Sqoop breaks down export process into multiple transactions, it
is possible that a failed export job may result in partial data being
committed to the database. This can further lead to subsequent jobs
failing due to insert collisions in some cases, or lead to duplicated data
in others. You can overcome this problem by specifying a staging table via
the --staging-table option which acts as an auxiliary table that is used
to stage exported data. The staged data is finally moved to the destination
table in a single transaction.In order to use the staging facility, you must create the staging table
prior to running the export job. This table must be structurally
identical to the target table. This table should either be empty before
the export job runs, or the --clear-staging-table option must be specified.
If the staging table contains data and the --clear-staging-table option is
specified, Sqoop will delete all of the data before starting the export job.NoteSupport for staging data prior to pushing it into the destination
table is not available for --direct exports. It is also not available when
export is invoked using the --update-key option for updating existing data.By default, sqoop-export appends each input
record is transformed into an INSERT statement that adds a row to the
target database table. If your table has constraints (e.g., a primary
key column whose values must be unique) and already contains data, you
must take care to avoid inserting records that violate these
constraints. The export process will fail if an INSERT statement
fails. This mode is primarily intended for exporting records to a new,
empty table intended to receive these results.If you specify the --update-key argument, Sqoop will instead modify
an existing dataset in the database. Each input record is treated as
an UPDATE statement that modifies an existing row. The row a
statement modifies is determined by the column name(s) specified with
--update-key. For example, consider the following table
definition:CREATE TABLE foo(
id INT NOT NULL PRIMARY KEY,
msg VARCHAR(32),
bar INT);Consider also a dataset in HDFS containing records like these:0,this is a test,42
1,some more data,100
...Running sqoop-export --table foo --update-key id --export-dir
/path/to/data --connect … will run an export job that executes SQL
statements based on the data like so:UPDATE foo SET msg='this is a test', bar=42 WHERE id=0;
UPDATE foo SET msg='some more data', bar=100 WHERE id=1;
...If an UPDATE statement modifies no rows, this is not considered an
the export will silently continue. (In effect, this means that
an update-based export will not insert new rows into the database.)
Likewise, if the column specified with --update-key does not
uniquely identify rows and multiple rows are updated by a single
statement, this condition is also undetected.The argument --update-key can also be given a comma separated list of
column names. In which case, Sqoop will match all keys from this list before
updating any existing record.Depending on the target database, you may also specify the --update-mode
argument with allowinsert mode if you want to update rows if they exist
in the database already or insert rows if they do not exist yet.Table 18. Input parsing arguments:
Description
--input-enclosed-by &char&
Sets a required field encloser
--input-escaped-by &char&
Sets the input escape
--input-fields-terminated-by &char&
Sets the input field separator
--input-lines-terminated-by &char&
Sets the input end-of-line
--input-optionally-enclosed-by &char&
Sets a field enclosing
Table 19. Output line formatting arguments:
Description
--enclosed-by &char&
Sets a required field enclosing
--escaped-by &char&
Sets the escape character
--fields-terminated-by &char&
Sets the field separator character
--lines-terminated-by &char&
Sets the end-of-line character
--mysql-delimiters
Uses MySQL’s default delimiter set:
escaped-by: \
optionally-enclosed-by: '
--optionally-enclosed-by &char&
Sets a field enclosing character
Sqoop automatically generates code to parse and interpret records of the
files containing the data to be exported back to the database. If
these files were created with non-default delimiters (comma-separated
fields with newline-separated records), you should specify
the same delimiters again so that Sqoop can parse your files.If you specify incorrect delimiters, Sqoop will fail to find enough
columns per line. This will cause export map tasks to fail by throwing
ParseExceptions.Table 20. Code generation arguments:
Description
--bindir &dir&
Output directory for compiled objects
--class-name &name&
Sets the generated class name. This overrides
--package-name. When combined with
--jar-file, sets the input class.
--jar-file &file&
Dis use specified jar
--outdir &dir&
Output directory for generated code
--package-name &name&
Put auto-generated classes in this package
--map-column-java &m&
Override default mapping from SQL type to
Java type for configured columns.
If the records to be exported were generated as the result of a
previous import, then the original generated class can be used to read
the data back. Specifying --jar-file and --class-name obviate
the need to specify delimiters in this case.The use of existing generated code is incompatible with
--update-key; an update-mode export requires new code generation to
perform the update. You cannot use --jar-file, and must fully specify
any non-default delimiters.9.4. Exports and TransactionsExports are performed by multiple writers in parallel. Each writer
uses a separate connec these have separate
transactions from one another. Sqoop uses the multi-row INSERT
syntax to insert up to 100 records per statement. Every 100
statements, the current transaction within a writer task is committed,
causing a commit every 10,000 rows. This ensures that transaction
buffers do not grow without bound, and cause out-of-memory conditions.
Therefore, an export is not an atomic process. Partial results from
the export will become visible before the export is complete.9.5. Failed ExportsExports may fail for a number of reasons:
Loss of connectivity from the Hadoop cluster to the database (either
due to hardware fault, or server software crashes)
Attempting to INSERT a row which violates a consistency constraint
(for example, inserting a duplicate primary key value)
Attempting to parse an incomplete or malformed record from the HDFS
source data
Attempting to parse records using incorrect delimiters
Capacity issues (such as insufficient RAM or disk space)
If an export map task fails due to these or other reasons, it will
cause the export job to fail. The results of a failed export are
undefined. Each export map task operates in a separate transaction.
Furthermore, individual map tasks commit their current transaction
periodically. If a task fails, the current transaction will be rolled
back. Any previously-committed transactions will remain durable in the
database, leading to a partially-complete export.9.6. Example InvocationsA basic export to populate a table named bar:$ sqoop export --connect jdbc:mysql:///foo --table bar
--export-dir /results/bar_dataThis example takes the files in /results/bar_data and injects their
contents in to the bar table in the foo database on .
The target table must already exist in the database. Sqoop performs
a set of INSERT INTO operations, without regard for existing content. If
Sqoop attempts to insert rows which violate constraints in the database
(for example, a particular primary key value already exists), then the export
fails.10. Saved JobsImports and exports can be repeatedly performed by issuing the same command
multiple times. Especially when using the incremental import capability,
this is an expected scenario.Sqoop allows you to define saved jobs which make this process easier. A
saved job records the configuration information required to execute a
Sqoop command at a later time. The section on the sqoop-job tool
describes how to create and work with saved jobs.By default, job descriptions are saved to a private repository stored
in $HOME/.sqoop/. You can configure Sqoop to instead use a shared
metastore, which makes saved jobs available to multiple users across a
shared cluster. Starting the metastore is covered by the section on the
sqoop-metastore tool.11. sqoop-job11.1. PurposeThe job tool allows you to create and work with saved jobs. Saved jobs
remember the parameters used to specify a job, so they can be
re-executed by invoking the job by its handle.If a saved job is configured to perform an incremental import, state regarding
the most recently imported rows is updated in the saved job to allow the job
to continually import only the}

我要回帖

更多关于 don t have to 的文章

更多推荐

版权声明:文章内容来源于网络,版权归原作者所有,如有侵权请点击这里与我们联系,我们将及时删除。

点击添加站长微信