About big data file shares
A big data file share is an item created in your portal that references feature data (points, polylines, polygons, or tabular data) on a location available to your ArcGIS GeoAnalytics Server. The big data file share item in your portal allows you to browse for your registered data from ArcGIS GeoAnalytics Server tools. Big data file shares can reference the following data sources:
- File share—A directory of datasets on a local disk or network share.
- HDFS—Apache Hadoop Distributed File System (HDFS) directory of datasets.
- Hive—Apache Hive metastore databases.
- Cloud store—An Amazon Simple Storage Service (S3) bucket, Microsoft Azure Blob container, or Microsoft Azure Data Lake Store containing a directory of datasets.
Note:
Support for Microsoft Azure Data Lake Store is added at ArcGIS Enterprise 10.6.1.
Note:
A big data file share is only available for use if the portal administrator has enabled GeoAnalytics Server. To learn more about enabling GeoAnalytics Server, see Set up ArcGIS GeoAnalytics Server.
There are several benefits to using a big data file share common to all data sources. You can keep your data in an accessible location until you are ready to perform analysis. A big data file share accesses the data when the analysis is run, so you can continue to add more data to an existing dataset in your big data file share without having to re-register or publish your data. You can also modify the manifest to remove, add, or update datasets in the big data file share. Big data file shares are extremely flexible in how time and geometry can be defined, and allow for multiple time formats on a single dataset. Big data file shares also allow you to partition your datasets while still treating multiple partitions as a single dataset.
Note:
Big data file shares are only accessed when you run GeoAnalytics Tools. This means that you can only browse and add big data files to your analysis; you cannot visualize the data on a map.
Big data file shares are one of several ways GeoAnalytics Tools can access your data. See Use the GeoAnalytics Tools in Map Viewer for a list of possible GeoAnalytics Tools data inputs.
The following file types are supported as datasets in big data file shares:
- Delimited files (such as .csv, .tsv, and .txt)
- Shapefiles (.shp)
- Parquet files (.gz.parquet)
- ORC files (orc.crc)
Prepare your data to be registered as a big data file share
File shares and HDFS
To prepare your data for a big data file share, you need to format your datasets as subfolders under a single parent folder that will be registered. In this parent folder you register, the names of the subfolders represent the dataset names. If your subfolders contain multiple folders or files, all of the contents of the top-level subfolders are read as a single dataset, and must share the same schema. The following is an example of how to register the folder FileShareFolder that contains three datasets, named Earthquakes, Hurricanes, and GlobalOceans. When you register a parent folder, all subdirectories under the folder
you specify are also registered with the GeoAnalytics Server. Always register
the parent folder (for example, \\machinename\FileShareFolder)
that contains one or more individual dataset folders.
Example of a big data file share that contains three datasets: Earthquakes, Hurricanes, and GlobalOceans.|---FileShareFolder < -- The top-level folder is what is registered as a big data file share
|---Earthquakes < -- A dataset "Earthquakes", composed of 4 csvs with the same schema
|---1960
|---01_1960.csv
|---02_1960.csv
|---1961
|---01_1961.csv
|---02_1961.csv
|---Hurricanes < -- The dataset "Hurricanes", composed of 3 shapefiles with the same schema
|---atlantic_hur.shp
|---pacific_hur.shp
|---otherhurricanes.shp
|---GlobalOceans < -- The dataset "GlobalOceans", composed of a single shapefile
|---oceans.shp
This same structure is applied to file shares and HDFS, although the terminology differs. In a file share, there is a top-level folder or directory, and datasets are represented by the subdirectories. In HDFS, the file share location is registered and contains datasets. The following table outlines the differences:
File share | HDFS | |
---|---|---|
Big data file share location | A folder or directory | An HDFS path |
Datasets | Top-level subfolders | Datasets within the HDFS path |
Once your data is organized as a folder with dataset subfolders, make your data accessible to your GeoAnalytics Server by following the steps in Make your data accessible to ArcGIS Server and registering the dataset folder.
Accessing HDFS using Kerberos
At ArcGIS Enterprise 10.6.1 GeoAnalytics Server can access HDFS using Kerberos authentication.
Follow these steps to register the HDFS file share using Kerberos authentication:
- Sign in to your GeoAnalytics Server site from ArcGIS Server Administrator Directory.
ArcGIS Server Administrator Directory requires you to sign in as an administrator. To connect to your federated GeoAnalytics Server site, you must sign in using a portal token, which requires the portal administrator's credentials, or as the GeoAnalytics Server site's primary site administrator. If you are not a portal administrator or do not have access to the primary site administrator account information, contact your portal administrator to complete these steps for you.
- Go to data > registerItem.
- Copy the following text and paste it into the Item text box. Update the following values:
- <bigDataFileShareName>: Replace with the name you want for the big data file share.
- <hdfs path>: Replace with the fully qualified file system path to the big data file share, for example, hdfs://domainname:port/folder.
- <user@realm>: Replace with the user and realm of the principal.
- <keytab location>: Replace with the location of the keytab file. The keytab file must be accessible to all machines in the GeoAnalytics Server site, for example, //shared/keytab/hadoop.keytab.
{ "path": "/bigDataFileShares/<bigDataFileShareName>", "type": "bigDataFileShare", "info": { "connectionString": "{\"path\":\"<hdfs path>",\"accessMode\":\"Kerberos\",\"principal\":\"user@realm\",\"keytab\":\"<keytab location>\"}", "connectionType": "hdfs" } }
- Click Register Item.
Once the item is registered, the big data file share appears as a data store in ArcGIS Server Manager with a populated manifest. If the manifest is not populated, continue to Step 5.
- Sign in to your GeoAnalytics Server site ArcGIS Server Manager.
You can sign in as a publisher or administrator.
- Go to Site > Data Stores and click the Regenerate Manifest button next to your new big data file share.
You now have a big data file share and manifest for your HDFS, which you will access through Kerberos authentication. The big data file share item in your portal points to a big data catalog service in the GeoAnalytics Server.
Hive
In Hive, all tables in a database are recognized as datasets in a big data file share. In the following example, there is a metastore with two databases, default and CityData. When registering a Hive big data file share through ArcGIS Server with your GeoAnalytics Server, only one database can be selected. In this example, if the CityData database was selected, there would be two datasets in the big data file share, FireData and LandParcels.|---HiveMetastore < -- The top-level folder is what is registered as a big data file share
|---default < -- A database
|---Earthquakes
|---Hurricanes
|---GlobalOceans
|---CityData < -- A database that is registered (specified in Server Manager)
|---FireData
|---LandParcels
Cloud stores
There are three steps to registering a big data file share of type cloud store.
Prepare your data
To prepare your data for a big data file share in a cloud store, format your datasets as subfolders under a single parent folder.
The following is an example of how to structure your data. This example registers the parent folder, FileShareFolder, which contains three datasets: Earthquakes, Hurricanes, and GlobalOceans. When you register a parent folder, all subdirectories under the folder
you specify are also registered with GeoAnalytics Server. Example of a how to structure data in a cloud store that will be used as a big data file share. This big data file contains three datasets: Earthquakes, Hurricanes, and GlobalOceans.|---Cloud Store < -- The cloud store being registered
|---Container or S3 Bucket Name < -- The container (Azure) or bucket (Amazon) being registered as part of the cloud store
|---FileShareFolder < -- The parent folder that is registered as the 'folder' during cloud store registration
|---Earthquakes < -- The dataset "Earthquakes", composed of 4 csvs with the same schema
|---1960
|---01_1960.csv
|---02_1960.csv
|---1961
|---01_1961.csv
|---02_1961.csv
|---Hurricanes < -- The dataset "Hurricanes", composed of 3 shapefiles with the same schema
|---atlantic_hur.shp
|---pacific_hur.shp
|---otherhurricanes.shp
|---GlobalOceans < -- The dataset "GlobalOceans", composed of 1 shapefile
|---oceans.shp
Register the cloud store with your GeoAnalytics Server
Connect to your GeoAnalytics Server site from ArcGIS Server Manager to register a cloud store. When you register a cloud store, you must include an Azure container name, an Amazon S3 bucket name, or a Azure Data Lake Store account name. It is recommended to additionally specify a folder within the container or bucket. The specified folder is composed of subfolders, and each represents an individual dataset. Each dataset is composed of all the contents of the subfolder.
Register the cloud store as a big data file share
Follow these steps to register the cloud store you created in the previous section as a big data file share:
- Sign in to your GeoAnalytics Server site from ArcGIS Server Manager.
You can sign in as a publisher or administrator.
- Go to Site > Data Stores and choose Big Data File Share from the Register drop-down list.
- Provide the following information in the Register Big Data File Share dialog box:
- Type a name for the big data file share.
- Choose Cloud Store from the Type drop-down list.
- Choose the name of your cloud store from the Cloud Store drop-down list.
- Click Create to register your cloud store as a big data file share.
You now have a big data file share and manifest for your cloud store. The big data file share item in your portal points to a big data catalog service in the GeoAnalytics Server.
Register your big data file share
To register a file share, HDFS, or Hive cloud store as a big data file share, connect to your GeoAnalytics Server site through ArcGIS Server Manager. See Register your data with ArcGIS Server using Manager in the ArcGIS Server help for details on the necessary steps.
Tip:
Steps for registering a cloud store as a big data file share were covered in the previous section.
When a big data file share is registered, a manifest is generated that outlines the format of the datasets within your share location, including the fields representing the geometry and time. A big data file share item is created in your portal that points to a big data catalog service in the GeoAnalytics Server where you registered the data. To learn more about big data catalog services, see the Big Data Catalog Service documentation in the ArcGIS Services REST API help.
Modify a big data file share
When a big data catalog service is created, a manifest is automatically generated and uploaded to the GeoAnalytics Server site where you registered the data. The process of generating a manifest may not always correctly estimate the fields representing geometry and time, and you may need to apply edits. To edit a manifest, follow the steps in Edit big data file shares in Manager. To learn more about the big data file share manifest, see Understanding the big data file share manifest in the ArcGIS Server help.
Run analysis on a big data file share
You can run analysis on a dataset in a big data file share through any clients that support GeoAnalytics Server, which include the following:
- ArcGIS Pro
- Map Viewer
- ArcGIS REST API
- ArcGIS API for Python
To run your analysis on a big data file share through ArcGIS Pro or Map Viewer, select the GeoAnalytics Tools you want to use. For the input to the tool, browse to where your data is located under Portal in ArcGIS Pro or on the Browse Layers dialog box in Map Viewer. Data will be in My Content if you registered the data yourself. Otherwise, look in your Groups or All Portal. Note that a big data file share layer selected for analysis will not be displayed in the map.
Note:
Make sure you are signed in with a portal account that has access to the registered big data file share. You can search your portal with the term bigDataFileShare* to quickly find all the big data file shares you can access.
To run analysis on a big data file share through the ArcGIS REST API, use the big data catalog service URL as the input. This will be in the format {"url":" https://webadaptorhost.domain.com/webadaptorname/rest/DataStoreCatalogs/bigDataFileShares_filesharename/BigDataCatalogServer/dataset"}. For example, with a machine named example, a domain named esri, a Web Adaptor named server, a big data file share named MyData, and a dataset named Earthquakes, the URL would be: {"url":" https://example.esri.com/server/rest/DataStoreCatalogs/bigDataFileShares_MyData/BigDataCatalogServer/Earthquakes"}. To learn more about input to big data analysis through REST, see the Feature Input topic in the ArcGIS Services REST API documentation.