Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GOBBLIN-2167] Allow filtering of Hive datasets by underlying HDFS folder location #4069

Merged

Conversation

Will-Lo
Copy link
Contributor

@Will-Lo Will-Lo commented Oct 19, 2024

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

Description

  • Here are some details about my PR, including screenshots (if applicable):
    Hive tables can be located in different folders in HDFS even if they belong to the same database.
    This becomes tricky to manage within a single Gobblin job especially when there are different permissions and handling based on underlying files for viewFS.

This PR adds a configuration to have a regex to filter tables based on their table location:

hive.dataset.tableFolderAllowlistFilter=<regex>

where tables with paths matching this filter will be selected, otherwise ignore

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:
    Unit tests

Commits

  • My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

@codecov-commenter
Copy link

codecov-commenter commented Oct 19, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 45.36%. Comparing base (45ad13e) to head (d4e60be).
Report is 15 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff              @@
##             master    #4069      +/-   ##
============================================
+ Coverage     45.12%   45.36%   +0.24%     
+ Complexity     3199     3181      -18     
============================================
  Files           705      695      -10     
  Lines         26949    26587     -362     
  Branches       2680     2655      -25     
============================================
- Hits          12160    12061      -99     
+ Misses        13781    13523     -258     
+ Partials       1008     1003       -5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Assert.assertEquals(datasets.size(), 1);

properties.put(HiveDatasetFinder.HIVE_DATASET_PREFIX + "." + WhitelistBlacklist.WHITELIST, "");
// The table located at /tmp/test should be filtered
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be helpful to call out that the dataset table is created at path /tmp/test on line 221 since there is another assertion above using a different regex which doesn't filter the table

finder = new TestHiveDatasetFinder(FileSystem.getLocal(new Configuration()), properties, pool);
datasets = Lists.newArrayList(finder.getDatasetsIterator());

Assert.assertEquals(datasets.size(), 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can also add a test for the case where the regex is empty or null

@@ -215,6 +215,32 @@ public void testDatasetConfig() throws Exception {

}

@Test
public void testHiveTableFolderFilter() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be renamed to testHiveTableFolderAllowlistFilter

if (!regex.isPresent()) {
return true;
}
return Pattern.compile(regex.get()).matcher(table.getSd().getLocation()).matches();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pattern.compile(regex.get()) is called every time the method shouldAllowTableLocation is executed when the regex is present. Compiling a regex every time can be inefficient, we can compile the pattern once and store it, eg.:
private static Pattern compiledAllowlistPattern = regex.map(Pattern::compile).orElse(null);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I compiled it as part of the class field instead, good callout

@khandelwal-prateek
Copy link
Contributor

LGTM!

@Will-Lo Will-Lo merged commit 916d570 into apache:master Oct 22, 2024
6 checks passed
@Will-Lo Will-Lo deleted the add-underlying-fs-allowlist-hive-dataset-finder branch October 22, 2024 03:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants