Inner Join
Syntax
- inner_join ( ds1 { as alias1 }, ds2 { as alias2 } { , dsN { as aliasN } }*
{ using usingComp { , usingComp }* }{ filter filterCondition }{ apply applyExpr| calc calcClause| aggr aggrClause { groupingClause } }{ keep comp {, comp }* | drop comp {, comp }* }{ rename compFrom to compTo { , compFrom to compTo }* } )calcClause ::= { calcRole } calcComp := calcExpr { , { calcRole } calcComp := calcExpr }*
calcRole ::= { identifier | measure | attribute | viral attribute }¹
aggrClause ::= { aggrRole } aggrComp := aggrExpr { , { aggrRole } aggrComp := aggrExpr }*
aggrRole ::= { measure | attribute | viral attribute }¹
groupingClause ::= { group by groupingId { , groupingId }* | group except groupingId { , groupingId }* | group all conversionExpr }¹ { having havingCondition }
Input parameters
ds1, …, dsN |
a minimum of two data set operands |
alias |
optional aliases for the input data sets, valid only within the “join” operation
to make it easier to refer to them. If omitted, the data set name must be used.
|
usingComp |
component of the input data sets whose values have to match in the join
|
filterCondition |
a condition (boolean expression) at component level, having only Components
of the input data sets as operands, which is evaluated for each joined
Data Point and filters them (when true the joined Data Point is kept,
otherwise it is not kept)
|
applyExpr |
an expression, having the input data sets as operands, which is pairwise applied
to all their homonym Measure Components and produces homonym Measure
Components in the result; for example if both the data sets ds1 and ds2
have the numeric measures m1 and m2, the clause apply ds1 + ds2 would
result in calculating m1 := ds1#m1 + ds2#m1 and m2 := ds1#m2 + ds2#m2
|
calcClause |
clause that specifies the Components to be calculated, their roles and their
calculation algorithms, to be applied on the joined and filtered Data Points.
|
calcRole |
the role of the Component to be calculated |
calcComp |
the name of the Component to be calculated |
calcExpr |
expression at component level, having only Components of the input
data sets as operands, used to calculate a Component
|
aggrClause |
clause that specifies the required aggregations, i.e., the aggregated
Components to be calculated, their roles and their calculation algorithm,
to be applied on the joined and filtered Data Points
|
aggrRole |
the role of the aggregated Component to be calculated;
if omitted, the Measure role is assumed
|
aggrComp |
the name of the aggregated Component to be calculated; this is a dependent
Component of the result (Measure or Attribute, not Identifier)
|
aggrExpr |
expression at component level, having only Components of the input
data sets as operands, which invokes an aggregate operator
(e.g. avg, count, max…, see also the corresponding sections)
to perform the desired aggregation.
|
groupingClause |
see the section “Aggregate Operators”
|
groupingId |
see the section “Aggregate Operators”
|
conversionExpr |
see the section “Aggregate Operators”
|
havingCondition |
see the section “Aggregate Operators”
|
comp |
dependent Component (Measure or Attribute, not Identifier) to be kept
(in the keep clause) or dropped (in the drop clause)
|
compFrom |
the original name of the Component to be renamed |
compTo |
the new name of the Component after the renaming |
Examples of valid syntaxes
inner_join ( ds1 as d1, ds2 as d2 using Id1, Id2
filter d1#Me1 + d2#Me1 <10
apply d1 / d2
keep Me1, Me2, Me3
rename Id1 to Id10, id2 to id20
)
Semantics for scalar operations
The join operator does not perform scalar operations.
Input parameters type
ds1, …, dsN
dataset
alias1, …, aliasN
name
usingId
name<component>
filterCondition
component<boolean>
applyExpr
dataset
calcComp
name<component>
calcExpr
component<scalar>
aggrComp
name<component>
aggrExpr
component<scalar>
groupingId
name<identifier>
conversionExpr
component<scalar>
havingCondition
component<boolean>
comp
name<component>
compFrom
component<scalar>
compTo
component<scalar>
Result type
result
dataset
Additional Constraints
The aliases must be all distinct, and are mandatory for data sets which appear more than once in the Join (self-join) and for non-named data set obtained as result of a sub-expression.
Let \(DS_i\) denote a join operand, with \(i = 1,\ldots,n\) where \(n\) is the number of join operands; let also \(I_i\) denote the set of its identifiers.
inner_join requires that \(\exists\, P = { DS_{p_1},\ldots,DS_{p_n} \mid \forall p_i \in S_n \wedge \bigcap_{j=1}^{i} I_{p_j} \neq \varnothing }\), where \(S_n\) denotes the lexicographically-ordered permutation of the first \(n\) integers, or, in other words, that there exists a permutation of the join operands such that each operand has at least one identifier in common with any of the operands preceding it in that permutation.
Moreover, an optional using clause may be specified to restrict which actual identifiers are used in the join.
if using is not specified, inner_join performs a natural join, that is, it uses all the identifiers appearing in more than one dataset as join keys; if the required condition does not hold, an error is raised.
if using is specified, inner_join removes from each of the \(I_i\) sets those identifiers not appearing in the using clause, and only uses the remaining identifiers as join keys if they appear in more than one dataset. If an identifier is named, but does not appear in multiple datasets, an error is raised. The resulting \(I_i\) sets must still satisfy the required condition for inner_join. If a valid permutation that satisfies that condition cannot be found as a result of this manipulation, an error is raised.
The inner_join operator must fulfil also other constraints:
apply, calc and aggr clauses are mutually exclusive
keep and drop clauses are mutually exclusive
comp can be only dependent Components (Measures and Attributes, not Identifiers)
An Identifier not included in the group by clause (if any) cannot be included in the rename clause
An Identifier included in the group except clause (if any) cannot be included in the rename clause. If the aggr clause is invoked and the grouping clause is omitted, no Identifier can be included in the rename clause
A dependent Component not included in the keep clause (if any) cannot be renamed
A dependent Component included in the drop clause (if any) cannot be renamed
Behaviour
First, the join keys are determined, either implicitly or by the using clause, and if the determined keys do not satisfy the constraints of inner_join, an error is raised; then a relational join of the input operands is performed, by matching the join keys according to SQL inner join (inner_join).
The SQL relational join produces an intermediate result, called virtual data set (VDS₁); this virtual data set VDS₁ has the following components:
The join keys, which appear once and maintain their names;
The remaining components, identifier or not, coming from exactly one input data set, which appear once and maintain their original name and role.
The remaining components, identifier or not, coming from multiple data sets, which appear as many times as the data sets they come from; names of each of these components are prefixed with the alias of the data set they come from, separated by the “#” symbol; in this context, the symbol “#” does not denote the membership operator, but acts just as a separator between the data set and the component name. If the aliases are not defined, the names are prefixed with the data set name. If the data set name can’t be determined (for example the join operand is an expression), an error is raised. For example, if “population” appears in two input data sets “ds1” and “ds2”, that have the aliases “a” and “b” respectively, both “a#population” and “b#population” will appear in the virtual Data Set; If the aliases were not specified, the names must be used (i.e. “ds1#population” and “ds2#population”).
Then, subsequent clauses in the inner_join are procedurally evaluated on the virtual data set VDS₁ as follows.
The filter is applied on VDS₁, if present, producing the Virtual data set VDS₂. It operates by filtering the data points to match a specified boolean expression; when the expression is true the Data Point is kept in the result, otherwise it is discarded.
Either one of the apply, calc or aggr expressions, if present, are applied on VDS₂ producing VDS₃:
apply combines the homonym Measures in the source operands whose type is compatible with the operators used in applyExpr, for each of them generating a single measures. The expression applyExpr can use as input the names or aliases of the operand data sets. It applies the expression to each of the n-uple of homonym Measures in the input data sets producing in the target a single homonym Measure for each n-uple. It can be thought of as the multi-measure version of the calc. For example, if d1, d2, and d3 are data sets with two measures each named M1 and M2, then the expression d1+d2+d3, respectively sums the values of M1 and M2, so as to obtain in the result: calc M1 := d1#M1 + d2#M1 + d3#M1, M2 := d1#M2 + d2#M2 + d3#M2. All the input measures do not appear in VDS₂.
calc calculates new components by independent sub-expressions evaluated at component level. The role of the new components can be specified, so the original role can be changed if the name is kept the same; if the role is not specified, it will be measure. Each sub-expression can only reference components in VDS₂, and cannot use components generated by other sub-expressions. If a generated component already exists in any of the input data sets, all input components homonymous to the former are removed, and only the generated component is kept; if any of the removed components were identifiers, an error is raised. Analytic operators can be used in the expressions.
aggr aggregates values of non-identifier components by indipendent sub-expressions evaluated at component level. The role of the components can be specified, so the original role can be changed if the name is kept the same; if the role is not specified, it will be measure. All non-identifier components are removed, and only the components generated by the sub-expressions are kept; if any of the removed components were identifiers, an error is raised. The sub-expressions must contain only Aggregate operators, which may specify a grouping clause and/or a filtering expression, as specified in the section for “Aggregation operators”. If no grouping clause is specified, then all the input Data Points are aggregated in a single group and the clause returns a data set that contains a single Data Point and has no Identifier Components.
Either a keep or drop clause, whichever is present, is applied on VDS₃, producing the Virtual data set VDS₄:
keep will maintain all the identifiers and only the specified non-identifier components of VDS₃;
drop will remove all specified non-identifier components of VDS₃.
The rename clause, if present, is applied on VDS₄, producing the Virtual data set VDS₅ by giving each specified component with a new name. If the chosen name already exists in VDS₄, an error is raised.
Finally, all components that originally appeared in multiple input data sets, are renamed by stripping their previously determined prefix; if this step determines a structure with homonymous components, an error is raised.
The contents of inner_join are ideally determined stepwise, by using the left-most dataset in the order defined by the chosen permutation as the initial partial result, and joining each of the other input data sets to the partial result, starting from the left side and proceeding towards the right side. In each step, a data point in VDS₁ is generated for each pair of data points in the partial result and the joined data set in which the common join keys assume the same value. Then, the step is repeated by joining this partial result to the next data set.
The Viral Attribute propagation in the join is the following. The Attributes explicitly calculated through the calc or aggr clauses are maintained unchanged. Other viral attributes, present in exactly one input data set, are also kept unchanged. For all the other viral attributes, which are present in multiple data sets, the Attribute propagation rule is applied on VDS₂ (see “Attribute Propagation Rule” section in the User Manual).
Examples
Given the operand datasets DS_1 DS_2 and DS_3:
Input DS_1 (see structure)
Id_1 |
Id_2 |
Me_1 |
Me_2 |
|---|---|---|---|
1 |
A |
A |
B |
1 |
B |
C |
D |
2 |
A |
E |
F |
Input DS_2 (see structure)
Id_1 |
Id_2 |
Me_1A |
Me_2 |
|---|---|---|---|
1 |
A |
B |
Q |
1 |
B |
S |
T |
3 |
A |
Z |
M |
Input DS_3 (see structure)
Id_1 |
Id_2 |
Me_1 |
Me_2 |
|---|---|---|---|
1 |
A |
B |
Q |
1 |
B |
S |
T |
3 |
A |
Z |
M |
Input DS_4 (see structure)
Id_1 |
Me_1 |
|---|---|
1 |
200 |
2 |
300 |
3 |
100 |
Input DS_5 (see structure)
Id_2 |
Id_3 |
Me_2 |
|---|---|---|
10 |
S11 |
12345678 |
20 |
S2 |
87654321 |
30 |
S121 |
18273645 |
Input DS_6 (see structure)
Id_1 |
Id_2 |
Id_4 |
Me_3 |
|---|---|---|---|
1 |
30 |
c |
|
1 |
10 |
d |
|
2 |
30 |
c |
|
2 |
20 |
c |
|
2 |
10 |
d |
|
3 |
30 |
c |
|
3 |
10 |
d |
50 |
3 |
20 |
d |
50 |
Example 1
DS_r := inner_join (DS_1 as d1, DS_2 as d2 keep Me_1, d2#Me_2, Me_1A );
results in (see structure):
Id_1 |
Id_2 |
Me_1 |
Me_2 |
Me_1A |
|---|---|---|---|---|
1 |
A |
A |
Q |
B |
1 |
B |
C |
T |
S |
Example 2
DS_r := inner_join (DS_1 as d1, DS_2 as d2 filter Me_1 = "A" calc Me_4 := Me_1 || Me_1A drop d1#Me_2);
results in (see structure):
Id_1 |
Id_2 |
Me_1 |
Me_1A |
Me_2 |
Me_4 |
|---|---|---|---|---|---|
1 |
A |
A |
B |
Q |
AB |
Example 3
DS_r := inner_join ( DS_1 filter Id_2 ="B" calc Me_2 := Me_2 || "_NEW" keep Me_1, Me_2);
results in (see structure):
Id_1 |
Id_2 |
Me_1 |
Me_2 |
|---|---|---|---|
1 |
B |
C |
D_NEW |
Example 4
DS_r := inner_join (DS_1 as d1, DS_3 as d2 apply d1 || d2);
results in (see structure):
Id_1 |
Id_2 |
Me_1 |
Me_2 |
|---|---|---|---|
1 |
A |
AB |
BQ |
1 |
B |
CS |
DT |
Example 5
IBSC := inner_join(DS_5, DS_6[sub Id_4 = "c"] as CI);
IBSD := inner_join(DS_5, DS_6[sub Id_4 = "d"] as DI);
DS_r := inner_join (
DS_4,
IBSC[rename Id_2 to Id_21, Id_3 to Id_31, Me_2 to Me_21] as IBSC,
IBSD[rename Id_2 to Id_22, Id_3 to Id_32, Me_2 to Me_22] as IBSD
);
results in (see structure):
Id_1 |
Id_21 |
Id_22 |
Me1 |
Id_31 |
Me_21 |
Id_32 |
Me_22 |
Me_3 |
|---|---|---|---|---|---|---|---|---|
1 |
30 |
10 |
200 |
S121 |
18273645 |
S11 |
12345678 |
|
2 |
30 |
10 |
300 |
S121 |
18273645 |
S11 |
12345678 |
|
2 |
20 |
10 |
300 |
S2 |
87654321 |
S11 |
12345678 |
|
3 |
30 |
10 |
100 |
S121 |
18273645 |
S11 |
12345678 |
50 |
3 |
30 |
20 |
100 |
S121 |
18273645 |
S2 |
87654321 |
50 |