java – Aggregate data from a huge list under 50ms

I got this question as a coding challenge and was unable to get it done under 50 milliseconds (my solution takes >100 ms) 😀

Would you please review my code and share any idea how to do this within 50ms?

Problem Description
One of our customers, a multinational company that manufactures industrial appliances, has an internal system to procure (purchase) all resources the company needs to operate. The procurement is done through the company’s own ERP (Enterprise Resource Planning) system.

A typical business process represented by the ERP system is procure-to-pay, which generally includes the following activities:

create purchase request
request approved
create purchase order
select supplier
receive goods
pay invoice
Whenever the company wants to buy something, they do so through their ERP system.

The company buys many resources, always using their ERP system. Each resource purchase can be considered a case, or single instance of this process. As it happens, the actual as-is process often deviates from the ideal to-be process. Sometimes purchase requests are raised but never get approved, sometimes a supplier is selected but the goods are never received, sometimes it simply takes a long time to complete the process, and so on. We call each unique sequence of activities a variant.

The customer provides us with extracted process data from their existing ERP system. The customer extracted one of their processes for analysis: Procure-to-pay. The logfiles contain three columns:

activity name
case id
timestamp
We want to analyse and compare process instances (cases) with each other.

Acceptance Criteria

  • Aggregate cases that have the same event execution order and list the
    10 variants with the most cases.
  • As that output is used by other highly interactive components, we
    need to be able to get the query results in well under 50
    milliseconds.

Notes:

  • The sample data set is not sorted, please use the timestamp in the
    last column to ensure the correct order.
  • The time required to read the CSV file is not considered part of the
    50 milliseconds specified in the acceptance criteria.

Sample data: the actual file contains 62,000 rows is here

CaseID;ActivityName;Timestamp
100430035020241420012015;Create purchase order item;2015-05-27 12:44:47.000
100430035020261980012015;Create MM invoice by vendor;2015-07-13 00:00:00.000
100430035020119700012015;Reduce purchase order item net value;2015-02-13 10:24:02.000
100430035020066380012015;Change purchase order item;2015-01-23 09:39:33.000
100430035020232560012015;Change purchase order item;2015-05-11 07:58:29.000
100430031000134820012015;Clear open item;2015-07-28 23:59:59.000
100430035020241250012015;Remove payment block;2015-06-04 16:36:26.000
100430035020193960012015;Enter goods receipt;2015-03-12 20:00:06.000
100430031000151590012015;Clear open item;2015-11-24 23:59:59.000
100430031000129230012015;Post invoice in FI;2015-06-01 12:00:37.000
100430035020228280012015;Create MM invoice by vendor;2015-04-07 00:00:00.000
100430031000113630012015;Clear open item;2015-03-24 23:59:59.000
100430035020260940012015;Enter goods receipt;2015-07-16 15:07:49.000
100430035020244540012015;Create purchase order item;2015-06-02 11:06:11.000

my rejected code

fun main(args: Array<String>) {
    val eventlogRows = CSVReader.readFile("samples/Activity_Log.csv")

    val begin = System.currentTimeMillis()

    val grouped = eventlogRows.groupBy { it.caseId }
    val map = hashMapOf<String, Int>()
    grouped.forEach {
        val toSortedSet = it.value.toSortedSet(compareBy { it.timestamp })
        val hash = toSortedSet.joinToString { it -> it.eventName }
        map(hash) = map(hash) ?: 0 + 1
    }
    
    val sortedByDescending = map.entries.sortedByDescending { it.value }
    
    val end = System.currentTimeMillis()

    println(String.format("Duration: %s milliseconds", end - begin))
}

Efficient algorithm to aggregate a heightmap to a lower resolution

I have a raw height map with basically consists of cells of the following structure

type Cell =    
{ Coordinate: GeoCoordinate //contains Latitude and Longitude of the coordinate
  Elevation: int16
}

I generate this height map from real world data.

Now I want to aggregate the height map to a lower resolution, say from a cell grid length of 300 meters to 10 kilometers. That is creating an average of the elevations. Of course I can apply a brute force algorithm doing that, e.g. beginning from the center cell grid and aggregate it to a “bigger cell grid”, memorizing what grids were considered and so forth. But maybe this is not the best way of doing that. Are there more efficient ways (algorithms) for aggregating such a height map ?

domain driven design – Aggregate roots always have to be “complete”?

Lets think about the most common Aggregate Root example: Order and OrderItems (or OrderLines).

So I want a useCase called UpdateOrderItem. Given a OrderItemID and a complete OrderItemDTO in the request I want to update the OrderItem info. Following the Aggregate root pattern, OrderItems should not be handled alone. So ill need to instanciate Order. My question is: to edit a specific OrderItem should I load all OrderItems and build the complete Aggregate or can I retrieve from repository just the Order data and the specific OrderItem data and follow the update ?

php – Identifying Aggregate Root and Services Logic Responsibility

I’m attempting to refactor small meal planner following DDD, but I’m completely lost and overwhelmed in identifying the separation of concerns within my domain models.

Here’s the business requirement: A user can generate a 14 day plan consisting of 3 meals per day (collection of 42 Meal), they must provide their body info along with their meal plan preferences before they’re allowed to generate a plan. Their body info is only required to calculate the optimal caloric range per meal, whilst the meal preferences are mostly for sorting the recipes (provided from the subdomain).

I’ve came up with the following:

src
|-- App
|   `-- MealPlan
|       `-- Application
|       |   `-- Create
|       |   |   |-- CreateMealPlanService.php
|       |   `-- Find
|       `-- Domain
|       |   |-- Exception
|       |   |-- Services
|       |   |   |-- BuildMealsService.php
|       |   `-- Meal
|       |   |   |-- Meal.php
|       |   |   |-- MealId.php // uuid 
|       |   |   |-- Ingredients.php // VO
|       |   |   |-- Macros.php // vo
|       |   |   |-- Meals.php
|       |   `-- Profile
|       |   |   |-- Gender.php
|       |   |   |-- Height.php
|       |   |   |-- Unit.php
|       |   |   |-- Weight.php
|       |   |   |-- UserId.php // User Id from third party service sent through http requests
|       |   |   |-- Profile.php // ValueObject
|       |   `-- Preferences 
|       |   |   |-- Allergen.php
|       |   |   |-- Intolerance.php
|       |   |   |-- Preferences.php // ValueObject
|       |   |-- MealPlan.php // The aggregate root
|       |   |-- MealPlanId.php // uuid
|       |   |-- MealPlanRepositoryInterface.php
|       `-- Infrastructure
|           `-- Http
|           `-- Persistence
|               `--MealPlanRepository.php
`-- Recipes
|   `-- Application
|   `-- Domain
|   `-- Infrastructure
|
`-- Shared
    `-- Domain
    `-- Infrastructure

Use case: after authentication, during onboarding stage Profile and Preferences form data gets sent to CreateMealPlanService.

At this stage, I’m not sure if CreateMealPlanService should call BuildMealsService or if that logic should be within the MealPlan aggragate root, obviously a user MealPlan belongs to a user (tgethe UserId I put in the Profile as a VO, but I also think the MealPlan should have that UserId reference…).

I figured BuildMealsService would inject the RecipesService (from the Recipes Bounded Context), and there it’d generate the 42 meals, and passed to the MealPlan to be saved, but I’m not sure anymore at this point whose responsibility it really is.

Here’s the MealPlan aggragate root:

final class MealPlan {

    public function __construct(
        private MealPlanId $id,
        private Profile $profile,
        private Preferences $preferences,
        private Meals $meals
    ){}

    public static function create(MealPlanId $id, Profile $profile, Preferences $preferences){
        return new self($id, $profile, $preferences);
    }

    public function id(): MealPlanId {
        return $this->id;
    }

    public function profile(): Profile {
        return $this->profile;
    }

    public function preferences(): Preferences {
        return $this->preferences;
    }

    public function meals(): Meals {
        return $this->meals;
    }

}

latex – How to express mathematically an aggregate operation in a set

I’m having a problem to express mathematically an aggregate function

I have a set C who is composed by m elements e

C = {e1, e2, …, em}

select
property_1
, property_2
, property_3
, property_N
from element

enter image description here

each element e has n properties

eProperties = {p1, p2, …, pn}

I need to apply an aggregate function like sum, min or max grouping by some of eProperties over each e elements in C

select 
property_2
, property_3
, property_N
, count(property_1) (countElements)
, min(property_1) (min_Property_3)
, max(property_1) (max_Property_3)
from element
group by property_2
, property_3
, property_N

enter image description here

and the result will be p subsets SC, where each e element will be classified by the value of their properties, for example, a subset(first row) will have all e elements with property2 = 1, property3 = 3, and propertyN = 2 this subset has 26 elements.

How can I express that function from C to p subsets SC?

Can you recommend me some book?

Thanks in advance.

How to aggregate rows with the same key in google sheets

I have a table of dates with names that I would like to aggregate. As time passes, more dates and more names can be added, so I’d like this solution to auto-populate itself.
Basically,

Abe Bob Charlie
5/12 2 1 x
5/12 x
5/13 1 1 1
5/13 1 1

should become

A B C
5/12 1 2 1
5/13 1 2 2

I tried COUNTA(QUERY({Input!$A$1:$H$10}, "SELECT Col"&Column(B$1) - Column($A$1) + 1&" WHERE Col1 = '"&$A2&"'")) – This works, but it doesn’t allow me to autopopulate the rest of the table.

I think I need to use ArrayFormula somehow, but from what I’ve read, it’s only able to iterate over the input to the query and not the values in the various columns. I tried ARRAYFORMULA(COUNTA(QUERY({A1:H1; ARRAYFORMULA(VLOOKUP($A$2, Input!$A$1:$H$10, COLUMN(Input!$A$1:$H$1), 0))}, "SELECT Col2 WHERE Col1 = '"&$A$2:$A$4&"'"))), but it doesn’t iterate as desired (over $A$2:$A$4).

I’m stuck on how to achieve this.

I have a sample sheet here

event sourcing – In CQRS/ES where does an Aggregate Root belong?

Disclaimer: This question may be related to the framework I’m using to support CQRS/ES rather than the concepts themselves but many of these frameworks implement the same strategies, making me think the two are tightly coupled regardless.

CQRS tells us to…

use a different model to update information than the model you use to read information1

And in event sourcing…

The fundamental idea of Event Sourcing is that of ensuring every change to the state of an application is captured in an event object2

My design includes aggregate root objects, upon which methods are called to make changes (in my particular case called from Commands/Handlers). Those methods check the invariants and then publish an event to a bus, which in turn updates some aspect of the aggregate, typically setting properties or adding items to a collection. These events also update my read model so that I have a projection of the most recent state of the system that can be easily queried. Most of my queries simply act upon the most recent state, but occasionally I need to create a projection for an aggregate as it existed at a point in the past (hence the use of event sourcing).

As such my aggregate root and read model share a very similar “shape”, so similar that I’ve created an interface that both implement so that I can treat them equally depending on the type of query being executed.

Given that the aggregate root and read model are so similar, and even though the aggregate appears to belong to the write model (as the commands act upon it) does it in fact belong in the read model?

Or, where does the aggregate root belong? In the write model, the read model or in a shared domain model, which seems to go against the whole CRQS idea?

google sheets – Using VLOOKUP with pivot table – OR – another way to aggregate data

You don’t need the pivot table at all (i.e., you can delete it altogether).

Delete everything in Column C (including the header) and place the following formula in C1:

=ArrayFormula({"REALlastContact";IF(B2:B="",,VLOOKUP(B2:B,SORT({B2:B,D2:D},2,0),2,FALSE))})

This reads, in plain English, as follows: “Process an entire array, not just one cell. First, put the header. Under that (as indicated by the semicolon), if any row is blank in Column B, leave it null in Column C. Otherwise, look up whatever is in that row of Column B within a two-column array of the family paired with the contact date, sorted upside-down by contact date, and return the contact date (which will be the most recent, because the highest/most recent dates will be found first when sorted upside-down).”

MongoDB Aggregate Poor Index Usage

I’ve been trying to understand the MongoDB Aggregate process so I can better optimize my queries and I’m confused by usage and $match and $sort together.

Sample DB has only one collection people

({
    "name": "Joe Smith",
    "age": 40,
    "admin": false
},
{
    "name": "Jen Ford",
    "age": 45,
    "admin": true
},
{
    "name": "Steve Nash",
    "age": 45,
    "admin": true
},
{
    "name": "Ben Simmons",
    "age": 45,
    "admin": true
})

I’ve multiplied this data x1000 just as a POC.

The DB above has one index name_1

The Following query

db.people.find({"name": "Jen Ford"}).sort({"_id": -1}).explain()

Has the following output

{ queryPlanner: 
   { plannerVersion: 1,
     namespace: 'db.people',
     indexFilterSet: false,
     parsedQuery: { name: { '$eq': 'Jen Ford' } },
     queryHash: '3AE4BDA3',
     planCacheKey: '2A9CC473',
     winningPlan: 
      { stage: 'SORT',
        sortPattern: { _id: -1 },
        inputStage: 
         { stage: 'SORT_KEY_GENERATOR',
           inputStage: 
            { stage: 'FETCH',
              inputStage: 
               { stage: 'IXSCAN',
                 keyPattern: { name: 1 },
                 indexName: 'name_1',
                 isMultiKey: false,
                 multiKeyPaths: { name: () },
                 isUnique: false,
                 isSparse: false,
                 isPartial: false,
                 indexVersion: 2,
                 direction: 'forward',
                 indexBounds: { name: ( '("Jen Ford", "Jen Ford")' ) } } } } },
     rejectedPlans: 
      ( { stage: 'FETCH',
          filter: { name: { '$eq': 'Jen Ford' } },
          inputStage: 
           { stage: 'IXSCAN',
             keyPattern: { _id: 1 },
             indexName: '_id_',
             isMultiKey: false,
             multiKeyPaths: { _id: () },
             isUnique: true,
             isSparse: false,
             isPartial: false,
             indexVersion: 2,
             direction: 'backward',
             indexBounds: { _id: ( '(MaxKey, MinKey)' ) } } } ) },
  serverInfo: 
   { host: '373ea645996b',
     port: 27017,
     version: '4.2.0',
     gitVersion: 'a4b751dcf51dd249c5865812b390cfd1c0129c30' },
  ok: 1 }

This makes total sense.

However

The following query results in the same set but uses the aggregate pipeline

db.people.aggregate(( { $match: { $and: ({ name: "Jen Ford" })}}, { $sort: {"_id": -1}}), {"explain": true})

Has the following output.

{ queryPlanner: 
   { plannerVersion: 1,
     namespace: 'db.people',
     indexFilterSet: false,
     parsedQuery: { name: { '$eq': 'Jen Ford' } },
     queryHash: '3AE4BDA3',
     planCacheKey: '2A9CC473',
     optimizedPipeline: true,
     winningPlan: 
      { stage: 'FETCH',
        filter: { name: { '$eq': 'Jen Ford' } },
        inputStage: 
         { stage: 'IXSCAN',
           keyPattern: { _id: 1 },
           indexName: '_id_',
           isMultiKey: false,
           multiKeyPaths: { _id: () },
           isUnique: true,
           isSparse: false,
           isPartial: false,
           indexVersion: 2,
           direction: 'backward',
           indexBounds: { _id: ( '(MaxKey, MinKey)' ) } } },
     rejectedPlans: () },
  serverInfo: 
   { host: '373ea645996b',
     port: 27017,
     version: '4.2.0',
     gitVersion: 'a4b751dcf51dd249c5865812b390cfd1c0129c30' },
  ok: 1 }

Notice how the Aggregate Query is unable to recognize it should utilize the name index against the $match. This has massive implications as the size of the collection grows

I’ve seen this behavior now in Mongo 3.4, 3.6, and 4.2.

https://docs.mongodb.com/v4.2/core/aggregation-pipeline-optimization/ provides this blurb

$sort + $match Sequence Optimization:
When you have a sequence with $sort followed by a $match, the $match moves before the $sort to minimize the number of objects to sort.

From all this, I think I’m fundamentally misunderstanding something with the Mongo aggregate command.

I already understand that if I create a composite index name,_id then it will work as it includes the fields used in my $match and my $sort clause.

But why must an index include a field from the $sort clause to be utilized to restrict my $match set? It seems obvious that we would prefer to $sort on the smallest set possible?

aggregate – How to display the factor names in my descriptive statistics instead of the numbers in R

In the DataAnalyst data (from Kaggle), I am trying to show descriptive statistics of Ratings (numeric value) by state (categorical factor). I am able to successfully display everything but the state names which shows up as #s:

m<-aggregate(Rating~state, data=df,mean)
sd<-aggregate(Rating~state, data=df,sd)
n<-aggregate(Rating~state, data=df,length)
##summary descriptive table
(df.des <- cbind(n(,1), n=n(,2), mean=m(,2), sd=round(sd(,2),3),se=round(sd(,2)/sqrt(n(,2)),3)))

For df.des, I understand the n(,1) displays the # col you want. I have tried n(,2) which brings up the number/state. How can I get the table to display the names and not the numbers?
P.S. “State” is listed as characters (e.g. CA, NY, IL) and not numbers.

descriptive statistics with numbers instead of state categories

What n looks like